DataFrame in Expression Language: Difference between revisions
No edit summary |
No edit summary |
||
Line 256: | Line 256: | ||
<pre> | <pre> | ||
ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns(["both": Def("", text + "=" + id)]).ToCsv() | ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns(["both": Def("", text + "=" + id)]).ToCsv() | ||
Returns: | |||
id;text;both | id;text;both | ||
0;zero;zero=0 | 0;zero;zero=0 | ||
Line 263: | Line 263: | ||
ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns(["text": Def("", text + "=" + id)]).ToCsv() | ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns(["text": Def("", text + "=" + id)]).ToCsv() | ||
Returns: | |||
id;text | id;text | ||
0;zero=0 | 0;zero=0 | ||
Line 270: | Line 270: | ||
ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns(["both": Def("", text + "=" + id), "both+1": Def("", both + 1), "text": Def("", "Done: " + Column("both+1")), "constant": 1234]).ToCsv() | ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns(["both": Def("", text + "=" + id), "both+1": Def("", both + 1), "text": Def("", "Done: " + Column("both+1")), "constant": 1234]).ToCsv() | ||
Returns: | |||
id;text;both;both+1;constant | id;text;both;both+1;constant | ||
0;Done: zero=01;zero=0;zero=01;1234 | 0;Done: zero=01;zero=0;zero=01;1234 |
Revision as of 14:14, 19 December 2019
DataFrame represents a two dimensional array of data with one-to-many columns and zero-to-many rows, like a relational database table, an Excel sheet or a CSV data file. Each column in the DataFrame has a name, and there must not be more than one column with the same name. DataFrame is the generic data structure used to manage all kinds data in QPR ProcessAnalyzer expression engine that run in-memory. DataFrames as linked to other entities in QPR ProcessAnalyzer as follows:
- Datatable contents is fetched into the memory as a DataFrame object
- DataFrame can be stored (persisted) to a Datatable
- ETL operations, such as joining, unions, filtering and grouping are based on the DataFrames
- Data extracted from an external data source, e.g. using ODBC, is fetched to the in-memory calculation as a DataFrame.
- When using a loading scripts, cases and events data is fed to the model using the DataFrame.
DataFrame properties | Description |
---|---|
Columns (String*) | DataFrame columns names as an array in the order the columns are in the DataFrame. |
Rows (Object**) | Returns the data content of the DataFrame as a two-dimensional array (matrix). The column names are not part of the data content.
Examples: DatatableById(5).DataFrame.Rows[0][0] Returns: the value in the first row and first column in a datatable with id 5. |
<column name> (Object*) |
Returns an array of values of given column in the datatable. If the column name contains spaces, the Column function needs to be used to refer to a column. Examples: ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).right Returns: [zero, two, three] ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).right[2] Returns: three |
DataFrame functions | Parameters | Description |
---|---|---|
Append (DataFrame) |
|
Creates a new DataFrame that has the contents of given DataFrame added to the end of this DataFrame. When the data is combined, the order of columns matters, not the names of the columns. The resulting DataFrame gets column names from this DataFrame. If the number of columns is different between this DataFrame and the other DataFrame, an exception is thrown. Examples: ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).Append( ToDataFrame([[1, "one"], [4, "four"]], ["id", "text"]) ); Returns: id;text 0;zero 2;two 3;three 1;one 4;four |
ColumnIndexes (Integer*) |
|
Convert DataFrame column names into column indexes. The indexes are starting from zero. If a column is not found, an exception is given. Examples: ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).ColumnIndexes(["right", "id"]) Returns: [1, 0] |
Column (Object*) |
|
Returns an array of values of given column in the order rows are in the datatable. Examples: ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).Column("right") Returns: [zero, two, three] |
Columns (DataFrame) |
|
Creates a new DataFrame having only the defined columns of the original DataFrame. Note that Columns function is different than Columns property - difference is that the function has parameters. Examples: ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).Columns([1]).ToCsv() Returns: right zero two three |
Head (DataFrame) |
|
Creates a new DataFrame that only contains top number of rows of this DataFrame. If the DataFrame has less than n rows, all its rows are returned. Examples: ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).Head(2).ToCsv() Results string: id;right 0;zero 2;two |
Join (DataFrame) |
|
See Joining DataFrames. |
GroupBy (DataFrame) |
|
Creates a new DataFrame based on the current DataFrame. The resulting DataFrame has rows grouped by given columns and values aggregated using given functions. In the resulting DataFrame one row in the end result corresponds with one group. Parameters:
Examples: ToDataFrame([[0, "zero"], [0, "zero2"], [2, "two"], [2, "two"], [2, "two3"], [3, "three"]], ["id", "text"]).GroupBy(["id"], [ "ids": Def("", Sum(id)), "texts": Def("", StringJoin(",", text)), "constant": 123 ]).ToCsv() Returns: ids;texts;constant 0;zero,zero2;123 6;two,two,two3;123 3;three;123 ToDataFrame([[0, "zero"], [0, "zero2"], [2, "two"], [2, "two"], [2, "two3"], [3, "three"]], ["id", "text"]).GroupBy(["id", "text"], [ "ids": Def("", Sum(id)), "texts": Def("", StringJoin(",", text)), "constant": 123 ]).ToCsv() Returns: ids;texts;constant 0;zero;123 0;zero2;123 4;two,two;123 2;two3;123 3;three;123 Analysis("OperationLog") .GroupBy( ["User Name"], [ "User Name": Def("", Column("User Name")[0]), "Count": Def("Rows", CountTop(Rows)), "Avg. Duration": Def("", Average(Duration)), "Max. Duration": Def("", Max(Duration)) ] ).ToCsv() Returns (similar to this): User Name;Count;Avg. Duration;Max. Duration ;207;0.617434782608696;20.556 Administrator;665;16.3750631578947;4225.497 qpr;128;2.158765625;20.346 |
Merge (DataFrame) |
See Merging DataFrames. | |
OrderBy (DataFrame) |
|
Creates a new DataFrame having rows ordered in an ascending order using the given expression evaluated on each row. Examples: ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).OrderBy(text).ToCsv() Returns: id;right 3;three 2;two 0;zero Analysis("OperationLog").OrderBy(Duration).Head(1) Results string: id;right 0;zero 2;two 3;three |
OrderByDescending (DataFrame) |
|
Creates a new DataFrame having rows ordered in an descending order using the given expression evaluated on each row. See OrderBy above for examples. |
Persist (DataTable) |
|
Writes the DataFrame into QPR ProcessAnalyzer database as a DataTable having the given name. If a DataTable with that name does not exist in the project, a new DataTable is created. If a DataTable with that name already exists, the DataFrame will be stored into that DataTable. The function returns a DataTable object which the data was persisted into. The additional parameters support:
Examples: Let("right", ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"])); right.Persist("RightDataTable", ["ProjectName": "TestData"]) Results: Id of the new data table named "RightDataTable" created into project named TestData (which is created if it doesn't already exist). If the table already existed, its contents will be overwritten by the new content. Let("newData", ToDataFrame([[4, "four"]], ["id", "right"])); newData.Persist("RightDataTable", ["ProjectName": "TestData", "Append": true]) Results: Id of the new data table named "RightDataTable" created into project named TestData (which is created if it doesn't already exist). If the table already existed, new content will be appended into the end of the table. |
SetColumns (DataFrame) |
|
Creates a new DataFrame based on the current DataFrame, where new columns have been created and/or existing columns have been modified. New and modified columns are defined using an array, where the column name is as a key and as a value there is the expression to calculate the new or modified column. When specifying a column that already exists, the column values are modified. When specifying a new column name, that column is created as a new column to the resulting DataFrame. Examples: ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns(["both": Def("", text + "=" + id)]).ToCsv() Returns: id;text;both 0;zero;zero=0 2;two;two=2 3;three;three=3 ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns(["text": Def("", text + "=" + id)]).ToCsv() Returns: id;text 0;zero=0 2;two=2 3;three=3 ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns(["both": Def("", text + "=" + id), "both+1": Def("", both + 1), "text": Def("", "Done: " + Column("both+1")), "constant": 1234]).ToCsv() Returns: id;text;both;both+1;constant 0;Done: zero=01;zero=0;zero=01;1234 2;Done: two=21;two=2;two=21;1234 3;Done: three=31;three=3;three=31;1234 |
Tail (DataFrame) |
|
Creates a new DataFrame that has only the bottom number of rows of this DataFrame. If the DataFrame has less than n rows, all its rows are returned. Example: ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).Tail(2).ToCsv() Results string: id;right 2;two 3;three |
ToCsv (String) | (none) |
Converts a DataFrame into a CSV data. The CSV data has the following formatting:
Example: ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).ToCsv() Returns: id;right 0;zero 2;two 3;three |
Where (DataFrame) |
|
Creates a new DataFrame having only rows for which the given condition expression returns true. The condition expression can refer to the columns of the DataFrame (see the example below). Examples: Let("df", ToDataFrame([[0, "zero"], [2, "two", true], [3, "three"]], ["id", "string"])); All the following expression return the same: df.Where(id < 3); df.Where(Column("id") < 3); df.Where(_[0] < 3); Returns: id;string 0;zero 2;two |
Zip (DataFrame) |
|
Creates a new DataFrame that has the contents of given DataFrame appended as new columns into the end of this DataFrame. Returns a new DataFrame that has the colums from both the data frames so that the columns from the other DataFrame are appended to the end of the columns of this DataFrame. If the number of rows is different between this DataFrame and the other DataFrame, an exception is thrown. There must not be duplicate column names in the DataFrames - otherwise an exception is thrown. Examples: Let("df1", ToDataFrame([[0, "zero"], [1, "one"], [4, "four"]], ["id", "text"])); Let("df2", ToDataFrame([[1, "one"], [2, "two"], [3, "three"]], ["id2", "text2"])); df1.Zip(df2).ToCsv(); Returns: id;text;id2;text2 0;zero;1;one 1;one;2;two 4;four;3;three |
The following functions can be used to initialize DataFrame objects:
Function | Parameters | Description |
---|---|---|
ToDataFrame |
|
Creates a DataFrame based on the given matrix (two dimensional array) and column names. Number of column names should be the same as the number of columns in the matrix. Examples: ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).ToCsv() Returns: A string containing: id;right 0;zero 2;two 3;three |
Joining DataFrames
Performs a joining operation between two DataFrames. Columns which the joining is based on, can be defined as follows:
- If joining using one column having the same name in both DataFrames, the column name is specified as as string.
- If joining using several columns having the same names in both DataFrames, the column names are specified as an array.
- If joining using columns that have different names between the DataFrames, columns are specified as an array of key-value pairs, where the key is the column name in the left side DataFrame, and value is the column name in the right side DataFrame.
The join type can be inner (row is generated if both DataFrames have the key) or leftouter (at least one row is generated for each left side DataFrame row, even if there is no matching other row, in which case null is given as value for the other columns). The default is inner.
Examples:
Let("left", ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"])); Let("right", ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"])); left.join(right, "id").ToCsv() Returns: id;left;right 0;zero;zero Let("left", ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"])); Let("right", ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"])); left.join(right, "id", "leftouter").ToCsv() Returns: id;left;right 0;zero;zero 1;one; Let("left", ToDataFrame([[0, 0, "zerozeroleft"], [0, 1, "zeroleft"], [1, 2, "oneleft"]], ["idleft1", "idleft2", "left"])); Let("right", ToDataFrame([[0, 0, "zerozeroright"], [0, 1, "zeroright"], [2, 3, "tworight"], [3, 4, "threeright"]], ["idright1", "idright2", "right"])); left.join(right, ["idleft1": "idright1"], "inner"); Returns: idleft1;idleft2;left;idright2;right 0;0;zerozeroleft;0;zerozeroright 0;0;zerozeroleft;1;zeroright 0;1;zeroleft;0;zerozeroright 0;1;zeroleft;1;zeroright Let("left", ToDataFrame([[0, 0, "zerozeroleft"], [0, 1, "zeroleft"], [1, 2, "oneleft"]], ["idleft1", "idleft2", "left"])); Let("right", ToDataFrame([[0, 0, "zerozeroright"], [0, 1, "zeroright"], [2, 3, "tworight"], [3, 4, "threeright"]], ["idright1", "idright2", "right"])); left.join(right, ["idleft1": "idright1", "idleft2": "idright2"], "inner"); Returns: idleft1;idleft2;left;right 0;0;zerozeroleft;zerozeroright 0;1;zeroleft;zeroright
Merging DataFrames
Creates a new DataFrame that has the contents of given (target) DataFrame merged with another (source) DataFrame using method described in parameters. Principles follow how MERGE SQL command works in SQL Server. In order for the merge to work properly, both the source and target table must have exactly the same columns in exactly the same order (otherwise error is given).
Parameters:
- source': DataFrame to be merged with this (target) DataFrame.
- matchColumns: Identifies columns to be matched with each other. Value can be either single column index (numerical) or name (in this case, the column identified with this value is used in both the DataFrames to match the rows), or an array of any number of:
- Column identifiers or names: In this case, the column identified with this value is used in both the DataFrames to match the rows.
- Hierarchical arrays containing identifiers or names both in their context object and value object. In this case, the column identified with the context value is used to identify the column in this DataFrame, whereas the column identified with the value of the context is used to identify the column in the other DataFrame to match the rows.
- copyOnMatch: An array similar to the one in 2.2.2 of column names or indexes of columns to copy from the source to target, if a matching row is found. If the value is _remove , then all the rows for which a match was found will be removed from the resulting table. If the value is null, then the target row will be completely replaced with all the columns in the source row that have identical name in both source and target table. The default value is null.
- copyIfNotMatchedInTarget: An array similar to the one in 2.3 of column names or indexes of columns to copy from the source to target, if a matching row is not found.
- addIfNotMatchedInTarget: A boolean value that defines whether the source row is added to the result if there is no matching row in the target DataFrame. The default value is true.
- keepIfNotMatchedBySource: A boolean value that defines whether the target row is added to the result if there is no matching row in the source DataFrame. The default value is true.
Examples:
Let("target", ToDataFrame([[0, "zero", "target"], [1, "", "target"]], ["id", "text", "frame"])); Let("source", ToDataFrame([[1, "one", "source"], [2, "two", "source"], [3, "three", "source"]], ["id", "text", "frame"])); target.Merge(source, "id").ToCsv() Returns (one key, default parameters, identical dataframe columns): id;text;frame 0;zero;target 1;one;source 2;two;source 3;three;source target.Merge(source, "id", ["text"]).ToCsv() Returns (one key, default parameters, identical dataframe columns, copy only text column from source): id;text;frame 0;zero;target 1;one;target 2;two; 3;three; target.Merge(source, "id", ["text"], _remove).ToCsv() Returns (one key, default parameters, identical dataframe columns, copy only text column from source, remove rows found only in source): id;text;frame 0;zero;target 1;one;target target.Merge(source, "id", ["text"], _remove, false).ToCsv() Returns (one key, identical dataframe columns, copy only text column from source, remove rows found only in source or only in target): id;text;frame 1;one;target target.Merge(source, "id", _remove, _remove, false).ToCsv() Returns (one key, identical dataframe columns, remove all rows): id;text;frame Let("target", ToDataFrame([[0, 0, "zerozeroleft", "target"], [0, 1, "zeroleft", "target"], [1, 2, "left", "target"], [4, 5, "fourleft", "target"]], ["idleft1", "idleft2", "textleft", "frame"])); Let("source", ToDataFrame([[0, 0, "zerozeroright", "source"], [0, 1, "zeroright", "target"], [1, 2, "oneright", "source"], [2, 3, "tworight", "source"], [3, 4, "threeright", "source"]], ["idright1", "idright2", "textright", "frame"])); target.Merge(source, ["idleft1": "idright1"]).ToCsv() Returns (one key, default parameters, different dataframe columns, copy all matching columns): idleft1;idleft2;textleft;frame 0;0;zerozeroleft;source 0;0;zerozeroleft;target 0;1;zeroleft;source 0;1;zeroleft;target 1;2;left;source 4;5;fourleft;target 2;;;source 3;;;source target.Merge(source, ["idleft1": "idright1"], []).ToCsv() Returns (one key, default parameters, different dataframe columns, copy only key column): idleft1;idleft2;textleft;frame 0;0;zerozeroleft;target 0;0;zerozeroleft;target 0;1;zeroleft;target 0;1;zeroleft;target 1;2;left;target 4;5;fourleft;target 2;;; 3;;; target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"]).ToCsv() Returns (two keys, default parameters, different dataframe columns, copy all matching columns): idleft1;idleft2;textleft;frame 0;0;zerozeroleft;source 0;1;zeroleft;target 1;2;left;source 4;5;fourleft;target 2;3;;source 3;4;;source target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright"]).ToCsv() Returns (two keys, default parameters, different dataframe columns, copy only textright -column): idleft1;idleft2;textleft;frame 0;0;zerozeroright;target 0;1;zeroright;target 1;2;oneright;target 4;5;fourleft;target 2;3;tworight; 3;4;threeright; target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright", "frame"]).ToCsv() Returns (two keys, default parameters, different dataframe columns, copy textright and frame -columns): idleft1;idleft2;textleft;frame 0;0;zerozeroright;source 0;1;zeroright;target 1;2;oneright;source 4;5;fourleft;target 2;3;tworight;source 3;4;threeright;source target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], _remove).ToCsv() Returns (two keys, default parameters, different dataframe columns, remove all matching rows, copy only matching columns): idleft1;idleft2;textleft;frame 4;5;fourleft;target 2;3;; 3;4;; target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright"], ["idleft1": "idright1", "idleft2": "idright2", "textleft": "textright"]).ToCsv() Returns (two keys, default parameters, different dataframe columns, copy only textright-column for matching columns, copy id columns and textright-column for rows not found in target): idleft1;idleft2;textleft;frame 0;0;zerozeroright;target 0;1;zeroright;target 1;2;oneright;target 4;5;fourleft;target 2;3;tworight; 3;4;threeright; target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright"], ["idleft1": "idright1", "idleft2": "idright2", "textleft": "textright", "frame"]).ToCsv() Returns (two keys, default parameters, different dataframe columns, copy only textright-column for matching columns, copy id, frame and textright-column for rows not found in target): idleft1;idleft2;textleft;frame 0;0;zerozeroright;target 0;1;zeroright;target 1;2;oneright;target 4;5;fourleft;target 2;3;tworight;source 3;4;threeright;source target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], null, ["idleft1": "idright1", "idleft2": "idright2", "textleft": "textright"]).ToCsv() Returns (two keys, default parameters, different dataframe columns, don't copy any columns from source for matching columns, copy id columns and textright-column for rows not found in target): idleft1;idleft2;textleft;frame 0;0;zerozeroleft;source 0;1;zeroleft;target 1;2;left;source 4;5;fourleft;target 2;3;tworight; 3;4;threeright; target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright"], ["idleft1": "idright1", "idleft2": "idright2", "textleft": "textright", "frame"], false).ToCsv() Returns (two keys, default parameters, different dataframe columns, copy only textright-column for matching columns, copy id, frame and textright-column for rows not found in target, remove all rows not found in source): idleft1;idleft2;textleft;frame 0;0;zerozeroright;target 0;1;zeroright;target 1;2;oneright;target 2;3;tworight;source 3;4;threeright;source