DataFrame in Expression Language: Difference between revisions

From QPR ProcessAnalyzer Wiki
Jump to navigation Jump to search
No edit summary
 
(142 intermediate revisions by 3 users not shown)
Line 1: Line 1:
DataFrame represents a two dimensional array of data with one-to-many columns and zero-to-many rows, like a relational database table, an Excel sheet or a CSV data file. Each column in the DataFrame has a name, and there must not be more than one column with the same name. DataFrame is the generic data structure used to manage all kinds data in QPR ProcessAnalyzer expression engine that run in-memory. DataFrames as linked to other entities in QPR ProcessAnalyzer as follows:
DataFrame represents a two dimensional array of data with one-to-many columns and zero-to-many rows, like relational database table, Excel sheet or CSV data file. Each column in the DataFrame has a name, and there must not be more than one column with the same name.
* Datatable contents is fetched into the memory as a DataFrame object
 
* DataFrame can be stored (persisted) to a Datatable
DataFrame is the generic data structure used to manage all kinds data in QPR ProcessAnalyzer expression engine that run in-memory. As DataFrame is an in-memory object, processing large dataset using DataFrames requires considerable amount of memory in QPR ProcessAnalyzer Server. Instead of using DataFrames, data can also be processed using [[SqlDataFrame_in_Expression_Language|SqlDataFrames]] (in Snowflake or SQL Server) or using [[DataFlow_in_Expression_Language|DataFlows]] (as a stream with low memory requirements).
 
DataFrames as linked to other entities in as follows:
* Datatable contents can be fetched into the memory as DataFrame
* DataFrame can be stored (persisted) to Datatable
* ETL operations, such as joining, unions, filtering and grouping are based on the DataFrames
* ETL operations, such as joining, unions, filtering and grouping are based on the DataFrames
* Data extracted from an external data source, e.g. using ODBC, is fetched to the in-memory calculation as a DataFrame.
* Data extracted from an external data source, e.g. using ODBC, is fetched to the in-memory calculation as a DataFrame.
* When using a [[QPR_ProcessAnalyzer_Model_Datasources#Loading Data using Loading Script|loading scripts]], cases and events data is fed to the model using the DataFrame.
* When using a [[QPR_ProcessAnalyzer_Model_Datasources#Loading Data using Loading Script|loading scripts]], cases and events data is fed to the model using the DataFrame.
== DataFrame Properties ==


{| class="wikitable"
{| class="wikitable"
Line 12: Line 18:
||Columns (String*)
||Columns (String*)
||DataFrame columns names as an array in the order the columns are in the DataFrame.
||DataFrame columns names as an array in the order the columns are in the DataFrame.
|-
||ColumnNames (Dictionary*)
||DataFrame columns metadata (names and data types) in an array of dictionaries, where each dictionary has the '''Name''' and '''Datatype''' properties. For in-memory DataFrames, the precise data types are available only if the dataframe originates directly from a datatable. In other cases, for example in DataFrames originating from modification operations, the data type ''Any'' is returns for all columns.
Example: 3rd column data type for a variable stored DataFrame:
<pre>
myDataframe.Columns[2].Datatype
</pre>
|-
||ColumnMappings (Dictionary)
||Gives column mappings configured for this DataFrame. Returned data is a dictionary, where keys are mapping names (''CaseId'', ''EventType'', ''Timestamp'') and value is the column name. The ColumnMappings property returns ''null'' if column mappings have not been defined.
Example:
<pre>
let caseIdColumnName = myDataFrame.ColumnMappings("CaseId");
</pre>
|-
||DataSourceConnection
||Returns connection object used by this dataframe to connect to its datasource. For in-memory data frames, ''null'' is returned.
|-
|-
||Rows (Object**)
||Rows (Object**)
Line 36: Line 61:
|-
|-
|}
|}
== DataFrame Functions ==


{| class="wikitable"
{| class="wikitable"
Line 41: Line 68:
! '''Parameters'''
! '''Parameters'''
! '''Description'''
! '''Description'''
|-
||Aggregate (DataFrame)
||
# Aggregated columns (string array, or key-value pairs)
# Aggregation methods (string array)
||
Create a new DataFrame from a ''GroupedDataFrame'' by performing aggregations to all groups separately and returning one row for each group. Also returns all the columns used in the grouping of the values. Parameters:
# '''columns''': Key-value pairs where each mapping describes the column name in the original DataFrame name (key) and the name of the created column (value). Columns having null value in the dictionary are not renamed.
# '''aggregation method''': Array of string values describing aggregation method for each of the aggregation. The length of the array must be equal to the length of columns array.
Supported aggregations are:
* '''Average''': Average value of the specified column.
* '''Count''': Count of rows in this group.
* '''DateTimeRange''': Duration in seconds between the minimum and the maximum values of the DateTimes of the specified column.
* '''Max''': Maximum value of the specified column.
* '''Median''': Median value of the specified column.
* '''Min''': Minimum value of the specified column.
* '''Sum''': Sum of the specified column.
* '''List''': Combines several string values into one string. Optionally a separator character and sorting order for the list can be defined as follows (see example below): #{"Function": "List", "Ordering": ["<ColumnToSort>"], "Separator": ","}
Examples:
<pre>
ToDataFrame([[0, "zero"], [10, "zero"], [2, "two"], [12, "two"], [22, "two"], [3, "three"]], ["id", "text"])
.GroupBy(["text"])
.Aggregate(["ids": "id"], ["sum"]).ToCsv()
Returns string:
text;id
three;3
two;36
zero;10
ToDataFrame([[0, "zero"], [10, "zero"], [2, "two"], [12, "two"], [22, "two"], [3, "three"]], ["id", "text"])
.GroupBy(["text"])
.Aggregate(
  ["average": "id", "sum": "id", "min": "id", "max": "id", "median": "id"],
  ["average", "sum", "min", "max", "median"]
).ToCsv()
Returns string:
text;average;sum;min;max;median
three;3;3;3;3;3
two;12;36;2;22;12
zero;5;10;0;10;5
ToDataFrame([[0, DateTime(2020, 1)], [0, DateTime(2020, 4)], [0, DateTime(2020, 2)], [1, DateTime(2019, 1)], [1, DateTime(2009, 1)]], ["id", "timestamp"])
.GroupBy(["id"])
.Aggregate(
  ["duration": "timestamp", "count": "id"],
  ["DateTimeRange", "Count"]
).ToCsv()
Returns string:
id;duration;count
0;7862400;3
1;315532800;2
ToDataFrame([[1, "zero"], [1, "one"], [1, "two"], [2, "two"], [3, "two"], [3, "three"]], ["id", "text"])
.GroupBy(["id"])
.Aggregate(
  ["text"],
  [#{"Function": "List", "Ordering": ["text"], "Separator": ", "}]
).ToCsv()
Returns:
id;text
1;one, two, zero
2;two
3;three, two
</pre>
|-
|-
||Append (DataFrame)
||Append (DataFrame)
||
||
* DataFrame which data to append  
# DataFrame to append
# Include all columns (boolean)
||
||
Creates a new DataFrame that has the contents of given DataFrame added to the end of this DataFrame. When the data is combined, the order of columns matters, not the names of the columns. The resulting DataFrame gets column names from this DataFrame.
Creates a new dataframe that has the contents of given dataframe added to the end of this dataframe. The appending behavior is affected by the ''include all columns'' parameter.


If the number of columns is different between this DataFrame and the other DataFrame, an exception is thrown.
Parameters:
# '''Appended dataframe''': The dataframe that is appended to the end.
# '''Include all columns''' (boolean): If ''true'', the result will have all the columns that were present in either of the dataframes. The order of the input columns does not matter. The order of the output columns is the same as the order of the columns in the original dataframe followed by all the rest of the columns existing only in appended dataframe in alphabetical order. If a column doesn't exist in other dataframe, ''null'' values will be set to those columns for dataframes in which the column does not exist. If parameter is ''false'' (default), the result will contain only the columns that were present in the original dataframe. The columns in the appended dataframe must be in the same order as they were in the context data frame.


Examples:
Example 1: Without including all columns:
<pre>
<pre>
Let("dataframe1", ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]));
let dataframe1 = ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]);
Let("dataframe2", ToDataFrame([[1, "one"], [4, "four"]], ["id", "text"]));
let dataframe2 = ToDataFrame([[1, "one"], [4, "four"]], ["id", "text"]);
dataframe1.Append(dataframe2);
dataframe1.Append(dataframe2).ToCsv();


Returns:
Returns:
Line 63: Line 159:
1;one
1;one
4;four
4;four
</pre>
Example 1: With including all columns:
<pre>
let dataframe1 = ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]);
let dataframe2 = ToDataFrame([["one", 1], ["four", 4]], ["text2", "id"]);
dataframe1.Append(dataframe2, true).ToCsv();
Returns:
id;text;text2
0;zero;
2;two;
3;three;
1;;one
4;;four
</pre>
|-
||Clone (DataFrame)
||DataFrame to clone
||
Returns a new DataFrame that is an exact copy of the data frame this method was called for.
Examples:
<pre>
ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).Clone()
Returns:A copy of the original DataFrame object.
</pre>
</pre>
|-
|-
||ColumnIndexes (Integer*)
||ColumnIndexes (Integer*)
||
||
* Column names (String*)
Column names (String*)
||
||
Convert DataFrame column names into column indexes. The indexes are starting from zero. If a column is not found, an exception is given.
Convert DataFrame column names into column indexes. The indexes are starting from zero. If a column is not found, an exception is given.
Line 79: Line 202:
||Column (Object*)
||Column (Object*)
||
||
* Column name
Column name
||
||
Returns an array of values of given column in the order rows are in the datatable.
Returns an array of values of given column in the order rows are in the datatable.
Line 91: Line 214:
||Columns (DataFrame)
||Columns (DataFrame)
||
||
* Column indexes
Array of column names
||
||
Creates a new DataFrame having only the defined columns of the original DataFrame.
Creates a new DataFrame having only the defined columns of the original DataFrame. Note that Columns function is different than Columns property (difference is that the function has parameters).
 
Note that Columns function is different than Columns property - difference is that the function has parameters.


Examples:
Examples:
<pre>
<pre>
ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).Columns([1]).ToCsv()
ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).Columns(["right"]).ToCsv()
Returns:
Returns:
right
right
Line 106: Line 227:
three
three


</pre>
|-
||ExcludeValues (DataFrame)
||
# Column name (string)
# Value (single item) or values (array) to exclude
||
Creates a new DataFrame having only rows for which given column does not have any of the specified values.
Examples:
<pre>
ToDataFrame([[0, "zero"], [1, "one"], [2, "two"]], ["id", "left"]).ExcludeValues("id", 1).ToCsv()
Returns:
id;left
0;zero
2;two
ToDataFrame([[0, "zero"], [1, "one"], [2, "two"]], ["id", "left"]).ExcludeValues("left", ["one", "two", "three"]).ToCsv()
Returns:
id;left
0;zero
</pre>
</pre>
|-
|-
||Head (DataFrame)
||Head (DataFrame)
||
||
* Number of rows
Number of top rows
||
||
Creates a new DataFrame that only contains top number of rows of this DataFrame. If the DataFrame has less than n rows, all its rows are returned.
Creates a new DataFrame that only contains the defined top number of rows of this DataFrame. If the DataFrame has less than the defined top rows, all rows are returned.


Examples:
Examples:
Line 123: Line 265:
</pre>
</pre>
|-
|-
||Join (DataFrame)
||IncludeOnlyValues
||
# Column name (string)
# Value (single item) or values (array) to include
||
Create a new DataFrame containing only those rows for which the given column has any of the given values. Values can be provided as a single object (if there is only one value) or an array of objects (if multiple values).
 
Examples:
<pre>
ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]).IncludeOnlyValues("id", 1).ToCsv()
Returns:
id;left
1;one
 
ToDataFrame([[0, "zero"], [1, "one"], [3, "three"]], ["id", "left"]).IncludeOnlyValues("id", [0, 1, 2]).ToCsv()
Returns:
id;left
0;zero
1;one
 
ToDataFrame([[0, "zero"], [1, "one"], [3, "three"]], ["id", "left"]).IncludeOnlyValues("left", ["zero", "three"]).ToCsv()
Returns:
id;left
0;zero
3;three
 
df.IncludeOnlyValues("EventType", ["start", "end"])
Returns a data frame containing all the rows in DataFrame df that have either "start" or "end" value in the column "EventType".
</pre>
|-
||[[Joining_DataFrames|Join]] (DataFrame)
||
||
# Other DataFrame
# Other DataFrame
# Columns to join
# Columns to join
# join type
# Join type
||
||
See [[#Joining_DataFrames|Joining DataFrames]].
|-
||GroupBy (GroupedDataFrame)
||
Grouped columns (string array)
||
Creates a ''GroupedDataFrame'' object based based on given columns. Takes as a parameter an array of column names, based on which to group the rows. For examples, see the ''Aggregate'' function.


|-
|-
Line 140: Line 317:


Parameters:
Parameters:
# '''Columns''': Columns to group identified by column indexes or column names.
# '''Columns''': Columns to group identified by column names.
# '''Aggregation expressions''': Array containing the column name as a key and the aggregation expression as a value.
# '''Aggregation expressions''': Array containing the column name as a key and the aggregation expression as a value.


Line 147: Line 324:
ToDataFrame([[0, "zero"], [0, "zero2"], [2, "two"], [2, "two"], [2, "two3"], [3, "three"]], ["id", "text"]).GroupBy(["id"],
ToDataFrame([[0, "zero"], [0, "zero2"], [2, "two"], [2, "two"], [2, "two3"], [3, "three"]], ["id", "text"]).GroupBy(["id"],
[
[
   "ids": Def("", Sum(id)),
   "ids": () => Sum(id),
   "texts": Def("", StringJoin(",", text)),
   "texts": () => StringJoin(",", text),
   "constant": 123
   "constant": 123
]).ToCsv()
]).ToCsv()
Line 159: Line 336:
ToDataFrame([[0, "zero"], [0, "zero2"], [2, "two"], [2, "two"], [2, "two3"], [3, "three"]], ["id", "text"]).GroupBy(["id", "text"],
ToDataFrame([[0, "zero"], [0, "zero2"], [2, "two"], [2, "two"], [2, "two3"], [3, "three"]], ["id", "text"]).GroupBy(["id", "text"],
[
[
   "ids": Def("", Sum(id)),
   "ids": () => Sum(id),
   "texts": Def("", StringJoin(",", text)),
   "texts": () => StringJoin(",", text),
   "constant": 123
   "constant": 123
]).ToCsv()
]).ToCsv()
Line 170: Line 347:
2;two3;123
2;two3;123
3;three;123
3;three;123
Analysis("OperationLog")
.GroupBy(
  ["User Name"],
  [
    "User Name": Def("", Column("User Name")[0]),
    "Count": Def("Rows", CountTop(Rows)),
    "Avg. Duration": Def("", Average(Duration)),
    "Max. Duration": Def("", Max(Duration))
  ]
).ToCsv()
Returns (similar to this):
User Name;Count;Avg. Duration;Max. Duration
;207;0.617434782608696;20.556
Administrator;665;16.3750631578947;4225.497
qpr;128;2.158765625;20.346
</pre>
</pre>
|-
|-
||Merge (DataFrame)
||[[Merging_DataFrames|Merge]] (DataFrame)
||
||
||
||
See [[#Merging_DataFrames|Merging DataFrames]].
|-
|-
||OrderBy (DataFrame)
||OrderBy (DataFrame)
||
||
* ordering expression
# primary ordering expression
# secondary ordering expression
# ...
||
||
Creates a new DataFrame having rows ordered in an ascending order using the given expression evaluated on each row.
Creates a new DataFrame having rows ordered in an ascending order using the given expression(s) evaluated on each row. Rows that have same ordering value in the primary ordering exprssion, are sorted based on the secondary ordering expression.


Examples:
Examples:
Line 213: Line 374:
id;right
id;right
0;zero
0;zero
2;two
3;three
el.Analysis("EventTypes").OrderBy(Count, Name)
Returns event types in eventlog ordered primarily by Count and secondarily by Name.
</pre>
If there is a need to sort by some column ascending and some column descending, the sortings can be chained. Example:
<pre>
el.Analysis("EventTypes").OrderByDescending(Name).OrderBy(Count)
Returns event types in eventlog ordered primarily by Count ascending and secondarily by Name descending.
</pre>
|-
||OrderByColumns (DataFrame)
||
# Columns to be ordered (String array)
# Sorting order (boolean array)
||
Creates a new DataFrame having rows ordered by given columns in given directions. Note that the ordered columns need to contain same type of data, because ordering is not possible between different data types. Parameters:
# '''columns''': Column to be sorted.
# '''sort order''': Array of boolean values indicating whether to sort the columns in ascending (''true'') or descending (''false'') direction. The length of the array must be equal to the length of columns array.
Null values are always first in the order (both in ascending and descending order).
Examples:
<pre>
ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).OrderByColumns(["id"], [false]).ToCsv()
Returns string:
id;text
3;three
2;two
0;zero
ToDataFrame([[0, "zero"], [0, "nolla"], [2, "two"], [3, "three"]], ["id", "text"]).OrderByColumns(["id", "text"], [true, false]).ToCsv()
Returns string:
id;text
0;zero
0;nolla
2;two
2;two
3;three
3;three
Line 219: Line 418:
||OrderByDescending (DataFrame)
||OrderByDescending (DataFrame)
||
||
* ordering expression
# primary ordering expression
# secondary ordering expression
# ...
||
||
Creates a new DataFrame having rows ordered in an descending order using the given expression evaluated on each row. See OrderBy above for examples.
Creates a new DataFrame having rows ordered in an descending order using the given expression(s) evaluated on each row. See OrderBy above for examples.
 
|-
|-
||Persist (DataTable)
||Persist (DataTable)
||
||
* DataTable name
* Datatable name (String)
* Additional parameters
* Additional parameters (Dictionary)
||
||
Writes the DataFrame into QPR ProcessAnalyzer database as a DataTable having the given name. If a DataTable with that name does not exist in the project, a new DataTable is created. If a DataTable with that name already exists, the DataFrame will be stored into that DataTable. The function returns a DataTable object which the data was persisted into.
Writes DataFrame into datatable. If a datatable with that name does not exist in the project, a new datatable is created. If a datatable with that name already exists, the DataFrame will be stored into that DataTable. The function returns the written datatable object.


The additional parameters support:
The following parameters are supported:
* Append: Can be used to determine whether to append (true) or overwrite (false) the existing data. Default is overwrite.
* '''ProjectName''': Name of the project where the datatable is created.
* ProjectName: Name of the project under which the DataTable is to be created.
* '''ProjectId''': Id of project where the datatable is created.
* ProjectId: Id of the project under which the DataTable is to be created.
* '''Append''', '''ImportExistingColumnOnly''' and '''MatchByColumns''': See the [[QPR_ProcessAnalyzer_Objects_in_Expression_Language#Datatable|import function]] for details.


Examples:
Examples:
<pre>
<pre>
Let("right", ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]));
let right = ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]);
right.Persist("RightDataTable", ["ProjectName": "TestData"])
right.Persist("RightDataTable", ["ProjectName": "TestData"])
Results: Id of the new data table named "RightDataTable" created into project named TestData (which is created if it doesn't already exist). If the table already existed, its contents will be overwritten by the new content.
Results: Id of the new data table named "RightDataTable" created into project named TestData (which is created if it doesn't already exist). If the table already existed, its contents will be overwritten by the new content.


Let("newData", ToDataFrame([[4, "four"]], ["id", "right"]));
let newData = ToDataFrame([[4, "four"]], ["id", "right"]);
newData.Persist("RightDataTable", ["ProjectName": "TestData", "Append": true])
newData.Persist("RightDataTable", ["ProjectName": "TestData", "Append": true])
Results: Id of the new data table named "RightDataTable" created into project named TestData (which is created if it doesn't already exist). If the table already existed, new content will be appended into the end of the table.
Results: Id of the new data table named "RightDataTable" created into project named TestData (which is created if it doesn't already exist). If the table already existed, new content will be appended into the end of the table.
</pre>
The following script reads data from a datatable (MyProject -> MyDatatable) convert one column (MyColumn) into floats and writes the data back to the same datatable.
<pre>
let project = (Projects.Where(Name=="MyProject"))[0];
let datatable = (project.Datatables.Where(Name=="MyDatatable"))[0];
DatatableById(datatable.Id).DataFrame
.SetColumns([
"MyColumn": () => ToFloat(Column("MyColumn"))
])
.Persist(datatable.Name, ["ProjectId": project.Id, "Append": false]);
</pre>
|-
||RemoveColumns (DataFrame)
||
Column names (string array)
||
Creates new DataFrame where the defined columns have been removed. Throws an exception if the original DataFrame doesn't contain any of the defined columns.
Examples:
<pre>
ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]).RemoveColumns(["id"]).ToCsv()
Returns string:
left
zero
one
</pre>
|-
||RenameAllColumns (DataFrame)
||Template string
||
Renames all columns of the DataFrame using a name template. The template needs to contain '''{0}''' that will be replaced by the original name of the column. For escapings, '''{{''' is replaced with '''{''' and '''}}''' with '''}'''. Throws an exception if the template doesn't contain {0}.
Example:
<pre>
ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"])
  .RenameAllColumns("new_{0}").ToCsv();
Returns:
new_id;new_left
0;zero
1;one
</pre>
|-
||RenameColumns (DataFrame)
||Key-value pairs of name mappings
||
Renames DataFrame columns. Takes a parameter of key-value pairs describing how old names (value) are changed to new names (key). Throws an exception if any of the (old) column names don't exist. Renaming DataFrame columns doesn't copy the data, so it's a quick operation for even large datasets.
Examples:
<pre>
ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]).RenameColumns(["newId": "id", "newLeft": "left"]).ToCsv()
Returns string:
newId;newLeft
0;zero
1;one
</pre>
|-
||Select
||Column names (string array, or key-value pairs)
||
Creates a new DataFrame where only the selected columns are included. Allows also to change column names when the parameter contains key-value pairs where the original column names are as values and new columns names as keys (see the examples). Throws an exception if any of the columns specified does not exist.
Examples:
<pre>
ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]).Select(["left"]).ToCsv()
Returns string:
left
zero
one
ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]).Select(["newLeft": "left"]).ToCsv()
Returns string:
newLeft
zero
one
ToDataFrame([[0, "zero", "nolla"], [1, "one", "yksi"]], ["id", "left", "right"]).Select(["left": "newLeft", "right"]).ToCsv()
Returns string:
newLeft;right
zero;nolla
one;yksi
</pre>
|-
||SelectDistinct
||Column names (string array, or key-value pairs)
||
Returns only distinct values within given columns or all columns (if no columns are specified). Takes as parameter the column names defining for which columns the distinct value combinations are returned. When no parameter is provided, returns all columns in the dataframe. Supports also renaming columns similar to the Select function.
Examples:
<pre>
ToDataFrame([[0, "Apple"], [1, "Orange"], [1, "Orange"], [1, "Apple"], [2, "Apple"]], ["number", "text"])
.SelectDistinct()
Returns:
number;text
0;Apple
1;Orange
1;Apple
2;Apple
ToDataFrame([[0, "Apple"], [1, "Orange"], [1, "Orange"], [1, "Apple"], [2, "Apple"]], ["number", "text"])
.SelectDistinct(["text"])
Returns:
text
Apple
Orange
</pre>
</pre>
|-
|-
||SetColumns (DataFrame)
||SetColumns (DataFrame)
||
||
* new/modified columns as array
New/modified columns as array
||
||
Creates a new DataFrame based on the current DataFrame, where new columns have been created and/or existing columns have been modified. New and modified columns are defined using an array, where the column name is as a key and as a value there is the expression to calculate the new or modified column. When specifying a column that already exists, the column values are modified. When specifying a new column name, that column is created as a new column to the resulting DataFrame.
Creates a new DataFrame based on the current DataFrame, where new columns have been created and/or existing columns have been modified. New and modified columns are defined using an array, where the column name is as a key and as a value there is the expression to calculate the new or modified column. When specifying a column that already exists, the column values are modified. When specifying a new column name, that column is created as a new column to the resulting DataFrame.
Line 255: Line 560:
Examples:
Examples:
<pre>
<pre>
ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns(["both": Def("", text + "=" + id)]).ToCsv()
ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns([
"both": () => text + "=" + id
]).ToCsv()
Returns:
Returns:
id;text;both
id;text;both
Line 262: Line 569:
3;three;three=3
3;three;three=3


ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns(["text": Def("", text + "=" + id)]).ToCsv()
ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns([
"text": () => text + "=" + id
]).ToCsv()
Returns:
Returns:
id;text
id;text
Line 269: Line 578:
3;three=3
3;three=3


ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns(["both": Def("", text + "=" + id), "both+1": Def("", both + 1), "text": Def("", "Done: " + Column("both+1")), "constant": 1234]).ToCsv()
ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns([
"both": () => text + "=" + id,
"both+1": () => both + 1,
"text": () => "Done: " + Column("both+1"),
"constant": 1234
]).ToCsv()
Returns:
Returns:
id;text;both;both+1;constant
id;text;both;both+1;constant
Line 275: Line 589:
2;Done: two=21;two=2;two=21;1234
2;Done: two=21;two=2;two=21;1234
3;Done: three=31;three=3;three=31;1234
3;Done: three=31;three=3;three=31;1234
</pre>
|-
||Skip (DataFrame)
||Number of rows to skip
||Returns DataFrame where the defined number of rows is removed ("skipped"). Returns an empty DataFrame if skipping more rows than there are rows in the source DataFrame.
Examples:
<pre>
ToDataFrame([[0, "zero"], [1, "one"], [2, "two"], [3, "three"]], ["id", "right"]).Skip(1).ToCsv()
Results string:
id;right
1;one
2;two
3;three
</pre>
</pre>
|-
|-
||Tail (DataFrame)
||Tail (DataFrame)
||
||
* Number of rows
Number of rows
||
||
Creates a new DataFrame that has only the bottom number of rows of this DataFrame. If the DataFrame has less than n rows, all its rows are returned.
Creates a new DataFrame that has only the bottom number of rows of this DataFrame. If the DataFrame has less than n rows, all its rows are returned.
Line 293: Line 621:
|-
|-
||ToCsv (String)
||ToCsv (String)
||(none)
||includeHeaders (boolean)
||
||
Converts a DataFrame into a CSV data. The CSV data has the following formatting:
Converts a DataFrame into a CSV data (i.e. string). The CSV data has the following formatting:
* Column separator: semicolon
* Column separator: semicolon (;)
* Decimal separator in numeric fields: period
* Decimal separator in numeric fields: period (.)
* Quotation character for text fields: double quotes (used when the textual value contains semicolon, double quotes, linebreak or tabulator)
* Quotation character for text fields: double quotes (") (used when the textual value contains semicolon, double quotes, linebreak or tabulator)
* Escape character: Double quotes in textual fields are escaped with two double quotes.
* Escape character: Double quotes in textual fields are escaped with two double quotes ("")
* Date format for date fields: yyyy-MM-dd HH:mm:ss,fff
* Date format for date fields: yyyy-MM-dd HH:mm:ss,fff
* First line: contains column headers
* Timespan (duration) format: dd.hh:mm:ss
* First line contains column headers
 
Parameter ''includeHeaders'' defines whether the header (column names) is returned (true, default) or not (false) as the first row.


Example:
Example:
Line 313: Line 644:
</pre>
</pre>
|-
|-
||Where (DataFrame)
||WithDenseRankNumberColumn (DataFrame)
||
# New column name (String)
# Order by columns (String array)
# Partition by columns (String array)
# Ascending/descending order (Boolean array)
||
Similar to the WithRankColumn function, except rank numbers doesn't contain gaps when there are rows with same rank values.
 
<pre>
let data = ToDataFrame([
["A", "Dallas", 8],
["B", "Dallas", 5],
["C", "Dallas", 5],
["D", "Dallas", 4],
["B", "New York", 6],
["C", "New York", 2]
], ["Customer", "Region", "Revenue"]);
data.WithDenseRankColumn("Rank", ["Revenue"]).OrderByColumns(["Revenue"], [true]).ToCsv();
Returns:
Customer;Region;Revenue;Rank
C;New York;2;1
D;Dallas;4;2
B;Dallas;5;3
C;Dallas;5;3
B;New York;6;4
A;Dallas;8;5
</pre>
|-
||WithColumn (DataFrame)
||
||
* condition expression
# New column name (String)
# Calculation expression
||
Creates new DataFrame with new column having value evaluated using given expression. If the column name already exist in the original DataFrame, it gets replaced with the evaluated expression.
<pre>
ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"])
  .WithColumn("combinedtext", text + "=" + (id * 2))
  .ToCsv();
Returns:
id;text;combinedtext
0;zero;zero=0
2;two;two=4
3;three;three=6
</pre>
|-
||WithRankColumn (DataFrame)
||
# New column name (String)
# Order by columns (String array)
# Partition by columns (String array)
# Ascending/descending order (Boolean array)
||
Similar to the WithRowNumberColumn function, except the produced numbering is based on the ranking logic. The difference to the row number is that the rows with an equal sorting value, gets the same rank number. The numbering continues being same as the row number, i.e. there are gaps in the assigned ranks if there are rows same rank values (see the example).
 
<pre>
let data = ToDataFrame([
["A", "Dallas", 8],
["B", "Dallas", 5],
["C", "Dallas", 5],
["D", "Dallas", 4],
["B", "New York", 6],
["C", "New York", 2]
], ["Customer", "Region", "Revenue"]);
data.WithRankColumn("Rank", ["Revenue"]).OrderByColumns(["Revenue"], [true]).ToCsv();
Returns:
Customer;Region;Revenue;Rank
C;New York;2;1
D;Dallas;4;2
B;Dallas;5;3
C;Dallas;5;3
B;New York;6;5
A;Dallas;8;6
</pre>
|-
||WithRowNumberColumn (DataFrame)
||
# New column name (String)
# Order by columns (String array)
# Partition by columns (String array)
# Ascending/descending order (Boolean array)
||
Creates new DataFrame with a new column containing ''row numbers'' based on defined partitions and ordering. Partitions are defined as one or several columns, and each partition will contain own row numbering starting from one. Rows are ordered within each partition and the row numbering is based in the ordering (which is separate than the ordering of the result data).
 
Parameters:
# '''New column name''': Name of the new column containing the row number.
# '''Order by columns''': Array of column(s) to order the rows where the row number is based on.
# '''Partition by columns''' (optional): Array of column(s) to partition the rows by. Each partition will have own row numbering starting from one. Partition parameter can be omitted, which will treat entire data as one partition.
# '''Ascending/descending order''' (optional): Array of booleans to define whether the ordering for each column (in the ''Order by columns'' parameter) is ascending (''true'', by default) or descending (''false'').
 
Examples:
<pre>
let data = ToDataFrame([
["A", "Dallas", 8],
["B", "Dallas", 5],
["C", "Dallas", 4],
["B", "New York", 6],
["C", "New York", 2]
], ["Customer", "Region", "Revenue"]);
data.WithRowNumberColumn("Order", ["Revenue"]).OrderByColumns(["Revenue"], [true]).ToCsv();
Returns:
Customer;Region;Revenue;Order
C;New York;2;1
C;Dallas;4;2
B;Dallas;5;3
B;New York;6;4
A;Dallas;8;5
 
data.WithRowNumberColumn("Order", ["Revenue"], ["Region"]).OrderByColumns(["Region", "Revenue"], [true, true]).ToCsv()
Returns:
Customer;Region;Revenue;Order
C;Dallas;4;1
B;Dallas;5;2
A;Dallas;8;3
C;New York;2;1
B;New York;6;2
 
data.WithRowNumberColumn("Order", ["Revenue"], ["Region"], ["false"]).OrderByColumns(["Region", "Revenue"], [true, false]).ToCsv()
Returns:
Customer;Region;Revenue;Order
A;Dallas;8;1
B;Dallas;5;2
C;Dallas;4;3
B;New York;6;1
C;New York;2;2
</pre>
|-
||<span id="Where">Where</span> (DataFrame)
||
Condition expression
||
||
Creates a new DataFrame having only rows for which the given condition expression returns true. The condition expression can refer to the columns of the DataFrame (see the example below).
Creates a new DataFrame having only rows for which the given condition expression returns true. The condition expression can refer to the columns of the DataFrame (see the example below).
Line 321: Line 780:
Examples:
Examples:
<pre>
<pre>
Let("df", ToDataFrame([[0, "zero"], [2, "two", true], [3, "three"]], ["id", "string"]));
let df = ToDataFrame([[0, "zero"], [2, "two", true], [3, "three"]], ["id", "string"]);


All the following expression return the same:
All the following expression return the same:
Line 336: Line 795:
||Zip (DataFrame)
||Zip (DataFrame)
||
||
* DataFrame
DataFrame
||
||
Creates a new DataFrame that has the contents of given DataFrame appended as new columns into the end of this DataFrame. Returns a new DataFrame that has the colums from both the data frames so that the columns from the other DataFrame are appended to the end of the columns of this DataFrame. If the number of rows is different between this DataFrame and the other DataFrame, an exception is thrown. There must not be duplicate column names in the DataFrames - otherwise an exception is thrown.
Creates a new DataFrame that has the contents of given DataFrame appended as new columns into the end of this DataFrame. Returns a new DataFrame that has the colums from both the data frames so that the columns from the other DataFrame are appended to the end of the columns of this DataFrame. If the number of rows is different between this DataFrame and the other DataFrame, an exception is thrown. There must not be duplicate column names in the DataFrames - otherwise an exception is thrown.
Line 342: Line 801:
Examples:
Examples:
<pre>
<pre>
Let("df1", ToDataFrame([[0, "zero"], [1, "one"], [4, "four"]], ["id", "text"]));
let df1 = ToDataFrame([[0, "zero"], [1, "one"], [4, "four"]], ["id", "text"]);
Let("df2", ToDataFrame([[1, "one"], [2, "two"], [3, "three"]], ["id2", "text2"]));
let df2 = ToDataFrame([[1, "one"], [2, "two"], [3, "three"]], ["id2", "text2"]);
df1.Zip(df2).ToCsv();
df1.Zip(df2).ToCsv();
Returns:
Returns:
Line 354: Line 813:
|}
|}


The following functions can be used to initialize DataFrame objects:
== DataFrame sources ==
The following functions can be used to extract data as DataFrames:
{| class="wikitable"
{| class="wikitable"
!'''Function'''
!'''DataFrame&nbsp;functions'''
!'''Parameters'''
! '''Parameters'''
! '''Description'''
! '''Description'''
|-
|-
||ToDataFrame
||ImportOdbc (DataFrame)
||
||
# matrix
# connection string (string)
# column names array
# query (string)
# query execution timeout (integer)
||
||
Creates a DataFrame based on the given matrix (two dimensional array) and column names. Number of column names should be the same as the number of columns in the matrix.
Runs given query in given ODBC datasource and returns data as DataFrame. [[PA_Configuration_database_table|AllowExternalDatasources]] setting needs to be ''true'' to use the ImportODBC function. Note also that the ODBC connection requires an ODBC driver to the datasource to be installed in the QPR ProcessAnalyzer Server.


Examples:
Example: Contents of OdbcTest table in the ODBC datasource is fetched and returned as a DataFrame.
<pre>
<pre>
ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).ToCsv()
ImportOdbc(
Returns:
  "Driver={SQL Server};Server=localhost;DataBase=QPR_PA1;Trusted_Connection=yes",
A string containing:
  "SELECT * FROM OdbcTest",
id;right
  500
0;zero
)
2;two
3;three
</pre>
</pre>
|-
|-
|}
||<span id="ImportOdbcSecure">ImportOdbcSecure</span> (DataFrame)
||
# project id (Integer)
# connection string key (string)
# query (string)
# query execution timeout (integer)
||
Similar command as ImportOdbc, except instead of the plain text connection string, a [[Storing_Secrets_for_Scripts|secret name]] is provided. Also a project id where to fetch the connection string key from needs to be provided.


==Joining DataFrames==
Example: Contents of OdbcTest table in the ODBC datasource is fetched and returned as a DataFrame.
Performs a [https://en.wikipedia.org/wiki/Relational_algebra#Joins_and_join-like_operators joining operation] between two DataFrames. Columns which the joining is based on, can be defined as follows:
# If joining using one column having the same name in both DataFrames, the column name is specified as as string.
# If joining using several columns having the same names in both DataFrames, the column names are specified as an array.
# If joining using columns that have different names between the DataFrames, columns are specified as an array of key-value pairs, where the key is the column name in the left side DataFrame, and value is the column name in the right side DataFrame.
 
The join type can be '''inner''' (row is generated if both DataFrames have the key) or '''leftouter''' (at least one row is generated for each left side DataFrame row, even if there is no matching other row, in which case null is given as value for the other columns). The default is ''inner''.
 
Examples:
<pre>
<pre>
Let("left", ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]));
ImportOdbcSecure(
Let("right", ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]));
  12,
left.join(right, "id").ToCsv()
  "MySecureConnectionString",
Returns:
  "SELECT * FROM OdbcTest",
id;left;right
  500
0;zero;zero
)
 
Let("left", ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]));
Let("right", ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]));
left.join(right, "id", "leftouter").ToCsv()
Returns:
id;left;right
0;zero;zero
1;one;
 
Let("left", ToDataFrame([[0, 0, "zerozeroleft"], [0, 1, "zeroleft"], [1, 2, "oneleft"]], ["idleft1", "idleft2", "left"]));
Let("right", ToDataFrame([[0, 0, "zerozeroright"], [0, 1, "zeroright"], [2, 3, "tworight"], [3, 4, "threeright"]], ["idright1", "idright2", "right"]));
left.join(right, ["idleft1": "idright1"], "inner");
Returns:
idleft1;idleft2;left;idright2;right
0;0;zerozeroleft;0;zerozeroright
0;0;zerozeroleft;1;zeroright
0;1;zeroleft;0;zerozeroright
0;1;zeroleft;1;zeroright
 
Let("left", ToDataFrame([[0, 0, "zerozeroleft"], [0, 1, "zeroleft"], [1, 2, "oneleft"]], ["idleft1", "idleft2", "left"]));
Let("right", ToDataFrame([[0, 0, "zerozeroright"], [0, 1, "zeroright"], [2, 3, "tworight"], [3, 4, "threeright"]], ["idright1", "idright2", "right"]));
left.join(right, ["idleft1": "idright1", "idleft2": "idright2"], "inner");
Returns:
idleft1;idleft2;left;right
0;0;zerozeroleft;zerozeroright
0;1;zeroleft;zeroright
</pre>
</pre>
|-
||ToDataFrame
||
# data as 2-dimensional array
# column names (array of strings) or column metadata (array of dictionaries)
||
Creates a DataFrame object containing the given data (in two dimensional array) and the array of column names. Number of column names must be the same as the number of columns in the two dimensional array.


==Merging DataFrames==
To be able to define data types for the columns, the second parameter can also be an array of dictionaries (one for each column), where each dictionary contains ''Name'' and ''DataType'' properties (see an example below). Available data types are ''String'', ''Integer'', ''Float'', ''DateTime'', ''Boolean'', ''Duration'' (Timespan) and ''Any'' (can contain any type of data).


Creates a new DataFrame that has the contents of target DataFrame merged with a source DataFrame. Principles follow how merging works in the SQL language (https://en.wikipedia.org/wiki/Merge_(SQL)). Note that the Merge is not able to create new columns to the target DataFrame.
The first parameter can also be an existing DataFrame which will create a copy of the DataFrame.
 
Parameters:
# '''Source DataFrame''': DataFrame to be merged with the current DataFrame.
# '''Columns to MATCH''': Columns used to match the source and target DataFrame. This parameter is defined like the join function's matchColumns parameter (2. parameter).
# '''Columns to UPDATE when matching rows''': Columns to copy from the source to target for the matching rows. This parameter is defined like the join function's matchColumns parameter (2. parameter).
#* If the value is ''_remove'', all matching rows are be removed from the resulting table.
#* If the value is ''null'', all rows are copied from the source to target (the ones having matching names). The default value is null.
#* If the value is ''[]'', no columns are updated.
# '''Columns to CREATE for non-match rowsing''': Columns to create to target, if a matching row is not found. This parameter is defined like the previous update parameter.
# '''Keep or DELETE if not matched by source''': Boolean value that defines whether those rows in the target DataFrame where no matching row in the source DataFrame is found, are included in the result (true) or deleted (false). The default value is ''true''.


Examples:
Examples:
Update case attribute values for some cases and create non-existing:
<pre>
<pre>
Let("CaseData", ToDataFrame([
ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).ToCsv()
["0", "New York", 6],
["1", "Dallas", 3],
["2", "Dallas", 8],
["3", "Chicago", 4],
["4", "New York", 2]
],["Case id", "Region", "Cost"]));
Let("UpdatedData", ToDataFrame([
["0", 7],
["1", 5],
["4", 2],
["5", 2]
],["Case id", "Cost"]));
CaseData.Merge(UpdatedData, "Case id").tocsv();
 
Returns:
Returns:
Case id;Region;Cost
A string containing:
0;New York;7
id;right
1;Dallas;5
0;zero
2;Dallas;8
2;two
3;Chicago;4
3;three
4;New York;2
5;;2
</pre>
 
Update case attribute values for some cases and not create non-existing:
<pre>
Let("CaseData", ToDataFrame([
["0", "New York", 6],
["1", "Dallas", 3],
["2", "Dallas", 8],
["3", "Chicago", 4],
["4", "New York", 2]
],["Case id", "Region", "Cost"]));
Let("UpdatedData", ToDataFrame([
["0", 7],
["1", 5],
["4", 2],
["5", 2]
],["Case id", "Cost"]));
CaseData.Merge(UpdatedData, "Case id", null, _remove).tocsv();


ToDataFrame(
  [[0, "zero"], [2, "two"], [3, "three"]],
  [
    #{"Name": "id", "DataType": "Integer"},
    #{"Name": "right", "DataType": "String"}
  ]
).ToCsv();
Returns:
Returns:
Case id;Region;Cost
id;right
0;New York;7
0;zero
1;Dallas;5
2;two
2;Dallas;8
3;three
3;Chicago;4
4;New York;2
</pre>
 
Update case attribute values for some cases (create non-existing) where columns to match have different names in the source and target DataFrames ("Case id" in target and "Case" in source are used to match, and "Cost" in target is updated from "Variable Cost" in source):
<pre>
Let("CaseData", ToDataFrame([
["0", "New York", 6],
["1", "Dallas", 3],
["2", "Dallas", 8],
["3", "Chicago", 4],
["4", "New York", 2]
],["Case id", "Region", "Cost"]));
Let("UpdatedData", ToDataFrame([
["0", 7, 4],
["1", 5, 2],
["4", 2, 0],
["5", 2, 1]
],["Case", "Cost", "Variable Cost"]));
CaseData.Merge(UpdatedData, ["Case id": "Case"], ["Cost": "Variable Cost"]).tocsv();
 
Returns:
Case id;Region;Cost
0;New York;4
1;Dallas;2
2;Dallas;8
3;Chicago;4
4;New York;0
5;;1
</pre>
 
Delete matching cases (don't create non-matching):
<pre>
Let("CaseData", ToDataFrame([
["0", "New York", 6],
["1", "Dallas", 3],
["2", "Dallas", 8],
["3", "Chicago", 4],
["4", "New York", 2]
],["Case id", "Region", "Cost"]));
Let("UpdatedData", ToDataFrame([
["0"],
["1"],
["4"],
["5"]
],["Case id"]));
CaseData.Merge(UpdatedData, "Case id", _remove, _remove).tocsv();
 
Returns:
Case id;Region;Cost
2;Dallas;8
3;Chicago;4
</pre>
 
Update matching cases, create non-matching by source as new, and delete non-matching by target:
<pre>
Let("CaseData", ToDataFrame([
["0", "New York", 6],
["1", "Dallas", 3],
["2", "Dallas", 8],
["3", "Chicago", 4],
["4", "New York", 2]
],["Case id", "Region", "Cost"]));
Let("UpdatedData", ToDataFrame([
["0", 7],
["1", 5],
["4", 2],
["5", 2]
],["Case id", "Cost"]));
CaseData.Merge(UpdatedData, "Case id", null, null, false).tocsv();
 
Returns:
Case id;Region;Cost
0;New York;7
1;Dallas;5
4;New York;2
5;;2
</pre>
 
<pre>
Let("target", ToDataFrame([[0, "zero", "target"], [1, "", "target"]], ["id", "text", "frame"]));
Let("source", ToDataFrame([[1, "one", "source"], [2, "two", "source"], [3, "three", "source"]], ["id", "text", "frame"]));
 
target.Merge(source, "id").ToCsv()
Returns (one key, default parameters, identical dataframe columns):
id;text;frame
0;zero;target
1;one;source
2;two;source
3;three;source
 
target.Merge(source, "id", ["text"]).ToCsv()
Returns (one key, default parameters, identical dataframe columns, copy only text column from source):
id;text;frame
0;zero;target
1;one;target
2;two;
3;three;
 
target.Merge(source, "id", ["text"], _remove).ToCsv()
Returns (one key, default parameters, identical dataframe columns, copy only text column from source, remove rows found only in source):
id;text;frame
0;zero;target
1;one;target
 
target.Merge(source, "id", ["text"], _remove, false).ToCsv()
Returns (one key, identical dataframe columns, copy only text column from source, remove rows found only in source or only in target):
id;text;frame
1;one;target
 
target.Merge(source, "id", _remove, _remove, false).ToCsv()
Returns (one key, identical dataframe columns, remove all rows):
id;text;frame
 
 
Let("target", ToDataFrame([[0, 0, "zerozeroleft", "target"], [0, 1, "zeroleft", "target"], [1, 2, "left", "target"], [4, 5, "fourleft", "target"]], ["idleft1", "idleft2", "textleft", "frame"]));
Let("source", ToDataFrame([[0, 0, "zerozeroright", "source"], [0, 1, "zeroright", "target"], [1, 2, "oneright", "source"], [2, 3, "tworight", "source"], [3, 4, "threeright", "source"]], ["idright1", "idright2", "textright", "frame"]));
 
target.Merge(source, ["idleft1": "idright1"]).ToCsv()
Returns (one key, default parameters, different dataframe columns, copy all matching columns):
idleft1;idleft2;textleft;frame
0;0;zerozeroleft;source
0;0;zerozeroleft;target
0;1;zeroleft;source
0;1;zeroleft;target
1;2;left;source
4;5;fourleft;target
2;;;source
3;;;source
 
target.Merge(source, ["idleft1": "idright1"], []).ToCsv()
Returns (one key, default parameters, different dataframe columns, copy only key column):
idleft1;idleft2;textleft;frame
0;0;zerozeroleft;target
0;0;zerozeroleft;target
0;1;zeroleft;target
0;1;zeroleft;target
1;2;left;target
4;5;fourleft;target
2;;;
3;;;
 
target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"]).ToCsv()
Returns (two keys, default parameters, different dataframe columns, copy all matching columns):
idleft1;idleft2;textleft;frame
0;0;zerozeroleft;source
0;1;zeroleft;target
1;2;left;source
4;5;fourleft;target
2;3;;source
3;4;;source
 
target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright"]).ToCsv()
Returns (two keys, default parameters, different dataframe columns, copy only textright -column):
idleft1;idleft2;textleft;frame
0;0;zerozeroright;target
0;1;zeroright;target
1;2;oneright;target
4;5;fourleft;target
2;3;tworight;
3;4;threeright;
 
target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright", "frame"]).ToCsv()
Returns (two keys, default parameters, different dataframe columns, copy textright and frame -columns):
idleft1;idleft2;textleft;frame
0;0;zerozeroright;source
0;1;zeroright;target
1;2;oneright;source
4;5;fourleft;target
2;3;tworight;source
3;4;threeright;source
 
target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], _remove).ToCsv()
Returns (two keys, default parameters, different dataframe columns, remove all matching rows, copy only matching columns):
idleft1;idleft2;textleft;frame
4;5;fourleft;target
2;3;;
3;4;;
 
target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright"], ["idleft1": "idright1", "idleft2": "idright2", "textleft": "textright"]).ToCsv()
Returns (two keys, default parameters, different dataframe columns, copy only textright-column for matching columns, copy id columns and textright-column for rows not found in target):
idleft1;idleft2;textleft;frame
0;0;zerozeroright;target
0;1;zeroright;target
1;2;oneright;target
4;5;fourleft;target
2;3;tworight;
3;4;threeright;
 
target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright"], ["idleft1": "idright1", "idleft2": "idright2", "textleft": "textright", "frame"]).ToCsv()
Returns (two keys, default parameters, different dataframe columns, copy only textright-column for matching columns, copy id, frame and textright-column for rows not found in target):
idleft1;idleft2;textleft;frame
0;0;zerozeroright;target
0;1;zeroright;target
1;2;oneright;target
4;5;fourleft;target
2;3;tworight;source
3;4;threeright;source
 
target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], null, ["idleft1": "idright1", "idleft2": "idright2", "textleft": "textright"]).ToCsv()
Returns (two keys, default parameters, different dataframe columns, don't copy any columns from source for matching columns, copy id columns and textright-column for rows not found in target):
idleft1;idleft2;textleft;frame
0;0;zerozeroleft;source
0;1;zeroleft;target
1;2;left;source
4;5;fourleft;target
2;3;tworight;
3;4;threeright;
 
target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright"], ["idleft1": "idright1", "idleft2": "idright2", "textleft": "textright", "frame"], false).ToCsv()
Returns (two keys, default parameters, different dataframe columns, copy only textright-column for matching columns, copy id, frame and textright-column for rows not found in target, remove all rows not found in source):
idleft1;idleft2;textleft;frame
0;0;zerozeroright;target
0;1;zeroright;target
1;2;oneright;target
2;3;tworight;source
3;4;threeright;source
</pre>
</pre>
|-
|}

Latest revision as of 10:48, 7 August 2024

DataFrame represents a two dimensional array of data with one-to-many columns and zero-to-many rows, like relational database table, Excel sheet or CSV data file. Each column in the DataFrame has a name, and there must not be more than one column with the same name.

DataFrame is the generic data structure used to manage all kinds data in QPR ProcessAnalyzer expression engine that run in-memory. As DataFrame is an in-memory object, processing large dataset using DataFrames requires considerable amount of memory in QPR ProcessAnalyzer Server. Instead of using DataFrames, data can also be processed using SqlDataFrames (in Snowflake or SQL Server) or using DataFlows (as a stream with low memory requirements).

DataFrames as linked to other entities in as follows:

  • Datatable contents can be fetched into the memory as DataFrame
  • DataFrame can be stored (persisted) to Datatable
  • ETL operations, such as joining, unions, filtering and grouping are based on the DataFrames
  • Data extracted from an external data source, e.g. using ODBC, is fetched to the in-memory calculation as a DataFrame.
  • When using a loading scripts, cases and events data is fed to the model using the DataFrame.

DataFrame Properties

DataFrame properties Description
Columns (String*) DataFrame columns names as an array in the order the columns are in the DataFrame.
ColumnNames (Dictionary*) DataFrame columns metadata (names and data types) in an array of dictionaries, where each dictionary has the Name and Datatype properties. For in-memory DataFrames, the precise data types are available only if the dataframe originates directly from a datatable. In other cases, for example in DataFrames originating from modification operations, the data type Any is returns for all columns.

Example: 3rd column data type for a variable stored DataFrame:

myDataframe.Columns[2].Datatype
ColumnMappings (Dictionary) Gives column mappings configured for this DataFrame. Returned data is a dictionary, where keys are mapping names (CaseId, EventType, Timestamp) and value is the column name. The ColumnMappings property returns null if column mappings have not been defined.

Example:

let caseIdColumnName = myDataFrame.ColumnMappings("CaseId");
DataSourceConnection Returns connection object used by this dataframe to connect to its datasource. For in-memory data frames, null is returned.
Rows (Object**) Returns the data content of the DataFrame as a two-dimensional array (matrix). The column names are not part of the data content.

Examples:

DatatableById(5).DataFrame.Rows[0][0]
Returns: the value in the first row and first column in a datatable with id 5.
<column name> (Object*)

Returns an array of values of given column in the datatable. If the column name contains spaces, the Column function needs to be used to refer to a column.

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).right
Returns: [zero, two, three]

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).right[2]
Returns: three

DataFrame Functions

DataFrame functions Parameters Description
Aggregate (DataFrame)
  1. Aggregated columns (string array, or key-value pairs)
  2. Aggregation methods (string array)

Create a new DataFrame from a GroupedDataFrame by performing aggregations to all groups separately and returning one row for each group. Also returns all the columns used in the grouping of the values. Parameters:

  1. columns: Key-value pairs where each mapping describes the column name in the original DataFrame name (key) and the name of the created column (value). Columns having null value in the dictionary are not renamed.
  2. aggregation method: Array of string values describing aggregation method for each of the aggregation. The length of the array must be equal to the length of columns array.

Supported aggregations are:

  • Average: Average value of the specified column.
  • Count: Count of rows in this group.
  • DateTimeRange: Duration in seconds between the minimum and the maximum values of the DateTimes of the specified column.
  • Max: Maximum value of the specified column.
  • Median: Median value of the specified column.
  • Min: Minimum value of the specified column.
  • Sum: Sum of the specified column.
  • List: Combines several string values into one string. Optionally a separator character and sorting order for the list can be defined as follows (see example below): #{"Function": "List", "Ordering": ["<ColumnToSort>"], "Separator": ","}

Examples:

ToDataFrame([[0, "zero"], [10, "zero"], [2, "two"], [12, "two"], [22, "two"], [3, "three"]], ["id", "text"])
.GroupBy(["text"])
.Aggregate(["ids": "id"], ["sum"]).ToCsv()
Returns string:
text;id
three;3
two;36
zero;10

ToDataFrame([[0, "zero"], [10, "zero"], [2, "two"], [12, "two"], [22, "two"], [3, "three"]], ["id", "text"])
.GroupBy(["text"])
.Aggregate(
  ["average": "id", "sum": "id", "min": "id", "max": "id", "median": "id"],
  ["average", "sum", "min", "max", "median"]
).ToCsv()
Returns string:
text;average;sum;min;max;median
three;3;3;3;3;3
two;12;36;2;22;12
zero;5;10;0;10;5

ToDataFrame([[0, DateTime(2020, 1)], [0, DateTime(2020, 4)], [0, DateTime(2020, 2)], [1, DateTime(2019, 1)], [1, DateTime(2009, 1)]], ["id", "timestamp"])
.GroupBy(["id"])
.Aggregate(
  ["duration": "timestamp", "count": "id"],
  ["DateTimeRange", "Count"]
).ToCsv()
Returns string:
id;duration;count
0;7862400;3
1;315532800;2

ToDataFrame([[1, "zero"], [1, "one"], [1, "two"], [2, "two"], [3, "two"], [3, "three"]], ["id", "text"])
.GroupBy(["id"])
.Aggregate(
  ["text"],
  [#{"Function": "List", "Ordering": ["text"], "Separator": ", "}]
).ToCsv()
Returns:
id;text
1;one, two, zero
2;two
3;three, two
Append (DataFrame)
  1. DataFrame to append
  2. Include all columns (boolean)

Creates a new dataframe that has the contents of given dataframe added to the end of this dataframe. The appending behavior is affected by the include all columns parameter.

Parameters:

  1. Appended dataframe: The dataframe that is appended to the end.
  2. Include all columns (boolean): If true, the result will have all the columns that were present in either of the dataframes. The order of the input columns does not matter. The order of the output columns is the same as the order of the columns in the original dataframe followed by all the rest of the columns existing only in appended dataframe in alphabetical order. If a column doesn't exist in other dataframe, null values will be set to those columns for dataframes in which the column does not exist. If parameter is false (default), the result will contain only the columns that were present in the original dataframe. The columns in the appended dataframe must be in the same order as they were in the context data frame.

Example 1: Without including all columns:

let dataframe1 = ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]);
let dataframe2 = ToDataFrame([[1, "one"], [4, "four"]], ["id", "text"]);
dataframe1.Append(dataframe2).ToCsv();

Returns:
id;text
0;zero
2;two
3;three
1;one
4;four

Example 1: With including all columns:

let dataframe1 = ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]);
let dataframe2 = ToDataFrame([["one", 1], ["four", 4]], ["text2", "id"]);
dataframe1.Append(dataframe2, true).ToCsv();

Returns:
id;text;text2
0;zero;
2;two;
3;three;
1;;one
4;;four
Clone (DataFrame) DataFrame to clone

Returns a new DataFrame that is an exact copy of the data frame this method was called for.

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).Clone()
Returns:A copy of the original DataFrame object.
ColumnIndexes (Integer*)

Column names (String*)

Convert DataFrame column names into column indexes. The indexes are starting from zero. If a column is not found, an exception is given.

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).ColumnIndexes(["right", "id"])
Returns: [1, 0]
Column (Object*)

Column name

Returns an array of values of given column in the order rows are in the datatable.

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).Column("right")
Returns: [zero, two, three]
Columns (DataFrame)

Array of column names

Creates a new DataFrame having only the defined columns of the original DataFrame. Note that Columns function is different than Columns property (difference is that the function has parameters).

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).Columns(["right"]).ToCsv()
Returns:
right
zero
two
three

ExcludeValues (DataFrame)
  1. Column name (string)
  2. Value (single item) or values (array) to exclude

Creates a new DataFrame having only rows for which given column does not have any of the specified values.

Examples:

ToDataFrame([[0, "zero"], [1, "one"], [2, "two"]], ["id", "left"]).ExcludeValues("id", 1).ToCsv()
Returns:
id;left
0;zero
2;two

ToDataFrame([[0, "zero"], [1, "one"], [2, "two"]], ["id", "left"]).ExcludeValues("left", ["one", "two", "three"]).ToCsv()
Returns:
id;left
0;zero
Head (DataFrame)

Number of top rows

Creates a new DataFrame that only contains the defined top number of rows of this DataFrame. If the DataFrame has less than the defined top rows, all rows are returned.

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).Head(2).ToCsv()
Results string:
id;right
0;zero
2;two
IncludeOnlyValues
  1. Column name (string)
  2. Value (single item) or values (array) to include

Create a new DataFrame containing only those rows for which the given column has any of the given values. Values can be provided as a single object (if there is only one value) or an array of objects (if multiple values).

Examples:

ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]).IncludeOnlyValues("id", 1).ToCsv()
Returns:
id;left
1;one

ToDataFrame([[0, "zero"], [1, "one"], [3, "three"]], ["id", "left"]).IncludeOnlyValues("id", [0, 1, 2]).ToCsv()
Returns:
id;left
0;zero
1;one

ToDataFrame([[0, "zero"], [1, "one"], [3, "three"]], ["id", "left"]).IncludeOnlyValues("left", ["zero", "three"]).ToCsv()
Returns:
id;left
0;zero
3;three

df.IncludeOnlyValues("EventType", ["start", "end"])
Returns a data frame containing all the rows in DataFrame df that have either "start" or "end" value in the column "EventType".
Join (DataFrame)
  1. Other DataFrame
  2. Columns to join
  3. Join type
GroupBy (GroupedDataFrame)

Grouped columns (string array)

Creates a GroupedDataFrame object based based on given columns. Takes as a parameter an array of column names, based on which to group the rows. For examples, see the Aggregate function.

GroupBy (DataFrame)
  1. Array of columns to group
  2. Array of grouping expressions

Creates a new DataFrame based on the current DataFrame. The resulting DataFrame has rows grouped by given columns and values aggregated using given functions. In the resulting DataFrame one row in the end result corresponds with one group.

Parameters:

  1. Columns: Columns to group identified by column names.
  2. Aggregation expressions: Array containing the column name as a key and the aggregation expression as a value.

Examples:

ToDataFrame([[0, "zero"], [0, "zero2"], [2, "two"], [2, "two"], [2, "two3"], [3, "three"]], ["id", "text"]).GroupBy(["id"],
[
  "ids": () => Sum(id),
  "texts": () => StringJoin(",", text),
  "constant": 123
]).ToCsv()
Returns:
ids;texts;constant
0;zero,zero2;123
6;two,two,two3;123
3;three;123

ToDataFrame([[0, "zero"], [0, "zero2"], [2, "two"], [2, "two"], [2, "two3"], [3, "three"]], ["id", "text"]).GroupBy(["id", "text"],
[
  "ids": () => Sum(id),
  "texts": () => StringJoin(",", text),
  "constant": 123
]).ToCsv()
Returns:
ids;texts;constant
0;zero;123
0;zero2;123
4;two,two;123
2;two3;123
3;three;123
Merge (DataFrame)
OrderBy (DataFrame)
  1. primary ordering expression
  2. secondary ordering expression
  3. ...

Creates a new DataFrame having rows ordered in an ascending order using the given expression(s) evaluated on each row. Rows that have same ordering value in the primary ordering exprssion, are sorted based on the secondary ordering expression.

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).OrderBy(text).ToCsv()
Returns:
id;right
3;three
2;two
0;zero

Analysis("OperationLog").OrderBy(Duration).Head(1)
Results string:
id;right
0;zero
2;two
3;three

el.Analysis("EventTypes").OrderBy(Count, Name)
Returns event types in eventlog ordered primarily by Count and secondarily by Name.

If there is a need to sort by some column ascending and some column descending, the sortings can be chained. Example:

el.Analysis("EventTypes").OrderByDescending(Name).OrderBy(Count)
Returns event types in eventlog ordered primarily by Count ascending and secondarily by Name descending.
OrderByColumns (DataFrame)
  1. Columns to be ordered (String array)
  2. Sorting order (boolean array)

Creates a new DataFrame having rows ordered by given columns in given directions. Note that the ordered columns need to contain same type of data, because ordering is not possible between different data types. Parameters:

  1. columns: Column to be sorted.
  2. sort order: Array of boolean values indicating whether to sort the columns in ascending (true) or descending (false) direction. The length of the array must be equal to the length of columns array.

Null values are always first in the order (both in ascending and descending order).

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).OrderByColumns(["id"], [false]).ToCsv()
Returns string:
id;text
3;three
2;two
0;zero

ToDataFrame([[0, "zero"], [0, "nolla"], [2, "two"], [3, "three"]], ["id", "text"]).OrderByColumns(["id", "text"], [true, false]).ToCsv()
Returns string:
id;text
0;zero
0;nolla
2;two
3;three
OrderByDescending (DataFrame)
  1. primary ordering expression
  2. secondary ordering expression
  3. ...

Creates a new DataFrame having rows ordered in an descending order using the given expression(s) evaluated on each row. See OrderBy above for examples.

Persist (DataTable)
  • Datatable name (String)
  • Additional parameters (Dictionary)

Writes DataFrame into datatable. If a datatable with that name does not exist in the project, a new datatable is created. If a datatable with that name already exists, the DataFrame will be stored into that DataTable. The function returns the written datatable object.

The following parameters are supported:

  • ProjectName: Name of the project where the datatable is created.
  • ProjectId: Id of project where the datatable is created.
  • Append, ImportExistingColumnOnly and MatchByColumns: See the import function for details.

Examples:

let right = ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]);
right.Persist("RightDataTable", ["ProjectName": "TestData"])
Results: Id of the new data table named "RightDataTable" created into project named TestData (which is created if it doesn't already exist). If the table already existed, its contents will be overwritten by the new content.

let newData = ToDataFrame([[4, "four"]], ["id", "right"]);
newData.Persist("RightDataTable", ["ProjectName": "TestData", "Append": true])
Results: Id of the new data table named "RightDataTable" created into project named TestData (which is created if it doesn't already exist). If the table already existed, new content will be appended into the end of the table.

The following script reads data from a datatable (MyProject -> MyDatatable) convert one column (MyColumn) into floats and writes the data back to the same datatable.

let project = (Projects.Where(Name=="MyProject"))[0];
let datatable = (project.Datatables.Where(Name=="MyDatatable"))[0];
DatatableById(datatable.Id).DataFrame
.SetColumns([
	"MyColumn": () => ToFloat(Column("MyColumn"))
])
.Persist(datatable.Name, ["ProjectId": project.Id, "Append": false]);
RemoveColumns (DataFrame)

Column names (string array)

Creates new DataFrame where the defined columns have been removed. Throws an exception if the original DataFrame doesn't contain any of the defined columns.

Examples:

ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]).RemoveColumns(["id"]).ToCsv()
Returns string:
left
zero
one
RenameAllColumns (DataFrame) Template string

Renames all columns of the DataFrame using a name template. The template needs to contain {0} that will be replaced by the original name of the column. For escapings, {{ is replaced with { and }} with }. Throws an exception if the template doesn't contain {0}.

Example:

ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"])
  .RenameAllColumns("new_{0}").ToCsv();
Returns:
new_id;new_left
0;zero
1;one
RenameColumns (DataFrame) Key-value pairs of name mappings

Renames DataFrame columns. Takes a parameter of key-value pairs describing how old names (value) are changed to new names (key). Throws an exception if any of the (old) column names don't exist. Renaming DataFrame columns doesn't copy the data, so it's a quick operation for even large datasets.

Examples:

ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]).RenameColumns(["newId": "id", "newLeft": "left"]).ToCsv()
Returns string:
newId;newLeft
0;zero
1;one
Select Column names (string array, or key-value pairs)

Creates a new DataFrame where only the selected columns are included. Allows also to change column names when the parameter contains key-value pairs where the original column names are as values and new columns names as keys (see the examples). Throws an exception if any of the columns specified does not exist.

Examples:

ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]).Select(["left"]).ToCsv()
Returns string:
left
zero
one

ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]).Select(["newLeft": "left"]).ToCsv()
Returns string:
newLeft
zero
one

ToDataFrame([[0, "zero", "nolla"], [1, "one", "yksi"]], ["id", "left", "right"]).Select(["left": "newLeft", "right"]).ToCsv()
Returns string:
newLeft;right
zero;nolla
one;yksi
SelectDistinct Column names (string array, or key-value pairs)

Returns only distinct values within given columns or all columns (if no columns are specified). Takes as parameter the column names defining for which columns the distinct value combinations are returned. When no parameter is provided, returns all columns in the dataframe. Supports also renaming columns similar to the Select function.

Examples:

ToDataFrame([[0, "Apple"], [1, "Orange"], [1, "Orange"], [1, "Apple"], [2, "Apple"]], ["number", "text"])
.SelectDistinct()
Returns:
number;text
0;Apple
1;Orange
1;Apple
2;Apple

ToDataFrame([[0, "Apple"], [1, "Orange"], [1, "Orange"], [1, "Apple"], [2, "Apple"]], ["number", "text"])
.SelectDistinct(["text"])
Returns:
text
Apple
Orange
SetColumns (DataFrame)

New/modified columns as array

Creates a new DataFrame based on the current DataFrame, where new columns have been created and/or existing columns have been modified. New and modified columns are defined using an array, where the column name is as a key and as a value there is the expression to calculate the new or modified column. When specifying a column that already exists, the column values are modified. When specifying a new column name, that column is created as a new column to the resulting DataFrame.

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns([
	"both": () => text + "=" + id
]).ToCsv()
Returns:
id;text;both
0;zero;zero=0
2;two;two=2
3;three;three=3

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns([
	"text": () => text + "=" + id
]).ToCsv()
Returns:
id;text
0;zero=0
2;two=2
3;three=3

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns([
	"both": () => text + "=" + id,
	"both+1": () => both + 1,
	"text": () => "Done: " + Column("both+1"),
	"constant": 1234
]).ToCsv()
Returns:
id;text;both;both+1;constant
0;Done: zero=01;zero=0;zero=01;1234
2;Done: two=21;two=2;two=21;1234
3;Done: three=31;three=3;three=31;1234
Skip (DataFrame) Number of rows to skip Returns DataFrame where the defined number of rows is removed ("skipped"). Returns an empty DataFrame if skipping more rows than there are rows in the source DataFrame.

Examples:

ToDataFrame([[0, "zero"], [1, "one"], [2, "two"], [3, "three"]], ["id", "right"]).Skip(1).ToCsv()
Results string:
id;right
1;one
2;two
3;three
Tail (DataFrame)

Number of rows

Creates a new DataFrame that has only the bottom number of rows of this DataFrame. If the DataFrame has less than n rows, all its rows are returned.

Example:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).Tail(2).ToCsv()
Results string:
id;right
2;two
3;three
ToCsv (String) includeHeaders (boolean)

Converts a DataFrame into a CSV data (i.e. string). The CSV data has the following formatting:

  • Column separator: semicolon (;)
  • Decimal separator in numeric fields: period (.)
  • Quotation character for text fields: double quotes (") (used when the textual value contains semicolon, double quotes, linebreak or tabulator)
  • Escape character: Double quotes in textual fields are escaped with two double quotes ("")
  • Date format for date fields: yyyy-MM-dd HH:mm:ss,fff
  • Timespan (duration) format: dd.hh:mm:ss
  • First line contains column headers

Parameter includeHeaders defines whether the header (column names) is returned (true, default) or not (false) as the first row.

Example:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).ToCsv()
Returns:
id;right
0;zero
2;two
3;three
WithDenseRankNumberColumn (DataFrame)
  1. New column name (String)
  2. Order by columns (String array)
  3. Partition by columns (String array)
  4. Ascending/descending order (Boolean array)

Similar to the WithRankColumn function, except rank numbers doesn't contain gaps when there are rows with same rank values.

let data = ToDataFrame([
	["A", "Dallas", 8],
	["B", "Dallas", 5],
	["C", "Dallas", 5],
	["D", "Dallas", 4],
	["B", "New York", 6],
	["C", "New York", 2]
], ["Customer", "Region", "Revenue"]);
data.WithDenseRankColumn("Rank", ["Revenue"]).OrderByColumns(["Revenue"], [true]).ToCsv();
Returns:
Customer;Region;Revenue;Rank
C;New York;2;1
D;Dallas;4;2
B;Dallas;5;3
C;Dallas;5;3
B;New York;6;4
A;Dallas;8;5
WithColumn (DataFrame)
  1. New column name (String)
  2. Calculation expression

Creates new DataFrame with new column having value evaluated using given expression. If the column name already exist in the original DataFrame, it gets replaced with the evaluated expression.

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"])
  .WithColumn("combinedtext", text + "=" + (id * 2))
  .ToCsv();
Returns:
id;text;combinedtext
0;zero;zero=0
2;two;two=4
3;three;three=6
WithRankColumn (DataFrame)
  1. New column name (String)
  2. Order by columns (String array)
  3. Partition by columns (String array)
  4. Ascending/descending order (Boolean array)

Similar to the WithRowNumberColumn function, except the produced numbering is based on the ranking logic. The difference to the row number is that the rows with an equal sorting value, gets the same rank number. The numbering continues being same as the row number, i.e. there are gaps in the assigned ranks if there are rows same rank values (see the example).

let data = ToDataFrame([
	["A", "Dallas", 8],
	["B", "Dallas", 5],
	["C", "Dallas", 5],
	["D", "Dallas", 4],
	["B", "New York", 6],
	["C", "New York", 2]
], ["Customer", "Region", "Revenue"]);
data.WithRankColumn("Rank", ["Revenue"]).OrderByColumns(["Revenue"], [true]).ToCsv();
Returns:
Customer;Region;Revenue;Rank
C;New York;2;1
D;Dallas;4;2
B;Dallas;5;3
C;Dallas;5;3
B;New York;6;5
A;Dallas;8;6
WithRowNumberColumn (DataFrame)
  1. New column name (String)
  2. Order by columns (String array)
  3. Partition by columns (String array)
  4. Ascending/descending order (Boolean array)

Creates new DataFrame with a new column containing row numbers based on defined partitions and ordering. Partitions are defined as one or several columns, and each partition will contain own row numbering starting from one. Rows are ordered within each partition and the row numbering is based in the ordering (which is separate than the ordering of the result data).

Parameters:

  1. New column name: Name of the new column containing the row number.
  2. Order by columns: Array of column(s) to order the rows where the row number is based on.
  3. Partition by columns (optional): Array of column(s) to partition the rows by. Each partition will have own row numbering starting from one. Partition parameter can be omitted, which will treat entire data as one partition.
  4. Ascending/descending order (optional): Array of booleans to define whether the ordering for each column (in the Order by columns parameter) is ascending (true, by default) or descending (false).

Examples:

let data = ToDataFrame([
	["A", "Dallas", 8],
	["B", "Dallas", 5],
	["C", "Dallas", 4],
	["B", "New York", 6],
	["C", "New York", 2]
], ["Customer", "Region", "Revenue"]);
data.WithRowNumberColumn("Order", ["Revenue"]).OrderByColumns(["Revenue"], [true]).ToCsv();
Returns:
Customer;Region;Revenue;Order
C;New York;2;1
C;Dallas;4;2
B;Dallas;5;3
B;New York;6;4
A;Dallas;8;5

data.WithRowNumberColumn("Order", ["Revenue"], ["Region"]).OrderByColumns(["Region", "Revenue"], [true, true]).ToCsv()
Returns:
Customer;Region;Revenue;Order
C;Dallas;4;1
B;Dallas;5;2
A;Dallas;8;3
C;New York;2;1
B;New York;6;2

data.WithRowNumberColumn("Order", ["Revenue"], ["Region"], ["false"]).OrderByColumns(["Region", "Revenue"], [true, false]).ToCsv()
Returns:
Customer;Region;Revenue;Order
A;Dallas;8;1
B;Dallas;5;2
C;Dallas;4;3
B;New York;6;1
C;New York;2;2
Where (DataFrame)

Condition expression

Creates a new DataFrame having only rows for which the given condition expression returns true. The condition expression can refer to the columns of the DataFrame (see the example below).

Examples:

let df = ToDataFrame([[0, "zero"], [2, "two", true], [3, "three"]], ["id", "string"]);

All the following expression return the same:
df.Where(id < 3);
df.Where(Column("id") < 3);
df.Where(_[0] < 3);

Returns:
id;string
0;zero
2;two
Zip (DataFrame)

DataFrame

Creates a new DataFrame that has the contents of given DataFrame appended as new columns into the end of this DataFrame. Returns a new DataFrame that has the colums from both the data frames so that the columns from the other DataFrame are appended to the end of the columns of this DataFrame. If the number of rows is different between this DataFrame and the other DataFrame, an exception is thrown. There must not be duplicate column names in the DataFrames - otherwise an exception is thrown.

Examples:

let df1 = ToDataFrame([[0, "zero"], [1, "one"], [4, "four"]], ["id", "text"]);
let df2 = ToDataFrame([[1, "one"], [2, "two"], [3, "three"]], ["id2", "text2"]);
df1.Zip(df2).ToCsv();
Returns:
id;text;id2;text2
0;zero;1;one
1;one;2;two
4;four;3;three

DataFrame sources

The following functions can be used to extract data as DataFrames:

DataFrame functions Parameters Description
ImportOdbc (DataFrame)
  1. connection string (string)
  2. query (string)
  3. query execution timeout (integer)

Runs given query in given ODBC datasource and returns data as DataFrame. AllowExternalDatasources setting needs to be true to use the ImportODBC function. Note also that the ODBC connection requires an ODBC driver to the datasource to be installed in the QPR ProcessAnalyzer Server.

Example: Contents of OdbcTest table in the ODBC datasource is fetched and returned as a DataFrame.

ImportOdbc(
  "Driver={SQL Server};Server=localhost;DataBase=QPR_PA1;Trusted_Connection=yes",
  "SELECT * FROM OdbcTest",
  500
)
ImportOdbcSecure (DataFrame)
  1. project id (Integer)
  2. connection string key (string)
  3. query (string)
  4. query execution timeout (integer)

Similar command as ImportOdbc, except instead of the plain text connection string, a secret name is provided. Also a project id where to fetch the connection string key from needs to be provided.

Example: Contents of OdbcTest table in the ODBC datasource is fetched and returned as a DataFrame.

ImportOdbcSecure(
  12,
  "MySecureConnectionString",
  "SELECT * FROM OdbcTest",
  500
)
ToDataFrame
  1. data as 2-dimensional array
  2. column names (array of strings) or column metadata (array of dictionaries)

Creates a DataFrame object containing the given data (in two dimensional array) and the array of column names. Number of column names must be the same as the number of columns in the two dimensional array.

To be able to define data types for the columns, the second parameter can also be an array of dictionaries (one for each column), where each dictionary contains Name and DataType properties (see an example below). Available data types are String, Integer, Float, DateTime, Boolean, Duration (Timespan) and Any (can contain any type of data).

The first parameter can also be an existing DataFrame which will create a copy of the DataFrame.

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).ToCsv()
Returns:
A string containing:
id;right
0;zero
2;two
3;three

ToDataFrame(
  [[0, "zero"], [2, "two"], [3, "three"]],
  [
    #{"Name": "id", "DataType": "Integer"},
    #{"Name": "right", "DataType": "String"}
  ]
).ToCsv();
Returns:
id;right
0;zero
2;two
3;three