DataFrame in Expression Language: Difference between revisions

From QPR ProcessAnalyzer Wiki
Jump to navigation Jump to search
Line 281: Line 281:
||
||
See [[#Joining_DataFrames|Joining DataFrames]].
See [[#Joining_DataFrames|Joining DataFrames]].
|-
||GroupBy (GroupedDataFrame)
||
Columns (string array)
||
Create a GroupedDataFrame based on grouping the current DataFrame based on given columns. Takes as a parameter an array of column labels, for which to group the rows. For examples, see the Aggregate function.


|-
|-

Revision as of 10:35, 2 October 2020

DataFrame represents a two dimensional array of data with one-to-many columns and zero-to-many rows, like a relational database table, an Excel sheet or a CSV data file. Each column in the DataFrame has a name, and there must not be more than one column with the same name. DataFrame is the generic data structure used to manage all kinds data in QPR ProcessAnalyzer expression engine that run in-memory. DataFrames as linked to other entities in QPR ProcessAnalyzer as follows:

  • Datatable contents is fetched into the memory as a DataFrame object
  • DataFrame can be stored (persisted) to a Datatable
  • ETL operations, such as joining, unions, filtering and grouping are based on the DataFrames
  • Data extracted from an external data source, e.g. using ODBC, is fetched to the in-memory calculation as a DataFrame.
  • When using a loading scripts, cases and events data is fed to the model using the DataFrame.

Extract Data to DataFrames

DataFrame functions Parameters Description¨
Analysis (DataFrame)
  1. analysisType (integer/string)
  2. Additional parameters as key-value pairs

Runs the given type of analysis and returns the results as a DataFrame. If an EventLog is available in the current context, the analysis is run for that EventLog.

The analysis type can be given as a string (e.g. "Cases") or numeric (e.g. 5).

Additional parameters are defined as a key-value pair collection.

Examples:

Analysis("OperationLog")
Returns: Filter report analysis

EventLogById(1).Analysis(5)
Returns: Analysis 5 (Cases) for event log having id 1.

EventLogById(1).Analysis("Cases")
Returns: Analysis 5 (Cases) for event log having id 1.

Analysis(25, ["ScriptId": 123, "SelectedAnalysisResult": "MyResult", "parameter1": "value1", "parameter2": "value2"])
Returns: Run a script which id is 123 using the provided parameters. (Remember to use "SheetName"="MyResult" parameter for the shown result query.) 

EventLogById(1).Analysis(14, ["MaximumCount": 30, "SelectedAttributeType": "Region"])
Returns: Analysis 14 (Root causes) for event log having id 1.
ImportOdbc (DataFrame)
  1. connection string (string)
  2. query (string)

Runs the given query to the given ODCB datasource, and returns data as a DataFrame. AllowExternalDatasources setting needs to be True to be able to use the ImportODBC function. Note also that the ODBC connection requires an ODBC driver specific to the datasource to be installed in the QPR ProcessAnalyzer Server computer.

Examples:

ImportOdbc("Driver={SQL Server};Server=localhost;DataBase=QPR_PA1;Trusted_Connection=yes", "SELECT * FROM OdbcTest")
Returns: The contents of OdbcTest table in given ODBC data source inside a DataFrame.
ImportOdbcSecure (DataFrame)
  1. Project id (Integer)
  2. connection string key (string)
  3. query (string)

Similar command as the ImportODBC, except instead of a plaintext connection string, a secure string key is provided. Also a project id, from where to fetch the connection string key, needs to be provided.

Examples:

ImportOdbcSecure(12, "MySecureConnectionString", "SELECT * FROM OdbcTest")
Returns: The contents of OdbcTest table in given ODBC data source inside a DataFrame.

DataFrame Properties

DataFrame properties Description
Columns (String*) DataFrame columns names as an array in the order the columns are in the DataFrame.
Rows (Object**) Returns the data content of the DataFrame as a two-dimensional array (matrix). The column names are not part of the data content.

Examples:

DatatableById(5).DataFrame.Rows[0][0]
Returns: the value in the first row and first column in a datatable with id 5.
<column name> (Object*)

Returns an array of values of given column in the datatable. If the column name contains spaces, the Column function needs to be used to refer to a column.

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).right
Returns: [zero, two, three]

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).right[2]
Returns: three

DataFrame Functions

DataFrame functions Parameters Description
Aggregate

Create a new DataFrame by performing selected aggregation operations to all the groups separately and returning one row for each group with the results of the aggregations. Also returns all the columns used in the grouping of the values.

Parameters:

  1. columns: Key-value pairs where each key-value pair describes the column name in the original DataFrame name (key) and the name (value) of the created column. Columns having null value in the dictionary are not renamed at all. Instead, the original label will be used for them. If this parameter is given as an array and an array element is not a hierarchical array, the created dictionary entry for that element will have null value.
  2. aggregation function: An array of string values. The length of the array must be equal to the length of columns-array.

Supported aggregation functions are:

  • Average: Calculates the average value of the specified column.
  • Count: Calculates the count of rows in this group.
  • DateTimeRange: Calculates the duration in seconds between the minimum and the maximum values of the datetime (#27613#) values of the specified column.
  • Max: Calculates the maximum value of the specified column.
  • Median: Calculates the median value of the specified column.
  • Min: Calculates the minimum value of the specified column.
  • Sum: Calculates the sum value of the specified column.

Examples:

ToDataFrame([[0, "zero"], [10, "zero"], [2, "two"], [12, "two"], [22, "two"], [3, "three"]], ["id", "text"]).GroupBy(["text"]).Aggregate(["ids": "id"], ["sum"]).ToCsv()
Returns string:
text;id
three;3
two;36
zero;10

ToDataFrame([[0, "zero"], [10, "zero"], [2, "two"], [12, "two"], [22, "two"], [3, "three"]], ["id", "text"]).GroupBy(["text"]).Aggregate(["average": "id", "sum": "id", "min": "id", "max": "id", "median": "id"], ["average", "sum", "min", "max", "median"]).ToCsv()
Returns string:
text;average;sum;min;max;median
three;3;3;3;3;3
two;12;36;2;22;12
zero;5;10;0;10;5

ToDataFrame([[0, DateTime(2020, 1)], [0, DateTime(2020, 4)], [0, DateTime(2020, 2)], [1, DateTime(2019, 1)], [1, DateTime(2009, 1)]], ["id", "timestamp"]).GroupBy(["id"]).Aggregate(["duration": "timestamp", "count": "id"], ["DateTimeRange", "Count"]).ToCsv()
Returns string:
id;duration;count
0;7862400;3
1;315532800;2
Append (DataFrame)

DataFrame which data to append

Creates a new DataFrame that has the contents of given DataFrame added to the end of this DataFrame. When the data is combined, the order of columns matters, not the names of the columns. The resulting DataFrame gets column names from this DataFrame.

If the number of columns is different between this DataFrame and the other DataFrame, an exception is thrown.

Examples:

Let("dataframe1", ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]));
Let("dataframe2", ToDataFrame([[1, "one"], [4, "four"]], ["id", "text"]));
dataframe1.Append(dataframe2);

Returns:
id;text
0;zero
2;two
3;three
1;one
4;four
Clone (DataFrame) DataFrame to clone

Returns a new DataFrame that is an exact copy of the data frame this method was called for.

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).Clone()
Returns:A copy of the original DataFrame object.
ColumnIndexes (Integer*)

Column names (String*)

Convert DataFrame column names into column indexes. The indexes are starting from zero. If a column is not found, an exception is given.

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).ColumnIndexes(["right", "id"])
Returns: [1, 0]
Column (Object*)
  • Column name

Returns an array of values of given column in the order rows are in the datatable.

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).Column("right")
Returns: [zero, two, three]
Columns (DataFrame)

Array of column names

Creates a new DataFrame having only the defined columns of the original DataFrame. Note that Columns function is different than Columns property (difference is that the function has parameters).

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).Columns(["right"]).ToCsv()
Returns:
right
zero
two
three

Head (DataFrame)

Number of rows

Creates a new DataFrame that only contains top number of rows of this DataFrame. If the DataFrame has less than n rows, all its rows are returned.

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).Head(2).ToCsv()
Results string:
id;right
0;zero
2;two
IncludeOnlyValues
  1. Column label (string)
  2. Values to include (string array)

Create a new filtered DataFrame having only rows for which given column has a value that equals to any of the specified values.

Parameters:

  1. Column: Column label.
  2. Values to include: Value or an array of values to compare column values with.

Examples:

ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]).IncludeOnlyValues("id", 1).ToCsv()
Returns string:
id;left
1;one

ToDataFrame([[0, "zero"], [1, "one"], [3, "three"]], ["id", "left"]).IncludeOnlyValues("id", [0, 1, 2]).ToCsv()
Returns string:
id;left
0;zero
1;one

ToDataFrame([[0, "zero"], [1, "one"], [3, "three"]], ["id", "left"]).IncludeOnlyValues("left", ["zero", "three"]).ToCsv()
id;left
0;zero
3;three

df.IncludeOnlyValues("EventType", ["start", "end"])
Returns a data frame containing all the rows in DataFrame df that have either "start" or "end" value in the column named "EventType".
Join (DataFrame)
  1. Other DataFrame
  2. Columns to join
  3. join type

See Joining DataFrames.

GroupBy (GroupedDataFrame)

Columns (string array)

Create a GroupedDataFrame based on grouping the current DataFrame based on given columns. Takes as a parameter an array of column labels, for which to group the rows. For examples, see the Aggregate function.

GroupBy (DataFrame)
  1. Array of columns to group
  2. Array of grouping expressions

Creates a new DataFrame based on the current DataFrame. The resulting DataFrame has rows grouped by given columns and values aggregated using given functions. In the resulting DataFrame one row in the end result corresponds with one group.

Parameters:

  1. Columns: Columns to group identified by column names.
  2. Aggregation expressions: Array containing the column name as a key and the aggregation expression as a value.

Examples:

ToDataFrame([[0, "zero"], [0, "zero2"], [2, "two"], [2, "two"], [2, "two3"], [3, "three"]], ["id", "text"]).GroupBy(["id"],
[
  "ids": () => Sum(id),
  "texts": () => StringJoin(",", text),
  "constant": 123
]).ToCsv()
Returns:
ids;texts;constant
0;zero,zero2;123
6;two,two,two3;123
3;three;123

ToDataFrame([[0, "zero"], [0, "zero2"], [2, "two"], [2, "two"], [2, "two3"], [3, "three"]], ["id", "text"]).GroupBy(["id", "text"],
[
  "ids": () => Sum(id),
  "texts": () => StringJoin(",", text),
  "constant": 123
]).ToCsv()
Returns:
ids;texts;constant
0;zero;123
0;zero2;123
4;two,two;123
2;two3;123
3;three;123

Analysis("OperationLog")
.GroupBy(
  ["User Name"],
  [
    "User Name": () => Column("User Name")[0],
    "Count": () => CountTop(Rows),
    "Avg. Duration": () => Average(Duration),
    "Max. Duration": () => Max(Duration)
  ]
).ToCsv()

Returns (similar to this):
User Name;Count;Avg. Duration;Max. Duration
;207;0.617434782608696;20.556
Administrator;665;16.3750631578947;4225.497
qpr;128;2.158765625;20.346
Merge (DataFrame)

See Merging DataFrames.

OrderBy (DataFrame)
  1. primary ordering expression
  2. secondary ordering expression
  3. ...

Creates a new DataFrame having rows ordered in an ascending order using the given expression(s) evaluated on each row. Rows that have same ordering value in the primary ordering exprssion, are sorted based on the secondary ordering expression.

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).OrderBy(text).ToCsv()
Returns:
id;right
3;three
2;two
0;zero

Analysis("OperationLog").OrderBy(Duration).Head(1)
Results string:
id;right
0;zero
2;two
3;three

el.Analysis("EventTypes").OrderBy(Count, Name)
Returns event types in eventlog ordered primarily by Count and secondarily by Name.

If there is a need to sort by some column ascending and some column descending, the sortings can be chained. Example:

el.Analysis("EventTypes").OrderByDescending(Name).OrderBy(Count)
Returns event types in eventlog ordered primarily by Count ascending and secondarily by Name descending.
OrderByColumns (DataFrame)
  1. columns (string array)
  2. sort order (boolean array)

Creates a new DataFrame having rows ordered based on given columns in given direction.

Parameters:

  1. columns: Column column labels.
  2. sort order: Array of boolean values. Each boolean value indicates whether to sort the column at corresponding index in the columns-array in ascending (true) or descending (false) direction. The length of the array must be equal to the length of columns-array.

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).OrderByColumns(["id"], [true]).ToCsv()
Returns string:
id;text
0;zero
2;two
3;three

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).OrderByColumns(["id"], [false]).ToCsv()
Returns string:
id;text
3;three
2;two
0;zero

ToDataFrame([[0, "zero"], [0, "nolla"], [2, "two"], [3, "three"]], ["id", "text"]).OrderByColumns(["id", "text"], [true, true]).ToCsv()
Returns string:
id;text
0;nolla
0;zero
2;two
3;three

ToDataFrame([[0, "zero"], [0, "nolla"], [2, "two"], [3, "three"]], ["id", "text"]).OrderByColumns(["id", "text"], [true, false]).ToCsv()
Returns string:
id;text
0;zero
0;nolla
2;two
3;three
OrderByDescending (DataFrame)
  1. primary ordering expression
  2. secondary ordering expression
  3. ...

Creates a new DataFrame having rows ordered in an descending order using the given expression(s) evaluated on each row. See OrderBy above for examples.

Persist (DataTable)
  • DataTable name
  • Additional parameters

Writes a DataFrame into a DataTable in the QPR ProcessAnalyzer database. If a DataTable with that name does not exist in the project, a new DataTable is created. If a DataTable with that name already exists, the DataFrame will be stored into that DataTable. The function returns the written DataTable object.

The additional parameters support:

  • Append: Can be used to determine whether to append (true) or overwrite (false) the existing data. Default is false.
  • ProjectName: Name of the project under which the DataTable is to be created.
  • ProjectId: Id of the project under which the DataTable is to be created.

Examples:

Let("right", ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]));
right.Persist("RightDataTable", ["ProjectName": "TestData"])
Results: Id of the new data table named "RightDataTable" created into project named TestData (which is created if it doesn't already exist). If the table already existed, its contents will be overwritten by the new content.

Let("newData", ToDataFrame([[4, "four"]], ["id", "right"]));
newData.Persist("RightDataTable", ["ProjectName": "TestData", "Append": true])
Results: Id of the new data table named "RightDataTable" created into project named TestData (which is created if it doesn't already exist). If the table already existed, new content will be appended into the end of the table.

The following script reads data from a datatable (MyProject -> MyDatatable) convert one column (MyColumn) into floats and writes the data back to the same datatable.

let project = (Projects.Where(Name=="MyProject"))[0];
let datatable = (project.Datatables.Where(Name=="MyDatatable"))[0];
DatatableById(datatable.Id).DataFrame
.SetColumns([
	"MyColumn": () => ToFloat(Column("MyColumn"))
])
.Persist(datatable.Name, ["ProjectId": project.Id, "Append": false]);
RemoveColumns (DataFrame)

Column labels (string array)

Creates a new DataFrame that contains a copy of the current data frame without the specified columns. Throws an exception if any of the columns specified in columns is not found.

Examples:

ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]).RemoveColumns(["id"]).ToCsv()
Returns string:
left
zero
one
RenameColumns (DataFrame)

Renames columns of this DataFrame. Takes as a parameter value convertible to a StringDictionary (#70489#) object where each key-value pair describes the old name (key) and the new name (value) of the column.. Throws an exception if any of the columns specified in columnNameMappings is not found.

Examples:

ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]).RenameColumns(["id": "newId", "left": "newLeft"]).ToCsv()
Returns string:
newId;newLeft
0;zero
1;one

ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]).RenameColumns(#{"id": "newId", "left": "newLeft"}).ToCsv()
Returns string:
newId;newLeft
0;zero
1;one
Select

Creates a new DataFrame where only selected columns of this DataFrame are added. Returns a new DataFrame that has all the values of specified columns and column names modified as requested. Also allows renaming those columns in the created DataFrame at the same time. Throws an exception if any of the columns specified in columns is not found.

The parameter is a value convertible to a StringDictionary object where each key-value pair describes the old name (key) and the new name (value) of the column. Columns having null value in the dictionary are not renamed at all. Instead, the original label will be used for them. If this parameter is given as an array and an array element is not a hierarchical array (#29290#), the created dictionary entry for that element will have null value.

Examples:

ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]).Select(["left"]).ToCsv()
Returns string:
left
zero
one

ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]).Select(["left": "newLeft"]).ToCsv()
Returns string:
newLeft
zero
one

ToDataFrame([[0, "zero", "nolla"], [1, "one", "yksi"]], ["id", "left", "right"]).Select(["left": "newLeft", "right"]).ToCsv()
Returns string:
newLeft;right
zero;nolla
one;yksi

ToDataFrame([[0, "zero", "nolla"], [1, "one", "yksi"]], ["id", "left", "right"]).Select(#{"left": "newLeft", "right": null}).ToCsv()
Returns string (note: columns may be in the different order as dictionary objects don't always return the keys in the same order they were initialized):
right;newLeft
nolla;zero
yksi;one
SetColumns (DataFrame)

New/modified columns as array

Creates a new DataFrame based on the current DataFrame, where new columns have been created and/or existing columns have been modified. New and modified columns are defined using an array, where the column name is as a key and as a value there is the expression to calculate the new or modified column. When specifying a column that already exists, the column values are modified. When specifying a new column name, that column is created as a new column to the resulting DataFrame.

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns(["both": () => text + "=" + id]).ToCsv()
Returns:
id;text;both
0;zero;zero=0
2;two;two=2
3;three;three=3

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns(["text": () => text + "=" + id]).ToCsv()
Returns:
id;text
0;zero=0
2;two=2
3;three=3

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "text"]).SetColumns(["both": () => text + "=" + id, "both+1": () => both + 1, "text": () => "Done: " + Column("both+1"), "constant": 1234]).ToCsv()
Returns:
id;text;both;both+1;constant
0;Done: zero=01;zero=0;zero=01;1234
2;Done: two=21;two=2;two=21;1234
3;Done: three=31;three=3;three=31;1234
Tail (DataFrame)

Number of rows

Creates a new DataFrame that has only the bottom number of rows of this DataFrame. If the DataFrame has less than n rows, all its rows are returned.

Example:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).Tail(2).ToCsv()
Results string:
id;right
2;two
3;three
ToCsv (String) includeHeaders (boolean)

Converts a DataFrame into a CSV data. The CSV data has the following formatting:

  • Column separator: semicolon
  • Decimal separator in numeric fields: period
  • Quotation character for text fields: double quotes (used when the textual value contains semicolon, double quotes, linebreak or tabulator)
  • Escape character: Double quotes in textual fields are escaped with two double quotes.
  • Date format for date fields: yyyy-MM-dd HH:mm:ss,fff
  • First line: contains column headers

Parameter includeHeaders defines whether the header (column names) is returned (true) or not (false) as the first row. Default is true.

Example:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).ToCsv()
Returns:
id;right
0;zero
2;two
3;three
Where (DataFrame)

Condition expression

Creates a new DataFrame having only rows for which the given condition expression returns true. The condition expression can refer to the columns of the DataFrame (see the example below).

Examples:

Let("df", ToDataFrame([[0, "zero"], [2, "two", true], [3, "three"]], ["id", "string"]));

All the following expression return the same:
df.Where(id < 3);
df.Where(Column("id") < 3);
df.Where(_[0] < 3);

Returns:
id;string
0;zero
2;two
Zip (DataFrame)

DataFrame

Creates a new DataFrame that has the contents of given DataFrame appended as new columns into the end of this DataFrame. Returns a new DataFrame that has the colums from both the data frames so that the columns from the other DataFrame are appended to the end of the columns of this DataFrame. If the number of rows is different between this DataFrame and the other DataFrame, an exception is thrown. There must not be duplicate column names in the DataFrames - otherwise an exception is thrown.

Examples:

Let("df1", ToDataFrame([[0, "zero"], [1, "one"], [4, "four"]], ["id", "text"]));
Let("df2", ToDataFrame([[1, "one"], [2, "two"], [3, "three"]], ["id2", "text2"]));
df1.Zip(df2).ToCsv();
Returns:
id;text;id2;text2
0;zero;1;one
1;one;2;two
4;four;3;three

The following functions can be used to initialize DataFrame objects:

Function Parameters Description
ToDataFrame
  1. data (2-dimensional array)
  2. column names (array)

Creates a DataFrame based on the given two dimensional array and array of column names. Number of column names should be the same as the number of columns in the matrix.

The first parameter can also be an existing DataFrame which will create a copy of the DataFrame.

Examples:

ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]).ToCsv()
Returns:
A string containing:
id;right
0;zero
2;two
3;three

Joining DataFrames

Performs a joining operation between two DataFrames.

Parameters:

  1. DataFrame: The other DataFrame to join.
  2. Columns to match: Columns which the joining is based on, can be defined as follows:
    • If joining using one column having the same name in both DataFrames, the column name is specified as as string.
    • If joining using several columns having the same names in both DataFrames, the column names are specified as a string array.
    • If joining using columns having different names between the DataFrames, columns are specified as an array of key-value pairs, where the key is the column name in the left side DataFrame, and value is the column name in the right side DataFrame.
  3. Join type which can be
    • inner (default): row is generated if both DataFrames have the key.
    • leftouter: at least one row is generated for each left side DataFrame row, even if there is no matching other row (in that case null is given as value for the other columns).
    • outer: at least one row is generated both for the left and right side DataFrames even if there is no matching other row (in that case null is given as value for the other columns).

Examples:

Let("left", ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]));
Let("right", ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]));
left.join(right, "id").ToCsv()
Returns:
id;left;right
0;zero;zero

Let("left", ToDataFrame([[0, "zero"], [1, "one"]], ["id", "left"]));
Let("right", ToDataFrame([[0, "zero"], [2, "two"], [3, "three"]], ["id", "right"]));
left.join(right, "id", "leftouter").ToCsv()
Returns:
id;left;right
0;zero;zero
1;one;

Let("left", ToDataFrame([[0, 0, "zerozeroleft"], [0, 1, "zeroleft"], [1, 2, "oneleft"]], ["idleft1", "idleft2", "left"]));
Let("right", ToDataFrame([[0, 0, "zerozeroright"], [0, 1, "zeroright"], [2, 3, "tworight"], [3, 4, "threeright"]], ["idright1", "idright2", "right"]));
left.join(right, ["idleft1": "idright1"], "inner");
Returns:
idleft1;idleft2;left;idright2;right
0;0;zerozeroleft;0;zerozeroright
0;0;zerozeroleft;1;zeroright
0;1;zeroleft;0;zerozeroright
0;1;zeroleft;1;zeroright

Let("left", ToDataFrame([[0, 0, "zerozeroleft"], [0, 1, "zeroleft"], [1, 2, "oneleft"]], ["idleft1", "idleft2", "left"]));
Let("right", ToDataFrame([[0, 0, "zerozeroright"], [0, 1, "zeroright"], [2, 3, "tworight"], [3, 4, "threeright"]], ["idright1", "idright2", "right"]));
left.join(right, ["idleft1": "idright1", "idleft2": "idright2"], "inner");
Returns:
idleft1;idleft2;left;right
0;0;zerozeroleft;zerozeroright
0;1;zeroleft;zeroright

Merging DataFrames

Creates a new DataFrame that has the contents of a DataFrame (target) merged with another DataFrame (source). The merging works in the same principle as in the SQL language. Note that the merging does not create new columns but the result DataFrame has the same columns as the target DataFrame. Merging has the following principle:

Merge.png

Parameters:

  1. Source DataFrame: DataFrame to be merged with the target DataFrame.
  2. Columns to used for matching: Columns used to match the source and target DataFrames. This parameter has similar syntax as the 2. parameter in the join function.
  3. Columns to UPDATE for matching rows: Columns to update from the source DataFrame to the target DataFrame for the rows that match between the DataFrames. Note that the matching columns cannot be updated because they always have same values due to the matching logic. This parameter has similar syntax as the 2. parameter in the join function. Following special values can also be used:
    • If the value is _remove, all matching rows are be deleted from the resulting DataFrame.
    • If the value is [], no columns are updated to the resulting DataFrame.
    • If the value is null (default), all columns are copied from the source to target (the ones having matching names).
  4. Columns to CREATE for non-match rows in source: Columns to create to the result DataFrame, if a matching row in the source DataFrame is not found in the target DataFrame. This parameter has similar syntax as the 2. parameter in the join function. Following special values can also be used:
    • If the value is _remove, no new rows are created to the resulting DataFrame.
    • If the value is [], new rows are created to the resulting DataFrame but all columns of the created rows get empty values.
    • If the value is null (default), when creating rows, all columns are copied from the source to target (the ones having matching names).
  5. Keep or DELETE non-matching rows in target: Boolean value that defines whether those rows in the target DataFrame where no matching row in the source DataFrame is found, are included in the result DataFrame (true) or deleted (false). The default value is true.

Examples:

Update case attribute values for some cases and create non-existing:

Let("CaseData", ToDataFrame([
	["0", "New York", 6],
	["1", "Dallas", 3],
	["2", "Dallas", 8],
	["3", "Chicago", 4],
	["4", "New York", 2]
], ["Case id", "Region", "Cost"]));
Let("UpdatedData", ToDataFrame([
	["0", 7],
	["1", 5],
	["4", 2],
	["5", 2]
], ["Case id", "Cost"]));
CaseData.Merge(UpdatedData, "Case id").tocsv();

Returns:
Case id;Region;Cost
0;New York;7
1;Dallas;5
2;Dallas;8
3;Chicago;4
4;New York;2
5;;2

Update case attribute values for some cases and don't create non-existing:

Let("CaseData", ToDataFrame([
	["0", "New York", 6],
	["1", "Dallas", 3],
	["2", "Dallas", 8],
	["3", "Chicago", 4],
	["4", "New York", 2]
], ["Case id", "Region", "Cost"]));
Let("UpdatedData", ToDataFrame([
	["0", 7],
	["1", 5],
	["4", 2],
	["5", 2]
], ["Case id", "Cost"]));
CaseData.Merge(UpdatedData, "Case id", null, _remove).tocsv();

Returns:
Case id;Region;Cost
0;New York;7
1;Dallas;5
2;Dallas;8
3;Chicago;4
4;New York;2

Update case attribute values for some cases (create non-existing) where columns to match have different names in the source and target DataFrames ("Case id" in target and "Case" in source are used to match, and "Cost" in target is updated from "Variable Cost" in source):

Let("CaseData", ToDataFrame([
	["0", "New York", 6],
	["1", "Dallas", 3],
	["2", "Dallas", 8],
	["3", "Chicago", 4],
	["4", "New York", 2]
], ["Case id", "Region", "Cost"]));
Let("UpdatedData", ToDataFrame([
	["0", 7, 4],
	["1", 5, 2],
	["4", 2, 0],
	["5", 2, 1]
], ["Case", "Cost", "Variable Cost"]));
CaseData.Merge(UpdatedData, ["Case id": "Case"], ["Cost": "Variable Cost"]).tocsv();

Returns:
Case id;Region;Cost
0;New York;4
1;Dallas;2
2;Dallas;8
3;Chicago;4
4;New York;0
5;;1

Delete matching cases (don't create non-matching):

Let("CaseData", ToDataFrame([
	["0", "New York", 6],
	["1", "Dallas", 3],
	["2", "Dallas", 8],
	["3", "Chicago", 4],
	["4", "New York", 2]
], ["Case id", "Region", "Cost"]));
Let("UpdatedData", ToDataFrame([
	["0"],
	["1"],
	["4"],
	["5"]
], ["Case id"]));
CaseData.Merge(UpdatedData, "Case id", _remove, _remove).tocsv();

Returns:
Case id;Region;Cost
2;Dallas;8
3;Chicago;4

Update matching cases, create non-matching by source as new, and delete non-matching by target:

Let("CaseData", ToDataFrame([
	["0", "New York", 6],
	["1", "Dallas", 3],
	["2", "Dallas", 8],
	["3", "Chicago", 4],
	["4", "New York", 2]
], ["Case id", "Region", "Cost"]));
Let("UpdatedData", ToDataFrame([
	["0", 7],
	["1", 5],
	["4", 2],
	["5", 2]
], ["Case id", "Cost"]));
CaseData.Merge(UpdatedData, "Case id", null, null, false).tocsv();

Returns:
Case id;Region;Cost
0;New York;7
1;Dallas;5
4;New York;2
5;;2
Let("target", ToDataFrame([[0, "zero", "target"], [1, "", "target"]], ["id", "text", "frame"]));
Let("source", ToDataFrame([[1, "one", "source"], [2, "two", "source"], [3, "three", "source"]], ["id", "text", "frame"]));

target.Merge(source, "id").ToCsv()
Returns (one key, default parameters, identical dataframe columns):
id;text;frame
0;zero;target
1;one;source
2;two;source
3;three;source

target.Merge(source, "id", ["text"]).ToCsv()
Returns (one key, default parameters, identical dataframe columns, copy only text column from source):
id;text;frame
0;zero;target
1;one;target
2;two;
3;three;

target.Merge(source, "id", ["text"], _remove).ToCsv()
Returns (one key, default parameters, identical dataframe columns, copy only text column from source, remove rows found only in source):
id;text;frame
0;zero;target
1;one;target

target.Merge(source, "id", ["text"], _remove, false).ToCsv()
Returns (one key, identical dataframe columns, copy only text column from source, remove rows found only in source or only in target):
id;text;frame
1;one;target

target.Merge(source, "id", _remove, _remove, false).ToCsv()
Returns (one key, identical dataframe columns, remove all rows):
id;text;frame


Let("target", ToDataFrame([[0, 0, "zerozeroleft", "target"], [0, 1, "zeroleft", "target"], [1, 2, "left", "target"], [4, 5, "fourleft", "target"]], ["idleft1", "idleft2", "textleft", "frame"]));
Let("source", ToDataFrame([[0, 0, "zerozeroright", "source"], [0, 1, "zeroright", "target"], [1, 2, "oneright", "source"], [2, 3, "tworight", "source"], [3, 4, "threeright", "source"]], ["idright1", "idright2", "textright", "frame"]));

target.Merge(source, ["idleft1": "idright1"]).ToCsv()
Returns (one key, default parameters, different dataframe columns, copy all matching columns):
idleft1;idleft2;textleft;frame
0;0;zerozeroleft;source
0;0;zerozeroleft;target
0;1;zeroleft;source
0;1;zeroleft;target
1;2;left;source
4;5;fourleft;target
2;;;source
3;;;source

target.Merge(source, ["idleft1": "idright1"], []).ToCsv()
Returns (one key, default parameters, different dataframe columns, copy only key column):
idleft1;idleft2;textleft;frame
0;0;zerozeroleft;target
0;0;zerozeroleft;target
0;1;zeroleft;target
0;1;zeroleft;target
1;2;left;target
4;5;fourleft;target
2;;;
3;;;

target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"]).ToCsv()
Returns (two keys, default parameters, different dataframe columns, copy all matching columns):
idleft1;idleft2;textleft;frame
0;0;zerozeroleft;source
0;1;zeroleft;target
1;2;left;source
4;5;fourleft;target
2;3;;source
3;4;;source

target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright"]).ToCsv()
Returns (two keys, default parameters, different dataframe columns, copy only textright -column):
idleft1;idleft2;textleft;frame
0;0;zerozeroright;target
0;1;zeroright;target
1;2;oneright;target
4;5;fourleft;target
2;3;tworight;
3;4;threeright;

target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright", "frame"]).ToCsv()
Returns (two keys, default parameters, different dataframe columns, copy textright and frame -columns):
idleft1;idleft2;textleft;frame
0;0;zerozeroright;source
0;1;zeroright;target
1;2;oneright;source
4;5;fourleft;target
2;3;tworight;source
3;4;threeright;source

target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], _remove).ToCsv()
Returns (two keys, default parameters, different dataframe columns, remove all matching rows, copy only matching columns):
idleft1;idleft2;textleft;frame
4;5;fourleft;target
2;3;;
3;4;;

target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright"], ["idleft1": "idright1", "idleft2": "idright2", "textleft": "textright"]).ToCsv()
Returns (two keys, default parameters, different dataframe columns, copy only textright-column for matching columns, copy id columns and textright-column for rows not found in target):
idleft1;idleft2;textleft;frame
0;0;zerozeroright;target
0;1;zeroright;target
1;2;oneright;target
4;5;fourleft;target
2;3;tworight;
3;4;threeright;

target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright"], ["idleft1": "idright1", "idleft2": "idright2", "textleft": "textright", "frame"]).ToCsv()
Returns (two keys, default parameters, different dataframe columns, copy only textright-column for matching columns, copy id, frame and textright-column for rows not found in target):
idleft1;idleft2;textleft;frame
0;0;zerozeroright;target
0;1;zeroright;target
1;2;oneright;target
4;5;fourleft;target
2;3;tworight;source
3;4;threeright;source

target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], null, ["idleft1": "idright1", "idleft2": "idright2", "textleft": "textright"]).ToCsv()
Returns (two keys, default parameters, different dataframe columns, don't copy any columns from source for matching columns, copy id columns and textright-column for rows not found in target):
idleft1;idleft2;textleft;frame
0;0;zerozeroleft;source
0;1;zeroleft;target
1;2;left;source
4;5;fourleft;target
2;3;tworight;
3;4;threeright;

target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright"], ["idleft1": "idright1", "idleft2": "idright2", "textleft": "textright", "frame"], false).ToCsv()
Returns (two keys, default parameters, different dataframe columns, copy only textright-column for matching columns, copy id, frame and textright-column for rows not found in target, remove all rows not found in source):
idleft1;idleft2;textleft;frame
0;0;zerozeroright;target
0;1;zeroright;target
1;2;oneright;target
2;3;tworight;source
3;4;threeright;source