Merging DataFrames

From QPR ProcessAnalyzer Wiki
Jump to navigation Jump to search

Merge function for DataFrames creates a new DataFrame that has the contents of a DataFrame (target) merged with another DataFrame (source). The merging works in the same principle as in the SQL language. Note that the merging does not create new columns but the result DataFrame has the same columns as the target DataFrame. Merging has the following principle:

Merge.png

Parameters:

  1. Source DataFrame: DataFrame to be merged with the target DataFrame.
  2. Columns to used for matching: Columns used to match the source and target DataFrames. This parameter has similar syntax as the 2. parameter in the join function.
  3. Columns to UPDATE for matching rows: Columns to update from the source DataFrame to the target DataFrame for the rows that match between the DataFrames. Note that the matching columns cannot be updated because they always have same values due to the matching logic. This parameter has similar syntax as the 2. parameter in the join function. Following special values can also be used:
    • If the value is _remove, all matching rows are be deleted from the resulting DataFrame.
    • If the value is [], no columns are updated to the resulting DataFrame.
    • If the value is null (default), all columns are copied from the source to target (the ones having matching names).
  4. Columns to CREATE for non-match rows in source: Columns to create to the result DataFrame, if a matching row in the source DataFrame is not found in the target DataFrame. This parameter has similar syntax as the 2. parameter in the join function. Following special values can also be used:
    • If the value is _remove, no new rows are created to the resulting DataFrame.
    • If the value is [], new rows are created to the resulting DataFrame but all columns of the created rows get empty values.
    • If the value is null (default), when creating rows, all columns are copied from the source to target (the ones having matching names).
  5. Keep or DELETE non-matching rows in target: Boolean value that defines whether those rows in the target DataFrame where no matching row in the source DataFrame is found, are included in the result DataFrame (true) or deleted (false). The default value is true.

Examples:

Update case attribute values for some cases and create non-existing:

let CaseData = ToDataFrame([
	["0", "New York", 6],
	["1", "Dallas", 3],
	["2", "Dallas", 8],
	["3", "Chicago", 4],
	["4", "New York", 2]
], ["Case id", "Region", "Cost"]);
let UpdatedData = ToDataFrame([
	["0", 7],
	["1", 5],
	["4", 2],
	["5", 2]
], ["Case id", "Cost"]);
CaseData.Merge(UpdatedData, "Case id").toCsv();

Returns:
Case id;Region;Cost
0;New York;7
1;Dallas;5
2;Dallas;8
3;Chicago;4
4;New York;2
5;;2

Update case attribute values for some cases and don't create non-existing:

let CaseData = ToDataFrame([
	["0", "New York", 6],
	["1", "Dallas", 3],
	["2", "Dallas", 8],
	["3", "Chicago", 4],
	["4", "New York", 2]
], ["Case id", "Region", "Cost"]);
let UpdatedData = ToDataFrame([
	["0", 7],
	["1", 5],
	["4", 2],
	["5", 2]
], ["Case id", "Cost"]);
CaseData.Merge(UpdatedData, "Case id", null, _remove).toCsv();

Returns:
Case id;Region;Cost
0;New York;7
1;Dallas;5
2;Dallas;8
3;Chicago;4
4;New York;2

Update case attribute values for some cases (create non-existing) where columns to match have different names in the source and target DataFrames ("Case id" in target and "Case" in source are used to match, and "Cost" in target is updated from "Variable Cost" in source):

let CaseData = ToDataFrame([
	["0", "New York", 6],
	["1", "Dallas", 3],
	["2", "Dallas", 8],
	["3", "Chicago", 4],
	["4", "New York", 2]
], ["Case id", "Region", "Cost"]);
let UpdatedData = ToDataFrame([
	["0", 7, 4],
	["1", 5, 2],
	["4", 2, 0],
	["5", 2, 1]
], ["Case", "Cost", "Variable Cost"]);
CaseData.Merge(UpdatedData, ["Case id": "Case"], ["Cost": "Variable Cost"]).toCsv();

Returns:
Case id;Region;Cost
0;New York;4
1;Dallas;2
2;Dallas;8
3;Chicago;4
4;New York;0
5;;1

Delete matching cases (don't create non-matching):

let CaseData = ToDataFrame([
	["0", "New York", 6],
	["1", "Dallas", 3],
	["2", "Dallas", 8],
	["3", "Chicago", 4],
	["4", "New York", 2]
], ["Case id", "Region", "Cost"]);
let UpdatedData = ToDataFrame([
	["0"],
	["1"],
	["4"],
	["5"]
], ["Case id"]);
CaseData.Merge(UpdatedData, "Case id", _remove, _remove).toCsv();

Returns:
Case id;Region;Cost
2;Dallas;8
3;Chicago;4

Update matching cases, create non-matching by source as new, and delete non-matching by target:

let CaseData = ToDataFrame([
	["0", "New York", 6],
	["1", "Dallas", 3],
	["2", "Dallas", 8],
	["3", "Chicago", 4],
	["4", "New York", 2]
], ["Case id", "Region", "Cost"]);
let UpdatedData = ToDataFrame([
	["0", 7],
	["1", 5],
	["4", 2],
	["5", 2]
], ["Case id", "Cost"]);
CaseData.Merge(UpdatedData, "Case id", null, null, false).toCsv();

Returns:
Case id;Region;Cost
0;New York;7
1;Dallas;5
4;New York;2
5;;2
let target = ToDataFrame([[0, "zero", "target"], [1, "", "target"]], ["id", "text", "frame"]);
let source = ToDataFrame([[1, "one", "source"], [2, "two", "source"], [3, "three", "source"]], ["id", "text", "frame"]);

target.Merge(source, "id").ToCsv()
Returns (one key, default parameters, identical dataframe columns):
id;text;frame
0;zero;target
1;one;source
2;two;source
3;three;source

target.Merge(source, "id", ["text"]).ToCsv()
Returns (one key, default parameters, identical dataframe columns, copy only text column from source):
id;text;frame
0;zero;target
1;one;target
2;two;
3;three;

target.Merge(source, "id", ["text"], _remove).ToCsv()
Returns (one key, default parameters, identical dataframe columns, copy only text column from source, remove rows found only in source):
id;text;frame
0;zero;target
1;one;target

target.Merge(source, "id", ["text"], _remove, false).ToCsv()
Returns (one key, identical dataframe columns, copy only text column from source, remove rows found only in source or only in target):
id;text;frame
1;one;target

target.Merge(source, "id", _remove, _remove, false).ToCsv()
Returns (one key, identical dataframe columns, remove all rows):
id;text;frame


let target = ToDataFrame([[0, 0, "zerozeroleft", "target"], [0, 1, "zeroleft", "target"], [1, 2, "left", "target"], [4, 5, "fourleft", "target"]], ["idleft1", "idleft2", "textleft", "frame"]);
let source = ToDataFrame([[0, 0, "zerozeroright", "source"], [0, 1, "zeroright", "target"], [1, 2, "oneright", "source"], [2, 3, "tworight", "source"], [3, 4, "threeright", "source"]], ["idright1", "idright2", "textright", "frame"]);

target.Merge(source, ["idleft1": "idright1"]).ToCsv()
Returns (one key, default parameters, different dataframe columns, copy all matching columns):
idleft1;idleft2;textleft;frame
0;0;zerozeroleft;source
0;0;zerozeroleft;target
0;1;zeroleft;source
0;1;zeroleft;target
1;2;left;source
4;5;fourleft;target
2;;;source
3;;;source

target.Merge(source, ["idleft1": "idright1"], []).ToCsv()
Returns (one key, default parameters, different dataframe columns, copy only key column):
idleft1;idleft2;textleft;frame
0;0;zerozeroleft;target
0;0;zerozeroleft;target
0;1;zeroleft;target
0;1;zeroleft;target
1;2;left;target
4;5;fourleft;target
2;;;
3;;;

target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"]).ToCsv()
Returns (two keys, default parameters, different dataframe columns, copy all matching columns):
idleft1;idleft2;textleft;frame
0;0;zerozeroleft;source
0;1;zeroleft;target
1;2;left;source
4;5;fourleft;target
2;3;;source
3;4;;source

target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright"]).ToCsv()
Returns (two keys, default parameters, different dataframe columns, copy only textright -column):
idleft1;idleft2;textleft;frame
0;0;zerozeroright;target
0;1;zeroright;target
1;2;oneright;target
4;5;fourleft;target
2;3;tworight;
3;4;threeright;

target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright", "frame"]).ToCsv()
Returns (two keys, default parameters, different dataframe columns, copy textright and frame -columns):
idleft1;idleft2;textleft;frame
0;0;zerozeroright;source
0;1;zeroright;target
1;2;oneright;source
4;5;fourleft;target
2;3;tworight;source
3;4;threeright;source

target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], _remove).ToCsv()
Returns (two keys, default parameters, different dataframe columns, remove all matching rows, copy only matching columns):
idleft1;idleft2;textleft;frame
4;5;fourleft;target
2;3;;
3;4;;

target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright"], ["idleft1": "idright1", "idleft2": "idright2", "textleft": "textright"]).ToCsv()
Returns (two keys, default parameters, different dataframe columns, copy only textright-column for matching columns, copy id columns and textright-column for rows not found in target):
idleft1;idleft2;textleft;frame
0;0;zerozeroright;target
0;1;zeroright;target
1;2;oneright;target
4;5;fourleft;target
2;3;tworight;
3;4;threeright;

target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright"], ["idleft1": "idright1", "idleft2": "idright2", "textleft": "textright", "frame"]).ToCsv()
Returns (two keys, default parameters, different dataframe columns, copy only textright-column for matching columns, copy id, frame and textright-column for rows not found in target):
idleft1;idleft2;textleft;frame
0;0;zerozeroright;target
0;1;zeroright;target
1;2;oneright;target
4;5;fourleft;target
2;3;tworight;source
3;4;threeright;source

target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], null, ["idleft1": "idright1", "idleft2": "idright2", "textleft": "textright"]).ToCsv()
Returns (two keys, default parameters, different dataframe columns, don't copy any columns from source for matching columns, copy id columns and textright-column for rows not found in target):
idleft1;idleft2;textleft;frame
0;0;zerozeroleft;source
0;1;zeroleft;target
1;2;left;source
4;5;fourleft;target
2;3;tworight;
3;4;threeright;

target.Merge(source, ["idleft1": "idright1", "idleft2": "idright2"], ["textleft": "textright"], ["idleft1": "idright1", "idleft2": "idright2", "textleft": "textright", "frame"], false).ToCsv()
Returns (two keys, default parameters, different dataframe columns, copy only textright-column for matching columns, copy id, frame and textright-column for rows not found in target, remove all rows not found in source):
idleft1;idleft2;textleft;frame
0;0;zerozeroright;target
0;1;zeroright;target
1;2;oneright;target
2;3;tworight;source
3;4;threeright;source