SqlDataFrame in Expression Language: Difference between revisions
No edit summary |
|||
(57 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
SqlDataFrame represents tabular data similar to | SqlDataFrame represents tabular data similar to an SQL query result. Data managed using SqlDataFrames is processed in the datasource where the data is located, i.e., in Snowflake or SQL Server (not in QPR ProcessAnalyzer Server memory). For each SqlDataFrame, there is an SQL query generated that is run in the datasource where the referenced datatables are located. | ||
SqlDataFrame operations itself don't cause the SQL to execute in the datasource, but it will happen when the '''Collect''' function is called for an SqlDataFrame, which generates and executes the SQL query representing the SqlDataFrame and loads the data into QPR ProcessAnalyzer memory as a DataFrame (where it can be presented in a dashboard). | |||
Each SqlDataFrame contain information about the datasource where the SQL query will be executed. When writing queries with several SqlDataFrames, the SqlDataFrames need to be located in the same datasource, to be able to execute the queries. If needed, data can be moved between datasources by using the ''Import'' or ''Persist'' functions. Alternatively, processing can be done in-memory by calling ''Collect'' for an SqlDataFrame and continuing calculation as in-memory DataFrame. | |||
There is a similar API for the SqlDataFrames as there is for the DataFrames. Note that merging is not possible between SqlDataFrames like it's for DataFrames, but SqlDataFrame can be merged into a [[QPR_ProcessAnalyzer_Objects_in_Expression_Language#Datatable|datatable]]. | |||
== SqlDataFrame properties == | |||
{| class="wikitable" | {| class="wikitable" | ||
!'''SqlDataFrame | !'''Property''' | ||
! '''Description''' | |||
|- | |||
||ColumnTypes (Dictionary) | |||
|| | |||
Returns information about the columns of the SqlDataFrame as an array of dictionaries with keys Name and DataType. Columns are returned in the same order as the columns exist in the data table. Column types are calculated based on the type of the column in the relational database management system (i.e. type of the column in an SQL table). | |||
Examples: | |||
<pre> | |||
table.SqlDataFrame.ColumnTypes | |||
Returns: [ #{ "Name": "name1", "DataType": "Integer" }, #{ "Name": "name2", "DataType": "String" } ] | |||
table.SqlDataFrame.ColumnTypes.Name | |||
Returns: ["name1", "name2"] | |||
table.SqlDataFrame.ColumnTypes.DataType | |||
Returns: ["Integer", "String"] | |||
</pre> | |||
|- | |||
||DataSourceConnection | |||
||Returns connection object used by this SqlDataFrame to connect to its datasource. | |||
|- | |||
||NColumns (Integer) | |||
||Returns number of columns in the dataset represented by the SqlDataFrame. | |||
|- | |||
||NRows (Integer) | |||
||Returns number of rows in the dataset represented by the SqlDataFrame. | |||
|} | |||
== SqlDataFrame functions == | |||
{| class="wikitable" | |||
!'''Function''' | |||
! '''Parameters''' | ! '''Parameters''' | ||
! '''Description''' | ! '''Description''' | ||
|- | |- | ||
||Aggregate ( | ||Aggregate (SqlDataFrame) | ||
|| | || | ||
# Aggregated columns (string array or key-value pairs) | # Aggregated columns (string array or key-value pairs) | ||
Line 18: | Line 51: | ||
||Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | ||Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | ||
|- | |- | ||
|| | ||Append | ||
||( | || | ||
# DataFrame to append | |||
# Include all columns (boolean) | |||
||Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | |||
|- | |||
|- | |||
||ApplyFilter (SqlDataFrame) | |||
|| | || | ||
Executes the SQL query for the SqlDataFrame in the datasource and returns results as an in-memory DataFrame. Then processing of the data can be continued as the in-memory DataFrame. | # Filter object | ||
# SqlDataFrame for cases | |||
|| | |||
Filters the SqlDataFrame (assumed to contain events data) using given filter and returns SqlDataFrame containing the remaining event data after filtering is performed. Requires that the source SqlDataFrame has ''CaseId'', ''EventType'' and ''TimeStamp'' mappings defined. | |||
Parameters: | |||
# '''filter''': Filter object. | |||
# '''cases''' (optional): SqlDataFrame containing corresponding case data. Case data should have the ''CaseId'' mapping defined. | |||
Example: Returns SqlDataFrame containing only events for case id "12345". | |||
<pre> | |||
let model = ModelById(123); | |||
model | |||
.EventsDataTable | |||
.SqlDataFrame | |||
.ApplyFilter( | |||
#{ | |||
"Items": [ #{ | |||
"Type": "IncludeCases", | |||
"Items": [ #{ | |||
"Type": "Case", | |||
"Values": [ "12345" ] | |||
}] | |||
}] | |||
}, | |||
model.CasesDataTable.SqlDataFrame | |||
) | |||
</pre> | |||
|- | |||
||Collect (SqlDataFrame) | |||
||Enable result caching (boolean) | |||
|| | |||
Executes the SQL query for the SqlDataFrame in the datasource and returns results as an in-memory DataFrame. Then processing of the data can be continued as the in-memory DataFrame. The ''Enable result caching'' parameter determines whether unique (false) or repeatable (true) object names are used in the query that affects whether query results may come from a cache or need to be recalculated each time. | |||
Examples: | Examples: | ||
Line 36: | Line 107: | ||
Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | ||
|- | |- | ||
|| | ||GroupBy (GroupedDataFrame) | ||
|| | || | ||
Grouped columns (string array) | |||
|| | || | ||
Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | ||
|- | |- | ||
|| | ||Head (SqlDataFrame) | ||
|| | ||Number of top rows | ||
|| | || | ||
Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | ||
Line 59: | Line 130: | ||
# Columns to match (String or key-value pairs) | # Columns to match (String or key-value pairs) | ||
# Join type (String) | # Join type (String) | ||
||Same functionality as in the [[ | ||Same functionality as in the [[Joining_DataFrames|DataFrame]]. | ||
|- | |- | ||
||OrderByColumns (SqlDataFrame) | ||OrderByColumns (SqlDataFrame) | ||
|| | || | ||
# | # Ordered columns (String array) | ||
# Sorting order (boolean array) | # Sorting order (boolean array) | ||
|| | || | ||
Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | ||
|- | |||
||Persist (SqlDataFrame) | |||
|| | |||
# DataTable name | |||
# Additional parameters | |||
|| | |||
Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. Additionally, if the SQL query for the SqlDataFrame is run in the same system as the target datatable, all data processing and storage is done within the system to achieve efficient operation. | |||
|- | |||
||Pivot (SqlDataFrame) | |||
|| | |||
# Pivot column name (String) | |||
# Value column name (String) | |||
# Pivot values (String array) | |||
# Aggregate function (String) | |||
|| | |||
Performs the pivot operation for the dataframe. More information about pivot: https://www.techonthenet.com/sql_server/pivot.php. | |||
Parameters: | |||
# '''Pivot column name''': Column in the source dataframe that will be aggregated. | |||
# '''Value column name''': Column in the source dataframe that determines to which generated column the row belongs to. | |||
# '''Pivot values''': Array of column names that the pivot operate will generate. | |||
# '''Aggregate function''': Aggregate function for combining the values from pivot column (the first parameter). | |||
Example: | |||
<pre> | |||
let data = ModelById(1) | |||
.EventsDatatable | |||
.SqlDataFrame; | |||
let result1 = data | |||
.Select(["User", "Event type"]) | |||
.Pivot("User", "User", ["John", "Greg", "James", "Sharon"], "Count").Collect(); | |||
// Returns user counts (on columns) for each event type (on rows) | |||
let result2 = data | |||
.Select(["User", "Event type", "Cost"]) | |||
.Pivot("Cost", "User", ["John", "Greg", "James", "Sharon"], "Sum").Collect(); | |||
// Returns costs for each user (on columns) and event type (on rows) | |||
</pre> | |||
|- | |||
||RemoveColumns (SqlDataFrame) | |||
||Column names (string array) | |||
||Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | |||
|- | |||
||RenameColumns (SqlDataFrame) | |||
||Key-value pairs of name mappings | |||
||Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | |||
|- | |||
||RenameAllColumns (SqlDataFrame) | |||
||Template string | |||
||Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | |||
|- | |- | ||
||Select (SqlDataFrame) | ||Select (SqlDataFrame) | ||
||Column names (string array, or key-value pairs) | |||
||Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | |||
|- | |||
||SelectDistinct (SqlDataFrame) | |||
|| | || | ||
Column names (string array, or key-value pairs) | Column names (string array, or key-value pairs) | ||
|| | || | ||
Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | ||
|- | |||
||Skip (SqlDataFrame) | |||
||Number of rows to skip | |||
||Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | |||
|- | |- | ||
||TakeSample (SqlDataFrame) | ||TakeSample (SqlDataFrame) | ||
||Number of rows (Integer) | ||Number of rows (Integer) | ||
||Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | || | ||
Takes a random sample of given number of rows of the original rows. Parameter is the number of rows to take. | |||
Example: Returns 100 random rows out of a datatable. | |||
<pre> | |||
DatatableById(1) | |||
.SqlDataFrame | |||
.TakeSample(100) | |||
.Collect(); | |||
</pre> | |||
|- | |||
||Unpivot (SqlDataFrame) | |||
|| | |||
# Value column name (String) | |||
# Name column name (String) | |||
# Columns to unpivot (String array) | |||
|| | |||
Performs the unpivot operation for the dataframe, i.e., rotates columns into rows. More information about unpivot: https://docs.snowflake.com/en/sql-reference/constructs/unpivot.html. | |||
Parameters: | |||
# '''Value column name''': Name of the generated column containing the unpivotted values. | |||
# '''Name column name''': Name of the generated column containing original column names of the unpivotted values. | |||
# '''Columns to unpivot''': Names of the columns in the source dataframe to be unpivotted. | |||
Note that the Unpivot function is only supported for SqlDataFrames (not in-memory DataFrames). | |||
Example: | |||
<pre> | |||
let result = df.Unpivot( | |||
"Value", | |||
"Name", | |||
["Column 1", "Column 2", "Column 3"] | |||
).Collect(); | |||
</pre> | |||
This example reads case attributes from a model and performs unpivot for them. Only string type of columns are unpivotted and also the case id columns is ignored. | |||
<pre> | |||
let caseAttributes = ModelById(123).CasesDatatable; | |||
caseAttributes.SqlDataFrame.Unpivot( | |||
"Value", | |||
"Case attribute", | |||
caseAttributes.Columns.Where(Datatype == "String" && Name != "CaseId").Name | |||
).Collect().toCsv() | |||
</pre> | |||
|- | |||
||<span id="WithClusterColumn">WithClusterColumn</span> (SqlDataFrame) | |||
|| | |||
# New column with cluster labels (String) | |||
# Parameters dictionary | |||
|| | |||
Performs clustering to data in an SQLDataFrame and creates a new SqlDataFrame with a column containing cluster labels. Cluster labels are integers starting from 0. | |||
Parameters: | |||
# New column name to contain the cluster labels. | |||
# Configuration for the clustering algorithm with following properties supported: | |||
#* '''NClusters''': Number of clusters to split the data into. | |||
#* '''Columns''': Columns to use in the clustering. | |||
Clustering uses the K-Means algorithm implemented with scikit-learn KMeans python package where random_state is set to be 0. Different data types are preprocessed before performing the clustering as follows: | |||
* Integers and floats are converted into numeric values between 0 and 1 using sklearn.preprocessin.MinMaxScaler. | |||
* Timestamps are converted into integer values, after it will get the same treatment as the numeric types. | |||
* All other data types are converted into multiple one-hot-encoded columns using sklearn.preprocessing.OneHotEncoder. | |||
Example: Cluster data in a datatable into 5 clusters based on columns "Region" and "Account Manager". | |||
<pre> | |||
DataTableById(1) | |||
.SqlDataFrame | |||
.WithClusterColumn("ClusterLabel", #{"NClusters": 5, "Columns": ["Region", "Account Manager", "Cost"]}) | |||
.Collect().ToCsv() | |||
</pre> | |||
|- | |||
||WithDenseRankNumberColumn (SqlDataFrame) | |||
|| | |||
# New column name (String) | |||
# Order by columns (String array) | |||
# Partition by columns (String array) | |||
# Ascending/descending order (Boolean array) | |||
|| | |||
Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | |||
|- | |||
||WithColumn (SqlDataFrame) | |||
|| | |||
# New column name (String) | |||
# New column expression | |||
|| | |||
Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]], except instead of in-memory expressions, SqlDataFrame use [[SQL Expressions]]. | |||
|- | |||
||WithRankColumn (SqlDataFrame) | |||
|| | |||
# New column name (String) | |||
# Order by columns (String array) | |||
# Partition by columns (String array) | |||
# Ascending/descending order (Boolean array) | |||
|| | |||
Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | |||
|- | |||
||WithRowNumberColumn (SqlDataFrame) | |||
|| | |||
# New column name (String) | |||
# Order by columns (String array) | |||
# Partition by columns (String array) | |||
# Ascending/descending order (Boolean array) | |||
|| | |||
Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]]. | |||
|- | |- | ||
||Where (SqlDataFrame) | ||Where (SqlDataFrame) | ||
||Condition expression | ||Condition expression | ||
||Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]] | ||Same functionality as in the [[DataFrame_in_Expression_Language|DataFrame]], except instead of in-memory expressions, SqlDataFrame use [[SQL Expressions]]. | ||
|} | |} |
Latest revision as of 13:15, 7 October 2024
SqlDataFrame represents tabular data similar to an SQL query result. Data managed using SqlDataFrames is processed in the datasource where the data is located, i.e., in Snowflake or SQL Server (not in QPR ProcessAnalyzer Server memory). For each SqlDataFrame, there is an SQL query generated that is run in the datasource where the referenced datatables are located.
SqlDataFrame operations itself don't cause the SQL to execute in the datasource, but it will happen when the Collect function is called for an SqlDataFrame, which generates and executes the SQL query representing the SqlDataFrame and loads the data into QPR ProcessAnalyzer memory as a DataFrame (where it can be presented in a dashboard).
Each SqlDataFrame contain information about the datasource where the SQL query will be executed. When writing queries with several SqlDataFrames, the SqlDataFrames need to be located in the same datasource, to be able to execute the queries. If needed, data can be moved between datasources by using the Import or Persist functions. Alternatively, processing can be done in-memory by calling Collect for an SqlDataFrame and continuing calculation as in-memory DataFrame.
There is a similar API for the SqlDataFrames as there is for the DataFrames. Note that merging is not possible between SqlDataFrames like it's for DataFrames, but SqlDataFrame can be merged into a datatable.
SqlDataFrame properties
Property | Description |
---|---|
ColumnTypes (Dictionary) |
Returns information about the columns of the SqlDataFrame as an array of dictionaries with keys Name and DataType. Columns are returned in the same order as the columns exist in the data table. Column types are calculated based on the type of the column in the relational database management system (i.e. type of the column in an SQL table). Examples: table.SqlDataFrame.ColumnTypes Returns: [ #{ "Name": "name1", "DataType": "Integer" }, #{ "Name": "name2", "DataType": "String" } ] table.SqlDataFrame.ColumnTypes.Name Returns: ["name1", "name2"] table.SqlDataFrame.ColumnTypes.DataType Returns: ["Integer", "String"] |
DataSourceConnection | Returns connection object used by this SqlDataFrame to connect to its datasource. |
NColumns (Integer) | Returns number of columns in the dataset represented by the SqlDataFrame. |
NRows (Integer) | Returns number of rows in the dataset represented by the SqlDataFrame. |
SqlDataFrame functions
Function | Parameters | Description |
---|---|---|
Aggregate (SqlDataFrame) |
|
Same functionality as in the DataFrame. |
Append |
|
Same functionality as in the DataFrame. |
ApplyFilter (SqlDataFrame) |
|
Filters the SqlDataFrame (assumed to contain events data) using given filter and returns SqlDataFrame containing the remaining event data after filtering is performed. Requires that the source SqlDataFrame has CaseId, EventType and TimeStamp mappings defined. Parameters:
Example: Returns SqlDataFrame containing only events for case id "12345". let model = ModelById(123); model .EventsDataTable .SqlDataFrame .ApplyFilter( #{ "Items": [ #{ "Type": "IncludeCases", "Items": [ #{ "Type": "Case", "Values": [ "12345" ] }] }] }, model.CasesDataTable.SqlDataFrame ) |
Collect (SqlDataFrame) | Enable result caching (boolean) |
Executes the SQL query for the SqlDataFrame in the datasource and returns results as an in-memory DataFrame. Then processing of the data can be continued as the in-memory DataFrame. The Enable result caching parameter determines whether unique (false) or repeatable (true) object names are used in the query that affects whether query results may come from a cache or need to be recalculated each time. Examples: DataTableById(123).SqlDataFrame.Head(100).Collect().ToCsv() Returns the top 100 rows from datatable id 123. |
ExcludeValues (SqlDataFrame) |
|
Same functionality as in the DataFrame. |
GroupBy (GroupedDataFrame) |
Grouped columns (string array) |
Same functionality as in the DataFrame. |
Head (SqlDataFrame) | Number of top rows |
Same functionality as in the DataFrame. |
IncludeOnlyValues (SqlDataFrame) |
|
Same functionality as in the DataFrame. |
Join (SqlDataFrame) |
|
Same functionality as in the DataFrame. |
OrderByColumns (SqlDataFrame) |
|
Same functionality as in the DataFrame. |
Persist (SqlDataFrame) |
|
Same functionality as in the DataFrame. Additionally, if the SQL query for the SqlDataFrame is run in the same system as the target datatable, all data processing and storage is done within the system to achieve efficient operation. |
Pivot (SqlDataFrame) |
|
Performs the pivot operation for the dataframe. More information about pivot: https://www.techonthenet.com/sql_server/pivot.php. Parameters:
Example: let data = ModelById(1) .EventsDatatable .SqlDataFrame; let result1 = data .Select(["User", "Event type"]) .Pivot("User", "User", ["John", "Greg", "James", "Sharon"], "Count").Collect(); // Returns user counts (on columns) for each event type (on rows) let result2 = data .Select(["User", "Event type", "Cost"]) .Pivot("Cost", "User", ["John", "Greg", "James", "Sharon"], "Sum").Collect(); // Returns costs for each user (on columns) and event type (on rows) |
RemoveColumns (SqlDataFrame) | Column names (string array) | Same functionality as in the DataFrame. |
RenameColumns (SqlDataFrame) | Key-value pairs of name mappings | Same functionality as in the DataFrame. |
RenameAllColumns (SqlDataFrame) | Template string | Same functionality as in the DataFrame. |
Select (SqlDataFrame) | Column names (string array, or key-value pairs) | Same functionality as in the DataFrame. |
SelectDistinct (SqlDataFrame) |
Column names (string array, or key-value pairs) |
Same functionality as in the DataFrame. |
Skip (SqlDataFrame) | Number of rows to skip | Same functionality as in the DataFrame. |
TakeSample (SqlDataFrame) | Number of rows (Integer) |
Takes a random sample of given number of rows of the original rows. Parameter is the number of rows to take. Example: Returns 100 random rows out of a datatable. DatatableById(1) .SqlDataFrame .TakeSample(100) .Collect(); |
Unpivot (SqlDataFrame) |
|
Performs the unpivot operation for the dataframe, i.e., rotates columns into rows. More information about unpivot: https://docs.snowflake.com/en/sql-reference/constructs/unpivot.html. Parameters:
Note that the Unpivot function is only supported for SqlDataFrames (not in-memory DataFrames). Example: let result = df.Unpivot( "Value", "Name", ["Column 1", "Column 2", "Column 3"] ).Collect(); This example reads case attributes from a model and performs unpivot for them. Only string type of columns are unpivotted and also the case id columns is ignored. let caseAttributes = ModelById(123).CasesDatatable; caseAttributes.SqlDataFrame.Unpivot( "Value", "Case attribute", caseAttributes.Columns.Where(Datatype == "String" && Name != "CaseId").Name ).Collect().toCsv() |
WithClusterColumn (SqlDataFrame) |
|
Performs clustering to data in an SQLDataFrame and creates a new SqlDataFrame with a column containing cluster labels. Cluster labels are integers starting from 0. Parameters:
Clustering uses the K-Means algorithm implemented with scikit-learn KMeans python package where random_state is set to be 0. Different data types are preprocessed before performing the clustering as follows:
Example: Cluster data in a datatable into 5 clusters based on columns "Region" and "Account Manager". DataTableById(1) .SqlDataFrame .WithClusterColumn("ClusterLabel", #{"NClusters": 5, "Columns": ["Region", "Account Manager", "Cost"]}) .Collect().ToCsv() |
WithDenseRankNumberColumn (SqlDataFrame) |
|
Same functionality as in the DataFrame. |
WithColumn (SqlDataFrame) |
|
Same functionality as in the DataFrame, except instead of in-memory expressions, SqlDataFrame use SQL Expressions. |
WithRankColumn (SqlDataFrame) |
|
Same functionality as in the DataFrame. |
WithRowNumberColumn (SqlDataFrame) |
|
Same functionality as in the DataFrame. |
Where (SqlDataFrame) | Condition expression | Same functionality as in the DataFrame, except instead of in-memory expressions, SqlDataFrame use SQL Expressions. |