DataFlow in Expression Language: Difference between revisions

From QPR ProcessAnalyzer Wiki
Jump to navigation Jump to search
No edit summary
 
(21 intermediate revisions by the same user not shown)
Line 1: Line 1:
DataFlow is an object representing a stream for tabular data. Difference to the DataFrame is that the DataFrame contains all its contents stored to the system memory. If there is lot of data, also lot of memory is required when using the DataFrame. On the other hand, in the DataFlow, the tabular contents "flows" from a source and is stored to a destination. Data can be manipulated, while having only a small portion of the entire data in memory at the same time. Thus DataFlows are suitable for ETL where the data volumes are large.
DataFlow is an object in the expression language representing a stream of tabular data. Data structure in the DataFlow is similar to the DataFrame, but difference is that in the DataFrame all its contents is stored to the system memory. If there is large volume of data, also lot of memory is required when managing the data in DataFrames. On the other hand, in the DataFlow the content "flows" from the source to the destination, and data can be manipulated while having only a small portion of the entire dataset in memory at the same time. Thus, DataFlows are suitable for ETL where data volumes are high.
 
DataFlow continues to run until it ''completes''. DataFlow will complete automatically, when all queried items have been returned. DataFlow can also be completed explicitly by calling the ''Complete'' function. When the DataFlow has been completed, no new items can be added to it. When collecting the DataFlow to an in-memory DataFrame, the ''Collect'' call waits until the DataFlow completes, to make sure all items are included to the DataFrame.
 
== DataFlow properties ==
{| class="wikitable"
!'''Property'''
! '''Description'''
|-
||HasFailed (boolean)
||Returns true if the DataFlow is in the failed state, i.e., the Fail function has been called for it.
|-
||IsCompleted (boolean)
||Returns true when the ''Complete'' function has been called for the DataFlow and there are no more unread items in it.
|}
 
== DataFlow functions ==


{| class="wikitable"
{| class="wikitable"
!'''Function'''
!'''Function'''
!'''Punctions'''
!'''Parameters'''
! '''Description'''
! '''Description'''
|-
|-
||Persist (String*)
||Append (DataFlow)
||Datatable
||DataFrame to append
||
Adds given DataFrame to DataFlow. Examples:
<pre>
let myDataFlow = ToDataFlow(ToDataFrame([], ["id", "color"]));
myDataFlow
  .Append(ToDataFrame([[1, "red"], [2, "green"]], ["id", "color"]));
myDataFlow
  .Complete()
  .Collect()
  .ToCsv()
</pre>
|-
||Collect (DataFrame)
||Parameters (Dictionary)
||
Returns in-memory DataFrame extracted from DataFlow. Returns null if either the timeout has been exceeded or the flow has been completed and is empty.
Parameters:
# CollectChunk: When ''true'', returns (as a DataFrame) rows that are currently in the DataFlow waiting to be processed. When ''false'', waits for the DataFlow to complete and returns the entire DataFlow contents (there won't be anymore new items in the DataFlow).
# Timeout: Maximum number of milliseconds to wait for data to appear into the DataFlow before exiting with null value as result.
 
Examples:
<pre>
ToDataFlow(ToDataFrame([], ["id", "color"]))
  .Append(ToDataFrame([[1, "red"], [2, "green"]], ["id", "color"]))
  .Append(ToDataFrame([[3, "blue"]], ["id", "color"]))
  .Complete()
  .Collect(#{"CollectChunk": true})
  .ToCsv()
</pre>
|-
||Complete (DataFlow)
||(none)
||
Declares that the DataFlow is completed, i.e., there won't be any new items anymore added to the DataFlow.
 
Examples:
<pre>
myDataFlow.Complete();
</pre>
|-
||Fail
||Error message (String)
||Completes the DataFlow and sets it to a failed state with given error message. The parameter is a string error message describing why the DataFlow was set to the failed state. The function returns the DataFlow object itself.
 
Note that after a DataFlow has been completed, no new items can be added into it.
 
Examples:
<pre>
dataFlow.Fail("Error occurred during data extraction.");
</pre>
|-
||Persist (Datatable)
||
||
# Datatable name (String)
# Datatable name (String)
# Additional parameters (Dictionary)
# Additional parameters (Dictionary)
||Writes DataFlow into datatable. Works similarly as the same function in the [[DataFrame_in_Expression_Language#DataFrame_Functions|DataFrames]].
||Writes DataFlow into datatable. Works similarly as the same function in the [[DataFrame_in_Expression_Language#DataFrame_Functions|DataFrame]].
|-
|}
 
== DataFlow sources==
Following functions to extract data from different sources, create DataFlows.
 
{| class="wikitable"
!'''Function'''
!'''Parameters'''
! '''Description'''
|-
||ExtractOdbc (DataFlow)
||Query parameters (Dictionary)
||(available in future)
|-
||ExtractSalesforce (DataFlow)
||Query parameters (Dictionary)
||(available in future)
|-
||[[ExtractSap_Function|ExtractSap]] (DataFlow)
||Query parameters (Dictionary)
||[[ExtractSap_Function|(see documentation)]]
|-
||ToDataFlow (DataFlow)
||
Initialization DataFrame/SqlDataFrame
||
Creates new DataFlow and optionally initializes it with given DataFrame or SqlDataFrame.
 
Examples:
<pre>
ToDataFlow(ToDataFrame([[1, "red"], [2, "green"]], ["id", "color"]))
  .Append(ToDataFrame([[3, "blue"]], ["id", "color"]))
  .Complete()
  .Collect()
  .ToCsv()
</pre>
|-
|-
|}
|}

Latest revision as of 22:47, 15 February 2023

DataFlow is an object in the expression language representing a stream of tabular data. Data structure in the DataFlow is similar to the DataFrame, but difference is that in the DataFrame all its contents is stored to the system memory. If there is large volume of data, also lot of memory is required when managing the data in DataFrames. On the other hand, in the DataFlow the content "flows" from the source to the destination, and data can be manipulated while having only a small portion of the entire dataset in memory at the same time. Thus, DataFlows are suitable for ETL where data volumes are high.

DataFlow continues to run until it completes. DataFlow will complete automatically, when all queried items have been returned. DataFlow can also be completed explicitly by calling the Complete function. When the DataFlow has been completed, no new items can be added to it. When collecting the DataFlow to an in-memory DataFrame, the Collect call waits until the DataFlow completes, to make sure all items are included to the DataFrame.

DataFlow properties

Property Description
HasFailed (boolean) Returns true if the DataFlow is in the failed state, i.e., the Fail function has been called for it.
IsCompleted (boolean) Returns true when the Complete function has been called for the DataFlow and there are no more unread items in it.

DataFlow functions

Function Parameters Description
Append (DataFlow) DataFrame to append

Adds given DataFrame to DataFlow. Examples:

let myDataFlow = ToDataFlow(ToDataFrame([], ["id", "color"]));
myDataFlow
  .Append(ToDataFrame([[1, "red"], [2, "green"]], ["id", "color"]));
myDataFlow
  .Complete()
  .Collect()
  .ToCsv()
Collect (DataFrame) Parameters (Dictionary)

Returns in-memory DataFrame extracted from DataFlow. Returns null if either the timeout has been exceeded or the flow has been completed and is empty. Parameters:

  1. CollectChunk: When true, returns (as a DataFrame) rows that are currently in the DataFlow waiting to be processed. When false, waits for the DataFlow to complete and returns the entire DataFlow contents (there won't be anymore new items in the DataFlow).
  2. Timeout: Maximum number of milliseconds to wait for data to appear into the DataFlow before exiting with null value as result.

Examples:

ToDataFlow(ToDataFrame([], ["id", "color"]))
  .Append(ToDataFrame([[1, "red"], [2, "green"]], ["id", "color"]))
  .Append(ToDataFrame([[3, "blue"]], ["id", "color"]))
  .Complete()
  .Collect(#{"CollectChunk": true})
  .ToCsv()
Complete (DataFlow) (none)

Declares that the DataFlow is completed, i.e., there won't be any new items anymore added to the DataFlow.

Examples:

myDataFlow.Complete();
Fail Error message (String) Completes the DataFlow and sets it to a failed state with given error message. The parameter is a string error message describing why the DataFlow was set to the failed state. The function returns the DataFlow object itself.

Note that after a DataFlow has been completed, no new items can be added into it.

Examples:

dataFlow.Fail("Error occurred during data extraction.");
Persist (Datatable)
  1. Datatable name (String)
  2. Additional parameters (Dictionary)
Writes DataFlow into datatable. Works similarly as the same function in the DataFrame.

DataFlow sources

Following functions to extract data from different sources, create DataFlows.

Function Parameters Description
ExtractOdbc (DataFlow) Query parameters (Dictionary) (available in future)
ExtractSalesforce (DataFlow) Query parameters (Dictionary) (available in future)
ExtractSap (DataFlow) Query parameters (Dictionary) (see documentation)
ToDataFlow (DataFlow)

Initialization DataFrame/SqlDataFrame

Creates new DataFlow and optionally initializes it with given DataFrame or SqlDataFrame.

Examples:

ToDataFlow(ToDataFrame([[1, "red"], [2, "green"]], ["id", "color"]))
  .Append(ToDataFrame([[3, "blue"]], ["id", "color"]))
  .Complete()
  .Collect()
  .ToCsv()