SQL Expressions

From QPR ProcessAnalyzer Wiki
Jump to navigation Jump to search

SQL expressions are special expressions, which are converted into SQL and run in an external system that supports SQL (e.g., SQL Server, Snowflake, Databricks, Redshift). Only a subset of QPR ProcessAnalyzer expression language functionalities are supported by the SQL expressions (which are explained in this page). SQL expressions are used e.g. in the Where and WithColumn functions in SQLDataFrames.

Operators

Following operators are supported by by the SQL expressions:

  • Arithmetic operators: +, -, *, /, %
  • Comparison operators: ==, <, <=, >, >=, !=.
  • Logical operators: &&, ||, !
  • Data types: string ("this is a string"), integer (123), decimal number (123.45), boolean (true, false), null value (null)

Expressions resulting in boolean value need to be enclosed to the CaseWhen function. For example, expression Column("MyColumn") < 5 doesn't work, but following does:

CaseWhen(Column("MyColumn") < 5, true, false)

Mathematical functions

Function Description
Abs Returns absolute value of given number. Example:
Abs(-5)
Returns: 5
Ceiling

Returns given value rounded to the nearest equal or larger integer. The data type should be one of the numeric data types. If the value is null, then the result is also null.

Floor

Returns given value rounded to the nearest equal or smaller integer. The data type should be one of the numeric data types. If the value is null, then the result is also null.

Log Returns given logarithm of given number. The first parameter is the base number and the second parameter is number to take the logarithm from. Example:
Log(10, 1000)
Returns: 3

Log(2, 16)
Returns: 4
Mod Returns remainder of division operation of given numbers. The first parameter is the dividend and the second parameter is the divisor. Example:
Mod(10, 4)
Returns: 2
Power Returns given power of given number. First parameter is the base number and second is the exponent: Example:
Power(2, 4)
Returns: 16
Rand Returns random number between 0 and 1, where 0 is included and 1 is excluded. It's possible that when getting random for multiple columns, the same row may have the same random number. Example:
Rand()
Returns: 0.78452600 (for example)
Round

Rounds given number to given precision. Parameters:

  • number: Number to round.
  • precision (optional): Number of digits to round the given number to. When positive value, the given number is rounded to the corresponding number of decimals. When zero or omitted, the given number is rounded to the nearest integer. When negative value, the given number is rounded to the corresponding number of digits left of the decimal point (e.g. tens, hundreds, etc.).

Examples:

Round(123.456)
Returns 123

Round(123.456, 2)
Returns 123.46

Round(123.456, -1)
Returns 120

Round(123.456, 0)
Returns 123
Sign Return sign of given number, i.e., returns 1 (when number is a positive number) or 0 (when number is a negative number or zero). Example:
Sign(-2)
Returns: -1
Sqrt Return square root of given number. Example:
Sqrt(9)
Returns: 3

Date functions

Function Description
DateAdd

Adds a duration to a date, i.e., moves the date back or forth in time. The duration is specified as integer and the unit can be chosen from variety of options. The duration can also be a negative number for moving to an earlier time.

Parameters:

  1. time unit: Unit of time that the duration number represents. Supported values are year, quarter, month, week, day, hour, minute, second and millisecond.
  2. duration: Duration as number to add to the date. This needs to be integer, so decimal numbers cannot be used.
  3. date: Date to move.
DateAdd("month", 1, "01/01/2022")
Returns 1st of February 2022 at midnight.

DateAdd("year", -2, "01/01/2022")
Returns 1st of January 2020  at midnight.

DateAdd("hour", 12, "01/01/2022")
Returns 1st of January 2022 at 12 o'clock.
DateDiff

Calculates how many of the specified date part boundaries there are between the specified dates.

Parameters:

  1. time unit: Time unit in which the difference is calculated between the dates. Supported values are year, quarter, month, week, day, hour, minute, second and millisecond
  2. start date: Starting timestamp.
  3. end date: Ending timestamp.
Day

Returns the days of the month (1-31) of given timestamp.

Day(Column("DateColumn"))
DurationBetweenDates Calculates the duration between the specified dates and returns the precise duration as float. This function gives more precise result than the DateDiff function that complies with the traditional logic in the SQL language. Parameters:
  1. time unit: Time unit in which the duration is returned. Supported values are year, quarter, month, week, day, hour, minute, second and millisecond.
  2. start date: Starting timestamp.
  3. end date: Ending timestamp.

Example:

DurationBetweenDates("hour", Column("date1"), Column("date2"))
Hour Returns the hours part (0-59) of given timestamp.
Millisecond Returns the milliseconds part (0-999) of given timestamp.
Minute Returns the minutes part (0-59) of given timestamp.
Month Returns the months part (1-12) of given timestamp.
Second Returns the seconds part (0-59) of given timestamp.
TruncateDate

Truncates given date to given time unit. The truncation gives the timestamp of the beginning of the period defined by the time unit (logic thus resembles the floor function for numbers).

Parameters:

  1. date: Date to be truncated.
  2. time unit: Time unit to which the date is truncated. Supported values are year, halfyear, quarter, month, week, day, hour, minute, second, and millisecond. For the week unit, SQL Server or Snowflake settings (depending where the processing is done) determine the first day of the week (usually Sunday or Monday).

Examples:

TruncateDate(Column("MyDateColumn"), "year")
Returns timestamp representing beginning of a year, e.g. 2022-01-01 00:00:00.

TruncateDate(Column("MyDateColumn"), "day")
Returns timestamp representing beginning of a day, e.g. 2022-04-16 00:00:00.
Year Returns the year of given timestamp.

String functions

Function Description
CharIndex Returns the starting position of the first occurrence of the first argument in the second argument. The first character has the starting position 1. Returns 0 if the search doesn't find any results.

Parameters:

  • Target expression: String where to search from.
  • Searched expression: String that is searched from the other string.
  • Start index (optional): Number for the position from where to start the search. Value 1 is the start position.

Examples:

CharIndex("c", "abcdefg")
Returns: 3

CharIndex("ef", "abcdefg")
Returns: 5

CharIndex("fe", "abcdefg")
Returns: 0

CharIndex("c", "abcdefgc", 4)
Returns: 8
Concat

Return the concatenated string value of given values.

Concat("part 1", "part 2")
Returns "part 1part 2"

Concat(Column("column1"), " ", Column("column2"))
Returns column1 and column2 value concatenated separated by space.
Length

Returns length of a given string, i.e., the number of characters in a string.

Examples:

Length("test")
Returns: 4

Length("long text ending to space ")
Returns: 26

Length(Column("MyData"))
Like

Determines whether given string matches given pattern, and returns either true (for match) or false (for no match). Following special characters can be used in the pattern:

  • % (percentage): for matching any sequence of characters
  • _ (underscore): for matching any single character

Parameters:

  • String to find matches from
  • Search pattern
  • Escape character (optional)

When the percentage and underscore characters need to be used as normal characters in the search pattern (not as special characters), escaping for them is needed. By default, there is no escape character available, so it needs to be defined using the 3rd parameter. Any character can be used as the escape character, but it might be a good practice to use the backslash (\). See below an example of using the escape character.

The comparison is case-sensitive for Snowflake, and for SQL Server it is dependent on the server collation.

The Like function can be used with Where and CaseWhen functions which accept boolean values.

Examples:

Like(Column("TextData"), "a%")
Returns true for all rows starting with "a".

Like(Column("TextData"), "%aa%")
Returns true for all rows containing "aa".

Like(Column("TextData"), "abc_")
Returns true for all rows starting with "abc" and having four characters.

Like(Column("TextData"), "__a")
Returns true for all rows ending to "a" and having three characters.

dataFrame.Where(Like(Column("Region"), "t%"))
Filters dataframe rows where column Region starts with "t".

Like("a%b", "%\\%%", "\\")
Searches for a percentage character.
Substring

Returns substring of given string, based on given start index and length. Function has following parameters:

  • expression: Expression that evaluates to the string where the substring is taken.
  • start: Starting index of the substring. Start index 1 is the beginning of the string.
  • length: Length of the substring. This parameter is optional, and if left out, the substring spans until the end.
Substring("123456789", 5)
Returns: 56789

Substring("123456789", 5, 3)
Returns: 567

Substring(Column("MyColumn"), 1, 4)
Returns the first 4 characters of the MyColumn data.
ToLower Return string where all the characters of the input string have been converted into lower case characters.

Examples:

ToLower("Test")
Returns: test
ToUpper Return string where all the characters of the input string have been converted into upper case characters.

Example:

ToUpper("Test")
Returns: TEST
Trim

Trims a string, i.e., removes spaces from the beginning and end of a string. Only characters with code 32 are removed, which is the most common space character.

Example:

Trim(Column("MyColumn"))
Returns column "MyColumn" where leading and trailing spaces have been removed.

Trim(" test test  ")
Returns "test test".

AggregateFrom function

Aggregates a value from related objects, where there can be several of them a source object (e.g., when going from cases to events). See the type of relations between objects in the process mining concepts. The diagram shows that when going to direction where the target objects has count of N, AggregateFrom function needs to be used.

Parameters:

  1. Aggregation level: Aggregation level to aggregate from. Also an expression can be used to produce the dataset.
  2. Aggregation function: Aggregation function or object definition.
  3. Expression: Expression to generate the values to be aggregated. Default value is null.
  4. Filter: Optional filter to apply prior to performing the aggregation. Filter is given as JSON like syntax using dictionaries, arrays and scalar values.

Example for EventTypes:

AggregateFrom(Events, "Count")
Returns the number of events having each event type.

Example for Cases:

AggregateFrom(Events, #{ "Function": "List", "Ordering": ["TimeStamp"], "Separator": "#,#" }, Column("EventType"))
Returns variation/event type path string for all the cases.

GetValueFrom(Variations, AggregateFrom(Cases, "Count"))
Returns the number of cases having the same variation for every case.

DateDiff("Seconds", AggregateFrom(Events.Where(Column("EventType") == "Sales Order"), "Min", Column("TimeStamp")), AggregateFrom(Events.Where(Column("EventType") == "Invoice"), "Max", Column("Timestamp")))
Returns the duration in seconds between the first occurrence of "Sales Order"-event type and the last occurrence of "Invoice" event type for each case.

Example for Model:

AggregateFrom(Cases, "Count", null, #{"Items":[#{"Type":"IncludeCases","Items":[#{"Type":"CaseAttributeValue","Values":["Dallas"], "Attribute":"Region"}]}]})
Returns the total number of cases in the model having "Dallas" as the value of "Region" case attribute.

GetValueFrom function

Retrieves a value from a related object that there may be only one for a source object (e.g., when going from events to cases). See the type of relations between objects in the process mining concepts. The diagram shows that when going to direction where the target objects has count of 1, GetValueFrom function needs to be used.

Parameters:

  1. Aggregation level: Aggregation level to aggregate from. This includes possible additional data frame expressions to prepare the aggregation level.
  2. Expression: Expression to evaluate in given aggregation level to get the returned value.
  3. Filter: Optional filter to apply prior to performing expression evaluation. Filter is given as dictionary following the JSON filter syntax.

Examples as measure expression for events:

GetValueFrom(Cases, Column("Account Manager\"))
Returns for each event the value of Account Manager case attribute.

GetValueFrom(Variations, Column("Variation"))
Returns for each event variation/event type path string of its case.

Examples as measure expression for events:

GetValueFrom(Variations, AggregateFrom(Cases, "Count"))
Returns the number of cases having the same variation for every case.

GetValueFrom(Cases, Column("Variation"), #{"Items":[#{"Type":"IncludeEventTypes","Items":[#{"Type":"EventType","Values":["Shipment","Invoice"]}]}]})
Returns cases with their variations where only "Shipment" and "Invoice" event types are taken into account.

Other functions

Function Description
CaseWhen

Goes through conditions and returns a value when the first condition is true, similar to an if-then-else structure. Once a condition is true, it will stop reading and return the result. If no conditions are true, it returns the value in the else expression. If the else expression is not defined (i.e. there are even number of parameters), null value is returned.

Consists of any number of pairs of condition and value expressions followed by an optional else expression. The odd parameters are the conditions and the even parameters are the return values.

CaseWhen(Column("a") == null, 1, Column("a") < 1.0, 2, 3)
Returns 1 if the value of column "a" is null.
Returns 2 if the value of column "a" is less than 1.0.
Returns 3 otherwise.

Returns given value rounded to the nearest equal or larger integer. The data type should be one of the numeric data types. If the value is null, then the result is also null.

Coalesce

Returns the first non-null parameter. There can be any number of parameters. If all parameters are null, returns null.

Coalesce(null, 3, 2)
Returns 3.

Coalesce(Column("column1"), "N/A")
Returns column "column1" value, except replaces nulls with "N/A".
Column

Return the value of given column.

Column("column1")

Column("My Column 2")
In

Returns true, if the test expression (given as the first parameter) matches with any of the other expressions given as parameters (starting from the seconds parameter).

Examples:

In(Column("Country"), "Germany", "France", "UK")
Return true if column "Country" is Germany, France or UK.

In("Germany", Column("SourceCountry"), Column("DestinationCountry"))
Returns true, if column SourceCountry or DestinationCountry is Germany (or both columns).
Lag Similar to the Lead function, except refers to the previous row(s) instead of the next row(s).

Example: For events list, get the MyAttribute attribute value from the two steps previous event:

Lag(Column("MyAttribute"), [TimeStamp, EventType], [true, true], [CaseId], 2, "N/A"))
Lead

Returns value that is evaluated by accessing data in the next row. If a column with given columnId already exists in the DataFrame, the new column will overwrite the existing column.

Parameters:

  1. expression: Expression that is evaluated in the context of the next row(s).
  2. order by: Array of column(s) to order the rows within each partition.
  3. ascending/descending order: Array of boolean values, where each value indicates whether to sort the corresponding column in ascending (true) or descending (false) order. By default, it's the ascending order. The length of the array must be equal to the length of previous parameter (orderBy).
  4. partition by: Array of column(s) to partition the rows. Partitioning divides the dataset into distinct groups which witch are each sorted separately. When referring to the next row(s), partitions cannot see each others.
  5. offset (optional): Number of rows to go forward. Setting a negative offset will refer to the previous row(s) instead of next row(s). Default value is 1.
  6. default (optional): Expression to return when the offset goes out of bounds.

Example: For events list, get the next event timestamp:

Lead(TimeStamp, [TimeStamp, EventType], [true, true], [CaseId]))
Variable

Returns value of given variable that is available in the context where the SQL expression is run. Supports number, string and boolean values.

Examples:

let myRegion = "Dallas";
DatatableById(123).SqlDataFrame.Where(Column("Region") == Variable("myRegion")).Collect()
Filters datatable by Region is Dallas.

Process mining objects

The following variable names are supported in the beginning of a root expression and in the AggregateFrom and GetValueFrom functions:

  • Cases: Returns SqlDataFrame for cases. There are following properties:
    • CaseId: Case id.
    • All case attributes (can be referred using Column("<AttributeName>")
  • Events: Returns SqlDataFrame for events with following properties:
    • CaseId: Case id.
    • EventType: Event type name.
    • Timestamp: Event timestamp.
    • All event attributes (can be referred using Column("<AttributeName>")
  • EventTypes: Returns SqlDataFrame for event types. There are the following properties:
    • EventType: Event type name.
  • Variations: Returns SqlDataFrame for variations. There the following properties:
    • Variation: Variation identifier, which is concatenated event type names separated by separator "#,#".
  • Flows: Returns SqlDataFrame for flows. There are the following properties:
    • FromEventType: Event type name of the flow start.
    • ToEventType: Event type name of the flow end.
  • FlowOccurrences: Returns SqlDataFrame for flow occurrences. There are the following properties:
    • CaseId: Case id.
    • FromEventType: Event type name of the flow start.
    • FromTimeStamp: Time stamp of the flow start event.
    • All starting event attributes when adding text From to the beginning of the attribute name (can be referred using Column("From<AttributeName>")
    • ToEventType: Event type name of the flow end.
    • ToTimeStamp: Time stamp of the flow end event.
    • All ending event attributes when adding text To to the beginning of the attribute name (can be referred using Column("To<AttributeName>")
  • Model: Returns SqlDataFrame containing one row representing the model. There are the following properties:
    • ModelId: Model id.

After these variables, all functions supported by the SqlDataFrame can be used.

Examples: For cases (and also events), the case id can be referred using CaseId:

Cases.Where(CaseId == "Case_123")

Assuming that there is an Order Id column that is mapped to the CaseId, also the original column name can be used:

Cases.Where(Column("Order Id") == "Case_123")

For events, the event type can be referred using EventType:

Events.Where(EventType == "Order created")

Assuming that there is an Process step column that is mapped to the EventType, also the original column name can be used:

Cases.Where(Column("Process step") == "Order created")