Machine Learning Functions in Expression Language: Difference between revisions

From QPR ProcessAnalyzer Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
 
(17 intermediate revisions by 2 users not shown)
Line 1: Line 1:
This pages describes functions and properties that are related to the machine learning functionality, such as '''clustering''' and '''prediction''', in the QPR ProcessAnalyzer expression language.
This page describes functions and properties that implement the machine learning functionality, such as '''clustering''' and '''prediction''', that are part of the expression language. For prediction, the '''random forest''' is a supported algorithm. For clustering, the following algorithms are supported: '''KModes''', '''KMeans''' and '''BalancedKMeans'''.


== Machine Learning Functions ==
== Snowflake models ==
For Snowflake-based models and data frames, machine learning functionalities can be found in [[Create Predicted Eventlog]] (prediction) and [[SqlDataFrame in Expression Language|WithClusterColumn]] (clustering).
 
== Clustering functions ==
{| class="wikitable"
{| class="wikitable"
!'''Function'''
!'''Function'''
Line 7: Line 10:
! '''Description'''
! '''Description'''
|-
|-
||BalancedKMeans
||KModes
||jsonData (String)
||
||
Performs Balanced KMeans clustering for given numeric matrix. Algorithm is based on http://accord-framework.net/docs/html/T_Accord_MachineLearning_BalancedKMeans.htm. Parameters and return value structure is identical to the KMeans function.
Matrix to cluster
|-
||Codify
||Matrix to codify
||
||
Use Accord's Codify -functionality to encode all unique column values into unique numeric integer values. Based on: http://accord-framework.net/docs/html/T_Accord_Statistics_Filters_Codification.htm. Returns codified matrix of exactly the same dimensions as the input matrix.
Performs KModes clustering for a numeric matrix. Implementation uses Accord.NET KModes method (http://accord-framework.net/docs/html/T_Accord_MachineLearning_KModes.htm).
 
Parameters:
# Matrix to cluster. Rows (1st dimension) represent data points and columns represent feature values (2nd dimension).
# Target number of clusters.
# Distance function to be used in the clustering process.
 
Returns an array with following two elements:
* 1. Array describing to which cluster each input data point belongs to. Clusters are defined using index starting from zero.
* 2. Array containing following two elements:
** 2.1. Computed final error of the clustering.
** 2.2. Number of iterations performed in the clustering.


Examples:
Examples:
<pre>
<pre>
Codify([[1,2], [3,4], [1,4]])
KModes([[1, 2], [2, 3], [2, 2]], 2)
Returns: [[0, 0], [1, 1], [0, 1]]
Returns (e.g.): [[0, 1, 0], [0, 2]]


Codify([[123, "foo"], [456, "bar"], [456, "foo"]])
KModes([[1, 2], [2, 3], [2, 2]], 3)
Returns: [[0, 0], [1, 1], [1, 0]]
Returns (e.g.): [[2, 1, 0], [0, 1]]
</pre>
</pre>
|-
|-
Line 36: Line 46:


Parameters:
Parameters:
# Matrix to cluster. Rows (1st dimension) represent data points and columns represent feature values (2nd dimension).
# Matrix to cluster, where rows (1st dimension) represent data points and columns represent feature values (2nd dimension).
# Target number of clusters
# Target number of clusters
# Distance function to be used in the clustering process.
# Distance function to be used in the clustering process.
Line 63: Line 73:
</pre>
</pre>
|-
|-
||KModes
||BalancedKMeans
||
||
* Matrix to cluster
* Matrix to cluster
* Target number of clusters
* Distance function
* Additional parameters
||
Performs Balanced KMeans clustering for given numeric matrix. Algorithm is based on http://accord-framework.net/docs/html/T_Accord_MachineLearning_BalancedKMeans.htm. Parameters and return value structure is identical to the KMeans function.
|-
||Codify
||Matrix to codify
||
||
Performs KModes clustering for a numeric matrix. Implementation uses Accord.NET KModes method (http://accord-framework.net/docs/html/T_Accord_MachineLearning_KModes.htm).
Encodes all unique column values into unique numeric integer values. Based on Accord.Net codify functionality: http://accord-framework.net/docs/html/T_Accord_Statistics_Filters_Codification.htm. Returns codified matrix of exactly the same dimensions as the input matrix.
 
Parameters:
# Matrix to cluster. Rows (1st dimension) represent data points and columns represent feature values (2nd dimension)
# Target number of clusters
# distanceFunction: Distance function to be used in the clustering process (#48347#).
 
Returns an array having the following elements:
* Element 0: An array of all the cluster labels for all the rows in the input matrix in the same order as they were given in the matrix parameter.
* Element 1: An array of length 2 having the following elements:
** Element 0: Computed final error of the clustering.
** Element 1: Number of iterations performed in the clustering.


Examples:
Examples:
<pre>
<pre>
KModes([[1, 2], [2, 3], [2, 2]], 2)
Codify([["a", 4], ["c", 4], ["b", 3], ["c", 3]])
Returns (e.g.): [[0, 1, 0], [0, 2]]
Returns: [[0, 0], [1, 0], [2, 1], [1, 1]]
 
Codify([[1,2], [3,4], [1,4]])
Returns: [[0, 0], [1, 1], [0, 1]]


KModes([[1, 2], [2, 3], [2, 2]], 3)
Codify([[123, "foo"], [456, "bar"], [456, "foo"]])
Returns (e.g.): [[2, 1, 0], [0, 1]]
Returns: [[0, 0], [1, 1], [1, 0]]
</pre>
</pre>
||
|-
||MLModel (MLModel)
||
* type (String)
* parameters
||
Create a new machine learning model for predictions. Takes type of the prediction/classification model to create as a parameter. Currently the only supported value is '''randomforest''' which uses the Accord.NET's RandomForest algorithm.
Parameters:
* Type of the prediction/classification model to create. Only supported value is '''binarygbm''' (based on ML.NET's LightGBM)
* Additional parameters as key value pairs. Only supported parameter is '''ComputeCovariance''': If true, the result will include covariance matrices. Default value is false.
|-
|-
||OneHot
||OneHot
||
||
* Numeric matrix
Numeric matrix
||
||
One-hot encodes all matrix columns. Implementation uses Accord.NET OneHot method (http://accord-framework.net/docs/html/M_Accord_Math_Jagged_OneHot_1.htm)
One-hot encodes all matrix columns. Implementation uses Accord.NET OneHot method (http://accord-framework.net/docs/html/M_Accord_Math_Jagged_OneHot_1.htm)
Line 117: Line 115:
Returns: [[1, 0, 1, 0], [0, 1, 0, 1], [0, 1, 1, 0]]
Returns: [[1, 0, 1, 0], [0, 1, 0, 1], [0, 1, 1, 0]]
</pre>
</pre>
|}


== Prediction functions ==
{| class="wikitable"
!'''Function'''
! '''Parameters'''
! '''Description'''
|-
|-
||MLModel
||
* LM model type (String)
* Additional parameters (key-value pairs)
||
Create a new machine learning model for making predictions. Takes type of the model to create as a parameter. Currently the supported value is '''randomforest''' which uses the Accord.NET's RandomForest algorithm. Additional parameters:
* '''ComputeCovariance''': If true, the result will include covariance matrices. Default value is false.
<pre>
let myMLModel = MLModel("randomforest", #{"ComputeCovariance": true});
</pre>
|-
|-
|| Train (MLModel)
|| Train (MLModel)
Line 125: Line 141:
* Parameters
* Parameters
||
||
Trains given MLModel using given input data and expected outcomes.
Trains given MLModel using given input data and expected outcomes. Returns the trained MLModel object.


Parameters:
Parameters:
Line 132: Line 148:
** The second dimension (columns) specifies the feature values.
** The second dimension (columns) specifies the feature values.
* '''expected outcomes''': An array of expected outcomes for each row in the input data. Must be in the same order as the rows in the input data.
* '''expected outcomes''': An array of expected outcomes for each row in the input data. Must be in the same order as the rows in the input data.
* '''parameters''': Additional parameters for the MLModel. Supported parameters:
* '''parameters''': Additional parameters for the MLModel:
** NumberOfTrees: the number of trees in the random forest, default value is 10.
** '''NumberOfTrees''': number of trees in the random forest. Default value is 10.
** SampleRatio: the proportion of samples used to train each of the trees in the decision forest, default value is 0.632.
** '''SampleRatio''': Proportion of samples used to train each of the trees in the decision forest Default value is 0.632.
 
Returns the trained MLModel object.
|-
|-
||Transform (array)
||Transform (array)

Latest revision as of 07:26, 19 November 2024

This page describes functions and properties that implement the machine learning functionality, such as clustering and prediction, that are part of the expression language. For prediction, the random forest is a supported algorithm. For clustering, the following algorithms are supported: KModes, KMeans and BalancedKMeans.

Snowflake models

For Snowflake-based models and data frames, machine learning functionalities can be found in Create Predicted Eventlog (prediction) and WithClusterColumn (clustering).

Clustering functions

Function Parameters Description
KModes

Matrix to cluster

Performs KModes clustering for a numeric matrix. Implementation uses Accord.NET KModes method (http://accord-framework.net/docs/html/T_Accord_MachineLearning_KModes.htm).

Parameters:

  1. Matrix to cluster. Rows (1st dimension) represent data points and columns represent feature values (2nd dimension).
  2. Target number of clusters.
  3. Distance function to be used in the clustering process.

Returns an array with following two elements:

  • 1. Array describing to which cluster each input data point belongs to. Clusters are defined using index starting from zero.
  • 2. Array containing following two elements:
    • 2.1. Computed final error of the clustering.
    • 2.2. Number of iterations performed in the clustering.

Examples:

KModes([[1, 2], [2, 3], [2, 2]], 2)
Returns (e.g.): [[0, 1, 0], [0, 2]]

KModes([[1, 2], [2, 3], [2, 2]], 3)
Returns (e.g.): [[2, 1, 0], [0, 1]]
KMeans
  • Matrix to cluster
  • Target number of clusters
  • Distance function
  • Additional parameters

Performs KMeans clustering for a numeric matrix. Implementation uses Accord.NET KMeans function (http://accord-framework.net/docs/html/T_Accord_MachineLearning_KMeans.htm).

Parameters:

  1. Matrix to cluster, where rows (1st dimension) represent data points and columns represent feature values (2nd dimension).
  2. Target number of clusters
  3. Distance function to be used in the clustering process.
  4. Additional parameters: Optional key value pairs. Supported keys and values: ComputeCovariance: If true, the result will include covariance matrices. Default = false.

Returns an array having the following elements:

  • Element 0: An array of all the cluster labels for all the rows in the input matrix in the same order as they were given in the matrix parameter.
  • Element 1: An array of length 2 having the following elements:
    • Element 0: Computed final error of the clustering.
    • Element 1: Number of iterations performed in the clustering.
  • Element 2: Only returned if computeCovariance is True.

Examples:

KMeans([[1, 2], [2, 3], [2, 2]], 2)
Returns (e.g.): [[0, 1, 0], [0.16667, 2]]

KMeans([[1, 2], [2, 3], [2, 2]], 3)
Returns (e.g.): [[2, 1, 0], [0, 1]]

KMeans([[1, 2], [2, 3], [2, 2]], 2, "manhattan", true)
Returns (e.g.): [[0, 1, 0], [0.33333, 2], <covariance matrices (k * columns * columns)>]

KMeans(OneHot(Codify([[123, "foo"], [456, "bar"], [456, "foo"]])), 2)
Returns (e.g.): [[0, 1, 0], [0.33333, 2]]
BalancedKMeans
  • Matrix to cluster
  • Target number of clusters
  • Distance function
  • Additional parameters

Performs Balanced KMeans clustering for given numeric matrix. Algorithm is based on http://accord-framework.net/docs/html/T_Accord_MachineLearning_BalancedKMeans.htm. Parameters and return value structure is identical to the KMeans function.

Codify Matrix to codify

Encodes all unique column values into unique numeric integer values. Based on Accord.Net codify functionality: http://accord-framework.net/docs/html/T_Accord_Statistics_Filters_Codification.htm. Returns codified matrix of exactly the same dimensions as the input matrix.

Examples:

Codify([["a", 4], ["c", 4], ["b", 3], ["c", 3]])
Returns: [[0, 0], [1, 0], [2, 1], [1, 1]]

Codify([[1,2], [3,4], [1,4]])
Returns: [[0, 0], [1, 1], [0, 1]]

Codify([[123, "foo"], [456, "bar"], [456, "foo"]])
Returns: [[0, 0], [1, 1], [1, 0]]
OneHot

Numeric matrix

One-hot encodes all matrix columns. Implementation uses Accord.NET OneHot method (http://accord-framework.net/docs/html/M_Accord_Math_Jagged_OneHot_1.htm)

Returns a matrix consisting of a concatenation of one-hot encoding of each of the input matrix columns. The number of columns in the returned matrix is at least the same as in the input matrix. For each input column, the corresponding one-hot vector will have all the values of 0, except for one which will be 1.

Examples:

OneHot([[0], [2], [1], [3]])
Returns: [[1, 0, 0, 0], [0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 0, 1]]

OneHot(Codify([[123, "foo"], [456, "bar"], [456, "foo"]]))
Returns: [[1, 0, 1, 0], [0, 1, 0, 1], [0, 1, 1, 0]]

Prediction functions

Function Parameters Description
MLModel
  • LM model type (String)
  • Additional parameters (key-value pairs)

Create a new machine learning model for making predictions. Takes type of the model to create as a parameter. Currently the supported value is randomforest which uses the Accord.NET's RandomForest algorithm. Additional parameters:

  • ComputeCovariance: If true, the result will include covariance matrices. Default value is false.
let myMLModel = MLModel("randomforest", #{"ComputeCovariance": true});
Train (MLModel)
  • Input data
  • Expected outcomes
  • Parameters

Trains given MLModel using given input data and expected outcomes. Returns the trained MLModel object.

Parameters:

  • input data: Two dimensional array where:
    • The first dimension (rows) specifies different data points.
    • The second dimension (columns) specifies the feature values.
  • expected outcomes: An array of expected outcomes for each row in the input data. Must be in the same order as the rows in the input data.
  • parameters: Additional parameters for the MLModel:
    • NumberOfTrees: number of trees in the random forest. Default value is 10.
    • SampleRatio: Proportion of samples used to train each of the trees in the decision forest Default value is 0.632.
Transform (array)

Input data

Transforms given input data using the MLModel to generating predictions. Takes the input data as a parameter which is a two dimensional array where the first dimension (rows) specifies different data points and the second dimension (columns) specifies the feature values.

Returns an array of predictions. Transformations for each row in the input data can be found at the same index of the returned array.

MLModel (Machine Learning Model)

These properties are available for the MLModel object.

MLModel properties Description
Type Returns the exact type of the MLModel.

Examples

Example #1: Train a model using an event log and test its performance by replaying training data itself.


Def("GetOneHotColumnInformation", (
  Let("el", _),
  ToDictionary([
    "et": OrderByValue(el.EventTypes),
    "at": ToDictionary(ConcatTop(OrderByTop(el.CaseAttributes, Name).[_: Values]))
  ])
));

Def("GenerateOneHot", "cases", (
  Let("columnInformation", _),
  cases.(
    Let("cas", _),
    Flatten(
      [
        columnInformation.Get("et").(Let("et", _), If(Count(cas.EventsByType(et)) > 0, 1, 0)),
        (
          Let("atColumns", columnInformation.Get("at")),
          OrderByValue(atColumns.Keys).(
            Let("key", _),
            Let("values", atColumns.Get(key)),
            Let("caseValue", cas.Attribute(key)),
            values.(If(_ == caseValue, 1, 0))
          )
        )
      ]
    )
  )
));

Let("el", EventLogById(1));
Let("columnInformation", el.GetOneHotColumnInformation());
Let("allCases", el.Cases);
Let("allCasesOH", columnInformation.GenerateOneHot(el.Cases));
Let("trainDataOH", allCasesOH);
Let("outcomes", allCases.(Duration > TimeSpan(24)));
Let("testDataOH", allCasesOH);
Let("predictions", 
  MLModel("randomforest")
    .Train(trainDataOH, outcomes)
    .Transform(trainDataOH));
Sum(Zip(outcomes, predictions).(_[0] == _[1] != 0)) / Count(outcomes)

Example #2: Train a model using an a 75% sample of an event log and test its performance by using the rest 25% of the event log.

Def("GetOneHotColumnInformation", (
  Let("el", _),
  ToDictionary([
    "et": OrderByValue(el.EventTypes),
    "at": ToDictionary(ConcatTop(OrderByTop(el.CaseAttributes, Name).[_: Values]))
  ])
));

Def("GenerateOneHot", "cases", (
  Let("columnInformation", _),
  cases.(
    Let("cas", _),
    Flatten(
      [
        columnInformation.Get("et").(Let("et", _), If(Count(cas.EventsByType(et)) > 0, 1, 0)),
        (
          Let("atColumns", columnInformation.Get("at")),
          OrderByValue(atColumns.Keys).(
            Let("key", _),
            Let("values", atColumns.Get(key)),
            Let("caseValue", cas.Attribute(key)),
            values.(If(_ == caseValue, 1, 0))
          )
        )
      ]
    )
  )
));

Let("el", EventLogById(1));
Let("columnInformation", el.GetOneHotColumnInformation());
Let("allCases", Shuffle(el.Cases));
Let("lastTrainCaseIndex", 0.75 * CountTop(el.Cases));
Let("trainCases", allCases[NumberRange(0, lastTrainCaseIndex)]);
Let("testCases", allCases[NumberRange(lastTrainCaseIndex + 1, CountTop(el.Cases) - 1)]);
Let("trainDataOH", columnInformation.GenerateOneHot(trainCases));
Let("testDataOH", columnInformation.GenerateOneHot(testCases));
Let("trainOutcomes", trainCases.(Duration > TimeSpan(24)));
Let("testOutcomes", testCases.(Duration > TimeSpan(24)));
Let("predictions", 
  MLModel("randomforest")
    .Train(trainDataOH, trainOutcomes)
    .Transform(testDataOH));
Sum(Zip(testOutcomes, predictions).(_[0] == _[1] != 0)) / Count(testOutcomes)

Example #3: Three sets of cases: training cases, target cases (subset of training cases) and test cases (independent set of cases). Try to predict which cases in the test set will eventually end up becoming a case in target cases.

Def("GetOneHotColumnInformation", (
  Let("el", _),
  ToDictionary([
    "et": OrderByValue(el.EventTypes),
    "at": ToDictionary(ConcatTop(OrderByTop(el.CaseAttributes, Name).[_: Values]))
  ])
));

Def("GenerateOneHot", "cases", (
  Let("columnInformation", _),
  cases.(
    Let("cas", _),
    Flatten(
      [
        columnInformation.Get("et").(Let("et", _), If(Count(cas.EventsByType(et)) > 0, 1, 0)),
        (
          Let("atColumns", columnInformation.Get("at")),
          OrderByValue(atColumns.Keys).(
            Let("key", _),
            Let("values", atColumns.Get(key)),
            Let("caseValue", cas.Attribute(key)),
            values.(If(_ == caseValue, 1, 0))
          )
        )
      ]
    )
  )
));

Let("el", <event log to use>);
Let("trainCases", <cases to use for training>);
Let("targetCases", <cases representing the properties we want to try to predict (subset of traincases)>);
Let("testCases", <cases to use for testing>);
Let("targetCasesDict", ToDictionary(targetCases:true));
Let("outcomes", traincases.(Let("c", _), targetCasesDict.ContainsKey(c) ? 1 : 0));
Let("columnInformation", el.GetOneHotColumnInformation());

Let("mlModel", MLModel("randomforest"));
mlModel.Train(columnInformation.GenerateOneHot(trainCases), outcomes);
mlModel.Transform(columnInformation.GenerateOneHot(testCases));

Example #4: Customized version of example #3 using actual event type and attribute names. Three sets of cases: training cases, target cases (subset of training cases) and test cases (independent set of cases). Try to predict which cases in the test set will eventually end up becoming a case in target cases. Generate HTML result ready to be sent out in an email message.

Def("GenerateOneHot", "cases", { let columnInformation = _;
  cases.{ let cas = _;
    Flatten( [
      { let etColumns = columnInformation.Get("et"); etColumns.{ let  et = _; If(Count(cas.EventsByType(et)) > 0, 1, 0) } }, 
      { let atColumns = columnInformation.Get("at"); OrderByValue(atColumns.Keys).{ let key = _; let values = atColumns.Get(key); let caseValue = cas.Attribute(key); values.(If(_ == caseValue, 1, 0)) } }
    ] )
  }
});

// Make predictions for the whole model:
// let el = ModelById(39694).EventLog;

// Make predictions for cases in a particular filter
let el = EventLogById(109773);

let currenttime = now;

let trainCases = el.Cases.Where(Catch(currenttime - EventTimeStampsByType("hs_analytics_last_visit_timestamp")[0] > TimeSpan(30) , true));
let targetCases = trainCases.Where(_.Attribute("Lifecycle Stage").In(["opportunity", "marketingqualifiedlead", "customer", "salesqualifiedlead"]));
let testCases = el.Cases.Where(Catch(currenttime - EventTimeStampsByType("hs_analytics_last_visit_timestamp")[0] < TimeSpan(30) , false));

let targetCasesDict = ToDictionary(targetCases:true);
let outcomes = traincases.{ let c = _; targetCasesDict.ContainsKey(c) ? 1 : 0 };
let columnInformation = ToDictionary([
  "et": OrderByValue(el.EventTypes).Where(_.Name.In(["hs_email_last_click_date","first_conversion_date", "hs_analytics_last_visit_timestamp"])),
  "at": ToDictionary(ConcatTop(OrderByTop(el.CaseAttributes.Where(_.Name.In(["Lifecycle Stage", "Original Source", "QPR Digest", "Unsubscribed from all email"])) , Name).[_: Values]))
]);
let mlModel = MLModel("randomforest");
mlModel.Train(columnInformation.GenerateOneHot(trainCases), outcomes);

let predictions = mlModel.Transform(columnInformation.GenerateOneHot(testCases));
let predictedCases = ToDataFrame(Zip(testCases, predictions).Where(_[1] == 1) , ["id", "pred"]).id;

let body = "<html><body><table><tr><td>Last visited</td><td>Name</td></tr><tr>" +
  StringJoin( "</tr><tr>", "<td>" + predictedCases.EventTimeStampsByType("hs_analytics_last_visit_timestamp")[0] + "</td><td>" + predictedCases.Name + "</td>") +
  "</tr></table></body></html>";

body;