Difference between revisions of "Clustering Analysis"

From QPR ProcessAnalyzer Wiki
Jump to navigation Jump to search
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
The Clustering Analysis view groups cases in the model in a way that the cases inside a group are similar to each other (e.g. cases have the same case attribute values are in the same group). Clustering is based on advanced Machine Learning and Artificial Intelligence algorithms. By default Clustering Analysis uses in-memory built-in kmodes algorithm with categorized values for Event Type occurrences and Case Attribute values. The algorithm does not guarantee convergence to the global optimum which means that subsequent Clustering Analysis runs may result in slightly different clustering results. See this [https://en.wikipedia.org/wiki/Cluster_analysis Wikipedia article] for more about the idea behind clustering.
+
Clustering Analysis groups cases in the eventlog in a way that the cases within a group are similar to each other, e.g. cases have the same case attribute values are in the same group. Clustering is based on advanced machine learning and artificial intelligence algorithms. By default, the clustering analysis uses in-memory built-in ''kmodes'' algorithm with categorized values for event type occurrences and case attribute values. The algorithm does not guarantee convergence to the global optimum which means that subsequent clustering analysis runs may result in slightly different clustering results. See this [https://en.wikipedia.org/wiki/Cluster_analysis Wikipedia article] for more about the idea behind clustering.
  
You can use the Clustering Analysis View, for example, to check data integrity. That is, the Clustering Analysis might reveal that the model actually contains data from two different processes.
+
Clustering analysis has many use cases, e.g. you can use it to check data integrity: the analysis might reveal that the model actually contains data from two different processes.
  
 
[[File:Clusteringanalysis.png|800px]]
 
 
== Clustering Calculation Principle ==
 
 
Clustering analysis consists of two phases:
 
Clustering analysis consists of two phases:
* Phase 1: Clustering
+
* Phase 1: Actual clustering (i.e. dividing cases into similar groups)
* Phase 2: Root cause analysis to explain the clustering results
+
* Phase 2: Root cause analysis to explain the clusters
 
 
The dropdown settings affect the data features that are given to the clustering phase. You have done the right thigh by including only the one event type that occurs for each case. This way, the clustering in effect only uses the case attribute information. However, when the root cause analysis explains the results, it finds out that some event types correlate a lot with the clustering results, even though those event types were not included in the clustering phase as parameters.
 
  
“Saving of cluster identity”: When a set of clusters emerge, which seems to have a useful meaning, I would like to be able to “save” this, by eg creating a new case attribute on the fly that get set to the cluster identity (in the case above 33% of the cases should get this “new” attribute set to “Cluster 01”. (eg to go back to process discovery and filter only on “Cluster 01”)
+
This means that the clustering analysis does not show the individual cases in each cluster, but the features that describe each cluster. Note that the case attribute and event type settings affect the data features that are given to the clustering phase.
Doesn’t seem to be possible? What are your thoughts on the “next step” when a set of interesting clusters are identified?
 
  
Correct. It is not possible to save the clustering results at the moment. The next step would be to identify relevant business areas based on the clustering results. I would do this by reducing the number of event types and case attributes from the clustering source parameters so that eventually, I would only have the 1-3 most relevant features left. Then finally, I would create a calculated case attribute or new Filters to use that grouping in further analysis.
+
[[File:Clusteringanalysis.png|1100px]]
  
Another similar analysis would be to eg state that “I am looking for two clusters”, and cluster one are al cases that passes the “won” event, and the other cluster are the cases that passes through “lost”. Now, this is what the Root cause analysis does, right?
+
== How to Use Clustering Analysis ==
 
+
The right panel contains the clustering analysis results. The table shows the clusters, how many cases are in each cluster, and the following details for each cluster:
If the clustering is done using ONLY one case attribute which has ONLY two possible values while the Event Type dimension is “disabled” by selecting only one event type that occurs in all cases….and the number of clusters is set to 2….then the phase 1 clustering will most likely produce the two clusters which contain only the two different kind of cases. After this, the phase 2 -root cause analysis will give exactly the same result as a normal root cause analysis, given that the random initialization managed to select initial cases from different case attribute groups.
 
 
 
== Left Panel ==
 
You can use the left panel to filter cases. Note that you are not bound to using just the Flowchart analysis, as you can change the analysis by right-clicking the analysis and selecting a different type of analysis shown on the panel.
 
 
 
== Right Panel ==
 
The right panel contains the clustering analysis. The table shows the clusters, how many cases are in each cluster, and the following details for each cluster:
 
 
* '''Feature''' and '''Value''': These two columns list the case attribute and other values that are common to the cases in the cluster.
 
* '''Feature''' and '''Value''': These two columns list the case attribute and other values that are common to the cases in the cluster.
 
* '''Cluster Density %''': Share of cases having this feature value within the cluster (i.e. the number of cases having the value shown on the row in this particular cluster divided by the number of cases in the cluster * 100).
 
* '''Cluster Density %''': Share of cases having this feature value within the cluster (i.e. the number of cases having the value shown on the row in this particular cluster divided by the number of cases in the cluster * 100).
 
* '''Total Density %''': Share of cases having this feature value in the whole data set (i.e. the total number of cases having the value shown on the row divided by the total number of cases * 100).
 
* '''Total Density %''': Share of cases having this feature value in the whole data set (i.e. the total number of cases having the value shown on the row divided by the total number of cases * 100).
 
* '''Contribution %''': Amount of cases that can be explained to belong to this cluster because of this feature value. The scale is such that 0% means that the feature value isn't specific to this cluster and 100% means that all cases belonging to this cluster can be explained by this feature value.
 
* '''Contribution %''': Amount of cases that can be explained to belong to this cluster because of this feature value. The scale is such that 0% means that the feature value isn't specific to this cluster and 100% means that all cases belonging to this cluster can be explained by this feature value.
 +
 +
The clustering analysis has the following settings:
 +
* '''Clusters''': Number of clusters which the cases are divided into.
 +
* '''Cluster rows''': Number of describing features shown for each cluster. The best describing features are on top.
 +
* '''Attributes''': Case attributes that are taken into account in the clustering analysis. If none is selected, all case attributes are selected. You can restrict which case attributes are selected, if you want the clustering to be done based on only certain features.
 +
* '''Events''': Event types that occurrences are taken into account in the clustering analysis. If none is selected, all event types are selected.
 +
In the left panel, you can use the left panel to filter cases. Note that you are not bound to using just the Flowchart analysis, as you can change the analysis by right-clicking the analysis and selecting a different type of analysis shown on the panel.
  
 
[[Category: QPR ProcessAnalyzer]]
 
[[Category: QPR ProcessAnalyzer]]

Latest revision as of 22:24, 13 June 2021

Clustering Analysis groups cases in the eventlog in a way that the cases within a group are similar to each other, e.g. cases have the same case attribute values are in the same group. Clustering is based on advanced machine learning and artificial intelligence algorithms. By default, the clustering analysis uses in-memory built-in kmodes algorithm with categorized values for event type occurrences and case attribute values. The algorithm does not guarantee convergence to the global optimum which means that subsequent clustering analysis runs may result in slightly different clustering results. See this Wikipedia article for more about the idea behind clustering.

Clustering analysis has many use cases, e.g. you can use it to check data integrity: the analysis might reveal that the model actually contains data from two different processes.

Clustering analysis consists of two phases:

  • Phase 1: Actual clustering (i.e. dividing cases into similar groups)
  • Phase 2: Root cause analysis to explain the clusters

This means that the clustering analysis does not show the individual cases in each cluster, but the features that describe each cluster. Note that the case attribute and event type settings affect the data features that are given to the clustering phase.

Clusteringanalysis.png

How to Use Clustering Analysis

The right panel contains the clustering analysis results. The table shows the clusters, how many cases are in each cluster, and the following details for each cluster:

  • Feature and Value: These two columns list the case attribute and other values that are common to the cases in the cluster.
  • Cluster Density %: Share of cases having this feature value within the cluster (i.e. the number of cases having the value shown on the row in this particular cluster divided by the number of cases in the cluster * 100).
  • Total Density %: Share of cases having this feature value in the whole data set (i.e. the total number of cases having the value shown on the row divided by the total number of cases * 100).
  • Contribution %: Amount of cases that can be explained to belong to this cluster because of this feature value. The scale is such that 0% means that the feature value isn't specific to this cluster and 100% means that all cases belonging to this cluster can be explained by this feature value.

The clustering analysis has the following settings:

  • Clusters: Number of clusters which the cases are divided into.
  • Cluster rows: Number of describing features shown for each cluster. The best describing features are on top.
  • Attributes: Case attributes that are taken into account in the clustering analysis. If none is selected, all case attributes are selected. You can restrict which case attributes are selected, if you want the clustering to be done based on only certain features.
  • Events: Event types that occurrences are taken into account in the clustering analysis. If none is selected, all event types are selected.

In the left panel, you can use the left panel to filter cases. Note that you are not bound to using just the Flowchart analysis, as you can change the analysis by right-clicking the analysis and selecting a different type of analysis shown on the panel.