Clustering Analysis: Difference between revisions

From QPR ProcessAnalyzer Wiki
Jump to navigation Jump to search
Line 9: Line 9:
* '''Feature''': The describing feature of the cluster, i.e., the case attribute and its value, or the event type name.
* '''Feature''': The describing feature of the cluster, i.e., the case attribute and its value, or the event type name.
* '''Cluster density %''': Share of cases having this feature value within the cluster, i.e. the number of cases having the value shown on the row in this particular cluster divided by the number of cases in the cluster * 100.
* '''Cluster density %''': Share of cases having this feature value within the cluster, i.e. the number of cases having the value shown on the row in this particular cluster divided by the number of cases in the cluster * 100.
* '''Total density %''': Share of cases having this feature in the whole eventlog, i.e., the total number of cases having the value shown on the row divided by the total number of cases * 100.
* '''Total density %''': Share of cases having this feature in the entire eventlog, i.e., the total number of cases having the value shown on the row divided by the total number of cases * 100.
* '''Contribution %''': Number of cases that can be explained to belong to this cluster because of this feature value. The scale is such that 0% means that the feature value isn't specific to this cluster and 100% means that all cases belonging to this cluster can be explained by this feature value. The contribution percentage is calculated as a subtraction of the cluster density and total density percentages.
* '''Contribution %''': Explains how much more common this feature is in this cluster when comparing to the entire eventlog. The higher the value, the more the feature characterizes the cluster. The contribution percentage is calculated as a subtraction of the cluster density and total density percentages.


== Clustering Analysis Settings ==
== Clustering Analysis Settings ==

Revision as of 19:09, 28 July 2023

Clustering Analysis divides cases into groups in a way that the cases within each group are as similar to each other as possible, in terms of the case attribute values and occurred event types. Clustering is based on a so called unsupervised machine learning algorithm. The clustering analysis uses the kmodes algorithm with categorized values for event type occurrences and case attribute values. Due to the nature of the algorithm, different clustering runs may end up to slightly different results. See this Wikipedia article for more about the idea behind clustering. Clustering analysis is an easy way to understand and explain the eventlog without knowing anything about it beforehand. It can also be used to check data integrity, as the analysis might reveal that the eventlog actually contains data from two distinct processes that cannot actually be compared to each other.

Clustering Analysis Overview

Clustering analysis is available as a view in the Navigation menu. When creating custom dashboards, the clustering can also be opened as a preset to add it to the dashboard. The dashboard will remember the settings made for the clustering.

Clusteringanalysis.png

The clustering analysis is shown in a table where rows are grouped as follows: Each group is a cluster and each row shows describing features in the cluster. There are following columns:

  • Feature: The describing feature of the cluster, i.e., the case attribute and its value, or the event type name.
  • Cluster density %: Share of cases having this feature value within the cluster, i.e. the number of cases having the value shown on the row in this particular cluster divided by the number of cases in the cluster * 100.
  • Total density %: Share of cases having this feature in the entire eventlog, i.e., the total number of cases having the value shown on the row divided by the total number of cases * 100.
  • Contribution %: Explains how much more common this feature is in this cluster when comparing to the entire eventlog. The higher the value, the more the feature characterizes the cluster. The contribution percentage is calculated as a subtraction of the cluster density and total density percentages.

Clustering Analysis Settings

Clustering analysis has the following settings:

  • Clusters: Number of clusters which the cases are divided into.
  • Cluster rows: Number of describing features shown for each cluster. The features are shown in the order of strongest contribution.
  • Attributes: Case attributes that are taken to the clustering analysis. If none is selected, all case attributes are selected. You can restrict which case attributes are selected, if you want the clustering to be done based on only certain features.
  • Events: Event types that occurrences are taken to the clustering analysis. If none is selected, all event types are selected.

Clustering Analysis Calculation

Calculating the clustering analysis has the following steps:

  1. Taking a random sample of cases from the eventlog.
  2. Performing the clustering using machine learning to the cases. The result is that each cases belongs to a certain cluster.
  3. Root causes is run to find the explaining factors for each clusters.

The root causes analysis is used so that the clustering analysis does not show the individual cases in each cluster, but the features that describe each cluster (a long list of individual cases wouldn't be very easy to read). Note that the case attribute and event type settings affect the data features that are given to the clustering phase.