Best Practices for Designing Models: Difference between revisions

From QPR ProcessAnalyzer Wiki
Jump to navigation Jump to search
 
(18 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Best Practices for designing dashboards ==
This page describes common best practices for designing a suitable structure for a process mining model, and how to configure the model settings. Best practices how to write ETL scripts that actually create and update models, are described separately.


=== Visualization and usability best practices ===
== Datatable column data types ==
* Use conditional formattings to improve KPI visualization
* Always use the most suitable [[Importing Data to Datatable from CSV File#Data types|datatypes]] for datatable columns, as the datatypes have remarkable performance impacts and they also affects how data can be used in the analysis. Datatable column datatypes will also be the case and event attribute datatypes in the model. As a general rule, avoid the ''string'' datatype when other datatypes can be be used. Here are some guidelines:
* Use on-screen settings for settings that often users want to change, as they are easier to use than opening the settings. They also guide users to change parameters that might be relevant from the analysis viewpoint.
** If there are only two possible values, ''boolean'' is the best datatype. The values in boolean are called ''true'' and ''false'' which can be mapped into a textual presentation in charts. Thus, it's not needed to use string datatype to get desired textual presentations in dashboards.
* Check that each measure and dimension has a descriptive unit. For example, the generic terms "cases" and "events" might not describe the counts best, and cases might be orders, incidents etc.
** If there is numerical data that doesn't contain decimals or precision with decimals is not required, ''integer'' is the best datatype.
* Define a custom label text if they describe the measures and dimensions better. Still, in many measures and dimensions, the automatically generated title is suitable.
** For timestamps, the string datatype will definitely not work, so make sure to use ''date'' type and the conversion from a textual value during the import interprets the data correctly. Even though it's not about a precise timestamp, but the precision is for example a day, the date datatype is still be best.
* Defined custom title for the chart, if the automatically generated title is not descriptive.
* Note also how the [[Measure, Dimension and Column Settings#Special values|special values]] , such as null and empty strings, are presented and set a descriptive label name. For example, when showing duration between events, cases which don't have the events, are shown as nulls. Instead of empty label, the label can be "Events missing".
* If there are no meaningful filters that can be created from the chart, disable creating filters from chart. When filtering is disabled, the cursor doesn't indicate clickable items which would otherwise attract users from making clicks leading to dead end.
* Limit the shown case/event attributes and event types, if there are some that are not needed. This will make the dropdown lists shorter making it easier for users to find the important selections. This doesn't have performance impact, though.
 
=== Performance optimization best practices ===
* When choosing the Analyzed objects setting, prefer Cases over Events, as calculating from events is notably slower than cases, because in usual models there are lot more events than cases. Some KPI's, even though using the event level data, can be calculated from the cases. Also ''Variations'', ''Event types'' and ''Flows'' are generally fast. On the other hand, ''Flow Occurrences'' is slow, as the number of the for occurrence objects is even more than the event count.
* Prefer ready-made measures and dimensions, and create a custom when there no ready-made available. This is because the ready-made measures and dimensions have been optimized for performance. When creating a custom measure or dimension, you need to be careful not to write the calculation logic in an non-optimal way.
* For some simple calculations, the [[Measure, Dimension and Column Settings#Statistical calculations|Statistical calculations]] may be used instead of writing a custom expression. Also the [[Measure, Dimension and Column Settings#Advanced settings|Adjustment expression]] is useful in avoiding custom expressions in certain cases..
* Limit number of returned rows: The Max rows setting determines the number of returned rows. The less there are rows the better performance. Usually in dashboards, limiting the amount of shown data is desired also for usability, e.g., show only the top-20 items etc.
* The more there are charts in the dashboard, the more it takes to get the dashboard opened and all chart shown. This is because each chart requests a calculation which all need to be done at the same time in the server. If the dashboard opens slowly, charts could be divided into multiple dashboards and [[Label and Link|links]] created to navigate between them.
* ''Sorting'' the data affects performance, so use sorting only when it's relevant for the analysis.
* ''Group Rows Exceeding Maximum'' goes through all rows that otherwise would be left out of the calculation, which has an impact on the performance, so use it only when the information is useful for the analysis.
*
*
* Alternative to chart filter is Analyzed objects containing filtering, e.g. ... might improve performance
* Don't use dimensioning when it's not needed. When desire is to have a row for each root object, dimensioning is unnecessary. For example, Cases as Analyzed objects and dimensioning by case id will lead to a row for each case, but the same result can be achieved by disabling dimensioning.
 
=== Advanced performance optimization best practices ===
* For slow charts, use the [[QPR ProcessAnalyzer Chart#Benchmark performance|Benchmark Performance]] to find the fastest settings. Usually settings up a working chart is the first thing to do, and if the chart appears too slow, you can try to find another, faster way to calculate the same chart.
* Avoid making same calculations multiple times in different measures. If there are repeating expressions, create a separate measure for the repeating part, and define the measure as a variable, which can then be referenced from other measures.
* Different models can be used in the same dashboard, and for example filtering still works if the models use same case/event attribute and event type names. This allows to create models optimized for specific charts, which might improve performance.
* Sampling improves performance in cases when it can be used. The idea of sampling is to pick only a portion of the root objects for the dimension and measure calculation to improve performance. Unfortunately, in most cases, it cannot be used, as it affects the analysis results, for example in object counts.
 
=== Other best practices ===
* '''Use preset as basis''': The preset contain commonly needed analysis, so in many cases, you find what you are looking for from the presets. It's easy to take a preset as a basis, and continue modifying the chart settings for your customized needs.
* Mappings can be done freely, so dimensions don't always need to go to the X-axis and measures go to the Y-axis.
* '''Avoid Custom layout settings''': Avoid Custom layout settings as their compatibility with future QPR ProcessAnalyzer versions might not be maintained. Use Custom layout only when it's absolutely necessary for the visualization.
* '''For large data exports use CSV''': When there is need to export large amount of data, prefer the CSV export over Excel export, because the CSV performs better for large data.
 
== Best practices for creating models ==
* Use the most suitable datatypes for datatable columns. Avoid string datatype when other datatypes can be be used:
** If there are only two possible values, ''boolean'' is the best datatype. In booleans, the true and false values can be mapped into a textual presentation in charts, so it's not needed to use string datatype to get desired textual presentations.
** If there is numerical data that cannot contain decimals or precision with decimals is not required, ''integer'' should be used.
** If the data contains a numerical score (such as number between 1 and 5), integer is better than string.
** If the data contains a numerical score (such as number between 1 and 5), integer is better than string.
* All datatypes support ''null'' values to mark missing or some other kind of special value. The null value can be used to mark anything, as it's just a matter of decision. In numerical values, using null is better than zero, as nulls can are ignored in calculations (such as average). Strings can also contain empty string value, which is different than the null value. Note also that booleans can actually contain three values: true, false and null.
* All datatypes support ''null'' values to mark missing or not existing value. The null value can be used to mark anything – its meaning is just a matter of decision. For not existing numerical values, using null is better than zero, as nulls are ignored in calculations (such as in average). Note that strings can also contain the empty string value, which is different than the null value. In addition, booleans can actually contain three values: true, false and null.
* Include to models only those case and event attributes that are needed by the dashboards, because loading model is slower, when there are more attributes. For advanced analysis, such as finding root causes and clustering, more attributes maybe useful, but not for dashboards using only specific attributes.
* Include only those event types to models, that are needed by the dashboards. The more there are events, the more calculations take.
* For large models, the [[Automatic_Model_Loading_on_Server_Startup|Load Model on Startup]] setting may be needed, so that the initial opening of dashboard isn't too slow, when the model is already available in the memory. On the other hand, loading many models automatically into memory, consume more memory, so models that are not used regularly, should not be loaded automatically into memory.
* Use concise event type names, as shorter are easier to read in the UI and they also provide slightly better performance. This is also valid for case and event attributes values. It's also better for readability, that if there are names that are close to each other, the differences would be in the beginning of the name rather than in the end, as the end may be cropped out if there is lack of space.
* Use the model description to document any relevant details regarding the model for other users. For example, the meaning of the event types and case/event attributes. The model description field can be found in the [[QPR_ProcessAnalyzer_Project_Workspace#Editing_Model_Settings|Model Properties dialog]].


== Best practices for writing ETL scripts ==
== Performance optimization ==
* Include to the model only those case and event attributes that are used in the dashboards, because loading the model is slower and it uses more memory, when there are more attributes. For advanced analysis, such as finding root causes and clustering, more attributes maybe useful, but not for dashboards using only the specified attributes. The calculation performance itself doesn't deteriorate, even though the number of attributes increase.
* Include only those event types to models, that are needed by the dashboards and analyses. The more there are events, the more model loading takes, the more model uses memory, and the more calculations take. Event types can be excluded using filters, improving the measure calculation performance similar to corresponding smaller model, but on the other hand calculating the the filter take time.
* For large models, the [[Automatic_Model_Loading_on_Server_Startup|Load Model on Startup]] setting may be useful, so that the initial opening of a dashboard doesn't take too long, as with the setting the model is already available in the memory. On the other hand, pre-loading many models automatically, consume more memory, so models that are not used regularly, should not be loaded automatically into memory.
* When new data is loaded into the model, it will become progressively slower to use. If the oldest data in the model is not shown by dashboards, it's recommended to remove it from the model at the same time when new data is loaded. The old data may be stored to an archive model, if it's still needed in some situations.


== Best Practices for hosting QPR ProcessAnalyzer ==
== Usability for dashboarding ==
* Use concise names for event types, as shorter names are easier to read by users and they also provide slightly better performance. This is also valid for case and event attributes values. For the readability, it's also better that if there are names that are close to each other, the differences would be in the beginning of the name rather than in the end, as the ending may be cropped out if there is lack of screen space.
* Use the model description to document any relevant details regarding the model for other users. For example, meaning of the event types and case/event attributes is definitely not obvious for other users but also perhaps not for the author itself after some time. The model description field can be found in the [[QPR_ProcessAnalyzer_Project_Workspace#Editing_Model_Settings|Model Properties dialog]].
* When data is sorted their data types are important, for example numerical data is sorted by the number values, where as strings are sorted alphabetically. The difference is, for example when sorting numbers 9 and 10 ascending, the 9 is before 10, but if data is stored as strings, the "10" is before "9". If strings need to be in specific order, this needs to be taken into account when formatting data. For example the string values can be prefixed with an order number. The previous example with 9 and 10 can be worked around by storing string values "09" and "10".

Latest revision as of 09:35, 31 March 2022

This page describes common best practices for designing a suitable structure for a process mining model, and how to configure the model settings. Best practices how to write ETL scripts that actually create and update models, are described separately.

Datatable column data types

  • Always use the most suitable datatypes for datatable columns, as the datatypes have remarkable performance impacts and they also affects how data can be used in the analysis. Datatable column datatypes will also be the case and event attribute datatypes in the model. As a general rule, avoid the string datatype when other datatypes can be be used. Here are some guidelines:
    • If there are only two possible values, boolean is the best datatype. The values in boolean are called true and false which can be mapped into a textual presentation in charts. Thus, it's not needed to use string datatype to get desired textual presentations in dashboards.
    • If there is numerical data that doesn't contain decimals or precision with decimals is not required, integer is the best datatype.
    • For timestamps, the string datatype will definitely not work, so make sure to use date type and the conversion from a textual value during the import interprets the data correctly. Even though it's not about a precise timestamp, but the precision is for example a day, the date datatype is still be best.
    • If the data contains a numerical score (such as number between 1 and 5), integer is better than string.
  • All datatypes support null values to mark missing or not existing value. The null value can be used to mark anything – its meaning is just a matter of decision. For not existing numerical values, using null is better than zero, as nulls are ignored in calculations (such as in average). Note that strings can also contain the empty string value, which is different than the null value. In addition, booleans can actually contain three values: true, false and null.

Performance optimization

  • Include to the model only those case and event attributes that are used in the dashboards, because loading the model is slower and it uses more memory, when there are more attributes. For advanced analysis, such as finding root causes and clustering, more attributes maybe useful, but not for dashboards using only the specified attributes. The calculation performance itself doesn't deteriorate, even though the number of attributes increase.
  • Include only those event types to models, that are needed by the dashboards and analyses. The more there are events, the more model loading takes, the more model uses memory, and the more calculations take. Event types can be excluded using filters, improving the measure calculation performance similar to corresponding smaller model, but on the other hand calculating the the filter take time.
  • For large models, the Load Model on Startup setting may be useful, so that the initial opening of a dashboard doesn't take too long, as with the setting the model is already available in the memory. On the other hand, pre-loading many models automatically, consume more memory, so models that are not used regularly, should not be loaded automatically into memory.
  • When new data is loaded into the model, it will become progressively slower to use. If the oldest data in the model is not shown by dashboards, it's recommended to remove it from the model at the same time when new data is loaded. The old data may be stored to an archive model, if it's still needed in some situations.

Usability for dashboarding

  • Use concise names for event types, as shorter names are easier to read by users and they also provide slightly better performance. This is also valid for case and event attributes values. For the readability, it's also better that if there are names that are close to each other, the differences would be in the beginning of the name rather than in the end, as the ending may be cropped out if there is lack of screen space.
  • Use the model description to document any relevant details regarding the model for other users. For example, meaning of the event types and case/event attributes is definitely not obvious for other users but also perhaps not for the author itself after some time. The model description field can be found in the Model Properties dialog.
  • When data is sorted their data types are important, for example numerical data is sorted by the number values, where as strings are sorted alphabetically. The difference is, for example when sorting numbers 9 and 10 ascending, the 9 is before 10, but if data is stored as strings, the "10" is before "9". If strings need to be in specific order, this needs to be taken into account when formatting data. For example the string values can be prefixed with an order number. The previous example with 9 and 10 can be worked around by storing string values "09" and "10".