Best Practices for Designing Models: Difference between revisions

From QPR ProcessAnalyzer Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 8: Line 8:
** If the data contains a numerical score (such as number between 1 and 5), integer is better than string.
** If the data contains a numerical score (such as number between 1 and 5), integer is better than string.
* All datatypes support ''null'' values to mark missing or not existing value. The null value can be used to mark anything – its meaning is just a matter of decision. For not existing numerical values, using null is better than zero, as nulls are ignored in calculations (such as in average). Note that strings can also contain the empty string value, which is different than the null value. In addition, booleans can actually contain three values: true, false and null.
* All datatypes support ''null'' values to mark missing or not existing value. The null value can be used to mark anything – its meaning is just a matter of decision. For not existing numerical values, using null is better than zero, as nulls are ignored in calculations (such as in average). Note that strings can also contain the empty string value, which is different than the null value. In addition, booleans can actually contain three values: true, false and null.
* Include to the model only those case and event attributes that are needed by the dashboards, because loading model is slower, when there are more attributes. For advanced analysis, such as finding root causes and clustering, more attributes maybe useful, but not for dashboards using only specific attributes.
* Include to the model only those case and event attributes that are used in the dashboards, because loading the model is slower and it uses more memory, when there are more attributes. For advanced analysis, such as finding root causes and clustering, more attributes maybe useful, but not for dashboards using only the specified attributes. The calculation performance itself doesn't deteriorate, even though the number of attributes increase.
* Include only those event types to models, that are needed by the dashboards. The more there are events, the more calculations take.
* Include only those event types to models, that are needed by the dashboards and analyses. The more there are events, the more model loading takes, the more model uses memory, and the more calculations take. Event types can be excluded using filters, improving the measure calculation performance similar to corresponding smaller model, but on the other hand calculating the the filter take time.
* For large models, the [[Automatic_Model_Loading_on_Server_Startup|Load Model on Startup]] setting may be needed, so that the initial opening of dashboard isn't too slow, when the model is already available in the memory. On the other hand, loading many models automatically into memory, consume more memory, so models that are not used regularly, should not be loaded automatically into memory.
* For large models, the [[Automatic_Model_Loading_on_Server_Startup|Load Model on Startup]] setting may be useful, so that the initial opening of a dashboard doesn't take too long, as with the setting the model is already available in the memory. On the other hand, pre-loading many models automatically, consume more memory, so models that are not used regularly, should not be loaded automatically into memory.


== Usability ==
== Usability ==

Revision as of 23:25, 28 March 2022

This page describes common best practices for designing a suitable structure for a process mining model, and how to configure the model settings. Best practices how to write ETL scripts that actually create and update models, are described separately.

Performance

  • Always use the most suitable datatypes for datatable columns, as the datatypes have remarkable performance impacts and they also affects how data can be used in the analysis. Datatable column datatypes will also be the case and event attribute datatypes in the model. As a general rule, avoid the string datatype when other datatypes can be be used. Here are some guidelines:
    • If there are only two possible values, boolean is the best datatype. The values in boolean are called true and false which can be mapped into a textual presentation in charts. Thus, it's not needed to use string datatype to get desired textual presentations in dashboards.
    • If there is numerical data that doesn't contain decimals or precision with decimals is not required, integer is the best datatype.
    • For timestamps, the string datatype will definitely not work, so make sure to use date type and the conversion from a textual value during the import interprets the data correctly. Even though it's not about a precise timestamp, but the precision is for example a day, the date datatype is still be best.
    • If the data contains a numerical score (such as number between 1 and 5), integer is better than string.
  • All datatypes support null values to mark missing or not existing value. The null value can be used to mark anything – its meaning is just a matter of decision. For not existing numerical values, using null is better than zero, as nulls are ignored in calculations (such as in average). Note that strings can also contain the empty string value, which is different than the null value. In addition, booleans can actually contain three values: true, false and null.
  • Include to the model only those case and event attributes that are used in the dashboards, because loading the model is slower and it uses more memory, when there are more attributes. For advanced analysis, such as finding root causes and clustering, more attributes maybe useful, but not for dashboards using only the specified attributes. The calculation performance itself doesn't deteriorate, even though the number of attributes increase.
  • Include only those event types to models, that are needed by the dashboards and analyses. The more there are events, the more model loading takes, the more model uses memory, and the more calculations take. Event types can be excluded using filters, improving the measure calculation performance similar to corresponding smaller model, but on the other hand calculating the the filter take time.
  • For large models, the Load Model on Startup setting may be useful, so that the initial opening of a dashboard doesn't take too long, as with the setting the model is already available in the memory. On the other hand, pre-loading many models automatically, consume more memory, so models that are not used regularly, should not be loaded automatically into memory.

Usability

  • Use concise names for event types, as shorter are easier to read in the UI and they also provide slightly better performance. This is also valid for case and event attributes values. It's also better for readability, that if there are names that are close to each other, the differences would be in the beginning of the name rather than in the end, as the end may be cropped out if there is lack of space.
  • Use the model description to document any relevant details regarding the model for other users. For example, the meaning of the event types and case/event attributes. The model description field can be found in the Model Properties dialog.
  • When data is sorted, their types matter, i.e., numerical values are sorted by their values, where as strings are sorted alphabetically. There is a difference, for example when sorting numbers 9 and 10 ascending the 9 in first, but if they are stored as strings, the "10" is first. So if string values need to be in specific order other than alphabetical, this needs to be taken into account in the naming. For textual values, the values can be prefixed with an order number. The previous example with 9 and 10 can be worked around by storing string values "09" and "10".