Importing Data to Datatable from CSV File: Difference between revisions

From QPR ProcessAnalyzer Wiki
Jump to navigation Jump to search
No edit summary
 
(17 intermediate revisions by the same user not shown)
Line 1: Line 1:
<div style="border:1px solid #dfdfdf;padding:0.5em 1em 0.5em 1em;background-color:#E7EAEC;margin:10px 0px 0px 10px;">
Data from CSV files can be imported to datatables in the [[QPR_ProcessAnalyzer_Project_Workspace#Importing_Data_to_Datatable_from_CSV_File_.28available_in_2021.6.29|Workspace screen]]. The data is added to the existing datatable and no existing data is removed during the import. There is also a matching mode available, which is able to update existing rows in the datatable based on the CSV file data rather than adding new rows (see the ''Match by columns'' setting for more information).
This functionality is available from 2021.6.
</div>


Data can be imported to datatables from CSV files in the [[QPR_ProcessAnalyzer_Project_Workspace|Workspace]] screen by following [[QPR_ProcessAnalyzer_Project_Workspace#Importing_Data_to_Datatable_from_CSV_File_.28available_in_2021.6.29|these instructions]].
When the CSV file is selected, the import analyzes the contents and finds suitable data types and other conversion settings for each column. In some special formats, the automatic data type detection does not work, requiring to adjust the settings manually. When importing timestamps from different time zones, note how to [[Managing Time Zones and Local Time|manage the time zones and local time]].
 
When a CSV file is selected, the import tries to interpret the contents of the file and find the suitable data types and other conversion parameters for each column. In some special cases, the automatic data type detection does not work, and then the settings need to be set manually.
 
The import process can be cancelled. In that case data imported thus far is stored to the datatable.


== Import Settings ==
== Import Settings ==
The data import dialog has the following settings:
The data import dialog has the following settings:
* '''Select CSV file''': This button opens the file selector to select the CSV file if you want to change the file.
* '''Select CSV file''': This button opens the file selector to select the CSV file if you want to change the file.
* '''Match by columns''': Choose datatable columns (one or several) to match with the CSV file data as follows: If a matching row is found in the datatable, that row is updated based on the CSV file data. If no matching row is found, a new row is created to the datatable (this is the behavior when no matching is used). Only text, integer and date type of columns can be selected as matching columns, as equality comparison is possible with those types. If the datatable doesn't yet have any data, matching cannot be used. New columns can also be created to the datatable when the matching is used.
* '''Column separator''': Column separator character used by the CSV file. The column separator is automatically detected when the file is selected. In some cases, the automatic detection does not find the correct column separator, and this selection allows to change the column separator. If you want to use space as a column separator, select ''<space''>, or if you want to use tabulator, select ''<tab''>. Note that when changing the column separator, all column specific settings in the table are reset. It's also possible to define several characters for a column separator if that is the case in the CSV file.
* '''Column separator''': Column separator character used by the CSV file. The column separator is automatically detected when the file is selected. In some cases, the automatic detection does not find the correct column separator, and this selection allows to change the column separator. If you want to use space as a column separator, select ''<space''>, or if you want to use tabulator, select ''<tab''>. Note that when changing the column separator, all column specific settings in the table are reset. It's also possible to define several characters for a column separator if that is the case in the CSV file.
* '''Skip errors''': Allows to continue the import when encountering errors, and import ''null'' values in place of values where the conversion could not be made. If error skipping is not in use, the import stops to the first encountered error. The data conversions are checked entirely before storing data to the datatable, so conversion errors are observed early and they will not lead to partially imported data. When skipping errors, there is a summary of all the encountered errors when the import is completed.
* '''Skip errors''': Allows to continue the import when encountering errors, and import ''null'' values in place of values where the conversion could not be made. If error skipping is not in use, the import stops to the first encountered error. The data conversions are checked entirely before storing data to the datatable, so conversion errors are observed early and they will not lead to partially imported data. When skipping errors, there is a summary of all the encountered errors when the import is completed.
Line 17: Line 12:


== Column Settings ==
== Column Settings ==
There is a settings section for each column in the CSV file, defining how the data is treated in the import. The title of the column shows the column name in the CSV file.
There are settings for each CSV file column, defining how the data is converted in the import. The title of the column shows the column name in the CSV file.


The following options are possible:
The following options are possible:
* '''Import data to a new column in the datatable''': Write a column name that doesn't exist in the datatable. When the import is started, the needed new columns are automatically created to the datatable. When importing to a new column, the data type for the new column can be chosen freely.
* '''Import data to a new column in the datatable''': Write a column name that doesn't exist in the datatable. When the import is started, the needed new columns are automatically created to the datatable. When importing to a new column, the data type for the new column can be chosen freely. Note that the maximum allowed column length is 128 characters.
* '''Import data to an existing column in the datatable''': Choose an existing column in the datatable from the dropdown list. For existing columns, the data type of the column is visible and it cannot be changed, so only that type of data can be imported to the column.
* '''Import data to an existing column in the datatable''': Choose an existing column in the datatable from the dropdown list. For existing columns, the data type of the column is visible and it cannot be changed, so only that type of data can be imported to the column.
* '''Not to import the data in the column''': When choosing ''<do not import>'', the column is ignored in the import.
* '''Not to import the column''': Choose the ''<do not import>'' option to ignore this column in the import.


== Data types ==
== Data types ==
Line 28: Line 23:
{| class="wikitable"
{| class="wikitable"
! '''Data type'''
! '''Data type'''
! '''Use'''
! '''Use cases'''
! '''Settings'''
! '''Conversion settings'''
|-
|-
||String
||String
Line 37: Line 32:
|-
|-
||Integer
||Integer
||Data type for whole numbers, i.e. numbers that don't need decimals. Integers are also suitable for storing distinct values that can be ordered (e.g. 1=low, 2=medium, 3=high), which makes it easier to visualize data in a logical way. Integers have a advantage over decimal numbers: it's possible to make equality comparisons for integers. Converted text can contain thousand separators, which are automatically detected.
||Data type for whole numbers (i.e. numbers that don't need decimals). Integers are also suitable for storing any distinct values that can be ordered (e.g. 1=low, 2=medium, 3=high), which makes it easier to show data in a logical order in visualizations. Integers have an advantage over decimal numbers, as it's possible to make equality comparisons for integers, for example when creating filters.
 
Text converted into integer, can use thousand separators, which are automatically detected.
||
||
* Unit: If there is a (textual) unit in the CSV file, it needs to be written here, so that the textual part can be omitted and the data can be interpreted as numerical.
* Unit: If there is a (textual) unit in the CSV file, it needs to be defined here, so that the textual part can be omitted and the data can be interpreted as numerical.
* Null value: Text that is imported as null.
* Null value: Text that is imported as null.
|-
|-
||Decimal number
||Decimal&nbsp;number
||Data type that stores numerical data with decimals. It's not possible to make equality comparisons for decimal numbers (i.e. equals than, not equals than, less or equal than, greater or equal than). Only more than and less than comparison can be made for example when filtering data. Converted text can contain thousand separators, which are automatically detected.
||Data type that stores numbers with decimal precision. Note that it's not possible to make equality comparisons for decimal numbers (i.e. ''equals than'', ''not equals than'', ''less or equal than'', ''greater or equal than''). Only ''more than'' and ''less than'' comparison can be made, for example when filtering data.
 
Text converted into decimal numbers can contain thousand separators, which are automatically detected.
||
||
*
* Unit: If there is a (textual) unit in the CSV file, it needs to be defined here, so that the textual part can be omitted and the data can be interpreted as numerical.
* Unit: If there is a (textual) unit in the CSV file, it needs to be written here, so that the textual part can be omitted and the data can be interpreted as numerical.
* Null value: Text that is imported as null.
* Null value: Text that is imported as null.
|-
|-
||Date
||Date
||Data type stores a precise timestamp, which is an essential data type for process mining. This data type can also be used to store less fine graining time units, such as days, weeks, months or years. In that case the stored timestamp is the point of time in the beginning of the day, week etc.
||Date is an essential data type for process mining, as it stores a precise timestamp. Dates can also be used to store less fine-grained time units, such as days, weeks, months or years. In that case, the timestamp points to the beginning of the stored interval (day, week, etc.).
||
||
* Date format: Describes how the data is interpreted as date.
* Date format: Describes how the data is interpreted as date (see more in [[#Date Formats|date formats]]).
|-
|-
||Boolean
||Boolean
||Data type that contains only two values, true and false. The ''null' value is also available (like in other data types) which in practise is the third possible value for boolean type of data. If no more values are needed, boolean is the most efficient data type to use. Note that in dashboards, the true and false values can be translated into user understandable texts, such as "yes" and "no".  
||Data type that can contain two values, ''true'' and ''false''. The ''null'' value is also available (like in other data types) which in practice is the third possible value for boolean data. If only two (or three) distinct values are needed, boolean is the most efficient data type to use. Note that in dashboard visualization, the true and false values can be translated into user understandable texts, such as "yes"/"no" or "is compliant"/"is not compliant" depending on the use case.  
||
||
* True value: Text that is imported as true value.
* True value: Text that is imported as true value.
Line 62: Line 60:
|-
|-
||Duration
||Duration
||Data type that contains a time duration (e.g. 5 hours, 14 days, 3 seconds, etc.). Corresponding expression language data type is Timespan. Duration is quite close to decimal number, except its benefit is that it also contains the duration unit (thus it doesn't need to be agreed separately). Duration type of data is also very easy to show in any other duration unit with a simple selection in the dashboard. Note that the duration data type does not store information when the duration started or ended, but only the duration itself.
||Data type that contains a time duration (e.g. 5 hours, 14 days, 3 seconds, etc.). Duration is close to decimal number, except durations also have the unit natively embedded (thus it doesn't need to be remembered or recorded separately). Duration type of data is also easy to show in any duration units with a simple selection in the dashboard. Note that the duration data type does not store information when the duration started or ended, but only the duration itself. Corresponding expression language data type for duration is ''Timespan''.
||
||
* Time unit: Time unit that the number represents.
* Time unit: Time unit that the imported number represents.
* Unit: If there is a (textual) unit in the CSV file, it needs to be written here, so that the textual part can be omitted and the data can be interpreted as numerical.
* Unit: If there is a (textual) unit in the CSV file, it needs to be defined here, so that the textual part can be omitted and the data can be interpreted as numerical.
* Null value: Text that is imported as null.
* Null value: Text that is imported as null.
|}
|}
Line 81: Line 79:
* '''HH''': hours as two digits based on 24-hour clock (e.g. 00, 01, 02, ... 23)
* '''HH''': hours as two digits based on 24-hour clock (e.g. 00, 01, 02, ... 23)
* '''H''': hours as one or two digits based on 24-hour clock (e.g. 1, 2, ... 23)
* '''H''': hours as one or two digits based on 24-hour clock (e.g. 1, 2, ... 23)
* '''hh''': same as HH, except for 12-hour clock
* '''hh''': same as HH, except for 12-hour clock (e.g. 01, 02, ... 12)
* '''h''': same as H, except for 12-hour clock
* '''h''': same as H, except for 12-hour clock (e.g. 1, 2, ... 12)
* '''mm''': similar as HH except for minutes
* '''mm''': minutes as two digits (e.g. 00, 01, 02, ... 59)
* '''m''': similar as H except for minutes
* '''m''': minutes as one or two digits based on 24-hour clock (e.g. 1, 2, ... 59)
* '''ss''': similar as HH except for seconds
* '''ss''': seconds as two digits (e.g. 00, 01, 02, ... 59)
* '''s''': similar as H except for seconds
* '''s''': seconds as one or two digits based on 24-hour clock (e.g. 1, 2, ... 59)
* '''fff''': 1/1000th of second as three digits (e.g. 000, 001, 002, ... 999)
* '''fff''': milliseconds as three digits (e.g. 000, 001, 002, ... 999)
* '''ff''': 1/100th of second as two digits (e.g. 00, 01, 02, ... 99)
* '''ff''': 1/100th of second as two digits (e.g. 00, 01, 02, ... 99)
* '''f''': 1/10th of second as one digit (e.g. 0, 1, 2, ... 9)
* '''f''': 1/10th of second as one digit (e.g. 0, 1, 2, ... 9)
* '''aa''': either "AM" or "PM" indicating before or after midday (needed when h/hh is used)
* '''tt''': either "AM" or "PM" indicating before or after midday (needed when h or hh is used)
* '''a''': either "A" or "P" indicating before or after midday (needed when h/hh is used)
* '''t''': either "A" or "P" indicating before or after midday (needed when h or hh is used)
* '''tt''': same as "aa"
* '''t''': same as "a"
* '''zzz''': Timezone offset as hours and minutes in format (+/-)<hours>:<minutes> (e.g. -23:00, -07:00, +00:00, +00:30, +02:00, +23:00)
* '''zzz''': Timezone offset as hours and minutes in format (+/-)<hours>:<minutes> (e.g. -23:00, -07:00, +00:00, +00:30, +02:00, +23:00)
* '''zz''': Timezone offset hours as two digits preceded by plus/minus (e.g. -23, ..., -01, +00, +01, ... +23)
* '''zz''': Timezone offset hours as two digits preceded by plus/minus (e.g. -23, ..., -01, +00, +01, ... +23)

Latest revision as of 20:23, 21 August 2022

Data from CSV files can be imported to datatables in the Workspace screen. The data is added to the existing datatable and no existing data is removed during the import. There is also a matching mode available, which is able to update existing rows in the datatable based on the CSV file data rather than adding new rows (see the Match by columns setting for more information).

When the CSV file is selected, the import analyzes the contents and finds suitable data types and other conversion settings for each column. In some special formats, the automatic data type detection does not work, requiring to adjust the settings manually. When importing timestamps from different time zones, note how to manage the time zones and local time.

Import Settings

The data import dialog has the following settings:

  • Select CSV file: This button opens the file selector to select the CSV file if you want to change the file.
  • Match by columns: Choose datatable columns (one or several) to match with the CSV file data as follows: If a matching row is found in the datatable, that row is updated based on the CSV file data. If no matching row is found, a new row is created to the datatable (this is the behavior when no matching is used). Only text, integer and date type of columns can be selected as matching columns, as equality comparison is possible with those types. If the datatable doesn't yet have any data, matching cannot be used. New columns can also be created to the datatable when the matching is used.
  • Column separator: Column separator character used by the CSV file. The column separator is automatically detected when the file is selected. In some cases, the automatic detection does not find the correct column separator, and this selection allows to change the column separator. If you want to use space as a column separator, select <space>, or if you want to use tabulator, select <tab>. Note that when changing the column separator, all column specific settings in the table are reset. It's also possible to define several characters for a column separator if that is the case in the CSV file.
  • Skip errors: Allows to continue the import when encountering errors, and import null values in place of values where the conversion could not be made. If error skipping is not in use, the import stops to the first encountered error. The data conversions are checked entirely before storing data to the datatable, so conversion errors are observed early and they will not lead to partially imported data. When skipping errors, there is a summary of all the encountered errors when the import is completed.
  • Character encoding: Character encoding used by the CSV file. Most text files use the UTF-8 encoding which is the default. If some of the characters in the preview table look incorrect, try to change other character encoding to find the correct encoding. Note that there is no automatic detection for the character encoding. Note also that changing the character encoding will reset all column specific settings.

Column Settings

There are settings for each CSV file column, defining how the data is converted in the import. The title of the column shows the column name in the CSV file.

The following options are possible:

  • Import data to a new column in the datatable: Write a column name that doesn't exist in the datatable. When the import is started, the needed new columns are automatically created to the datatable. When importing to a new column, the data type for the new column can be chosen freely. Note that the maximum allowed column length is 128 characters.
  • Import data to an existing column in the datatable: Choose an existing column in the datatable from the dropdown list. For existing columns, the data type of the column is visible and it cannot be changed, so only that type of data can be imported to the column.
  • Not to import the column: Choose the <do not import> option to ignore this column in the import.

Data types

It's very important to set the most suitable data types for each column, as it greatly affects the usage of the data in the analyses. Incorrect data type usually make the analysis impossible. The following data types are available:

Data type Use cases Conversion settings
String Data type for textual data. Also suitable for distinct classification values that are not numerical, e.g. "low", "medium", "high". The length of the string is unlimited, but in practice it's more efficient to use shorter strings (e.g. maximum of 100 characters) for process mining purposes. The string data type make a distinction between empty string and null value. The CSV file import does not import any empty strings, as empty data is imported as nulls.
  • Null value: Text that is imported as null.
Integer Data type for whole numbers (i.e. numbers that don't need decimals). Integers are also suitable for storing any distinct values that can be ordered (e.g. 1=low, 2=medium, 3=high), which makes it easier to show data in a logical order in visualizations. Integers have an advantage over decimal numbers, as it's possible to make equality comparisons for integers, for example when creating filters.

Text converted into integer, can use thousand separators, which are automatically detected.

  • Unit: If there is a (textual) unit in the CSV file, it needs to be defined here, so that the textual part can be omitted and the data can be interpreted as numerical.
  • Null value: Text that is imported as null.
Decimal number Data type that stores numbers with decimal precision. Note that it's not possible to make equality comparisons for decimal numbers (i.e. equals than, not equals than, less or equal than, greater or equal than). Only more than and less than comparison can be made, for example when filtering data.

Text converted into decimal numbers can contain thousand separators, which are automatically detected.

  • Unit: If there is a (textual) unit in the CSV file, it needs to be defined here, so that the textual part can be omitted and the data can be interpreted as numerical.
  • Null value: Text that is imported as null.
Date Date is an essential data type for process mining, as it stores a precise timestamp. Dates can also be used to store less fine-grained time units, such as days, weeks, months or years. In that case, the timestamp points to the beginning of the stored interval (day, week, etc.).
  • Date format: Describes how the data is interpreted as date (see more in date formats).
Boolean Data type that can contain two values, true and false. The null value is also available (like in other data types) which in practice is the third possible value for boolean data. If only two (or three) distinct values are needed, boolean is the most efficient data type to use. Note that in dashboard visualization, the true and false values can be translated into user understandable texts, such as "yes"/"no" or "is compliant"/"is not compliant" depending on the use case.
  • True value: Text that is imported as true value.
  • False value: Text that is imported as false value.
  • Null value: Text that is imported as null.
Duration Data type that contains a time duration (e.g. 5 hours, 14 days, 3 seconds, etc.). Duration is close to decimal number, except durations also have the unit natively embedded (thus it doesn't need to be remembered or recorded separately). Duration type of data is also easy to show in any duration units with a simple selection in the dashboard. Note that the duration data type does not store information when the duration started or ended, but only the duration itself. Corresponding expression language data type for duration is Timespan.
  • Time unit: Time unit that the imported number represents.
  • Unit: If there is a (textual) unit in the CSV file, it needs to be defined here, so that the textual part can be omitted and the data can be interpreted as numerical.
  • Null value: Text that is imported as null.

Date Formats

The date format defines how a text is interpreted as a date. The below listed standard parts identifying e.g. day, month and year part, can be used in the date format. In addition, the date format can contain any other characters which need to exists as such in the text to be converted into date. If the time zone offset is provided, dates are imported as UTC.

Supported date parts:

  • yyyy: year as four digits
  • yy: year as two digits (the last digits of the year)
  • MM: month as two digits (e.g. 01, 02, ... 10, 11, 12)
  • M: month as one or two digits (e.g. 1, 2, ... 10, 11, 12)
  • dd: day as two digits (e.g. 01, 02, ...)
  • d: day as one or two digits (e.g. 1, 2, ... 10, 11, 12, ...)
  • HH: hours as two digits based on 24-hour clock (e.g. 00, 01, 02, ... 23)
  • H: hours as one or two digits based on 24-hour clock (e.g. 1, 2, ... 23)
  • hh: same as HH, except for 12-hour clock (e.g. 01, 02, ... 12)
  • h: same as H, except for 12-hour clock (e.g. 1, 2, ... 12)
  • mm: minutes as two digits (e.g. 00, 01, 02, ... 59)
  • m: minutes as one or two digits based on 24-hour clock (e.g. 1, 2, ... 59)
  • ss: seconds as two digits (e.g. 00, 01, 02, ... 59)
  • s: seconds as one or two digits based on 24-hour clock (e.g. 1, 2, ... 59)
  • fff: milliseconds as three digits (e.g. 000, 001, 002, ... 999)
  • ff: 1/100th of second as two digits (e.g. 00, 01, 02, ... 99)
  • f: 1/10th of second as one digit (e.g. 0, 1, 2, ... 9)
  • tt: either "AM" or "PM" indicating before or after midday (needed when h or hh is used)
  • t: either "A" or "P" indicating before or after midday (needed when h or hh is used)
  • zzz: Timezone offset as hours and minutes in format (+/-)<hours>:<minutes> (e.g. -23:00, -07:00, +00:00, +00:30, +02:00, +23:00)
  • zz: Timezone offset hours as two digits preceded by plus/minus (e.g. -23, ..., -01, +00, +01, ... +23)
  • z: Timezone offset hours as one or two digits preceded by plus/minus (e.g. -23, ..., -1, +0, +1, ... +23)

Supported CSV Formats

CSV files complying with the following rules can be imported to QPR ProcessAnalyzer:

  • Row separator can be: (1) carriage return + line feed, (2) line feed, or (3) carriage return. The entire file needs to use the same row separator consistently. The row separator is automatically detected, and user cannot change the assumed row separator.
  • Column separator can be any character or several characters (but it cannot be same as the row separator or the text qualifier character).
  • First line of the file contains headers (column names).
  • Text qualifier character is the quotation mark ("). All values that contain column separator, line separator, or the text qualifier, need to be enclosed using the text qualifier (text qualifier can also be used in other cases). Text qualifier characters in the values need to be escaped using two text qualifiers ("").
  • There must be equal number of columns in each row.
  • File cannot start with line breaks.