Importing Data to Datatable from CSV File: Difference between revisions
No edit summary |
|||
Line 5: | Line 5: | ||
# Click the '''Import''' button. | # Click the '''Import''' button. | ||
# Select the CSV file to be imported, adjust data type and conversion settings if needed, and click '''Start import'''. | # Select the CSV file to be imported, adjust data type and conversion settings if needed, and click '''Start import'''. | ||
When a CSV file is selected, the import tries to interpret the contents of the file and find the suitable data types and other conversion parameters for each column. In some special cases, the automatic data type detection does not work, and then the settings need to be set manually. | |||
== Import Settings == | == Import Settings == | ||
Line 57: | Line 59: | ||
* Null value: Text that is imported as null. | * Null value: Text that is imported as null. | ||
|} | |} | ||
== Date Formats == | |||
The date format defines how a text is interpreted and converted to a date. There are different parts that can be used in the date format, listed below. In addition, date format can contain any other characters which need to exists as such in the same position in the text to be converted into date. If the timezone offset is provided, dates are imported as UTC. | |||
# '''yyyy''': year as four digits | |||
# '''yy''': year as two digits (the last digits of the year) | |||
# '''MM''': month as two digits (e.g. 01, 02, ... 10, 11, 12) | |||
# '''M''': month as one or two digits (e.g. 1, 2, ... 10, 11, 12) | |||
# '''dd''': day as two digits (e.g. 01, 02, ...) | |||
# '''d''': day as one or two digits (e.g. 1, 2, ... 10, 11, 12, ...) | |||
# '''HH''': hours as two digits based on 24-hour clock (e.g. 00, 01, 02, ... 23) | |||
# '''H''': hours as one or two digits based on 24-hour clock (e.g. 1, 2, ... 23) | |||
# '''hh''': same as HH, except for 12-hour clock | |||
# '''h''': same as H, except for 12-hour clock | |||
# '''mm''': similar as HH except for minutes | |||
# '''m''': similar as H except for minutes | |||
# '''ss''': similar as HH except for seconds | |||
# '''s''': similar as H except for seconds | |||
# '''fff''': 1/1000th of second as three digits (e.g. 000, 001, 002, ... 999) | |||
# '''ff''': 1/100th of second as two digits (e.g. 00, 01, 02, ... 99) | |||
# '''f''': 1/10th of second as one digit (e.g. 0, 1, 2, ... 9) | |||
# '''aa''': either "AM" or "PM" indicating before or after midday (needed when h/hh is used) | |||
# '''a''': either "A" or "P" indicating before or after midday (needed when h/hh is used) | |||
# '''tt''': same as "aa" | |||
# '''t''': same as "a" | |||
# '''z''': Timezone offset hours as one or two digits preceded by plus/minus (e.g. -23, ..., -1, +0, +1, ... +23) | |||
# '''zz''': Timezone offset hours as two digits preceded by plus/minus (e.g. -23, ..., -01, +00, +01, ... +23) | |||
# '''zzz''': Timezone offset as hours and minutes in format (+/-)<hours>:<minutes> (e.g. -23:00, -07:00, +00:00, +00:30, +02:00, +23:00) | |||
== Supported CSV Formats == | == Supported CSV Formats == |
Revision as of 13:54, 21 June 2021
Data can be imported to datatables from CSV files using the Workspace screen. Import is done as follows:
- On the left side projects hierarchy, select the project where the target datatable is located.
- Open the Datatables tab.
- Select the datatable where to import data.
- Click the Import button.
- Select the CSV file to be imported, adjust data type and conversion settings if needed, and click Start import.
When a CSV file is selected, the import tries to interpret the contents of the file and find the suitable data types and other conversion parameters for each column. In some special cases, the automatic data type detection does not work, and then the settings need to be set manually.
Import Settings
The data import dialog has the following settings:
- Select CSV file: Button to select the CSV file. You can use this button to change the CSV file if you want to change the file.
- Column separator: Column separator character used by the CSV file. The column separator is automatically detected when the file is selected. In some cases, the automatic detection does not find the correct column separator, and this selection allows to change the column separator. If you want to use space as a column separator, select <space>, or if you want to use tabulator, select <tab>. Note that when changing the column separator, all column specific settings in the table are reset.
- Skip errors: Allows to continue the import when encountering errors and import null values in place of cells where the conversion could not be made. If error skipping is not in use, the import will stop to the first encountered error. The data conversions are checked before importing actual data to the datatables, so data conversion errors will not lead to partially done imported. When skipping errors, there is a summary of all the encountered errors during the import when the import is completed.
- Character encoding: Character encoding used by the CSV file. Most text files use UTF-8 encoding which is the default. If some of the characters in the CSV file look incorrect in the table preview, try to change other character encoding to find the correct one. Note that there is no automatic detection of the character encoding. Note also that changing the character encoding will reset all column specific settings
Data types
It's very important to set the most suitable data types for each column, as it greatly affects the usage of the data in the analyses. Incorrect data type usually make the analysis impossible. The following data types are available:
Data type | Use | Settings |
---|---|---|
String | Data type for textual data. Also suitable for distinct classification values that are not numerical, e.g. "low", "medium", "high". The length of the string is unlimited, but in practice it's more efficient to use shorter strings (e.g. maximum of 100 characters) for process mining purposes. The string data type make a distinction between empty string and null value. The CSV file import does not import any empty strings, as empty data is imported as nulls. |
|
Integer | Data type for whole numbers, i.e. numbers that don't need decimals. Integers are also suitable for storing distinct values that can be ordered (e.g. 1=low, 2=medium, 3=high), which makes it easier to visualize data in a logical way. Integers have a advantage over decimal numbers: it's possible to make equality comparisons for integers. |
|
Decimal number | Data type that stores numerical data with decimals. It's not possible to make equality comparisons for decimal numbers (i.e. equals than, not equals than, less or equal than, greater or equal than). Only more than and less than comparison can be made for example when filtering data. |
|
Date | Data type stores a precise timestamp, which is an essential data type for process mining. This data type can also be used to store less fine graining time units, such as days, weeks, months or years. In that case the stored timestamp is the point of time in the beginning of the day, week etc. |
|
Boolean | Data type that contains only two values, true and false. The null' value is also available (like in other data types) which in practise is the third possible value for boolean type of data. If no more values are needed, boolean is the most efficient data type to use. Note that in dashboards, the true and false values can be translated into user understandable texts, such as "yes" and "no". |
|
Duration | Data type that contains a time duration (e.g. 5 hours, 14 days, 3 seconds, etc.). Corresponding expression language data type is Timespan. Duration is quite close to decimal number, except its benefit is that it also contains the duration unit (thus it doesn't need to be agreed separately). Duration type of data is also very easy to show in any other duration unit with a simple selection in the dashboard. Note that the duration data type does not store information when the duration started or ended, but only the duration itself. |
|
Date Formats
The date format defines how a text is interpreted and converted to a date. There are different parts that can be used in the date format, listed below. In addition, date format can contain any other characters which need to exists as such in the same position in the text to be converted into date. If the timezone offset is provided, dates are imported as UTC.
- yyyy: year as four digits
- yy: year as two digits (the last digits of the year)
- MM: month as two digits (e.g. 01, 02, ... 10, 11, 12)
- M: month as one or two digits (e.g. 1, 2, ... 10, 11, 12)
- dd: day as two digits (e.g. 01, 02, ...)
- d: day as one or two digits (e.g. 1, 2, ... 10, 11, 12, ...)
- HH: hours as two digits based on 24-hour clock (e.g. 00, 01, 02, ... 23)
- H: hours as one or two digits based on 24-hour clock (e.g. 1, 2, ... 23)
- hh: same as HH, except for 12-hour clock
- h: same as H, except for 12-hour clock
- mm: similar as HH except for minutes
- m: similar as H except for minutes
- ss: similar as HH except for seconds
- s: similar as H except for seconds
- fff: 1/1000th of second as three digits (e.g. 000, 001, 002, ... 999)
- ff: 1/100th of second as two digits (e.g. 00, 01, 02, ... 99)
- f: 1/10th of second as one digit (e.g. 0, 1, 2, ... 9)
- aa: either "AM" or "PM" indicating before or after midday (needed when h/hh is used)
- a: either "A" or "P" indicating before or after midday (needed when h/hh is used)
- tt: same as "aa"
- t: same as "a"
- z: Timezone offset hours as one or two digits preceded by plus/minus (e.g. -23, ..., -1, +0, +1, ... +23)
- zz: Timezone offset hours as two digits preceded by plus/minus (e.g. -23, ..., -01, +00, +01, ... +23)
- zzz: Timezone offset as hours and minutes in format (+/-)<hours>:<minutes> (e.g. -23:00, -07:00, +00:00, +00:30, +02:00, +23:00)