Anonymize data: Difference between revisions

From QPR ProcessAnalyzer Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
Usually confidential data is handled in process mining, and thus being able to anonymize confidential data is an essential feature. Data can be anonymized using scripts with the following methods:
Confidential data is usually handled in process mining, and thus being able to anonymize data is an essential feature. Data can be anonymized using scripts with the following methods:
* Within the source system, for example a VIEW in source database
* Within the query from the source system, for example, in the SQL SELECT statement
* When data is extracted from a source system, the data is immediately anonymized and the anonymized data is stored to datatables.
* When data is extracted from a source system, the data is immediately anonymized and the anonymized data is stored to datatables.
* Data in the datatables are analymized and stored to other datatables, where the anonymized data can be exported or visualized in dashboards.
* Inside the SQL Sandbox database
 
* Transformation for data loaded into QPR database: Data in the datatables are analymized and stored to other datatables, where the anonymized data can be exported or visualized in dashboards.
If the original data is stored in the system, make sure that users who are only allowed to see the anonymized data, don't have access to the original data.
* Transformation when model is loaded into memory
** Expression language query
** SQL query
* Transformation in a Dashboard query: Individual Dashboard(s) can show the data in anonymized format
* Transformation when exporting data from QPR: If the original data is stored in the system, make sure that users who are only allowed to see the anonymized data, don't have access to the original data.


== Pseudonymization ==
Following example script anonymizes selected columns in a datatable and writes the result to a new datatable. Each anonymized data value gets an numeric value starting from one (this can also be called pseudonymization).
Following example script anonymizes selected columns in a datatable and writes the result to a new datatable. Each anonymized data value gets an numeric value starting from one (this can also be called pseudonymization).
<pre>
<pre>
Line 36: Line 43:
</pre>
</pre>


== Shuffling ==
Following example anonymizes data by shuffling values in each of the selected columns:
Following example anonymizes data by shuffling values in each of the selected columns:
<pre>
<pre>
Line 53: Line 61:
</pre>
</pre>


== Masking ==
Data can be masked using the following script:
Data can be masked using the following script:
<pre>
<pre>
Line 70: Line 79:
</pre>
</pre>


== Column removal ==
Columns containing confidential data can be removed as follows:
Columns containing confidential data can be removed as follows:
<pre>
<pre>

Revision as of 12:07, 14 December 2022

Confidential data is usually handled in process mining, and thus being able to anonymize data is an essential feature. Data can be anonymized using scripts with the following methods:

  • Within the source system, for example a VIEW in source database
  • Within the query from the source system, for example, in the SQL SELECT statement
  • When data is extracted from a source system, the data is immediately anonymized and the anonymized data is stored to datatables.
  • Inside the SQL Sandbox database
  • Transformation for data loaded into QPR database: Data in the datatables are analymized and stored to other datatables, where the anonymized data can be exported or visualized in dashboards.
  • Transformation when model is loaded into memory
    • Expression language query
    • SQL query
  • Transformation in a Dashboard query: Individual Dashboard(s) can show the data in anonymized format
  • Transformation when exporting data from QPR: If the original data is stored in the system, make sure that users who are only allowed to see the anonymized data, don't have access to the original data.

Pseudonymization

Following example script anonymizes selected columns in a datatable and writes the result to a new datatable. Each anonymized data value gets an numeric value starting from one (this can also be called pseudonymization).

let anonymizationDict = #{};
function AnonymizeColumn(columnName, oldValue) {
  let dict
  if (!anonymizationDict.ContainsKey(columnName)) {
    dict = #{};
    anonymizationDict.Set(columnName, dict);
  }
  else {
    dict = anonymizationDict[columnName];
  }
  if (!dict.ContainsKey(oldValue)) {
    dict.Set(oldValue, `${dict.Count + 1}`);
  }
  return dict[oldValue];
}

function Anonymize(df, cols) {
  for (let i = 0; i < CountTop(cols); ++i) {
    let col = cols[i];
    df = df.SetColumns([
      `${col}`: () => AnonymizeColumn(col, Column(col))
    ]);
  }
}
let sourceDatatable = DataTableById(1);
Anonymize(sourceDatatable.SqlDataFrame.Collect(), ["Case Id", "Company Code", "Customer Name"])
  .Persist(sourceDatatable.Name + "_anonymized", #{"ProjectId": sourceDatatable.Project.Id})

Shuffling

Following example anonymizes data by shuffling values in each of the selected columns:

function Anonymize(df, cols) {
  for (let i = 0; i < CountTop(cols); ++i) {
    let shuffledData = Shuffle(NumberRange(0, CountTop(df.Rows) - 1));
    let col = cols[i];
    let j = 0;
    df = df.SetColumns([
      `${col}`: () => df.Column(col)[shuffledData[j++]]
    ]);
  }
}
let sourceDatatable = DataTableById(1);
Anonymize(sourceDatatable.SqlDataFrame.Collect(), ["Case Id", "Company Code", "Customer Name"])
  .Persist(sourceDatatable.Name + "_anonymized", #{"ProjectId": sourceDatatable.Project.Id})

Masking

Data can be masked using the following script:

function Anonymize(df, cols, maskCharacters) {
  let mask = StringJoin("", Repeat(maskCharacters, "*"));
  for (let i = 0; i < CountTop(cols); ++i) {
    let shuffledData = Shuffle(NumberRange(0, CountTop(df.Rows) - 1));
    let col = cols[i];
    df = df.SetColumns([
      `${col}`: () => If(Column(col).length < mask.length, mask, mask + Column(col).Substring(mask.length))
    ]);
  }
}
let sourceDatatable = DataTableById(1);
Anonymize(sourceDatatable.SqlDataFrame.Collect(), ["Case Id", "Company Code", "Customer Name"], 5)
  .Persist(sourceDatatable.Name + "_anonymized1", #{"ProjectId": sourceDatatable.Project.Id})

Column removal

Columns containing confidential data can be removed as follows:

let sourceDatatable = DataTableById(1);
sourceDatatable.SqlDataFrame.Collect()
  .RemoveColumns(["Case Id", "Company Code", "Customer Name"])
  .Persist(sourceDatatable.Name + "_anonymized", #{"ProjectId": sourceDatatable.Project.Id})