Anonymize data: Difference between revisions

From QPR ProcessAnalyzer Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
Process mining usually needs to handle confidential data, and thus being able to anonymize data is an essential feature. For example, personally identifiable information (PII) may need to be removed.
Process mining usually needs to handle confidential data, and thus being able to anonymize data is an essential feature. For example, personally identifiable information (PII) may need to be removed.


= General=
== General==
Data can be anonymized using scripts with the following methods:
Data can be anonymized using scripts with the following methods:
* Within the source system, for example a VIEW in source database. In this approach, the confidential data is not loaded to QPR ProcessAnalyzer at all.
* Within the source system, for example a VIEW in source database. In this approach, the confidential data is not loaded to QPR ProcessAnalyzer at all.
* Within the query to fetch the data from the source system, for example in the SQL SELECT statement. In this approach, the confidential data is not loaded to QPR ProcessAnalyzer at all.
* Within the query to fetch the data from the source system, for example in the SQL SELECT statement. In this approach, the confidential data is not loaded to QPR ProcessAnalyzer at all.
Line 14: Line 14:
* Transformation when exporting data from QPR: If the original data is stored in the system, make sure that users who are only allowed to see the anonymized data, don't have access to the original data.
* Transformation when exporting data from QPR: If the original data is stored in the system, make sure that users who are only allowed to see the anonymized data, don't have access to the original data.


= Anonymization techniques =
== Anonymization techniques ==
== Pseudonymization ==
=== Pseudonymization ===
Following example script anonymizes selected columns in a datatable and writes the result to a new datatable. Each anonymized data value gets an numeric value starting from one (this can also be called pseudonymization).
Following example script anonymizes selected columns in a datatable and writes the result to a new datatable. Each anonymized data value gets an numeric value starting from one (this can also be called pseudonymization).
<pre>
<pre>
Line 49: Line 49:
</pre>
</pre>


== Shuffling ==
=== Shuffling ===
Following example anonymizes data by shuffling values in each of the selected columns:
Following example anonymizes data by shuffling values in each of the selected columns:
<pre>
<pre>
Line 70: Line 70:
</pre>
</pre>


== Masking ==
=== Masking ===
Data can be masked using the following script. The given number of characters are masked.
Data can be masked using the following script. The given number of characters are masked.
<pre>
<pre>
Line 92: Line 92:
</pre>
</pre>


== Generalization: rounding numbers ==
=== Generalization: rounding numbers ===
Numerical data can be anonymized by rounding the precise original value to a more generic level (e.g., to the nearest hundred):
Numerical data can be anonymized by rounding the precise original value to a more generic level (e.g., to the nearest hundred):
<pre>
<pre>
Line 112: Line 112:
</pre>
</pre>


== Random noise ==
=== Random noise ===
This script add random noise to numerical data. The random noise is defined as a number between minimum and maximum value.
This script add random noise to numerical data. The random noise is defined as a number between minimum and maximum value.
<pre>
<pre>
Line 134: Line 134:
</pre>
</pre>


==Column removal==
=== Column removal ===
Columns containing confidential data can be removed as follows:
Columns containing confidential data can be removed as follows:
<pre>
<pre>

Revision as of 20:05, 14 December 2022

Process mining usually needs to handle confidential data, and thus being able to anonymize data is an essential feature. For example, personally identifiable information (PII) may need to be removed.

General

Data can be anonymized using scripts with the following methods:

  • Within the source system, for example a VIEW in source database. In this approach, the confidential data is not loaded to QPR ProcessAnalyzer at all.
  • Within the query to fetch the data from the source system, for example in the SQL SELECT statement. In this approach, the confidential data is not loaded to QPR ProcessAnalyzer at all.
  • When data is extracted from a source system, the data is immediately anonymized and the anonymized data is stored to datatables.
  • Inside the SQL Sandbox database
  • Transformation for data loaded into QPR database: Data in the datatables are analymized and stored to other datatables, where the anonymized data can be exported or visualized in dashboards.
  • Transformation when model is loaded into memory
    • Expression language query
    • SQL query
  • Transformation in a Dashboard query: Individual Dashboard(s) can show the data in anonymized format
  • Transformation when exporting data from QPR: If the original data is stored in the system, make sure that users who are only allowed to see the anonymized data, don't have access to the original data.

Anonymization techniques

Pseudonymization

Following example script anonymizes selected columns in a datatable and writes the result to a new datatable. Each anonymized data value gets an numeric value starting from one (this can also be called pseudonymization).

let anonymizationMappings = #{};
function PseudonymizeColumn(columnName, originalValue) {
  let dict;
  if (!anonymizationMappings.ContainsKey(columnName)) {
    dict = #{};
    anonymizationMappings.Set(columnName, dict);
  } else {
    dict = anonymizationMappings[columnName];
  }
  if (!dict.ContainsKey(originalValue)) {
    dict.Set(originalValue, `${columnName}: ${dict.Count + 1}`);
  }
  return dict[originalValue];
}

function Pseudonymize(df, cols) {
  for (let i = 0; i < CountTop(cols); ++i) {
    let col = cols[i];
    df = df.SetColumns([
      `${col}`: () => PseudonymizeColumn(col, Column(col))
    ]);
  }
}

let sourceDatatable = DataTableById(1);
Pseudonymize(
  sourceDatatable.SqlDataFrame.Collect(),
  ["Case Id", "Company Code", "Customer Name"]
).Persist(sourceDatatable.Name + "_anonymized", #{"ProjectId": sourceDatatable.Project.Id})

Shuffling

Following example anonymizes data by shuffling values in each of the selected columns:

function Shuffle(df, cols) {
  for (let i = 0; i < CountTop(cols); ++i) {
    let shuffledData = Shuffle(NumberRange(0, CountTop(df.Rows) - 1));
    let col = cols[i];
    let j = 0;
    df = df.SetColumns([
      `${col}`: () => df.Column(col)[shuffledData[j++]]
    ]);
  }
}

let sourceDatatable = DataTableById(1);
Shuffle(
  sourceDatatable.SqlDataFrame.Collect(),
  ["Case Id", "Company Code", "Customer Name"]
).Persist(sourceDatatable.Name + "_anonymized", #{"ProjectId": sourceDatatable.Project.Id})

Masking

Data can be masked using the following script. The given number of characters are masked.

function Mask(df, cols, maskCharacters) {
  let mask = StringJoin("", Repeat(maskCharacters, "*"));
  let maskLength = mask.length;
  for (let i = 0; i < CountTop(cols); ++i) {
    let col = cols[i];
    df = df.SetColumns([
      `${col}`: () => If(Column(col) == null || Column(col).length < maskLength, mask, mask + Column(col).Substring(maskLength))
    ]);
  }
}

let sourceDatatable = DataTableById(1);
Mask(
  sourceDatatable.SqlDataFrame.Collect(),
  ["Case Id", "Company Code", "Customer Name"],
  5
).Persist(sourceDatatable.Name + "_anonymized", #{"ProjectId": sourceDatatable.Project.Id})

Generalization: rounding numbers

Numerical data can be anonymized by rounding the precise original value to a more generic level (e.g., to the nearest hundred):

function RoundNumbers(df, cols, precision) {
  for (let i = 0; i < CountTop(cols); ++i) {
    let col = cols[i];
    df = df.SetColumns([
      `${col}`: () => If(Column(col) == null, null, Round(Column(col) / precision, 0) * precision)
    ]);
  }
}

let sourceDatatable = DataTableById(1);
RoundNumbers(
  sourceDatatable.SqlDataFrame.Collect(),
  ["Cost"],
  100
).Persist(sourceDatatable.Name + "_anonymized", #{"ProjectId": sourceDatatable.Project.Id})

Random noise

This script add random noise to numerical data. The random noise is defined as a number between minimum and maximum value.

function Mask(df, cols, min, max) {
  for (let i = 0; i < CountTop(cols); ++i) {
    let col = cols[i];
    df = df.SetColumns([
      `${col}`: () => If(Column(col) == null, null, Column(col) + min + (max - min) * Random())
    ]);
  }
}

let sourceDatatable = DataTableById(66);
Mask(
  sourceDatatable.SqlDataFrame.Collect(),
  ["Item_OrderQuantity"],
  100,
  50
).Persist(sourceDatatable.Name + "_anonymized2", #{"ProjectId": sourceDatatable.Project.Id})

Column removal

Columns containing confidential data can be removed as follows:

let sourceDatatable = DataTableById(1);
sourceDatatable.SqlDataFrame.Collect()
  .RemoveColumns(["Case Id", "Company Code", "Customer Name"])
  .Persist(sourceDatatable.Name + "_anonymized", #{"ProjectId": sourceDatatable.Project.Id})