Anonymize data: Difference between revisions

From QPR ProcessAnalyzer Wiki
Jump to navigation Jump to search
Line 3: Line 3:
== Principles and best practices ==
== Principles and best practices ==
Data can be anonymized in the following phases of the data flow:
Data can be anonymized in the following phases of the data flow:
* Within the source system, for example a VIEW in source database. In this approach, the confidential data is not loaded to QPR ProcessAnalyzer at all, and the anonymization can be done in a way that even the QPR ProcessAnalyzer administrator doesn't have access to the original data.
* Within the source system, for example using a view in source database. In this approach, the original data is not fetched to QPR ProcessAnalyzer at all, and the anonymization can be done in a way that even the QPR ProcessAnalyzer administrators don't have access to the original data.
* Within the query to fetch the data from the source system, for example in the SQL SELECT statement. In this approach, the original data is not fetched to QPR ProcessAnalyzer at all, but QPR ProcessAnalyzer developer users may have possibility to get the original data, because they can change the queries to remove the anonymization.
* Within the query to fetch the data from the source system. In this approach, the original data is not fetched to QPR ProcessAnalyzer at all, but QPR ProcessAnalyzer developer users may have possibility to see the original data, because they can change the queries to remove the anonymization. Benefit of this approach comparing to the previous is that no changes to the source system is needed for the anonymization.
* When data is extracted from a source system, the data is immediately anonymized in the QPR ProcessAnalyzer side and the anonymized data is stored to datatables. In this approach, the original data is not stored permanently to QPR ProcessAnalyzer. Similar to the previous approach, developer users have possibility to get the original data by modifying the script where the anonymization is done.
* During data is extracted from a source system, it's immediately anonymized in the QPR ProcessAnalyzer side and stored to datatables. In this approach, the original data is not stored permanently to QPR ProcessAnalyzer, and it's not possible to access the original data after the extraction. Similar to the previous approach, developer users have possibility to sett the original data by modifying the script where the anonymization is done.
* Original data is loaded into QPR ProcessAnalyzer datatables and data is anonymized and written to other datatables. Benefit of this approach is that the anonymization technique can be changed without reloading the data from the source system. When this approach is used, it's important to set the permissions in a way that users don't have access to the datatables containing the original data.
* Original data is loaded into QPR ProcessAnalyzer datatables, data is anonymized and written to other datatables. Benefit of this approach is that the anonymization technique can be changed without reloading the data from the source system. When this approach is used, it's important to set the permissions in a way that users don't have access to the datatables containing the original data.
* Anonymization when model is loaded into memory. This approach is similar to the previous, except the data is transformed when the model is used (ELT is used instead of ETL). This approach requires to use the loading script.
* Anonymization is performed when model is loaded into memory. This approach is similar to the previous, except the data is transformed when the model is used (ELT is used instead of ETL). This approach requires to use the loading script.
* Anonymization in dashboard query: Individual dashboards can show the data in anonymized format. This approach is meant for testing the anonymization, for example to find the suitable anonymization technique. This approach should not be used in production, because even though seeing the anonymized data, the dashboard users still have access to the original data in the process mining model.
* Anonymization is performed in dashboard queries for individual dashboards to show the data anonymized. This approach is meant only for testing the anonymization, for example to find the suitable technique. This approach should not be used in production, because even though seeing the anonymized data, the dashboard users still have access to the original data in the process mining model.
* Transformation when exporting data from QPR ProcessAnalyzer: If the original data is stored in the system, make sure that users who are only allowed to see the anonymized data, don't have access to the original data.
* Transformation when exporting data from QPR ProcessAnalyzer: If the original data is stored in the system, make sure that users who are only allowed to see the anonymized data, don't have access to the original data.



Revision as of 23:58, 14 December 2022

Process mining usually needs to handle confidential data, and thus being able to anonymize data is an essential feature. For example, personally identifiable information (PII) may need to be removed.

Principles and best practices

Data can be anonymized in the following phases of the data flow:

  • Within the source system, for example using a view in source database. In this approach, the original data is not fetched to QPR ProcessAnalyzer at all, and the anonymization can be done in a way that even the QPR ProcessAnalyzer administrators don't have access to the original data.
  • Within the query to fetch the data from the source system. In this approach, the original data is not fetched to QPR ProcessAnalyzer at all, but QPR ProcessAnalyzer developer users may have possibility to see the original data, because they can change the queries to remove the anonymization. Benefit of this approach comparing to the previous is that no changes to the source system is needed for the anonymization.
  • During data is extracted from a source system, it's immediately anonymized in the QPR ProcessAnalyzer side and stored to datatables. In this approach, the original data is not stored permanently to QPR ProcessAnalyzer, and it's not possible to access the original data after the extraction. Similar to the previous approach, developer users have possibility to sett the original data by modifying the script where the anonymization is done.
  • Original data is loaded into QPR ProcessAnalyzer datatables, data is anonymized and written to other datatables. Benefit of this approach is that the anonymization technique can be changed without reloading the data from the source system. When this approach is used, it's important to set the permissions in a way that users don't have access to the datatables containing the original data.
  • Anonymization is performed when model is loaded into memory. This approach is similar to the previous, except the data is transformed when the model is used (ELT is used instead of ETL). This approach requires to use the loading script.
  • Anonymization is performed in dashboard queries for individual dashboards to show the data anonymized. This approach is meant only for testing the anonymization, for example to find the suitable technique. This approach should not be used in production, because even though seeing the anonymized data, the dashboard users still have access to the original data in the process mining model.
  • Transformation when exporting data from QPR ProcessAnalyzer: If the original data is stored in the system, make sure that users who are only allowed to see the anonymized data, don't have access to the original data.

Choosing the most suitable method for anonymization is about finding a balance between the level of privacy and the usefulness of the analysis that can be done from the anonymized data. The more the data is anonymized, the more details are lost which will limit what kind of analysis can be done based on the data.

Anonymization techniques

Pseudonymization

Following example script anonymizes selected columns in a datatable and writes the result to a new datatable. Each anonymized data value gets an numeric value starting from one (this can also be called pseudonymization).

let anonymizationMappings = #{};
function PseudonymizeColumn(columnName, originalValue) {
  let dict;
  if (!anonymizationMappings.ContainsKey(columnName)) {
    dict = #{};
    anonymizationMappings.Set(columnName, dict);
  } else {
    dict = anonymizationMappings[columnName];
  }
  if (!dict.ContainsKey(originalValue)) {
    dict.Set(originalValue, `${columnName}: ${dict.Count + 1}`);
  }
  return dict[originalValue];
}

function Pseudonymize(df, cols) {
  for (let i = 0; i < CountTop(cols); ++i) {
    let col = cols[i];
    df = df.SetColumns([
      `${col}`: () => PseudonymizeColumn(col, Column(col))
    ]);
  }
}

let sourceDatatable = DataTableById(1);
Pseudonymize(
  sourceDatatable.SqlDataFrame.Collect(),
  ["Case Id", "Company Code", "Customer Name"]
).Persist(sourceDatatable.Name + "_anonymized", #{"ProjectId": sourceDatatable.Project.Id});

Shuffling

Following example anonymizes data by shuffling values in each of the selected columns:

function Shuffle(df, cols) {
  for (let i = 0; i < CountTop(cols); ++i) {
    let shuffledData = Shuffle(NumberRange(0, CountTop(df.Rows) - 1));
    let col = cols[i];
    let j = 0;
    df = df.SetColumns([
      `${col}`: () => df.Column(col)[shuffledData[j++]]
    ]);
  }
}

let sourceDatatable = DataTableById(1);
Shuffle(
  sourceDatatable.SqlDataFrame.Collect(),
  ["Case Id", "Company Code", "Customer Name"]
).Persist(sourceDatatable.Name + "_anonymized", #{"ProjectId": sourceDatatable.Project.Id});

Masking

Data can be masked using the following script. The given number of characters are masked.

function Mask(df, cols, maskCharacters) {
  let mask = StringJoin("", Repeat(maskCharacters, "*"));
  let maskLength = mask.length;
  for (let i = 0; i < CountTop(cols); ++i) {
    let col = cols[i];
    df = df.SetColumns([
      `${col}`: () => If(Column(col) == null || Column(col).length < maskLength, mask, mask + Column(col).Substring(maskLength))
    ]);
  }
}

let sourceDatatable = DataTableById(1);
Mask(
  sourceDatatable.SqlDataFrame.Collect(),
  ["Case Id", "Company Code", "Customer Name"],
  5
).Persist(sourceDatatable.Name + "_anonymized", #{"ProjectId": sourceDatatable.Project.Id});

Generalization: broader categories

Categorical data can be converted into more general level by mapping each value into a broader category.

function MapCategories(df, cols, mappings) {
  let reverseMappings = #{};
  mappings.Keys.{
    let key = _;
    mappings[key].{
      let value = _;
      reverseMappings.Set(value, key);
    }
  };
  for (let i = 0; i < CountTop(cols); ++i) {
    let col = cols[i];
    df = df.SetColumns([
      `${col}`: () => {
        let value = Column(col);
        return If(reverseMappings.ContainsKey(value), reverseMappings[value], value);
      }
    ]);
  }
}

let sourceDatatable = DataTableById(1);
MapCategories(
  sourceDatatable.SqlDataFrame.Collect(),
  ["Country"],
  #{
    "Europe": ["Germany", "Spain", "Sweden", "Norway"],
    "Middle East": ["Saudi Arabia", "Qatar", "United Arab Emirates"],
    "North America": ["USA", "Canada", "Mexico"]
  }
).Persist(sourceDatatable.Name + "_anonymized3", #{"ProjectId": sourceDatatable.Project.Id})

Generalization: rounding numbers

Numerical data can be anonymized by rounding the precise original value to a more generic level (e.g., to the nearest hundred):

function RoundNumbers(df, cols, precision) {
  for (let i = 0; i < CountTop(cols); ++i) {
    let col = cols[i];
    df = df.SetColumns([
      `${col}`: () => If(Column(col) == null, null, Round(Column(col) / precision, 0) * precision)
    ]);
  }
}

let sourceDatatable = DataTableById(1);
RoundNumbers(
  sourceDatatable.SqlDataFrame.Collect(),
  ["Cost"],
  100
).Persist(sourceDatatable.Name + "_anonymized", #{"ProjectId": sourceDatatable.Project.Id});

Adding random noise

This script add random noise to numerical data. The random noise is defined as a number between minimum and maximum value.

function Mask(df, cols, min, max) {
  for (let i = 0; i < CountTop(cols); ++i) {
    let col = cols[i];
    df = df.SetColumns([
      `${col}`: () => If(Column(col) == null, null, Column(col) + min + (max - min) * Random())
    ]);
  }
}

let sourceDatatable = DataTableById(66);
Mask(
  sourceDatatable.SqlDataFrame.Collect(),
  ["Item_OrderQuantity"],
  100,
  50
).Persist(sourceDatatable.Name + "_anonymized2", #{"ProjectId": sourceDatatable.Project.Id});

Column removal

Columns containing confidential data can be removed as follows:

let sourceDatatable = DataTableById(1);
sourceDatatable.SqlDataFrame.Collect()
  .RemoveColumns(["Case Id", "Company Code", "Customer Name"])
  .Persist(sourceDatatable.Name + "_anonymized", #{"ProjectId": sourceDatatable.Project.Id});