Anonymize data: Difference between revisions

From QPR ProcessAnalyzer Wiki
Jump to navigation Jump to search
 
(19 intermediate revisions by the same user not shown)
Line 1: Line 1:
Process mining usually needs to handle confidential data, and thus being able to anonymize data is an essential feature. For example, personally identifiable information (PII) may need to be removed.
Process mining often handles sensitive data, and thus being able to anonymize the data is an essential feature. For example, personally identifiable information (PII) can be removed. The goal of anonymization is hide the sensitive details in the data, while preserving the analytical utility of the data.


== Principles and best practices ==
== Principles and best practices ==
Data can be anonymized in the following phases of the data flow:
Data can be anonymized in different phases of the data flow depending on requirements. QPR ProcessAnalyzer provides the following approaches where the data can be anonymized:
* Within the source system, for example using a view in source database. In this approach, the original data is not fetched to QPR ProcessAnalyzer at all, and the anonymization can be done in a way that even the QPR ProcessAnalyzer administrators don't have access to the original data.
* In the source system, for example using a view in data source system's database. In this approach, the original data is not fetched into QPR ProcessAnalyzer at all, so the anonymization can be done in a way that even the QPR ProcessAnalyzer administrators don't have access to the original data.
* Within the query to fetch the data from the source system. In this approach, the original data is not fetched to QPR ProcessAnalyzer at all, but QPR ProcessAnalyzer developer users may have possibility to see the original data, because they can change the queries to remove the anonymization. Benefit of this approach comparing to the previous is that no changes to the source system is needed for the anonymization.
* In the query to fetch the data from the source system. In this approach like in the previous, the original data is not fetched into QPR ProcessAnalyzer at all. The difference is that QPR ProcessAnalyzer developer users may have possibility to see the original data, because they can change the way the anonymization is done or even remove the anonymization. Benefit of this approach comparing to the previous is that no changes to the source system is needed to implement anonymization.
* During data is extracted from a source system, it's immediately anonymized in the QPR ProcessAnalyzer side and stored to datatables. In this approach, the original data is not stored permanently to QPR ProcessAnalyzer, and it's not possible to access the original data after the extraction. Similar to the previous approach, developer users have possibility to sett the original data by modifying the script where the anonymization is done.
* During data extraction from the source system, the data is immediately anonymized in QPR ProcessAnalyzer and stored to datatables. In this approach, the original data is not stored permanently to QPR ProcessAnalyzer, and it's not possible to access the original data after the extraction. Similar to the previous approach, developer users have possibility to see the original data by modifying the script where the anonymization is done. In this approach the original data is loaded from the source system into QPR ProcessAnalyzer, but usually this doesn't matter because the transferred data is always encrypted.
* Original data is loaded into QPR ProcessAnalyzer datatables, data is anonymized and written to other datatables. Benefit of this approach is that the anonymization technique can be changed without reloading the data from the source system. When this approach is used, it's important to set the permissions in a way that users don't have access to the datatables containing the original data.
* Original data is loaded into QPR ProcessAnalyzer datatables, and data is anonymized from those datatables and written to other datatables. Benefit of this approach is that the anonymization method can be changed without reloading the data from the source system. When this approach is used, it's important to set the permissions in a way that users don't have access to the datatables containing the original data.
* Anonymization is performed when model is loaded into memory. This approach is similar to the previous, except the data is transformed when the model is used (ELT is used instead of ETL). This approach requires to use the loading script.
* Anonymization is performed when model is loaded into memory. This approach is similar to the previous, except the data is transformed when the model is used (ELT is used instead of ETL). This approach requires to use the loading script. The benefit of this approach is that the data doesn't need to be stored twice to datatables. Possible drawback is that the loading might take longer time due to increased computing time to perform the anonymization.
* Anonymization is performed in dashboard queries for individual dashboards to show the data anonymized. This approach is meant only for testing the anonymization, for example to find the suitable technique. This approach should not be used in production, because even though seeing the anonymized data, the dashboard users still have access to the original data in the process mining model.
* Anonymization is performed in dashboard queries for individual dashboards to show the data anonymized. This approach is meant only for testing anonymization methods to find the suitable technique. This approach cannot be used in production, because even though seeing the anonymized data, the dashboard users still have access to the original data in the process mining model.
* Transformation when exporting data from QPR ProcessAnalyzer: If the original data is stored in the system, make sure that users who are only allowed to see the anonymized data, don't have access to the original data.
* If anonymized data is meant to be used outside QPR ProcessAnalyzer, anonymization can be performed when exporting data from QPR ProcessAnalyzer. If the original data is stored in QPR ProcessAnalyzer, make sure that users who are only allowed to see the anonymized data, don't have access to the original data.


Choosing the most suitable method for anonymization is about finding a balance between the level of privacy and the usefulness of the analysis that can be done from the anonymized data. The more the data is anonymized, the more details are lost which will limit what kind of analysis can be done based on the data.
Choosing the most suitable method for anonymization is all about balancing between the level of privacy and the usefulness of the anonymized data in analytical purposes. The more the data is anonymized, the more details are lost which will limit what kind of analysis and relevant findings can be done from the data. You may get started with the anonymization by gathering the requirements for both
# What sensitive data should not be revealed
# What kind of process mining analyses are intended based on the data


== Anonymization techniques ==
== Anonymization techniques ==
=== Pseudonymization with running numbering ===
=== Pseudonymization with running numbering ===
Following example script anonymizes selected columns in a datatable and writes the result to a new datatable. Each anonymized data value gets an numeric value starting from one, that can also be called pseudonymization. Note that this approach may give different anonymized values in the next time if that data has changed.
Pseudonymization is a method for anonymization where is anonymized data value is mapped to another value where the original value cannot be derived. The idea is that same data values always get the same anonumized value, which helps to keep the data more usable for the analysis despite the anonymization.
<pre>
 
Following example script anonymizes selected columns in a datatable and writes the result to a new datatable. Each anonymized data value gets a distinct numeric value starting from one. Note that this approach may give different anonymized values, when the anonymization is done in the next time.
<syntaxhighlight lang="typescript">
let skipAnonymizationForValues = [null, "Unknown"]; // list values that are not anonymized
let mappings = #{};
let mappings = #{};
function PseudonymizeColumn(columnName, originalValue) {
function PseudonymizeColumn(columnName, originalValue) {
  if (IndexOf(skipAnonymizationForValues, originalValue) > -1) {
    return originalValue;
  }
   let dict;
   let dict;
   if (!mappings.ContainsKey(columnName)) {
   if (!mappings.ContainsKey(columnName)) {
Line 25: Line 33:
   } else {
   } else {
     dict = mappings[columnName];
     dict = mappings[columnName];
  }
  if (originalValue == null) {
    originalValue = "(null)";
   }
   }
   if (!dict.ContainsKey(originalValue)) {
   if (!dict.ContainsKey(originalValue)) {
Line 36: Line 47:
     let col = cols[i];
     let col = cols[i];
     df = df.SetColumns([
     df = df.SetColumns([
       `${col}`: () => PseudonymizeColumn(col, Column(col))
       col: () => PseudonymizeColumn(col, Column(col))
     ]);
     ]);
   }
   }
Line 45: Line 56:
   sourceDatatable.SqlDataFrame.Collect(),
   sourceDatatable.SqlDataFrame.Collect(),
   ["Case Id", "Company Code", "Customer Name"]
   ["Case Id", "Company Code", "Customer Name"]
).Persist(sourceDatatable.Name + "_anonymized", #{"ProjectId": sourceDatatable.Project.Id});
).Persist(sourceDatatable.Name + "_pseudonymized", #{"ProjectId": sourceDatatable.Project.Id});
</pre>
</syntaxhighlight>


=== Pseudonymization with hashing ===
=== Pseudonymization with hashing ===
Following example script anonymizes selected columns in a datatable and writes the result to a new datatable. Each anonymized data value gets an numeric value starting from one, that can also be called pseudonymization. This approach gives always the same hash values to the data, so the anonymization is done again and for example filters still refer to the same items. Note that the anonymization needs to be done in a secure by adding a secret to the anonymized data. This way the hashing process cannot be reversed.  
Following example script anonymizes selected columns in a datatable and writes the result to a new datatable. The anonymization is done by calculating a hash value for each anonymized data value. The benefit of the hashing method is that it gives always the same hash values to the data, so the anonymization can be done again while keeping the same anonymized values that appeared earlier. Note that the hashing needs to be done in a secure way by including a secret string to the anonymized data values. This way, without knowing the secret, the hashing process cannot be reversed with a brute-force method.  
<pre>
<syntaxhighlight lang="typescript">
let secret = "JvqcfiCDksrHqe94maYcm3RvEe0eAY"; // replace with your own secret
let secret = "JvqcfiCDksrHqe94maYcm3RvEe0eAY"; // replace with your own secret


function Pseudonymize(df, cols) {
function PseudonymizeByHashing(df, cols) {
   for (let i = 0; i < CountTop(cols); ++i) {
   for (let i = 0; i < CountTop(cols); ++i) {
     let col = cols[i];
     let col = cols[i];
     df = df.SetColumns([
     df = df.SetColumns([
       `${col}`: () => Hash(`${col}: ${Hash(Column(col) + secret)}`)
       col: () => `${col}: ${Hash(`${Column(col)}${secret}`)}`
     ]);
     ]);
   }
   }
Line 63: Line 74:


let sourceDatatable = DataTableById(1);
let sourceDatatable = DataTableById(1);
Pseudonymize(
PseudonymizeByHashing(
   sourceDatatable.SqlDataFrame.Collect(),
   sourceDatatable.SqlDataFrame.Collect(),
   ["Case Id", "Company Code", "Customer Name"]
   ["Case Id", "Company Code", "Customer Name"]
).Persist(sourceDatatable.Name + "_anonymized", #{"ProjectId": sourceDatatable.Project.Id});
).Persist(sourceDatatable.Name + "_pseudonymizedWithHash", #{"ProjectId": sourceDatatable.Project.Id});
</pre>
</syntaxhighlight>


=== Shuffling ===
=== Shuffling ===
Following example anonymizes data by shuffling values in each of the selected columns:
Following example anonymizes data by shuffling values in each of the selected columns:
<pre>
<syntaxhighlight lang="typescript">
function Shuffle(df, cols) {
function ShuffleData(df, cols) {
   for (let i = 0; i < CountTop(cols); ++i) {
   for (let i = 0; i < CountTop(cols); ++i) {
     let shuffledData = Shuffle(NumberRange(0, CountTop(df.Rows) - 1));
     let shuffledData = Shuffle(NumberRange(0, CountTop(df.Rows) - 1));
Line 84: Line 95:


let sourceDatatable = DataTableById(1);
let sourceDatatable = DataTableById(1);
Shuffle(
ShuffleData(
   sourceDatatable.SqlDataFrame.Collect(),
   sourceDatatable.SqlDataFrame.Collect(),
   ["Case Id", "Company Code", "Customer Name"]
   ["Case Id", "Company Code", "Customer Name"]
).Persist(sourceDatatable.Name + "_anonymized", #{"ProjectId": sourceDatatable.Project.Id});
).Persist(sourceDatatable.Name + "_shuffled", #{"ProjectId": sourceDatatable.Project.Id});
</pre>
</syntaxhighlight>


=== Masking ===
=== Masking ===
Data can be masked using the following script. The given number of characters are masked.
Data can be masked using the following script. The given number of characters are masked.
<pre>
<syntaxhighlight lang="typescript">
function Mask(df, cols, maskCharacters) {
function Mask(df, cols, maskCharacters) {
   let mask = StringJoin("", Repeat(maskCharacters, "*"));
   let mask = StringJoin("", Repeat(maskCharacters, "*"));
Line 109: Line 120:
   ["Case Id", "Company Code", "Customer Name"],
   ["Case Id", "Company Code", "Customer Name"],
   5
   5
).Persist(sourceDatatable.Name + "_anonymized", #{"ProjectId": sourceDatatable.Project.Id});
).Persist(sourceDatatable.Name + "_masked", #{"ProjectId": sourceDatatable.Project.Id});
</pre>
</syntaxhighlight>


=== Generalization: broader categories ===
=== Generalization: broader categories ===
Categorical data can be converted into more general level by mapping each value into a broader category.
Categorical data can be converted into more general level by mapping each value into a broader category.
<pre>
<syntaxhighlight lang="typescript">
function MapCategories(df, cols, mappings) {
function MapCategories(df, cols, mappings) {
   let reverseMappings = #{};
   let reverseMappings = #{};
Line 144: Line 155:
     "North America": ["USA", "Canada", "Mexico"]
     "North America": ["USA", "Canada", "Mexico"]
   }
   }
).Persist(sourceDatatable.Name + "_anonymized3", #{"ProjectId": sourceDatatable.Project.Id})
).Persist(sourceDatatable.Name + "_generalizedCategories", #{"ProjectId": sourceDatatable.Project.Id});
</pre>
</syntaxhighlight>


=== Generalization: rounding numbers ===
=== Generalization: rounding numbers ===
Numerical data can be anonymized by rounding the precise original value to a more generic level (e.g., to the nearest hundred):
Numerical data can be anonymized by rounding the precise original value to a more generic level (e.g., to the nearest hundred):
<pre>
<syntaxhighlight lang="typescript">
function RoundNumbers(df, cols, precision) {
function RoundNumbers(df, cols, precision) {
   for (let i = 0; i < CountTop(cols); ++i) {
   for (let i = 0; i < CountTop(cols); ++i) {
Line 164: Line 175:
   ["Cost"],
   ["Cost"],
   100
   100
).Persist(sourceDatatable.Name + "_anonymized", #{"ProjectId": sourceDatatable.Project.Id});
).Persist(sourceDatatable.Name + "_rounded", #{"ProjectId": sourceDatatable.Project.Id});
</pre>
</syntaxhighlight>


=== Adding random noise ===
=== Adding random noise ===
This script add random noise to numerical data. The random noise is defined as a number between minimum and maximum value.
This script add random noise to numerical data. The random noise is defined as a number between minimum and maximum value.
<pre>
<syntaxhighlight lang="typescript">
function Mask(df, cols, min, max) {
function AddRandomNoise(df, cols, min, max) {
   for (let i = 0; i < CountTop(cols); ++i) {
   for (let i = 0; i < CountTop(cols); ++i) {
     let col = cols[i];
     let col = cols[i];
Line 180: Line 191:


let sourceDatatable = DataTableById(66);
let sourceDatatable = DataTableById(66);
Mask(
AddRandomNoise(
   sourceDatatable.SqlDataFrame.Collect(),
   sourceDatatable.SqlDataFrame.Collect(),
   ["Item_OrderQuantity"],
   ["Cost"],
   100,
   100,
   50
   50
).Persist(sourceDatatable.Name + "_anonymized2", #{"ProjectId": sourceDatatable.Project.Id});
).Persist(sourceDatatable.Name + "_randomNoise", #{"ProjectId": sourceDatatable.Project.Id});


</pre>
</syntaxhighlight>


=== Column removal ===
=== Column removal ===
Columns containing confidential data can be removed as follows:
Columns containing confidential data can be removed as follows:
<pre>
<syntaxhighlight lang="typescript">
let sourceDatatable = DataTableById(1);
let sourceDatatable = DataTableById(1);
sourceDatatable.SqlDataFrame.Collect()
sourceDatatable.SqlDataFrame.Collect()
   .RemoveColumns(["Case Id", "Company Code", "Customer Name"])
   .RemoveColumns(["Case Id", "Company Code", "Customer Name"])
   .Persist(sourceDatatable.Name + "_anonymized", #{"ProjectId": sourceDatatable.Project.Id});
   .Persist(sourceDatatable.Name + "_columnsRemoved", #{"ProjectId": sourceDatatable.Project.Id});
</pre>
</syntaxhighlight>

Latest revision as of 14:06, 23 March 2023

Process mining often handles sensitive data, and thus being able to anonymize the data is an essential feature. For example, personally identifiable information (PII) can be removed. The goal of anonymization is hide the sensitive details in the data, while preserving the analytical utility of the data.

Principles and best practices

Data can be anonymized in different phases of the data flow depending on requirements. QPR ProcessAnalyzer provides the following approaches where the data can be anonymized:

  • In the source system, for example using a view in data source system's database. In this approach, the original data is not fetched into QPR ProcessAnalyzer at all, so the anonymization can be done in a way that even the QPR ProcessAnalyzer administrators don't have access to the original data.
  • In the query to fetch the data from the source system. In this approach like in the previous, the original data is not fetched into QPR ProcessAnalyzer at all. The difference is that QPR ProcessAnalyzer developer users may have possibility to see the original data, because they can change the way the anonymization is done or even remove the anonymization. Benefit of this approach comparing to the previous is that no changes to the source system is needed to implement anonymization.
  • During data extraction from the source system, the data is immediately anonymized in QPR ProcessAnalyzer and stored to datatables. In this approach, the original data is not stored permanently to QPR ProcessAnalyzer, and it's not possible to access the original data after the extraction. Similar to the previous approach, developer users have possibility to see the original data by modifying the script where the anonymization is done. In this approach the original data is loaded from the source system into QPR ProcessAnalyzer, but usually this doesn't matter because the transferred data is always encrypted.
  • Original data is loaded into QPR ProcessAnalyzer datatables, and data is anonymized from those datatables and written to other datatables. Benefit of this approach is that the anonymization method can be changed without reloading the data from the source system. When this approach is used, it's important to set the permissions in a way that users don't have access to the datatables containing the original data.
  • Anonymization is performed when model is loaded into memory. This approach is similar to the previous, except the data is transformed when the model is used (ELT is used instead of ETL). This approach requires to use the loading script. The benefit of this approach is that the data doesn't need to be stored twice to datatables. Possible drawback is that the loading might take longer time due to increased computing time to perform the anonymization.
  • Anonymization is performed in dashboard queries for individual dashboards to show the data anonymized. This approach is meant only for testing anonymization methods to find the suitable technique. This approach cannot be used in production, because even though seeing the anonymized data, the dashboard users still have access to the original data in the process mining model.
  • If anonymized data is meant to be used outside QPR ProcessAnalyzer, anonymization can be performed when exporting data from QPR ProcessAnalyzer. If the original data is stored in QPR ProcessAnalyzer, make sure that users who are only allowed to see the anonymized data, don't have access to the original data.

Choosing the most suitable method for anonymization is all about balancing between the level of privacy and the usefulness of the anonymized data in analytical purposes. The more the data is anonymized, the more details are lost which will limit what kind of analysis and relevant findings can be done from the data. You may get started with the anonymization by gathering the requirements for both

  1. What sensitive data should not be revealed
  2. What kind of process mining analyses are intended based on the data

Anonymization techniques

Pseudonymization with running numbering

Pseudonymization is a method for anonymization where is anonymized data value is mapped to another value where the original value cannot be derived. The idea is that same data values always get the same anonumized value, which helps to keep the data more usable for the analysis despite the anonymization.

Following example script anonymizes selected columns in a datatable and writes the result to a new datatable. Each anonymized data value gets a distinct numeric value starting from one. Note that this approach may give different anonymized values, when the anonymization is done in the next time.

let skipAnonymizationForValues = [null, "Unknown"]; // list values that are not anonymized
let mappings = #{};
function PseudonymizeColumn(columnName, originalValue) {
  if (IndexOf(skipAnonymizationForValues, originalValue) > -1) {
    return originalValue;
  }
  let dict;
  if (!mappings.ContainsKey(columnName)) {
    dict = #{};
    mappings.Set(columnName, dict);
  } else {
    dict = mappings[columnName];
  }
  if (originalValue == null) {
    originalValue = "(null)";
  }
  if (!dict.ContainsKey(originalValue)) {
    dict.Set(originalValue, `${columnName}: ${dict.Count + 1}`);
  }
  return dict[originalValue];
}

function Pseudonymize(df, cols) {
  for (let i = 0; i < CountTop(cols); ++i) {
    let col = cols[i];
    df = df.SetColumns([
      col: () => PseudonymizeColumn(col, Column(col))
    ]);
  }
}

let sourceDatatable = DataTableById(1);
Pseudonymize(
  sourceDatatable.SqlDataFrame.Collect(),
  ["Case Id", "Company Code", "Customer Name"]
).Persist(sourceDatatable.Name + "_pseudonymized", #{"ProjectId": sourceDatatable.Project.Id});

Pseudonymization with hashing

Following example script anonymizes selected columns in a datatable and writes the result to a new datatable. The anonymization is done by calculating a hash value for each anonymized data value. The benefit of the hashing method is that it gives always the same hash values to the data, so the anonymization can be done again while keeping the same anonymized values that appeared earlier. Note that the hashing needs to be done in a secure way by including a secret string to the anonymized data values. This way, without knowing the secret, the hashing process cannot be reversed with a brute-force method.

let secret = "JvqcfiCDksrHqe94maYcm3RvEe0eAY"; // replace with your own secret

function PseudonymizeByHashing(df, cols) {
  for (let i = 0; i < CountTop(cols); ++i) {
    let col = cols[i];
    df = df.SetColumns([
      col: () => `${col}: ${Hash(`${Column(col)}${secret}`)}`
    ]);
  }
}

let sourceDatatable = DataTableById(1);
PseudonymizeByHashing(
  sourceDatatable.SqlDataFrame.Collect(),
  ["Case Id", "Company Code", "Customer Name"]
).Persist(sourceDatatable.Name + "_pseudonymizedWithHash", #{"ProjectId": sourceDatatable.Project.Id});

Shuffling

Following example anonymizes data by shuffling values in each of the selected columns:

function ShuffleData(df, cols) {
  for (let i = 0; i < CountTop(cols); ++i) {
    let shuffledData = Shuffle(NumberRange(0, CountTop(df.Rows) - 1));
    let col = cols[i];
    let j = 0;
    df = df.SetColumns([
      `${col}`: () => df.Column(col)[shuffledData[j++]]
    ]);
  }
}

let sourceDatatable = DataTableById(1);
ShuffleData(
  sourceDatatable.SqlDataFrame.Collect(),
  ["Case Id", "Company Code", "Customer Name"]
).Persist(sourceDatatable.Name + "_shuffled", #{"ProjectId": sourceDatatable.Project.Id});

Masking

Data can be masked using the following script. The given number of characters are masked.

function Mask(df, cols, maskCharacters) {
  let mask = StringJoin("", Repeat(maskCharacters, "*"));
  let maskLength = mask.length;
  for (let i = 0; i < CountTop(cols); ++i) {
    let col = cols[i];
    df = df.SetColumns([
      `${col}`: () => If(Column(col) == null || Column(col).length < maskLength, mask, mask + Column(col).Substring(maskLength))
    ]);
  }
}

let sourceDatatable = DataTableById(1);
Mask(
  sourceDatatable.SqlDataFrame.Collect(),
  ["Case Id", "Company Code", "Customer Name"],
  5
).Persist(sourceDatatable.Name + "_masked", #{"ProjectId": sourceDatatable.Project.Id});

Generalization: broader categories

Categorical data can be converted into more general level by mapping each value into a broader category.

function MapCategories(df, cols, mappings) {
  let reverseMappings = #{};
  mappings.Keys.{
    let key = _;
    mappings[key].{
      let value = _;
      reverseMappings.Set(value, key);
    }
  };
  for (let i = 0; i < CountTop(cols); ++i) {
    let col = cols[i];
    df = df.SetColumns([
      `${col}`: () => {
        let value = Column(col);
        return If(reverseMappings.ContainsKey(value), reverseMappings[value], value);
      }
    ]);
  }
}

let sourceDatatable = DataTableById(1);
MapCategories(
  sourceDatatable.SqlDataFrame.Collect(),
  ["Country"],
  #{
    "Europe": ["Germany", "Spain", "Sweden", "Norway"],
    "Middle East": ["Saudi Arabia", "Qatar", "United Arab Emirates"],
    "North America": ["USA", "Canada", "Mexico"]
  }
).Persist(sourceDatatable.Name + "_generalizedCategories", #{"ProjectId": sourceDatatable.Project.Id});

Generalization: rounding numbers

Numerical data can be anonymized by rounding the precise original value to a more generic level (e.g., to the nearest hundred):

function RoundNumbers(df, cols, precision) {
  for (let i = 0; i < CountTop(cols); ++i) {
    let col = cols[i];
    df = df.SetColumns([
      `${col}`: () => If(Column(col) == null, null, Round(Column(col) / precision, 0) * precision)
    ]);
  }
}

let sourceDatatable = DataTableById(1);
RoundNumbers(
  sourceDatatable.SqlDataFrame.Collect(),
  ["Cost"],
  100
).Persist(sourceDatatable.Name + "_rounded", #{"ProjectId": sourceDatatable.Project.Id});

Adding random noise

This script add random noise to numerical data. The random noise is defined as a number between minimum and maximum value.

function AddRandomNoise(df, cols, min, max) {
  for (let i = 0; i < CountTop(cols); ++i) {
    let col = cols[i];
    df = df.SetColumns([
      `${col}`: () => If(Column(col) == null, null, Column(col) + min + (max - min) * Random())
    ]);
  }
}

let sourceDatatable = DataTableById(66);
AddRandomNoise(
  sourceDatatable.SqlDataFrame.Collect(),
  ["Cost"],
  100,
  50
).Persist(sourceDatatable.Name + "_randomNoise", #{"ProjectId": sourceDatatable.Project.Id});

Column removal

Columns containing confidential data can be removed as follows:

let sourceDatatable = DataTableById(1);
sourceDatatable.SqlDataFrame.Collect()
  .RemoveColumns(["Case Id", "Company Code", "Customer Name"])
  .Persist(sourceDatatable.Name + "_columnsRemoved", #{"ProjectId": sourceDatatable.Project.Id});