Ultimate access to all questions.
You are developing a machine learning model using data stored in Google BigQuery. This dataset contains several values classified as Personally Identifiable Information (PII), such as names, addresses, and social security numbers. To comply with data privacy laws and reduce the sensitivity of the dataset before using it for training, you need to anonymize or mask these sensitive columns without removing them, as every column is critical for the model's performance. How should you proceed?
Explanation:
The correct answer is B. The Cloud Data Loss Prevention (DLP) API allows you to scan for sensitive data within your dataset and then encrypt those sensitive values using Format Preserving Encryption (FPE). This approach maintains the structural format of the data, which is important if every column is critical to your model's performance. Using Dataflow with the DLP API ensures that the data can be processed efficiently at scale. This method complies with data privacy regulations and helps in preserving the data distribution and format, enabling the model to maintain its accuracy.