
Ultimate access to all questions.
You are tasked with designing a real-time prediction engine on Google Cloud that processes incoming files, some of which may contain Personally Identifiable Information (PII). To ensure compliance with data protection standards, you plan to use the Cloud Data Loss Prevention (DLP) API for scanning these files. The solution must minimize the risk of unauthorized access to PII, be cost-effective, and scalable to handle varying loads. Considering these requirements, what is the most secure and efficient approach to manage the data? Choose the best option.
A
Directly stream all files to BigQuery and use scheduled DLP API scans to identify and classify PII. This approach leverages BigQuery's analytics capabilities but may expose PII during the scanning interval.
B
Implement a two-bucket system: 'Processing' and 'Archived'. Stream files to the 'Processing' bucket, apply DLP API scans in real-time, and then move files to 'Archived' based on scan results. This reduces exposure time but may increase costs due to real-time scanning.
C
Use a single bucket for all files, applying DLP API scans periodically. Files identified as containing PII are then encrypted. This method is simple but leaves PII exposed until the scan completes.
D
Establish a three-bucket system: 'Quarantine', 'Sensitive', and 'Non-sensitive'. All incoming files are initially placed in 'Quarantine'. Periodic DLP API scans then move files to 'Sensitive' or 'Non-sensitive' based on content, ensuring PII is never exposed in an unsecured state.
E
Combine streaming files directly into BigQuery with real-time DLP API scans before storage. This ensures immediate PII detection but may not be cost-effective for high-volume data streams.