
Ultimate access to all questions.
An external customer supplies you with a daily data dump from their database, which is transferred to Google Cloud Storage (GCS) as comma-separated values (CSV) files. Your objective is to analyze this data using Google BigQuery. However, there is a possibility that some rows within these CSV files might be incorrectly formatted or corrupted. What approach would you take to construct an efficient data pipeline that ensures the integrity of the data while loading it into BigQuery?
A
Use federated data sources, and check data in the SQL query.
B
Enable BigQuery monitoring in Google Stackdriver and create an alert.
C
Import the data into BigQuery using the gcloud CLI and set max_bad_records to 0.
D
Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.
Explanation:
The correct answer is D. Using Google Cloud Dataflow allows you to create a data pipeline that can preprocess and transform data before loading it into BigQuery. This approach enables you to handle problematic rows by pushing them to a dead-letter table for later analysis while ensuring that valid data is loaded into BigQuery. Option A does not provide data validation or cleaning capabilities efficiently. Option B can only monitor the performance but cannot handle corrupted data. Option C will ignore corrupted data, which could lead to incorrect analysis.