
Answer-first summary for fast verification
Answer: 1. Check for duplicate rows in the BigQuery tables that have the daily partition data size doubled. 2. Check the BigQuery Audit logs to find job IDs. 3. Use Cloud Monitoring to determine when the identified Dataflow jobs started and the pipeline code version. 4. When more than one pipeline ingests data into a table, stop all versions except the latest one.
The correct answer is Option C. This approach systematically investigates the data increase by checking for duplicates, auditing job logs, monitoring job start times, and managing pipeline versions effectively. It's crucial to identify if duplicate rows are causing the data size increase and to understand the changes that led to this situation by examining job IDs and pipeline versions. Stopping older pipeline versions prevents further data duplication. Other options either focus on deduplication without addressing the root cause, suggest immediate rollback without investigation, or overlook the importance of auditing and monitoring to identify the issue's origin.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
You are overseeing your company's data lake on BigQuery, with data ingestion pipelines pulling data from Pub/Sub into BigQuery tables. After a new pipeline version was deployed, there was a 50% increase in daily stored data, with some tables' daily partition sizes doubling, despite no change in Pub/Sub data volumes. What steps should you take to investigate and resolve this sudden data increase?
A
B
C
D