
Explanation:
The correct answer is Option C. This approach systematically investigates the data increase by checking for duplicates, auditing job logs, monitoring job start times, and managing pipeline versions effectively. It's crucial to identify if duplicate rows are causing the data size increase and to understand the changes that led to this situation by examining job IDs and pipeline versions. Stopping older pipeline versions prevents further data duplication. Other options either focus on deduplication without addressing the root cause, suggest immediate rollback without investigation, or overlook the importance of auditing and monitoring to identify the issue's origin.
Ultimate access to all questions.
No comments yet.
You are overseeing your company's data lake on BigQuery, with data ingestion pipelines pulling data from Pub/Sub into BigQuery tables. After a new pipeline version was deployed, there was a 50% increase in daily stored data, with some tables' daily partition sizes doubling, despite no change in Pub/Sub data volumes. What steps should you take to investigate and resolve this sudden data increase?
A
B
C
D