You possess historical data spanning the last three years stored in BigQuery, alongside a data pipeline that updates this BigQuery dataset with new daily data. Observations have shown that when the Data Science team executes a query filtered by a date column and constrained to a 30-90 day range, the entire table is unintentionally scanned. This extensive table scanning has led to an unexpected surge in costs. To address this, your goal is to find a cost-effective solution that still allows for efficient SQL querying. What steps should you take?

Exam-Like

Re-create the tables using DDL. Partition the tables by a column containing a TIMESTAMP or DATE Type.

72.9%

Recommend that the Data Science team export the table to a CSV file on Cloud Storage and use Cloud Datalab to explore the data by reading the files directly.

10.4%

Modify your pipeline to maintain the last 30-90 days of data in one table and the longer history in a different table to minimize full table scans over the entire history.

10.4%

Write an Apache Beam pipeline that creates a BigQuery table per day. Recommend that the Data Science team use wildcards on the table name suffixes to select the data they need.

6.3%

Google Professional Data Engineer

Get started today

Comments