
Answer-first summary for fast verification
Answer: Migrate your data to BigQuery. Refactor Spark pipelines to write and read data on BigQuery, and run them on Dataproc Serverless.
The correct answer is C because the question specifically states that BigQuery will be used on future transformation pipelines, and you need to ensure that your data is available in BigQuery. By migrating your data directly to BigQuery and refactoring Spark pipelines to read and write data on BigQuery, you can achieve this requirement. Running the Spark jobs on Dataproc Serverless allows you to use managed services, minimizing changes to the ETL processes and keeping overhead costs low. Options A, B, and D either involve additional steps or don't directly ensure the data availability in BigQuery as efficiently as option C does.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
Your organization currently utilizes an on-premises Apache Hadoop cluster to store customer data in Apache Parquet format. Daily data processing tasks are handled by Apache Spark jobs running on this cluster. As part of your migration strategy, both the Spark jobs and the Parquet data need to be transferred to Google Cloud. BigQuery will be the new platform for future data transformation pipelines, requiring the Parquet data to be accessible within BigQuery. Your goal is to leverage managed services to simplify this process while also minimizing changes to ETL data processing and controlling overhead costs. What steps should you take to achieve this?
A
Migrate your data to Cloud Storage and migrate the metadata to Dataproc Metastore (DPMS). Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc Serverless.
B
Migrate your data to Cloud Storage and register the bucket as a Dataplex asset. Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc Serverless.
C
Migrate your data to BigQuery. Refactor Spark pipelines to write and read data on BigQuery, and run them on Dataproc Serverless.
D
Migrate your data to BigLake. Refactor Spark pipelines to write and read data on Cloud Storage, and run them on Dataproc on Compute Engine.