
Answer-first summary for fast verification
Answer: Ingest your data into BigQuery from Cloud Storage, convert your PySpark commands into BigQuery SQL queries to transform the data, and then write the transformations to a new table.
Option C is the best approach because BigQuery provides a fast, scalable, and serverless method for transforming structured data using SQL syntax, which aligns with the requirements mentioned. By directly ingesting data from Cloud Storage into BigQuery and performing transformations with BigQuery SQL queries, you avoid the overhead of maintaining an intermediate processing cluster, such as Dataproc, and ensure faster processing times. This serverless nature and efficiency make BigQuery the ideal choice for expediting both development and pipeline runtime.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You have a batch processing pipeline for structured data on Google Cloud currently powered by PySpark for data transformations at scale. However, your pipelines require more than twelve hours to execute fully. To improve both development speed and pipeline execution times, you want to transition to a serverless tool using SQL syntax. Note that your raw data has already been transferred to Cloud Storage. What steps would you take to construct the new pipeline on Google Cloud to meet both speed and processing requirements?
A
Convert your PySpark commands into SparkSQL queries to transform the data, and then run your pipeline on Dataproc to write the data into BigQuery.
B
Ingest your data into Cloud SQL, convert your PySpark commands into SparkSQL queries to transform the data, and then use federated queries from BigQuery for machine learning.
C
Ingest your data into BigQuery from Cloud Storage, convert your PySpark commands into BigQuery SQL queries to transform the data, and then write the transformations to a new table.
D
Use Apache Beam Python SDK to build the transformation pipelines, and write the data into BigQuery.
No comments yet.