
Ultimate access to all questions.
You have a batch processing pipeline for structured data on Google Cloud currently powered by PySpark for data transformations at scale. However, your pipelines require more than twelve hours to execute fully. To improve both development speed and pipeline execution times, you want to transition to a serverless tool using SQL syntax. Note that your raw data has already been transferred to Cloud Storage. What steps would you take to construct the new pipeline on Google Cloud to meet both speed and processing requirements?
A
Convert your PySpark commands into SparkSQL queries to transform the data, and then run your pipeline on Dataproc to write the data into BigQuery.
B
Ingest your data into Cloud SQL, convert your PySpark commands into SparkSQL queries to transform the data, and then use federated queries from BigQuery for machine learning.
C
Ingest your data into BigQuery from Cloud Storage, convert your PySpark commands into BigQuery SQL queries to transform the data, and then write the transformations to a new table.
D
Use Apache Beam Python SDK to build the transformation pipelines, and write the data into BigQuery.