
Ultimate access to all questions.
You are a Machine Learning Engineer at a retail company looking to optimize the ML pipeline for processing structured sales data on Google Cloud. The current pipeline uses PySpark for data transformations after moving raw data into Cloud Storage, but it's taking over 12 hours to run, which is not meeting the business's need for near-real-time analytics. The company requires a serverless solution that leverages SQL syntax for ease of use and efficiency. Additionally, the solution must be cost-effective, scalable, and minimize operational overhead. Which of the following approaches would best meet these requirements? (Choose one correct option)
A
Design the transformation pipelines using Data Fusion‘s GUI for a no-code solution, then store the processed data in BigQuery for analytics.
B
Use BigQuery Load to transfer your data into BigQuery, convert your PySpark commands into BigQuery SQL queries for transformations, and save the results to a new table.
C
Convert your PySpark into SparkSQL queries and run your pipeline on Dataproc for distributed processing, then store the data in BigQuery for analysis.
D
Import the data into Cloud SQL, transform PySpark commands into SQL queries for processing, and use BigQuery federated queries to access the data for ML tasks.