
Answer-first summary for fast verification
Answer: Add a ContainerOp to your pipeline that spins a Dataproc cluster, runs a transformation, and then saves the transformed data in Cloud Storage.
The recommended approach is to add a ContainerOp to your Kubeflow pipeline that spins up a Dataproc cluster, runs the PySpark transformation step, and saves the transformed data in Cloud Storage. This solution leverages Dataproc, a fully managed service, to handle the scalability and efficiency of PySpark transformations on large datasets. It integrates seamlessly with Kubeflow Pipelines and reduces the manual setup and configuration needed when compared to other options.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You have built a machine learning model that is trained on data stored in Parquet files. This data is accessed through a Hive table hosted on Google Cloud Platform (GCP). For preprocessing, you utilized PySpark and exported the resulting data as a CSV file into Google Cloud Storage. Following this preprocessing stage, you perform additional steps to train and evaluate your model. You now wish to automate and parameterize this entire model training workflow using Kubeflow Pipelines on Google Cloud. What should you do to achieve this?
A
Remove the data transformation step from your pipeline.
B
Containerize the PySpark transformation step, and add it to your pipeline.
C
Add a ContainerOp to your pipeline that spins a Dataproc cluster, runs a transformation, and then saves the transformed data in Cloud Storage.
D
Deploy Apache Spark at a separate node pool in a Google Kubernetes Engine cluster. Add a ContainerOp to your pipeline that invokes a corresponding transformation job for this Spark instance.
No comments yet.