
Ultimate access to all questions.
In the context of automating the refresh of an ML model with new data as soon as it becomes available, using Google Kubernetes Engine (GKE) and Kubeflow Pipelines within a CI/CD workflow, consider the following scenario: Your data engineering team has set up a pipeline to clean and save datasets in a Cloud Storage bucket. The solution must minimize latency between data arrival and model update, ensure scalability to handle varying data volumes, and maintain cost efficiency. Which of the following architectures best meets these requirements? (Choose one correct option)
A
Implement a lightweight Python client deployed on App Engine that continuously polls the Cloud Storage bucket for new files and initiates a training job on GKE upon detecting new data, ensuring immediate processing but potentially higher costs due to constant polling.
B
Configure Cloud Scheduler to trigger periodic checks of the Cloud Storage bucket for new files, initiating a training job on GKE only when new data is found. This approach reduces costs by minimizing unnecessary operations but may introduce delays in model updates.
C
Use Dataflow to process and store files in Cloud Storage, then automatically start the training job on GKE after storage. This method leverages Dataflow's scalability for large datasets but may not be the most cost-effective for small or frequently updated datasets.
D
Set up a Cloud Storage trigger to publish a message to a Pub/Sub topic upon the arrival of new files in the bucket. A Cloud Function, subscribed to this topic, then initiates the training job on GKE. This solution offers a balance between low latency, scalability, and cost efficiency.