Ultimate access to all questions.
You are leading a data science team at a large international corporation that primarily trains large-scale models using high-level TensorFlow APIs on Google Cloud AI Platform with GPUs. Typically, iterating on a new version of a model takes your team a few weeks or even months. Recently, you have been asked to review and reduce your team’s Google Cloud compute costs while ensuring that the model's performance remains unaffected. Considering this, what approach should you take?
Explanation:
The correct answer is C. Preemptible VMs are a cost-effective solution because they are significantly cheaper than standard VMs, although they can be terminated at any time. By using checkpoints, you can save the progress of your training job at regular intervals, which allows you to resume the job if a preemptible VM is terminated. This ensures that your model's performance is not impacted, even if the training job is interrupted, while also reducing compute costs. Migrating to Kubeflow on Google Kubernetes Engine aligns with the need to manage resource allocation efficiently.