You are leading a data science team at a large international corporation that primarily trains large-scale models using high-level TensorFlow APIs on Google Cloud AI Platform with GPUs. Typically, iterating on a new version of a model takes your team a few weeks or even months. Recently, you have been asked to review and reduce your team’s Google Cloud compute costs while ensuring that the model's performance remains unaffected. Considering this, what approach should you take?

Exam-Like

Use AI Platform to run distributed training jobs with checkpoints.

24.0%

Use AI Platform to run distributed training jobs without checkpoints.

11.5%

Migrate to training with Kubeflow on Google Kubernetes Engine, and use preemptible VMs with checkpoints.

61.5%

Migrate to training with Kubeflow on Google Kubernetes Engine, and use preemptible VMs without checkpoints.

3.1%

Google Professional Machine Learning Engineer