
Answer-first summary for fast verification
Answer: Use the tf.distribute.Strategy API and run a distributed training job.
The correct answer is D. Using the tf.distribute.Strategy API and running a distributed training job can significantly decrease training time without sacrificing model performance. Given the large size of the dataset (three million images, each 2 GB), training on a single machine, even one with robust specifications, can be very slow. Distributed training allows the workload to be split across multiple machines or multiple GPUs, which can drastically speed up the process. Option A is not ideal because simply increasing memory and batch size might not address the primary bottleneck in processing large images. Option B is incorrect because replacing a P100 GPU with a K80 GPU would likely slow down the training due to the K80 being less powerful. Option C, enabling early stopping, could prematurely stop the training process before the model reaches its optimal performance.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You are training an object detection machine learning model on a large dataset comprising three million X-ray images, each approximately 2 GB in size. The training is conducted on Vertex AI using a Compute Engine instance with 32 cores, 128 GB of RAM, and 1 NVIDIA P100 GPU. Despite the robust computational resources, you observe that the model training process is exceptionally slow. To optimize and reduce the training time while maintaining the model's performance, what should you do?
A
Increase the instance memory to 512 GB, and increase the batch size.
B
Replace the NVIDIA P100 GPU with a K80 GPU in the training job.
C
Enable early stopping in your Vertex AI Training job.
D
Use the tf.distribute.Strategy API and run a distributed training job.
No comments yet.