
Answer-first summary for fast verification
Answer: A cluster with 2 a2-megagpu-16g machines, each with 16 NVIDIA Tesla A100 GPUs (640 GB GPU memory in total), 96 vCPUs, and 1.4 TB RAM, designed for memory-intensive deep learning workloads and offering superior scalability., A combination of a cluster with 2 a2-megagpu-16g machines for training and a separate n1-highcpu-64 machine with a v2-8 TPU for inference, optimizing both training and inference phases.
The correct hardware setup must not only accommodate the model's size (20GB) and the large batch size (1024 examples, each 1MB, totaling 1GB per batch) but also consider cost-efficiency and scalability for future expansions. GPUs are preferred for their acceleration capabilities in deep learning tasks. Option D with 2 a2-megagpu-16g machines provides a total of 640 GB GPU memory, which is more than sufficient for the model size, and 1.4 TB RAM per machine to handle the batch data efficiently. This setup is specifically designed for memory-intensive deep learning workloads, making it the most suitable for the scenario described. Option E suggests an optimized approach by separating training and inference tasks, which could offer additional efficiency but at a higher cost and complexity.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
In your role as a Machine Learning Engineer at a biotech startup, your team is pioneering the development of deep learning models inspired by biological organisms. This innovative approach requires the creation of custom TensorFlow operations in C++ and the training of models on datasets with exceptionally large batch sizes. Given that each example in your dataset is approximately 1MB in size, the average network size (including weights and embeddings) is 20GB, and a typical batch size is 1024 examples, which hardware setup would best support your models while considering cost-efficiency and scalability for future model expansions? Choose the best option.
A
A cluster with 4 n1-highcpu-96 machines, each with 96 vCPUs and 86 GB RAM, suitable for CPU-intensive tasks but may lack the necessary GPU acceleration for deep learning.
B
A cluster with 2 n1-highcpu-64 machines, each with 8 NVIDIA Tesla V100 GPUs (128 GB GPU memory in total), and a n1-highcpu-64 machine with 64 vCPUs and 58 GB RAM, offering a balance between GPU acceleration and CPU resources.
C
A cluster with an n1-highcpu-64 machine with a v2-8 TPU and 64 GB RAM, providing specialized hardware for tensor operations but may limit scalability due to fixed memory.
D
A cluster with 2 a2-megagpu-16g machines, each with 16 NVIDIA Tesla A100 GPUs (640 GB GPU memory in total), 96 vCPUs, and 1.4 TB RAM, designed for memory-intensive deep learning workloads and offering superior scalability.
E
A combination of a cluster with 2 a2-megagpu-16g machines for training and a separate n1-highcpu-64 machine with a v2-8 TPU for inference, optimizing both training and inference phases.