
Answer-first summary for fast verification
Answer: A cluster with 2 a2-megagpu-16g machines, each with 16 NVIDIA Tesla A100 GPUs (640 GB GPU memory in total), 96 vCPUs, and 1.4 TB RAM
The best hardware configuration for your models is option B: 'A cluster with 2 a2-megagpu-16g machines, each with 16 NVIDIA Tesla A100 GPUs (640 GB GPU memory in total), 96 vCPUs, and 1.4 TB RAM.' This setup provides extensive GPU memory, which is crucial given the size of your model (20 GB) and large batch sizes (1024 MB per batch). The 16 GPUs per machine will speed up training, and the 96 vCPUs will handle the custom TensorFlow operations in C++. Additionally, the 1.4 TB RAM ensures there is enough memory to train and run your models efficiently.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You work for a biotech startup that focuses on developing cutting-edge deep learning models based on the properties of biological organisms. Your team often engages in early-stage experimental phases with novel ML model architectures and frequently writes custom TensorFlow operations in C++. The models are trained on extensive datasets with substantial batch sizes, where a typical batch contains 1024 examples and each example is approximately 1 MB in size. The average size of an entire network, including all weights and embeddings, is around 20 GB. Considering these requirements, which hardware configuration would be the most suitable for your models?
A
A cluster with 2 n1-highcpu-64 machines, each with 8 NVIDIA Tesla V100 GPUs (128 GB GPU memory in total), and a n1-highcpu-64 machine with 64 vCPUs and 58 GB RAM
B
A cluster with 2 a2-megagpu-16g machines, each with 16 NVIDIA Tesla A100 GPUs (640 GB GPU memory in total), 96 vCPUs, and 1.4 TB RAM
C
A cluster with an n1-highcpu-64 machine with a v2-8 TPU and 64 GB RAM
D
A cluster with 4 n1-highcpu-96 machines, each with 96 vCPUs and 86 GB RAM
No comments yet.