
Answer-first summary for fast verification
Answer: Configure the machines of the first two worker pools to have GPUs and to use a container image where your training code runs. Configure the third worker pool to use the reductionserver container image without accelerators, and choose a machine type that prioritizes bandwidth.
The correct answer is B because it aligns with the Reduction Server strategy requirements and best practices. The first two worker pools need GPUs to accelerate the training computations for the custom language model, and they should run the container image with the custom training code. The third worker pool, dedicated to the Reduction Server, does not require GPUs as its role is gradient aggregation and communication, not computation. Instead, it should use the reductionserver container image and prioritize high network bandwidth to optimize gradient exchange efficiency. Community discussion strongly supports B (100% consensus), highlighting that Reduction Server is designed for GPU-based training (not TPUs), and the reduction server nodes do not need accelerators but benefit from high bandwidth. Options A and D incorrectly assign GPUs/TPUs to the reduction server pool, wasting resources, while option C incorrectly uses TPUs instead of GPUs, which are not supported for Reduction Server strategy.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
You are training a custom language model on Vertex AI using a large dataset and plan to employ the Reduction Server strategy. How should you configure the worker pools for the distributed training job?
A
Configure the machines of the first two worker pools to have GPUs, and to use a container image where your training code runs. Configure the third worker pool to have GPUs, and use the reductionserver container image.
B
Configure the machines of the first two worker pools to have GPUs and to use a container image where your training code runs. Configure the third worker pool to use the reductionserver container image without accelerators, and choose a machine type that prioritizes bandwidth.
C
Configure the machines of the first two worker pools to have TPUs and to use a container image where your training code runs. Configure the third worker pool without accelerators, and use the reductionserver container image without accelerators, and choose a machine type that prioritizes bandwidth.
D
Configure the machines of the first two pools to have TPUs, and to use a container image where your training code runs. Configure the third pool to have TPUs, and use the reductionserver container image.