
Answer-first summary for fast verification
Answer: Decrease the CPU utilization target in the autoscaling configurations
The question involves a Vertex AI endpoint with memory-intensive preprocessing that fails to autoscale under multiple requests. Vertex AI autoscaling primarily uses CPU utilization as the metric, not memory usage. Option D (decrease CPU utilization target) is optimal because it lowers the threshold for scaling, allowing the endpoint to scale up earlier when CPU usage increases due to memory-intensive tasks, even though memory is the bottleneck. This proactive approach helps handle traffic spikes. Option A (use machine with more memory) is less suitable as it doesn't address autoscaling configuration and may not trigger scaling if CPU utilization remains low. Option B (decrease workers per machine) is irrelevant to Vertex AI endpoints, which don't use a worker concept. Option C (increase CPU utilization target) would worsen the issue by delaying scaling until higher CPU usage, potentially causing performance degradation during memory-intensive requests. Community discussion supports D with 74% consensus, highlighting that autoscaling settings must be adjusted to respond to fluctuating demand, not machine specs.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
You have deployed a custom model with multiple memory-intensive preprocessing steps to a Vertex AI endpoint. While initial validation showed acceptable latency, the endpoint fails to autoscale properly under multiple concurrent requests. What steps should you take to resolve this?
A
Use a machine type with more memory
B
Decrease the number of workers per machine
C
Increase the CPU utilization target in the autoscaling configurations.
D
Decrease the CPU utilization target in the autoscaling configurations