Ultimate access to all questions.
You have a machine learning model built with scikit-learn that has been deployed to a Vertex AI endpoint. During testing of this model with live production traffic, you notice that the number of requests per hour is double what you initially anticipated. To ensure that the endpoint scales efficiently and prevents users from experiencing high latency, what action should you take to handle potential future increases in demand?
Explanation:
The correct answer is B. Configuring an appropriate minReplicaCount value based on expected baseline traffic ensures that there are enough replicas running to handle the baseline traffic efficiently. This allows Vertex AI's built-in autoscaling to automatically provision additional replicas when demand increases, thereby maintaining performance and preventing high latency. Options A, C, and D are less optimal because they either introduce redundant resources, risk under-provisioning during peak traffic, or incur unnecessary expenses without directly addressing scaling needs.