
Answer-first summary for fast verification
Answer: Deploy the model on a Vertex AI endpoint manually by creating a custom inference container.
The question requires full control over infrastructure and minimizing inference time. Option A (deploying on Vertex AI with a custom inference container) provides the optimal balance: it offers full control over the container environment (dependencies, runtime, hardware like GPUs) while leveraging Vertex AI's optimized serving infrastructure for low latency. This is supported by the community discussion where option A has 75% consensus and the top-voted comment highlights its benefits for control and latency. Option D (manual GKE deployment) offers control but lacks the managed optimization of Vertex AI, potentially increasing operational overhead without guaranteed latency improvements. Options B and C use Model Garden deployments, which provide less control over infrastructure and may use default configurations not optimized for low latency, as noted in the discussion.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
You are an ML researcher at an investment bank experimenting with the Gemma large language model (LLM) for an internal use case. You require full control over the model's underlying infrastructure and need to minimize the model's inference time. Which serving configuration should you use?
A
Deploy the model on a Vertex AI endpoint manually by creating a custom inference container.
B
Deploy the model on a Google Kubernetes Engine (GKE) cluster by using the deployment options in Model Garden.
C
Deploy the model on a Vertex AI endpoint by using one-click deployment in Model Garden.
D
Deploy the model on a Google Kubernetes Engine (GKE) cluster manually by cresting a custom yaml manifest.