
Ultimate access to all questions.
You are a Machine Learning Engineer at a tech company that has developed a model using AI Platform. The model is now being moved to production, where it serves a few thousand queries per second. However, you're encountering latency issues. The current setup involves a load balancer distributing requests across multiple Kubeflow CPU-only pods on Google Kubernetes Engine (GKE). Given the constraints of not altering the underlying infrastructure, which of the following strategies would be the MOST effective in improving the serving latency? (Choose one correct option)
A
Switch to the tensorflow-model-server-universal version of TensorFlow Serving to leverage its broader compatibility features.
B
Increase the max_enqueued_batches parameter in TensorFlow Serving significantly to allow more batches to be processed simultaneously, hoping to reduce latency.
C
Recompile TensorFlow Serving from source to include CPU-specific optimizations and instruct GKE to select an appropriate baseline minimum CPU platform for the serving nodes to ensure the availability of necessary CPU features.
D
Significantly increase the max_batch_size parameter in TensorFlow Serving to process larger batches at once, aiming to decrease the overall processing time.