
Answer-first summary for fast verification
Answer: Apply quantization to your SavedModel by reducing the floating point precision from tf.float64 to tf.float16 to decrease model size and accelerate inference.
The most effective first step to reduce serving latency with minimal performance impact, especially under budget constraints, is to apply post-training quantization, specifically reducing the precision from tf.float64 to tf.float16. This approach decreases the model size and memory usage, enables faster computation on hardware that supports float16 (such as GPUs and TPUs), and typically has a negligible effect on prediction accuracy in regression tasks. Switching to GPU serving (Option A) may not always be cost-effective or significantly reduce latency for small models or low batch sizes. Increasing the dropout rate during training (Option B) is inappropriate as dropout is a mechanism to prevent overfitting during training, not to reduce inference latency, and a high dropout rate could degrade model performance. Adjusting dropout in _PREDICT mode (Option D) is incorrect because dropout should be disabled during inference to ensure reliable and deterministic predictions.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You are a Machine Learning Engineer at a real estate technology company. You've developed a Deep Neural Network (DNN) regressor using TensorFlow to predict housing prices based on various features. The model was trained with tf.float64 precision and performs satisfactorily in terms of accuracy. However, before deploying the model to production, your team has identified a critical requirement to reduce the model's serving latency from 10 ms at the 90th percentile to 8 ms at the 90th percentile, with minimal impact on the model's predictive performance. The solution must also consider cost constraints, as the company is operating on a tight budget. Given these constraints, what is the first step you should take to achieve the desired reduction in serving latency? Choose the best option.
A
Switch from CPU to GPU serving to leverage the parallel processing capabilities of GPUs for faster inference.
B
Increase the dropout rate to 0.8 during training to reduce model complexity and potentially decrease inference time.
C
Apply quantization to your SavedModel by reducing the floating point precision from tf.float64 to tf.float16 to decrease model size and accelerate inference.
D
Modify the TensorFlow Serving settings to increase the dropout rate to 0.8 in _PREDICT mode, aiming to reduce inference time by simplifying the model during prediction.
No comments yet.