
Google Professional Machine Learning Engineer
Get started today
Ultimate access to all questions.
You have trained a Deep Neural Network (DNN) regressor using TensorFlow to predict housing prices based on a set of features such as location, size, and amenities. The model was developed with a default precision of tf.float64 and leveraged TensorFlow's estimator API. During evaluation, the model showed good predictive performance. However, before deploying the model to production, you noticed that the serving latency is 10 milliseconds at the 90th percentile using CPUs. The production environment has stricter requirements, necessitating a latency of no more than 8 milliseconds at the 90th percentile. You are prepared to accept a slight reduction in predictive accuracy to meet this latency requirement. To address this, you need to quickly reduce the serving latency and subsequently evaluate the impact on model accuracy. What should you try first to achieve the desired latency?
You have trained a Deep Neural Network (DNN) regressor using TensorFlow to predict housing prices based on a set of features such as location, size, and amenities. The model was developed with a default precision of tf.float64 and leveraged TensorFlow's estimator API. During evaluation, the model showed good predictive performance. However, before deploying the model to production, you noticed that the serving latency is 10 milliseconds at the 90th percentile using CPUs. The production environment has stricter requirements, necessitating a latency of no more than 8 milliseconds at the 90th percentile. You are prepared to accept a slight reduction in predictive accuracy to meet this latency requirement. To address this, you need to quickly reduce the serving latency and subsequently evaluate the impact on model accuracy. What should you try first to achieve the desired latency?
Explanation:
The correct answer is B: Apply quantization to your SavedModel by reducing the floating point precision to tf.float16. Quantization is a model optimization technique that reduces the precision of the model’s weights, which in turn reduces the model size and the computational resources needed during inference. This often leads to faster loading and execution times, thus lowering latency. Unlike switching from CPU to GPU, which involves additional hardware requirements and potentially more complex deployment changes, quantization can be done directly within the TensorFlow framework relatively quickly. Although there might be a slight reduction in model accuracy, it is generally minimal, making this option a favorable first step to meet the latency requirements.