
Answer-first summary for fast verification
Answer: Dynamic range quantization
The correct answer is B: Dynamic range quantization. This method is effective for reducing the inference latency of a model without needing to retrain it. Dynamic range quantization reduces the precision of the weights and biases from 32 bits to 8 bits, significantly decreasing the model size and inference time while maintaining most of the model’s accuracy. Methods like weight pruning, model distillation, and dimensionality reduction typically require retraining the model, which contradicts the requirement of not training a new model.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You are an ML engineer at a mobile gaming company. A data scientist on your team recently trained a TensorFlow model for a mobile game application. Your task is to deploy this model into the mobile app to optimize its performance. Despite the model's good accuracy, you find that the inference latency does not meet the stringent production requirements of the mobile application. To ensure a smooth user experience, you need to reduce the inference time by 50%. Accepting a slight decrease in model accuracy is acceptable to meet the latency requirement. Without retraining a new model, which model optimization technique should you try first to reduce latency?
A
Weight pruning
B
Dynamic range quantization
C
Model distillation
D
Dimensionality reduction
No comments yet.