
Ultimate access to all questions.
As an ML engineer at a mobile gaming company, you're tasked with deploying a TensorFlow model into a mobile app to enhance user experience by reducing game loading times. The current model's inference latency is 200ms, which is above the production standard of 100ms. The management has approved a slight decrease in accuracy, up to 2%, to achieve the target latency. Given these constraints and without the option to retrain the model, which optimization technique should you prioritize to meet the latency reduction goal? Choose the best option.
A
Model distillation: Training a smaller, faster model to mimic the behavior of the original model, which requires retraining and thus is not applicable here.
B
Dynamic range quantization: Reduces the precision of the model's weights from floating point to integer, significantly decreasing latency with minimal accuracy loss, without the need for retraining.
C
Weight pruning: Eliminates unnecessary weights in the model to reduce size and latency, but may require fine-tuning to maintain accuracy, which involves retraining.
D
Dimensionality reduction: Reduces the number of features in the input data, which is not applicable as it doesn't directly optimize the model's inference latency without retraining.