
Answer-first summary for fast verification
Answer: Reduce the batch size of instances in each prediction request to decrease memory consumption.
Analyzing each option in the context of the given scenario: - **Request a quota increase for the number of prediction requests to handle more traffic:** This option does not address the 'Out of Memory' error directly, as the issue is related to memory usage per request, not the number of requests. - **Switch to batch prediction mode for all prediction requests to reduce memory usage:** Batch prediction is not suitable for real-time inference as it introduces latency, which contradicts the requirement for real-time recommendations. - **Reduce the batch size of instances in each prediction request to decrease memory consumption:** This is the most effective solution. By reducing the batch size, the memory footprint per request is decreased, directly addressing the 'Out of Memory' error while maintaining real-time capabilities. - **Implement base64 encoding for the input data to reduce the size before sending it for prediction:** Base64 encoding actually increases the data size, which would exacerbate the memory issue. Therefore, the optimal solution is to **reduce the batch size of instances in each prediction request to decrease memory consumption**, as it meets the requirements of minimizing latency and cost while ensuring scalability.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
You are working as a Machine Learning Engineer for a retail company that has deployed a high-demand recommendation model on Vertex AI for real-time inference. The model is crucial for providing personalized recommendations to users in real-time. However, during peak hours, you encounter an 'Out of Memory' error during online prediction requests. The company emphasizes minimizing latency and cost while ensuring scalability. Given these constraints, what is the best course of action to resolve the 'Out of Memory' error without compromising the real-time inference capability? Choose the best option.
A
Request a quota increase for the number of prediction requests to handle more traffic.
B
Switch to batch prediction mode for all prediction requests to reduce memory usage.
C
Reduce the batch size of instances in each prediction request to decrease memory consumption.
D
Implement base64 encoding for the input data to reduce the size before sending it for prediction.
No comments yet.