
Ultimate access to all questions.
You work for a small company that uses Vertex AI to deploy Machine Learning models with autoscaling capabilities for serving online predictions in a production environment. The currently deployed model handles approximately 20 prediction requests per hour, maintaining an average response time of one second. Recently, you retrained this model using a new batch of data. To ensure the updated model performs well under real production conditions, you initiated a canary test, directing about 10% of the production traffic to the new model. During this testing phase, you observed that prediction requests to the new model are taking significantly longer to complete, often between 30 and 180 seconds. What steps should you take to address this issue?
A
Submit a request to raise your project quota to ensure that multiple prediction services can run concurrently.
B
Turn off auto-scaling for the online prediction service of your new model. Use manual scaling with one node always available.
C
Remove your new model from the production environment. Compare the new model and existing model codes to identify the cause of the performance bottleneck.
D
Remove your new model from the production environment. For a short trial period, send all incoming prediction requests to BigQuery. Request batch predictions from your new model, and then use the Data Labeling Service to validate your model’s performance before promoting it to production.