
Ultimate access to all questions.
A Generative AI Engineer has built an LLM application using a pay-per-token Foundation Model API. As they prepare for production deployment, how can they ensure the model endpoint can handle high volumes of incoming requests?
A
Switch to using External Models instead
B
Throttle the incoming batch of requests manually to avoid rate limiting issues
C
Change to a model with a fewer number of parameters in order to reduce hardware constraint issues
D
Deploy the model using provisioned throughput as it comes with performance guarantees