
Explanation:
Option B is correct because it addresses all the key requirements:
Why other options are incorrect:
Option A: Focuses on batch processing rather than real-time inference, and uses a large complex reasoning model which may not be optimized for low-latency requirements.
Option C: While SageMaker real-time endpoints with GPU instances can provide low latency, this approach requires more infrastructure management compared to Amazon Bedrock's fully managed service. It also doesn't mention provisioned throughput or automatic scaling policies.
Option D: Serverless endpoints are optimized for intermittent traffic patterns and may not provide the consistent low-latency performance required for real-time applications. Batch processing optimization contradicts real-time requirements.
Key considerations for real-time GenAI applications:
Ultimate access to all questions.
No comments yet.
A company is developing a generative AI (GenAI) application that analyzes customer service calls in real time and generates suggested responses for human customer service agents. The application must process 500,000 concurrent calls during peak hours with less than 200 ms end-to-end latency for each suggestion. The company uses existing architecture to transcribe customer call audio streams. The application must not exceed a predefined monthly compute budget and must maintain auto scaling capabilities.
Which solution will meet these requirements?
A
Deploy a large, complex reasoning model on Amazon Bedrock. Purchase provisioned throughput and optimize for batch processing.
B
Deploy a low-latency, real-time optimized model on Amazon Bedrock. Purchase provisioned throughput and set up automatic scaling policies.
C
Deploy a large language model (LLM) on an Amazon SageMaker real-time endpoint that uses dedicated GPU instances.
D
Deploy a mid-sized language model on an Amazon SageMaker serverless endpoint that is optimized for batch processing.