AWS Certified Generative AI Developer - Professional

Get started today

Ultimate access to all questions.

Explanation:

Explanation

Option B is the correct answer because it addresses both the performance bottlenecks and multi-region requirements effectively:

Token batching: This reduces API overhead by grouping multiple requests together, which improves throughput efficiency and reduces latency during peak periods.
Cross-Region inference profiles: This automatically distributes traffic across available AWS Regions, which helps handle the surge from 10,000 to 30,000 requests per hour by leveraging multiple regions' capacity.
The 2-second response time requirement: Cross-region distribution ensures requests can be routed to regions with available capacity, maintaining the SLA.

Why other options are incorrect:

Option A: Using a single Region with provisioned throughput doesn't address the multi-region requirement and may not handle the 3x surge effectively. Exponential backoff for failed requests doesn't solve the throughput bottleneck.
Option C: Auto-scaling Lambda functions help with compute scaling but don't address the Bedrock model inference bottlenecks. Client-side round-robin is less sophisticated than automated cross-region distribution.
Option D: Batch inference with S3 and SQS introduces asynchronous processing which violates the 2-second response time requirement. This approach is suitable for offline batch processing, not real-time inference.

Key AWS Services/Concepts:

Amazon Bedrock cross-region inference profiles
Token batching for API optimization
Multi-region deployment for high availability and scalability
Handling request surges from 10K to 30K requests/hour
Maintaining 2-second response time SLA

Explanation:

Explanation

Option B is the correct answer because it addresses both the performance bottlenecks and multi-region requirements effectively:

Token batching: This reduces API overhead by grouping multiple requests together, which improves throughput efficiency and reduces latency during peak periods.
Cross-Region inference profiles: This automatically distributes traffic across available AWS Regions, which helps handle the surge from 10,000 to 30,000 requests per hour by leveraging multiple regions' capacity.
The 2-second response time requirement: Cross-region distribution ensures requests can be routed to regions with available capacity, maintaining the SLA.

Why other options are incorrect:

Option A: Using a single Region with provisioned throughput doesn't address the multi-region requirement and may not handle the 3x surge effectively. Exponential backoff for failed requests doesn't solve the throughput bottleneck.
Option C: Auto-scaling Lambda functions help with compute scaling but don't address the Bedrock model inference bottlenecks. Client-side round-robin is less sophisticated than automated cross-region distribution.
Option D: Batch inference with S3 and SQS introduces asynchronous processing which violates the 2-second response time requirement. This approach is suitable for offline batch processing, not real-time inference.

Key AWS Services/Concepts:

Amazon Bedrock cross-region inference profiles
Token batching for API optimization
Multi-region deployment for high availability and scalability
Handling request surges from 10K to 30K requests/hour
Maintaining 2-second response time SLA

Comments (0)

No comments yet.

A company is using Amazon Bedrock and Anthropic Claude 3 Haiku to develop an AI assistant. The AI assistant normally processes 10,000 requests each hour but experiences surges of up to 30,000 requests each hour during peak usage periods. The AI assistant must respond within 2 seconds while operating across multiple AWS Regions.

The company observes that during peak usage periods, the AI assistant experiences throughput bottlenecks that cause increased latency and occasional request timeouts. The company must resolve the performance issues.

Which solution will meet this requirement?

Real Exam

Community

DDucse

Last updated: March 23, 2026 at 10:35

Purchase provisioned throughput and sufficient model units (MUs) in a single Region. Configure the application to retry failed requests with exponential backoff.

0.0%

Implement token batching to reduce API overhead. Use cross-Region inference profiles to automatically distribute traffic across available Regions.