AWS Certified Generative AI Developer - Professional

Get started today

Ultimate access to all questions.

Explanation:

Explanation

Option B is the correct answer because it provides the most comprehensive evaluation approach that addresses multiple aspects of the requirements:

Retrieve-and-generate evaluation job: This evaluates both the retrieval component (chunking strategies) and the generation component (FM responses), which is essential for assessing the complete RAG pipeline.
Custom precision-at-k metrics: These are appropriate for evaluating retrieval quality across different chunking strategies.
LLM-as-a-judge metric with scale 1-5: This provides a nuanced evaluation of response quality from both foundation models.
Includes each chunking strategy in evaluation dataset: Ensures comprehensive testing of all chunking approaches.
Uses Anthropic Claude Sonnet: A supported model for evaluation in Amazon Bedrock.

Why other options are incorrect:

Option A: Only evaluates retrieval (retrieve-only) and doesn't assess the generation quality of the FMs.

Option C: Creates separate jobs for each combination, which is inefficient and doesn't provide integrated evaluation. Manual review also doesn't scale.

Option D: Uses multiple retrieve-only jobs and Amazon Nova Pro (which may not be a supported evaluator model in Bedrock), and focuses only on retrieval quality without evaluating generation.

Option B provides the most holistic approach that evaluates both retrieval strategies and foundation model responses in a single, integrated evaluation job.

Explanation:

Explanation

Option B is the correct answer because it provides the most comprehensive evaluation approach that addresses multiple aspects of the requirements:

Retrieve-and-generate evaluation job: This evaluates both the retrieval component (chunking strategies) and the generation component (FM responses), which is essential for assessing the complete RAG pipeline.
Custom precision-at-k metrics: These are appropriate for evaluating retrieval quality across different chunking strategies.
LLM-as-a-judge metric with scale 1-5: This provides a nuanced evaluation of response quality from both foundation models.
Includes each chunking strategy in evaluation dataset: Ensures comprehensive testing of all chunking approaches.
Uses Anthropic Claude Sonnet: A supported model for evaluation in Amazon Bedrock.

Why other options are incorrect:

Option A: Only evaluates retrieval (retrieve-only) and doesn't assess the generation quality of the FMs.

Option C: Creates separate jobs for each combination, which is inefficient and doesn't provide integrated evaluation. Manual review also doesn't scale.

Option D: Uses multiple retrieve-only jobs and Amazon Nova Pro (which may not be a supported evaluator model in Bedrock), and focuses only on retrieval quality without evaluating generation.

Option B provides the most holistic approach that evaluates both retrieval strategies and foundation model responses in a single, integrated evaluation job.

Comments (0)

No comments yet.

A company uses Amazon Bedrock to implement a Retrieval Augmented Generation (RAG)-based system to serve medical information to users. The company needs to compare multiple chunking strategies, evaluate the generation quality of two foundation models (FMs), and enforce quality thresholds for deployment.

Which Amazon Bedrock evaluation configuration will meet these requirements?

Real Exam

Community

DDucse

Last updated: March 24, 2026 at 06:44

Create a retrieve-only evaluation job that uses a supported version of Anthropic Claude Sonnet as the evaluator model. Configure metrics for context relevance and context coverage. Define deployment thresholds in a separate CI/CD pipeline.

0.0%

Create a retrieve-and-generate evaluation job that uses custom precision-at-k metrics and an LLM-as-a-judge metric with a scale of 1–5. Include each chunking strategy in the evaluation dataset. Use a supported version of Anthropic Claude Sonnet to evaluate responses from both FMs.