
Explanation:
Option B is the correct answer because it provides the most comprehensive evaluation approach that addresses multiple aspects of the requirements:
Retrieve-and-generate evaluation job: This evaluates both the retrieval component (chunking strategies) and the generation component (FM responses), which is essential for assessing the complete RAG pipeline.
Custom precision-at-k metrics: These are appropriate for evaluating retrieval quality across different chunking strategies.
LLM-as-a-judge metric with scale 1-5: This provides a nuanced evaluation of response quality from both foundation models.
Includes each chunking strategy in evaluation dataset: Ensures comprehensive testing of all chunking approaches.
Uses Anthropic Claude Sonnet: A supported model for evaluation in Amazon Bedrock.
Option A: Only evaluates retrieval (retrieve-only) and doesn't assess the generation quality of the FMs.
Option C: Creates separate jobs for each combination, which is inefficient and doesn't provide integrated evaluation. Manual review also doesn't scale.
Option D: Uses multiple retrieve-only jobs and Amazon Nova Pro (which may not be a supported evaluator model in Bedrock), and focuses only on retrieval quality without evaluating generation.
Option B provides the most holistic approach that evaluates both retrieval strategies and foundation model responses in a single, integrated evaluation job.
Ultimate access to all questions.
No comments yet.
A company uses Amazon Bedrock to implement a Retrieval Augmented Generation (RAG)-based system to serve medical information to users. The company needs to compare multiple chunking strategies, evaluate the generation quality of two foundation models (FMs), and enforce quality thresholds for deployment.
Which Amazon Bedrock evaluation configuration will meet these requirements?
A
Create a retrieve-only evaluation job that uses a supported version of Anthropic Claude Sonnet as the evaluator model. Configure metrics for context relevance and context coverage. Define deployment thresholds in a separate CI/CD pipeline.
B
Create a retrieve-and-generate evaluation job that uses custom precision-at-k metrics and an LLM-as-a-judge metric with a scale of 1–5. Include each chunking strategy in the evaluation dataset. Use a supported version of Anthropic Claude Sonnet to evaluate responses from both FMs.
C
Create a separate evaluation job for each chunking strategy and FM combination. Use Amazon Bedrock built-in metrics for correctness and completeness. Manually review scores before deployment approval.
D
Set up a pipeline that uses multiple retrieve-only evaluation jobs to assess retrieval quality. Create separate evaluation jobs for both FMs that use Amazon Nova Pro as the LLM-as-a-judge model. Evaluate based on faithfulness and citation precision metrics.