
Explanation:
Option B is the correct solution because Amazon Bedrock evaluation jobs are purpose-built to assess prompt effectiveness, model behavior, and response quality in a repeatable and automated manner. Evaluation jobs support both quantitative metrics and LLM-based judgment, making them suitable for detecting subtle response quality regressions that simple sentiment or latency metrics cannot capture.
Key reasons why Option B meets all requirements:
Automatic comparison of multiple prompt templates: Amazon Bedrock evaluation jobs can test multiple prompt templates and model configurations against the same custom datasets.
Detection of response quality issues: Evaluation jobs provide comprehensive metrics and can use LLM-based judges to identify quality issues that go beyond simple sentiment analysis.
Quantitative metrics: Bedrock evaluation jobs generate structured scoring outputs with quantitative metrics.
Human reviewer feedback: The evaluation framework supports human-in-the-loop review processes.
Prevent deployment of low-quality configurations: By integrating with AWS CodePipeline, the solution can block deployments that don't meet predefined quality thresholds.
Why other options are incorrect:
Option A: Relies on manual review and lacks automated quality threshold enforcement. Amazon QuickSight visualization doesn't provide the structured evaluation needed.
Option C: Focuses on operational metrics (latency, error rates) rather than response quality evaluation. Doesn't address prompt effectiveness or content quality.
Option D: Uses production traffic sampling which may not provide consistent test cases. Amazon Comprehend sentiment analysis is too simplistic for comprehensive response quality assessment and doesn't address prompt effectiveness evaluation.
Additional context: Amazon Bedrock evaluation jobs are specifically designed for this use case, providing:
Ultimate access to all questions.
No comments yet.
A company has a customer service application that uses Amazon Bedrock to generate personalized responses to customer inquiries. The company needs to establish a quality assurance process to evaluate prompt effectiveness and model configurations across updates. The process must automatically compare outputs from multiple prompt templates, detect response quality issues, provide quantitative metrics, and allow human reviewers to give feedback on responses. The process must prevent configurations that do not meet a predefined quality threshold from being deployed.
Which solution will meet these requirements?
A
Create an AWS Lambda function that sends sample customer inquiries to multiple Amazon Bedrock model configurations and stores responses in Amazon S3. Use Amazon QuickSight to visualize response patterns. Manually review outputs daily. Use AWS CodePipeline to deploy configurations that meet the quality threshold.
B
Use Amazon Bedrock evaluation jobs to compare model outputs by using custom prompt datasets. Configure AWS CodePipeline to run the evaluation jobs when prompt templates change. Configure CodePipeline to deploy only configurations that exceed the predefined quality threshold.
C
Set up Amazon CloudWatch alarms to monitor response latency and error rates from Amazon Bedrock. Use Amazon EventBridge rules to notify teams when thresholds are exceeded. Configure a manual approval workflow in AWS Systems Manager.
D
Use AWS Lambda functions to create an automated testing framework that samples production traffic and routes duplicate requests to the updated model version. Use Amazon Comprehend sentiment analysis to compare results. Block deployment if sentiment scores decrease.