AWS Certified AI Practitioner

Get started today

Ultimate access to all questions.

Explanation:

Detailed Explanation

Question Analysis

The question asks for an appropriate model evaluation strategy to assess the accuracy of machine-generated translations from English to other languages using LLMs. The key requirement is evaluating translation accuracy by examining the generated text.

Evaluation of Options

A: Bilingual Evaluation Understudy (BLEU)

Optimal Choice: BLEU is specifically designed for evaluating machine translation quality. It measures the similarity between machine-generated translations and human reference translations using n-gram precision with a brevity penalty.
Why it fits: BLEU directly addresses the core requirement of translation accuracy assessment. It's widely accepted in both research and industry for machine translation evaluation, making it the standard metric for this use case.
Strengths: Quantifies translation quality objectively, correlates well with human judgment, and is computationally efficient.

B: Root Mean Squared Error (RMSE)

Not Suitable: RMSE is primarily used for regression problems where predictions are continuous numerical values. It measures the average magnitude of errors between predicted and actual values.
Why it doesn't fit: Translation involves text generation, not numerical prediction. RMSE cannot be applied to textual data or translation quality assessment.

C: Recall-Oriented Understudy for Gisting Evaluation (ROUGE)

Less Suitable: ROUGE is primarily designed for evaluating text summarization systems, though it can be adapted for translation. It focuses on recall-oriented measures of n-gram overlap.
Why it's suboptimal: While ROUGE can provide some insights, it's not specifically optimized for translation evaluation. BLEU is the established standard for machine translation, while ROUGE is better suited for summarization tasks.

D: F1 Score

Not Suitable: F1 score is used for classification tasks, combining precision and recall. It's typically applied to binary or multi-class classification problems.
Why it doesn't fit: Translation is a generation task, not a classification problem. F1 score cannot evaluate the quality of generated text translations.

Conclusion

BLEU (Option A) is the most appropriate evaluation strategy because it's specifically designed for machine translation quality assessment. It provides an objective, standardized method to compare machine-generated translations against human reference translations, which aligns perfectly with the company's need to evaluate translation accuracy. The other metrics are designed for different types of machine learning tasks and are not suitable for evaluating translation quality.

Explanation:

Detailed Explanation

Question Analysis

Evaluation of Options

A: Bilingual Evaluation Understudy (BLEU)

Optimal Choice: BLEU is specifically designed for evaluating machine translation quality. It measures the similarity between machine-generated translations and human reference translations using n-gram precision with a brevity penalty.
Why it fits: BLEU directly addresses the core requirement of translation accuracy assessment. It's widely accepted in both research and industry for machine translation evaluation, making it the standard metric for this use case.
Strengths: Quantifies translation quality objectively, correlates well with human judgment, and is computationally efficient.

B: Root Mean Squared Error (RMSE)

Not Suitable: RMSE is primarily used for regression problems where predictions are continuous numerical values. It measures the average magnitude of errors between predicted and actual values.
Why it doesn't fit: Translation involves text generation, not numerical prediction. RMSE cannot be applied to textual data or translation quality assessment.

C: Recall-Oriented Understudy for Gisting Evaluation (ROUGE)

Less Suitable: ROUGE is primarily designed for evaluating text summarization systems, though it can be adapted for translation. It focuses on recall-oriented measures of n-gram overlap.
Why it's suboptimal: While ROUGE can provide some insights, it's not specifically optimized for translation evaluation. BLEU is the established standard for machine translation, while ROUGE is better suited for summarization tasks.

D: F1 Score

Not Suitable: F1 score is used for classification tasks, combining precision and recall. It's typically applied to binary or multi-class classification problems.
Why it doesn't fit: Translation is a generation task, not a classification problem. F1 score cannot evaluate the quality of generated text translations.

Conclusion

Comments (0)

No comments yet.

A company uses generative AI with large language models (LLMs) to translate training manuals from English into other languages. They need to assess the accuracy of the generated translated text. Which model evaluation approach satisfies this need?

Exam-Like

Last updated: June 3, 2026 at 14:03

Bilingual Evaluation Understudy (BLEU)

78.9%

Root mean squared error (RMSE)

5.3%

Recall-Oriented Understudy for Gisting Evaluation (ROUGE)

10.5%