
Answer-first summary for fast verification
Answer: Bilingual Evaluation Understudy (BLEU)
## Detailed Explanation ### Question Analysis The question asks for an appropriate model evaluation strategy to assess the accuracy of machine-generated translations from English to other languages using LLMs. The key requirement is evaluating **translation accuracy** by examining the generated text. ### Evaluation of Options **A: Bilingual Evaluation Understudy (BLEU)** - **Optimal Choice**: BLEU is specifically designed for evaluating machine translation quality. It measures the similarity between machine-generated translations and human reference translations using n-gram precision with a brevity penalty. - **Why it fits**: BLEU directly addresses the core requirement of translation accuracy assessment. It's widely accepted in both research and industry for machine translation evaluation, making it the standard metric for this use case. - **Strengths**: Quantifies translation quality objectively, correlates well with human judgment, and is computationally efficient. **B: Root Mean Squared Error (RMSE)** - **Not Suitable**: RMSE is primarily used for regression problems where predictions are continuous numerical values. It measures the average magnitude of errors between predicted and actual values. - **Why it doesn't fit**: Translation involves text generation, not numerical prediction. RMSE cannot be applied to textual data or translation quality assessment. **C: Recall-Oriented Understudy for Gisting Evaluation (ROUGE)** - **Less Suitable**: ROUGE is primarily designed for evaluating text summarization systems, though it can be adapted for translation. It focuses on recall-oriented measures of n-gram overlap. - **Why it's suboptimal**: While ROUGE can provide some insights, it's not specifically optimized for translation evaluation. BLEU is the established standard for machine translation, while ROUGE is better suited for summarization tasks. **D: F1 Score** - **Not Suitable**: F1 score is used for classification tasks, combining precision and recall. It's typically applied to binary or multi-class classification problems. - **Why it doesn't fit**: Translation is a generation task, not a classification problem. F1 score cannot evaluate the quality of generated text translations. ### Conclusion BLEU (Option A) is the most appropriate evaluation strategy because it's specifically designed for machine translation quality assessment. It provides an objective, standardized method to compare machine-generated translations against human reference translations, which aligns perfectly with the company's need to evaluate translation accuracy. The other metrics are designed for different types of machine learning tasks and are not suitable for evaluating translation quality.
Ultimate access to all questions.
No comments yet.
Author: LeetQuiz Editorial Team
A company uses generative AI with large language models (LLMs) to translate training manuals from English into other languages. They need to assess the accuracy of the generated translated text. Which model evaluation approach satisfies this need?
A
Bilingual Evaluation Understudy (BLEU)
B
Root mean squared error (RMSE)
C
Recall-Oriented Understudy for Gisting Evaluation (ROUGE)
D
F1 score