
Answer-first summary for fast verification
Answer: BLEU metric
The BLEU (Bilingual Evaluation Understudy) metric is specifically designed for evaluating machine translation quality by comparing machine-generated translations with reference human translations. It measures the precision of n-gram matches between the candidate and reference translations, making it ideal for benchmarking LLMs on translation tasks. The community discussion shows 100% consensus on option A, with the comment highlighting that BLEU is explicitly named for bilingual evaluation. Other options are less suitable: NDCG is for ranking systems, ROUGE is primarily for text summarization, and RECALL alone is insufficient as it doesn't capture translation quality comprehensively.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
A Generative AI Engineer has developed an LLM-based system for automatic text translation between two languages. They now need to benchmark multiple LLMs on this task and select the best one. They possess an evaluation dataset containing known high-quality translation examples. They want to evaluate each LLM using this dataset with a performant metric.
Which metric should they choose for this evaluation?
A
BLEU metric
B
NDCG metric
C
ROUGE metric
D
RECALL metric
No comments yet.