Databricks Certified Generative AI Engineer - Associate

Get started today

Ultimate access to all questions.

Explanation:

The BLEU (Bilingual Evaluation Understudy) metric is specifically designed for evaluating machine translation quality by comparing machine-generated translations with reference human translations. It measures the precision of n-gram matches between the candidate and reference translations, making it ideal for benchmarking LLMs on translation tasks. The community discussion shows 100% consensus on option A, with the comment highlighting that BLEU is explicitly named for bilingual evaluation. Other options are less suitable: NDCG is for ranking systems, ROUGE is primarily for text summarization, and RECALL alone is insufficient as it doesn't capture translation quality comprehensively.

Explanation:

Comments (0)

No comments yet.

A Generative AI Engineer has developed an LLM-based system for automatic text translation between two languages. They now need to benchmark multiple LLMs on this task and select the best one. They possess an evaluation dataset containing known high-quality translation examples. They want to evaluate each LLM using this dataset with a performant metric.

Which metric should they choose for this evaluation?

Exam-Like

Last updated: February 7, 2026 at 14:03

BLEU metric

82.4%

NDCG metric

4.3%

ROUGE metric

7.3%

RECALL metric

6.0%