
Ultimate access to all questions.
A Generative AI Engineer has developed an LLM-based system for automatic text translation between two languages. They now need to benchmark multiple LLMs on this task and select the best one. They possess an evaluation dataset containing known high-quality translation examples. They want to evaluate each LLM using this dataset with a performant metric.
Which metric should they choose for this evaluation?