
Explanation:
The company is developing a foreign language learning app that uses an LLM to improve text coherence. They have:
This is fundamentally a text generation evaluation problem where we need to measure how closely machine-generated text matches human-provided reference texts.
A. Value of the loss function
B. Semantic robustness
C. Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score
Optimal choice: ROUGE is specifically designed for evaluating text generation quality by comparing machine-generated text against human reference texts. It measures:
For this use case, ROUGE can quantify how well the LLM's "more readable" outputs match the style and content of the provided enhanced examples. It's widely accepted in NLP for summarization, translation, and text generation evaluation.
D. Latency of the text generation
Direct comparison capability: ROUGE provides quantitative metrics for comparing generated text against reference texts, which aligns perfectly with the requirement to "resemble the provided examples."
Multiple dimensions of evaluation: Different ROUGE variants can assess various aspects:
Industry standard: ROUGE is widely used in academic research and industry for evaluating text generation systems, including summarization, translation, and text simplification tasks—exactly what's needed here.
Alignment with readability assessment: While ROUGE primarily measures content overlap, the correlation between content similarity and readability/style similarity makes it appropriate for this use case. The company wants the LLM to produce text that matches the "more readable versions"—ROUGE can measure how closely the generated text matches these reference examples.
ROUGE score is the most appropriate metric because it directly measures the similarity between generated text and reference examples through multiple overlapping criteria. This aligns perfectly with the company's requirement to evaluate whether LLM outputs resemble the provided examples of more readable text. The other options either measure different aspects (loss function, latency) or address different concerns (semantic robustness) that don't directly assess the similarity to reference texts.
Ultimate access to all questions.
No comments yet.
A company is launching a mobile app for foreign language learning that uses a large language model (LLM) to improve text coherence. They have compiled a diverse text dataset and augmented it with examples of more readable versions. They want the LLM's output to closely match the style and quality of these enhanced examples.
Which metric should the company use to evaluate if the LLM's outputs align with these provided examples?
A
Value of the loss function
B
Semantic robustness
C
Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score
D
Latency of the text generation