AWS Certified AI Practitioner

Get started today

Ultimate access to all questions.

Explanation:

Detailed Explanation

Understanding the Problem Context

The company is developing a foreign language learning app that uses an LLM to improve text coherence. They have:

A diverse dataset of text
Augmented it with examples of more readable versions
The goal is for the LLM output to resemble these enhanced examples in terms of style and quality

This is fundamentally a text generation evaluation problem where we need to measure how closely machine-generated text matches human-provided reference texts.

Analysis of Each Option

A. Value of the loss function

Not suitable: Loss functions (like cross-entropy) are used during model training to optimize internal parameters, not for evaluating final output quality against reference examples. They measure prediction error during training, not similarity to target outputs in deployment.

B. Semantic robustness

Not suitable: Semantic robustness refers to a model's ability to maintain consistent meaning despite input variations or adversarial attacks. While important for reliability, it doesn't measure similarity to reference examples of readable text.

C. Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score

Optimal choice: ROUGE is specifically designed for evaluating text generation quality by comparing machine-generated text against human reference texts. It measures:
- N-gram overlap (ROUGE-N): Captures word sequence similarity
- Longest common subsequence (ROUGE-L): Measures structural similarity
- Recall-oriented metrics: Focus on how much of the reference content appears in the generated text
For this use case, ROUGE can quantify how well the LLM's "more readable" outputs match the style and content of the provided enhanced examples. It's widely accepted in NLP for summarization, translation, and text generation evaluation.

D. Latency of the text generation

Not suitable: Latency measures response time, which is important for user experience but doesn't assess whether the output resembles the reference examples in terms of readability and coherence.

Why ROUGE is Specifically Appropriate

Direct comparison capability: ROUGE provides quantitative metrics for comparing generated text against reference texts, which aligns perfectly with the requirement to "resemble the provided examples."
Multiple dimensions of evaluation: Different ROUGE variants can assess various aspects:
- ROUGE-1 (unigrams): Basic word overlap
- ROUGE-2 (bigrams): Phrase structure similarity
- ROUGE-L: Sentence-level structural similarity
- ROUGE-SU: Skip-bigram with unigram inclusion
Industry standard: ROUGE is widely used in academic research and industry for evaluating text generation systems, including summarization, translation, and text simplification tasks—exactly what's needed here.
Alignment with readability assessment: While ROUGE primarily measures content overlap, the correlation between content similarity and readability/style similarity makes it appropriate for this use case. The company wants the LLM to produce text that matches the "more readable versions"—ROUGE can measure how closely the generated text matches these reference examples.

Alternative Metrics Considered and Rejected

BLEU score: While similar to ROUGE, BLEU is more precision-oriented and better suited for translation than text coherence evaluation.
Perplexity: Measures model confidence, not output quality against references.
Human evaluation: While valuable, it's not a predefined metric option and would be resource-intensive.

Conclusion

ROUGE score is the most appropriate metric because it directly measures the similarity between generated text and reference examples through multiple overlapping criteria. This aligns perfectly with the company's requirement to evaluate whether LLM outputs resemble the provided examples of more readable text. The other options either measure different aspects (loss function, latency) or address different concerns (semantic robustness) that don't directly assess the similarity to reference texts.

Explanation:

Detailed Explanation

Understanding the Problem Context

The company is developing a foreign language learning app that uses an LLM to improve text coherence. They have:

A diverse dataset of text
Augmented it with examples of more readable versions
The goal is for the LLM output to resemble these enhanced examples in terms of style and quality

This is fundamentally a text generation evaluation problem where we need to measure how closely machine-generated text matches human-provided reference texts.

Analysis of Each Option

A. Value of the loss function

Not suitable: Loss functions (like cross-entropy) are used during model training to optimize internal parameters, not for evaluating final output quality against reference examples. They measure prediction error during training, not similarity to target outputs in deployment.

B. Semantic robustness

Not suitable: Semantic robustness refers to a model's ability to maintain consistent meaning despite input variations or adversarial attacks. While important for reliability, it doesn't measure similarity to reference examples of readable text.

C. Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score

Optimal choice: ROUGE is specifically designed for evaluating text generation quality by comparing machine-generated text against human reference texts. It measures:
- N-gram overlap (ROUGE-N): Captures word sequence similarity
- Longest common subsequence (ROUGE-L): Measures structural similarity
- Recall-oriented metrics: Focus on how much of the reference content appears in the generated text
For this use case, ROUGE can quantify how well the LLM's "more readable" outputs match the style and content of the provided enhanced examples. It's widely accepted in NLP for summarization, translation, and text generation evaluation.

D. Latency of the text generation

Not suitable: Latency measures response time, which is important for user experience but doesn't assess whether the output resembles the reference examples in terms of readability and coherence.

Why ROUGE is Specifically Appropriate

Direct comparison capability: ROUGE provides quantitative metrics for comparing generated text against reference texts, which aligns perfectly with the requirement to "resemble the provided examples."
Multiple dimensions of evaluation: Different ROUGE variants can assess various aspects:
- ROUGE-1 (unigrams): Basic word overlap
- ROUGE-2 (bigrams): Phrase structure similarity
- ROUGE-L: Sentence-level structural similarity
- ROUGE-SU: Skip-bigram with unigram inclusion
Industry standard: ROUGE is widely used in academic research and industry for evaluating text generation systems, including summarization, translation, and text simplification tasks—exactly what's needed here.
Alignment with readability assessment: While ROUGE primarily measures content overlap, the correlation between content similarity and readability/style similarity makes it appropriate for this use case. The company wants the LLM to produce text that matches the "more readable versions"—ROUGE can measure how closely the generated text matches these reference examples.

Alternative Metrics Considered and Rejected

BLEU score: While similar to ROUGE, BLEU is more precision-oriented and better suited for translation than text coherence evaluation.
Perplexity: Measures model confidence, not output quality against references.
Human evaluation: While valuable, it's not a predefined metric option and would be resource-intensive.

Conclusion

Comments (0)

No comments yet.

A company is launching a mobile app for foreign language learning that uses a large language model (LLM) to improve text coherence. They have compiled a diverse text dataset and augmented it with examples of more readable versions. They want the LLM's output to closely match the style and quality of these enhanced examples.

Which metric should the company use to evaluate if the LLM's outputs align with these provided examples?

Exam-Like

Last updated: April 4, 2026 at 14:03

Value of the loss function

12.5%

Semantic robustness

25.0%

Recall-Oriented Understudy for Gisting Evaluation (ROUGE) score

62.5%