AWS Certified AI Practitioner

Get started today

Ultimate access to all questions.

Explanation:

The Bilingual Evaluation Understudy (BLEU) score is the most appropriate metric for this scenario because it evaluates text generation quality by comparing machine-generated text to human-written reference outputs using n-gram precision. This makes it particularly effective for assessing how well the LLM mimics specific stylistic elements like creative spelling and abbreviations.

Why BLEU is optimal:

N-gram matching: BLEU compares overlapping sequences of words (or characters) between generated and reference texts, which directly measures how well the model reproduces specific phrasing patterns, including creative spellings and shortened words.
Style preservation: Since the company wants the chatbot to use teenage language with creative spelling and abbreviations, BLEU can quantify how closely the generated text matches reference examples of that specific style.
Established standard: BLEU is a widely accepted metric in NLP for evaluating text generation tasks where stylistic fidelity is important.

Why other options are less suitable:

F1 score (A): Primarily used for classification tasks to balance precision and recall, not for evaluating text generation quality or stylistic matching.
BERTScore (B): Uses contextual embeddings to measure semantic similarity, which might overlook specific surface-level stylistic features like creative spelling and abbreviations that are crucial here.
ROUGE (C): Focuses on recall-oriented evaluation for summarization tasks, measuring overlap of n-grams and word sequences, but is less precise than BLEU for evaluating stylistic imitation in conversational contexts.

BLEU's precision-based approach with n-gram matching makes it the best choice for quantifying how well the LLM adopts the target audience's specific linguistic style.

Explanation:

Why BLEU is optimal:

N-gram matching: BLEU compares overlapping sequences of words (or characters) between generated and reference texts, which directly measures how well the model reproduces specific phrasing patterns, including creative spellings and shortened words.
Style preservation: Since the company wants the chatbot to use teenage language with creative spelling and abbreviations, BLEU can quantify how closely the generated text matches reference examples of that specific style.
Established standard: BLEU is a widely accepted metric in NLP for evaluating text generation tasks where stylistic fidelity is important.

Why other options are less suitable:

F1 score (A): Primarily used for classification tasks to balance precision and recall, not for evaluating text generation quality or stylistic matching.
BERTScore (B): Uses contextual embeddings to measure semantic similarity, which might overlook specific surface-level stylistic features like creative spelling and abbreviations that are crucial here.
ROUGE (C): Focuses on recall-oriented evaluation for summarization tasks, measuring overlap of n-grams and word sequences, but is less precise than BLEU for evaluating stylistic imitation in conversational contexts.

BLEU's precision-based approach with n-gram matching makes it the best choice for quantifying how well the LLM adopts the target audience's specific linguistic style.

Comments (0)

No comments yet.

Which metric should the education company use to evaluate whether its custom LLM's responses match the creative spelling and shortened words typical of teenage language?

Exam-Like

Last updated: May 8, 2026 at 14:02

F1 score

0.0%

BERTScore

25.0%

Recall-Oriented Understudy for Gisting Evaluation (ROUGE)

37.5%

Bilingual Evaluation Understudy (BLEU) score

37.5%