
Explanation:
The Bilingual Evaluation Understudy (BLEU) score is the most appropriate metric for this scenario because it evaluates text generation quality by comparing machine-generated text to human-written reference outputs using n-gram precision. This makes it particularly effective for assessing how well the LLM mimics specific stylistic elements like creative spelling and abbreviations.
Why BLEU is optimal:
Why other options are less suitable:
BLEU's precision-based approach with n-gram matching makes it the best choice for quantifying how well the LLM adopts the target audience's specific linguistic style.
Ultimate access to all questions.
No comments yet.
Which metric should the education company use to evaluate whether its custom LLM's responses match the creative spelling and shortened words typical of teenage language?
A
F1 score
B
BERTScore
C
Recall-Oriented Understudy for Gisting Evaluation (ROUGE)
D
Bilingual Evaluation Understudy (BLEU) score