AWS Certified Cloud Practitioner

Get started today

Ultimate access to all questions.

Deep dive into the quiz with AI chat providers.

We prepare a focused prompt with your quiz and certificate details so each AI can offer a more tailored, in-depth explanation.

A company has fine-tuned a large language model (LLM) to answer questions for a help desk. The company wants to determine if the fine-tuning has enhanced the model's accuracy. Which metric should the company use for the evaluation?

Exam-Like

Community

RRitesh

Last updated: December 8, 2025 at 19:13

Precision

Time to first token

F1 score

Word error rate

Explanation:

Explanation

For evaluating the accuracy of a fine-tuned LLM for a help desk question-answering system, the F1 score is the most appropriate metric among the given options.

Why F1 Score is Correct:

F1 score is the harmonic mean of precision and recall, providing a balanced measure of a model's accuracy when dealing with classification tasks.
For question-answering systems, we typically need to evaluate how well the model provides correct answers, which involves measuring both:
- Precision: How many of the predicted answers are actually correct
- Recall: How many of the correct answers are captured by the model
The F1 score combines both metrics, making it ideal for evaluating overall accuracy in information retrieval and classification tasks.

Why Other Options Are Less Suitable:

A. Precision: While important, precision alone doesn't give the complete picture. A model could have high precision but low recall (missing many correct answers).
B. Time to first token: This measures response latency, not accuracy. It's about performance speed, not the quality of answers.
D. Word error rate: This is typically used for speech recognition or text transcription tasks to measure transcription accuracy, not for evaluating the correctness of generated answers in a question-answering system.

Additional Context:

For LLM evaluation in question-answering tasks, other relevant metrics might include:

Exact Match (EM): Whether the answer exactly matches the ground truth
ROUGE/L/BLEU scores: For evaluating text generation quality
Human evaluation: For subjective assessment of answer quality

However, among the given options, F1 score is the most comprehensive metric for evaluating accuracy improvements from fine-tuning.

Powered ByGPT-5.2

Comments

Loading comments...