AWS Certified AI Practitioner

Get started today

Ultimate access to all questions.

Explanation:

Detailed Explanation

When evaluating whether fine-tuning has improved a large language model's accuracy for a help desk question-answering system, the F1 score is the most appropriate metric among the given options. Here's why:

Why F1 Score (Option C) is Optimal

Balanced Evaluation: The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both aspects. For a help desk system:
- Precision measures how many of the model's answers are correct (avoiding incorrect information)
- Recall measures how many of the correct answers the model retrieves (avoiding missed relevant information)
Question-Answering Context: In help desk scenarios, both false positives (incorrect answers) and false negatives (missed correct answers) are problematic. The F1 score accounts for both, making it superior to using precision or recall alone.
Classification Task Alignment: Evaluating LLM responses for accuracy typically involves classifying answers as correct or incorrect, which aligns with classification metrics like F1 score.

Analysis of Other Options

A: Precision: While important, precision alone doesn't capture whether the model is missing relevant information. A model could have high precision by being overly conservative and answering few questions, which isn't ideal for a help desk.
B: Time to first token: This measures latency/response time, not accuracy. While important for user experience, it doesn't assess whether the fine-tuning improved the correctness of answers.
D: Word error rate: Primarily used for speech recognition and transcription tasks, WER measures transcription accuracy by comparing word sequences. It's not suitable for evaluating the semantic accuracy of question-answering systems where different wordings can convey the same correct meaning.

Best Practices Consideration

For evaluating fine-tuned LLMs in question-answering applications, standard practice involves using metrics that assess both the correctness and completeness of responses. The F1 score is widely recognized in machine learning for classification tasks with imbalanced datasets or when both precision and recall are important—exactly the case for help desk systems where users need both accurate and comprehensive answers.

Therefore, C: F1 score provides the most comprehensive evaluation of whether fine-tuning has enhanced the model's accuracy for this specific use case.

Explanation:

Detailed Explanation

Why F1 Score (Option C) is Optimal

Balanced Evaluation: The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both aspects. For a help desk system:
- Precision measures how many of the model's answers are correct (avoiding incorrect information)
- Recall measures how many of the correct answers the model retrieves (avoiding missed relevant information)
Question-Answering Context: In help desk scenarios, both false positives (incorrect answers) and false negatives (missed correct answers) are problematic. The F1 score accounts for both, making it superior to using precision or recall alone.
Classification Task Alignment: Evaluating LLM responses for accuracy typically involves classifying answers as correct or incorrect, which aligns with classification metrics like F1 score.

Analysis of Other Options

A: Precision: While important, precision alone doesn't capture whether the model is missing relevant information. A model could have high precision by being overly conservative and answering few questions, which isn't ideal for a help desk.
B: Time to first token: This measures latency/response time, not accuracy. While important for user experience, it doesn't assess whether the fine-tuning improved the correctness of answers.
D: Word error rate: Primarily used for speech recognition and transcription tasks, WER measures transcription accuracy by comparing word sequences. It's not suitable for evaluating the semantic accuracy of question-answering systems where different wordings can convey the same correct meaning.

Best Practices Consideration

Therefore, C: F1 score provides the most comprehensive evaluation of whether fine-tuning has enhanced the model's accuracy for this specific use case.

Comments (0)

No comments yet.

A company has fine-tuned a large language model (LLM) for a help desk question-answering system. How should the company measure whether the fine-tuning improved the model's accuracy?

Exam-Like

Last updated: February 8, 2026 at 20:17

Precision

15.4%

Time to first token

7.7%

F1 score

69.2%