
Answer-first summary for fast verification
Answer: F1 score
## Detailed Explanation When evaluating whether fine-tuning has improved a large language model's accuracy for a help desk question-answering system, the **F1 score** is the most appropriate metric among the given options. Here's why: ### Why F1 Score (Option C) is Optimal 1. **Balanced Evaluation**: The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both aspects. For a help desk system: - **Precision** measures how many of the model's answers are correct (avoiding incorrect information) - **Recall** measures how many of the correct answers the model retrieves (avoiding missed relevant information) 2. **Question-Answering Context**: In help desk scenarios, both false positives (incorrect answers) and false negatives (missed correct answers) are problematic. The F1 score accounts for both, making it superior to using precision or recall alone. 3. **Classification Task Alignment**: Evaluating LLM responses for accuracy typically involves classifying answers as correct or incorrect, which aligns with classification metrics like F1 score. ### Analysis of Other Options - **A: Precision**: While important, precision alone doesn't capture whether the model is missing relevant information. A model could have high precision by being overly conservative and answering few questions, which isn't ideal for a help desk. - **B: Time to first token**: This measures latency/response time, not accuracy. While important for user experience, it doesn't assess whether the fine-tuning improved the correctness of answers. - **D: Word error rate**: Primarily used for speech recognition and transcription tasks, WER measures transcription accuracy by comparing word sequences. It's not suitable for evaluating the semantic accuracy of question-answering systems where different wordings can convey the same correct meaning. ### Best Practices Consideration For evaluating fine-tuned LLMs in question-answering applications, standard practice involves using metrics that assess both the correctness and completeness of responses. The F1 score is widely recognized in machine learning for classification tasks with imbalanced datasets or when both precision and recall are important—exactly the case for help desk systems where users need both accurate and comprehensive answers. Therefore, **C: F1 score** provides the most comprehensive evaluation of whether fine-tuning has enhanced the model's accuracy for this specific use case.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
No comments yet.
A company has fine-tuned a large language model (LLM) for a help desk question-answering system. How should the company measure whether the fine-tuning improved the model's accuracy?
A
Precision
B
Time to first token
C
F1 score
D
Word error rate