
Answer-first summary for fast verification
Answer: F1 score
## Explanation For evaluating the accuracy of a fine-tuned LLM for a help desk question-answering system, the **F1 score** is the most appropriate metric among the given options. ### Why F1 Score is Correct: 1. **F1 score** is the harmonic mean of precision and recall, providing a balanced measure of a model's accuracy when dealing with classification tasks. 2. For question-answering systems, we typically need to evaluate how well the model provides correct answers, which involves measuring both: - **Precision**: How many of the predicted answers are actually correct - **Recall**: How many of the correct answers are captured by the model 3. The F1 score combines both metrics, making it ideal for evaluating overall accuracy in information retrieval and classification tasks. ### Why Other Options Are Less Suitable: - **A. Precision**: While important, precision alone doesn't give the complete picture. A model could have high precision but low recall (missing many correct answers). - **B. Time to first token**: This measures response latency, not accuracy. It's about performance speed, not the quality of answers. - **D. Word error rate**: This is typically used for speech recognition or text transcription tasks to measure transcription accuracy, not for evaluating the correctness of generated answers in a question-answering system. ### Additional Context: For LLM evaluation in question-answering tasks, other relevant metrics might include: - **Exact Match (EM)**: Whether the answer exactly matches the ground truth - **ROUGE/L/BLEU scores**: For evaluating text generation quality - **Human evaluation**: For subjective assessment of answer quality However, among the given options, F1 score is the most comprehensive metric for evaluating accuracy improvements from fine-tuning.
Author: Ritesh Yadav
Ultimate access to all questions.
No comments yet.
A company has fine-tuned a large language model (LLM) to answer questions for a help desk. The company wants to determine if the fine-tuning has enhanced the model's accuracy. Which metric should the company use for the evaluation?
A
Precision
B
Time to first token
C
F1 score
D
Word error rate