
Answer-first summary for fast verification
Answer: Regularly log BLEU and ROUGE scores on a fixed set of evaluation queries and compare them over time
## Explanation **D. Regularly log BLEU and ROUGE scores on a fixed set of evaluation queries and compare them over time** BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) are standard metrics for evaluating the quality of generated text in natural language processing tasks. ### Why this approach works: - **BLEU** measures the precision of n-grams in generated text compared to reference text - **ROUGE** measures recall by comparing overlapping n-grams, word sequences, and word pairs - Using a **fixed set of evaluation queries** ensures consistent comparison over time - **Regular logging** in MLflow allows tracking performance trends and detecting degradation - This directly addresses the concern about "generating useful and accurate responses" ### Why not the other options: - **A**: Monitoring retrieval accuracy only assesses the document retrieval component, not the response generation quality - **B**: Tracking query volume monitors usage patterns but doesn't measure performance quality - **C**: Learning rate and training epochs are training hyperparameters, not production performance metrics This approach provides quantitative, reproducible metrics to detect model drift in response generation quality over time.
Author: LeetQuiz .
Ultimate access to all questions.
No comments yet.
You have deployed a RAG model for document retrieval and response generation in a customer service application. Over time, you want to monitor if the performance of your model degrades, particularly in terms of its ability to generate useful and accurate responses. Which of the following approaches would be most appropriate for using MLflow to monitor model drift over time?
A
Monitor the accuracy of the retrieval step over time
B
Track the number of queries processed by the model daily
C
Monitor the change in the learning rate and number of training epochs used in fine-tuning the model
D
Regularly log BLEU and ROUGE scores on a fixed set of evaluation queries and compare them over time