
Answer-first summary for fast verification
Answer: Curate a dataset that can test the retrieval and generation components of the system separately. Use MLflow’s built in evaluation metrics to perform the evaluation on the retrieval and generation components.
Option C is the optimal choice because it provides a systematic, modular approach to evaluating a RAG system by separating the retrieval and generation components. This allows the engineer to pinpoint specific weaknesses in either the document retrieval accuracy or the answer generation quality, using MLflow's built-in evaluation metrics for objective measurement. The community discussion strongly supports this approach with 100% consensus and upvoted comments emphasizing its scientific methodology for debugging and optimization. Option A (ROUGE score) is limited as it only evaluates generation quality without assessing retrieval effectiveness. Option B (LLM-as-a-judge) can be subjective and expensive. Option D (benchmarking multiple LLMs) focuses only on the generation component and doesn't address potential retrieval issues.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
A Generative AI Engineer has developed a RAG application to help employees interpret HR documentation. The prototype is functional and has received positive initial feedback from internal testers. How should the engineer now formally evaluate the system's performance and identify areas for improvement?
A
Use ROUGE score to comprehensively evaluate the quality of the final generated answers.
B
Use an LLM-as-a-judge to evaluate the quality of the final answers generated.
C
Curate a dataset that can test the retrieval and generation components of the system separately. Use MLflow’s built in evaluation metrics to perform the evaluation on the retrieval and generation components.
D
Benchmark multiple LLMs with the same data and pick the best LLM for the job.
No comments yet.