
Answer-first summary for fast verification
Answer: Curate a dataset that can test the retrieval and generation components of the system separately. Use MLflow’s built in evaluation metrics to perform the evaluation on the retrieval and generation components.
Option B is the correct answer because it provides a comprehensive evaluation approach that addresses the core requirement of identifying specific areas for improvement in the RAG system. By curating a dataset to test retrieval and generation components separately and using MLflow's built-in evaluation metrics, the engineer can pinpoint whether issues lie in document retrieval accuracy or answer generation quality. This methodical approach aligns with RAG system evaluation best practices. While option D (LLM-as-a-judge) is mentioned in the discussion and has some support, the community consensus and higher upvotes favor option B, especially after the commenter corrected their initial position upon learning that MLflow supports LLM evaluation metrics. Option A (cosine similarity) is too narrow and doesn't provide comprehensive system evaluation. Option C (benchmarking multiple LLMs) addresses only the generation component and doesn't help identify specific improvement areas across the entire RAG pipeline.
Author: LeetQuiz Editorial Team
Ultimate access to all questions.
A Generative AI Engineer has developed a RAG application that enables employees to retrieve answers from an internal knowledge base, such as Confluence pages or Google Drive. The prototype is functional and has received positive feedback from internal testers. The engineer now wants to conduct a formal evaluation of the system's performance and identify areas for improvement.
How should the Generative AI Engineer evaluate the system?
A
Use cosine similarity score to comprehensively evaluate the quality of the final generated answers.
B
Curate a dataset that can test the retrieval and generation components of the system separately. Use MLflow’s built in evaluation metrics to perform the evaluation on the retrieval and generation components.
C
Benchmark multiple LLMs with the same data and pick the best LLM for the job.
D
Use an LLM-as-a-judge to evaluate the quality of the final answers generated.
No comments yet.