Databricks Certified Generative AI Engineer - Associate

Get started today

Ultimate access to all questions.

Explanation:

The question asks which metric is NOT appropriate for monitoring a deployed LLM application in production. MMLU (Massive Multi-task Language Understanding) score is a benchmarking metric used during model development and evaluation, not for ongoing production monitoring. In contrast, metrics like number of customer inquiries processed per unit of time (throughput), factual accuracy of responses (quality), and response time (performance) are all relevant for monitoring a production customer service application. The community discussion confirms this, with 100% consensus on A and an explanation that MMLU is for pre-training and evaluation, not production monitoring.

Explanation:

Comments (0)

No comments yet.

A Generative AI Engineer has deployed an LLM application at a manufacturing company to assist with customer service inquiries. They need to identify the key enterprise metrics for monitoring the application in production.

Which of the following is NOT a metric they would implement for their customer service LLM application in production?

Exam-Like

Last updated: February 17, 2026 at 14:03

Massive Multi-task Language Understanding (MMLU) score

75.5%

Number of customer inquiries processed per unit of time

10.5%

Factual accuracy of the response

5.5%

Time taken for LLM to generate a response

8.4%