Databricks Certified Generative AI Engineer - Associate

Get started today

Ultimate access to all questions.

Deep dive into the quiz with AI chat providers.

We prepare a focused prompt with your quiz and certificate details so each AI can offer a more tailored, in-depth explanation.

Question: 17 You are working with a Retrieval-Augmented Generation (RAG) application that uses a large language model (LLM) to generate responses. The cost of running this application is increasing due to high usage of the LLM for inference. What is the most effective way to use Databricks features to control costs without compromising the quality of responses?

Real Exam

Community

LLeetQuiz

Last updated: February 2, 2026 at 14:03

Use model checkpointing to avoid retraining the LLM from scratch for each query

Employ prompt optimization techniques and cache common query results in Databricks

Use the Databricks autoscaling feature to scale compute clusters based on LLM load

Decrease the number of tokens used for generation by reducing the max tokens parameter in the LLM

Explanation:

Explanation

Option B is the correct answer because it directly addresses the cost issue caused by high LLM inference usage while maintaining response quality.

Why Option B is effective:

Prompt optimization techniques can reduce unnecessary token usage by making prompts more efficient and targeted, which directly lowers inference costs
Caching common query results prevents redundant LLM calls for frequently asked questions, significantly reducing inference costs
This approach maintains response quality since cached responses are identical to original LLM outputs, and optimized prompts still generate high-quality responses

Why other options are less effective:

Option A (Model checkpointing): This is primarily for training scenarios, not inference cost control. Checkpointing saves training progress but doesn't reduce inference costs.
Option C (Autoscaling): While autoscaling manages compute resources efficiently, it doesn't directly reduce the number of LLM inference calls or token usage that drive costs.
Option D (Reducing max tokens): This could compromise response quality by truncating potentially important content, and the cost savings may be minimal compared to the other approaches.

This solution leverages Databricks' capabilities for caching and prompt management to achieve significant cost reductions while preserving the application's effectiveness.

Powered ByGPT-5.2

Comments

Loading comments...