
Answer-first summary for fast verification
Answer: Employ prompt optimization techniques and cache common query results in Databricks
## Explanation **Option B is the correct answer** because it directly addresses the cost issue caused by high LLM inference usage while maintaining response quality. ### Why Option B is effective: - **Prompt optimization techniques** can reduce unnecessary token usage by making prompts more efficient and targeted, which directly lowers inference costs - **Caching common query results** prevents redundant LLM calls for frequently asked questions, significantly reducing inference costs - This approach maintains response quality since cached responses are identical to original LLM outputs, and optimized prompts still generate high-quality responses ### Why other options are less effective: - **Option A (Model checkpointing)**: This is primarily for training scenarios, not inference cost control. Checkpointing saves training progress but doesn't reduce inference costs. - **Option C (Autoscaling)**: While autoscaling manages compute resources efficiently, it doesn't directly reduce the number of LLM inference calls or token usage that drive costs. - **Option D (Reducing max tokens)**: This could compromise response quality by truncating potentially important content, and the cost savings may be minimal compared to the other approaches. This solution leverages Databricks' capabilities for caching and prompt management to achieve significant cost reductions while preserving the application's effectiveness.
Author: LeetQuiz .
Ultimate access to all questions.
Question: 17 You are working with a Retrieval-Augmented Generation (RAG) application that uses a large language model (LLM) to generate responses. The cost of running this application is increasing due to high usage of the LLM for inference. What is the most effective way to use Databricks features to control costs without compromising the quality of responses?
A
Use model checkpointing to avoid retraining the LLM from scratch for each query
B
Employ prompt optimization techniques and cache common query results in Databricks
C
Use the Databricks autoscaling feature to scale compute clusters based on LLM load
D
Decrease the number of tokens used for generation by reducing the max tokens parameter in the LLM
No comments yet.