
Ultimate access to all questions.
Deep dive into the quiz with AI chat providers.
We prepare a focused prompt with your quiz and certificate details so each AI can offer a more tailored, in-depth explanation.
Question: 20
You are working on a Retrieval-Augmented Generation (RAG) application using a large language model (LLM) on Databricks. The cost of inference has increased significantly due to high traffic. You want to use Databricks features to control the costs associated with running the LLM while maintaining reasonable performance for end-users. Which of the following methods would be the BEST way to control LLM costs in your RAG application on Databricks?
A
Use Databricks Auto-Scaling clusters to dynamically adjust the number of nodes in your cluster based on workload, reducing costs during periods of low traffic.
B
Use MLflow to log all LLM responses and track usage, but do not change the underlying infrastructure as Databricks optimizes costs automatically.
C
Cache all LLM-generated responses in Databricks to avoid repeated queries to the model.
D
Utilize Databricks Serverless endpoints, which automatically adjust based on the number of incoming requests, to optimize cost-per-query for LLM inference.
Explanation:
Databricks Serverless endpoints are highly efficient for handling variable traffic, as they dynamically scale based on incoming request volume. This ensures that you're only paying for the compute resources you use, reducing costs when there are fewer requests and scaling up to maintain performance when traffic increases. This is ideal for managing costs in high-traffic scenarios while maintaining good user experience.
Option A (Auto-Scaling clusters) is beneficial but may not scale as efficiently for inference workloads as Serverless endpoints, and you still pay for idle cluster time.
Option B (Using MLflow to log responses) helps with tracking but doesn't directly control infrastructure costs.
Option C (Caching LLM-generated responses) can reduce redundant computations, but it doesn't address the core issue of dynamic cost optimization based on traffic patterns — especially if cache misses occur frequently or if the dataset is large and varied.
Thus, Databricks Serverless endpoints offer the most effective balance of cost control and performance for variable traffic in an LLM-based RAG application.