
Ultimate access to all questions.
Question: 20
You are working on a Retrieval-Augmented Generation (RAG) application using a large language model (LLM) on Databricks. The cost of inference has increased significantly due to high traffic. You want to use Databricks features to control the costs associated with running the LLM while maintaining reasonable performance for end-users. Which of the following methods would be the BEST way to control LLM costs in your RAG application on Databricks?
Explanation:
Databricks Serverless endpoints are highly efficient for handling variable traffic, as they dynamically scale based on incoming request volume. This ensures that you're only paying for the compute resources you use, reducing costs when there are fewer requests and scaling up to maintain performance when traffic increases. This is ideal for managing costs in high-traffic scenarios while maintaining good user experience.
Option A (Auto-Scaling clusters) is beneficial but may not scale as efficiently for inference workloads as Serverless endpoints, and you still pay for idle cluster time.
Option B (Using MLflow to log responses) helps with tracking but doesn't directly control infrastructure costs.
Option C (Caching LLM-generated responses) can reduce redundant computations, but it doesn't address the core issue of dynamic cost optimization based on traffic patterns — especially if cache misses occur frequently or if the dataset is large and varied.
Thus, Databricks Serverless endpoints offer the most effective balance of cost control and performance for variable traffic in an LLM-based RAG application.