
Answer-first summary for fast verification
Answer: max-tokens
## Explanation The correct answer is **C. max-tokens**. **Why max-tokens is the correct parameter:** 1. **Purpose of max-tokens**: The `max-tokens` parameter controls the maximum number of tokens (words or word pieces) that the model can generate in its response. This directly limits the length of the output. 2. **Problem statement**: The question describes "excessively long responses" which are causing: - Increased latency (longer response times) - Higher token costs (more tokens = higher cost) 3. **How adjusting max-tokens helps**: - By reducing the `max-tokens` value, you limit how long the response can be - Shorter responses mean: - Faster generation (reduced latency) - Fewer tokens used (lower cost) **Why the other options are incorrect:** - **A. temperature**: Controls randomness/creativity of responses (higher = more random, lower = more deterministic). Doesn't control response length. - **B. top-p**: Controls nucleus sampling - determines the cumulative probability threshold for token selection. Affects response quality/coherence, not length. - **D. top-k**: Controls the number of highest-probability tokens to consider for sampling. Affects response diversity, not length. **Best Practice Tip**: When optimizing for cost and latency in production applications, setting appropriate `max-tokens` limits is crucial. You should determine the optimal balance between response completeness and cost/latency constraints for your specific use case.
Author: Ritesh Yadav
Ultimate access to all questions.
No comments yet.