
Answer-first summary for fast verification
Answer: max-tokens
The `max-tokens` parameter controls the maximum number of tokens (words/characters) that the model can generate in its response. When responses are excessively long, increasing latency and token costs, adjusting `max-tokens` to a lower value will limit the response length, reducing both latency and costs. **Explanation of other options:** - **A. temperature**: Controls randomness/creativity of responses (lower = more deterministic, higher = more creative) - **B. top-p**: Controls vocabulary diversity through nucleus sampling - **D. top-k**: Controls vocabulary diversity by limiting to top k tokens Only `max-tokens` directly addresses the issue of response length, latency, and token cost.
Author: Ritesh Yadav
Ultimate access to all questions.
No comments yet.