Explanation
The correct answer is C. max-tokens.
Why max-tokens is the correct parameter:
- max-tokens controls the maximum number of tokens (words/sub-words) that the model can generate in its response
- When summaries are too long, it directly increases:
- Latency: Longer responses take more time to generate
- Token cost: More tokens = higher API costs
How adjusting max-tokens helps:
- Setting a lower max-tokens limit forces the model to generate more concise summaries
- This reduces both generation time and cost
- The model learns to be more efficient within the token constraint
Why other options are incorrect:
- A. temperature: Controls randomness/creativity of responses, not length
- B. top-p: Controls vocabulary diversity through nucleus sampling, affects quality not length
- D. stop-sequences: Used to define stopping points for generation, but doesn't directly control overall response length
Best Practice:
For healthcare chatbots where concise, accurate information is crucial, setting an appropriate max-tokens limit ensures responses remain focused and cost-effective while maintaining quality.