
Answer-first summary for fast verification
Answer: max-tokens
## Explanation When a healthcare chatbot on Bedrock generates long summaries, it increases both latency (response time) and token cost. The parameter that directly controls the length of the generated output is **max-tokens**. ### Parameter Analysis: 1. **max-tokens**: This parameter sets the maximum number of tokens (words/subwords) that the model can generate in its response. By reducing this value, you can limit the length of the summaries, which will: - Reduce latency (shorter responses are generated faster) - Lower token costs (fewer tokens used means lower cost) 2. **temperature**: Controls randomness in the output (higher = more creative/random, lower = more deterministic). This doesn't directly control response length. 3. **top-p**: Controls nucleus sampling for diversity in responses. This affects quality/variety but not length. 4. **stop-sequences**: Defines sequences that cause the model to stop generating. While this can indirectly limit length, max-tokens is the direct parameter for controlling response length. ### Why C is Correct: - The problem specifically mentions "long summaries" causing increased latency and token cost - max-tokens is the parameter that directly limits the maximum length of generated text - Adjusting max-tokens to a lower value will create shorter summaries, reducing both latency and cost ### Best Practice: For healthcare chatbots, you might want to set an appropriate max-tokens value based on: - The typical length needed for summaries - Cost constraints - User experience requirements (response time expectations)
Author: Jin H
Ultimate access to all questions.
No comments yet.