
Ultimate access to all questions.
Explanation:
When a healthcare chatbot on Bedrock generates long summaries, it increases both latency (response time) and token cost. The parameter that directly controls the length of the generated output is max-tokens.
max-tokens: This parameter sets the maximum number of tokens (words/subwords) that the model can generate in its response. By reducing this value, you can limit the length of the summaries, which will:
temperature: Controls randomness in the output (higher = more creative/random, lower = more deterministic). This doesn't directly control response length.
top-p: Controls nucleus sampling for diversity in responses. This affects quality/variety but not length.
stop-sequences: Defines sequences that cause the model to stop generating. While this can indirectly limit length, max-tokens is the direct parameter for controlling response length.
For healthcare chatbots, you might want to set an appropriate max-tokens value based on:
No comments yet.