Thinking Token Budget
Token budget parameters for thinking LLMs usually cap how many internal reasoning tokens the model may spend before producing the visible answer.
Common names by API/provider include:
- max_tokens / max_output_tokens: caps generated output tokens, sometimes including hidden reasoning tokens depending on the API.
- reasoning_effort: qualitative budget like low, medium, high; the API maps this to an internal reasoning-token allowance.
- thinking_budget / budget_tokens: explicit number of hidden reasoning tokens allowed for models that expose thinking controls.
- max_completion_tokens: in some APIs, caps both reasoning tokens and final answer tokens together.
Why it matters:
- Higher budget: useful for hard math, coding, planning, and multi-step debugging.
- Lower budget: cheaper, faster, enough for simple Q&A or formatting tasks.
- Too low: model may answer prematurely or miss steps.
- Too high: slower and more expensive, sometimes overthinks simple tasks.
Mental model: total completion budget = hidden reasoning tokens + visible answer tokens
If the completion cap is tight, a thinking model may spend tokens reasoning and have too little room left for the final answer.
Example qualitative setting: { “model”: “reasoning-model”, “reasoning_effort”: “medium”, “max_output_tokens”: 1000 }
Example explicit thinking budget: { “thinking”: { “type”: “enabled”, “budget_tokens”: 2048 }, “max_output_tokens”: 1000 }
The exact parameter name depends on the model provider and API.