Reasoning effort: low / medium / high / xhigh — picking the right level
When to use low, when medium, when high, and what xhigh is even for. Real scenarios, quality and cost measurements through Codex Key.
Reasoning effort: low / medium / high / xhigh — picking the right level
Every GPT-5 model on Codex Key accepts a reasoning_effort parameter. It does not change the tariff coefficient, but directly affects answer length, quality and latency. Here's when to use which.
TL;DR
| Level | Tasks | Avg output tokens | Avg latency | Quality |
|---|---|---|---|---|
| low | Classification, short reply, chat | 80-200 | 0.5-1.5s | Baseline |
| medium (default) | Code gen, normal chat, summarization | 300-800 | 1.5-4s | Good |
| high | Multi-step reasoning, hard code, review | 800-2500 | 4-12s | High |
| xhigh | Research, proofs, deep analysis | 2000-8000 | 12-40s | Max |
How it works
reasoning_effort controls the model's internal chain of thought. On high and xhigh the model spends more tokens on "thinking" (some visible, some hidden depending on the model) before emitting the final answer.
Important: you pay for all reasoning tokens, including hidden ones. An xhigh call can burn 5-10× more tokens than low.
When to use low
- Classification: "spam / not spam?", "intent: search / buy / help?"
- Structured extract: name, email, date from text
- Short chat reply: "rephrase politely", "translate this string"
- Routing: decide which subsystem handles the request
resp = client.chat.completions.create(
model="gpt-5.4",
messages=[{"role": "user", "content": "intent: 'I want a refund'"}],
extra_body={"reasoning_effort": "low"},
max_tokens=50,
)
On these tasks low gives quality within 1-2% of medium at 3-5× lower cost.
When to use medium
The default. Use it if you don't know which level fits.
- Generating functions, tests, migrations
- Normal user-facing chat
- Code completion (though IDEs usually run
low) - Document summarization
- RAG answers over context
When to use high
Switch on when medium falls short:
- Multi-file refactor
- Architecture decisions with trade-off analysis
- SQL planning with joins and subqueries
- Code review hunting edge cases
- Hard debugging
resp = client.chat.completions.create(
model="gpt-5.5",
messages=[{"role": "user", "content": "find the race condition in this code: ..."}],
extra_body={"reasoning_effort": "high"},
)
In our runs, moving medium → high adds +10-25% accuracy on reasoning tasks at +50-100% answer tokens.
When to use xhigh
Rarely. Genuinely useful only for:
- Research-grade: derive a proof, design an algorithm from scratch
- Long-document analysis hunting non-obvious links
- Decompilation, reverse engineering, security analysis
- Multi-step planning with depth 10+
For most production scenarios xhigh is overkill. If you don't see a clear quality jump vs high, stay on high.
Combining with model choice
reasoning_effort multiplies the effect of model selection:
| Model × Effort | When |
|---|---|
gpt-5.4 + low | Cheap router, classifier |
gpt-5.4 + medium | Default for 80% of tasks |
gpt-5.4 + high | Quality matters but gpt-5.5 is too expensive |
gpt-5.5 + medium | Hard tasks without overshooting |
gpt-5.5 + high | Hardcore reasoning, architecture review |
gpt-5.5 + xhigh | Research only |
Real measurements on one task
Task: "find and fix the memory leak in this 800-line Go service".
| Configuration | Output tokens | Latency | Correctness |
|---|---|---|---|
gpt-5.4 + low | 180 | 1.1s | 22% |
gpt-5.4 + medium | 520 | 3.4s | 58% |
gpt-5.4 + high | 1240 | 8.7s | 74% |
gpt-5.5 + medium | 680 | 6.2s | 79% |
gpt-5.5 + high | 1850 | 14.3s | 91% |
gpt-5.5 + xhigh | 4200 | 31.2s | 93% |
The jump from high → xhigh: +2% accuracy at 2.3× more tokens. Not worth it.
Cheap pattern: escalation
def solve(task: str) -> str:
for effort in ("low", "medium", "high"):
resp = client.chat.completions.create(
model="gpt-5.4",
messages=[...],
extra_body={"reasoning_effort": effort},
)
if validate(resp): # your eval function
return resp.choices[0].message.content
# last resort
resp = client.chat.completions.create(model="gpt-5.5", ...)
return resp.choices[0].message.content
This pattern cuts average cost 2-3× vs always running high.
Bottom line
- low for routine and routing
- medium for the default
- high when medium clearly underperforms
- xhigh for research only
Measure through your own evals. Don't trust intuition — reasoning_effort is frequently overestimated.