API GUIDE

Reasoning effort: low / medium / high / xhigh — picking the right level

When to use low, when medium, when high, and what xhigh is even for. Real scenarios, quality and cost measurements through Codex Key.

May 19, 2026·reasoning · api · optimization · gpt-5

Reasoning effort: low / medium / high / xhigh — picking the right level

Every GPT-5 model on Codex Key accepts a reasoning_effort parameter. It does not change the tariff coefficient, but directly affects answer length, quality and latency. Here's when to use which.

TL;DR

Level	Tasks	Avg output tokens	Avg latency	Quality
low	Classification, short reply, chat	80-200	0.5-1.5s	Baseline
medium (default)	Code gen, normal chat, summarization	300-800	1.5-4s	Good
high	Multi-step reasoning, hard code, review	800-2500	4-12s	High
xhigh	Research, proofs, deep analysis	2000-8000	12-40s	Max

How it works

reasoning_effort controls the model's internal chain of thought. On high and xhigh the model spends more tokens on "thinking" (some visible, some hidden depending on the model) before emitting the final answer.

Important: you pay for all reasoning tokens, including hidden ones. An xhigh call can burn 5-10× more tokens than low.

When to use low

Classification: "spam / not spam?", "intent: search / buy / help?"
Structured extract: name, email, date from text
Short chat reply: "rephrase politely", "translate this string"
Routing: decide which subsystem handles the request

resp = client.chat.completions.create(
    model="gpt-5.4",
    messages=[{"role": "user", "content": "intent: 'I want a refund'"}],
    extra_body={"reasoning_effort": "low"},
    max_tokens=50,
)

On these tasks low gives quality within 1-2% of medium at 3-5× lower cost.

When to use medium

The default. Use it if you don't know which level fits.

Generating functions, tests, migrations
Normal user-facing chat
Code completion (though IDEs usually run low)
Document summarization
RAG answers over context

When to use high

Switch on when medium falls short:

Multi-file refactor
Architecture decisions with trade-off analysis
SQL planning with joins and subqueries
Code review hunting edge cases
Hard debugging

resp = client.chat.completions.create(
    model="gpt-5.5",
    messages=[{"role": "user", "content": "find the race condition in this code: ..."}],
    extra_body={"reasoning_effort": "high"},
)

In our runs, moving medium → high adds +10-25% accuracy on reasoning tasks at +50-100% answer tokens.

When to use xhigh

Rarely. Genuinely useful only for:

Research-grade: derive a proof, design an algorithm from scratch
Long-document analysis hunting non-obvious links
Decompilation, reverse engineering, security analysis
Multi-step planning with depth 10+

For most production scenarios xhigh is overkill. If you don't see a clear quality jump vs high, stay on high.

Combining with model choice

reasoning_effort multiplies the effect of model selection:

Model × Effort	When
`gpt-5.4` + `low`	Cheap router, classifier
`gpt-5.4` + `medium`	Default for 80% of tasks
`gpt-5.4` + `high`	Quality matters but `gpt-5.5` is too expensive
`gpt-5.5` + `medium`	Hard tasks without overshooting
`gpt-5.5` + `high`	Hardcore reasoning, architecture review
`gpt-5.5` + `xhigh`	Research only

Real measurements on one task

Task: "find and fix the memory leak in this 800-line Go service".

Configuration	Output tokens	Latency	Correctness
`gpt-5.4` + `low`	180	1.1s	22%
`gpt-5.4` + `medium`	520	3.4s	58%
`gpt-5.4` + `high`	1240	8.7s	74%
`gpt-5.5` + `medium`	680	6.2s	79%
`gpt-5.5` + `high`	1850	14.3s	91%
`gpt-5.5` + `xhigh`	4200	31.2s	93%

The jump from high → xhigh: +2% accuracy at 2.3× more tokens. Not worth it.

Cheap pattern: escalation

def solve(task: str) -> str:
    for effort in ("low", "medium", "high"):
        resp = client.chat.completions.create(
            model="gpt-5.4",
            messages=[...],
            extra_body={"reasoning_effort": effort},
        )
        if validate(resp):    # your eval function
            return resp.choices[0].message.content
    # last resort
    resp = client.chat.completions.create(model="gpt-5.5", ...)
    return resp.choices[0].message.content

This pattern cuts average cost 2-3× vs always running high.

Bottom line

low for routine and routing
medium for the default
high when medium clearly underperforms
xhigh for research only

Measure through your own evals. Don't trust intuition — reasoning_effort is frequently overestimated.