API GUIDE

Reasoning effort: low / medium / high / xhigh — picking the right level

When to use low, when medium, when high, and what xhigh is even for. Real scenarios, quality and cost measurements through Codex Key.

·reasoning · api · optimization · gpt-5

Reasoning effort: low / medium / high / xhigh — picking the right level

Every GPT-5 model on Codex Key accepts a reasoning_effort parameter. It does not change the tariff coefficient, but directly affects answer length, quality and latency. Here's when to use which.

TL;DR

LevelTasksAvg output tokensAvg latencyQuality
lowClassification, short reply, chat80-2000.5-1.5sBaseline
medium (default)Code gen, normal chat, summarization300-8001.5-4sGood
highMulti-step reasoning, hard code, review800-25004-12sHigh
xhighResearch, proofs, deep analysis2000-800012-40sMax

How it works

reasoning_effort controls the model's internal chain of thought. On high and xhigh the model spends more tokens on "thinking" (some visible, some hidden depending on the model) before emitting the final answer.

Important: you pay for all reasoning tokens, including hidden ones. An xhigh call can burn 5-10× more tokens than low.

When to use low

  • Classification: "spam / not spam?", "intent: search / buy / help?"
  • Structured extract: name, email, date from text
  • Short chat reply: "rephrase politely", "translate this string"
  • Routing: decide which subsystem handles the request
resp = client.chat.completions.create(
    model="gpt-5.4",
    messages=[{"role": "user", "content": "intent: 'I want a refund'"}],
    extra_body={"reasoning_effort": "low"},
    max_tokens=50,
)

On these tasks low gives quality within 1-2% of medium at 3-5× lower cost.

When to use medium

The default. Use it if you don't know which level fits.

  • Generating functions, tests, migrations
  • Normal user-facing chat
  • Code completion (though IDEs usually run low)
  • Document summarization
  • RAG answers over context

When to use high

Switch on when medium falls short:

  • Multi-file refactor
  • Architecture decisions with trade-off analysis
  • SQL planning with joins and subqueries
  • Code review hunting edge cases
  • Hard debugging
resp = client.chat.completions.create(
    model="gpt-5.5",
    messages=[{"role": "user", "content": "find the race condition in this code: ..."}],
    extra_body={"reasoning_effort": "high"},
)

In our runs, moving mediumhigh adds +10-25% accuracy on reasoning tasks at +50-100% answer tokens.

When to use xhigh

Rarely. Genuinely useful only for:

  • Research-grade: derive a proof, design an algorithm from scratch
  • Long-document analysis hunting non-obvious links
  • Decompilation, reverse engineering, security analysis
  • Multi-step planning with depth 10+

For most production scenarios xhigh is overkill. If you don't see a clear quality jump vs high, stay on high.

Combining with model choice

reasoning_effort multiplies the effect of model selection:

Model × EffortWhen
gpt-5.4 + lowCheap router, classifier
gpt-5.4 + mediumDefault for 80% of tasks
gpt-5.4 + highQuality matters but gpt-5.5 is too expensive
gpt-5.5 + mediumHard tasks without overshooting
gpt-5.5 + highHardcore reasoning, architecture review
gpt-5.5 + xhighResearch only

Real measurements on one task

Task: "find and fix the memory leak in this 800-line Go service".

ConfigurationOutput tokensLatencyCorrectness
gpt-5.4 + low1801.1s22%
gpt-5.4 + medium5203.4s58%
gpt-5.4 + high12408.7s74%
gpt-5.5 + medium6806.2s79%
gpt-5.5 + high185014.3s91%
gpt-5.5 + xhigh420031.2s93%

The jump from high → xhigh: +2% accuracy at 2.3× more tokens. Not worth it.

Cheap pattern: escalation

def solve(task: str) -> str:
    for effort in ("low", "medium", "high"):
        resp = client.chat.completions.create(
            model="gpt-5.4",
            messages=[...],
            extra_body={"reasoning_effort": effort},
        )
        if validate(resp):    # your eval function
            return resp.choices[0].message.content
    # last resort
    resp = client.chat.completions.create(model="gpt-5.5", ...)
    return resp.choices[0].message.content

This pattern cuts average cost 2-3× vs always running high.

Bottom line

  • low for routine and routing
  • medium for the default
  • high when medium clearly underperforms
  • xhigh for research only

Measure through your own evals. Don't trust intuition — reasoning_effort is frequently overestimated.