GPT-5.4 vs GPT-5.5 vs Codex: which model to pick
Compare GPT-5.4, GPT-5.5 and Codex on speed, price and quality. When to use the flagship, when to use mini, when to use Codex — real numbers and real scenarios.
GPT-5.4 vs GPT-5.5 vs Codex: which model to pick
Three models on one key — three different scenarios. Here's when each one is the better economic choice, based on real Codex Key tariff coefficients.
TL;DR
| Model | Coef. | Per 1M tokens* | When to use |
|---|---|---|---|
Codex (gpt-5.4-mini) | ×0.9 | ~$0.024 | Autocomplete, refactor, code review |
| GPT-5.4 | ×1.0 | ~$0.027 | Universal default, chat, agents |
| GPT-5.5 | ×4.5 | ~$0.12 | Hard reasoning, multi-step planning |
*Team plan ($90). Starter is slightly higher per million.
GPT-5.4 — the workhorse
When: 80% of tasks. Chat, code generation, agents, RAG, summarization.
Pros:
- Cheapest flagship-class model
- Stable 800–1500ms streaming latency
- Supports Fast / Priority modes (×2 coefficient, +30% speed)
Cons:
- Loses to GPT-5.5 by ~15% on multi-step reasoning
- 200k context window — smaller than GPT-5.5
GPT-5.5 — for hard reasoning
When: SQL planning, agents with heavy tool use, long-context (400k+), nuanced code review.
Pros:
- Best reasoning bench scores (HumanEval, MBPP)
- 400k context window
- Holds multi-turn dialog without hallucinating
Cons:
- ×4.5 — 4.5× the cost of GPT-5.4
- 1500–3000ms latency — heavy for interactive UI
Codex — for code
When: IDE autocomplete, refactoring, boilerplate generation, code-review comments.
Pros:
- ×0.9 — cheaper than GPT-5.4
- Code-tuned: understands file context, imports, types better
- Faster than GPT-5.4 on code (~500–1000ms)
Cons:
- Weaker on prose, explanations, documentation
- Less suitable for tool-use agents
Simple picker
- Default —
gpt-5.4. Cheapest and fastest for 80% of work. - Code in IDE — switch to
codexto save 10%. - Reasoning, planning, hard code review —
gpt-5.5, but only whengpt-5.4actually underperforms.
Reasoning effort: low / medium / high / xhigh
All models accept reasoning_effort. It doesn't change the token coefficient but affects answer quality and length:
- low — short answers, minimal reasoning. Best for chat.
- medium (default) — balanced.
- high — deep reasoning, +30–50% answer tokens.
- xhigh — max. For research-grade prompts.
Fast / Priority
Pay ×2 per token, get queue priority and +30% speed. Makes sense for realtime (voice, live-completion).
Drop-in router template
def pick_model(task: str) -> str:
if task in ("code-completion", "refactor", "review-comment"):
return "codex"
if task in ("planning", "multi-step-reasoning", "long-context"):
return "gpt-5.5"
return "gpt-5.4"
Start with gpt-5.4, measure quality via your evals, escalate to gpt-5.5 only where the diff is visible.