MODEL COMPARISON

GPT-5.4 vs GPT-5.5 vs Codex: which model to pick

Compare GPT-5.4, GPT-5.5 and Codex on speed, price and quality. When to use the flagship, when to use mini, when to use Codex — real numbers and real scenarios.

May 18, 2026·Updated May 19, 2026·gpt-5 · comparison · models · pricing

GPT-5.4 vs GPT-5.5 vs Codex: which model to pick

Three models on one key — three different scenarios. Here's when each one is the better economic choice, based on real Codex Key tariff coefficients.

TL;DR

Model	Coef.	Per 1M tokens*	When to use
Codex (`gpt-5.4-mini`)	×0.9	~$0.024	Autocomplete, refactor, code review
GPT-5.4	×1.0	~$0.027	Universal default, chat, agents
GPT-5.5	×4.5	~$0.12	Hard reasoning, multi-step planning

*Team plan ($90). Starter is slightly higher per million.

GPT-5.4 — the workhorse

When: 80% of tasks. Chat, code generation, agents, RAG, summarization.

Pros:

Cheapest flagship-class model
Stable 800–1500ms streaming latency
Supports Fast / Priority modes (×2 coefficient, +30% speed)

Cons:

Loses to GPT-5.5 by ~15% on multi-step reasoning
200k context window — smaller than GPT-5.5

GPT-5.5 — for hard reasoning

When: SQL planning, agents with heavy tool use, long-context (400k+), nuanced code review.

Pros:

Best reasoning bench scores (HumanEval, MBPP)
400k context window
Holds multi-turn dialog without hallucinating

Cons:

×4.5 — 4.5× the cost of GPT-5.4
1500–3000ms latency — heavy for interactive UI

Codex — for code

When: IDE autocomplete, refactoring, boilerplate generation, code-review comments.

Pros:

×0.9 — cheaper than GPT-5.4
Code-tuned: understands file context, imports, types better
Faster than GPT-5.4 on code (~500–1000ms)

Cons:

Weaker on prose, explanations, documentation
Less suitable for tool-use agents

Simple picker

Default — gpt-5.4. Cheapest and fastest for 80% of work.
Code in IDE — switch to codex to save 10%.
Reasoning, planning, hard code review — gpt-5.5, but only when gpt-5.4 actually underperforms.

Reasoning effort: low / medium / high / xhigh

All models accept reasoning_effort. It doesn't change the token coefficient but affects answer quality and length:

low — short answers, minimal reasoning. Best for chat.
medium (default) — balanced.
high — deep reasoning, +30–50% answer tokens.
xhigh — max. For research-grade prompts.

Fast / Priority

Pay ×2 per token, get queue priority and +30% speed. Makes sense for realtime (voice, live-completion).

Drop-in router template

def pick_model(task: str) -> str:
    if task in ("code-completion", "refactor", "review-comment"):
        return "codex"
    if task in ("planning", "multi-step-reasoning", "long-context"):
        return "gpt-5.5"
    return "gpt-5.4"

Start with gpt-5.4, measure quality via your evals, escalate to gpt-5.5 only where the diff is visible.