BILLING

Codex Key token economy: what's counted and how to pay less

How Codex Key meters tokens, what the tariff coefficients mean, which optimizations actually cut your bill, and which are myths.

·billing · tokens · optimization · pricing

Codex Key token economy: what's counted and how to pay less

Short version: you pay for tokens, not for requests or minutes. This post unpacks how each request is counted and where the actual savings hide.

What a token even is

A token is a slice of text the model chops your input and output into. English text: ~1 token per 4 characters. Russian: ~1 token per 2-3 characters. Code: denser, usually ~1 token per 3-4 characters.

Rough planning numbers:

  • One A4 page of prose ≈ 400-500 tokens
  • 100 lines of Python ≈ 800-1200 tokens
  • One SWE-bench ticket (input + output) ≈ 15-40k tokens

Codex Key billing formula

billed_tokens = (input_tokens + output_tokens) × model_coef × mode_coef
MultiplierValues
model_coefcodex-5.3 ×0.9 · gpt-5.4 ×1.0 · gpt-5.5 ×4.5
mode_coefstandard ×1.0 · fast ×2.0 · priority ×2.0

Example. A gpt-5.5 call in Priority with 3000 input and 800 output tokens:

(3000 + 800) × 4.5 × 2.0 = 34,200 billed tokens

On the Team plan (~3.4B tokens for $90) that's ~$0.001 per request.

What actually cuts your bill

1. Right model per task (up to ×5 savings)

Moving 80% of traffic from gpt-5.5 to gpt-5.4 cuts your bill 4.5×. Only escalate to 5.5 where the quality delta is visible.

2. Short system prompts (×1.3-2.0)

A long system prompt ships with every request. 2,000 system tokens × 100 requests = 200k tokens before the user typed anything. Trim to 500 — save 150k.

3. Truncate history intelligently

Chats by default send the whole transcript. After 20 turns that's 30-50k input tokens. Strategies:

  • Sliding window of the last N messages
  • Summarize older turns via gpt-5.4 every N iterations
  • Tool-aware compaction: drop raw tool outputs after you've used them

4. Stop sequences and max_tokens

client.chat.completions.create(
    model="gpt-5.4",
    messages=[...],
    max_tokens=400,           # cap the answer
    stop=["\n\n---", "</answer>"],
)

Without max_tokens the model can casually emit 2-3k tokens unprompted.

5. Reasoning effort

reasoning_effort: low produces answers 30-50% shorter than medium. For simple work (classification, short answer) use low.

6. Streaming + early break

If your app can abort the stream once a condition is met (e.g. closing } in JSON) — you save on the tail.

What does not work

  • "Prompt compression" via GPT — usually costs more than it saves.
  • Replacing words with emoji — emoji tokenize denser, not cheaper.
  • Translating to English — ~20% savings, but quality on Russian domain tasks degrades more. Do the math.

How to inspect billing

In the Codex Key cabinet, the Usage section shows model × mode × day breakdowns. Each request is recorded with a request_id (also returned in the x-request-id response header). If something looks off — send support that ID.

Example: bill refactor on a real team

8-developer team, ~2000 requests/day:

ChangeMonthly savings
Moved autocomplete from gpt-5.4 to codex-5.3~10%
Trimmed system prompt from 1800 to 600 tokens~22%
Added history summarization in chats > 15 turns~18%
Set max_tokens: 600 on classification handlers~7%
Total~50%

Dropped from the Team plan to Pro — $360/year saved with no quality regression.

Bottom line

The biggest lever is picking the right model per task. Second biggest is system prompt and history hygiene. Everything else is half-percent tuning.

Start by labeling the 5 most frequent endpoints in your app: which model, which reasoning_effort, which max_tokens. That gives you 80% of the savings in one evening.