Codex Key token economy: what's counted and how to pay less
How Codex Key meters tokens, what the tariff coefficients mean, which optimizations actually cut your bill, and which are myths.
Codex Key token economy: what's counted and how to pay less
Short version: you pay for tokens, not for requests or minutes. This post unpacks how each request is counted and where the actual savings hide.
What a token even is
A token is a slice of text the model chops your input and output into. English text: ~1 token per 4 characters. Russian: ~1 token per 2-3 characters. Code: denser, usually ~1 token per 3-4 characters.
Rough planning numbers:
- One A4 page of prose ≈ 400-500 tokens
- 100 lines of Python ≈ 800-1200 tokens
- One SWE-bench ticket (input + output) ≈ 15-40k tokens
Codex Key billing formula
billed_tokens = (input_tokens + output_tokens) × model_coef × mode_coef
| Multiplier | Values |
|---|---|
model_coef | codex-5.3 ×0.9 · gpt-5.4 ×1.0 · gpt-5.5 ×4.5 |
mode_coef | standard ×1.0 · fast ×2.0 · priority ×2.0 |
Example. A gpt-5.5 call in Priority with 3000 input and 800 output tokens:
(3000 + 800) × 4.5 × 2.0 = 34,200 billed tokens
On the Team plan (~3.4B tokens for $90) that's ~$0.001 per request.
What actually cuts your bill
1. Right model per task (up to ×5 savings)
Moving 80% of traffic from gpt-5.5 to gpt-5.4 cuts your bill 4.5×. Only escalate to 5.5 where the quality delta is visible.
2. Short system prompts (×1.3-2.0)
A long system prompt ships with every request. 2,000 system tokens × 100 requests = 200k tokens before the user typed anything. Trim to 500 — save 150k.
3. Truncate history intelligently
Chats by default send the whole transcript. After 20 turns that's 30-50k input tokens. Strategies:
- Sliding window of the last N messages
- Summarize older turns via
gpt-5.4every N iterations - Tool-aware compaction: drop raw tool outputs after you've used them
4. Stop sequences and max_tokens
client.chat.completions.create(
model="gpt-5.4",
messages=[...],
max_tokens=400, # cap the answer
stop=["\n\n---", "</answer>"],
)
Without max_tokens the model can casually emit 2-3k tokens unprompted.
5. Reasoning effort
reasoning_effort: low produces answers 30-50% shorter than medium. For simple work (classification, short answer) use low.
6. Streaming + early break
If your app can abort the stream once a condition is met (e.g. closing } in JSON) — you save on the tail.
What does not work
- "Prompt compression" via GPT — usually costs more than it saves.
- Replacing words with emoji — emoji tokenize denser, not cheaper.
- Translating to English — ~20% savings, but quality on Russian domain tasks degrades more. Do the math.
How to inspect billing
In the Codex Key cabinet, the Usage section shows model × mode × day breakdowns. Each request is recorded with a request_id (also returned in the x-request-id response header). If something looks off — send support that ID.
Example: bill refactor on a real team
8-developer team, ~2000 requests/day:
| Change | Monthly savings |
|---|---|
Moved autocomplete from gpt-5.4 to codex-5.3 | ~10% |
| Trimmed system prompt from 1800 to 600 tokens | ~22% |
| Added history summarization in chats > 15 turns | ~18% |
Set max_tokens: 600 on classification handlers | ~7% |
| Total | ~50% |
Dropped from the Team plan to Pro — $360/year saved with no quality regression.
Bottom line
The biggest lever is picking the right model per task. Second biggest is system prompt and history hygiene. Everything else is half-percent tuning.
Start by labeling the 5 most frequent endpoints in your app: which model, which reasoning_effort, which max_tokens. That gives you 80% of the savings in one evening.