OpenAI’s model lineup in May 2026 is a mess — and I mean that as a technical descriptor, not a complaint.
You’ve got o3 and o4-mini as the reasoning series, GPT-5.5 and GPT-5.4 as the flagship conversational models, and somewhere in the middle there’s a pricing structure that requires a spreadsheet to decode. The naming convention that made sense three years ago has completely broken down, and the capability differences between adjacent models are genuinely hard to pin down without running your own benchmarks.
Let’s sort through it.
Based on what we know from pricing guides and benchmark data as of May 2026:
| Model | Context | Pricing (input/output per 1M tokens) |
|---|---|---|
| GPT-5.5 | 128K | $5 / $30 |
| GPT-5.4 | 128K | $2.50 / $15 |
| o4-mini | 128K | $0.55 / $2.20 |
| o3 | 128K | Varies (high) |
| GPT-4o (legacy) | Retired | — |
The o3/o4 family uses the “thinking” architecture — extended reasoning loops that let the model deliberate longer before responding. The GPT-5 family is the standard autocomplete-style model. They serve different use cases even when benchmarks overlap.
Here’s something the marketing doesn’t tell you: Suprmind’s May 2026 hallucination benchmark shows o3 hallucinates on 33% of PersonQA questions (factual questions about real people). That’s double o1’s rate of 16%. o4-mini is worse at 48%.
This matters for production applications. If you’re building something where factual accuracy on real-world entities is critical — legal research, journalism, medical info — o3’s higher reasoning capability comes with a meaningful accuracy tradeoff on basic facts. The model thinks harder but sometimes thinks wrong, not right.
o3 leads on AIME math benchmarks (96.7% on AIME 2024) and SWE-bench (software engineering), which makes sense for technical domains where there’s a verifiable correct answer. But for open-ended factual recall, the newer models are regressing.
The o-mega.ai May 2026 benchmark roundup shows a tight race at the top:
The interesting story here is that open-source models (Llama 4) have basically caught the proprietary frontier on math benchmarks. That’s a meaningful shift from 18 months ago.
On Codeforces and SWE-bench, o3 sets new SOTA — OpenAI’s own help documentation claims it “sets a new SOTA on benchmarks including Codeforces, SWE-bench (without building a custom model-specific scaffold), and MMMU.” The “without building a custom model-specific scaffold” part is notable — o3 doesn’t need the scaffolding trick that previous models required to get good software engineering results.
Use o3 for: Complex software engineering tasks, multi-step math, problems where you can verify the answer. The extended reasoning pays off when there’s a right answer to find.
Use GPT-5.4 for: General conversation, content generation, applications where hallucination rate matters more than raw reasoning power. It’s between Gemini and Claude on both price and performance — solid middle-tier flagship.
Be careful with o4-mini: The 48% hallucination rate on PersonQA is a red flag for any factual application. It’s cheap ($0.55/$2.20 per 1M tokens) but the accuracy tradeoff is real.
OpenAI has essentially created two parallel product lines that are starting to compete with each other. o3 and o4-mini are marketed as reasoning models but they overlap significantly with GPT-5 capabilities. The pricing is structured differently (per-token for o3/o4 vs subscription + token for ChatGPT), which makes direct comparison harder.
The biggest change in the 2026 lineup: every pre-GPT-5 model family has been retired. GPT-4, GPT-4o, GPT-4o-mini — all gone from the API. If you’re still running v4 calls, you’re running against a deprecated endpoint that may not be supported much longer.
OpenAI’s model lineup in 2026 is more capable than ever but significantly harder to navigate. The benchmark table matters more than the marketing.