Claude Fable 5 Dropped Yesterday. The Benchmark Table Is Brutal.

Anthropic's Claude Fable 5 launched June 9 with the clearest agentic coding lead we've seen in a single model generation. Here's the full benchmark picture — including the asterisks.

Anthropic shipped Claude Fable 5 on June 9. The benchmark table it published the same day tells a blunt story: for agentic coding and long-horizon autonomous work, this is a different class of model.

Fable 5 scores 80.3% on SWE-Bench Pro, the industry’s gold-standard coding evaluation. The next-best deployable model, Anthropic’s own Opus 4.8, sits at 69.2%. GPT-5.5 manages 58.6%. Gemini 3.1 Pro comes in at 54.2%. That’s an 11-point gap to the closest competitor, and 26 points ahead of where OpenAI’s best shipping model lands.

The Benchmark Table That Should Inform Your Stack Decision

Anthropic published the full comparison on June 9 covering Fable 5, Opus 4.8, GPT-5.5, and Gemini 3.1 Pro across eight agentic benchmarks. Here’s the matrix that matters for production decisions:

Benchmark Fable 5 Opus 4.8 GPT-5.5 Gemini 3.1 Pro
SWE-Bench Pro (coding) 80.3% 69.2% 58.6% 54.2%
FrontierCode Diamond (hard) 29.3% 13.4% 5.7%
Terminal-Bench 2.1 88.0%* 82.7% 83.4% 70.7%
GDPval-AA (knowledge, ELO) 1932 1890 1769 1314
OSWorld-Verified (computer use) 85.0% 83.4% 78.7% 76.2%
AutomationBench (tool use) 17.4% 15.5% 12.9% 9.6%

*Terminal-Bench 2.1 starred rows show Mythos 5 (restricted tier), not Fable 5. GPT-5.5’s 83.4% uses the Codex CLI harness natively.

Where the Gap Is Widest

SWE-Bench Pro is the number that should get your attention. It tests whether a model can resolve real GitHub issues — multi-file code changes, dependency reasoning, test passing. 80.3% is not a marginal improvement. It’s a generational step. Opus 4.8 at 69.2% was already considered state-of-the-art as recently as Q1 2026.

On FrontierCode Diamond — the hardest tier of the FrontierCode benchmark — the separation is even starker. Fable 5 at 29.3% versus Opus 4.8’s 13.4% and GPT-5.5’s 5.7% suggests the difficulty ceiling has moved substantially. If you’re building agents that tackle novel, multi-step engineering problems (not just autocomplete or single-file edits), this is the row that predicts how far you can push the system.

One exception worth knowing: on Terminal-Bench 2.1, GPT-5.5’s Codex CLI harness scores 83.4%, narrowly ahead of Opus 4.8 at 82.7%. Fable 5’s 88.0% on that row belongs to the restricted Mythos 5 tier — not the model you can deploy today. The practical implication: when you hold the agent harness constant, GPT-5.5 and Opus 4.8 are closer than the SWE-Bench gap suggests. For terminal-based coding agents, test both in your own harness before routing everything to Fable 5.

Pricing: Twice the Cost, But the Math Changes With Cache Hits

Fable 5 runs $10 per million input tokens and $50 per million output tokens — exactly double Opus 4.8’s $5/$25 rates. For a task consuming 200,000 input and 50,000 output tokens, that’s $4.50 versus $2.25 per task.

The discount that changes the equation: Anthropic’s 90% prompt-caching discount applies to input tokens on repeated context. For agents that reuse a large system prompt or codebase across many turns, Fable 5’s effective input cost drops from $2.00 to $0.20 per task. At that point, the premium over Opus 4.8 narrows to roughly $2.25 versus $1.25 — a gap that looks different when the task quality delta is 11 SWE-Bench points.

What This Means for Your Stack

The routing logic is becoming clearer. Fable 5 earns its premium on hard, multi-file, long-horizon coding tasks — the work where an 80% versus 69% SWE-Bench score actually shows up in the PR as fewer round trips and less human rescue. For commodity tasks, simpler models, or agents where the harness matters more than the base model, Opus 4.8 or GPT-5.5 may be the better economic choice.

The benchmark table is also a reminder: starred numbers and restricted-tier scores belong in your awareness, not your procurement sheet. The 78.0% ExploitBench score in Anthropic’s table? That belongs to Mythos 5, the restricted model. Fable 5 in blocking mode made 0% progress on offensive cyber tasks. Don’t benchmark-shop on numbers you can’t deploy.

If you’re running agentic coding in production today, the honest move is to test Fable 5 against your own task distribution — not rely on a single benchmark row. But the SWE-Bench gap is real, and it’s large.


Sources