Claude Mythos Found Thousands of Zero-Days — The Benchmark That Actually Matters
Claude Mythos — Anthropic’s preview release currently deployed with 12 security-focused partner organizations — scored 93.9% on SWE-bench Verified and 94.6% on GPQA Diamond. Those are remarkable numbers on their own. But the more striking claim from the disclosure: Mythos found thousands of zero-day vulnerabilities across every major operating system and browser during its evaluation period.
That’s not a benchmark. That’s a live ammo test.
Why Zero-Days Are the Benchmark That Matters
SWE-bench tests whether a model can fix known bugs in real open-source codebases. GPQA Diamond tests graduate-level reasoning. Both are useful proxies for “can this model actually reason about code.”
But neither measures what Mythos appears to have done: find previously unknown vulnerabilities in production software used by hundreds of millions of people. That’s a different capability entirely — not just fixing what’s broken, but identifying what no one knew was broken.
The standard critique of AI security claims is that LLMs hallucinate plausible-sounding but non-existent vulnerabilities. “Thousands of zero-days” would normally fall into that bucket. But the evaluation methodology here matters: these findings were validated by Anthropic’s red team and the partner organizations running Mythos in controlled environments.
What the Numbers Actually Mean
The benchmark scores are concrete:
| Benchmark | Score | Context |
|---|---|---|
| SWE-bench Verified | 93.9% | Fixes real GitHub issues |
| GPQA Diamond | 94.6% | Graduate-level reasoning |
| Vulnerabilities Found | Thousands | Across OS and browser targets |
For context: GPT-5.4 sits at ~74.9% on SWE-bench. Claude Opus 4.6 is in the same range. A jump to 93.9% isn’t incremental improvement — it’s a capability step change.
The question is how much of that improvement transfers to real-world security work versus benchmark-specific pattern matching.
The Deployment Model Problem
Here’s the tension: Claude Mythos is impressive in the numbers, impressive in the zero-day findings, and completely unavailable for local deployment or independent testing.
Anthropic is running it through a controlled deployment program — 12 partner organizations, specific use cases, no public API access. This is a reasonable approach for a model with genuine security implications (a model that reliably finds zero-days is also a model that could find novel exploits). But it means the benchmarks are coming from a lab with a vested interest in strong results.
The SWE-bench and GPQA numbers are hard to argue with independently. The zero-day claim is impossible to verify without access to the evaluation methodology and the findings themselves.
What This Means for the Rankings
For TopClanker’s purposes: Claude Mythos would rank #1 on every benchmark it has been measured on — decisively. The SWE-bench score alone would place it above GLM-5.1, GPT-5.4, and every other model in our tracking universe.
But we don’t have an SWE-bench Verified score from an independent evaluator, and we don’t have the zero-day findings available for community verification. The numbers are impressive enough that they’re worth reporting. They’re also incomplete enough that we’re treating them as preliminary until the evaluation methodology is public.
The broader signal: the frontier model capability gap is widening again after a period of relative convergence between GPT-5.4, Claude Opus 4.6, and the open-weight models. If Mythos’s numbers hold up under independent scrutiny, the local AI ecosystem will have to reckon with a new capability ceiling — one they can’t access without a cloud API.
Sources: