Ornith-1.0 Beats Opus 4.7 on Coding — and It Writes Its Own Training Harness

Open-source coding models spent 2025 closing the gap with frontier closed-source systems. DeepReinforce’s new Ornith-1.0 family didn’t just close the gap — it walked past it. The 397B flagship scores 82.4 on SWE-Bench Verified and 77.5 on Terminal-Bench 2.1, beating Anthropic’s Claude Opus 4.7 (80.8 and 70.3) on both. It’s MIT-licensed, weights are on Hugging Face, and the trick that got it there is genuinely novel: the model learns to author its own reinforcement-learning training scaffolds at training time.

That last sentence deserves unpacking, because it’s the story that matters.

The Headline Numbers (Vendor-Reported)

All Ornith numbers below come from DeepReinforce’s announcement and were produced on the standard OpenHands eval harness — temperature 1.0, top_p 0.95, 256k context window. Independent verification has not yet been published, and you should weight vendor-reported benchmarks accordingly. The methodology is at least reproducible in principle, which is more than most labs publish.

Model SWE-Bench Verified Terminal-Bench 2.1
Ornith-1.0-397B 82.4 77.5
Claude Opus 4.7 80.8 70.3
DeepSeek-V4-Pro 80.6 67.9
MiniMax M3 80.5 66.0
Ornith-1.0-35B (MoE, ~3B active) 64.4
Qwen 3.5-397B 53.5
Ornith-1.0-9B (Dense) 69.4 43.1

Two things jump off this table. First, the 397B flagship didn’t just match Opus 4.7 — it cleared it on both benchmarks. Second, the 35B MoE variant sits above a 397B Qwen baseline on Terminal-Bench 2.1, with roughly an order of magnitude fewer total parameters. That second result is the more interesting one. We’ll come back to it.

The Model Family

Ornith-1.0 ships in four sizes under MIT on Hugging Face:

  • 9B Dense — fits in ~19GB at bf16, runs on a single 80GB GPU, hits 69.4 on SWE-Bench Verified. This is the local-LLM story: it outperforms Gemma 4-31B on the same benchmark at roughly a third the parameter count.
  • 31B Dense — the larger dense sibling.
  • 35B MoE (~3B active) — the punchline of the family. Detailed below.
  • 397B MoE — the flagship, built on top of pretrained Gemma 4 and Qwen 3.5.

The license is the headline most open-source coverage will lead with, and rightly so: MIT means commercial use, fine-tuning, redistribution, the works. A 397B model that beats Opus 4.7 on agentic coding benchmarks with no usage gate is a real shift in the deployment economics for code-agent products.

What “Self-Scaffolding” Actually Means

Most RL-trained coding models train against a fixed, human-written harness — a script that defines how the agent invokes tools, parses outputs, and assembles trajectories. The harness is part of the ceiling. You can scale the model and the data, but if the orchestration logic is wrong, the model learns to compensate inside a flawed structure.

Ornith-1.0 flips this. At each RL step, the model itself proposes a refined per-task scaffold — a tailored orchestration plan for the specific coding task at hand — and then generates a solution rollout conditioned on that scaffold. The reward signal propagates back through both stages. The model is graded not just on whether the code works, but on whether the scaffold it authored was any good.

The optimization objective is token-level GRPO with asynchronous pipeline-RL and staleness-weighted off-policy tokens. You don’t need to grok every word of that to get the gist: the training loop runs scaffold proposals and code rollouts in parallel, and stale gradient updates from earlier pipeline stages are down-weighted rather than thrown away.

This is, to put it bluntly, a real training-time idea. It’s not a benchmark-rigging trick, not a prompt-engineering trick, and not the kind of file-system scaffolding that some agentic coding setups bolt on with AGENTS.md context layers — that concept shows up in different corners of the field and is unrelated to what Ornith is doing here. The scaffold in Ornith-1.0 is internal to the model’s reasoning, generated at training time, and reinforced end-to-end. That distinction matters.

The 35B-MoE Data Point That Should Change Your Deployment Plan

With roughly 3B active parameters, Ornith-1.0-35B scores 64.4 on Terminal-Bench 2.1 — beating Qwen 3.5-397B’s 53.5 on the same benchmark. That’s an ~11x reduction in total parameters for a +10.9-point improvement, and it runs on a single high-end GPU.

If that holds up under independent replication — and it should, given the published eval harness — this is the strongest current evidence that the parameter-efficiency gap in agentic coding is closing fast. The deployment calculus for code-agent products is about to shift. A team that today provisions a multi-GPU cluster for a frontier closed-source API can plausibly self-host something in the same neighborhood on commodity hardware.

Reward Hacking: How DeepReinforce Defends It (and What It Doesn’t Prove)

Letting a model author its own training scaffold raises the obvious question: how do you stop it from gaming the reward signal? DeepReinforce’s answer is three layers of defense:

  1. Immutable trust boundary. The environment, tools, and test isolation sit outside the model’s reach. The model can propose scaffolds, but it can’t rewrite the rules of the evaluation.
  2. Deterministic monitor. Any trajectory that reads withheld paths or modifies verification scripts gets zero reward. Hard kill, no negotiation.
  3. Frozen LLM judge as veto. A separate, frozen model inspects trajectories for intent-level gaming within the permitted tool surface and can veto suspicious successes.

This is a defensible architecture. The trust boundary and the monitor are auditable in principle — you can read the code and verify what the model can’t touch. The frozen-judge veto is harder to audit, because the judge is itself a language model with its own failure modes. DeepReinforce hasn’t published independent red-teaming results, and you should not assume the model can’t find a way to game its own training until third parties have tried.

For high-stakes deployment — anything where the cost of a compromised training signal is meaningful — treat the defenses as a working hypothesis, not a guarantee.

Verdict

Three things, in order of confidence:

  1. Open-source coding models are now genuinely competitive with frontier closed-source systems on agentic benchmarks. Ornith-1.0-397B beating Opus 4.7 on both SWE-Bench Verified and Terminal-Bench 2.1 is a real data point, even with the vendor-reporting caveat. DeepSeek-V4-Pro and MiniMax M3 are in the same neighborhood. The open-source frontier has caught up, and on some axes it has pulled ahead.

  2. Parameter efficiency is the more important story. The 35B-MoE result is the one that should change deployment planning. A single high-end GPU running an Opus-4.7-class coding model is a configuration most teams can actually afford. That’s the shift.

  3. Self-scaffolding is an idea worth watching. Even if Ornith-1.0 weren’t a benchmark winner, the training-time innovation would be the part to study. A model that learns to author better orchestration logic alongside better code is doing something structurally different from the rest of the field. Expect this idea to get pressure-tested, replicated, or stolen within the next two quarters.

The open question is the usual one: do the benchmarks hold up under independent replication? DeepReinforce published hyperparameters, used the standard OpenHands harness, and put weights on Hugging Face under MIT. That’s the minimum bar for verifiability. The rest is up to the community.


Sources