GLM-5.1 Just Beat GPT-5.4 and Claude Opus 4.6 on SWE-bench Pro — And It's Open Source

GLM-5.1 dropped yesterday (April 8, 2026), and the benchmark tables just got interesting.

Z.ai (formerly Zhipu AI) released their flagship open-source coding agent model, and on the SWE-bench Pro benchmark — the hard one, the one designed to separate frontier models on real software engineering work — GLM-5.1 scored 58.4%, outperforming GPT-5.4 (57.7%) and Claude Opus 4.6.

The leaderboard as it stands right now (per BenchLM.ai):

Model SWE-bench Pro
Claude Mythos Preview 77.8%
GLM-5.1 58.4%
GPT-5.4 57.7%
Claude Opus 4.6 ~54%
Gemini 3.1 Pro ~50%

That’s not a rounding error. That’s a gap that matters when you’re routing production coding tasks.

The Numbers That Stand Out

  • 58.4% on SWE-bench Pro — beats every closed-source competitor except Claude Mythos Preview
  • 94.6% of Claude Opus 4.6’s broader coding score — their own internal framing, which tells you they’re aware there’s still a gap above them
  • 8-hour autonomous execution — sustained “experiment–analyze–optimize” loop without human intervention
  • 655 iterations in a single demo, building a Linux desktop system from scratch
  • 3.6x geometric mean speedup on KernelBench Level 3 (real ML workloads)
  • 200K context window, 128K output tokens
  • MIT license — fully open weights
  • Trained on Huawei Ascend chips — no Nvidia dependency

The pricing is worth noting too: $1.00/M input tokens, $3.20/M output tokens via Z.ai’s API. For context, that’s competitive with the current GPT-5.4 pricing.

What This Means for Local LLM Users

Here’s the part that matters for this audience: GLM-5.1 is designed for agentic coding workflows and is explicitly compatible with tools like Claude Code and OpenClaw. The weights are MIT-licensed. If you’re running a local setup — LM Studio, Ollama, whatever your poison — this is a model worth benchmarking locally.

The 8-hour autonomous execution claim is the headline, but the practical upside is the combination of a large context window, high output token limits, and an agentic design that plays nice with tool-use workflows. That’s the exact profile of what TopClanker readers are running on their own hardware.

The Caveat

Claude Mythos Preview still leads at 77.8% — and that’s a meaningful gap. Z.ai itself acknowledges GLM-5.1 is at ~94.6% of Claude Opus 4.6’s broader coding score, which suggests the remaining gap is in reasoning and creative tasks, not just benchmark mechanics. SWE-bench Pro is a strong signal, but it’s not the whole picture.

The Take

Open-source coding agents just crossed a threshold. GLM-5.1 isn’t beating the frontier on every axis, but it’s beating the closed-source incumbents on the benchmark that supposedly separates frontier models — and it’s MIT-licensed, trained on non-Nvidia hardware, and already compatible with the tools you’re using.

Run it locally. The weights are out. Benchmark it against whatever you’re currently using. The leaderboard is moving.


Sources