GLM-5.1 Just Beat GPT-5.4 and Claude Opus 4.6 on SWE-bench Pro — And It's Open Source
GLM-5.1 dropped yesterday (April 8, 2026), and the benchmark tables just got interesting.
Z.ai (formerly Zhipu AI) released their flagship open-source coding agent model, and on the SWE-bench Pro benchmark — the hard one, the one designed to separate frontier models on real software engineering work — GLM-5.1 scored 58.4%, outperforming GPT-5.4 (57.7%) and Claude Opus 4.6.
The leaderboard as it stands right now (per BenchLM.ai):
| Model | SWE-bench Pro |
|---|---|
| Claude Mythos Preview | 77.8% |
| GLM-5.1 | 58.4% |
| GPT-5.4 | 57.7% |
| Claude Opus 4.6 | ~54% |
| Gemini 3.1 Pro | ~50% |
That’s not a rounding error. That’s a gap that matters when you’re routing production coding tasks.
The Numbers That Stand Out
- 58.4% on SWE-bench Pro — beats every closed-source competitor except Claude Mythos Preview
- 94.6% of Claude Opus 4.6’s broader coding score — their own internal framing, which tells you they’re aware there’s still a gap above them
- 8-hour autonomous execution — sustained “experiment–analyze–optimize” loop without human intervention
- 655 iterations in a single demo, building a Linux desktop system from scratch
- 3.6x geometric mean speedup on KernelBench Level 3 (real ML workloads)
- 200K context window, 128K output tokens
- MIT license — fully open weights
- Trained on Huawei Ascend chips — no Nvidia dependency
The pricing is worth noting too: $1.00/M input tokens, $3.20/M output tokens via Z.ai’s API. For context, that’s competitive with the current GPT-5.4 pricing.
What This Means for Local LLM Users
Here’s the part that matters for this audience: GLM-5.1 is designed for agentic coding workflows and is explicitly compatible with tools like Claude Code and OpenClaw. The weights are MIT-licensed. If you’re running a local setup — LM Studio, Ollama, whatever your poison — this is a model worth benchmarking locally.
The 8-hour autonomous execution claim is the headline, but the practical upside is the combination of a large context window, high output token limits, and an agentic design that plays nice with tool-use workflows. That’s the exact profile of what TopClanker readers are running on their own hardware.
The Caveat
Claude Mythos Preview still leads at 77.8% — and that’s a meaningful gap. Z.ai itself acknowledges GLM-5.1 is at ~94.6% of Claude Opus 4.6’s broader coding score, which suggests the remaining gap is in reasoning and creative tasks, not just benchmark mechanics. SWE-bench Pro is a strong signal, but it’s not the whole picture.
The Take
Open-source coding agents just crossed a threshold. GLM-5.1 isn’t beating the frontier on every axis, but it’s beating the closed-source incumbents on the benchmark that supposedly separates frontier models — and it’s MIT-licensed, trained on non-Nvidia hardware, and already compatible with the tools you’re using.
Run it locally. The weights are out. Benchmark it against whatever you’re currently using. The leaderboard is moving.
Sources