Why Every Vibe Coding Benchmark Is Missing the Point
Vals AI showed us how to measure if models can build apps. But 'it works' and 'it's built right' are different problems. Here's the evaluation framework that measures both.
Last month, Vals AI published Vibe Code Bench — a benchmark of 100 web application specs tested against models’ ability to build working software from natural language. The methodology was rigorous: 964 browser-based workflows, 10,131 substeps, autonomous browser agents evaluating whether the output actually worked.
The top model — GPT-5.3-Codex — passed 61.8% of workflows. Even the best still failed roughly 4 in 10 applications.
This is a genuinely useful signal. But it only measures one axis: does the output work?
It tells us nothing about whether the model built the app the right way — whether it followed conventions, used appropriate patterns, structured the codebase for maintainability, or understood the context it was working in. “It works” and “it’s built correctly” have been separate problems in AI coding evals. They shouldn’t be.
Source: Vals AI — Vibe Code Bench v1.1
What Existing Benchmarks Miss
SWE-bench tests whether a model can resolve a GitHub issue. LiveCodeBench tests competitive programming. Vibe Code Bench tests whether a model can build a working app from a prompt.
None of them test whether an agent can operate within a codebase’s conventions — understanding the existing patterns, following team-specific rules, maintaining the kind of institutional knowledge that makes a codebase sustainable over time.
This matters because real teams don’t want an agent that can build an app. They want an agent that can build apps like their team builds apps.
The difference is the context. The conventions. The “this is how we do auth here” rules that live in people’s heads or, at best, scattered across a README.
Layered AGENTS.md as Institutional Memory
ActionsCI’s node-agentic-scaffold takes a different approach. Instead of evaluating a model in isolation, it gives agents a layered context system:
AGENTS.md (root) ← Org-wide golden rules, tech stack, conventions
├── services/auth/AGENTS.md ← Auth-specific rules, frozen files, error codes
├── services/payments/AGENTS.md ← Payments-specific rules, idempotency, DAL
└── shared/AGENTS.md ← Rules for shared utilities
Rules cascade downward. A service-level AGENTS.md can add stricter constraints but cannot relax a root-level rule. When there’s a conflict, the root wins.
This is designed for human teams working with AI coding agents. Fill out the layered AGENTS.md files with your actual conventions, give the agent the relevant context, and it builds the way your team builds.
But there’s a second benefit: the AGENTS.md structure is also an evaluation rubric.
Source: ActionsCI/node-agentic-scaffold — GitHub
The Scaffold-Based Eval Format
Here’s the core idea we’re developing for TopClanker’s eval framework:
The spec defines what to build. “Build a Zeeter-style social app — authentication, posts, follows, likes.”
The AGENTS.md layers define how to evaluate whether it was built correctly. Not just “does it function” but:
- Did the agent put auth behind the right middleware?
- Did it follow the DAL conventions for database access?
- Did it handle idempotency correctly in the payments flow?
- Are the tests structured according to the org’s testing standards?
The difference between Vals’ approach and this one:
| Dimension | Vals Vibe Code Bench | Scaffold-Based Eval |
|---|---|---|
| What it measures | Does the app work? | Does the agent understand the codebase? |
| Context | None — bare prompt | Layered conventions (AGENTS.md) |
| Evaluation | Browser agent clicks through UI | Convention checks + functional tests |
| What it’s testing | Model capability in isolation | Agent + context alignment |
The scaffold-based format doesn’t replace Vibe Code Bench — it extends it. A model might build a working Zeeter clone. But did it build it using the team’s auth patterns? Did it use the right database access layer? That’s what this format measures.
Why This Matters for Local AI
The local AI audience has an additional constraint: these models run on your own hardware. The question isn’t just “can the model build this” — it’s “can the model build this within the resource constraints of a local setup.”
A quantized 7B model might pass a simple scaffold-based eval at Tier 1 (single-page apps, basic forms). Tier 3 (full apps with auth, payments, external APIs) might require a 35B+ model with GPU offload.
The scaffold framework lets us measure the capability gap across model sizes, quantization levels, and hardware configurations — giving the local AI community concrete data on what works and what doesn’t.
What’s Next
This post is the first in a series documenting our eval framework development. Next up:
- Post 2: Building the harness — how we’re automating scaffold-based evals with Playwright for browser testing and convention checking for AGENTS.md compliance
- Post 3: First benchmark results — which local models pass which tier, and what the quantization tradeoff curves look like
- Ongoing: Expanding the spec library, publishing results as we generate them
The goal: an open, reproducible eval suite for local AI coding agents. Not a one-time benchmark — an ongoing capability tracking system the community can use and contribute to.
If you’re building with local models and care about whether your agents are actually following your conventions: this is the framework that makes that measurable.