Here’s a problem nobody in the AI coding agent space is being honest about: we keep measuring whether the output app works, not whether it follows our conventions.
You can get a working habit tracker app in ten minutes. That tells you nothing about whether the agent absorbed your team’s standards. That’s a different problem — and that’s why we built the eval harness.
Traditional benchmarks for AI coding agents focus on functional output: Does the app boot? Do the tests pass? Does it pass a smoke test?
Those are fine starting points. They’re also completely insufficient if you’re trying to run agent output through a consistent pipeline. A working app can still ship with security debt, missing coverage files, half-absorbed scaffold conventions, and zero visibility into any of it.
The question we needed to answer wasn’t “does it work?” It was “does it work the way we want it to work?”
We’ve been building github.com/burninmedia/eval-harness to answer that question. The harness runs five distinct checks against agent-produced repos:
Each check is discrete. An agent can ace functional and fail on conventions. That’s the point. We want granular signal, not a binary pass/fail that obscures where things went sideways.
We pointed the harness at a Claude Code run on the claude/habit-tracker-app-U8J2t branch. Here’s what came back:
| Check | Result |
|---|---|
| Tests | 67/67 passing |
| Coverage | 97.77% |
| Scaffold Conventions | 4/5 (missing SESSION.md) |
| Security | 3 high-severity issues |
67/67 tests passing. 97.77% line coverage. Way above our 80% bar. The agent wrote the tests, and they work. No complaints here — this is exactly what you want to see on the functional side.
This is where the cracks started showing.
The agent missed SESSION.md. That’s one of the five scaffold files in our standard layout. The other four were present and correct, but missing one file means the repo doesn’t fully conform to what we expect from agent output.
That’s a real signal. It tells us the agent didn’t fully absorb our scaffold before starting. It’s not catastrophic — SESSION.md is relatively easy to add retroactively — but it’s the kind of gap that compounds across dozens of agent runs if you’re not catching it.
Here’s the one that should get attention.
The issues trace back through bcrypt → node-tar as a transitive dependency chain. node-tar has had known high-severity CVEs. bcrypt pulls it in transitively, meaning the agent never explicitly chose a vulnerable dependency — it inherited one.
This is the failure mode that functional tests don’t catch. The app works. The tests pass. But there’s a CVE sitting in your node_modules that your test suite never touched.
Our harness catches this. The functional run doesn’t — and shouldn’t — care about transitive dependency health. That’s a different check. We now run it automatically.
This first run is a data point, not a verdict. But it’s an instructive one.
Functional output is not the hard problem. Claude Code shipped a habit tracker app with 67 passing tests and near-perfect coverage. That’s solved.
Convention absorption is the unsolved part. Missing SESSION.md is a scaffold problem, not a logic problem. The agent produced working code that didn’t fully conform to our layout expectations.
Security debt hides in transitive deps. This is the most underappreciated risk in AI-generated code. The agent didn’t pick a bad dependency — it picked a legitimate one (bcrypt) that pulls in a problematic transitive (node-tar). Nobody caught that in the functional run.
The harness is live. Every agent run we do from here forward goes through it automatically. We’re tracking these metrics over time — not just “did it pass” but “where did it fail” and “is that failure mode getting better or worse.”
We’ll publish results as we accumulate them. The goal isn’t to catch bad output — it’s to build a feedback loop that makes the next run better.
Sources