Here’s a problem nobody in the AI coding agent space is being honest about: we keep measuring whether the output app works, not whether it follows our conventions.

You can get a working habit tracker app in ten minutes. That tells you nothing about whether the agent absorbed your team’s standards. That’s a different problem — and that’s why we built the eval harness.

The Problem With “It Works”

Traditional benchmarks for AI coding agents focus on functional output: Does the app boot? Do the tests pass? Does it pass a smoke test?

Those are fine starting points. They’re also completely insufficient if you’re trying to run agent output through a consistent pipeline. A working app can still ship with security debt, missing coverage files, half-absorbed scaffold conventions, and zero visibility into any of it.

The question we needed to answer wasn’t “does it work?” It was “does it work the way we want it to work?

The Eval Harness

We’ve been building github.com/burninmedia/eval-harness to answer that question. The harness runs five distinct checks against agent-produced repos:

  1. Tests — Do they pass?
  2. Coverage — Does it meet our 80% threshold?
  3. Conventions — Did the agent absorb the scaffold files?
  4. Security — Any high-severity transitive dependencies or known CVEs?
  5. Functional — Does the app actually boot and behave as expected?

Each check is discrete. An agent can ace functional and fail on conventions. That’s the point. We want granular signal, not a binary pass/fail that obscures where things went sideways.

First Run: Claude Code on habit-tracker-app

We pointed the harness at a Claude Code run on the claude/habit-tracker-app-U8J2t branch. Here’s what came back:

Check Result
Tests 67/67 passing
Coverage 97.77%
Scaffold Conventions 4/5 (missing SESSION.md)
Security 3 high-severity issues

Tests and Coverage: Strong

67/67 tests passing. 97.77% line coverage. Way above our 80% bar. The agent wrote the tests, and they work. No complaints here — this is exactly what you want to see on the functional side.

Scaffold Conventions: 4/5

This is where the cracks started showing.

The agent missed SESSION.md. That’s one of the five scaffold files in our standard layout. The other four were present and correct, but missing one file means the repo doesn’t fully conform to what we expect from agent output.

That’s a real signal. It tells us the agent didn’t fully absorb our scaffold before starting. It’s not catastrophic — SESSION.md is relatively easy to add retroactively — but it’s the kind of gap that compounds across dozens of agent runs if you’re not catching it.

Security: 3 High-Severity Issues

Here’s the one that should get attention.

The issues trace back through bcryptnode-tar as a transitive dependency chain. node-tar has had known high-severity CVEs. bcrypt pulls it in transitively, meaning the agent never explicitly chose a vulnerable dependency — it inherited one.

This is the failure mode that functional tests don’t catch. The app works. The tests pass. But there’s a CVE sitting in your node_modules that your test suite never touched.

Our harness catches this. The functional run doesn’t — and shouldn’t — care about transitive dependency health. That’s a different check. We now run it automatically.

What the Data Tells Us

This first run is a data point, not a verdict. But it’s an instructive one.

Functional output is not the hard problem. Claude Code shipped a habit tracker app with 67 passing tests and near-perfect coverage. That’s solved.

Convention absorption is the unsolved part. Missing SESSION.md is a scaffold problem, not a logic problem. The agent produced working code that didn’t fully conform to our layout expectations.

Security debt hides in transitive deps. This is the most underappreciated risk in AI-generated code. The agent didn’t pick a bad dependency — it picked a legitimate one (bcrypt) that pulls in a problematic transitive (node-tar). Nobody caught that in the functional run.

What Happens Next

The harness is live. Every agent run we do from here forward goes through it automatically. We’re tracking these metrics over time — not just “did it pass” but “where did it fail” and “is that failure mode getting better or worse.”

We’ll publish results as we accumulate them. The goal isn’t to catch bad output — it’s to build a feedback loop that makes the next run better.


Sources