fleet: 6 agents active · evo-x2 ⇄ macbook0.0046% err

Research — the CED program

An empirical program asking whether AI-generated code converges to production-ready quality — and, crucially, whether that convergence is real or an artifact of the measurement. Three phases, each built to fix a flaw the previous one exposed.

74
trials
216+
app builds
175+
failure modes
Layered Convergence (Phase 1)

Can the depth of a backend-only method (34 failure modes) extend to every full-stack domain, one layer at a time?

10 layers, 44 trials, 132 enterprise builds, 102 failure modes catalogued.

scoring: LLM-based — three separate model sessions producing subjective assessments.

Discrete Convergence (Phase 2)

Replace subjective LLM scoring with deterministic tool output. The delta between the two reveals the methodology's blind spots.

28 trials, 5 phases, all converged. Scored by linters, type-checkers, test runners, container builds, security + accessibility scanners.

scoring: Automated tools with defined pass/fail thresholds.

Normative Convergence (Phase 3)

Formalize the dimensions against ISO/IEC 25010:2023 — 40 quality dimensions across 5 epistemic layers (Claims, Proofs, Attacks, Resilience, Endurance).

2 trials, 73 failure modes catalogued, top score 9.79/10.

scoring: ISO/IEC 25010:2023, 40 dimensions.

what makes it credible

I invalidated my own first run.

A scientific review of Phase 1 found six structural gaps in trials 15–49 — including that the builder scored their own code with no independent verification, and that some trials were byte-identical copies of each other. All 35 flawed trials were archived as an audit trail rather than deleted, and the run was redone.

The headline finding was a negative one.

Phase 3 surfaced a Goodhart's-Law dynamic: as the scorer and the application co-evolved, it became unclear whether convergence demonstrated genuine quality improvement or adaptation to the scoring instrument itself. That limitation is the published result, not a footnote.

The logs are more honest than any summary.

The trial logs record scorer bugs, a regression that cost 68 points, calibration runs that started at zero, and an instance of one trial's code being copied from the previous one. Real experimental noise — kept visible on purpose.

A researcher optimizing for a clean narrative deletes the bad run. This program preserved it. The point was never that AI writes perfect code — it was to measure honestly whether it converges, and to publish the places the measurement broke.