Research — the CED program
An empirical program asking whether AI-generated code converges to production-ready quality — and, crucially, whether that convergence is real or an artifact of the measurement. Three phases, each built to fix a flaw the previous one exposed.
Can the depth of a backend-only method (34 failure modes) extend to every full-stack domain, one layer at a time?
10 layers, 44 trials, 132 enterprise builds, 102 failure modes catalogued.
scoring: LLM-based — three separate model sessions producing subjective assessments.
Replace subjective LLM scoring with deterministic tool output. The delta between the two reveals the methodology's blind spots.
28 trials, 5 phases, all converged. Scored by linters, type-checkers, test runners, container builds, security + accessibility scanners.
scoring: Automated tools with defined pass/fail thresholds.
Formalize the dimensions against ISO/IEC 25010:2023 — 40 quality dimensions across 5 epistemic layers (Claims, Proofs, Attacks, Resilience, Endurance).
2 trials, 73 failure modes catalogued, top score 9.79/10.
scoring: ISO/IEC 25010:2023, 40 dimensions.
I invalidated my own first run.
A scientific review of Phase 1 found six structural gaps in trials 15–49 — including that the builder scored their own code with no independent verification, and that some trials were byte-identical copies of each other. All 35 flawed trials were archived as an audit trail rather than deleted, and the run was redone.
The headline finding was a negative one.
Phase 3 surfaced a Goodhart's-Law dynamic: as the scorer and the application co-evolved, it became unclear whether convergence demonstrated genuine quality improvement or adaptation to the scoring instrument itself. That limitation is the published result, not a footnote.
The logs are more honest than any summary.
The trial logs record scorer bugs, a regression that cost 68 points, calibration runs that started at zero, and an instance of one trial's code being copied from the previous one. Real experimental noise — kept visible on purpose.
A researcher optimizing for a clean narrative deletes the bad run. This program preserved it. The point was never that AI writes perfect code — it was to measure honestly whether it converges, and to publish the places the measurement broke.