fleet: 6 agents active · evo-x2 ⇄ macbook0.0046% err
2026-06-01

The Levels of AI Orchestration (L0–L7)


title: The Levels of AI Orchestration (L0–L7) date: 2026-06-01 summary: A ladder for AI-native engineering — from chat-assistant to self-improving control plane — and an honest read on where the industry actually sits.

Why "AI-native" is illegible

"AI-native" has stopped carrying information. A senior engineer who keeps a chat tab open for tricky regexes says it. So does someone running a fleet of autonomous agents across machines while they sleep. Both are telling the truth, and the term flattens a roughly hundred-fold difference in how the work actually gets done into a single adjective. When everyone claims the same label, the label is no longer a claim — it's noise.

The problem is that we measure adoption when we should be measuring orchestration. Counting seats — how many developers have Copilot or Cursor enabled — tells you almost nothing about how AI changes the shape of the work. The interesting variable is not whether a model is in the loop but how much of the loop the human still occupies, and what they had to build to safely hand the rest away.

So here is a ladder. It runs from L0 to L7 and it tries to make the differences legible: at each rung, what does the human actually do, and what would have to be true to climb to the next. I want to be clear up front that this is a model, not gospel. The boundaries are fuzzy, real practices straddle rungs, and the numbering is descriptive, not a scoreboard. Its only job is to give us words precise enough to disagree with.

The ladder, L0 to L7

| Level | Name | What the human does | |---|---|---| | L0 | Chat / copy-paste | Open a model in a browser tab, paste code in, paste the answer back. The IDE has no idea the model exists. | | L1 | Autocomplete & approve-each-step | Inline suggestions you accept or reject, or an agent that asks permission before every action. The human drives every keystroke; the model shortens them. | | L2 | One autonomous agent, supervised | You hand an agent a whole task and let it run — edit files, run commands, iterate — while you watch. You intervene on drift, but you are not approving each step. | | L3 | A few parallel agents | Two or three sessions running at once. You context-switch between them, reviewing and unblocking. Your scarce resource becomes attention, not typing. | | L4 | Multi-machine fleet, away-from-keyboard | Many autonomous agents across machines, controllable remotely. You behave like an air-traffic controller — routing, prioritizing, resolving conflicts — not a typist. The work continues when you step away. | | L5 | Platform-builder | You no longer just run the fleet; you engineer the substrate it runs on. Hooks that gate work, coordination so sessions don't collide, telemetry to see what happened, independent reviewers to catch what the implementer missed. | | L6 | Autonomy-platform owner | The control plane becomes something other humans can trust and operate. Trust stops being personal — "it's fine because I watched it" — and becomes institutional: the guarantees hold regardless of who is at the console. | | L7 | Recursive / self-directing | The system improves its own orchestration. Humans set objectives and constraints; the system proposes and adjusts how it pursues them. |

On L7 I'll be blunt: nobody is durably here in mid-2026. There are demos, there are narrow loops that tune a prompt or pick a tool, and there is a great deal of marketing that rounds those up to "self-improving." A scripted demo that works on a curated task is not a system you can leave running against your codebase for a month. When someone claims L7, the right reflex is to ask for the month of logs, not the keynote.

Where the industry actually is

This next part is an estimate, and I want it labeled as one — it is my read from working in this space and talking to people who do, not a survey with error bars. Treat it as a calibrated guess.

The median professional developer sits around L1. AI is a smarter autocomplete and a faster Stack Overflow: it finishes the line, drafts the function, answers the question. That is genuinely useful and it is also not orchestration. Crucially, this is true even at shops that have rolled out Copilot or Cursor to everyone. A seat is not a workflow. Adoption gets you to L1; it does not carry you past it.

L2 — letting a single agent own a task end to end — is a fast-growing minority and, I'd guess, where the most movement is happening right now. L3 and above thin out quickly. L4 and L5, the away-from-keyboard fleet and the people building the substrate underneath it, are a thin slice of practitioners.

The reason for the cliff is worth stating plainly, because it's the whole thesis: the gap between L1 and L4-plus is mostly trust engineering, not tool access. Everyone can install the tools. The tools are commoditizing fast and will keep doing so. What almost no one builds is the verification around fallible automation — the scaffolding that lets you believe an agent's output without re-reading every line yourself. That scaffolding is the actual product of the upper rungs, and it does not come in a download.

What L4–L5 looks like in practice

Rather than describe it abstractly, here is what one operator's fleet actually produces, with real numbers — and an honest account of what they do and don't mean. I keep these numbers because the point of this essay is that claims should be auditable, and I'd rather show receipts than adjectives. But a receipt is only honest if you are clear about what it measures.

Over one roughly four-week window (May 5 to June 1, 2026), the fleet logged 21,875 tool calls across 24 active days and 46 sessions, peaking at 4,004 in a single day. Be precise about what that is: those are actions the agents took — files read, edits written, shell commands run — not my own keystrokes. I directed the work; the agents did the volume. That distinction is the entire point of orchestration, and collapsing it into "look how much I shipped" would be the exact overclaim this essay argues against. Of those calls, 0.0046% failed outright — one in 21,875. That is a property of the harness executing cleanly, not a claim that the code it produced was defect-free; a low tool-call error rate means the machinery ran, not that the software was correct. Correctness is a separate question, and answering it is what the verification layer below exists for. Alongside the main sessions I dispatched 128 autonomous subagents for research and review that didn't need my main context. The hardware is a multi-machine arrangement: a Linux workstation orchestrating a remote Mac when iOS builds need a macOS toolchain, controlled remotely so the work doesn't stop when I leave the desk.

None of that is the interesting part, though. Volume is the easy thing to brag about and the least meaningful. The interesting part is the verification.

The practice I'd point to first is an independent local code-reviewer that runs a different model family from the one writing the code. The reason is specific: a model and its reviewer drawn from the same family tend to share blind spots. They make correlated mistakes — the implementer's failure modes are exactly the ones the same-family reviewer is least likely to flag, because it would have made them too. Putting a different lineage in the reviewer seat doesn't eliminate that, but it decorrelates it. The general principle, which I think matters more than the specific setup, is this: build verification that does not share the implementer's failure modes. A check that fails in the same way as the thing it's checking is not a check.

Underneath sits the substrate that makes L5 distinct from "I run a lot of agents." Hooks gate work on real test exit codes — a task is not done because a model says it's done; it's done because a named command exited 0, and the gate reads the exit code, not the model's summary of it. Cross-machine task-locking keeps concurrent sessions from editing the same files out from under each other. Telemetry records everything, which is why I can quote the numbers above at all rather than estimate them. The substrate is the work. The agents are just what runs on it.

The honest part: higher isn't "better"

The ladder is not a quality ranking, and I want to push back on my own framing before someone else does. Climbing trades something away. L6 and L7 buy you leverage and legibility, but they buy it with hands-on judgment — the felt sense of the code that comes from being close to it. A solo founder who needs to feel the system in their hands, who is making product calls that depend on intuition about the code's texture, may rationally choose to stay at L4 or L5. Staying put can be the correct decision, not a failure to advance.

So the real frontier skill is not running more agents. More agents is a throughput question, and throughput is the part the tooling will eventually hand you for free. The frontier skill is making trust transferable — getting the guarantees out of your own head and into a form another person can rely on without your standing over their shoulder.

By that measure I have a gap, and I'll name it. My trust loop is still personal. The reason I believe my fleet's output is, ultimately, that I built the checks and I understand their limits — I am the system's last line of defense. That is L5, honestly assessed. The jump to L6 is making that trust institutional: encoding the judgment well enough that someone who is not me can run this control plane and get the same guarantees, without inheriting my context. I haven't done that. It's the next problem, and it's harder than any amount of additional throughput.

Close

Two things are happening at once. Tooling is commoditizing — the models, the agents, the IDE integrations are getting cheaper, better, and more uniform, and that trend will not reverse. And the durable skill is moving somewhere the tooling doesn't reach: designing verification for systems that are powerful, fast, and reliably wrong some fraction of the time. The leverage isn't in having the agents. It's in being able to trust what they produce, and in being able to hand that trust to someone else.

Which brings me back to the only belief I'd defend without hedging: the unit of trust is an artifact, not an assertion. A model's confidence is not evidence. A summary that says "done" is not done. That is exactly why every number in this essay is a receipt — a count from telemetry I could put in front of you, not a claim you have to take on faith. If I've done my job, you shouldn't believe any of it because I said it. You should believe it because, in principle, you could check.