A 6-level maturity model for how software teams operationalize AI — moving from keystroke assistance to an autonomous software factory.
The levels describe who does the work (human vs AI) and where human attention goes (code vs specs vs outcomes).
This isn't about which AI model or brand you use. It's about your engineering workflow and control systems.
Higher levels require stronger controls. More AI autonomy without better gates is a risk, not progress.
The "dark factory" is an engineered production system — specs, evaluation, CI/CD gates, and simulation — not just better prompts.
AI suggests the next line; the human accepts or rejects. A faster tab key.
Human hands AI a discrete, well-scoped task. Human still handles architecture and integration.
AI handles multi-file changes and navigates codebases, but humans still read all the code.
The relationship flips — you direct the AI and review at the feature/PR level. AI submits PRs for review.
Write a spec, leave, come back and check if tests pass. Code becomes a black box — you evaluate outcomes.
A lights-out pipeline: humans own the what, machines own the how.
Key safeguard: the "Holdout Eval" scenarios live outside the codebase — the AI never sees the test criteria, so it cannot game them.
Digital Twins simulate real services (Okta, Jira, Slack) so AI agents can run full integration tests without touching production.
The Levels 0–5 we just covered measure one thing: how much autonomy you give the AI. But autonomy alone isn't maturity — you also need to score two more axes, each on the same 0–5 scale.
We call these three axes A (Autonomy), C (Controls), and G (Governance) — all scored 0–5, same scale as the Levels.
The danger zone: a team at A3 but only C1 — high AI autonomy with weak testing.
Dark Factory = A5 + C5 + G5 — Level 5 on all three axes.
Don't race to Level 5 everywhere. Match your A (autonomy) and C (controls) targets to system criticality.
| Tier | Systems | Autonomy (A) | Controls (C) |
|---|---|---|---|
| Tier 1 — High Risk | Regulated, money-moving, identity & access | A2–A3 | C3–C4 (very strong) |
| Tier 2 — Medium Risk | Internal platforms, data pipelines, ops tooling | A3–A4 | C4 |
| Tier 3 — Low Risk | Front-ends, prototypes, internal productivity apps | A4–A5 | Proving ground for scenarios & twins |
A practical migration path from standardized AI-assisted development to autonomous delivery.
"Maturity is not how much code AI writes.
It's how confidently you can ship without reading code — which depends on external scenario evaluations and safe simulation environments."