Why an AI agent governance maturity model matters
Governance is not binary. You do not go from "no governance" to "fully governed" in a single sprint. Organizations need a roadmap that tells them where they are, what to build next, and what each step gets them.
Recent data supports that. McKinsey's March 2026 survey of approximately 500 organizations reports average maturity at 2.3 (up from 2.0 in 2024), but only about 30% reached level 3 or above in strategy, governance, and agentic controls. Nearly two-thirds cited security and risk concerns as the top blocker. In parallel, NIST-based maturity research highlights that implementation is often selective and inconsistent, even when teams publicly claim "responsible AI" readiness.
A maturity model gives you a shared vocabulary. Instead of arguing about whether your governance is "good enough," you can say "we are at Level 2 and need to reach Level 3 before production."
What top ranking maturity-model articles miss
Most maturity model content is strong on governance taxonomy and weak on runtime enforcement. That gap matters for autonomous agents because incidents happen at dispatch time, not in policy slide decks.
| Source | What it covers well | What it misses for agent ops |
|---|---|---|
| McKinsey AI Trust Maturity Survey (Mar 2026) | Recent enterprise signal: maturity scores, incident confidence trends, and key scaling barriers across approximately 500 organizations. | No implementation contract for pre-dispatch policy decisions or approval-binding artifacts per autonomous action. |
| NIST AI RMF 1.0 + Playbook | Canonical risk management baseline for trustworthy AI with implementation support via playbook, roadmap, and gen-AI profile. | Not an agent runtime maturity contract. It does not define dispatch-time decisions like REQUIRE_APPROVAL or ALLOW_WITH_CONSTRAINTS. |
| NIST-based maturity research (arXiv 2401.15229) | Strong argument for operationalizable practices and a maturity model anchored in sociotechnical harm mitigation. | Stops short of a concrete control-plane schema for action-level decision records and approval lineage at runtime. |
This is where agent-specific maturity needs stricter criteria. Level 3 should require pre-dispatch decisions, approval binding, and immutable run evidence. Without those three artifacts, you do not have production-grade governance. You have documentation.
{
"run_id": "run_01JTRH4JQ4V6F4Y3S8N4G6C6D1",
"job_id": "job_9b1f7e",
"decision": "require_approval",
"policy_version": "v1.12.0",
"rule_id": "require-approval-writes",
"approval": {
"status": "approved",
"approved_by": "[email protected]",
"approved_at": "2026-04-01T08:33:14Z"
},
"constraints": {
"max_runtime_seconds": 120,
"allowed_topics": ["job.db.write"]
}
}The five levels of AI agent governance maturity
What you have: Agents run with developer credentials. No policies. No logs beyond what the LLM provider captures. No visibility into what agents are doing.
What is missing: Everything. You cannot answer any of the four questions: what did it do, what policy allowed it, who approved it, where is the proof.
Risk: You will not know about incidents until the damage is visible. Most organizations start here.
What you have: Logging of agent actions after the fact. Alerts on anomalies. Dashboard showing what happened yesterday. Provider-level spend alerts.
What is missing: Pre-execution policy. Agents still act first, and you review later. Approvals are ad-hoc Slack messages, not structured workflows.
Risk: You catch incidents faster but cannot prevent them. Good enough for internal tools, not for production.
What you have: Written policies for what agents can and cannot do. Risk categorization of agent actions. Manual review processes for high-risk operations.
What is missing: Automated enforcement. Policies exist in documents but are not evaluated at runtime. Compliance depends on people remembering to follow the rules.
Risk: Satisfies initial audits but breaks at scale. You can show auditors a policy document but cannot prove it was followed for any specific action.
What you have: Policy-as-code evaluated before every agent action. Approval gates for high-risk operations. Structured audit trail of every decision. Deterministic enforcement.
What is missing: Fleet-wide visibility. You govern individual agents well but cannot manage 50 agents across 10 teams from a single pane.
This is the production bar. You can answer all four questions for any action. Auditors get proof, not promises.
What you have: Centralized governance across all agents, teams, and workflows. Budget enforcement. Cross-agent policy consistency. Organization-wide audit trails. Anomaly detection across the fleet.
What is missing: Nothing structural. At this level, governance improvements are optimizations: better policies, faster approvals, more granular budget controls.
Scale: This is what 100+ agent deployments require. Individual agent governance does not scale; fleet governance does.
Assessment checklist: 10 questions to find your level
Answer yes or no to each question. Your level is the highest consecutive group where you answered yes to all questions.
Can you list every action your agents took yesterday?
Do you have alerts for unusual agent behavior (cost spikes, error rates, access patterns)?
Do you have written policies defining what agents can and cannot do?
Are agent actions categorized by risk level (read/write/destructive)?
Are agent actions evaluated against policy before execution, not after?
Do high-risk actions pause for human approval before proceeding?
Can you produce an audit trail linking any action to the policy that allowed it and the human who approved it?
Can you enforce budget limits across your entire agent fleet from a single control plane?
Do you have organization-wide visibility into all agent actions across every team?
Can you update a policy once and have it apply to every agent in your fleet instantly?
If you answered no to questions 1-2, you are at Level 0. Yes to 1-2 but no to 3-4 puts you at Level 1. Yes through 4 but no to 5-7 means Level 2. Yes through 7 is Level 3. All ten yes is Level 4.
ROI of each governance level
Each level delivers specific, measurable value. Governance is not a cost center. It is an enabler.
Level 1 saves you from blind incidents. You discover problems in hours instead of weeks. A monitoring system that catches a runaway agent burning $2,000/day on the first day instead of the fourteenth saves $26,000.
Level 2 satisfies auditors. You can show written policies and risk categorizations. This is enough for early control discussions. It is not enough when reviewers ask for evidence that policies were actually enforced at execution time.
Level 3 gets you to production. This is the point where trust moves from process to mechanism. Actions are checked before dispatch, risky actions pause for approval, and evidence is queryable after incidents. McKinsey also reports stronger realized business value in organizations that invest deeper in trust maturity, especially at higher spend levels.
Level 4 scales to 100+ agents. When you run dozens of agents across multiple teams, individual governance per agent becomes a bottleneck. Fleet governance gives you one control plane for all agents: unified policies, centralized audit, budget enforcement across the organization.
How to move up: concrete steps between levels
Level 0 to Level 1: Add logging and alerting. Instrument your agent framework to record every action with timestamp, input, output, and duration. Set up cost alerts with your LLM provider. This takes a day, not a sprint.
Level 1 to Level 2: Write your agent policies. Define what actions are read-only (safe), which are write (need review), and which are destructive (blocked). Categorize every agent topic by risk level. Put this in a document your security team signs off on.
Level 2 to Level 3: This is the big jump. Turn your documents into code. Instead of a policy that says "agents should not delete production data," you need a rule that blocks it at runtime. Here is what that transition looks like:
# Level 2: Written policies (not enforced) # This is a document, not code. It cannot prevent anything. # # Agent Policy v1 (Google Doc) # - Agents should not access production databases # - Agents should not send emails without review # - High-risk actions require manager approval # (Nobody checks. Nobody enforces. Nobody audits.)
# Level 3: Pre-dispatch governance (enforced)
# This is code. It prevents violations before they happen.
version: v1
rules:
- id: allow-read-ops
match:
topics: ["job.*.read", "job.*.list"]
risk_tags: []
decision: allow
reason: "Read operations safe by default"
- id: require-approval-writes
match:
topics: ["job.*.write", "job.*.update"]
risk_tags: ["data-mutation"]
decision: require_approval
reason: "Write operations need human review"
- id: deny-production-delete
match:
topics: ["job.*.delete", "job.*.drop"]
risk_tags: ["destructive"]
decision: deny
reason: "Destructive production ops blocked"The Level 3 policy is evaluated by a Safety Kernel before every action. Read operations pass through. Write operations pause for human review. Destructive operations are blocked. The rules are version-controlled, testable, and auditable. Read more about this transition in our policy-as-code guide.
Level 3 to Level 4: Add fleet-wide visibility and cross-team governance. Centralize policy management so a single update applies across all agents. Add budget enforcement and anomaly detection at the organization level. This is infrastructure work, not policy work.
Where Cordum fits
Cordum is Level 3-4 infrastructure. The Safety Kernel provides pre-dispatch policy evaluation (Level 3). Approval workflows provide structured human-in-the-loop gates (Level 3). Fleet governance features, budget enforcement, and organization-wide audit trails provide Level 4 capabilities.
We built Cordum because we kept seeing teams stuck at Level 2. They had policies, they had good intentions, but they had no enforcement layer. The jump from "we wrote down the rules" to "the system enforces the rules" requires infrastructure that most teams do not have time to build. Read more about our governance architecture and the five-minute quickstart.
This maturity model is useful regardless of whether you use Cordum. The assessment, the levels, and the steps between them apply to any governance implementation. What matters is knowing where you stand and having a plan to get to Level 3 before your first production incident forces the conversation.