Skip to content
AI Governance

AI Governance in Production

Policy-first control planes for autonomous AI agents, with approvals and evidence that hold up under incident pressure.

Governance15 min readUpdated Apr 2026
TL;DR
  • -Governance that happens after dispatch is reporting, not control.
  • -Production systems need deterministic outcomes: ALLOW, DENY, REQUIRE_APPROVAL, THROTTLE, or constrained allow.
  • -Approvals are useful only when tied to policy snapshot + job intent.
  • -Audit trails should answer what happened in minutes, not after a multi-day incident thread.
Risk Reality

The failure mode is usually not “bad intent.” It is valid automation executed without correct gates.

Determinism

Governance controls must return explicit, enforceable outcomes instead of advisory recommendations.

Auditability

Good governance answers "who approved what and why" quickly, without manual log reconstruction.

Scope

This guide focuses on runtime enforcement patterns for production AI governance, not policy prose. If a control cannot alter dispatch behavior, it does not reduce blast radius.

The production problem

Most teams can explain governance principles. Fewer can show where those principles execute in real request paths. That gap is where incidents happen.

In production, autonomous AI agents can touch tickets, code, cloud infrastructure, and customer data. A governance layer has to decide before dispatch whether the action is allowed, blocked, gated, or constrained.

What top ranking sources cover vs miss

SourceStrong coverageMissing piece
IBM: Implementing AI GovernanceStrong organizational model: roles, lifecycle ownership, and continuous oversight framing.Limited wire-level control semantics for pre-dispatch decisions and approval binding.
AWS Prescriptive GuidanceDetailed risk and compliance controls, including audit posture and cross-functional governance.No unified protocol-level decision contract for autonomous action gating.
COSO GenAI Internal ControlAudit-oriented control framing and internal-control emphasis for GenAI risk management.Operational implementation details for high-frequency autonomous agent dispatch paths.

Runtime control model

Governance controls should map directly to execution points. The table below is the practical model we see working in production systems.

ControlObjectiveImplementationEvidence
Pre-dispatch policy checkBlock unsafe actions before executionEvaluate each job request synchronously and return deterministic decision typeDecision record with rule id, reason, and policy snapshot
Approval bindingRequire human review for high-risk actionsQueue REQUIRE_APPROVAL jobs and bind approval to current job hashResolver identity, timestamp, note, and linked policy data
Constraint injectionAllow useful actions while reducing blast radiusApply max runtime, retry, path/network constraints before dispatchConstraint set included in job metadata and execution logs
Output safetyPrevent sensitive output from flowing downstreamApply post-execution checks with redact/quarantine pathsScanner findings and resulting action state
Append-only audit trailEnable fast, defensible incident reconstructionPersist state transitions and decision events with trace idTimeline with actor, decision, and terminal status

Implementation details

Policy should be readable by humans and executable by infrastructure. Dispatch handlers should reflect decision outcomes exactly, without interpretation layers.

governance-policy.yaml
YAML
version: v1
rules:
  - id: allow-low-risk-read
    match:
      topics: ["job.mcp-bridge.read.*"]
      risk_tags: []
    decision: allow

  - id: require-approval-prod-write
    match:
      topics: ["job.mcp-bridge.write.*"]
      risk_tags: ["prod"]
    decision: require_approval
    reason: "Production writes require human review"

  - id: constrain-medium-risk
    match:
      topics: ["job.agent.exec.*"]
      risk_tags: ["medium"]
    decision: allow_with_constraints
    constraints:
      max_runtime_sec: 45
      max_retries: 1

  - id: deny-destructive
    match:
      risk_tags: ["destructive"]
    decision: deny
decision-gate.go
Go
func EvaluateAndDispatch(req *JobRequest) (*JobState, error) {
  decision, err := safetyClient.Check(req)
  if err != nil {
    return nil, err
  }

  switch decision.Type {
  case "DENY":
    return setState(req.ID, "DENIED", decision.Reason), nil
  case "REQUIRE_APPROVAL":
    return enqueueApproval(req, decision), nil
  case "ALLOW_WITH_CONSTRAINTS":
    constrained := applyConstraints(req, decision.Constraints)
    return scheduler.Dispatch(constrained)
  case "THROTTLE":
    return setState(req.ID, "THROTTLED", decision.Reason), nil
  default:
    return scheduler.Dispatch(req)
  }
}
approval-ops.sh
Bash
# Submit a high-risk job
curl -sS -X POST http://localhost:8081/api/v1/jobs   -H "Content-Type: application/json"   -d '{
    "topic":"job.mcp-bridge.write.update_ticket",
    "tenant_id":"default",
    "risk_tags":["prod"],
    "labels":{"mcp.action":"write"}
  }'

# Verify approval queue
curl -sS "http://localhost:8081/api/v1/approvals?include_resolved=false"

# Approve after review
curl -sS -X POST "http://localhost:8081/api/v1/approvals/<job_id>/approve"   -H "Content-Type: application/json"   -d '{"note":"approved after policy review"}'

Concrete runtime numbers matter. For example, short safety-client deadlines (2s in current safety-kernel references) keep policy checks synchronous while protecting scheduler throughput under dependency pressure.

Governance checklist

Checklist itemWhy it mattersPass condition
Every dispatch path calls policyCloses bypass route from scripts or ad-hoc workersPass when no job reaches worker subject without decision record
Approval queue is boundedPrevents silent backlog growth during incidentsPass when queue age and timeout behavior are monitored
Fail mode is explicitSafety-unavailable behavior must be intentionalPass when environment has documented fail-open/fail-closed policy
Audit data is queryableIncident reviews should not depend on manual log archaeologyPass when trace id returns full decision and execution timeline

Limitations and tradeoffs

Additional platform complexity

Policy engines, approval queues, and audit services add operational surface area.

Approval latency

High-risk write paths take longer when human review is mandatory. This is the cost of control.

Policy tuning work

Risk tags and constraint sets need continuous calibration as workflows evolve.

Frequently Asked Questions

What is the difference between governance and observability for AI agents?
Observability tells you what happened. Governance decides what is allowed to happen before execution. You need both.
Do approvals need to exist for every action?
No. Approvals are usually reserved for high-impact writes or destructive actions. Over-approving everything creates operational drag.
Can we do governance with prompt instructions only?
Prompt instructions can help behavior but they are not enforcement. Runtime policy checks and dispatch controls provide enforceable guarantees.
What is the minimum viable governance baseline?
At minimum: pre-dispatch policy, explicit risk tiers, approval gates for high-risk actions, and append-only audit logs tied to execution traces.
Next step

Choose one high-risk autonomous workflow this week and run a controlled drill. Verify policy decision, approval binding, constrained execution, and audit evidence in one trace.

Sources