Skip to content
Production Guide

Deploy AI Agents in Production

Your agent works in the demo. Now it needs to survive 10,000 runs without breaking production, leaking data, or executing actions nobody approved. This guide covers the exact infrastructure, rollout phases, and governance controls that production teams ship before going live.

Guide 14 min read Updated Apr 2026
TL;DR
  • Most production failures happen after a successful demo — teams skip decision control and rollback drills.
  • Deployment is a staged migration: 5% → 25% → 50% → 100%, with objective gates at each step.
  • The highest-value control is policy-before-dispatch: evaluate risk before the worker executes any side effect.
  • Output safety is required but runs after execution. Pre-dispatch policy runs before. You need both.
The Reality

Why demos succeed and production fails

You are no longer grading prompt quality. You are operating a distributed system that writes to infrastructure, tickets, code repositories, and customer-facing channels. Three failure modes show up repeatedly.

No decision boundary

Your agent auto-resolved 200 tickets perfectly in staging. In production, it encounters a ticket containing SQL injection and executes it against your database — because no policy check evaluated the input before dispatch.

No staged rollout

Traffic jumps from zero to 100%. At 2x expected load, your scheduler queue backs up, retry storms trigger duplicate executions, and three customers receive conflicting automated responses. You needed canary gates.

No incident narrative

Something went wrong at 3 AM. Your logs show the action happened, but not who approved it, which policy was active, or what the full input context was. The postmortem starts with "we think" — and your auditor is unimpressed.

Production sanity check

If you cannot answer who approved a risky action, which policy snapshot allowed it, and what exact output was returned — you are not in production mode yet.

Architecture

What do AI agents need for production deployment?

The stack is smaller than most architecture diagrams suggest. Six layers, each explicit. If any layer is missing, you'll hit the failure mode listed — usually during your first real incident.

Execution Bus

NATS or Kafka with explicit subject/topic routing

If missing: Backpressure is invisible, retries become guesswork, and jobs silently pile up.

State & Pointers

Redis or Postgres for job state, context pointers, result pointers

If missing: No reproducible run history. Incidents become archaeology.

Scheduler

Deterministic retries, timeouts, dead-letter handling

If missing: Orphaned runs and duplicate execution under retry pressure.

Policy Decision Point

Pre-dispatch policy checks and approval queue for risky actions

If missing: Agent reaches side effects before anyone can intervene.

Output Safety

Allow, redact, or quarantine result pipeline

If missing: PII and secrets leak in outputs even when input controls look fine.

Audit Trail

Immutable action + decision + approver timeline

If missing: No defensible post-incident narrative, weak compliance evidence.

Infrastructure

Options for hosting AI agents in production

The right hosting choice depends on your operational capacity, security posture, and scale requirements.

Managed Runtime (PaaS)

Teams optimizing for speed and low ops load
Platform limits and less low-level control

Kubernetes

High-volume, strict networking, custom runtime controls
Higher operational overhead and on-call complexity

VM-Based Deployment

Legacy integration and simple small-scale workloads
Manual scaling and weaker isolation patterns
Deployment Plan

Step-by-step phased deployment

Do not jump from staging to 100%. Use explicit traffic gates and require each gate to pass objective metrics before promoting. If a gate fails, roll back immediately.

Phase 0Synthetic + replay
3–7 days

Replay real production requests through your policy engine without executing side effects. Every denied action should produce a correct audit entry. This catches policy misconfigurations before any real traffic flows.

Gate: 0 critical policy bypasses in replay set
Phase 15% low-risk jobs
3–5 days

Route only low-risk, read-only operations. Monitor success rate, latency, and approval queue. If any high-risk action executes without approval, roll back immediately — your routing labels are wrong.

Gate: Success rate >= 99%, no unapproved high-risk action
Phase 225% mixed workload
5–7 days

Introduce write operations under approval gates. Watch for approval queue depth spikes — they indicate either too many risky actions or too few reviewers. Both are problems to fix before scaling further.

Gate: P95 latency within +20% baseline, stable approval queue
Phase 350–100%
7–14 days

Full traffic only after two consecutive clean review cycles and a successful rollback drill. The drill is non-negotiable — if you haven't tested rollback, you don't have rollback.

Gate: Two clean weekly reviews, rollback drill passed

Each gate produces an evidence object. Here is the contract used in production reviews — copy this into your CI pipeline or deployment checklist:

rollout-gate-evidence.json
JSON
{
  "gate_id": "deploy-phase-2-2026-04-01",
  "traffic_slice": "25%",
  "checks": {
    "success_rate_pct": { "value": 99.34, "target_gte": 99.00, "pass": true },
    "p95_latency_ms": { "value": 812, "target_lte": 840, "pass": true },
    "approval_queue_p95_sec": { "value": 55, "target_lte": 120, "pass": true },
    "policy_bypass_incidents": { "value": 0, "target_eq": 0, "pass": true },
    "rollback_drill_passed": { "value": true, "target_eq": true, "pass": true }
  },
  "policy_snapshot": "v1:7f93d2c",
  "reviewed_by": "[email protected]",
  "decision": "promote_to_phase_3"
}
Governance

Policy gates and approvals

A deployment-ready agent stack needs deterministic decisions before dispatch. Four decision types cover the full spectrum of production actions.

ALLOW

Read-only ops, safe actions

DENY

Destructive shell commands in prod

REQUIRE_APPROVAL

Production deploys, permission changes

ALLOW_WITH_CONSTRAINTS

External calls with host allowlist

Two implementation rules matter most:

Run policy before submit persistence and before dispatch. Two checks are safer than one.
Bind approvals to a policy snapshot and job hash so evidence survives audits.
safety-policy.yaml
YAML
# Copy-paste ready. Works with any agent framework.
version: v1
rules:
  # Block destructive shell commands in production
  - id: deny-destructive-prod-shell
    match:
      topic: "job.exec.shell"
      labels:
        env: prod
        command_class: destructive
    decision: DENY
    reason: "destructive shell action blocked in production"

  # Production deploys need human sign-off
  - id: require-approval-prod-deploy
    match:
      topic: "job.deploy.apply"
      labels:
        env: prod
    decision: REQUIRE_APPROVAL
    reason: "production deploy needs human sign-off"

  # External API calls restricted to approved endpoints
  - id: constrain-external-egress
    match:
      topic: "job.integrations.call"
      risk_tags: ["egress"]
    decision: ALLOW_WITH_CONSTRAINTS
    constraints:
      allowed_hosts: ["api.github.com", "api.slack.com"]
      timeout_ms: 15000
    reason: "external calls restricted to approved endpoints"

  # Read-only operations pass through
  - id: allow-read-only-ops
    match:
      topic: "job.repo.read"
    decision: ALLOW
    reason: "read-only operation"
Compliance

Compliance requirements for deploying AI agents

Compliance is not a separate project. It is a natural output of the controls above — if you build them with audit evidence in mind. Most teams map to SOC 2 Type II, ISO 27001, or NIST AI RMF.

Least privilege

Auditors look for

Per-agent credential scoping, rotation evidence

How to satisfy it

Issue scoped API keys per agent. Rotate on a 90-day cycle. Log rotation events.

Approval evidence

Auditors look for

Who approved what, when, under which policy

How to satisfy it

Bind approvals to policy snapshot hash + job ID. Store in append-only audit log.

Immutable audit trail

Auditors look for

Tamper-evident logs with retention policy

How to satisfy it

Write-once storage. Define retention (1–7 years). Export to SIEM for correlation.

Incident timeline

Auditors look for

Reproducible sequence from trigger to resolution

How to satisfy it

Link every decision, action, and output in a queryable timeline per run ID.

Change control

Auditors look for

Policy changes reviewed before deployment

How to satisfy it

Version policy files in git. Require PR review. Tag deployed versions.

The key insight: if you have pre-dispatch policy, approval routing, and an immutable audit trail, you already satisfy the core evidence requirements. Compliance becomes documentation of controls you already run — not a bolt-on project.

Observability

Monitoring and SLO baselines

Start with a small baseline set. You can always add more dashboards after the first week. You cannot retroactively add yesterday's missing data.

Reliability

Success rate target: >= 99% for low-risk automated jobs. Alert on any drop below 97%.

Latency

Track P95 end-to-end latency. Enforce a max +20% drift budget during each rollout phase.

Governance

Monitor approval queue depth and decision mix. Sudden spikes in denies or quarantines signal policy drift.

Also track cost per completed workflow. Token spend without completion context is a finance horror story, not an engineering metric.

Resilience

Failure drills and rollback

The safest rollback path is the one you tested last week — not the one in a doc from last quarter.

Minimum drill cadence

WeeklyDeny and approval flow verification — submit blocked actions, verify correct audit entries
Bi-weeklyWorker crash and retry validation — kill a worker mid-run, verify no duplicate side effects
MonthlyFull rollback simulation with stakeholder comms timing — measure how long recovery actually takes
deployment-drill.sh
Bash
# Run these drills BEFORE each traffic increase.
# If any drill fails, do not promote to the next phase.

# 1) Submit a denied action — verify hard block + audit entry
curl -sS -X POST "$API/api/v1/jobs" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $KEY" -H "X-Tenant-ID: default" \
  -d '{"topic":"job.exec.shell","labels":{"env":"prod","command_class":"destructive"}}'
# Expected: 403 with decision=DENY

# 2) Submit approval-required action — verify queue state
curl -sS -X POST "$API/api/v1/jobs" \
  -H "Content-Type: application/json" \
  -H "X-API-Key: $KEY" -H "X-Tenant-ID: default" \
  -d '{"topic":"job.deploy.apply","labels":{"env":"prod"}}'
# Expected: 202 with status=pending_approval

# 3) Kill one worker mid-run — verify retry without duplicate side effect
# docker kill cordum-worker-1 && sleep 5 && docker logs cordum-scheduler

# 4) Trigger rollback — confirm compensating action timeline is clean
# Follow your rollback runbook. Time it. Record the duration.
Go-Live

Best practices checklist: 10 items before production

Do not ship until every item is checked. Each one has prevented a real production incident.

1Policy-before-dispatch gate is active in both submit and dispatch path
2High-risk actions route to approval with immutable policy snapshot reference
3Run state is durable, queryable, and mapped to pointers for context/results
4Retry and timeout behavior is tested with at least one worker kill drill
5Output safety decisions (ALLOW, REDACT, QUARANTINE) are visible in logs and metrics
6Dead-letter queue has clear ownership and documented replay procedure
7Per-agent credentials are scoped and rotation process is documented
8Canary gates are explicit and automated: success rate, latency, queue depth, policy anomalies
9Rollback runbook includes both technical rollback and stakeholder communication path
10On-call team can answer: what executed, why it executed, and who approved it

Frequently Asked Questions

What do AI agents need for production deployment?
Six infrastructure layers: a message bus (NATS or Kafka) for job routing, durable state store (Redis or Postgres) for run history, a scheduler with retry/timeout/dead-letter handling, a pre-dispatch policy engine that evaluates every action before execution, an output safety pipeline (allow/redact/quarantine), and an immutable audit trail. You also need a phased rollout plan — never jump from staging to 100% traffic.
What are compliance requirements for deploying AI agents in production?
Most teams map to SOC 2 Type II, ISO 27001, or NIST AI RMF. Auditors look for: per-agent credential scoping with rotation evidence, approval records bound to policy snapshots, immutable append-only audit logs with defined retention, reproducible incident timelines from trigger to resolution, and versioned policy files with PR review before deployment. The key insight: if you already have pre-dispatch policy, approval routing, and an immutable audit trail, compliance becomes documentation of controls you already run.
How do you reduce blast radius during AI agent rollout?
Use staged traffic percentages with objective promotion gates. Phase 0: synthetic replay (0 policy bypasses). Phase 1: 5% low-risk traffic (success rate >= 99%). Phase 2: 25% mixed workload (P95 latency within +20% baseline). Phase 3: 50-100% (two clean weekly reviews + passed rollback drill). If any gate fails, roll back immediately — don't spend six hours debating ownership.
What hosting options exist for AI agents in production?
Three main options: Managed runtimes (PaaS) for teams optimizing speed and low ops load — tradeoff is platform limits. Kubernetes for high-volume workloads needing strict networking and custom runtime controls — tradeoff is operational complexity. VM-based deployment for legacy integration and simple small-scale workloads — tradeoff is manual scaling. The right choice depends on your operational capacity, security posture, and scale requirements.
Is output filtering alone enough for production safety?
No. Output filtering runs after execution — the agent has already performed the action. You also need pre-dispatch policy checks so dangerous actions are blocked before any side effects happen. Think of it this way: output safety catches data leaks in responses, but pre-dispatch policy prevents the agent from executing a destructive database query in the first place. You need both.
What steps should I follow to deploy an AI agent to production?
Phase 0 (3-7 days): Run synthetic replay with policy checks, targeting 0 critical bypasses. Phase 1 (3-5 days): Route 5% of low-risk jobs, require >= 99% success rate with no unapproved high-risk actions. Phase 2 (5-7 days): Expand to 25% mixed workload, enforce P95 latency within +20% baseline and stable approval queue. Phase 3 (7-14 days): Scale to 50-100% after two clean weekly reviews and a passed rollback drill. Each phase has explicit promotion gates — no subjective assessments.
How do I test rollback before going to production?
Run four drills before each traffic increase: (1) Submit a denied action and verify hard block with correct audit entry, (2) Submit an approval-required action and verify it enters the queue correctly, (3) Kill a worker process mid-run and verify retry without duplicate side effects, (4) Trigger the rollback path and confirm the compensating action timeline is clean. The safest rollback is the one you tested last week, not the one in a doc from last quarter.