Deploy AI Agents in Production
Your agent works in the demo. Now it needs to survive 10,000 runs without breaking production, leaking data, or executing actions nobody approved. This guide covers the exact infrastructure, rollout phases, and governance controls that production teams ship before going live.
- Most production failures happen after a successful demo — teams skip decision control and rollback drills.
- Deployment is a staged migration: 5% → 25% → 50% → 100%, with objective gates at each step.
- The highest-value control is policy-before-dispatch: evaluate risk before the worker executes any side effect.
- Output safety is required but runs after execution. Pre-dispatch policy runs before. You need both.
Why demos succeed and production fails
You are no longer grading prompt quality. You are operating a distributed system that writes to infrastructure, tickets, code repositories, and customer-facing channels. Three failure modes show up repeatedly.
No decision boundary
Your agent auto-resolved 200 tickets perfectly in staging. In production, it encounters a ticket containing SQL injection and executes it against your database — because no policy check evaluated the input before dispatch.
No staged rollout
Traffic jumps from zero to 100%. At 2x expected load, your scheduler queue backs up, retry storms trigger duplicate executions, and three customers receive conflicting automated responses. You needed canary gates.
No incident narrative
Something went wrong at 3 AM. Your logs show the action happened, but not who approved it, which policy was active, or what the full input context was. The postmortem starts with "we think" — and your auditor is unimpressed.
Production sanity check
If you cannot answer who approved a risky action, which policy snapshot allowed it, and what exact output was returned — you are not in production mode yet.
What do AI agents need for production deployment?
The stack is smaller than most architecture diagrams suggest. Six layers, each explicit. If any layer is missing, you'll hit the failure mode listed — usually during your first real incident.
Execution Bus
NATS or Kafka with explicit subject/topic routing
If missing: Backpressure is invisible, retries become guesswork, and jobs silently pile up.
State & Pointers
Redis or Postgres for job state, context pointers, result pointers
If missing: No reproducible run history. Incidents become archaeology.
Scheduler
Deterministic retries, timeouts, dead-letter handling
If missing: Orphaned runs and duplicate execution under retry pressure.
Policy Decision Point
Pre-dispatch policy checks and approval queue for risky actions
If missing: Agent reaches side effects before anyone can intervene.
Output Safety
Allow, redact, or quarantine result pipeline
If missing: PII and secrets leak in outputs even when input controls look fine.
Audit Trail
Immutable action + decision + approver timeline
If missing: No defensible post-incident narrative, weak compliance evidence.
Options for hosting AI agents in production
The right hosting choice depends on your operational capacity, security posture, and scale requirements.
Managed Runtime (PaaS)
Kubernetes
VM-Based Deployment
Step-by-step phased deployment
Do not jump from staging to 100%. Use explicit traffic gates and require each gate to pass objective metrics before promoting. If a gate fails, roll back immediately.
Replay real production requests through your policy engine without executing side effects. Every denied action should produce a correct audit entry. This catches policy misconfigurations before any real traffic flows.
Route only low-risk, read-only operations. Monitor success rate, latency, and approval queue. If any high-risk action executes without approval, roll back immediately — your routing labels are wrong.
Introduce write operations under approval gates. Watch for approval queue depth spikes — they indicate either too many risky actions or too few reviewers. Both are problems to fix before scaling further.
Full traffic only after two consecutive clean review cycles and a successful rollback drill. The drill is non-negotiable — if you haven't tested rollback, you don't have rollback.
Each gate produces an evidence object. Here is the contract used in production reviews — copy this into your CI pipeline or deployment checklist:
{
"gate_id": "deploy-phase-2-2026-04-01",
"traffic_slice": "25%",
"checks": {
"success_rate_pct": { "value": 99.34, "target_gte": 99.00, "pass": true },
"p95_latency_ms": { "value": 812, "target_lte": 840, "pass": true },
"approval_queue_p95_sec": { "value": 55, "target_lte": 120, "pass": true },
"policy_bypass_incidents": { "value": 0, "target_eq": 0, "pass": true },
"rollback_drill_passed": { "value": true, "target_eq": true, "pass": true }
},
"policy_snapshot": "v1:7f93d2c",
"reviewed_by": "[email protected]",
"decision": "promote_to_phase_3"
}Policy gates and approvals
A deployment-ready agent stack needs deterministic decisions before dispatch. Four decision types cover the full spectrum of production actions.
Read-only ops, safe actions
Destructive shell commands in prod
Production deploys, permission changes
External calls with host allowlist
Two implementation rules matter most:
# Copy-paste ready. Works with any agent framework.
version: v1
rules:
# Block destructive shell commands in production
- id: deny-destructive-prod-shell
match:
topic: "job.exec.shell"
labels:
env: prod
command_class: destructive
decision: DENY
reason: "destructive shell action blocked in production"
# Production deploys need human sign-off
- id: require-approval-prod-deploy
match:
topic: "job.deploy.apply"
labels:
env: prod
decision: REQUIRE_APPROVAL
reason: "production deploy needs human sign-off"
# External API calls restricted to approved endpoints
- id: constrain-external-egress
match:
topic: "job.integrations.call"
risk_tags: ["egress"]
decision: ALLOW_WITH_CONSTRAINTS
constraints:
allowed_hosts: ["api.github.com", "api.slack.com"]
timeout_ms: 15000
reason: "external calls restricted to approved endpoints"
# Read-only operations pass through
- id: allow-read-only-ops
match:
topic: "job.repo.read"
decision: ALLOW
reason: "read-only operation"Compliance requirements for deploying AI agents
Compliance is not a separate project. It is a natural output of the controls above — if you build them with audit evidence in mind. Most teams map to SOC 2 Type II, ISO 27001, or NIST AI RMF.
Least privilege
Auditors look for
Per-agent credential scoping, rotation evidence
How to satisfy it
Issue scoped API keys per agent. Rotate on a 90-day cycle. Log rotation events.
Approval evidence
Auditors look for
Who approved what, when, under which policy
How to satisfy it
Bind approvals to policy snapshot hash + job ID. Store in append-only audit log.
Immutable audit trail
Auditors look for
Tamper-evident logs with retention policy
How to satisfy it
Write-once storage. Define retention (1–7 years). Export to SIEM for correlation.
Incident timeline
Auditors look for
Reproducible sequence from trigger to resolution
How to satisfy it
Link every decision, action, and output in a queryable timeline per run ID.
Change control
Auditors look for
Policy changes reviewed before deployment
How to satisfy it
Version policy files in git. Require PR review. Tag deployed versions.
The key insight: if you have pre-dispatch policy, approval routing, and an immutable audit trail, you already satisfy the core evidence requirements. Compliance becomes documentation of controls you already run — not a bolt-on project.
Monitoring and SLO baselines
Start with a small baseline set. You can always add more dashboards after the first week. You cannot retroactively add yesterday's missing data.
Reliability
Success rate target: >= 99% for low-risk automated jobs. Alert on any drop below 97%.
Latency
Track P95 end-to-end latency. Enforce a max +20% drift budget during each rollout phase.
Governance
Monitor approval queue depth and decision mix. Sudden spikes in denies or quarantines signal policy drift.
Also track cost per completed workflow. Token spend without completion context is a finance horror story, not an engineering metric.
Failure drills and rollback
The safest rollback path is the one you tested last week — not the one in a doc from last quarter.
Minimum drill cadence
# Run these drills BEFORE each traffic increase.
# If any drill fails, do not promote to the next phase.
# 1) Submit a denied action — verify hard block + audit entry
curl -sS -X POST "$API/api/v1/jobs" \
-H "Content-Type: application/json" \
-H "X-API-Key: $KEY" -H "X-Tenant-ID: default" \
-d '{"topic":"job.exec.shell","labels":{"env":"prod","command_class":"destructive"}}'
# Expected: 403 with decision=DENY
# 2) Submit approval-required action — verify queue state
curl -sS -X POST "$API/api/v1/jobs" \
-H "Content-Type: application/json" \
-H "X-API-Key: $KEY" -H "X-Tenant-ID: default" \
-d '{"topic":"job.deploy.apply","labels":{"env":"prod"}}'
# Expected: 202 with status=pending_approval
# 3) Kill one worker mid-run — verify retry without duplicate side effect
# docker kill cordum-worker-1 && sleep 5 && docker logs cordum-scheduler
# 4) Trigger rollback — confirm compensating action timeline is clean
# Follow your rollback runbook. Time it. Record the duration.Best practices checklist: 10 items before production
Do not ship until every item is checked. Each one has prevented a real production incident.
Frequently Asked Questions
What do AI agents need for production deployment?
What are compliance requirements for deploying AI agents in production?
How do you reduce blast radius during AI agent rollout?
What hosting options exist for AI agents in production?
Is output filtering alone enough for production safety?
What steps should I follow to deploy an AI agent to production?
How do I test rollback before going to production?
Traditional APM does not work for autonomous agents. Learn the three pillars of AI agent observability: decision tracing, behavioral drift detection, and governance audit trails.
Read moreA production AI agent checklist with 20 controls and pass/fail launch gates, including policy checks, canary thresholds, and rollback drills.
Read more12 AI agent security controls that actually work in production. Covers pre-dispatch policy gates, least-privilege scoping, output quarantine, credential rotation, and validation runbooks with code.
Read moreReady to deploy your first governed agent?
Pick one high-impact workflow and run this rollout plan on it this week. Start with the quickstart and wire your first approval-gated action before expanding scope.