Skip to content
Checklist

AI Agent Production Deployment Checklist

If your launch gate cannot fail a release, it is not a gate.

Checklist13 min readUpdated Apr 2026
TL;DR
  • -A checklist without pass/fail thresholds is not a launch gate. It is a to-do list.
  • -The highest-value control is policy before dispatch. If you check only after execution, side effects already happened.
  • -Launch in staged percentages and stop promotion automatically when reliability or governance signals degrade.
  • -Run incident replay drills before launch day, not during your first incident.
How to use this checklist

Treat every row as a pass/fail test. If a launch-blocking row fails, stop promotion and fix it before adding more traffic. This guide is designed for engineering teams running autonomous workflows, not slide decks.

The production problem

Teams usually do this in the wrong order. They optimize prompt quality first, then bolt on controls when the first incident happens.

Production flips the order: define control boundaries first, then optimize behavior inside those boundaries.

Launch anti-pattern

“We will watch dashboards and roll back if needed” is not a control strategy. It is hope with a Slack channel.

What top checklist posts cover vs miss

SourceStrong coverageMissing piece
TheAgentLabs: AI Agent Deployment ChecklistClear release-readiness framing across objectives, governance, observability, and stakeholder alignment.No operator-grade launch blockers (SLO windows, hard stop thresholds, or accountable escalation owners).
InfoWorld: 10 release criteria for AI agentsStrong cross-functional release criteria: value metrics, trust factors, data quality, and compliance expectations.No executable gate contract tying criteria to dispatch-time policy checks and auto-block behavior.
Field Guide to AI: Deployment LifecycleUseful canary sequencing, rollback triggers, and lifecycle checklist structure.No control-plane binding for approvals, policy snapshots, and idempotency evidence in production audits.

Gap summary: most content explains what to have, but not what should block launch. This checklist adds pass/fail criteria and runnable gate checks.

launch-gate-contract.json
JSON
{
  "gate_id": "production-launch-2026-04-01",
  "window": "10m",
  "blockers": {
    "unapproved_high_risk_actions": { "target_eq": 0, "value": 0, "pass": true },
    "workflow_success_rate": { "target_gte": 0.99, "value": 0.994, "pass": true },
    "p95_latency_ms": { "target_lte": 1200, "value": 1014, "pass": true },
    "approval_queue_median_sec": { "target_lte": 900, "value": 520, "pass": true },
    "rollback_drill_passed": { "target_eq": true, "value": true, "pass": true }
  },
  "owners": {
    "reliability": "sre-oncall",
    "governance": "ai-governance",
    "cost": "finops"
  },
  "decision": "allow_promotion_to_next_stage"
}

20 controls with pass/fail gates

#AreaControlPass signalBlock signal
1GovernancePolicy check on submit pathDeny/approval decisions returned before job persistenceJob is persisted or queued before policy evaluation
2GovernancePolicy check on dispatch pathScheduler re-evaluates policy before worker routingQueued jobs dispatch with stale policy assumptions
3GovernanceApproval routing for high-risk actionsProd writes, destructive ops, and financial actions require explicit approvalRisky actions can execute without human checkpoint
4GovernanceApproval binding to policy snapshotApproval record stores policy hash + job hashApproval exists but cannot prove what policy version was approved
5GovernanceImmutable decision timelineRequest, decision, approver, and outcome are linked by IDPartial logs cannot reconstruct incidents
6SecurityPer-agent identity isolationEach agent or worker has a unique principalShared credential used by multiple agent paths
7SecurityCredential rotation processRotation tested and documented with zero-downtime fallbackStatic keys and manual break-glass updates
8SecurityOutbound network allowlistAgent egress constrained to explicit host listUnbounded outbound calls to internet targets
9SecurityInput schema validationInvalid requests rejected at API boundaryFree-form input reaches tool execution directly
10SecurityOutput safety decisionsALLOW, REDACT, QUARANTINE outcomes are enforced and loggedSensitive output reaches downstream systems unfiltered
11ReliabilityRetry class definitionTransient vs terminal errors are separated and testedAll failures retried identically
12ReliabilityTimeout budget per job classTimeouts mapped to workload type and queue latency profileSingle global timeout for all job types
13ReliabilityDead-letter queue ownershipTeam, SLA, and replay process definedDLQ exists but nobody owns triage
14ReliabilityWorker heartbeat watchdogStale worker detection triggers requeue or fail-safe transitionLost worker leaves job in undefined state
15ReliabilityIdempotency on side effectsDuplicate dispatch does not duplicate external side effectsRetries can create duplicate writes or payments
16OperationsCanary rollout gatesPromotion blocked automatically when thresholds failManual promotion despite failing indicators
17OperationsRollback trigger matrixLatency, error, and governance thresholds mapped to rollback actionsRollback criteria are subjective and ad hoc
18OperationsWorkflow cost budgetsPer-workflow token/API budget and alerts configuredNo ceiling on autonomous spend
19OperationsOn-call runbookRunbook covers stop, isolate, replay, and comms stepsFirst incident requires inventing process live
20ComplianceIncident replay drillTeam can replay one historical incident end-to-end in stagingReplay is theoretical and not tested

Canary rollout thresholds

You need launch math, not vibes. Use fixed promotion criteria and fixed rollback triggers.

PhaseTrafficPromote whenRollback when
Replay Gate0% live traffic30 scenario replay suite has zero critical bypassAny critical bypass or non-idempotent duplicate
Canary 11-5% low-risk trafficSuccess rate >= 99%, P95 latency <= 1.2x baselineError rate > 2x baseline for 10 minutes
Canary 225% mixed trafficApproval queue median wait <= 15 min, no unapproved high-risk actionApproval backlog exceeds SLA for 30 minutes
General50-100% trafficTwo weekly reliability reviews pass with no critical incidentsAny policy bypass or output safety quarantine spike > 3x

Policy and gate config examples

Start with a small policy bundle and a small gate script. Expand only when you have real incident data that requires new rules.

production-policy.yaml
YAML
version: v1
rules:
  - id: deny-destructive-prod
    match:
      topic: "job.exec.shell"
      labels:
        env: prod
        command_class: destructive
    decision: DENY
    reason: "destructive production command blocked"

  - id: approval-prod-write
    match:
      topic: "job.deploy.apply"
      labels:
        env: prod
    decision: REQUIRE_APPROVAL
    reason: "production deployment requires human approval"

  - id: constrain-external-egress
    match:
      topic: "job.integrations.call"
      risk_tags: ["egress"]
    decision: ALLOW_WITH_CONSTRAINTS
    constraints:
      allowed_hosts: ["api.github.com", "api.slack.com"]
      timeout_ms: 15000

  - id: allow-read-only
    match:
      topic: "job.repo.read"
    decision: ALLOW
gate-check.sh
Bash
# gate-check.sh
set -euo pipefail

# Fail if any unapproved high-risk execution appears in the last 10m
HIGH_RISK_UNAPPROVED=$(curl -s "$API/metrics/high-risk-unapproved?window=10m")
if [ "$HIGH_RISK_UNAPPROVED" -gt 0 ]; then
  echo "BLOCK: unapproved high-risk action detected"
  exit 1
fi

# Fail if reliability drifts outside launch SLO
SUCCESS_RATE=$(curl -s "$API/metrics/success-rate?window=10m")
P95_LATENCY_MS=$(curl -s "$API/metrics/p95-latency-ms?window=10m")
if (( $(echo "$SUCCESS_RATE < 0.99" | bc -l) )); then
  echo "BLOCK: success rate below 99%"
  exit 1
fi
if [ "$P95_LATENCY_MS" -gt 1200 ]; then
  echo "BLOCK: P95 latency above threshold"
  exit 1
fi

echo "PASS: promote canary step"

Limitations and tradeoffs

More controls, more latency

Pre-dispatch checks and approvals add overhead. Keep low-risk actions constrained but autonomous.

Approvals can bottleneck throughput

If everything needs approval, nothing scales. Route approvals only for truly high-risk categories.

Checklists decay without ownership

Assign owners and review cadence. A static checklist from six months ago is operational debt.

Frequently Asked Questions

What is the minimum AI agent production checklist?
Minimum does not mean small. You still need policy-before-dispatch checks, approval gates for high-risk actions, immutable decision logs, DLQ ownership, staged rollout gates, and rollback drills.
How many checklist items should block launch?
At least the critical set: controls 1-5, 6, 10, 13, 16, and 17. If one of those fails, do not launch.
How do I avoid approval fatigue?
Reserve approvals for high-risk actions only. Low-risk actions should auto-pass with constraints. Track approval queue latency and auto-adjust thresholds if reviewers are overloaded.
Is output filtering enough for production safety?
No. Output filtering runs after execution. You still need pre-dispatch controls so dangerous actions are blocked before side effects happen.
What should I test in an incident replay drill?
Test full sequence: detect, stop, isolate, replay, verify idempotency, and produce an audit timeline. If any step requires guesswork, your runbook is incomplete.
Next step

Pick the first ten controls, map each one to an owner, and run one replay-gate drill this week. Do not start canary traffic before that drill passes.