The production problem
Teams usually do this in the wrong order. They optimize prompt quality first, then bolt on controls when the first incident happens.
Production flips the order: define control boundaries first, then optimize behavior inside those boundaries.
Launch anti-pattern
“We will watch dashboards and roll back if needed” is not a control strategy. It is hope with a Slack channel.
What top checklist posts cover vs miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| TheAgentLabs: AI Agent Deployment Checklist | Clear release-readiness framing across objectives, governance, observability, and stakeholder alignment. | No operator-grade launch blockers (SLO windows, hard stop thresholds, or accountable escalation owners). |
| InfoWorld: 10 release criteria for AI agents | Strong cross-functional release criteria: value metrics, trust factors, data quality, and compliance expectations. | No executable gate contract tying criteria to dispatch-time policy checks and auto-block behavior. |
| Field Guide to AI: Deployment Lifecycle | Useful canary sequencing, rollback triggers, and lifecycle checklist structure. | No control-plane binding for approvals, policy snapshots, and idempotency evidence in production audits. |
Gap summary: most content explains what to have, but not what should block launch. This checklist adds pass/fail criteria and runnable gate checks.
{
"gate_id": "production-launch-2026-04-01",
"window": "10m",
"blockers": {
"unapproved_high_risk_actions": { "target_eq": 0, "value": 0, "pass": true },
"workflow_success_rate": { "target_gte": 0.99, "value": 0.994, "pass": true },
"p95_latency_ms": { "target_lte": 1200, "value": 1014, "pass": true },
"approval_queue_median_sec": { "target_lte": 900, "value": 520, "pass": true },
"rollback_drill_passed": { "target_eq": true, "value": true, "pass": true }
},
"owners": {
"reliability": "sre-oncall",
"governance": "ai-governance",
"cost": "finops"
},
"decision": "allow_promotion_to_next_stage"
}20 controls with pass/fail gates
| # | Area | Control | Pass signal | Block signal |
|---|---|---|---|---|
| 1 | Governance | Policy check on submit path | Deny/approval decisions returned before job persistence | Job is persisted or queued before policy evaluation |
| 2 | Governance | Policy check on dispatch path | Scheduler re-evaluates policy before worker routing | Queued jobs dispatch with stale policy assumptions |
| 3 | Governance | Approval routing for high-risk actions | Prod writes, destructive ops, and financial actions require explicit approval | Risky actions can execute without human checkpoint |
| 4 | Governance | Approval binding to policy snapshot | Approval record stores policy hash + job hash | Approval exists but cannot prove what policy version was approved |
| 5 | Governance | Immutable decision timeline | Request, decision, approver, and outcome are linked by ID | Partial logs cannot reconstruct incidents |
| 6 | Security | Per-agent identity isolation | Each agent or worker has a unique principal | Shared credential used by multiple agent paths |
| 7 | Security | Credential rotation process | Rotation tested and documented with zero-downtime fallback | Static keys and manual break-glass updates |
| 8 | Security | Outbound network allowlist | Agent egress constrained to explicit host list | Unbounded outbound calls to internet targets |
| 9 | Security | Input schema validation | Invalid requests rejected at API boundary | Free-form input reaches tool execution directly |
| 10 | Security | Output safety decisions | ALLOW, REDACT, QUARANTINE outcomes are enforced and logged | Sensitive output reaches downstream systems unfiltered |
| 11 | Reliability | Retry class definition | Transient vs terminal errors are separated and tested | All failures retried identically |
| 12 | Reliability | Timeout budget per job class | Timeouts mapped to workload type and queue latency profile | Single global timeout for all job types |
| 13 | Reliability | Dead-letter queue ownership | Team, SLA, and replay process defined | DLQ exists but nobody owns triage |
| 14 | Reliability | Worker heartbeat watchdog | Stale worker detection triggers requeue or fail-safe transition | Lost worker leaves job in undefined state |
| 15 | Reliability | Idempotency on side effects | Duplicate dispatch does not duplicate external side effects | Retries can create duplicate writes or payments |
| 16 | Operations | Canary rollout gates | Promotion blocked automatically when thresholds fail | Manual promotion despite failing indicators |
| 17 | Operations | Rollback trigger matrix | Latency, error, and governance thresholds mapped to rollback actions | Rollback criteria are subjective and ad hoc |
| 18 | Operations | Workflow cost budgets | Per-workflow token/API budget and alerts configured | No ceiling on autonomous spend |
| 19 | Operations | On-call runbook | Runbook covers stop, isolate, replay, and comms steps | First incident requires inventing process live |
| 20 | Compliance | Incident replay drill | Team can replay one historical incident end-to-end in staging | Replay is theoretical and not tested |
Canary rollout thresholds
You need launch math, not vibes. Use fixed promotion criteria and fixed rollback triggers.
| Phase | Traffic | Promote when | Rollback when |
|---|---|---|---|
| Replay Gate | 0% live traffic | 30 scenario replay suite has zero critical bypass | Any critical bypass or non-idempotent duplicate |
| Canary 1 | 1-5% low-risk traffic | Success rate >= 99%, P95 latency <= 1.2x baseline | Error rate > 2x baseline for 10 minutes |
| Canary 2 | 25% mixed traffic | Approval queue median wait <= 15 min, no unapproved high-risk action | Approval backlog exceeds SLA for 30 minutes |
| General | 50-100% traffic | Two weekly reliability reviews pass with no critical incidents | Any policy bypass or output safety quarantine spike > 3x |
Policy and gate config examples
Start with a small policy bundle and a small gate script. Expand only when you have real incident data that requires new rules.
version: v1
rules:
- id: deny-destructive-prod
match:
topic: "job.exec.shell"
labels:
env: prod
command_class: destructive
decision: DENY
reason: "destructive production command blocked"
- id: approval-prod-write
match:
topic: "job.deploy.apply"
labels:
env: prod
decision: REQUIRE_APPROVAL
reason: "production deployment requires human approval"
- id: constrain-external-egress
match:
topic: "job.integrations.call"
risk_tags: ["egress"]
decision: ALLOW_WITH_CONSTRAINTS
constraints:
allowed_hosts: ["api.github.com", "api.slack.com"]
timeout_ms: 15000
- id: allow-read-only
match:
topic: "job.repo.read"
decision: ALLOW# gate-check.sh set -euo pipefail # Fail if any unapproved high-risk execution appears in the last 10m HIGH_RISK_UNAPPROVED=$(curl -s "$API/metrics/high-risk-unapproved?window=10m") if [ "$HIGH_RISK_UNAPPROVED" -gt 0 ]; then echo "BLOCK: unapproved high-risk action detected" exit 1 fi # Fail if reliability drifts outside launch SLO SUCCESS_RATE=$(curl -s "$API/metrics/success-rate?window=10m") P95_LATENCY_MS=$(curl -s "$API/metrics/p95-latency-ms?window=10m") if (( $(echo "$SUCCESS_RATE < 0.99" | bc -l) )); then echo "BLOCK: success rate below 99%" exit 1 fi if [ "$P95_LATENCY_MS" -gt 1200 ]; then echo "BLOCK: P95 latency above threshold" exit 1 fi echo "PASS: promote canary step"
Limitations and tradeoffs
More controls, more latency
Pre-dispatch checks and approvals add overhead. Keep low-risk actions constrained but autonomous.
Approvals can bottleneck throughput
If everything needs approval, nothing scales. Route approvals only for truly high-risk categories.
Checklists decay without ownership
Assign owners and review cadence. A static checklist from six months ago is operational debt.