The production problem
Agent releases often fail for one reason: teams validate prompts, then skip runtime rollout discipline. The first real traffic spike becomes the experiment.
Shadow traffic catches some logic regressions. It does not validate irreversible side effects, approval latency, or policy-deny drift under live constraints.
Canary without measurable gates has the same flaw. You get staged exposure, but no deterministic promotion rule.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Argo Rollouts Canary | Concrete canary mechanics (`setWeight`, `pause`, surge/unavailable controls, and optional traffic routing). | No policy-risk decision layer for autonomous agent side effects. |
| AWS CodeDeploy TimeBasedCanary | Precise canary parameters (`CanaryPercentage`, `CanaryInterval`) for phased rollout timing. | No guidance for agent-specific approval and replay controls during promotion. |
| Spinnaker Canary Best Practices | Useful scoring discipline: 3-hour canary windows, 1-hour intervals, >=50 data points, and score thresholds. | No control-plane model for policy simulation and safety-gated autonomous execution. |
Rollout model
Promotion should be a state machine with explicit entry and exit conditions. If a stage fails, rollback must be mechanical.
| Stage | Traffic profile | Promotion gate | Rollback trigger |
|---|---|---|---|
| Shadow stage | 0% user-facing actions, mirrored evaluation workload | Policy simulation pass + no critical schema/output violations | Any safety deny spike or parser failure over baseline |
| Canary stage 1 | 5-10% live traffic | Score >=95 or manual hold if 75-94 with >=50 data points | Error rate delta >2x baseline or approval backlog breach |
| Canary stage 2 | 25-50% live traffic | Stable latency + policy deny rate within allowed threshold | Sustained overload reason codes (`pool_overloaded`, `no_workers`) |
| Full promotion | 100% | No critical incidents through one full business cycle | Immediate rerun-from-step/rollback workflow on severe regression |
Cordum runtime implications
| Implication | Current behavior | Why it matters |
|---|---|---|
| Pre-deploy policy validation | `POST /api/v1/policy/simulate` and `POST /api/v1/policy/bundles/{id}/simulate` | Canary candidates can be evaluated without side effects before traffic shift. |
| Workflow dry-run | `POST /api/v1/workflows/{id}/dry-run` | Rollout steps can be tested with environment context before live dispatch. |
| Safe rerun path | Workflow supports rerun-from-step and dry-run mode | Rollback and corrective rollout can resume from known safe boundaries. |
| Run idempotency | Runs support `Idempotency-Key` on creation | Promotion retries do not create duplicate rollout runs. |
| Approval gate support | Unified approvals endpoint for workflow and policy approvals | High-risk rollout stages can require explicit human authorization. |
Implementation examples
Canary traffic stages (YAML)
apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
strategy:
canary:
steps:
- setWeight: 10
- pause: { duration: 20m }
- setWeight: 25
- pause: { duration: 30m }
- setWeight: 50
- pause: { duration: 30m }Promotion scoring policy (YAML)
rollout_gates:
min_data_points_per_metric: 50
canary_lifetime_hours: 3
metric_interval_hours: 1
score:
pass: 95
marginal: 75
rollback_triggers:
max_error_rate_delta: 2.0
max_policy_deny_rate: 0.03Cordum pre-promotion checks (bash)
# 1) Simulate policy on rollout candidate
curl -sS -X POST http://localhost:8081/api/v1/policy/simulate -H "X-API-Key: $CORDUM_API_KEY" -H "X-Tenant-ID: default" -H "Content-Type: application/json" -d '{"topic":"job.prod.deploy","tenant":"default"}'
# 2) Dry-run workflow before canary shift
curl -sS -X POST http://localhost:8081/api/v1/workflows/WF_ID/dry-run -H "X-API-Key: $CORDUM_API_KEY" -H "X-Tenant-ID: default" -H "Content-Type: application/json" -d '{"input":{"release":"v1.8.0"},"environment":"staging"}'Limitations and tradeoffs
- - Shadow traffic increases infra cost and does not fully validate side-effect safety.
- - Slower canary stages reduce blast radius but delay feature delivery.
- - Strict promotion gates reduce bad releases but may produce false positives on noisy metrics.
- - Manual approval steps reduce risk but add coordination latency.
Next step
Run this in one sprint:
- 1. Define rollout stages (shadow, 10%, 25%, 50%, 100%) per critical workflow.
- 2. Set numeric gates (sample size, score thresholds, rollback deltas) in policy.
- 3. Wire policy simulation and workflow dry-run into your release pipeline.
- 4. Require explicit approval for high-risk promotion steps.
Continue with AI Agent Policy Simulation and AI Agent Fail-Open vs Fail-Closed.