Name: Cordum
Author: Cordum

The production problem

Agent releases often fail for one reason: teams validate prompts, then skip runtime rollout discipline. The first real traffic spike becomes the experiment.

Shadow traffic catches some logic regressions. It does not validate irreversible side effects, approval latency, or policy-deny drift under live constraints.

Canary without measurable gates has the same flaw. You get staged exposure, but no deterministic promotion rule.

What top results miss

Source	Strong coverage	Missing piece
Argo Rollouts Canary	Concrete canary mechanics (`setWeight`, `pause`, surge/unavailable controls, and optional traffic routing).	No policy-risk decision layer for autonomous agent side effects.
AWS CodeDeploy TimeBasedCanary	Precise canary parameters (`CanaryPercentage`, `CanaryInterval`) for phased rollout timing.	No guidance for agent-specific approval and replay controls during promotion.
Spinnaker Canary Best Practices	Useful scoring discipline: 3-hour canary windows, 1-hour intervals, >=50 data points, and score thresholds.	No control-plane model for policy simulation and safety-gated autonomous execution.

Rollout model

Promotion should be a state machine with explicit entry and exit conditions. If a stage fails, rollback must be mechanical.

Stage	Traffic profile	Promotion gate	Rollback trigger
Shadow stage	0% user-facing actions, mirrored evaluation workload	Policy simulation pass + no critical schema/output violations	Any safety deny spike or parser failure over baseline
Canary stage 1	5-10% live traffic	Score >=95 or manual hold if 75-94 with >=50 data points	Error rate delta >2x baseline or approval backlog breach
Canary stage 2	25-50% live traffic	Stable latency + policy deny rate within allowed threshold	Sustained overload reason codes (`pool_overloaded`, `no_workers`)
Full promotion	100%	No critical incidents through one full business cycle	Immediate rerun-from-step/rollback workflow on severe regression

Cordum runtime implications

Implication	Current behavior	Why it matters
Pre-deploy policy validation	`POST /api/v1/policy/simulate` and `POST /api/v1/policy/bundles/{id}/simulate`	Canary candidates can be evaluated without side effects before traffic shift.
Workflow dry-run	`POST /api/v1/workflows/{id}/dry-run`	Rollout steps can be tested with environment context before live dispatch.
Safe rerun path	Workflow supports rerun-from-step and dry-run mode	Rollback and corrective rollout can resume from known safe boundaries.
Run idempotency	Runs support `Idempotency-Key` on creation	Promotion retries do not create duplicate rollout runs.
Approval gate support	Unified approvals endpoint for workflow and policy approvals	High-risk rollout stages can require explicit human authorization.

Implementation examples

Canary traffic stages (YAML)

rollout-canary.yaml

YAML

apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 20m }
        - setWeight: 25
        - pause: { duration: 30m }
        - setWeight: 50
        - pause: { duration: 30m }

Promotion scoring policy (YAML)

rollout-gates.yaml

YAML

rollout_gates:
  min_data_points_per_metric: 50
  canary_lifetime_hours: 3
  metric_interval_hours: 1
  score:
    pass: 95
    marginal: 75
  rollback_triggers:
    max_error_rate_delta: 2.0
    max_policy_deny_rate: 0.03

Cordum pre-promotion checks (bash)

pre-promotion-checks.sh

Bash

# 1) Simulate policy on rollout candidate
curl -sS -X POST http://localhost:8081/api/v1/policy/simulate   -H "X-API-Key: $CORDUM_API_KEY"   -H "X-Tenant-ID: default"   -H "Content-Type: application/json"   -d '{"topic":"job.prod.deploy","tenant":"default"}'

# 2) Dry-run workflow before canary shift
curl -sS -X POST http://localhost:8081/api/v1/workflows/WF_ID/dry-run   -H "X-API-Key: $CORDUM_API_KEY"   -H "X-Tenant-ID: default"   -H "Content-Type: application/json"   -d '{"input":{"release":"v1.8.0"},"environment":"staging"}'

Limitations and tradeoffs

- Shadow traffic increases infra cost and does not fully validate side-effect safety.
- Slower canary stages reduce blast radius but delay feature delivery.
- Strict promotion gates reduce bad releases but may produce false positives on noisy metrics.
- Manual approval steps reduce risk but add coordination latency.

Next step

Run this in one sprint:

1. Define rollout stages (shadow, 10%, 25%, 50%, 100%) per critical workflow.
2. Set numeric gates (sample size, score thresholds, rollback deltas) in policy.
3. Wire policy simulation and workflow dry-run into your release pipeline.
4. Require explicit approval for high-risk promotion steps.

Continue with AI Agent Policy Simulation and AI Agent Fail-Open vs Fail-Closed.

AI Agent Canary Deployment and Shadow Traffic