The production problem
Many teams run chaos testing as a demo event. They inject one fault, watch a dashboard, and call it resilience.
Autonomous agent systems fail differently. Policy dependencies can degrade safely or unsafely. Replay paths can recover or duplicate side effects. Lock behavior can protect consistency or silently drift.
A useful chaos program must test these behaviors explicitly with measurable success and abort criteria.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| AWS Prescriptive Guidance: Chaos engineering on AWS | Strong workflow for planning, scoping, and running safe experiments. | No pre-dispatch governance or policy fail-mode validation for AI agent execution paths. |
| Google Cloud Blog: Getting started with chaos engineering | Clear intro to steady-state hypotheses and progressive fault injection. | No queue replay integrity checks for autonomous workflows with at-least-once delivery. |
| Principles of Chaos Engineering | Foundational method: define steady state, vary real-world events, minimize blast radius. | No practical mapping to policy-denied/deferred/quarantined agent outcomes. |
Experiment model
Keep experiments narrow. One fault class, one hypothesis, one blast radius. Broader experiments make results hard to attribute.
| Experiment class | Injected fault | Hypothesis | Abort guard |
|---|---|---|---|
| Policy dependency outage | Safety kernel unavailable for 5-10 minutes | Jobs requeue safely (fail-closed default) without unsafe bypass | User-facing critical workflow misses SLO for >10 minutes |
| Worker capacity exhaustion | Temporarily remove one worker pool | Retry/backoff absorbs pressure without infinite hot loops | Failed completion ratio > 10% for 10 minutes |
| Scheduler lock contention | Inject Redis latency/lock acquisition stress | Single-writer reconciler behavior remains consistent | Stale jobs > 50 and rising for 15 minutes |
| Output policy noise spike | Introduce synthetic high-risk outputs in test traffic | Quarantine path catches outputs without full pipeline collapse | Quarantine rate > 1/s for >10 minutes in mixed workload |
Cordum runtime mapping
| Implication | Current behavior | Why it matters |
|---|---|---|
| Retry envelope | Max scheduling retries is 50 with backoff 1s-30s (`retryDelayNoWorkers` 2s) | Defines expected failure amplification behavior during capacity chaos tests. |
| Policy fail mode | `POLICY_CHECK_FAIL_MODE` defaults to `closed` | Chaos tests should verify safe degradation path when policy dependency fails. |
| Steady-state latency guard | Dispatch p99 warning threshold is 1s | Useful fast signal for experiment abort or rollback. |
| Consistency debt guard | `cordum_scheduler_stale_jobs` and `cordum_scheduler_orphan_replayed_total` | Measures whether the system is recovering safely after injected faults. |
| Governance behavior signal | `cordum_safety_unavailable_total` and `cordum_output_policy_quarantined_total` | Verifies that governance controls remain visible and measurable under stress. |
Implementation examples
Chaos experiment plan (YAML)
experiment:
name: safety-kernel-unavailable
environment: staging-prod-mirror
duration: 10m
blast_radius:
tenants: ["internal-test-tenant"]
topics: ["job.remediation.execute"]
traffic_share_percent: 5
hypothesis:
steady_state:
dispatch_p99_seconds: "<= 1.0"
failed_ratio_5m: "<= 0.10"
degraded_state:
safety_unavailable_rate_5m: "> 0"
unsafe_dispatch_count: "== 0"
abort_guards:
- metric: dispatch_p99_seconds
condition: "> 2.0 for 5m"
- metric: failed_ratio_5m
condition: "> 0.15 for 5m"
rollback:
- restore_safety_kernel
- verify_reconciler_lock
- confirm_orphan_replay_progressSafety-kernel outage injection script (Bash)
#!/usr/bin/env bash
set -euo pipefail
# Example: staging only
NAMESPACE="${NAMESPACE:-cordum-staging}"
echo "[1/4] scale safety kernel down"
kubectl -n "$NAMESPACE" scale deploy/cordum-safety-kernel --replicas=0
echo "[2/4] wait 180s and sample key metrics"
sleep 180
curl -sG http://prometheus:9090/api/v1/query --data-urlencode 'query=rate(cordum_safety_unavailable_total[5m])' | jq .
echo "[3/4] restore safety kernel"
kubectl -n "$NAMESPACE" scale deploy/cordum-safety-kernel --replicas=2
echo "[4/4] verify reconciler lock and replay activity"
redis-cli GET "cordum:reconciler:default"
curl -sG http://prometheus:9090/api/v1/query --data-urlencode 'query=rate(cordum_scheduler_orphan_replayed_total[5m])' | jq .PromQL abort guards
# Abort guard A: dispatch latency runaway
histogram_quantile(0.99, rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m])) > 2
# Abort guard B: failed ratio runaway
(
rate(cordum_jobs_completed_total{status="failed"}[5m])
/ clamp_min(rate(cordum_jobs_completed_total[5m]), 0.001)
) > 0.15
# Abort guard C: stale jobs runaway
cordum_scheduler_stale_jobs > 50Limitations and tradeoffs
- - Frequent experiments improve confidence but consume on-call and platform capacity.
- - Highly synthetic tests can miss real multi-factor failure chains.
- - Tight abort guards improve safety but may stop experiments before useful data appears.
- - Testing in production-like staging reduces risk but cannot perfectly mirror real tenant behavior.
Next step
Run this in one sprint:
- 1. Pick one experiment class with 5% traffic blast radius and explicit abort guards.
- 2. Record steady-state baseline for one week before injection.
- 3. Run one 10-minute controlled experiment and collect recovery evidence.
- 4. Convert one failure finding into a tracked corrective action with due date and metric target.
Continue with AI Agent Incident Response Runbook and AI Agent Blameless Postmortem Template.