Skip to content
Guide

AI Agent Chaos Engineering Playbook

Inject failures safely and verify recovery behavior before real outages do it for you.

Guide12 min readMar 2026
TL;DR
  • -Chaos tests without explicit abort criteria are incidents waiting to happen.
  • -Agent systems need policy-path failure tests, not only infra-level fault tests.
  • -Measure recovery behavior, not just failure behavior.
  • -A good experiment has one hypothesis, one blast radius, and one owner.
Hypothesis first

Start with expected steady-state behavior and expected degradation envelope.

Abort discipline

Predefine stop conditions before injecting any failure.

Policy-aware validation

Verify governance behavior under stress, not only request throughput.

Scope

This guide targets autonomous AI agent control planes with queue-based dispatch, policy checks, and distributed recovery components.

The production problem

Many teams run chaos testing as a demo event. They inject one fault, watch a dashboard, and call it resilience.

Autonomous agent systems fail differently. Policy dependencies can degrade safely or unsafely. Replay paths can recover or duplicate side effects. Lock behavior can protect consistency or silently drift.

A useful chaos program must test these behaviors explicitly with measurable success and abort criteria.

What top results miss

SourceStrong coverageMissing piece
AWS Prescriptive Guidance: Chaos engineering on AWSStrong workflow for planning, scoping, and running safe experiments.No pre-dispatch governance or policy fail-mode validation for AI agent execution paths.
Google Cloud Blog: Getting started with chaos engineeringClear intro to steady-state hypotheses and progressive fault injection.No queue replay integrity checks for autonomous workflows with at-least-once delivery.
Principles of Chaos EngineeringFoundational method: define steady state, vary real-world events, minimize blast radius.No practical mapping to policy-denied/deferred/quarantined agent outcomes.

Experiment model

Keep experiments narrow. One fault class, one hypothesis, one blast radius. Broader experiments make results hard to attribute.

Experiment classInjected faultHypothesisAbort guard
Policy dependency outageSafety kernel unavailable for 5-10 minutesJobs requeue safely (fail-closed default) without unsafe bypassUser-facing critical workflow misses SLO for >10 minutes
Worker capacity exhaustionTemporarily remove one worker poolRetry/backoff absorbs pressure without infinite hot loopsFailed completion ratio > 10% for 10 minutes
Scheduler lock contentionInject Redis latency/lock acquisition stressSingle-writer reconciler behavior remains consistentStale jobs > 50 and rising for 15 minutes
Output policy noise spikeIntroduce synthetic high-risk outputs in test trafficQuarantine path catches outputs without full pipeline collapseQuarantine rate > 1/s for >10 minutes in mixed workload

Cordum runtime mapping

ImplicationCurrent behaviorWhy it matters
Retry envelopeMax scheduling retries is 50 with backoff 1s-30s (`retryDelayNoWorkers` 2s)Defines expected failure amplification behavior during capacity chaos tests.
Policy fail mode`POLICY_CHECK_FAIL_MODE` defaults to `closed`Chaos tests should verify safe degradation path when policy dependency fails.
Steady-state latency guardDispatch p99 warning threshold is 1sUseful fast signal for experiment abort or rollback.
Consistency debt guard`cordum_scheduler_stale_jobs` and `cordum_scheduler_orphan_replayed_total`Measures whether the system is recovering safely after injected faults.
Governance behavior signal`cordum_safety_unavailable_total` and `cordum_output_policy_quarantined_total`Verifies that governance controls remain visible and measurable under stress.

Implementation examples

Chaos experiment plan (YAML)

chaos-experiment.yaml
YAML
experiment:
  name: safety-kernel-unavailable
  environment: staging-prod-mirror
  duration: 10m
  blast_radius:
    tenants: ["internal-test-tenant"]
    topics: ["job.remediation.execute"]
    traffic_share_percent: 5
  hypothesis:
    steady_state:
      dispatch_p99_seconds: "<= 1.0"
      failed_ratio_5m: "<= 0.10"
    degraded_state:
      safety_unavailable_rate_5m: "> 0"
      unsafe_dispatch_count: "== 0"
  abort_guards:
    - metric: dispatch_p99_seconds
      condition: "> 2.0 for 5m"
    - metric: failed_ratio_5m
      condition: "> 0.15 for 5m"
  rollback:
    - restore_safety_kernel
    - verify_reconciler_lock
    - confirm_orphan_replay_progress

Safety-kernel outage injection script (Bash)

chaos-safety-kernel.sh
Bash
#!/usr/bin/env bash
set -euo pipefail

# Example: staging only
NAMESPACE="${NAMESPACE:-cordum-staging}"

echo "[1/4] scale safety kernel down"
kubectl -n "$NAMESPACE" scale deploy/cordum-safety-kernel --replicas=0

echo "[2/4] wait 180s and sample key metrics"
sleep 180
curl -sG http://prometheus:9090/api/v1/query --data-urlencode   'query=rate(cordum_safety_unavailable_total[5m])' | jq .

echo "[3/4] restore safety kernel"
kubectl -n "$NAMESPACE" scale deploy/cordum-safety-kernel --replicas=2

echo "[4/4] verify reconciler lock and replay activity"
redis-cli GET "cordum:reconciler:default"
curl -sG http://prometheus:9090/api/v1/query --data-urlencode   'query=rate(cordum_scheduler_orphan_replayed_total[5m])' | jq .

PromQL abort guards

chaos-abort-guards.promql
PromQL
# Abort guard A: dispatch latency runaway
histogram_quantile(0.99, rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m])) > 2

# Abort guard B: failed ratio runaway
(
  rate(cordum_jobs_completed_total{status="failed"}[5m])
  / clamp_min(rate(cordum_jobs_completed_total[5m]), 0.001)
) > 0.15

# Abort guard C: stale jobs runaway
cordum_scheduler_stale_jobs > 50

Limitations and tradeoffs

  • - Frequent experiments improve confidence but consume on-call and platform capacity.
  • - Highly synthetic tests can miss real multi-factor failure chains.
  • - Tight abort guards improve safety but may stop experiments before useful data appears.
  • - Testing in production-like staging reduces risk but cannot perfectly mirror real tenant behavior.

Next step

Run this in one sprint:

  1. 1. Pick one experiment class with 5% traffic blast radius and explicit abort guards.
  2. 2. Record steady-state baseline for one week before injection.
  3. 3. Run one 10-minute controlled experiment and collect recovery evidence.
  4. 4. Convert one failure finding into a tracked corrective action with due date and metric target.

Continue with AI Agent Incident Response Runbook and AI Agent Blameless Postmortem Template.

Chaos engineering is a safety skill

If experiments do not improve incident response and recovery quality, they are load tests with better branding.