AI Agent Chaos Engineering Playbook: Safe Failure Injection in Production-Like Systems (2026)

Chaos engineering for AI agents means deliberately injecting controlled failures into an autonomous agent control plane to prove it degrades safely. Unlike standard infrastructure chaos, it must also test governance paths: what the system does when the policy decision point is unreachable, when a result queue replays a job, or when an output scanner flags a spike of high-risk results. Every experiment defines a steady-state hypothesis, a bounded blast radius, and explicit abort guards before any fault is injected.

This playbook maps each experiment to how Cordum actually works: governed dispatch through the scheduler, the safety kernel as the policy decision point, the Redis-backed circuit breaker and POLICY_CHECK_FAIL_MODE that determine degradation behavior, and the audit trail that gives you attributable evidence after every run. The metrics, constants, and fail modes referenced below are taken from Cordum's operations and safety-kernel documentation, not invented thresholds.

The production problem

Many teams run chaos testing as a demo event. They inject one fault, watch a dashboard, and call it resilience.

Autonomous agent systems fail differently. Policy dependencies can degrade safely or unsafely. Replay paths can recover or duplicate side effects. Lock behavior can protect consistency or silently drift.

A useful chaos program must test these behaviors explicitly with measurable success and abort criteria.

What top results miss

Source	Strong coverage	Missing piece
AWS Prescriptive Guidance: Chaos engineering on AWS	Strong workflow for planning, scoping, and running safe experiments.	No pre-dispatch governance or policy fail-mode validation for AI agent execution paths.
Google Cloud Blog: Getting started with chaos engineering	Clear intro to steady-state hypotheses and progressive fault injection.	No queue replay integrity checks for autonomous workflows with at-least-once delivery.
Principles of Chaos Engineering	Foundational method: define steady state, vary real-world events, minimize blast radius.	No practical mapping to policy-denied/deferred/quarantined agent outcomes.

Experiment model

Keep experiments narrow. One fault class, one hypothesis, one blast radius. Broader experiments make results hard to attribute.

Experiment class	Injected fault	Hypothesis	Abort guard
Policy dependency outage	Safety kernel unavailable for 5-10 minutes	Jobs requeue safely (fail-closed default) without unsafe bypass	User-facing critical workflow misses SLO for >10 minutes
Worker capacity exhaustion	Temporarily remove one worker pool	Retry/backoff absorbs pressure without infinite hot loops	Failed completion ratio > 10% for 10 minutes
Scheduler lock contention	Inject Redis latency/lock acquisition stress	Single-writer reconciler behavior remains consistent	Stale jobs > 50 and rising for 15 minutes
Output policy noise spike	Introduce synthetic high-risk outputs in test traffic	Quarantine path catches outputs without full pipeline collapse	Quarantine rate > 1/s for >10 minutes in mixed workload

Cordum runtime mapping

Implication	Current behavior	Why it matters
Retry envelope	Max scheduling retries is 50 with backoff 1s-30s (`retryDelayNoWorkers` 2s)	Defines expected failure amplification behavior during capacity chaos tests.
Policy fail mode	`POLICY_CHECK_FAIL_MODE` defaults to `closed`	Chaos tests should verify safe degradation path when policy dependency fails.
Steady-state latency guard	Dispatch p99 warning threshold is 1s	Useful fast signal for experiment abort or rollback.
Consistency debt guard	`cordum_scheduler_stale_jobs` and `cordum_scheduler_orphan_replayed_total`	Measures whether the system is recovering safely after injected faults.
Governance behavior signal	`cordum_safety_unavailable_total` and `cordum_output_policy_quarantined_total`	Verifies that governance controls remain visible and measurable under stress.

Implementation examples

Chaos experiment plan (YAML)

chaos-experiment.yaml

YAML

experiment:
  name: safety-kernel-unavailable
  environment: staging-prod-mirror
  duration: 10m
  blast_radius:
    tenants: ["internal-test-tenant"]
    topics: ["job.remediation.execute"]
    traffic_share_percent: 5
  hypothesis:
    steady_state:
      dispatch_p99_seconds: "<= 1.0"
      failed_ratio_5m: "<= 0.10"
    degraded_state:
      safety_unavailable_rate_5m: "> 0"
      unsafe_dispatch_count: "== 0"
  abort_guards:
    - metric: dispatch_p99_seconds
      condition: "> 2.0 for 5m"
    - metric: failed_ratio_5m
      condition: "> 0.15 for 5m"
  rollback:
    - restore_safety_kernel
    - verify_reconciler_lock
    - confirm_orphan_replay_progress

Safety-kernel outage injection script (Bash)

chaos-safety-kernel.sh

Bash

#!/usr/bin/env bash
set -euo pipefail

# Example: staging only
NAMESPACE="${NAMESPACE:-cordum-staging}"

echo "[1/4] scale safety kernel down"
kubectl -n "$NAMESPACE" scale deploy/cordum-safety-kernel --replicas=0

echo "[2/4] wait 180s and sample key metrics"
sleep 180
curl -sG http://prometheus:9090/api/v1/query --data-urlencode   'query=rate(cordum_safety_unavailable_total[5m])' | jq .

echo "[3/4] restore safety kernel"
kubectl -n "$NAMESPACE" scale deploy/cordum-safety-kernel --replicas=2

echo "[4/4] verify reconciler lock and replay activity"
redis-cli GET "cordum:reconciler:default"
curl -sG http://prometheus:9090/api/v1/query --data-urlencode   'query=rate(cordum_scheduler_orphan_replayed_total[5m])' | jq .

PromQL abort guards

chaos-abort-guards.promql

PromQL

# Abort guard A: dispatch latency runaway
histogram_quantile(0.99, rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m])) > 2

# Abort guard B: failed ratio runaway
(
  rate(cordum_jobs_completed_total{status="failed"}[5m])
  / clamp_min(rate(cordum_jobs_completed_total[5m]), 0.001)
) > 0.15

# Abort guard C: stale jobs runaway
cordum_scheduler_stale_jobs > 50

Limitations and tradeoffs

- Frequent experiments improve confidence but consume on-call and platform capacity.
- Highly synthetic tests can miss real multi-factor failure chains.
- Tight abort guards improve safety but may stop experiments before useful data appears.
- Testing in production-like staging reduces risk but cannot perfectly mirror real tenant behavior.

Frequently asked questions

What is chaos engineering for AI agents?

Chaos engineering for AI agents is the practice of deliberately injecting controlled failures into an autonomous agent control plane to prove it degrades safely. Beyond standard infrastructure faults, it tests agent-specific paths: what happens when the policy decision point is unreachable, when a result queue replays a job, or when an output scanner flags a spike of high-risk results. Every experiment defines a steady-state hypothesis, a bounded blast radius, and explicit abort guards before any fault is injected.

How is chaos engineering for AI agents different from normal chaos engineering?

Traditional chaos engineering targets infrastructure faults like instance loss, network latency, and disk pressure. Agent control planes add a governance layer that must also be tested. The key extra question is the policy fail mode: when the safety kernel is unreachable, does the scheduler fail closed (requeue jobs, no unsafe bypass) or fail open (allow jobs through unevaluated)? You also need to verify that at-least-once delivery and orphan replay recover jobs without duplicating side effects, which infrastructure-only experiments never cover.

How does Cordum behave when the safety kernel is unavailable during a chaos experiment?

When the safety kernel is unreachable, the scheduler's Redis-backed circuit breaker opens after 3 failures (30s open window) and returns SafetyUnavailable decisions instead of blocking on RPC. The POLICY_CHECK_FAIL_MODE setting then decides the outcome: the default 'closed' requeues jobs with exponential backoff until the kernel recovers, so no unevaluated job passes. The 'open' mode allows jobs through and increments cordum_scheduler_input_fail_open_total. A chaos experiment should confirm the closed-mode path holds and that unsafe_dispatch_count stays at zero.

What metrics should I use as abort guards?

Tie abort guards to Cordum's exported Prometheus signals: dispatch latency via cordum_scheduler_dispatch_latency_seconds (p99 warning threshold is 1s), failed-job ratio via cordum_jobs_completed_total{status="failed"}, consistency debt via cordum_scheduler_stale_jobs and cordum_scheduler_orphan_replayed_total, and governance behavior via cordum_safety_unavailable_total and cordum_output_policy_quarantined_total. Predefine thresholds and durations so the experiment aborts automatically before it becomes an incident.

Is it safe to run chaos experiments in production?

Run in a production-mirroring staging environment first, scoped to a test tenant and a small traffic share (for example 5 percent). Production-like staging reduces risk but cannot perfectly mirror real tenant behavior, so promote experiments to production only after they pass repeatedly in staging, and only with tight abort guards and a rehearsed rollback. Every action is captured in Cordum's audit trail, which gives you the evidence to attribute behavior after the run.

Next step

Chaos experiments are only worth running if they sharpen recovery, not just confirm failure. Cordum's incident response solution turns the findings from these experiments into governed, auditable recovery workflows.

Run this in one sprint:

1. Pick one experiment class with 5% traffic blast radius and explicit abort guards.
2. Record steady-state baseline for one week before injection.
3. Run one 10-minute controlled experiment and collect recovery evidence.
4. Convert one failure finding into a tracked corrective action with due date and metric target.

Continue with AI Agent Incident Response Runbook and AI Agent Blameless Postmortem Template.