Skip to content
Guide

AI Agent PodDisruptionBudget Strategy

If your disruption budget is wrong, maintenance turns into outage rehearsal.

Guide11 min readApr 2026
TL;DR
  • -A PodDisruptionBudget is an availability contract, not a copy-paste YAML snippet.
  • -Kubernetes PDBs protect only voluntary disruptions and only when eviction APIs are respected.
  • -For Cordum, stateful quorum services need stricter budgets than stateless API workers.
  • -PDB sizing must be validated with lock-takeover and retry behavior, not rollout status alone.
Quorum math

Stateful control-plane services fail hard when budgets ignore consensus thresholds.

Eviction limits

`minAvailable` and `maxUnavailable` define how much disruption automation can apply.

Recovery checks

A safe budget is one that still meets takeover and retry SLOs during maintenance.

Scope

This guide is for platform teams running autonomous AI control planes on Kubernetes with mixed workload tiers: stateless APIs, scheduler services, and quorum-dependent stateful components.

The production problem

Most PDB outages start with good intentions: keep upgrades safe, reduce risk, automate maintenance. Then one bad budget blocks deploys or allows too many evictions at once.

For AI control planes, that mistake is expensive. You are not only protecting HTTP availability. You are protecting lock ownership, scheduler continuity, and message flow.

A wrong budget either freezes operations or burns reliability debt that appears after rollout.

What top results miss

SourceStrong coverageMissing piece
Kubernetes Disruptions docsVoluntary vs involuntary disruptions and where PDBs apply.No guidance on lock-backed scheduler recovery behavior for AI control planes.
Kubernetes PDB task docs`minAvailable`/`maxUnavailable` semantics and budget examples.No service-tier mapping for mixed stateless + quorum stateful architectures.
GKE cluster upgrade best practicesUpgrade sequencing guidance and the role of PDBs in limiting voluntary disruption during maintenance.No control-plane-specific validation for lock handoff, queue drain continuity, and retry surge behavior.

The gap is service-tier translation. Teams need concrete mapping from budget fields to real control-plane failure modes.

PDB math that matters

Start with failure tolerance by service tier. Then derive budget values. Never start with someone else's YAML.

pdb_budget_math.txt
Text
# Quick budget checks
# N = replicas
# M = minAvailable
# U = maxUnavailable

# Rule A (minAvailable): evictions_allowed = N - M
# Rule B (maxUnavailable): evictions_allowed = U

# Stateful quorum example (3-node cluster):
# majority = floor(3/2)+1 = 2
# choose M >= 2  => at most 1 voluntary eviction

# If one pod is already unhealthy, effective eviction budget shrinks to zero.
# Automation must handle this and defer disruption until health recovers.
Service tierBudget ruleRisk if wrongVerification
Stateless API / gatewayUse PDB `maxUnavailable: 1` for small replica sets; keep rollout waves narrow.Large simultaneous evictions can spike 5xx and retry traffic.Watch unavailable replicas and P99 latency through each rollout wave.
Scheduler / control workersKeep at least one active scheduler during voluntary disruption windows.No active dispatcher means queue backlog growth and delayed lock turnover.Measure dispatch continuity and lock takeover lag under kill drills.
Consensus-backed NATS clusterSet `minAvailable` to preserve Raft majority (2 of 3 nodes).Leader loss can halt writes and destabilize event flow.Check cluster quorum before and during node drain.
Redis cluster (3 primary + 3 replica)Set `minAvailable: 4` to preserve data availability during upgrades.Losing too many nodes can block writes and break lock coordination.Check slot coverage, primary health, and lock/read latency during maintenance.

Cordum baseline values

These values are pulled from current Cordum docs and runtime behavior. Use them as a starting point, then tune with environment-specific load tests.

BoundaryCurrent behaviorOperational impact
Application servicesCordum docs use PDB `maxUnavailable: 1` for gateway, scheduler, and other app services.Limits blast radius of voluntary evictions while keeping rollout progress.
NATS StatefulSetRecommended `minAvailable: 2` out of 3 to keep Raft quorum during updates.Protects consensus so event transport remains writable during maintenance.
Redis StatefulSetRecommended `minAvailable: 4` out of 6 (3 primary + 3 replica).Maintains write/data availability and prevents coordination collapse.
Graceful shutdown envelopeServices target 15s shutdown; `terminationGracePeriodSeconds: 30` is recommended in docs.Provides headroom for clean drain before forced pod kill.
Forced-kill fallbackScheduler lock TTL is 60s with renewal every 20s; surviving replica takes over after expiry.Puts an upper bound on disruption recovery but can add temporary queue latency.

Implementation examples

Tiered PDB configuration (YAML)

cordum_pdbs.yaml
YAML
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: cordum-api-gateway-pdb
  namespace: cordum
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: cordum-api-gateway
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: cordum-nats-pdb
  namespace: cordum
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: nats
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: cordum-redis-pdb
  namespace: cordum
spec:
  minAvailable: 4
  selector:
    matchLabels:
      app: redis

Pre-rollout disruption checklist (Bash)

pdb_rollout_check.sh
Bash
# 1) Check disruption headroom before rollout
kubectl get pdb -n cordum
kubectl describe pdb cordum-api-gateway-pdb -n cordum
kubectl describe pdb cordum-nats-pdb -n cordum
kubectl describe pdb cordum-redis-pdb -n cordum

# 2) Validate rollout plus PDB constraints
kubectl rollout restart deployment/cordum-api-gateway -n cordum
kubectl rollout status deployment/cordum-api-gateway -n cordum

# 3) Verify lock ownership continuity during restart
redis-cli GET "cordum:scheduler:job:JOB_ID"
redis-cli GET "cordum:reconciler:default"
redis-cli GET "cordum:replayer:pending"

# 4) Roll back if disruption error budget is breached
kubectl rollout undo deployment/cordum-api-gateway -n cordum

Post-maintenance regression signals (PromQL)

pdb_regression.promql
PromQL
# Unavailable replicas during maintenance window
max_over_time(kube_deployment_status_replicas_unavailable{namespace="cordum"}[15m])

# Pod restart spike can indicate disruption pressure
increase(kube_pod_container_status_restarts_total{namespace="cordum"}[15m])

# Queue lock contention signal (if exported)
histogram_quantile(0.99, rate(job_lock_wait_bucket[5m]))

Limitations and tradeoffs

  • - PDBs do not prevent involuntary disruptions like node crashes.
  • - PDBs can block maintenance if baseline health is already degraded.
  • - Tight budgets improve availability but can slow upgrades and security patching.
  • - `kubectl delete pod` and direct workload deletion can bypass PDB intent.

If disruption automation does not use the Eviction API path, your PDB is advisory text, not a safety control.

Next step

Run this in one sprint:

  1. 1. Classify every control-plane workload into stateless, scheduler, or quorum tier.
  2. 2. Define one PDB per tier with explicit rationale and owner.
  3. 3. Add a pre-rollout gate that fails if current healthy pods leave zero eviction headroom.
  4. 4. Run one forced-kill game day and measure lock takeover plus retry surge.

Continue with AI Agent Rolling Restart Playbook and AI Agent Graceful Shutdown.

Budgets fail quietly, then loudly

Treat disruption budgets as release policy inputs and verify them in every maintenance window.