The production problem
Most PDB outages start with good intentions: keep upgrades safe, reduce risk, automate maintenance. Then one bad budget blocks deploys or allows too many evictions at once.
For AI control planes, that mistake is expensive. You are not only protecting HTTP availability. You are protecting lock ownership, scheduler continuity, and message flow.
A wrong budget either freezes operations or burns reliability debt that appears after rollout.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Kubernetes Disruptions docs | Voluntary vs involuntary disruptions and where PDBs apply. | No guidance on lock-backed scheduler recovery behavior for AI control planes. |
| Kubernetes PDB task docs | `minAvailable`/`maxUnavailable` semantics and budget examples. | No service-tier mapping for mixed stateless + quorum stateful architectures. |
| GKE cluster upgrade best practices | Upgrade sequencing guidance and the role of PDBs in limiting voluntary disruption during maintenance. | No control-plane-specific validation for lock handoff, queue drain continuity, and retry surge behavior. |
The gap is service-tier translation. Teams need concrete mapping from budget fields to real control-plane failure modes.
PDB math that matters
Start with failure tolerance by service tier. Then derive budget values. Never start with someone else's YAML.
# Quick budget checks # N = replicas # M = minAvailable # U = maxUnavailable # Rule A (minAvailable): evictions_allowed = N - M # Rule B (maxUnavailable): evictions_allowed = U # Stateful quorum example (3-node cluster): # majority = floor(3/2)+1 = 2 # choose M >= 2 => at most 1 voluntary eviction # If one pod is already unhealthy, effective eviction budget shrinks to zero. # Automation must handle this and defer disruption until health recovers.
| Service tier | Budget rule | Risk if wrong | Verification |
|---|---|---|---|
| Stateless API / gateway | Use PDB `maxUnavailable: 1` for small replica sets; keep rollout waves narrow. | Large simultaneous evictions can spike 5xx and retry traffic. | Watch unavailable replicas and P99 latency through each rollout wave. |
| Scheduler / control workers | Keep at least one active scheduler during voluntary disruption windows. | No active dispatcher means queue backlog growth and delayed lock turnover. | Measure dispatch continuity and lock takeover lag under kill drills. |
| Consensus-backed NATS cluster | Set `minAvailable` to preserve Raft majority (2 of 3 nodes). | Leader loss can halt writes and destabilize event flow. | Check cluster quorum before and during node drain. |
| Redis cluster (3 primary + 3 replica) | Set `minAvailable: 4` to preserve data availability during upgrades. | Losing too many nodes can block writes and break lock coordination. | Check slot coverage, primary health, and lock/read latency during maintenance. |
Cordum baseline values
These values are pulled from current Cordum docs and runtime behavior. Use them as a starting point, then tune with environment-specific load tests.
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Application services | Cordum docs use PDB `maxUnavailable: 1` for gateway, scheduler, and other app services. | Limits blast radius of voluntary evictions while keeping rollout progress. |
| NATS StatefulSet | Recommended `minAvailable: 2` out of 3 to keep Raft quorum during updates. | Protects consensus so event transport remains writable during maintenance. |
| Redis StatefulSet | Recommended `minAvailable: 4` out of 6 (3 primary + 3 replica). | Maintains write/data availability and prevents coordination collapse. |
| Graceful shutdown envelope | Services target 15s shutdown; `terminationGracePeriodSeconds: 30` is recommended in docs. | Provides headroom for clean drain before forced pod kill. |
| Forced-kill fallback | Scheduler lock TTL is 60s with renewal every 20s; surviving replica takes over after expiry. | Puts an upper bound on disruption recovery but can add temporary queue latency. |
Implementation examples
Tiered PDB configuration (YAML)
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: cordum-api-gateway-pdb
namespace: cordum
spec:
maxUnavailable: 1
selector:
matchLabels:
app: cordum-api-gateway
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: cordum-nats-pdb
namespace: cordum
spec:
minAvailable: 2
selector:
matchLabels:
app: nats
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: cordum-redis-pdb
namespace: cordum
spec:
minAvailable: 4
selector:
matchLabels:
app: redisPre-rollout disruption checklist (Bash)
# 1) Check disruption headroom before rollout kubectl get pdb -n cordum kubectl describe pdb cordum-api-gateway-pdb -n cordum kubectl describe pdb cordum-nats-pdb -n cordum kubectl describe pdb cordum-redis-pdb -n cordum # 2) Validate rollout plus PDB constraints kubectl rollout restart deployment/cordum-api-gateway -n cordum kubectl rollout status deployment/cordum-api-gateway -n cordum # 3) Verify lock ownership continuity during restart redis-cli GET "cordum:scheduler:job:JOB_ID" redis-cli GET "cordum:reconciler:default" redis-cli GET "cordum:replayer:pending" # 4) Roll back if disruption error budget is breached kubectl rollout undo deployment/cordum-api-gateway -n cordum
Post-maintenance regression signals (PromQL)
# Unavailable replicas during maintenance window
max_over_time(kube_deployment_status_replicas_unavailable{namespace="cordum"}[15m])
# Pod restart spike can indicate disruption pressure
increase(kube_pod_container_status_restarts_total{namespace="cordum"}[15m])
# Queue lock contention signal (if exported)
histogram_quantile(0.99, rate(job_lock_wait_bucket[5m]))Limitations and tradeoffs
- - PDBs do not prevent involuntary disruptions like node crashes.
- - PDBs can block maintenance if baseline health is already degraded.
- - Tight budgets improve availability but can slow upgrades and security patching.
- - `kubectl delete pod` and direct workload deletion can bypass PDB intent.
If disruption automation does not use the Eviction API path, your PDB is advisory text, not a safety control.
Next step
Run this in one sprint:
- 1. Classify every control-plane workload into stateless, scheduler, or quorum tier.
- 2. Define one PDB per tier with explicit rationale and owner.
- 3. Add a pre-rollout gate that fails if current healthy pods leave zero eviction headroom.
- 4. Run one forced-kill game day and measure lock takeover plus retry surge.
Continue with AI Agent Rolling Restart Playbook and AI Agent Graceful Shutdown.