The production problem

Most PDB outages start with good intentions: keep upgrades safe, reduce risk, automate maintenance. Then one bad budget blocks deploys or allows too many evictions at once.

For AI control planes, that mistake is expensive. You are not only protecting HTTP availability. You are protecting lock ownership, scheduler continuity, and message flow.

A wrong budget either freezes operations or burns reliability debt that appears after rollout.

What top results miss

Source	Strong coverage	Missing piece
Kubernetes Disruptions docs	Voluntary vs involuntary disruptions and where PDBs apply.	No guidance on lock-backed scheduler recovery behavior for AI control planes.
Kubernetes PDB task docs	`minAvailable`/`maxUnavailable` semantics and budget examples.	No service-tier mapping for mixed stateless + quorum stateful architectures.
GKE cluster upgrade best practices	Upgrade sequencing guidance and the role of PDBs in limiting voluntary disruption during maintenance.	No control-plane-specific validation for lock handoff, queue drain continuity, and retry surge behavior.

The gap is service-tier translation. Teams need concrete mapping from budget fields to real control-plane failure modes.

PDB math that matters

Start with failure tolerance by service tier. Then derive budget values. Never start with someone else's YAML.

pdb_budget_math.txt

Text

# Quick budget checks
# N = replicas
# M = minAvailable
# U = maxUnavailable

# Rule A (minAvailable): evictions_allowed = N - M
# Rule B (maxUnavailable): evictions_allowed = U

# Stateful quorum example (3-node cluster):
# majority = floor(3/2)+1 = 2
# choose M >= 2  => at most 1 voluntary eviction

# If one pod is already unhealthy, effective eviction budget shrinks to zero.
# Automation must handle this and defer disruption until health recovers.

Service tier	Budget rule	Risk if wrong	Verification
Stateless API / gateway	Use PDB `maxUnavailable: 1` for small replica sets; keep rollout waves narrow.	Large simultaneous evictions can spike 5xx and retry traffic.	Watch unavailable replicas and P99 latency through each rollout wave.
Scheduler / control workers	Keep at least one active scheduler during voluntary disruption windows.	No active dispatcher means queue backlog growth and delayed lock turnover.	Measure dispatch continuity and lock takeover lag under kill drills.
Consensus-backed NATS cluster	Set `minAvailable` to preserve Raft majority (2 of 3 nodes).	Leader loss can halt writes and destabilize event flow.	Check cluster quorum before and during node drain.
Redis cluster (3 primary + 3 replica)	Set `minAvailable: 4` to preserve data availability during upgrades.	Losing too many nodes can block writes and break lock coordination.	Check slot coverage, primary health, and lock/read latency during maintenance.

Cordum baseline values

These values are pulled from current Cordum docs and runtime behavior. Use them as a starting point, then tune with environment-specific load tests.

Boundary	Current behavior	Operational impact
Application services	Cordum docs use PDB `maxUnavailable: 1` for gateway, scheduler, and other app services.	Limits blast radius of voluntary evictions while keeping rollout progress.
NATS StatefulSet	Recommended `minAvailable: 2` out of 3 to keep Raft quorum during updates.	Protects consensus so event transport remains writable during maintenance.
Redis StatefulSet	Recommended `minAvailable: 4` out of 6 (3 primary + 3 replica).	Maintains write/data availability and prevents coordination collapse.
Graceful shutdown envelope	Services target 15s shutdown; `terminationGracePeriodSeconds: 30` is recommended in docs.	Provides headroom for clean drain before forced pod kill.
Forced-kill fallback	Scheduler lock TTL is 60s with renewal every 20s; surviving replica takes over after expiry.	Puts an upper bound on disruption recovery but can add temporary queue latency.

Implementation examples

Tiered PDB configuration (YAML)

cordum_pdbs.yaml

YAML

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: cordum-api-gateway-pdb
  namespace: cordum
spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app: cordum-api-gateway
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: cordum-nats-pdb
  namespace: cordum
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: nats
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: cordum-redis-pdb
  namespace: cordum
spec:
  minAvailable: 4
  selector:
    matchLabels:
      app: redis

Pre-rollout disruption checklist (Bash)

pdb_rollout_check.sh

Bash

# 1) Check disruption headroom before rollout
kubectl get pdb -n cordum
kubectl describe pdb cordum-api-gateway-pdb -n cordum
kubectl describe pdb cordum-nats-pdb -n cordum
kubectl describe pdb cordum-redis-pdb -n cordum

# 2) Validate rollout plus PDB constraints
kubectl rollout restart deployment/cordum-api-gateway -n cordum
kubectl rollout status deployment/cordum-api-gateway -n cordum

# 3) Verify lock ownership continuity during restart
redis-cli GET "cordum:scheduler:job:JOB_ID"
redis-cli GET "cordum:reconciler:default"
redis-cli GET "cordum:replayer:pending"

# 4) Roll back if disruption error budget is breached
kubectl rollout undo deployment/cordum-api-gateway -n cordum

Post-maintenance regression signals (PromQL)

pdb_regression.promql

PromQL

# Unavailable replicas during maintenance window
max_over_time(kube_deployment_status_replicas_unavailable{namespace="cordum"}[15m])

# Pod restart spike can indicate disruption pressure
increase(kube_pod_container_status_restarts_total{namespace="cordum"}[15m])

# Queue lock contention signal (if exported)
histogram_quantile(0.99, rate(job_lock_wait_bucket[5m]))

Limitations and tradeoffs

- PDBs do not prevent involuntary disruptions like node crashes.
- PDBs can block maintenance if baseline health is already degraded.
- Tight budgets improve availability but can slow upgrades and security patching.
- `kubectl delete pod` and direct workload deletion can bypass PDB intent.

If disruption automation does not use the Eviction API path, your PDB is advisory text, not a safety control.

Next step

Run this in one sprint:

1. Classify every control-plane workload into stateless, scheduler, or quorum tier.
2. Define one PDB per tier with explicit rationale and owner.
3. Add a pre-rollout gate that fails if current healthy pods leave zero eviction headroom.
4. Run one forced-kill game day and measure lock takeover plus retry surge.

Continue with AI Agent Rolling Restart Playbook and AI Agent Graceful Shutdown.

AI Agent PodDisruptionBudget Strategy