The production problem
Teams often set probes to “something that passes” and move on. That works until a dependency blips, rollout starts, and half the pods restart in sync.
For AI control planes, that is worse than basic web traffic disruption. You can interrupt in-flight work, lock ownership, and event processing continuity.
Probe design is a reliability policy. It should be reviewed with the same rigor as retry and timeout settings.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Kubernetes probe configuration docs | Probe types, fields, and semantics for liveness/readiness/startup. | No control-plane-specific dependency strategy for queue/lock-driven services. |
| Kubernetes Pod lifecycle | How readiness affects endpoint routing and pod serving state. | No guidance for readiness gating based on scheduler lock and reconciliation status. |
| Kubernetes rolling update tutorial | Rolling update flow and why healthy new pods matter before replacement. | No probe tuning method for preventing restart loops during dependency cold starts. |
The gap is operational semantics: probe roles should reflect control-plane behavior, not only HTTP endpoint availability.
Probe role contract
| Probe | Question it answers | Include | Avoid |
|---|---|---|---|
| Readiness | Should this pod receive new traffic right now? | Dependency reachability, local warm state, and drain-mode flag. | Permanent process health assumptions. |
| Liveness | Is this process irrecoverably stuck and needs restart? | Deadlock/hang detection that local retries cannot fix. | Transient network dependency failures. |
| Startup | Has slow initialization completed? | One-time bootstrap checks before liveness kicks in. | Continuous runtime dependency checks. |
Cordum probe baseline
Current manifests use `/health` for both liveness and readiness across core services, with initial delay 5s and period 10s.
| Service | Current behavior | Operational impact |
|---|---|---|
| Scheduler | `/health` on port 9090 for both liveness and readiness (initialDelay 5s, period 10s). | Simple baseline, but same endpoint means fewer differentiated signals. |
| API Gateway | `/health` on port 8081 for liveness and readiness (initialDelay 5s, period 10s). | Pod is pulled from service endpoints quickly when health fails. |
| Workflow Engine | `/health` on port 9093 for liveness and readiness (initialDelay 5s, period 10s). | Keeps rollout gating consistent across core services. |
| Platform hardening guidance | Production guide requires readiness/liveness probes on every workload. | Probe coverage is treated as a baseline production control. |
Implementation examples
Split probe endpoints by role (YAML)
apiVersion: apps/v1
kind: Deployment
metadata:
name: cordum-api-gateway
spec:
template:
spec:
containers:
- name: gateway
image: ghcr.io/cordum-io/cordum-api-gateway:latest
startupProbe:
httpGet:
path: /health/startup
port: 8081
periodSeconds: 5
failureThreshold: 24 # 120s max startup budget
readinessProbe:
httpGet:
path: /health/ready
port: 8081
periodSeconds: 5
failureThreshold: 2
livenessProbe:
httpGet:
path: /health/live
port: 8081
periodSeconds: 10
failureThreshold: 3Role-specific health handlers (Go)
var draining atomic.Bool
mux.HandleFunc("/health/live", func(w http.ResponseWriter, _ *http.Request) {
// process alive and event loop responsive
w.WriteHeader(http.StatusOK)
})
mux.HandleFunc("/health/ready", func(w http.ResponseWriter, _ *http.Request) {
if draining.Load() || !depsHealthy() || !cacheWarm() {
http.Error(w, "not ready", http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
})
mux.HandleFunc("/health/startup", func(w http.ResponseWriter, _ *http.Request) {
if !bootstrapComplete() {
http.Error(w, "starting", http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
})Probe-focused rollout runbook (Bash)
# Check probe-related restarts and readiness failures kubectl get pods -n cordum kubectl describe pod <pod-name> -n cordum | grep -E "Liveness|Readiness|Startup|Failed" # Watch rollout readiness progression kubectl rollout status deployment/cordum-api-gateway -n cordum kubectl get endpoints cordum-api-gateway -n cordum -w # Validate service health endpoints directly kubectl port-forward deploy/cordum-api-gateway 18081:8081 -n cordum curl -i http://127.0.0.1:18081/health/ready curl -i http://127.0.0.1:18081/health/live
Limitations and tradeoffs
- - Separate probe endpoints improve signal quality but increase implementation complexity.
- - Aggressive liveness checks recover faster but raise restart-loop risk under transient failures.
- - Conservative readiness checks protect traffic but can prolong rollout duration.
- - Startup probes prevent false early restarts but must be tuned to real bootstrap times.
If liveness fails because Redis was slow for two seconds, you did not detect a dead process. You created one.
Next step
Run this in one sprint:
- 1. Define probe contracts for each control-plane service in one table.
- 2. Split `/health` into `/health/live`, `/health/ready`, and `/health/startup` where needed.
- 3. Run a rollout game day with dependency latency injection and track restart counts.
- 4. Lock in probe thresholds only after observing p95/p99 startup and warm-up timings.
Continue with AI Agent Rolling Restart Playbook and AI Agent PodDisruptionBudget Strategy.