Name: Cordum
Author: Cordum

The production problem

Teams often set probes to “something that passes” and move on. That works until a dependency blips, rollout starts, and half the pods restart in sync.

For AI control planes, that is worse than basic web traffic disruption. You can interrupt in-flight work, lock ownership, and event processing continuity.

Probe design is a reliability policy. It should be reviewed with the same rigor as retry and timeout settings.

What top results miss

Source	Strong coverage	Missing piece
Kubernetes probe configuration docs	Probe types, fields, and semantics for liveness/readiness/startup.	No control-plane-specific dependency strategy for queue/lock-driven services.
Kubernetes Pod lifecycle	How readiness affects endpoint routing and pod serving state.	No guidance for readiness gating based on scheduler lock and reconciliation status.
Kubernetes rolling update tutorial	Rolling update flow and why healthy new pods matter before replacement.	No probe tuning method for preventing restart loops during dependency cold starts.

The gap is operational semantics: probe roles should reflect control-plane behavior, not only HTTP endpoint availability.

Probe role contract

Probe	Question it answers	Include	Avoid
Readiness	Should this pod receive new traffic right now?	Dependency reachability, local warm state, and drain-mode flag.	Permanent process health assumptions.
Liveness	Is this process irrecoverably stuck and needs restart?	Deadlock/hang detection that local retries cannot fix.	Transient network dependency failures.
Startup	Has slow initialization completed?	One-time bootstrap checks before liveness kicks in.	Continuous runtime dependency checks.

Cordum probe baseline

Current manifests use `/health` for both liveness and readiness across core services, with initial delay 5s and period 10s.

Service	Current behavior	Operational impact
Scheduler	`/health` on port 9090 for both liveness and readiness (initialDelay 5s, period 10s).	Simple baseline, but same endpoint means fewer differentiated signals.
API Gateway	`/health` on port 8081 for liveness and readiness (initialDelay 5s, period 10s).	Pod is pulled from service endpoints quickly when health fails.
Workflow Engine	`/health` on port 9093 for liveness and readiness (initialDelay 5s, period 10s).	Keeps rollout gating consistent across core services.
Platform hardening guidance	Production guide requires readiness/liveness probes on every workload.	Probe coverage is treated as a baseline production control.

Implementation examples

Split probe endpoints by role (YAML)

probe_strategy.yaml

YAML

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cordum-api-gateway
spec:
  template:
    spec:
      containers:
        - name: gateway
          image: ghcr.io/cordum-io/cordum-api-gateway:latest
          startupProbe:
            httpGet:
              path: /health/startup
              port: 8081
            periodSeconds: 5
            failureThreshold: 24   # 120s max startup budget
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8081
            periodSeconds: 5
            failureThreshold: 2
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8081
            periodSeconds: 10
            failureThreshold: 3

Role-specific health handlers (Go)

health_handlers.go

var draining atomic.Bool

mux.HandleFunc("/health/live", func(w http.ResponseWriter, _ *http.Request) {
  // process alive and event loop responsive
  w.WriteHeader(http.StatusOK)
})

mux.HandleFunc("/health/ready", func(w http.ResponseWriter, _ *http.Request) {
  if draining.Load() || !depsHealthy() || !cacheWarm() {
    http.Error(w, "not ready", http.StatusServiceUnavailable)
    return
  }
  w.WriteHeader(http.StatusOK)
})

mux.HandleFunc("/health/startup", func(w http.ResponseWriter, _ *http.Request) {
  if !bootstrapComplete() {
    http.Error(w, "starting", http.StatusServiceUnavailable)
    return
  }
  w.WriteHeader(http.StatusOK)
})

Probe-focused rollout runbook (Bash)

probe_rollout_check.sh

Bash

# Check probe-related restarts and readiness failures
kubectl get pods -n cordum
kubectl describe pod <pod-name> -n cordum | grep -E "Liveness|Readiness|Startup|Failed"

# Watch rollout readiness progression
kubectl rollout status deployment/cordum-api-gateway -n cordum
kubectl get endpoints cordum-api-gateway -n cordum -w

# Validate service health endpoints directly
kubectl port-forward deploy/cordum-api-gateway 18081:8081 -n cordum
curl -i http://127.0.0.1:18081/health/ready
curl -i http://127.0.0.1:18081/health/live

Limitations and tradeoffs

- Separate probe endpoints improve signal quality but increase implementation complexity.
- Aggressive liveness checks recover faster but raise restart-loop risk under transient failures.
- Conservative readiness checks protect traffic but can prolong rollout duration.
- Startup probes prevent false early restarts but must be tuned to real bootstrap times.

If liveness fails because Redis was slow for two seconds, you did not detect a dead process. You created one.

Next step

Run this in one sprint:

1. Define probe contracts for each control-plane service in one table.
2. Split `/health` into `/health/live`, `/health/ready`, and `/health/startup` where needed.
3. Run a rollout game day with dependency latency injection and track restart counts.
4. Lock in probe thresholds only after observing p95/p99 startup and warm-up timings.

Continue with AI Agent Rolling Restart Playbook and AI Agent PodDisruptionBudget Strategy.

AI Agent Health Checks