Skip to content
Guide

AI Agent Health Checks

Most restart storms are probe design bugs, not app bugs.

Guide11 min readMar 2026
TL;DR
  • -Liveness, readiness, and startup probes solve different failure classes. Mixing them creates false recoveries.
  • -Readiness is traffic control. Liveness is process survival control.
  • -Using the same endpoint for both probes is common, but it can hide useful failure signals.
  • -Control-plane services should gate readiness on dependency reachability and local consistency state.
Role clarity

Each probe should answer one narrow question.

Traffic safety

Readiness decides when traffic enters a pod during rollouts and drains.

Restart control

Liveness should trigger only for irrecoverable local dead states.

Scope

This guide covers health probe design for autonomous AI control-plane services running in Kubernetes, especially scheduler/gateway/workflow components with external dependencies.

The production problem

Teams often set probes to “something that passes” and move on. That works until a dependency blips, rollout starts, and half the pods restart in sync.

For AI control planes, that is worse than basic web traffic disruption. You can interrupt in-flight work, lock ownership, and event processing continuity.

Probe design is a reliability policy. It should be reviewed with the same rigor as retry and timeout settings.

What top results miss

SourceStrong coverageMissing piece
Kubernetes probe configuration docsProbe types, fields, and semantics for liveness/readiness/startup.No control-plane-specific dependency strategy for queue/lock-driven services.
Kubernetes Pod lifecycleHow readiness affects endpoint routing and pod serving state.No guidance for readiness gating based on scheduler lock and reconciliation status.
Kubernetes rolling update tutorialRolling update flow and why healthy new pods matter before replacement.No probe tuning method for preventing restart loops during dependency cold starts.

The gap is operational semantics: probe roles should reflect control-plane behavior, not only HTTP endpoint availability.

Probe role contract

ProbeQuestion it answersIncludeAvoid
ReadinessShould this pod receive new traffic right now?Dependency reachability, local warm state, and drain-mode flag.Permanent process health assumptions.
LivenessIs this process irrecoverably stuck and needs restart?Deadlock/hang detection that local retries cannot fix.Transient network dependency failures.
StartupHas slow initialization completed?One-time bootstrap checks before liveness kicks in.Continuous runtime dependency checks.

Cordum probe baseline

Current manifests use `/health` for both liveness and readiness across core services, with initial delay 5s and period 10s.

ServiceCurrent behaviorOperational impact
Scheduler`/health` on port 9090 for both liveness and readiness (initialDelay 5s, period 10s).Simple baseline, but same endpoint means fewer differentiated signals.
API Gateway`/health` on port 8081 for liveness and readiness (initialDelay 5s, period 10s).Pod is pulled from service endpoints quickly when health fails.
Workflow Engine`/health` on port 9093 for liveness and readiness (initialDelay 5s, period 10s).Keeps rollout gating consistent across core services.
Platform hardening guidanceProduction guide requires readiness/liveness probes on every workload.Probe coverage is treated as a baseline production control.

Implementation examples

Split probe endpoints by role (YAML)

probe_strategy.yaml
YAML
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cordum-api-gateway
spec:
  template:
    spec:
      containers:
        - name: gateway
          image: ghcr.io/cordum-io/cordum-api-gateway:latest
          startupProbe:
            httpGet:
              path: /health/startup
              port: 8081
            periodSeconds: 5
            failureThreshold: 24   # 120s max startup budget
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8081
            periodSeconds: 5
            failureThreshold: 2
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8081
            periodSeconds: 10
            failureThreshold: 3

Role-specific health handlers (Go)

health_handlers.go
Go
var draining atomic.Bool

mux.HandleFunc("/health/live", func(w http.ResponseWriter, _ *http.Request) {
  // process alive and event loop responsive
  w.WriteHeader(http.StatusOK)
})

mux.HandleFunc("/health/ready", func(w http.ResponseWriter, _ *http.Request) {
  if draining.Load() || !depsHealthy() || !cacheWarm() {
    http.Error(w, "not ready", http.StatusServiceUnavailable)
    return
  }
  w.WriteHeader(http.StatusOK)
})

mux.HandleFunc("/health/startup", func(w http.ResponseWriter, _ *http.Request) {
  if !bootstrapComplete() {
    http.Error(w, "starting", http.StatusServiceUnavailable)
    return
  }
  w.WriteHeader(http.StatusOK)
})

Probe-focused rollout runbook (Bash)

probe_rollout_check.sh
Bash
# Check probe-related restarts and readiness failures
kubectl get pods -n cordum
kubectl describe pod <pod-name> -n cordum | grep -E "Liveness|Readiness|Startup|Failed"

# Watch rollout readiness progression
kubectl rollout status deployment/cordum-api-gateway -n cordum
kubectl get endpoints cordum-api-gateway -n cordum -w

# Validate service health endpoints directly
kubectl port-forward deploy/cordum-api-gateway 18081:8081 -n cordum
curl -i http://127.0.0.1:18081/health/ready
curl -i http://127.0.0.1:18081/health/live

Limitations and tradeoffs

  • - Separate probe endpoints improve signal quality but increase implementation complexity.
  • - Aggressive liveness checks recover faster but raise restart-loop risk under transient failures.
  • - Conservative readiness checks protect traffic but can prolong rollout duration.
  • - Startup probes prevent false early restarts but must be tuned to real bootstrap times.

If liveness fails because Redis was slow for two seconds, you did not detect a dead process. You created one.

Next step

Run this in one sprint:

  1. 1. Define probe contracts for each control-plane service in one table.
  2. 2. Split `/health` into `/health/live`, `/health/ready`, and `/health/startup` where needed.
  3. 3. Run a rollout game day with dependency latency injection and track restart counts.
  4. 4. Lock in probe thresholds only after observing p95/p99 startup and warm-up timings.

Continue with AI Agent Rolling Restart Playbook and AI Agent PodDisruptionBudget Strategy.

Probes are policy, not plumbing

Probe behavior controls availability decisions during every rollout and disruption event.