Skip to content
Guide

AI Agent Config Drift Detection

Replica mismatch is expensive because it looks like random behavior until you trace the config path.

Guide11 min readMar 2026
TL;DR
  • -Config drift in AI systems often appears as behavior mismatch, not process crashes.
  • -Fast notifications need a slower polling fallback, or missed events become silent divergence.
  • -Hash-gated reloads prevent unnecessary churn and make config updates deterministic across replicas.
  • -Your runbook should verify source-of-truth data, not only service logs.
Source of truth

All replicas should reload from one durable config document, never from ad-hoc message payloads.

Reload discipline

Notification-triggered reload plus polling fallback creates bounded consistency windows.

Operational proof

Use lock and queue signals to catch drift side effects before user-facing failures.

Scope

This guide focuses on multi-replica AI control planes where config changes affect routing, retries, and timeout behavior at runtime.

The production problem

Config drift rarely announces itself with a crash. One replica routes jobs with new pools while another keeps old limits, and the incident looks like random queue behavior.

Teams often ship notification-only reload logic. That works until one message is dropped. Then drift persists until a restart or manual intervention.

You need deterministic reload rules, bounded convergence windows, and post-change verification checks.

What top results miss

SourceStrong coverageMissing piece
Argo CD automated sync and self-healDesired-vs-live diffing with optional self-heal and prune controls.No guidance for non-Kubernetes control planes where config drives scheduler routing and timeout policies.
Terraform health assessments and drift checksScheduled and on-demand drift checks with explicit reconciliation flows.No replica-level runtime propagation model for live agent orchestrators.
AWS CloudFormation drift detectionFormal drift status model (`IN_SYNC`, `DRIFTED`, `NOT_CHECKED`) and property-level diff semantics.No low-latency config notification pattern with poll fallback for multi-replica services.

The gap is runtime consistency engineering: how replicas detect, apply, and verify the same config change under transport faults.

Drift detection model

StageMechanismRiskMitigation
DetectTrigger reload on config-change notification and periodic poll.Notification missed or delayed during transport issues.Keep bounded fallback interval and always reload from durable store.
ValidateCompute deterministic hash of effective config sections.Noisy updates and unnecessary reload churn.Apply only when hash changes; keep section-level hashes.
ApplyUpdate routing/timeouts atomically per replica.Partial application and temporary behavior divergence.Use explicit update paths and idempotent setter functions.
VerifyObserve queue pressure and stale backlog signals after config change.Drift remains undetected until customer impact.Attach post-change checks and rollback trigger conditions.

Cordum runtime behavior

These behaviors are validated against current scheduler code paths and troubleshooting guidance, not inferred from generic platform docs.

BehaviorCurrent implementationOperational impact
Notification subject`sys.config.changed` broadcast triggers immediate scheduler reload.Minimizes propagation lag for config updates.
Fallback poll intervalDefault `30s` (`SCHEDULER_CONFIG_RELOAD_INTERVAL` override available).Covers missed notifications and network partitions.
Source of truthReplicas reload from Redis config document (`cfg:system:default`), not from message payload.Prevents message skew from creating divergent runtime state.
Hash-gated applyRouting/timeouts reload only when computed section hashes change.Avoids unnecessary churn and keeps updates deterministic.
Expected consistency windowDuring notification loss, convergence is bounded by one poll cycle (about 30s by default).Turns a vague eventual-consistency claim into an explicit operational budget.

Implementation examples

Hash-gated reload loop with notification fallback (Go)

config_reload_loop.go
Go
func watchConfigChanges(
  ctx context.Context,
  svc *configsvc.Service,
  strategy *scheduler.LeastLoadedStrategy,
  reconciler *scheduler.Reconciler,
  natsBus *bus.NatsBus,
) {
  interval := 30 * time.Second
  notifyCh := make(chan struct{}, 1)

  _ = natsBus.Subscribe("sys.config.changed", "", func(_ *pb.BusPacket) error {
    select {
    case notifyCh <- struct{}{}:
    default: // coalesce bursts
    }
    return nil
  })

  ticker := time.NewTicker(interval)
  defer ticker.Stop()
  var lastPoolsHash, lastTimeoutsHash string

  reload := func(trigger string) {
    snap, err := loadConfigSnapshot(ctx, svc)
    if err != nil { return }
    if snap.PoolsHash != "" && snap.PoolsHash != lastPoolsHash {
      strategy.UpdateRouting(buildRouting(snap.Pools))
      lastPoolsHash = snap.PoolsHash
    }
    if snap.TimeoutsHash != "" && snap.TimeoutsHash != lastTimeoutsHash {
      dispatch, running, _ := reconcilerTimeouts(snap.Timeouts)
      reconciler.UpdateTimeouts(dispatch, running)
      lastTimeoutsHash = snap.TimeoutsHash
    }
  }

  for {
    select {
    case <-ctx.Done():
      return
    case <-ticker.C:
      reload("poll")
    case <-notifyCh:
      reload("notification")
    }
  }
}

Config drift policy sheet (YAML)

config_drift_policy.yaml
YAML
drift_control:
  source_of_truth: redis
  config_key: cfg:system:default
  notifications:
    subject: sys.config.changed
    mode: broadcast
  polling:
    interval: 30s
    env_override: SCHEDULER_CONFIG_RELOAD_INTERVAL
  hash_gate:
    sections:
      - pools
      - timeouts
  convergence_slo:
    notification_ok: "<= 5s"
    notification_lost: "<= 30s"

Replica drift diagnostics (Bash)

config_drift_runbook.sh
Bash
# 1) Verify source-of-truth config payload
redis-cli GET "cfg:system:default"

# 2) Verify notification path
docker compose logs scheduler 2>&1 | grep "config change notification"

# 3) Verify drift symptoms in scheduler signals
curl -s http://localhost:9090/metrics | grep job_lock_wait
curl -s http://localhost:9090/metrics | grep stale_jobs

# 4) Verify routing convergence from API replicas
curl -s http://localhost:8080/api/v1/workers | jq '.workers | length'

Post-change drift signals (PromQL)

config_drift_health.promql
PromQL
# Queueing pressure after config changes
histogram_quantile(0.99, rate(dispatch_latency_bucket[5m]))

# Lock contention side effect
histogram_quantile(0.99, rate(job_lock_wait_bucket[5m]))

# Recovery debt
stale_jobs
rate(orphan_replayed_total[5m])

Limitations and tradeoffs

  • - A 30s polling fallback is safe but may be too slow for high-risk policy updates.
  • - Notification coalescing reduces load but can hide intermediate config transitions in logs.
  • - Hash-based gating prevents noise but requires stable canonicalization to avoid false deltas.
  • - Redis-backed source-of-truth centralizes consistency and also centralizes failure risk.

If a replica only converges after restart, that is not eventual consistency. That is deferred outage repayment.

Next step

Run this in one sprint:

  1. 1. Document your config source-of-truth key and notification subject for every control-plane service.
  2. 2. Add hash-gated reloads for each config section that impacts runtime behavior.
  3. 3. Set an explicit convergence SLO for missed-notification scenarios (for example, 30s).
  4. 4. Simulate notification loss and verify all replicas converge via polling within the SLO.

Continue with AI Agent Leader Election and AI Agent Incident Response Runbook.

Drift is a latency problem in disguise

Treat config propagation delay as a first-class reliability budget, and test it the same way you test failover.