The production problem
Config drift rarely announces itself with a crash. One replica routes jobs with new pools while another keeps old limits, and the incident looks like random queue behavior.
Teams often ship notification-only reload logic. That works until one message is dropped. Then drift persists until a restart or manual intervention.
You need deterministic reload rules, bounded convergence windows, and post-change verification checks.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Argo CD automated sync and self-heal | Desired-vs-live diffing with optional self-heal and prune controls. | No guidance for non-Kubernetes control planes where config drives scheduler routing and timeout policies. |
| Terraform health assessments and drift checks | Scheduled and on-demand drift checks with explicit reconciliation flows. | No replica-level runtime propagation model for live agent orchestrators. |
| AWS CloudFormation drift detection | Formal drift status model (`IN_SYNC`, `DRIFTED`, `NOT_CHECKED`) and property-level diff semantics. | No low-latency config notification pattern with poll fallback for multi-replica services. |
The gap is runtime consistency engineering: how replicas detect, apply, and verify the same config change under transport faults.
Drift detection model
| Stage | Mechanism | Risk | Mitigation |
|---|---|---|---|
| Detect | Trigger reload on config-change notification and periodic poll. | Notification missed or delayed during transport issues. | Keep bounded fallback interval and always reload from durable store. |
| Validate | Compute deterministic hash of effective config sections. | Noisy updates and unnecessary reload churn. | Apply only when hash changes; keep section-level hashes. |
| Apply | Update routing/timeouts atomically per replica. | Partial application and temporary behavior divergence. | Use explicit update paths and idempotent setter functions. |
| Verify | Observe queue pressure and stale backlog signals after config change. | Drift remains undetected until customer impact. | Attach post-change checks and rollback trigger conditions. |
Cordum runtime behavior
These behaviors are validated against current scheduler code paths and troubleshooting guidance, not inferred from generic platform docs.
| Behavior | Current implementation | Operational impact |
|---|---|---|
| Notification subject | `sys.config.changed` broadcast triggers immediate scheduler reload. | Minimizes propagation lag for config updates. |
| Fallback poll interval | Default `30s` (`SCHEDULER_CONFIG_RELOAD_INTERVAL` override available). | Covers missed notifications and network partitions. |
| Source of truth | Replicas reload from Redis config document (`cfg:system:default`), not from message payload. | Prevents message skew from creating divergent runtime state. |
| Hash-gated apply | Routing/timeouts reload only when computed section hashes change. | Avoids unnecessary churn and keeps updates deterministic. |
| Expected consistency window | During notification loss, convergence is bounded by one poll cycle (about 30s by default). | Turns a vague eventual-consistency claim into an explicit operational budget. |
Implementation examples
Hash-gated reload loop with notification fallback (Go)
func watchConfigChanges(
ctx context.Context,
svc *configsvc.Service,
strategy *scheduler.LeastLoadedStrategy,
reconciler *scheduler.Reconciler,
natsBus *bus.NatsBus,
) {
interval := 30 * time.Second
notifyCh := make(chan struct{}, 1)
_ = natsBus.Subscribe("sys.config.changed", "", func(_ *pb.BusPacket) error {
select {
case notifyCh <- struct{}{}:
default: // coalesce bursts
}
return nil
})
ticker := time.NewTicker(interval)
defer ticker.Stop()
var lastPoolsHash, lastTimeoutsHash string
reload := func(trigger string) {
snap, err := loadConfigSnapshot(ctx, svc)
if err != nil { return }
if snap.PoolsHash != "" && snap.PoolsHash != lastPoolsHash {
strategy.UpdateRouting(buildRouting(snap.Pools))
lastPoolsHash = snap.PoolsHash
}
if snap.TimeoutsHash != "" && snap.TimeoutsHash != lastTimeoutsHash {
dispatch, running, _ := reconcilerTimeouts(snap.Timeouts)
reconciler.UpdateTimeouts(dispatch, running)
lastTimeoutsHash = snap.TimeoutsHash
}
}
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
reload("poll")
case <-notifyCh:
reload("notification")
}
}
}Config drift policy sheet (YAML)
drift_control:
source_of_truth: redis
config_key: cfg:system:default
notifications:
subject: sys.config.changed
mode: broadcast
polling:
interval: 30s
env_override: SCHEDULER_CONFIG_RELOAD_INTERVAL
hash_gate:
sections:
- pools
- timeouts
convergence_slo:
notification_ok: "<= 5s"
notification_lost: "<= 30s"Replica drift diagnostics (Bash)
# 1) Verify source-of-truth config payload redis-cli GET "cfg:system:default" # 2) Verify notification path docker compose logs scheduler 2>&1 | grep "config change notification" # 3) Verify drift symptoms in scheduler signals curl -s http://localhost:9090/metrics | grep job_lock_wait curl -s http://localhost:9090/metrics | grep stale_jobs # 4) Verify routing convergence from API replicas curl -s http://localhost:8080/api/v1/workers | jq '.workers | length'
Post-change drift signals (PromQL)
# Queueing pressure after config changes histogram_quantile(0.99, rate(dispatch_latency_bucket[5m])) # Lock contention side effect histogram_quantile(0.99, rate(job_lock_wait_bucket[5m])) # Recovery debt stale_jobs rate(orphan_replayed_total[5m])
Limitations and tradeoffs
- - A 30s polling fallback is safe but may be too slow for high-risk policy updates.
- - Notification coalescing reduces load but can hide intermediate config transitions in logs.
- - Hash-based gating prevents noise but requires stable canonicalization to avoid false deltas.
- - Redis-backed source-of-truth centralizes consistency and also centralizes failure risk.
If a replica only converges after restart, that is not eventual consistency. That is deferred outage repayment.
Next step
Run this in one sprint:
- 1. Document your config source-of-truth key and notification subject for every control-plane service.
- 2. Add hash-gated reloads for each config section that impacts runtime behavior.
- 3. Set an explicit convergence SLO for missed-notification scenarios (for example, 30s).
- 4. Simulate notification loss and verify all replicas converge via polling within the SLO.
Continue with AI Agent Leader Election and AI Agent Incident Response Runbook.