Name: Cordum
Author: Cordum

The production problem

Config drift rarely announces itself with a crash. One replica routes jobs with new pools while another keeps old limits, and the incident looks like random queue behavior.

Teams often ship notification-only reload logic. That works until one message is dropped. Then drift persists until a restart or manual intervention.

You need deterministic reload rules, bounded convergence windows, and post-change verification checks.

What top results miss

Source	Strong coverage	Missing piece
Argo CD automated sync and self-heal	Desired-vs-live diffing with optional self-heal and prune controls.	No guidance for non-Kubernetes control planes where config drives scheduler routing and timeout policies.
Terraform health assessments and drift checks	Scheduled and on-demand drift checks with explicit reconciliation flows.	No replica-level runtime propagation model for live agent orchestrators.
AWS CloudFormation drift detection	Formal drift status model (`IN_SYNC`, `DRIFTED`, `NOT_CHECKED`) and property-level diff semantics.	No low-latency config notification pattern with poll fallback for multi-replica services.

The gap is runtime consistency engineering: how replicas detect, apply, and verify the same config change under transport faults.

Drift detection model

Stage	Mechanism	Risk	Mitigation
Detect	Trigger reload on config-change notification and periodic poll.	Notification missed or delayed during transport issues.	Keep bounded fallback interval and always reload from durable store.
Validate	Compute deterministic hash of effective config sections.	Noisy updates and unnecessary reload churn.	Apply only when hash changes; keep section-level hashes.
Apply	Update routing/timeouts atomically per replica.	Partial application and temporary behavior divergence.	Use explicit update paths and idempotent setter functions.
Verify	Observe queue pressure and stale backlog signals after config change.	Drift remains undetected until customer impact.	Attach post-change checks and rollback trigger conditions.

Cordum runtime behavior

These behaviors are validated against current scheduler code paths and troubleshooting guidance, not inferred from generic platform docs.

Behavior	Current implementation	Operational impact
Notification subject	`sys.config.changed` broadcast triggers immediate scheduler reload.	Minimizes propagation lag for config updates.
Fallback poll interval	Default `30s` (`SCHEDULER_CONFIG_RELOAD_INTERVAL` override available).	Covers missed notifications and network partitions.
Source of truth	Replicas reload from Redis config document (`cfg:system:default`), not from message payload.	Prevents message skew from creating divergent runtime state.
Hash-gated apply	Routing/timeouts reload only when computed section hashes change.	Avoids unnecessary churn and keeps updates deterministic.
Expected consistency window	During notification loss, convergence is bounded by one poll cycle (about 30s by default).	Turns a vague eventual-consistency claim into an explicit operational budget.

Implementation examples

Hash-gated reload loop with notification fallback (Go)

config_reload_loop.go

func watchConfigChanges(
  ctx context.Context,
  svc *configsvc.Service,
  strategy *scheduler.LeastLoadedStrategy,
  reconciler *scheduler.Reconciler,
  natsBus *bus.NatsBus,
) {
  interval := 30 * time.Second
  notifyCh := make(chan struct{}, 1)

  _ = natsBus.Subscribe("sys.config.changed", "", func(_ *pb.BusPacket) error {
    select {
    case notifyCh <- struct{}{}:
    default: // coalesce bursts
    }
    return nil
  })

  ticker := time.NewTicker(interval)
  defer ticker.Stop()
  var lastPoolsHash, lastTimeoutsHash string

  reload := func(trigger string) {
    snap, err := loadConfigSnapshot(ctx, svc)
    if err != nil { return }
    if snap.PoolsHash != "" && snap.PoolsHash != lastPoolsHash {
      strategy.UpdateRouting(buildRouting(snap.Pools))
      lastPoolsHash = snap.PoolsHash
    }
    if snap.TimeoutsHash != "" && snap.TimeoutsHash != lastTimeoutsHash {
      dispatch, running, _ := reconcilerTimeouts(snap.Timeouts)
      reconciler.UpdateTimeouts(dispatch, running)
      lastTimeoutsHash = snap.TimeoutsHash
    }
  }

  for {
    select {
    case <-ctx.Done():
      return
    case <-ticker.C:
      reload("poll")
    case <-notifyCh:
      reload("notification")
    }
  }
}

Config drift policy sheet (YAML)

config_drift_policy.yaml

YAML

drift_control:
  source_of_truth: redis
  config_key: cfg:system:default
  notifications:
    subject: sys.config.changed
    mode: broadcast
  polling:
    interval: 30s
    env_override: SCHEDULER_CONFIG_RELOAD_INTERVAL
  hash_gate:
    sections:
      - pools
      - timeouts
  convergence_slo:
    notification_ok: "<= 5s"
    notification_lost: "<= 30s"

Replica drift diagnostics (Bash)

config_drift_runbook.sh

Bash

# 1) Verify source-of-truth config payload
redis-cli GET "cfg:system:default"

# 2) Verify notification path
docker compose logs scheduler 2>&1 | grep "config change notification"

# 3) Verify drift symptoms in scheduler signals
curl -s http://localhost:9090/metrics | grep job_lock_wait
curl -s http://localhost:9090/metrics | grep stale_jobs

# 4) Verify routing convergence from API replicas
curl -s http://localhost:8080/api/v1/workers | jq '.workers | length'

Post-change drift signals (PromQL)

config_drift_health.promql

PromQL

# Queueing pressure after config changes
histogram_quantile(0.99, rate(dispatch_latency_bucket[5m]))

# Lock contention side effect
histogram_quantile(0.99, rate(job_lock_wait_bucket[5m]))

# Recovery debt
stale_jobs
rate(orphan_replayed_total[5m])

Limitations and tradeoffs

- A 30s polling fallback is safe but may be too slow for high-risk policy updates.
- Notification coalescing reduces load but can hide intermediate config transitions in logs.
- Hash-based gating prevents noise but requires stable canonicalization to avoid false deltas.
- Redis-backed source-of-truth centralizes consistency and also centralizes failure risk.

If a replica only converges after restart, that is not eventual consistency. That is deferred outage repayment.

Next step

Run this in one sprint:

1. Document your config source-of-truth key and notification subject for every control-plane service.
2. Add hash-gated reloads for each config section that impacts runtime behavior.
3. Set an explicit convergence SLO for missed-notification scenarios (for example, 30s).
4. Simulate notification loss and verify all replicas converge via polling within the SLO.

Continue with AI Agent Leader Election and AI Agent Incident Response Runbook.

AI Agent Config Drift Detection