Skip to content
Deep Dive

AI Agent Config Reload Convergence

One missed reload can quietly shift safety posture across replicas.

Deep Dive10 min readMar 2026
TL;DR
  • -Hot reload is a convergence problem, not only a parsing problem.
  • -Cordum scheduler combines immediate `sys.config.changed` notifications with a 30s polling fallback.
  • -Reload applies only when content hashes change, reducing no-op churn across replicas.
  • -Live timeout reload currently updates dispatch and running windows; scan interval changes require restart.
Two reload triggers

Broadcast notification for speed, poll loop for missed events.

Hash gating

Pools and timeouts apply only when normalized content hash actually changes.

Known caveat

Reconciler scan interval is not updated by live timeout reload path.

Scope

This guide focuses on scheduler-side config reload behavior: routing, timeout, and fail-mode convergence across replicas. It does not cover policy authoring itself.

The production problem

Dynamic config updates fail in two common ways: no update arrives, or every update arrives and still causes churn. Both are expensive. One causes stale behavior. The other burns CPU and operator trust.

In agent control planes, this is more than inconvenience. A stale scheduler config can keep the wrong fail mode, wrong routing, or wrong timeout profile active on part of the fleet.

You need fast convergence and low-noise apply logic at the same time. Getting only one of those is how incident retrospectives get longer than the outage.

What top results miss

SourceStrong coverageMissing piece
Kubernetes tutorial: updating config via ConfigMapMounted ConfigMap updates and eventual propagation semantics.No scheduler-specific apply gating or governance-mode reload behavior.
etcd API docs: Watch APIEvent stream watches, revision-based updates, and long-running watch streams.No direct guidance for mixed watch-plus-poll convergence in job schedulers.
Envoy xDS protocol docsACK/NACK semantics, nonce handling, and eventual consistency constraints.No pre-dispatch policy-mode reload path where stale fail-mode can change risk posture.

The gap is governance-sensitive convergence: how to reload distributed scheduler behavior quickly while suppressing no-op updates and preserving replica consistency.

Cordum runtime behavior

BoundaryCurrent behaviorOperational impact
Immediate reload pathScheduler subscribes to `sys.config.changed` and triggers reload on notification.Fast cross-replica convergence after gateway config writes.
Fallback reload pathPolling loop runs every `30s` by default, configurable via `SCHEDULER_CONFIG_RELOAD_INTERVAL`.Converges even if notifications are dropped or delayed.
Apply gatingPools/timeouts are re-applied only when snapshot hash differs from previous hash.Avoids no-op reload work and routing churn.
Timeout live updateReload path recalculates dispatch/running timeouts and calls `reconciler.UpdateTimeouts(dispatch, running)`.Dispatch/running windows change live; scan interval does not.
Fail-mode reloadReload reads scheduler config keys: `input_fail_mode`, `output_fail_mode`, `output_policy_enabled`.Risk posture can change at runtime without scheduler restart.
Bootstrap file trackingBootstrap stores `_poolsFileHash` and `_timeoutsFileHash` in config document metadata.File changes are detected and merged/reset deterministically.

Code-level mechanics

Dual-trigger reload loop (Go)

watch_config_changes.go
Go
interval := 30 * time.Second
notifyCh := make(chan struct{}, 1)

natsBus.Subscribe("sys.config.changed", "", func(_ *pb.BusPacket) error {
  select {
  case notifyCh <- struct{}{}:
  default: // coalesce
  }
  return nil
})

for {
  select {
  case <-ticker.C:
    reload("poll")
  case <-notifyCh:
    reload("notification")
  }
}

Hash-based apply gating (Go)

hash_gated_apply.go
Go
if snap.PoolsHash != "" && snap.PoolsHash != lastPoolsHash {
  strategy.UpdateRouting(buildRouting(snap.Pools))
  lastPoolsHash = snap.PoolsHash
}

if snap.TimeoutsHash != "" && snap.TimeoutsHash != lastTimeoutsHash {
  dispatch, running, _ := reconcilerTimeouts(snap.Timeouts)
  reconciler.UpdateTimeouts(dispatch, running)
  lastTimeoutsHash = snap.TimeoutsHash
}

Hash gating is the quiet hero here. It blocks repeated apply work when documents change revision but not effective content.

Timeout reload caveat (Go)

reconciler_timeouts.go
Go
func reconcilerTimeouts(cfg *config.TimeoutsConfig) (time.Duration, time.Duration, time.Duration) {
  dispatchTimeout := time.Duration(cfg.Reconciler.DispatchTimeoutSeconds) * time.Second
  if dispatchTimeout == 0 { dispatchTimeout = 2 * time.Minute }

  runningTimeout := time.Duration(cfg.Reconciler.RunningTimeoutSeconds) * time.Second
  if runningTimeout == 0 { runningTimeout = 5 * time.Minute }

  scanInterval := time.Duration(cfg.Reconciler.ScanIntervalSeconds) * time.Second
  if scanInterval == 0 { scanInterval = 30 * time.Second }
  return dispatchTimeout, runningTimeout, scanInterval
}

func (r *Reconciler) UpdateTimeouts(dispatchTimeout, runningTimeout time.Duration) {
  if dispatchTimeout > 0 { r.dispatchTimeout = dispatchTimeout }
  if runningTimeout > 0 { r.runningTimeout = runningTimeout }
}

Notice the mismatch: `reconcilerTimeouts` computes `scanInterval`, but live update applies only dispatch and running timeouts. If you change `scan_interval_seconds`, plan a restart window.

Operator runbook

Use a repeatable convergence check after every scheduler config rollout.

config_reload_convergence_runbook.sh
Bash
# 1) Apply a small config change via API
curl -X PUT "$CORDUM_URL/api/v1/config"   -H "X-API-Key: $CORDUM_API_KEY"   -H "Content-Type: application/json"   -d '{"scheduler":{"input_fail_mode":"closed"}}'

# 2) Watch scheduler logs for notification-triggered reload
# expected: "config change notification received, reloading"

# 3) Confirm no-op changes do not re-apply
# expected: no repeated "routing updated" if hash unchanged

# 4) For timeout changes, verify dispatch/running effect live
# and schedule restart window if scan_interval_seconds changed

Limitations and tradeoffs

  • - Lower poll intervals improve fallback speed but increase background Redis reads.
  • - Broadcast notifications improve speed but still need polling as insurance.
  • - Hash gating reduces churn, but weak normalization can still produce false positives.
  • - Live timeout reload does not currently adjust reconciler scan interval.

Fast reload without apply gating creates a reload storm. Slow reload without fallback creates config drift. You need both controls, or you get one very long on-call night.

Next step

Run a rollout drill this week:

  1. 1. Apply a routing-only change and measure convergence across all scheduler replicas.
  2. 2. Apply a timeout-only change and verify dispatch/running behavior updates live.
  3. 3. Apply a scan interval change and verify restart requirement in your runbook.
  4. 4. Capture max convergence latency as an SLO and alert on deviations.

Continue with Config Drift Detection and Stuck Job Recovery.

Consistency is a feature

A scheduler that converges quickly and quietly is not glamorous. It is, however, very good at preventing surprise incidents.