Name: Cordum
Author: Cordum

The production problem

Dynamic config updates fail in two common ways: no update arrives, or every update arrives and still causes churn. Both are expensive. One causes stale behavior. The other burns CPU and operator trust.

In agent control planes, this is more than inconvenience. A stale scheduler config can keep the wrong fail mode, wrong routing, or wrong timeout profile active on part of the fleet.

You need fast convergence and low-noise apply logic at the same time. Getting only one of those is how incident retrospectives get longer than the outage.

What top results miss

Source	Strong coverage	Missing piece
Kubernetes tutorial: updating config via ConfigMap	Mounted ConfigMap updates and eventual propagation semantics.	No scheduler-specific apply gating or governance-mode reload behavior.
etcd API docs: Watch API	Event stream watches, revision-based updates, and long-running watch streams.	No direct guidance for mixed watch-plus-poll convergence in job schedulers.
Envoy xDS protocol docs	ACK/NACK semantics, nonce handling, and eventual consistency constraints.	No pre-dispatch policy-mode reload path where stale fail-mode can change risk posture.

The gap is governance-sensitive convergence: how to reload distributed scheduler behavior quickly while suppressing no-op updates and preserving replica consistency.

Cordum runtime behavior

Boundary	Current behavior	Operational impact
Immediate reload path	Scheduler subscribes to `sys.config.changed` and triggers reload on notification.	Fast cross-replica convergence after gateway config writes.
Fallback reload path	Polling loop runs every `30s` by default, configurable via `SCHEDULER_CONFIG_RELOAD_INTERVAL`.	Converges even if notifications are dropped or delayed.
Apply gating	Pools/timeouts are re-applied only when snapshot hash differs from previous hash.	Avoids no-op reload work and routing churn.
Timeout live update	Reload path recalculates dispatch/running timeouts and calls `reconciler.UpdateTimeouts(dispatch, running)`.	Dispatch/running windows change live; scan interval does not.
Fail-mode reload	Reload reads scheduler config keys: `input_fail_mode`, `output_fail_mode`, `output_policy_enabled`.	Risk posture can change at runtime without scheduler restart.
Bootstrap file tracking	Bootstrap stores `_poolsFileHash` and `_timeoutsFileHash` in config document metadata.	File changes are detected and merged/reset deterministically.

Code-level mechanics

Dual-trigger reload loop (Go)

watch_config_changes.go

interval := 30 * time.Second
notifyCh := make(chan struct{}, 1)

natsBus.Subscribe("sys.config.changed", "", func(_ *pb.BusPacket) error {
  select {
  case notifyCh <- struct{}{}:
  default: // coalesce
  }
  return nil
})

for {
  select {
  case <-ticker.C:
    reload("poll")
  case <-notifyCh:
    reload("notification")
  }
}

Hash-based apply gating (Go)

hash_gated_apply.go

if snap.PoolsHash != "" && snap.PoolsHash != lastPoolsHash {
  strategy.UpdateRouting(buildRouting(snap.Pools))
  lastPoolsHash = snap.PoolsHash
}

if snap.TimeoutsHash != "" && snap.TimeoutsHash != lastTimeoutsHash {
  dispatch, running, _ := reconcilerTimeouts(snap.Timeouts)
  reconciler.UpdateTimeouts(dispatch, running)
  lastTimeoutsHash = snap.TimeoutsHash
}

Hash gating is the quiet hero here. It blocks repeated apply work when documents change revision but not effective content.

Timeout reload caveat (Go)

reconciler_timeouts.go

func reconcilerTimeouts(cfg *config.TimeoutsConfig) (time.Duration, time.Duration, time.Duration) {
  dispatchTimeout := time.Duration(cfg.Reconciler.DispatchTimeoutSeconds) * time.Second
  if dispatchTimeout == 0 { dispatchTimeout = 2 * time.Minute }

  runningTimeout := time.Duration(cfg.Reconciler.RunningTimeoutSeconds) * time.Second
  if runningTimeout == 0 { runningTimeout = 5 * time.Minute }

  scanInterval := time.Duration(cfg.Reconciler.ScanIntervalSeconds) * time.Second
  if scanInterval == 0 { scanInterval = 30 * time.Second }
  return dispatchTimeout, runningTimeout, scanInterval
}

func (r *Reconciler) UpdateTimeouts(dispatchTimeout, runningTimeout time.Duration) {
  if dispatchTimeout > 0 { r.dispatchTimeout = dispatchTimeout }
  if runningTimeout > 0 { r.runningTimeout = runningTimeout }
}

Notice the mismatch: `reconcilerTimeouts` computes `scanInterval`, but live update applies only dispatch and running timeouts. If you change `scan_interval_seconds`, plan a restart window.

Operator runbook

Use a repeatable convergence check after every scheduler config rollout.

config_reload_convergence_runbook.sh

Bash

# 1) Apply a small config change via API
curl -X PUT "$CORDUM_URL/api/v1/config"   -H "X-API-Key: $CORDUM_API_KEY"   -H "Content-Type: application/json"   -d '{"scheduler":{"input_fail_mode":"closed"}}'

# 2) Watch scheduler logs for notification-triggered reload
# expected: "config change notification received, reloading"

# 3) Confirm no-op changes do not re-apply
# expected: no repeated "routing updated" if hash unchanged

# 4) For timeout changes, verify dispatch/running effect live
# and schedule restart window if scan_interval_seconds changed

Limitations and tradeoffs

- Lower poll intervals improve fallback speed but increase background Redis reads.
- Broadcast notifications improve speed but still need polling as insurance.
- Hash gating reduces churn, but weak normalization can still produce false positives.
- Live timeout reload does not currently adjust reconciler scan interval.

Fast reload without apply gating creates a reload storm. Slow reload without fallback creates config drift. You need both controls, or you get one very long on-call night.

Next step

Run a rollout drill this week:

1. Apply a routing-only change and measure convergence across all scheduler replicas.
2. Apply a timeout-only change and verify dispatch/running behavior updates live.
3. Apply a scan interval change and verify restart requirement in your runbook.
4. Capture max convergence latency as an SLO and alert on deviations.

Continue with Config Drift Detection and Stuck Job Recovery.

AI Agent Config Reload Convergence