The production problem
Teams rotate broker certificates and send `SIGHUP` to `nats-server`. Everything looks clean on paper.
Then control-plane pods start reconnecting with stale client certs and auth failures. Nothing says pager fatigue like certificate expiry at 03:00.
The failure mode is simple: server cert lifecycle and client cert lifecycle are separate jobs. Many runbooks treat them as one.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS docs: Enabling TLS | Server TLS fields (`cert_file`, `key_file`, `ca_file`, `min_version`, client verify settings). | No end-to-end rollout sequence for rotating mTLS client certs across long-lived control-plane processes. |
| NATS docs: Signals | `--signal reload` (`SIGHUP`) reloads server configuration. | Does not solve client-side cert material already loaded in running applications. |
| nats.go package docs | `ClientCert`, `RootCAs`, and callback-based `ClientTLSConfig` for cert/CA material. | No production migration playbook from file-loaded static certs to callback-based reload behavior. |
Cordum runtime reality
In `core/infra/bus/nats.go`, Cordum builds options at startup, creates TLS config from environment, and then dials NATS.
That is deterministic and easy to reason about. It also means cert/key changes on disk are not automatically applied to already-running processes.
| Area | Current behavior | Operational impact |
|---|---|---|
| Connection setup | Cordum builds NATS options once in `NewNatsBus` and calls `nats.Connect(url, opts...)`. | TLS config shape is decided at process startup. |
| TLS material loading | `natsTLSConfigFromEnv()` reads `NATS_TLS_CA`, `NATS_TLS_CERT`, `NATS_TLS_KEY` and calls `tls.LoadX509KeyPair`. | Certificate/key are loaded from file during startup path, not by file watchers. |
| Reconnect behavior | Cordum uses `MaxReconnects(-1)` with `ReconnectWait(2 * time.Second)` and reconnect/disconnect callbacks. | Reconnect storms can continue forever if certificate auth is invalid until process is restarted or TLS material changes. |
| Default timing | NATS docs default TLS handshake timeout is 2 seconds; Cordum reconnect wait is 2 seconds. | Each failed attempt can burn roughly 4 seconds before the next cycle under cert mismatch conditions. |
opts := []nats.Option{
nats.MaxReconnects(-1),
nats.ReconnectWait(2 * time.Second),
}
if strings.HasPrefix(url, "tls://") {
tlsConfig, err := natsTLSConfigFromEnv()
if err != nil {
return nil, fmt.Errorf("nats tls config: %w", err)
}
if tlsConfig != nil {
opts = append(opts, nats.Secure(tlsConfig))
}
}
nc, err := nats.Connect(url, opts...)
func natsTLSConfigFromEnv() (*tls.Config, error) {
certPath := strings.TrimSpace(os.Getenv("NATS_TLS_CERT"))
keyPath := strings.TrimSpace(os.Getenv("NATS_TLS_KEY"))
cfg := &tls.Config{MinVersion: tls.VersionTLS12}
if certPath != "" || keyPath != "" {
if certPath == "" || keyPath == "" {
return nil, fmt.Errorf("nats tls cert/key must be set together")
}
cert, err := tls.LoadX509KeyPair(certPath, keyPath)
if err != nil {
return nil, fmt.Errorf("nats tls keypair: %w", err)
}
cfg.Certificates = []tls.Certificate{cert}
}
return cfg, nil
}Rotation rollout that holds up
Use overlap windows for old and new client cert trust. Rotate workload pods in small waves. Validate reconnect/auth logs before each wave.
Keep the math visible. With 2-second reconnect wait and 2-second handshake timeout defaults, your reconnect cycle budget is not infinite.
# Rotation budget inputs
RECONNECT_WAIT_SEC=2
TLS_HANDSHAKE_TIMEOUT_SEC=2
PODS=24
MAX_UNAVAILABLE=3
# Worst-case reconnect cycle per pod (simple model)
CYCLE_SEC=$((RECONNECT_WAIT_SEC + TLS_HANDSHAKE_TIMEOUT_SEC))
# Estimated per-wave exposure window
WAVE_WINDOW_SEC=$((CYCLE_SEC * 2))
echo "cycle=${CYCLE_SEC}s wave_window=${WAVE_WINDOW_SEC}s"
# Rollout in small waves
kubectl -n cordum rollout restart deploy/cordum-scheduler
kubectl -n cordum rollout status deploy/cordum-scheduler --timeout=10m
# Validate reconnect and auth errors between waves
kubectl -n cordum logs deploy/cordum-scheduler | rg "disconnected from nats|reconnected to nats|tls|certificate|auth"If your SLA budget is tight, reduce `maxUnavailable`, shrink each wave, and finish rotation faster by parallelizing across independent control-plane roles.
Optional hot-reload design
`nats.go` exposes `ClientTLSConfig` with certificate and root-CA callbacks. That can support a no-restart model if your TLS material source is reloadable.
This is more code and more failure modes. Only adopt it after proving reconnect semantics in staging under forced certificate swaps.
// Optional design for no-restart cert rotation.
// Keep current static approach unless you can validate this behavior in staging.
type TLSMaterial interface {
Cert() (tls.Certificate, error)
RootCAs() (*x509.CertPool, error)
}
func NatsOptionsFromMaterial(m TLSMaterial) []nats.Option {
return []nats.Option{
nats.ClientTLSConfig(
func() (tls.Certificate, error) { return m.Cert() },
func() (*x509.CertPool, error) { return m.RootCAs() },
),
}
}
// Rotation note:
// replace m's backing cert/CA atomically,
// then force a reconnect test to verify the new material is used.Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Overlap + staged restarts (current Cordum-friendly) | Low code risk, predictable rollout, easy rollback by redeploying previous secret version. | Requires restart orchestration and cert overlap window discipline. |
| Callback-based client TLS material | Potentially reduces restart dependency for cert changes. | More implementation complexity and stricter staging validation requirements. |
| Big-bang replacement | Operationally simple on paper. | Highest blast radius. Usually becomes a pager event. |
If you are on static startup loading today, that is fine. Just design the operational process around it instead of assuming runtime magic exists.
Next step
Run one controlled rotation drill in staging this week: new cert issuance, staggered scheduler restarts, reconnect/auth log checks, and old-cert revocation only after full convergence.
Then automate the same sequence in your release pipeline so certificate rotation is a routine change, not an incident template.