The production problem
Teams rotate broker certificates and send `SIGHUP` to `nats-server`. Everything looks clean on paper.
Then control-plane pods start reconnecting with stale client certs and auth failures. Nothing says pager fatigue like certificate expiry at 03:00.
The failure mode is simple: server cert lifecycle and client cert lifecycle are separate jobs. Many runbooks treat them as one.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS docs: Enabling TLS | Server TLS fields (`cert_file`, `key_file`, `ca_file`, `min_version`, client verify settings), including bundled CA files. | No strict rollout order for dual-CA trust windows with long-lived control-plane clients. |
| NATS docs: Signals | `--signal reload` (`SIGHUP`) reloads server configuration. | Does not solve client-side cert material already loaded in running applications. |
| nats.go package docs | `ClientCert`, `RootCAs`, and callback-based `ClientTLSConfig` for cert/CA material. | No production migration playbook from file-loaded static certs to callback-based reload behavior. |
Cordum runtime reality
In `core/infra/bus/nats.go`, Cordum builds options at startup, creates TLS config from environment, and then dials NATS.
That is deterministic and easy to reason about. It also means cert/key changes on disk are not automatically applied to already-running processes.
| Area | Current behavior | Operational impact |
|---|---|---|
| Connection setup | Cordum builds NATS options once in `NewNatsBus` and calls `nats.Connect(url, opts...)`. | TLS config shape is decided at process startup. |
| TLS material loading | `natsTLSConfigFromEnv()` reads `NATS_TLS_CA`, `NATS_TLS_CERT`, `NATS_TLS_KEY` and calls `tls.LoadX509KeyPair`. | Certificate/key are loaded from file during startup path, not by file watchers. |
| Trust anchor updates | NATS server `ca_file` can include multiple CAs, and `--signal reload` refreshes server config. | Use this to run a temporary old+new trust overlap while clients rotate. |
| Reconnect behavior | Cordum uses `MaxReconnects(-1)` with `ReconnectWait(2 * time.Second)` and reconnect/disconnect callbacks. | Reconnect storms can continue forever if certificate auth is invalid until process is restarted or TLS material changes. |
| Default timing | NATS docs default TLS handshake timeout is 2 seconds; Cordum reconnect wait is 2 seconds. | Each failed attempt can burn roughly 4 seconds before the next cycle under cert mismatch conditions. |
opts := []nats.Option{
nats.MaxReconnects(-1),
nats.ReconnectWait(2 * time.Second),
}
if strings.HasPrefix(url, "tls://") {
tlsConfig, err := natsTLSConfigFromEnv()
if err != nil {
return nil, fmt.Errorf("nats tls config: %w", err)
}
if tlsConfig != nil {
opts = append(opts, nats.Secure(tlsConfig))
}
}
nc, err := nats.Connect(url, opts...)
func natsTLSConfigFromEnv() (*tls.Config, error) {
certPath := strings.TrimSpace(os.Getenv("NATS_TLS_CERT"))
keyPath := strings.TrimSpace(os.Getenv("NATS_TLS_KEY"))
cfg := &tls.Config{MinVersion: tls.VersionTLS12}
if certPath != "" || keyPath != "" {
if certPath == "" || keyPath == "" {
return nil, fmt.Errorf("nats tls cert/key must be set together")
}
cert, err := tls.LoadX509KeyPair(certPath, keyPath)
if err != nil {
return nil, fmt.Errorf("nats tls keypair: %w", err)
}
cfg.Certificates = []tls.Certificate{cert}
}
return cfg, nil
}Rotation rollout that holds up
Use overlap windows for old and new client cert trust. Rotate workload pods in small waves. Validate reconnect/auth logs before each wave.
The order is the key detail many guides skip: broaden trust first, rotate clients second, remove old trust last.
# Build dual-CA bundle for temporary overlap
cat old-client-ca.pem new-client-ca.pem > client-ca-bundle.pem
tls: {
cert_file: "./server-cert.pem"
key_file: "./server-key.pem"
ca_file: "./client-ca-bundle.pem"
verify: true
}Keep rollout windows bounded. If auth errors rise between waves, stop and fix trust material before continuing.
# 1) Broaden server trust first (old + new client CA), then reload server cat old-client-ca.pem new-client-ca.pem > /etc/nats/client-ca-bundle.pem nats-server --signal reload=/var/run/nats/nats-server.pid # 2) Publish new client cert secret for Cordum workloads kubectl -n cordum apply -f scheduler-nats-mtls-secret.yaml # 3) Restart workloads in small waves kubectl -n cordum rollout restart deploy/cordum-scheduler kubectl -n cordum rollout status deploy/cordum-scheduler --timeout=10m # 4) Validate reconnect/auth logs after each wave kubectl -n cordum logs deploy/cordum-scheduler | rg "disconnected from nats|reconnected to nats|tls|certificate|auth" # 5) After full convergence, remove old CA and reload server again cat new-client-ca.pem > /etc/nats/client-ca-bundle.pem nats-server --signal reload=/var/run/nats/nats-server.pid
If your SLA budget is tight, reduce `maxUnavailable`, shrink each wave, and finish rotation faster by parallelizing across independent control-plane roles.
Optional hot-reload design
`nats.go` exposes `ClientTLSConfig` with certificate and root-CA callbacks. That can support a no-restart model if your TLS material source is reloadable.
This is more code and more failure modes. Only adopt it after proving reconnect semantics in staging under forced certificate swaps.
// Optional design for no-restart cert rotation.
// Keep current static approach unless you can validate this behavior in staging.
type TLSMaterial interface {
Cert() (tls.Certificate, error)
RootCAs() (*x509.CertPool, error)
}
func NatsOptionsFromMaterial(m TLSMaterial) []nats.Option {
return []nats.Option{
nats.ClientTLSConfig(
func() (tls.Certificate, error) { return m.Cert() },
func() (*x509.CertPool, error) { return m.RootCAs() },
),
}
}
// Rotation note:
// replace m's backing cert/CA atomically,
// then force a reconnect test to verify the new material is used.Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Dual-CA overlap + staged restarts (current Cordum-friendly) | Low code risk, predictable rollout, easy rollback by redeploying previous secret version. | Requires restart orchestration and cert overlap window discipline. |
| Callback-based client TLS material | Potentially reduces restart dependency for cert changes. | More implementation complexity and stricter staging validation requirements. |
| Big-bang replacement | Operationally simple on paper. | Highest blast radius. Usually becomes a pager event. |
If you are on static startup loading today, that is fine. Just design the operational process around it instead of assuming runtime magic exists.
Next step
Run one controlled rotation drill in staging this week: new cert issuance, staggered scheduler restarts, reconnect/auth log checks, and old-cert revocation only after full convergence.
Then automate the same sequence in your release pipeline so certificate rotation is a routine change, not an incident template.