The production problem
Cert expiry incidents are boring until they happen at 2 AM. Then they become very educational.
Rotating a Safety Kernel cert is high impact because the scheduler and gateway depend on that channel for pre-dispatch decisions. A broken handshake path can degrade governance availability fast.
The failure mode is usually not cryptography. It is rollout sequencing, reconnect assumptions, and missing rollback criteria.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| gRPC Authentication Guide | TLS/mTLS channel authentication fundamentals and credential API model. | No service-specific certificate reload and reconnect runbook for long-lived control-plane channels. |
| Kubernetes kubelet certificate rotation | Automatic renewal flow and reconnect behavior as cert expiry approaches. | No app-level `GetCertificate` callback strategy for gRPC listeners with custom keypairs. |
| SPIRE use cases | Short-lived, automatically rotated workload certificates for mTLS. | No direct guidance on mixed static CA files plus app-managed cert reload logic. |
The gap is application-level rotation choreography: watcher cadence, handshake cutover timing, and client reconnect policy.
Rotation model
| Phase | Objective | Failure mode |
|---|---|---|
| Pre-rotation validation | Confirm current cert CN/SAN, expiry horizon, and chain trust | Rotate blindly and discover trust mismatch only after production reconnect |
| Write new cert+key atomically | Update disk artifacts in one deploy step | Split writes produce temporary invalid keypair and reload errors |
| Wait for reload window | Allow watcher tick to pick up file modifications (<=30s default) | Immediate validation before watch tick creates false incident signal |
| Force controlled reconnects | Move existing channels onto new server cert | Long-lived channels continue old handshake state longer than expected |
| Rollback readiness | Keep previous keypair available until post-rotation checks pass | No rollback path during trust-chain or hostname validation failures |
Cordum runtime behavior
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Initial load gate | `NewCertReloader` loads cert/key at startup and fails fast on invalid keypair. | Prevents boot with broken TLS material. |
| Reload trigger | `WatchLoop` polls cert/key file mtimes; interval defaults to `30s` when non-positive. | Predictable reload latency bound after secret update. |
| Handshake behavior | `tls.Config.GetCertificate` returns latest in-memory cert from reloader. | New inbound connections pick up rotated cert without process restart. |
| Existing connections | Reloader swaps cert for future handshakes; established channels are not forcibly reset. | You need reconnect strategy to complete migration quickly. |
| Client trust material | Scheduler/gateway TLS CA file is read when creating transport credentials. | CA bundle changes require redial path, restart, or controlled reconnection. |
Implementation examples
Reloader watch loop (Go)
func (r *CertReloader) WatchLoop(ctx context.Context, interval time.Duration) {
if interval <= 0 {
interval = 30 * time.Second
}
lastCertMod, lastKeyMod := r.modTimes()
ticker := time.NewTicker(interval)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
certMod, keyMod := r.modTimes()
if certMod.Equal(lastCertMod) && keyMod.Equal(lastKeyMod) {
continue
}
if err := r.reload(); err != nil {
slog.Error("tls cert reload failed", "label", r.label, "error", err)
continue
}
lastCertMod, lastKeyMod = certMod, keyMod
slog.Info("tls cert reloaded", "label", r.label)
}
}
}Safety Kernel integration (Go)
reloader, err := tlsreload.NewCertReloader(certPath, keyPath, "safety-kernel")
if err != nil {
return fmt.Errorf("safety kernel tls keypair: %w", err)
}
go reloader.WatchLoop(context.Background(), 30*time.Second)
tlsCfg := &tls.Config{
GetCertificate: reloader.GetCertificate,
MinVersion: tls.VersionTLS12,
}
serverCreds := grpc.Creds(credentials.NewTLS(tlsCfg))Rotation runbook
# 1) Stage new cert and key files kubectl -n cordum create secret tls safety-kernel-tls --cert=server-new.crt --key=server-new.key --dry-run=client -o yaml | kubectl apply -f - # 2) Confirm reload log appears (watch loop is 30s by default) kubectl logs -n cordum deploy/cordum-safety-kernel | grep "tls cert reloaded" # 3) Force controlled reconnect from clients (example: rollout restart) kubectl rollout restart deploy/cordum-scheduler -n cordum kubectl rollout restart deploy/cordum-gateway -n cordum # 4) Verify handshake uses new cert fingerprint before removing old material # (use your normal TLS probe / openssl check against SAFETY_KERNEL_ADDR)
Limitations and tradeoffs
- - Polling reload (30s default) is simple and robust, but not instant.
- - Hot reload avoids full server restart, but does not migrate active client channels automatically.
- - Fast cert turnover can pressure operational debugging if fingerprints and expiry are not logged.
- - CA rotation is a separate concern from leaf cert rotation and needs explicit client lifecycle handling.
If you rotate cert files but never recycle long-lived clients, you can pass superficial checks while old channels keep running on stale handshake state.
Next step
Run one full rotation rehearsal this week:
- 1. Rotate Safety Kernel cert/key in staging and confirm `tls cert reloaded` log appears.
- 2. Measure cutover delay from secret update to first successful new-handshake probe.
- 3. Trigger controlled scheduler/gateway reconnect and verify new cert fingerprint in traffic checks.
- 4. Execute rollback once to verify old keypair reactivation path and alert noise profile.
Continue with Safety Kernel TLS Hardening and Policy URL SSRF Hardening.