The production problem
Reconnect incidents happen at 03:00. Teams discover them by reading logs after customer impact starts.
The system had callback signals all along. They were not promoted to metrics and alert thresholds.
Logs are evidence. Metrics are control.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS docs: Listening for Reconnect Events | Callback hooks for disconnect/reconnect/close events. | No metric model or alert thresholds for large multi-replica control-plane deployments. |
| NATS docs: Automatic Reconnections | Reconnect flow and option-level behavior. | Does not define incident-oriented observability KPIs or SLO mappings. |
| nats.go package docs | Handlers and status-change surfaces exposed by the Go client. | No operator playbook for combining callbacks, metrics, and paging policy. |
Cordum runtime behavior
| Boundary | Observed behavior | Operational impact |
|---|---|---|
| Disconnect callback | Cordum logs disconnect errors via `nats.DisconnectErrHandler`. | Good raw signal. Hard to aggregate without structured counters. |
| Reconnect callback | Cordum logs reconnect with connected URL via `nats.ReconnectHandler`. | Useful for incident timeline reconstruction and broker failover visibility. |
| Closed callback | Cordum logs connection closure through `nats.ClosedHandler`. | Captures terminal transitions, but lacks objective outage duration metric. |
| Metrics surface | No dedicated reconnect counters/histograms are emitted in current bus setup. | Alerting depends on log search, which is slower and noisier under pressure. |
Code-level mechanics
1) Existing callback hooks
opts := []nats.Option{
nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
slog.Info("bus: disconnected from nats", "err", err)
}),
nats.ReconnectHandler(func(nc *nats.Conn) {
slog.Info("bus: reconnected to nats", "url", nc.ConnectedUrl())
}),
nats.ClosedHandler(func(nc *nats.Conn) {
slog.Info("bus: connection closed")
}),
}2) Minimal metrics augmentation
// Example metric wiring
var (
natsDisconnects = promauto.NewCounterVec(prometheus.CounterOpts{
Name: "cordum_nats_disconnect_total",
Help: "Total NATS disconnect events",
}, []string{"component"})
natsReconnects = promauto.NewCounterVec(prometheus.CounterOpts{
Name: "cordum_nats_reconnect_total",
Help: "Total NATS reconnect events",
}, []string{"component"})
)
nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
natsDisconnects.WithLabelValues("scheduler").Inc()
slog.Info("bus: disconnected from nats", "err", err)
})
nats.ReconnectHandler(func(nc *nats.Conn) {
natsReconnects.WithLabelValues("scheduler").Inc()
slog.Info("bus: reconnected to nats", "url", nc.ConnectedUrl())
})Operator runbook
Prove alerting works before production by forcing reconnect events in staging and checking both logs and counters.
# 1) Baseline reconnect event rates kubectl -n cordum logs deploy/cordum-scheduler | rg "disconnected from nats|reconnected to nats|connection closed" # 2) Trigger controlled broker turbulence in staging kubectl -n cordum rollout restart statefulset/nats # 3) Validate metric counters increment as expected # cordum_nats_disconnect_total # cordum_nats_reconnect_total # 4) Add alerting # page if disconnect_total increases and reconnect_total lags over 5m # page if reconnect churn exceeds threshold for 15m
Limitations and tradeoffs
| Approach | Benefit | Cost |
|---|---|---|
| Logs only | No code changes and immediate local visibility. | Slow incident detection and weak trend analysis. |
| Counters + logs | Fast alerting with retained forensic detail. | Requires label discipline and alert tuning to avoid noise. |
| Full SLO burn-rate model | Clear error-budget policy and escalation thresholds. | More upfront instrumentation and dashboard work. |
Next step
Add reconnect counters for one component this week, run one broker-failure drill, and wire a 5-minute burn alert before expanding to the rest of the control plane.