Skip to content
Deep Dive

AI Agent NATS Reconnect Observability

Reconnect callbacks are necessary. They are not enough until they are measurable.

Deep Dive10 min readMar 2026
TL;DR
  • -Cordum already hooks `DisconnectErrHandler`, `ReconnectHandler`, and `ClosedHandler` in NATS client options.
  • -Current handlers log useful events, but logs alone are weak for automated incident detection.
  • -Add reconnect counters, outage duration histograms, and burn-rate alerts tied to SLO budgets.
  • -Keep logs for forensics, but promote callbacks to first-class metrics for operational control.
Signals exist

Reconnect callbacks are already wired. The missing piece is quantitative aggregation.

Alerting gap

Without counters and durations, reconnect storms are discovered by humans too late.

SLO path

Callback events can map directly to error-budget burn alerts and recovery-time objectives.

Scope

This guide focuses on reconnect telemetry for Cordum bus connectivity, not full broker-level NATS server monitoring configuration.

The production problem

Reconnect incidents happen at 03:00. Teams discover them by reading logs after customer impact starts.

The system had callback signals all along. They were not promoted to metrics and alert thresholds.

Logs are evidence. Metrics are control.

What top results miss

SourceStrong coverageMissing piece
NATS docs: Listening for Reconnect EventsCallback hooks for disconnect/reconnect/close events.No metric model or alert thresholds for large multi-replica control-plane deployments.
NATS docs: Automatic ReconnectionsReconnect flow and option-level behavior.Does not define incident-oriented observability KPIs or SLO mappings.
nats.go package docsHandlers and status-change surfaces exposed by the Go client.No operator playbook for combining callbacks, metrics, and paging policy.

Cordum runtime behavior

BoundaryObserved behaviorOperational impact
Disconnect callbackCordum logs disconnect errors via `nats.DisconnectErrHandler`.Good raw signal. Hard to aggregate without structured counters.
Reconnect callbackCordum logs reconnect with connected URL via `nats.ReconnectHandler`.Useful for incident timeline reconstruction and broker failover visibility.
Closed callbackCordum logs connection closure through `nats.ClosedHandler`.Captures terminal transitions, but lacks objective outage duration metric.
Metrics surfaceNo dedicated reconnect counters/histograms are emitted in current bus setup.Alerting depends on log search, which is slower and noisier under pressure.

Code-level mechanics

1) Existing callback hooks

core/infra/bus/nats.go
go
opts := []nats.Option{
  nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
    slog.Info("bus: disconnected from nats", "err", err)
  }),
  nats.ReconnectHandler(func(nc *nats.Conn) {
    slog.Info("bus: reconnected to nats", "url", nc.ConnectedUrl())
  }),
  nats.ClosedHandler(func(nc *nats.Conn) {
    slog.Info("bus: connection closed")
  }),
}

2) Minimal metrics augmentation

metrics patch example
go
// Example metric wiring
var (
  natsDisconnects = promauto.NewCounterVec(prometheus.CounterOpts{
    Name: "cordum_nats_disconnect_total",
    Help: "Total NATS disconnect events",
  }, []string{"component"})

  natsReconnects = promauto.NewCounterVec(prometheus.CounterOpts{
    Name: "cordum_nats_reconnect_total",
    Help: "Total NATS reconnect events",
  }, []string{"component"})
)

nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
  natsDisconnects.WithLabelValues("scheduler").Inc()
  slog.Info("bus: disconnected from nats", "err", err)
})

nats.ReconnectHandler(func(nc *nats.Conn) {
  natsReconnects.WithLabelValues("scheduler").Inc()
  slog.Info("bus: reconnected to nats", "url", nc.ConnectedUrl())
})

Operator runbook

Prove alerting works before production by forcing reconnect events in staging and checking both logs and counters.

staging-runbook.sh
bash
# 1) Baseline reconnect event rates
kubectl -n cordum logs deploy/cordum-scheduler | rg "disconnected from nats|reconnected to nats|connection closed"

# 2) Trigger controlled broker turbulence in staging
kubectl -n cordum rollout restart statefulset/nats

# 3) Validate metric counters increment as expected
#    cordum_nats_disconnect_total
#    cordum_nats_reconnect_total

# 4) Add alerting
#    page if disconnect_total increases and reconnect_total lags over 5m
#    page if reconnect churn exceeds threshold for 15m

Limitations and tradeoffs

ApproachBenefitCost
Logs onlyNo code changes and immediate local visibility.Slow incident detection and weak trend analysis.
Counters + logsFast alerting with retained forensic detail.Requires label discipline and alert tuning to avoid noise.
Full SLO burn-rate modelClear error-budget policy and escalation thresholds.More upfront instrumentation and dashboard work.

Next step

Add reconnect counters for one component this week, run one broker-failure drill, and wire a 5-minute burn alert before expanding to the rest of the control plane.

Related Reads