The production problem
A broker outage starts. Control-plane replicas keep trying to publish operational events.
Reconnect buffer fills. Publish calls begin returning errors. If those errors are ignored upstream, state signals vanish and operators only see symptoms later.
This is not rare edge behavior. It is default reconnect physics under sustained disconnect.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS docs: Buffering Messages During Reconnect Attempts | Reconnect buffer behavior and the risk that app-level send appears successful while delivery fails. | No concrete sizing process for multi-component control planes with different message criticality classes. |
| NATS docs: Automatic Reconnections | Reconnect lifecycle, callback hooks, and high-level reliability caveats. | Does not map buffer boundaries to mixed Core NATS vs JetStream publish paths. |
| nats.go package docs | `ReconnectBufSize` semantics, default 8MB, and overflow behavior via publish errors. | No workload-specific guidance for choosing buffer size versus memory headroom in orchestrated replicas. |
Cordum runtime behavior
| Boundary | Observed behavior | Operational impact |
|---|---|---|
| Reconnect buffer default | Cordum bus options do not set `ReconnectBufSize`; client default applies. | Queueing limit during disconnect is implicit unless operator changes code. |
| Overflow behavior | When reconnect buffer fills, publish returns an error. | Callers must handle publish failures or lose control signals under outage stress. |
| Durable subjects | With JetStream enabled, durable subjects publish through `b.js.Publish(...)`. | Durable path behavior differs from core publish buffering and should be monitored separately. |
| Core broadcast subjects | Heartbeat/handshake/config-change style subjects intentionally stay on Core NATS. | These paths are exposed to reconnect buffer limits and at-most-once semantics. |
Code-level mechanics
1) Core vs durable publish path
func (b *NatsBus) Publish(subject string, packet *pb.BusPacket) error {
...
if b != nil && b.jsEnabled && isDurableSubject(subject) {
_, err = b.js.Publish(subject, data)
if err != nil {
return fmt.Errorf("publish %s: %w", subject, err)
}
return nil
}
if err := b.nc.Publish(subject, data); err != nil {
return fmt.Errorf("publish %s: %w", subject, err)
}
return nil
}2) Effective default reconnect buffer
// nats.go option docs
// ReconnectBufSize sets the buffer size of messages kept while busy reconnecting.
// Defaults to 8388608 bytes (8MB).
// Once exhausted, publish operations return an error.
// Cordum currently sets:
opts := []nats.Option{
nats.MaxReconnects(-1),
nats.ReconnectWait(2 * time.Second),
// no ReconnectBufSize override yet
}3) Explicit buffer sizing example
opts := []nats.Option{
nats.Name("cordum-bus"),
nats.MaxReconnects(-1),
nats.ReconnectWait(2 * time.Second),
nats.ReconnectBufSize(32 * 1024 * 1024), // 32MB example
nats.ReconnectErrHandler(func(nc *nats.Conn, err error) {
slog.Warn("bus: reconnect attempt failed", "err", err)
}),
}Size this against memory budgets and expected publish burst during outage windows. Bigger is not always safer if it hides persistent disconnects.
Operator runbook
Test reconnect buffer behavior under realistic outage conditions before tuning in production.
# 1) Measure publish error baseline kubectl -n cordum logs deploy/cordum-scheduler | rg "publish .*:" # 2) Simulate short broker outage kubectl -n cordum rollout restart statefulset/nats # 3) Watch for publish errors while disconnected kubectl -n cordum logs deploy/cordum-scheduler | rg "publish .*:" kubectl -n cordum logs deploy/cordum-gateway | rg "publish .*:" # 4) If errors spike, trial explicit ReconnectBufSize in staging # 5) Re-run outage drill and compare: # - publish error count # - process RSS growth # - reconnect recovery time
Limitations and tradeoffs
| Option | Benefit | Cost |
|---|---|---|
| Keep default 8MB | No extra memory planning required. | May hit publish-error boundary sooner during long disconnect windows. |
| Increase reconnect buffer | Absorbs larger outage bursts before publish failures surface. | Higher process memory footprint and delayed visibility of prolonged outages. |
| Disable reconnect buffering (-1) | Immediate publish failure signal during disconnect. | Higher error volume; requires strict retry policy in calling code. |
Next step
Run one outage drill this week, record publish error counts and memory use, then pick an explicit `ReconnectBufSize` policy for each control-plane component.