The production problem
An incident report says, "Publish succeeded." Later, a downstream component never observed the event.
Usually both statements are true. They reference different boundaries.
Without explicit subject-level policy, teams overestimate what a successful publish call guarantees.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS docs: Buffering Messages During Reconnect Attempts | Reconnect buffer behavior and why app-level send can appear successful before eventual failure. | Does not show how to split subject classes between durable and non-durable paths in a control plane. |
| NATS docs: Automatic Reconnections | Client reconnect lifecycle and callback model. | No concrete mapping between reconnect semantics and publish confirmation guarantees per message class. |
| nats.go package docs | Reconnect buffer defaults, publish error behavior, and reconnect options. | No architecture guidance for hybrid Core+JetStream publish policy. |
Cordum runtime behavior
| Boundary | Observed behavior | Operational impact |
|---|---|---|
| Durable subject path | If JetStream is enabled and subject is durable, Cordum uses `b.js.Publish(...)`. | Publish has stronger broker-backed confirmation semantics than Core fire-and-forget. |
| Core subject path | Non-durable subjects use `b.nc.Publish(...)`. | Success is local-client acceptance; delivery can still fail later under disconnect pressure. |
| Reconnect buffer | nats.go defaults reconnect buffer to 8MB unless overridden. | Once exhausted, publish starts returning errors; caller handling becomes decisive. |
| Broadcast subjects | Heartbeat, handshake, and config-changed style broadcasts are intentionally non-durable in Cordum. | At-most-once behavior is accepted where protocol-level self-healing exists. |
Code-level mechanics
1) Publish path split
func (b *NatsBus) Publish(subject string, packet *pb.BusPacket) error {
...
if b != nil && b.jsEnabled && isDurableSubject(subject) {
_, err = b.js.Publish(subject, data)
if err != nil {
return fmt.Errorf("publish %s: %w", subject, err)
}
return nil
}
if err := b.nc.Publish(subject, data); err != nil {
return fmt.Errorf("publish %s: %w", subject, err)
}
return nil
}2) Durable subject classifier
func isDurableSubject(subject string) bool {
switch subject {
case capsdk.SubjectSubmit, capsdk.SubjectResult, capsdk.SubjectDLQ, capsdk.SubjectAuditExport:
return true
}
if strings.HasPrefix(subject, "job.") {
return true
}
if strings.HasPrefix(subject, "worker.") && strings.HasSuffix(subject, ".jobs") {
return true
}
return false
}3) Minimal publish policy rubric
// Suggested policy sketch // 1) Must-not-drop domain transitions: durable (JetStream) // 2) Self-healing chatter: core (heartbeat, handshake, config-change notice) // 3) All publish errors: counted + alerted // 4) Reconnect drills: validate behavior every release
Operator runbook
Validate subject policies in staging with explicit outage drills and per-subject error counters.
# 1) Identify subject class # durable candidates: job submit/result, DLQ, audit export # core candidates: heartbeat, handshake, config changed # 2) Inject short broker outage in staging kubectl -n cordum rollout restart statefulset/nats # 3) Watch publish errors by subject kubectl -n cordum logs deploy/cordum-scheduler | rg "publish .*:" # 4) Confirm durable subjects recover via JetStream path # 5) Confirm core subjects self-heal via periodic resend mechanisms
Limitations and tradeoffs
| Choice | Benefit | Cost |
|---|---|---|
| Favor Core NATS | Lower overhead and fast fanout for chatter-style signals. | Weaker delivery confirmation boundary for critical events. |
| Favor JetStream durable publish | Stronger persistence and recovery semantics. | Higher operational complexity and storage costs. |
| Hybrid per subject | Matches reliability cost to business impact. | Requires discipline in subject taxonomy and documentation. |
Next step
Classify every active bus subject by business impact this week and validate that each class uses the intended publish path.