The production problem
An incident report says, "Publish succeeded." Later, a downstream component never observed the event.
Usually both statements are true. They reference different boundaries.
Without explicit subject-level policy, teams overestimate what a successful publish call guarantees.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS docs: Buffering Messages During Reconnect Attempts | Reconnect buffer behavior and why app-level send can appear successful before eventual failure. | Does not show how to split subject classes between durable and non-durable paths in a control plane. |
| NATS docs: JetStream Model Deep Dive | Deduplication model (`Nats-Msg-Id`) and duplicate-window semantics. | No policy guidance for interpreting deduped publish acks in hybrid Core+JetStream control planes. |
| nats.go package docs | Reconnect buffer defaults, publish error behavior, and reconnect options. | No architecture guidance for hybrid Core+JetStream publish policy. |
Cordum runtime behavior
| Boundary | Observed behavior | Operational impact |
|---|---|---|
| Durable subject path | If JetStream is enabled and subject is durable, Cordum uses `b.js.Publish(...)`. | Publish has stronger broker-backed confirmation semantics than Core fire-and-forget. |
| Core subject path | Non-durable subjects use `b.nc.Publish(...)`. | Success is local-client acceptance; delivery can still fail later under disconnect pressure. |
| Reconnect buffer | nats.go defaults reconnect buffer to 8MB unless overridden. | Once exhausted, publish starts returning errors; caller handling becomes decisive. |
| Deduplication boundary | Cordum can attach `nats.MsgId(...)` on durable publishes, and stream duplicate window is configured at 2 minutes. | JetStream ack can be successful while marking duplicate, so 'success' is not always a new persisted event. |
| Broadcast subjects | Heartbeat, handshake, and config-changed style broadcasts are intentionally non-durable in Cordum. | At-most-once behavior is accepted where protocol-level self-healing exists. |
Code-level mechanics
1) Publish path split
func (b *NatsBus) Publish(subject string, packet *pb.BusPacket) error {
...
if b != nil && b.jsEnabled && isDurableSubject(subject) {
msgID := computeMsgID(subject, packet)
if msgID != "" {
_, err = b.js.Publish(subject, data, nats.MsgId(msgID))
} else {
_, err = b.js.Publish(subject, data)
}
if err != nil {
return fmt.Errorf("publish %s: %w", subject, err)
}
return nil
}
if err := b.nc.Publish(subject, data); err != nil {
return fmt.Errorf("publish %s: %w", subject, err)
}
return nil
}2) Durable subject classifier
func isDurableSubject(subject string) bool {
switch subject {
case capsdk.SubjectSubmit, capsdk.SubjectResult, capsdk.SubjectDLQ, capsdk.SubjectAuditExport:
return true
}
if strings.HasPrefix(subject, "job.") {
return true
}
if strings.HasPrefix(subject, "worker.") && strings.HasSuffix(subject, ".jobs") {
return true
}
return false
}3) Minimal publish policy rubric
// Suggested policy sketch // 1) Must-not-drop domain transitions: durable (JetStream) // 2) Self-healing chatter: core (heartbeat, handshake, config-change notice) // 3) All publish errors: counted + alerted // 4) Track JetStream duplicate acks separately from hard failures // 5) Reconnect drills: validate behavior every release
4) Duplicate-window config boundary
// Stream config in Cordum JetStream setup
js.AddStream(&nats.StreamConfig{
Name: name,
Subjects: subjects,
Retention: nats.LimitsPolicy,
Storage: nats.FileStorage,
MaxAge: maxAge,
Replicas: replicas,
Duplicates: 2 * time.Minute, // duplicate tracking window
})Operator runbook
Validate subject policies in staging with explicit outage drills and per-subject error counters.
# 1) Identify subject class # durable candidates: job submit/result, DLQ, audit export # core candidates: heartbeat, handshake, config changed # 2) Inject short broker outage in staging kubectl -n cordum rollout restart statefulset/nats # 3) Watch publish errors by subject kubectl -n cordum logs deploy/cordum-scheduler | rg "publish .*:" # 4) For durable subjects, test duplicate MsgId behavior in staging # Expect ack success but duplicate=true for replayed MsgId within window # 5) Confirm durable subjects recover via JetStream path # 6) Confirm core subjects self-heal via periodic resend mechanisms
Limitations and tradeoffs
| Choice | Benefit | Cost |
|---|---|---|
| Favor Core NATS | Lower overhead and fast fanout for chatter-style signals. | Weaker delivery confirmation boundary for critical events. |
| Favor JetStream durable publish | Stronger persistence and recovery semantics. | Higher operational complexity and storage costs. |
| Hybrid per subject | Matches reliability cost to business impact. | Requires discipline in subject taxonomy and documentation. |
Next step
Classify every active bus subject by business impact this week and validate that each class uses the intended publish path.