Skip to content
Deep Dive

AI Agent NATS Publish Confirmation

Core Publish and JetStream Publish do not confirm the same thing.

Deep Dive10 min readMar 2026
TL;DR
  • -Core NATS `Publish()` success means client accepted bytes locally, not end-to-end delivery confirmation.
  • -During reconnect, buffered publishes can later fail when buffer limits are hit.
  • -Cordum routes durable subjects to JetStream publish path and keeps broadcast/control subjects on Core NATS.
  • -Subject criticality should drive transport choice: durable path for must-not-drop events, core path for self-healing signals.
Confirmation boundary

A successful call is not always a delivered message. The boundary depends on publish path.

Dual transport

Cordum intentionally mixes JetStream and Core NATS based on subject type.

Operational fit

Self-healing signals can stay Core. Business-critical transitions should stay durable.

Scope

This guide is about publish confirmation semantics in Cordum bus routing. It is not a full JetStream stream-retention design tutorial.

The production problem

An incident report says, "Publish succeeded." Later, a downstream component never observed the event.

Usually both statements are true. They reference different boundaries.

Without explicit subject-level policy, teams overestimate what a successful publish call guarantees.

What top results miss

SourceStrong coverageMissing piece
NATS docs: Buffering Messages During Reconnect AttemptsReconnect buffer behavior and why app-level send can appear successful before eventual failure.Does not show how to split subject classes between durable and non-durable paths in a control plane.
NATS docs: Automatic ReconnectionsClient reconnect lifecycle and callback model.No concrete mapping between reconnect semantics and publish confirmation guarantees per message class.
nats.go package docsReconnect buffer defaults, publish error behavior, and reconnect options.No architecture guidance for hybrid Core+JetStream publish policy.

Cordum runtime behavior

BoundaryObserved behaviorOperational impact
Durable subject pathIf JetStream is enabled and subject is durable, Cordum uses `b.js.Publish(...)`.Publish has stronger broker-backed confirmation semantics than Core fire-and-forget.
Core subject pathNon-durable subjects use `b.nc.Publish(...)`.Success is local-client acceptance; delivery can still fail later under disconnect pressure.
Reconnect buffernats.go defaults reconnect buffer to 8MB unless overridden.Once exhausted, publish starts returning errors; caller handling becomes decisive.
Broadcast subjectsHeartbeat, handshake, and config-changed style broadcasts are intentionally non-durable in Cordum.At-most-once behavior is accepted where protocol-level self-healing exists.

Code-level mechanics

1) Publish path split

core/infra/bus/nats.go
go
func (b *NatsBus) Publish(subject string, packet *pb.BusPacket) error {
  ...
  if b != nil && b.jsEnabled && isDurableSubject(subject) {
    _, err = b.js.Publish(subject, data)
    if err != nil {
      return fmt.Errorf("publish %s: %w", subject, err)
    }
    return nil
  }

  if err := b.nc.Publish(subject, data); err != nil {
    return fmt.Errorf("publish %s: %w", subject, err)
  }
  return nil
}

2) Durable subject classifier

core/infra/bus/nats.go
go
func isDurableSubject(subject string) bool {
  switch subject {
  case capsdk.SubjectSubmit, capsdk.SubjectResult, capsdk.SubjectDLQ, capsdk.SubjectAuditExport:
    return true
  }
  if strings.HasPrefix(subject, "job.") {
    return true
  }
  if strings.HasPrefix(subject, "worker.") && strings.HasSuffix(subject, ".jobs") {
    return true
  }
  return false
}

3) Minimal publish policy rubric

policy-notes.md
md
// Suggested policy sketch
// 1) Must-not-drop domain transitions: durable (JetStream)
// 2) Self-healing chatter: core (heartbeat, handshake, config-change notice)
// 3) All publish errors: counted + alerted
// 4) Reconnect drills: validate behavior every release

Operator runbook

Validate subject policies in staging with explicit outage drills and per-subject error counters.

staging-runbook.sh
bash
# 1) Identify subject class
#    durable candidates: job submit/result, DLQ, audit export
#    core candidates: heartbeat, handshake, config changed

# 2) Inject short broker outage in staging
kubectl -n cordum rollout restart statefulset/nats

# 3) Watch publish errors by subject
kubectl -n cordum logs deploy/cordum-scheduler | rg "publish .*:"

# 4) Confirm durable subjects recover via JetStream path
# 5) Confirm core subjects self-heal via periodic resend mechanisms

Limitations and tradeoffs

ChoiceBenefitCost
Favor Core NATSLower overhead and fast fanout for chatter-style signals.Weaker delivery confirmation boundary for critical events.
Favor JetStream durable publishStronger persistence and recovery semantics.Higher operational complexity and storage costs.
Hybrid per subjectMatches reliability cost to business impact.Requires discipline in subject taxonomy and documentation.

Next step

Classify every active bus subject by business impact this week and validate that each class uses the intended publish path.

Related Reads