Skip to content
Deep Dive

AI Agent NATS Publish Confirmation

Core Publish and JetStream Publish do not confirm the same thing.

Deep Dive10 min readApr 2026
TL;DR
  • -Core NATS `Publish()` success means client accepted bytes locally, not end-to-end delivery confirmation.
  • -During reconnect, buffered publishes can later fail when buffer limits are hit.
  • -JetStream `Publish()` waits for `PubAck`, but `PubAck.Duplicate=true` means deduped write, not a new stored event.
  • -Cordum routes durable subjects to JetStream publish path and keeps broadcast/control subjects on Core NATS.
  • -Subject criticality should drive transport choice: durable path for must-not-drop events, core path for self-healing signals.
Confirmation boundary

A successful call is not always a delivered message. The boundary depends on publish path.

Dual transport

Cordum intentionally mixes JetStream and Core NATS based on subject type.

Operational fit

Self-healing signals can stay Core. Business-critical transitions should stay durable.

Scope

This guide is about publish confirmation semantics in Cordum bus routing. It is not a full JetStream stream-retention design tutorial.

The production problem

An incident report says, "Publish succeeded." Later, a downstream component never observed the event.

Usually both statements are true. They reference different boundaries.

Without explicit subject-level policy, teams overestimate what a successful publish call guarantees.

What top results miss

SourceStrong coverageMissing piece
NATS docs: Buffering Messages During Reconnect AttemptsReconnect buffer behavior and why app-level send can appear successful before eventual failure.Does not show how to split subject classes between durable and non-durable paths in a control plane.
NATS docs: JetStream Model Deep DiveDeduplication model (`Nats-Msg-Id`) and duplicate-window semantics.No policy guidance for interpreting deduped publish acks in hybrid Core+JetStream control planes.
nats.go package docsReconnect buffer defaults, publish error behavior, and reconnect options.No architecture guidance for hybrid Core+JetStream publish policy.

Cordum runtime behavior

BoundaryObserved behaviorOperational impact
Durable subject pathIf JetStream is enabled and subject is durable, Cordum uses `b.js.Publish(...)`.Publish has stronger broker-backed confirmation semantics than Core fire-and-forget.
Core subject pathNon-durable subjects use `b.nc.Publish(...)`.Success is local-client acceptance; delivery can still fail later under disconnect pressure.
Reconnect buffernats.go defaults reconnect buffer to 8MB unless overridden.Once exhausted, publish starts returning errors; caller handling becomes decisive.
Deduplication boundaryCordum can attach `nats.MsgId(...)` on durable publishes, and stream duplicate window is configured at 2 minutes.JetStream ack can be successful while marking duplicate, so 'success' is not always a new persisted event.
Broadcast subjectsHeartbeat, handshake, and config-changed style broadcasts are intentionally non-durable in Cordum.At-most-once behavior is accepted where protocol-level self-healing exists.

Code-level mechanics

1) Publish path split

core/infra/bus/nats.go
go
func (b *NatsBus) Publish(subject string, packet *pb.BusPacket) error {
  ...
  if b != nil && b.jsEnabled && isDurableSubject(subject) {
    msgID := computeMsgID(subject, packet)
    if msgID != "" {
      _, err = b.js.Publish(subject, data, nats.MsgId(msgID))
    } else {
      _, err = b.js.Publish(subject, data)
    }
    if err != nil {
      return fmt.Errorf("publish %s: %w", subject, err)
    }
    return nil
  }

  if err := b.nc.Publish(subject, data); err != nil {
    return fmt.Errorf("publish %s: %w", subject, err)
  }
  return nil
}

2) Durable subject classifier

core/infra/bus/nats.go
go
func isDurableSubject(subject string) bool {
  switch subject {
  case capsdk.SubjectSubmit, capsdk.SubjectResult, capsdk.SubjectDLQ, capsdk.SubjectAuditExport:
    return true
  }
  if strings.HasPrefix(subject, "job.") {
    return true
  }
  if strings.HasPrefix(subject, "worker.") && strings.HasSuffix(subject, ".jobs") {
    return true
  }
  return false
}

3) Minimal publish policy rubric

policy-notes.md
md
// Suggested policy sketch
// 1) Must-not-drop domain transitions: durable (JetStream)
// 2) Self-healing chatter: core (heartbeat, handshake, config-change notice)
// 3) All publish errors: counted + alerted
// 4) Track JetStream duplicate acks separately from hard failures
// 5) Reconnect drills: validate behavior every release

4) Duplicate-window config boundary

core/infra/bus/nats.go
go
// Stream config in Cordum JetStream setup
js.AddStream(&nats.StreamConfig{
  Name:       name,
  Subjects:   subjects,
  Retention:  nats.LimitsPolicy,
  Storage:    nats.FileStorage,
  MaxAge:     maxAge,
  Replicas:   replicas,
  Duplicates: 2 * time.Minute, // duplicate tracking window
})

Operator runbook

Validate subject policies in staging with explicit outage drills and per-subject error counters.

staging-runbook.sh
bash
# 1) Identify subject class
#    durable candidates: job submit/result, DLQ, audit export
#    core candidates: heartbeat, handshake, config changed

# 2) Inject short broker outage in staging
kubectl -n cordum rollout restart statefulset/nats

# 3) Watch publish errors by subject
kubectl -n cordum logs deploy/cordum-scheduler | rg "publish .*:"

# 4) For durable subjects, test duplicate MsgId behavior in staging
#    Expect ack success but duplicate=true for replayed MsgId within window

# 5) Confirm durable subjects recover via JetStream path
# 6) Confirm core subjects self-heal via periodic resend mechanisms

Limitations and tradeoffs

ChoiceBenefitCost
Favor Core NATSLower overhead and fast fanout for chatter-style signals.Weaker delivery confirmation boundary for critical events.
Favor JetStream durable publishStronger persistence and recovery semantics.Higher operational complexity and storage costs.
Hybrid per subjectMatches reliability cost to business impact.Requires discipline in subject taxonomy and documentation.

Next step

Classify every active bus subject by business impact this week and validate that each class uses the intended publish path.

Related Reads