AI Agent NATS Publish Confirmation: Core Publish vs JetStream Ack in Control Planes (2026)

The production problem

An incident report says, "Publish succeeded." Later, a downstream component never observed the event.

Usually both statements are true. They reference different boundaries.

Without explicit subject-level policy, teams overestimate what a successful publish call guarantees.

What top results miss

Source	Strong coverage	Missing piece
NATS docs: Buffering Messages During Reconnect Attempts	Reconnect buffer behavior and why app-level send can appear successful before eventual failure.	Does not show how to split subject classes between durable and non-durable paths in a control plane.
NATS docs: JetStream Model Deep Dive	Deduplication model (`Nats-Msg-Id`) and duplicate-window semantics.	No policy guidance for interpreting deduped publish acks in hybrid Core+JetStream control planes.
nats.go package docs	Reconnect buffer defaults, publish error behavior, and reconnect options.	No architecture guidance for hybrid Core+JetStream publish policy.

Cordum runtime behavior

Boundary	Observed behavior	Operational impact
Durable subject path	If JetStream is enabled and subject is durable, Cordum uses `b.js.Publish(...)`.	Publish has stronger broker-backed confirmation semantics than Core fire-and-forget.
Core subject path	Non-durable subjects use `b.nc.Publish(...)`.	Success is local-client acceptance; delivery can still fail later under disconnect pressure.
Reconnect buffer	nats.go defaults reconnect buffer to 8MB unless overridden.	Once exhausted, publish starts returning errors; caller handling becomes decisive.
Deduplication boundary	Cordum can attach `nats.MsgId(...)` on durable publishes, and stream duplicate window is configured at 2 minutes.	JetStream ack can be successful while marking duplicate, so 'success' is not always a new persisted event.
Broadcast subjects	Heartbeat, handshake, and config-changed style broadcasts are intentionally non-durable in Cordum.	At-most-once behavior is accepted where protocol-level self-healing exists.

Code-level mechanics

1) Publish path split

core/infra/bus/nats.go

func (b *NatsBus) Publish(subject string, packet *pb.BusPacket) error {
  ...
  if b != nil && b.jsEnabled && isDurableSubject(subject) {
    msgID := computeMsgID(subject, packet)
    if msgID != "" {
      _, err = b.js.Publish(subject, data, nats.MsgId(msgID))
    } else {
      _, err = b.js.Publish(subject, data)
    }
    if err != nil {
      return fmt.Errorf("publish %s: %w", subject, err)
    }
    return nil
  }

  if err := b.nc.Publish(subject, data); err != nil {
    return fmt.Errorf("publish %s: %w", subject, err)
  }
  return nil
}

2) Durable subject classifier

core/infra/bus/nats.go

func isDurableSubject(subject string) bool {
  switch subject {
  case capsdk.SubjectSubmit, capsdk.SubjectResult, capsdk.SubjectDLQ, capsdk.SubjectAuditExport:
    return true
  }
  if strings.HasPrefix(subject, "job.") {
    return true
  }
  if strings.HasPrefix(subject, "worker.") && strings.HasSuffix(subject, ".jobs") {
    return true
  }
  return false
}

3) Minimal publish policy rubric

policy-notes.md

// Suggested policy sketch
// 1) Must-not-drop domain transitions: durable (JetStream)
// 2) Self-healing chatter: core (heartbeat, handshake, config-change notice)
// 3) All publish errors: counted + alerted
// 4) Track JetStream duplicate acks separately from hard failures
// 5) Reconnect drills: validate behavior every release

4) Duplicate-window config boundary

core/infra/bus/nats.go

// Stream config in Cordum JetStream setup
js.AddStream(&nats.StreamConfig{
  Name:       name,
  Subjects:   subjects,
  Retention:  nats.LimitsPolicy,
  Storage:    nats.FileStorage,
  MaxAge:     maxAge,
  Replicas:   replicas,
  Duplicates: 2 * time.Minute, // duplicate tracking window
})

Operator runbook

Validate subject policies in staging with explicit outage drills and per-subject error counters.

staging-runbook.sh

bash

# 1) Identify subject class
#    durable candidates: job submit/result, DLQ, audit export
#    core candidates: heartbeat, handshake, config changed

# 2) Inject short broker outage in staging
kubectl -n cordum rollout restart statefulset/nats

# 3) Watch publish errors by subject
kubectl -n cordum logs deploy/cordum-scheduler | rg "publish .*:"

# 4) For durable subjects, test duplicate MsgId behavior in staging
#    Expect ack success but duplicate=true for replayed MsgId within window

# 5) Confirm durable subjects recover via JetStream path
# 6) Confirm core subjects self-heal via periodic resend mechanisms

Limitations and tradeoffs

Choice	Benefit	Cost
Favor Core NATS	Lower overhead and fast fanout for chatter-style signals.	Weaker delivery confirmation boundary for critical events.
Favor JetStream durable publish	Stronger persistence and recovery semantics.	Higher operational complexity and storage costs.
Hybrid per subject	Matches reliability cost to business impact.	Requires discipline in subject taxonomy and documentation.

Next step

Classify every active bus subject by business impact this week and validate that each class uses the intended publish path.

Open operations docs Review output safety docs

AI Agent NATS Publish Confirmation

The production problem

What top results miss

Cordum runtime behavior

Code-level mechanics

1) Publish path split

2) Durable subject classifier

3) Minimal publish policy rubric

4) Duplicate-window config boundary

Operator runbook

Limitations and tradeoffs

Next step

Related Reads

AI Agent NATS Reconnect Buffer Sizing: Avoid Silent Drops During Broker Outages (2026)

AI Agent NATS Reconnect Jitter: Stop Thundering Herd Storms in Control Planes (2026)

AI Agent JetStream Broadcast Semantics: Durable Names That Prevent Replica Message Loss (2026)