Skip to content
Deep Dive

AI Agent NATS Subject Durability Map

Decide durability per subject family, not by default assumptions.

Deep Dive10 min readMar 2026
TL;DR
  • -Core NATS is best-effort, at-most-once. JetStream adds persistence and stronger delivery semantics.
  • -Cordum already implements an explicit durability map in `isDurableSubject(subject)`.
  • -Not every subject should be durable. Some broadcast events are intentionally best-effort and self-healing.
  • -The real risk is hidden assumptions. Teams often treat all events as durable even when the code does not.
Frequent mistake

Assuming all bus traffic is persisted creates false confidence during restarts and failovers.

Code truth

Cordum classifies durability by subject pattern, not by wishful thinking.

Operator win

A written durability map shortens incident triage and avoids noisy "data loss" confusion.

Scope

This guide maps Cordum bus subjects to durability expectations. It does not replace full incident-response procedure design.

The production problem

Incidents get noisy when teams cannot answer one simple question: was this event supposed to be durable?

If the answer lives only in code comments, operators will guess under pressure.

Guessing under pressure is how expected best-effort behavior turns into a fake data-loss incident.

What top results cover and miss

SourceStrong coverageMissing piece
NATS docs: Core NATSCore NATS provides best-effort, at-most-once delivery.No practical guidance on classifying control-plane subjects into durable vs best-effort buckets.
NATS docs: JetStream ConsumersDurable vs ephemeral consumer semantics and ack-driven behavior.Does not provide an event-taxonomy method for platform teams deciding what must survive restarts.
NATS docs: FAQClear statement of delivery guarantees and reliability boundaries.No rollout checklist for mixed-mode systems where only some subjects are persisted.

Cordum runtime map

Cordum has an explicit durability classifier. That is good. Most teams still need it translated into an operator-facing matrix.

AreaCurrent behaviorOperational impact
Durable subjectsCordum marks `sys.job.submit`, `sys.job.result`, `sys.job.dlq`, and `sys.audit.export` as durable.State-changing job lifecycle events survive broker restarts in JetStream mode.
Pattern durability`job.*` and `worker.*.jobs` subjects are durable.Dispatch and targeted worker job traffic stays in the durable path.
Best-effort broadcastsHeartbeat, handshake, config-changed, alert, progress, and workflow-event broadcasts are intentionally core NATS.Some transient events can be missed during disconnect windows by design.
Reasoning in codeComments note self-healing or informational semantics for best-effort subjects.Durability cost is reserved for events that change authoritative state.
Durability classifier
go
// core/infra/bus/nats.go
func isDurableSubject(subject string) bool {
  switch subject {
  case capsdk.SubjectSubmit, capsdk.SubjectResult, capsdk.SubjectDLQ, capsdk.SubjectAuditExport:
    return true
  }
  if strings.HasPrefix(subject, "job.") {
    return true
  }
  if strings.HasPrefix(subject, "worker.") && strings.HasSuffix(subject, ".jobs") {
    return true
  }
  return false
}
Best-effort rationale in code comments
go
// Broadcast subjects intentionally stay core NATS (best-effort):
// - sys.heartbeat: periodic, self-healing
// - sys.config.changed: poll fallback catches missed notices
// - sys.handshake: workers re-register on next cycle
// - sys.alert/sys.job.progress/sys.workflow.event: informational, no hard state dependency

Subject durability matrix

Subject familyDelivery modeWhy this choice works
`sys.job.submit`, `sys.job.result`, `sys.job.dlq`, `sys.audit.export`Durable (JetStream)Changes authoritative job state or audit evidence.
`job.*`, `worker.*.jobs`Durable (JetStream)Carries dispatch and worker-targeted execution messages.
`sys.heartbeat`, `sys.handshake`Best-effort (Core NATS)Periodic/self-healing presence signals.
`sys.config.changed`Best-effort (Core NATS)Config polling fallback recovers missed events.
`sys.alert`, `sys.job.progress`, `sys.workflow.event`Best-effort (Core NATS)Informational streams where occasional transient misses are tolerated.

Validation runbook

Treat durability mapping as testable behavior. Run restart drills and confirm that only expected best-effort gaps appear.

Restart validation steps
bash
# 1) List subject classes used by each service
# 2) Simulate rolling restart with JetStream enabled
kubectl -n cordum rollout restart deploy/cordum-scheduler

# 3) Verify durable subject continuity
#    sys.job.submit, sys.job.result, job.*, worker.*.jobs

# 4) Verify best-effort behavior is acceptable
#    sys.heartbeat, sys.job.progress, sys.workflow.event

# 5) Document one-page durability policy per subject family

Limitations and tradeoffs

ApproachUpsideDownside
Make all subjects durableFewer ambiguity debates during incidents.Higher storage/consumer overhead and more operational complexity for low-value transient events.
Keep mixed-mode durability (Cordum default style)Durability budget focused on state-critical traffic.Requires clear documentation so teams do not misread expected best-effort gaps as bugs.
Mostly best-effortLower persistence overhead.Poor fit for control planes where job lifecycle history must be trustworthy.

Next step

Publish your own subject-family durability policy as an internal one-pager and link it in every incident template. Remove ambiguity before the next restart event.

Related Articles

View all posts

Need production-safe agent governance?

Cordum helps teams enforce pre-dispatch policy, run dependable agent workflows, and keep evidence trails auditable.