Name: Cordum
Author: Cordum

The production problem

Incidents get noisy when teams cannot answer one simple question: was this event supposed to be durable?

If the answer lives only in code comments, operators will guess under pressure.

Guessing under pressure is how expected best-effort behavior turns into a fake data-loss incident.

What top results cover and miss

Source	Strong coverage	Missing piece
NATS docs: Core NATS	Core NATS provides best-effort, at-most-once delivery.	No practical guidance on classifying control-plane subjects into durable vs best-effort buckets.
NATS docs: JetStream Consumers	Durable vs ephemeral consumer semantics and ack-driven behavior.	Does not provide an event-taxonomy method for platform teams deciding what must survive restarts.
NATS docs: FAQ	Clear statement of delivery guarantees and reliability boundaries.	No rollout checklist for mixed-mode systems where only some subjects are persisted.

Cordum runtime map

Cordum has an explicit durability classifier. That is good. Most teams still need it translated into an operator-facing matrix.

Area	Current behavior	Operational impact
Durable subjects	Cordum marks `sys.job.submit`, `sys.job.result`, `sys.job.dlq`, and `sys.audit.export` as durable.	State-changing job lifecycle events survive broker restarts in JetStream mode.
Pattern durability	`job.` and `worker..jobs` subjects are durable.	Dispatch and targeted worker job traffic stays in the durable path.
Best-effort broadcasts	Heartbeat, handshake, config-changed, alert, progress, and workflow-event broadcasts are intentionally core NATS.	Some transient events can be missed during disconnect windows by design.
Reasoning in code	Comments note self-healing or informational semantics for best-effort subjects.	Durability cost is reserved for events that change authoritative state.

Durability classifier

// core/infra/bus/nats.go
func isDurableSubject(subject string) bool {
  switch subject {
  case capsdk.SubjectSubmit, capsdk.SubjectResult, capsdk.SubjectDLQ, capsdk.SubjectAuditExport:
    return true
  }
  if strings.HasPrefix(subject, "job.") {
    return true
  }
  if strings.HasPrefix(subject, "worker.") && strings.HasSuffix(subject, ".jobs") {
    return true
  }
  return false
}

Best-effort rationale in code comments

// Broadcast subjects intentionally stay core NATS (best-effort):
// - sys.heartbeat: periodic, self-healing
// - sys.config.changed: poll fallback catches missed notices
// - sys.handshake: workers re-register on next cycle
// - sys.alert/sys.job.progress/sys.workflow.event: informational, no hard state dependency

Subject durability matrix

Subject family	Delivery mode	Why this choice works
`sys.job.submit`, `sys.job.result`, `sys.job.dlq`, `sys.audit.export`	Durable (JetStream)	Changes authoritative job state or audit evidence.
`job.`, `worker..jobs`	Durable (JetStream)	Carries dispatch and worker-targeted execution messages.
`sys.heartbeat`, `sys.handshake`	Best-effort (Core NATS)	Periodic/self-healing presence signals.
`sys.config.changed`	Best-effort (Core NATS)	Config polling fallback recovers missed events.
`sys.alert`, `sys.job.progress`, `sys.workflow.event`	Best-effort (Core NATS)	Informational streams where occasional transient misses are tolerated.

Validation runbook

Treat durability mapping as testable behavior. Run restart drills and confirm that only expected best-effort gaps appear.

Restart validation steps

bash

# 1) List subject classes used by each service
# 2) Simulate rolling restart with JetStream enabled
kubectl -n cordum rollout restart deploy/cordum-scheduler

# 3) Verify durable subject continuity
#    sys.job.submit, sys.job.result, job.*, worker.*.jobs

# 4) Verify best-effort behavior is acceptable
#    sys.heartbeat, sys.job.progress, sys.workflow.event

# 5) Document one-page durability policy per subject family

Limitations and tradeoffs

Approach	Upside	Downside
Make all subjects durable	Fewer ambiguity debates during incidents.	Higher storage/consumer overhead and more operational complexity for low-value transient events.
Keep mixed-mode durability (Cordum default style)	Durability budget focused on state-critical traffic.	Requires clear documentation so teams do not misread expected best-effort gaps as bugs.
Mostly best-effort	Lower persistence overhead.	Poor fit for control planes where job lifecycle history must be trustworthy.

Next step

Publish your own subject-family durability policy as an internal one-pager and link it in every incident template. Remove ambiguity before the next restart event.

AI Agent NATS Subject Durability Map

The production problem

What top results cover and miss

Cordum runtime map

Subject durability matrix

Validation runbook

Limitations and tradeoffs

Next step

Related Articles

AI Agent NATS Drain vs Close: Prevent Shutdown Message Loss in Control Planes (2026)

AI Agent NATS Reconnect Observability: Turn Callback Logs into SLO Signals (2026)

AI Agent NATS Publish Confirmation: Core Publish vs JetStream Ack in Control Planes (2026)

Need production-safe agent governance?