The production problem
Incidents get noisy when teams cannot answer one simple question: was this event supposed to be durable?
If the answer lives only in code comments, operators will guess under pressure.
Guessing under pressure is how expected best-effort behavior turns into a fake data-loss incident.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS docs: Core NATS | Core NATS provides best-effort, at-most-once delivery. | No practical guidance on classifying control-plane subjects into durable vs best-effort buckets. |
| NATS docs: JetStream Consumers | Durable vs ephemeral consumer semantics and ack-driven behavior. | Does not provide an event-taxonomy method for platform teams deciding what must survive restarts. |
| NATS docs: FAQ | Clear statement of delivery guarantees and reliability boundaries. | No rollout checklist for mixed-mode systems where only some subjects are persisted. |
Cordum runtime map
Cordum has an explicit durability classifier. That is good. Most teams still need it translated into an operator-facing matrix.
| Area | Current behavior | Operational impact |
|---|---|---|
| Durable subjects | Cordum marks `sys.job.submit`, `sys.job.result`, `sys.job.dlq`, and `sys.audit.export` as durable. | State-changing job lifecycle events survive broker restarts in JetStream mode. |
| Pattern durability | `job.*` and `worker.*.jobs` subjects are durable. | Dispatch and targeted worker job traffic stays in the durable path. |
| Best-effort broadcasts | Heartbeat, handshake, config-changed, alert, progress, and workflow-event broadcasts are intentionally core NATS. | Some transient events can be missed during disconnect windows by design. |
| Reasoning in code | Comments note self-healing or informational semantics for best-effort subjects. | Durability cost is reserved for events that change authoritative state. |
// core/infra/bus/nats.go
func isDurableSubject(subject string) bool {
switch subject {
case capsdk.SubjectSubmit, capsdk.SubjectResult, capsdk.SubjectDLQ, capsdk.SubjectAuditExport:
return true
}
if strings.HasPrefix(subject, "job.") {
return true
}
if strings.HasPrefix(subject, "worker.") && strings.HasSuffix(subject, ".jobs") {
return true
}
return false
}// Broadcast subjects intentionally stay core NATS (best-effort): // - sys.heartbeat: periodic, self-healing // - sys.config.changed: poll fallback catches missed notices // - sys.handshake: workers re-register on next cycle // - sys.alert/sys.job.progress/sys.workflow.event: informational, no hard state dependency
Subject durability matrix
| Subject family | Delivery mode | Why this choice works |
|---|---|---|
| `sys.job.submit`, `sys.job.result`, `sys.job.dlq`, `sys.audit.export` | Durable (JetStream) | Changes authoritative job state or audit evidence. |
| `job.*`, `worker.*.jobs` | Durable (JetStream) | Carries dispatch and worker-targeted execution messages. |
| `sys.heartbeat`, `sys.handshake` | Best-effort (Core NATS) | Periodic/self-healing presence signals. |
| `sys.config.changed` | Best-effort (Core NATS) | Config polling fallback recovers missed events. |
| `sys.alert`, `sys.job.progress`, `sys.workflow.event` | Best-effort (Core NATS) | Informational streams where occasional transient misses are tolerated. |
Validation runbook
Treat durability mapping as testable behavior. Run restart drills and confirm that only expected best-effort gaps appear.
# 1) List subject classes used by each service # 2) Simulate rolling restart with JetStream enabled kubectl -n cordum rollout restart deploy/cordum-scheduler # 3) Verify durable subject continuity # sys.job.submit, sys.job.result, job.*, worker.*.jobs # 4) Verify best-effort behavior is acceptable # sys.heartbeat, sys.job.progress, sys.workflow.event # 5) Document one-page durability policy per subject family
Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Make all subjects durable | Fewer ambiguity debates during incidents. | Higher storage/consumer overhead and more operational complexity for low-value transient events. |
| Keep mixed-mode durability (Cordum default style) | Durability budget focused on state-critical traffic. | Requires clear documentation so teams do not misread expected best-effort gaps as bugs. |
| Mostly best-effort | Lower persistence overhead. | Poor fit for control planes where job lifecycle history must be trustworthy. |
Next step
Publish your own subject-family durability policy as an internal one-pager and link it in every incident template. Remove ambiguity before the next restart event.