The production problem
A poison payload hits max deliveries. Handler calls `Term()` and moves on. Then the process crashes before writing DLQ evidence.
Queue pressure goes down, but your forensic trail disappears.
That is the wrong trade. You want both queue health and evidence durability.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS docs: JetStream Consumers | Ack semantics, redelivery, and max-delivery behavior for consumers. | No explicit crash-window discussion for DLQ ordering around terminal ack paths. |
| NATS docs: JetStream Model Deep Dive | `AckTerm` (`+TERM`) protocol meaning and acknowledgment models. | Does not provide a production-safe DLQ-first sequencing pattern. |
| NATS docs: JetStream Overview | At-least-once reliability model and acknowledgment caveats. | No code-level poison-message handling template with durable forensic trace guarantees. |
Cordum runtime mechanics
Cordum explicitly handles this failure window by writing DLQ first and terminating only after DLQ success.
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Delivery threshold | Cordum inspects delivery metadata and terminates when redelivery count reaches configured max. | Poison pills stop blocking queue progress. |
| Ordering | Cordum attempts DLQ write before `msg.Term()`. | If DLQ write fails, message is NAKed with delay and retried. |
| Handler-level corruption | Unmarshal failures beyond threshold also route to termination path. | Corrupt payloads do not loop forever. |
| Observability | Warnings log delivery counts and poison termination events. | Operators can detect repeat offenders and scope replay. |
// core/infra/bus/nats.go (excerpt)
if numDelivered >= uint64(maxJSRedeliveries) {
slog.Warn("bus: terminating poison message", ...)
// DLQ write BEFORE Term — prevents data loss if we crash between Term and DLQ write.
if b.OnMessageTerminated != nil {
if dlqErr := b.OnMessageTerminated(subject, msg.Data, numDelivered); dlqErr != nil {
slog.Error("bus: dlq write failed, nak-ing for retry", ...)
_ = msg.NakWithDelay(5 * time.Second)
return
}
}
if termErr := msg.Term(); termErr != nil {
slog.Error("bus: term failed", ...)
}
return
}Why ordering matters
`AckTerm` is terminal for redelivery. After that, you cannot rely on JetStream to hand you the message again.
If evidence persistence happens after terminal ack, one crash can convert a known poison event into a silent hole.
// Risky ordering (do not use) _ = msg.Term() _ = writeToDLQ(data) // if process crashes before this line, failure evidence is gone
Validation runbook
Validate this in staging with forced restarts. A correctness guarantee that is not tested under crash timing is just optimism.
# 1) Inject known poison payload into staging stream # 2) Confirm delivery count reaches max threshold # 3) Validate DLQ entry exists BEFORE termination # 4) Force process restart during termination path to test crash resilience # 5) Verify replay workflow can consume DLQ evidence
Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| DLQ-first then Term (Cordum pattern) | Best forensic integrity and replay support. | Extra write in hot path and more failure handling branches. |
| Term-first then DLQ | Slightly shorter code path in success case. | Crash window can lose poison message evidence. |
| Never Term, only NAK | No terminal drops. | Queue starvation and retry storms for permanently bad payloads. |
Next step
Add a poison-message integration test that kills the worker between DLQ write and term path boundaries, then assert no evidence loss before promoting the rollout.