AI Agent NATS JetStream Poison Message Termination: DLQ-First Ordering That Avoids Crash Windows (2026)

The production problem

A poison payload hits max deliveries. Handler calls `Term()` and moves on. Then the process crashes before writing DLQ evidence.

Queue pressure goes down, but your forensic trail disappears.

That is the wrong trade. You want both queue health and evidence durability.

What top results cover and miss

Source	Strong coverage	Missing piece
NATS docs: JetStream Consumers	Ack semantics, redelivery, and max-delivery behavior for consumers.	No explicit crash-window discussion for DLQ ordering around terminal ack paths or per-error-class termination thresholds.
NATS docs: JetStream Model Deep Dive	`AckTerm` (`+TERM`) protocol meaning and acknowledgment models.	Does not provide a production-safe DLQ-first sequencing pattern.
NATS docs: JetStream Overview	At-least-once reliability model and acknowledgment caveats.	No code-level poison-message handling template with durable forensic trace guarantees.

Cordum runtime mechanics

Cordum explicitly handles this failure window by writing DLQ first and terminating only after DLQ success.

Boundary	Current behavior	Operational impact
Delivery threshold	Cordum inspects delivery metadata and terminates when redelivery count reaches configured max.	Poison pills stop blocking queue progress.
Ordering	Cordum attempts DLQ write before `msg.Term()`.	If DLQ write fails, message is NAKed with delay and retried.
Handler-level corruption	Corrupt protobuf payloads use a dedicated cutoff (`poisonUnmarshalThreshold = 3`) before term path.	Bad wire payloads terminate in ~4 deliveries instead of consuming all 100 attempts.
Observability	Warnings log delivery counts and poison termination events.	Operators can detect repeat offenders and scope replay.

Cordum poison-pill path

// core/infra/bus/nats.go (excerpt)
if numDelivered >= uint64(maxJSRedeliveries) {
  slog.Warn("bus: terminating poison message", ...)

  // DLQ write BEFORE Term — prevents data loss if we crash between Term and DLQ write.
  if b.OnMessageTerminated != nil {
    if dlqErr := b.OnMessageTerminated(subject, msg.Data, numDelivered); dlqErr != nil {
      slog.Error("bus: dlq write failed, nak-ing for retry", ...)
      _ = msg.NakWithDelay(5 * time.Second)
      return
    }
  }

  if termErr := msg.Term(); termErr != nil {
    slog.Error("bus: term failed", ...)
  }
  return
}

Why ordering matters

`AckTerm` is terminal for redelivery. After that, you cannot rely on JetStream to hand you the message again.

If evidence persistence happens after terminal ack, one crash can convert a known poison event into a silent hole.

Cordum also separates corruption from ordinary retry loops: malformed payloads are cut off much earlier than generic handler failures.

Risky ordering example

// Risky ordering (do not use)
_ = msg.Term()
_ = writeToDLQ(data) // if process crashes before this line, failure evidence is gone

Dual poison thresholds in Cordum

// core/infra/bus/nats.go (excerpt)
const maxJSRedeliveries = 100
const poisonUnmarshalThreshold uint64 = 3

if err := proto.Unmarshal(data, &packet); err != nil {
  if numDelivered > poisonUnmarshalThreshold {
    return msgActionTerm, 0
  }
  return msgActionNakDelay, 5 * time.Second
}

Validation runbook

Validate this in staging with forced restarts. A correctness guarantee that is not tested under crash timing is just optimism.

Crash-window validation steps

bash

# 1) Inject known poison payload into staging stream
# 2) Confirm delivery count reaches max threshold
# 3) Inject malformed protobuf payload and verify fast terminate after 3 retry attempts
# 4) Validate DLQ entry exists BEFORE termination in both paths
# 5) Force process restart during termination path to test crash resilience
# 6) Verify replay workflow can consume DLQ evidence

Limitations and tradeoffs

Approach	Upside	Downside
DLQ-first then Term (Cordum pattern)	Best forensic integrity and replay support.	Extra write in hot path and more failure handling branches.
Term-first then DLQ	Slightly shorter code path in success case.	Crash window can lose poison message evidence.
Never Term, only NAK	No terminal drops.	Queue starvation and retry storms for permanently bad payloads.

Next step

Add a poison-message integration test that kills the worker between DLQ write and term path boundaries, then assert no evidence loss before promoting the rollout.

AI Agent NATS JetStream Poison Message Termination

The production problem

What top results cover and miss

Cordum runtime mechanics

Why ordering matters

Validation runbook

Limitations and tradeoffs

Next step

Related Articles

AI Agent NATS Slow Consumer Guardrails (2026)

AI Agent MaxAckPending Tuning: Prevent JetStream Consumer Starvation (2026)

AI Agent Poison Message Handling: Quarantine, Triage, and Safe Replay (2026)

Need production-safe agent governance?