Name: Cordum
Author: Cordum

The production problem

A poison message is not just a malformed payload. It is any message that keeps failing and burns throughput while making no progress.

In autonomous agent systems, that failure can be expensive. A bad replay loop can open duplicate tickets, send repeated tool calls, and flood on-call with noise instead of signal.

Most incidents are not caused by one failure. They come from unbounded retries, weak classification, and replay paths that skip policy controls.

What top results miss

Source	Strong coverage	Missing piece
Amazon SQS dead-letter queues	Clear redrive policy guidance (`maxReceiveCount`) and retention caveats for standard vs FIFO.	No governance model for autonomous tool execution replay.
RabbitMQ Dead Letter Exchanges	Precise dead-letter triggers and policy-vs-argument configuration tradeoffs.	No policy-gated replay path for agent side effects outside the broker.
Google Pub/Sub dead-letter topics	Concrete delivery-attempt limits (5-100) and subscription-level dead lettering controls.	No run-level idempotency strategy for autonomous workflows after redelivery.

Poison message taxonomy

Do not route all failures to one bucket. Classify by recovery probability and side-effect risk, then bind each class to one action.

Failure class	Primary signal	Action	Risk if wrong
Transient infra failure	Timeouts, temporary dependency outage, lock contention	Retry with jittered backoff and strict max-attempt budget	Retry storm if no cap
Schema or payload failure	Decode error, missing required fields, malformed context pointer	Fail fast to DLQ with reason code and payload fingerprint	Infinite failures if retried blindly
Policy denial	Safety deny, missing approval, blocked capability/risk tag	Do not auto-retry; require policy change or explicit approval	Unsafe bypass if replay ignores policy
Poison side effect	Repeated external 4xx/semantic conflict despite retries	Quarantine and require idempotency/correction before replay	Duplicate tickets, PRs, or infrastructure mutations

Cordum runtime implications

Implication	Current behavior	Why it matters
Dispatch retry ceiling	50 scheduling attempts with exponential backoff from 1s to 30s	Retry loops are bounded before terminal DLQ to prevent unbounded churn.
Bus-level redelivery	JetStream at-least-once with AckWait 10m and MaxDeliver 100	Consumer code must assume duplicate delivery and remain replay-safe.
DLQ-first termination	DLQ write callback runs before message termination; on write error, NAK with 5s delay	Prevents message loss in crash windows between termination and persistence.
Replay API workflow	DLQ retry endpoint rehydrates context into a new job id and re-dispatches	Replay becomes explicit, auditable, and policy-controllable.

Practical baseline: keep retry budgets finite, publish terminal failures to DLQ with reason codes, and force replay through policy + idempotency checks.

Implementation examples

Failure classifier (Go)

classifier.go

type Action string

const (
  ActionRetry      Action = "retry"
  ActionToDLQ      Action = "dlq"
  ActionNeedReview Action = "manual_review"
)

func classifyFailure(code string, attempts int, maxAttempts int) Action {
  switch code {
  case "timeout", "dependency_unavailable", "store_lock_busy":
    if attempts < maxAttempts {
      return ActionRetry
    }
    return ActionToDLQ

  case "schema_invalid", "payload_unmarshal_failed", "no_pool_mapping":
    return ActionToDLQ

  case "safety_denied", "approval_required":
    return ActionNeedReview

  default:
    return ActionToDLQ
  }
}

Replay governance policy (YAML)

replay-controls.yaml

YAML

replay_controls:
  max_retries_per_message: 3
  require_policy_check: true
  require_idempotency_key: true
  auto_retry:
    allowed_reason_codes:
      - timeout
      - dependency_unavailable
      - store_lock_busy
    denied_reason_codes:
      - safety_denied
      - schema_invalid
      - payload_unmarshal_failed

DLQ operations (cURL)

dlq-ops.sh

Bash

# List DLQ entries
curl -sS http://localhost:8081/api/v1/dlq   -H "X-API-Key: ${API_KEY}"   -H "X-Tenant-ID: default"

# Retry one entry (creates a new job id)
curl -sS -X POST http://localhost:8081/api/v1/dlq/JOB_ID/retry   -H "X-API-Key: ${API_KEY}"   -H "X-Tenant-ID: default"

Limitations and tradeoffs

- Aggressive fail-fast classification can move recoverable messages to DLQ too early.
- Long retry windows reduce DLQ noise but increase queue latency under partial outages.
- Manual replay reviews reduce blast radius but add operational overhead.
- Replay safety depends on idempotent external systems, not just broker semantics.

Next step

Run this in one sprint:

1. Define 4-6 terminal reason codes and map each to retry, DLQ, or manual review.
2. Enforce max-attempt budgets per class instead of one global retry policy.
3. Add a replay checklist: policy check, idempotency key, side-effect simulation.
4. Track DLQ depth, replay success rate, and duplicate-detected rate as release gates.

Continue with AI Agent DLQ and Replay Patterns and AI Agent Idempotency Keys.

AI Agent Poison Message Handling