The production problem
A poison message is not just a malformed payload. It is any message that keeps failing and burns throughput while making no progress.
In autonomous agent systems, that failure can be expensive. A bad replay loop can open duplicate tickets, send repeated tool calls, and flood on-call with noise instead of signal.
Most incidents are not caused by one failure. They come from unbounded retries, weak classification, and replay paths that skip policy controls.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Amazon SQS dead-letter queues | Clear redrive policy guidance (`maxReceiveCount`) and retention caveats for standard vs FIFO. | No governance model for autonomous tool execution replay. |
| RabbitMQ Dead Letter Exchanges | Precise dead-letter triggers and policy-vs-argument configuration tradeoffs. | No policy-gated replay path for agent side effects outside the broker. |
| Google Pub/Sub dead-letter topics | Concrete delivery-attempt limits (5-100) and subscription-level dead lettering controls. | No run-level idempotency strategy for autonomous workflows after redelivery. |
Poison message taxonomy
Do not route all failures to one bucket. Classify by recovery probability and side-effect risk, then bind each class to one action.
| Failure class | Primary signal | Action | Risk if wrong |
|---|---|---|---|
| Transient infra failure | Timeouts, temporary dependency outage, lock contention | Retry with jittered backoff and strict max-attempt budget | Retry storm if no cap |
| Schema or payload failure | Decode error, missing required fields, malformed context pointer | Fail fast to DLQ with reason code and payload fingerprint | Infinite failures if retried blindly |
| Policy denial | Safety deny, missing approval, blocked capability/risk tag | Do not auto-retry; require policy change or explicit approval | Unsafe bypass if replay ignores policy |
| Poison side effect | Repeated external 4xx/semantic conflict despite retries | Quarantine and require idempotency/correction before replay | Duplicate tickets, PRs, or infrastructure mutations |
Cordum runtime implications
| Implication | Current behavior | Why it matters |
|---|---|---|
| Dispatch retry ceiling | 50 scheduling attempts with exponential backoff from 1s to 30s | Retry loops are bounded before terminal DLQ to prevent unbounded churn. |
| Bus-level redelivery | JetStream at-least-once with AckWait 10m and MaxDeliver 100 | Consumer code must assume duplicate delivery and remain replay-safe. |
| DLQ-first termination | DLQ write callback runs before message termination; on write error, NAK with 5s delay | Prevents message loss in crash windows between termination and persistence. |
| Replay API workflow | DLQ retry endpoint rehydrates context into a new job id and re-dispatches | Replay becomes explicit, auditable, and policy-controllable. |
Practical baseline: keep retry budgets finite, publish terminal failures to DLQ with reason codes, and force replay through policy + idempotency checks.
Implementation examples
Failure classifier (Go)
type Action string
const (
ActionRetry Action = "retry"
ActionToDLQ Action = "dlq"
ActionNeedReview Action = "manual_review"
)
func classifyFailure(code string, attempts int, maxAttempts int) Action {
switch code {
case "timeout", "dependency_unavailable", "store_lock_busy":
if attempts < maxAttempts {
return ActionRetry
}
return ActionToDLQ
case "schema_invalid", "payload_unmarshal_failed", "no_pool_mapping":
return ActionToDLQ
case "safety_denied", "approval_required":
return ActionNeedReview
default:
return ActionToDLQ
}
}Replay governance policy (YAML)
replay_controls:
max_retries_per_message: 3
require_policy_check: true
require_idempotency_key: true
auto_retry:
allowed_reason_codes:
- timeout
- dependency_unavailable
- store_lock_busy
denied_reason_codes:
- safety_denied
- schema_invalid
- payload_unmarshal_failedDLQ operations (cURL)
# List DLQ entries
curl -sS http://localhost:8081/api/v1/dlq -H "X-API-Key: ${API_KEY}" -H "X-Tenant-ID: default"
# Retry one entry (creates a new job id)
curl -sS -X POST http://localhost:8081/api/v1/dlq/JOB_ID/retry -H "X-API-Key: ${API_KEY}" -H "X-Tenant-ID: default"Limitations and tradeoffs
- - Aggressive fail-fast classification can move recoverable messages to DLQ too early.
- - Long retry windows reduce DLQ noise but increase queue latency under partial outages.
- - Manual replay reviews reduce blast radius but add operational overhead.
- - Replay safety depends on idempotent external systems, not just broker semantics.
Next step
Run this in one sprint:
- 1. Define 4-6 terminal reason codes and map each to retry, DLQ, or manual review.
- 2. Enforce max-attempt budgets per class instead of one global retry policy.
- 3. Add a replay checklist: policy check, idempotency key, side-effect simulation.
- 4. Track DLQ depth, replay success rate, and duplicate-detected rate as release gates.
Continue with AI Agent DLQ and Replay Patterns and AI Agent Idempotency Keys.