The production problem
Most teams treat DLQ as a final bucket. Messages go in, dashboards go red, and operators manually click retry until the queue moves.
That approach fails for autonomous agents. Replaying a side-effecting action without policy review and idempotency can repeat damage, not fix it.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Amazon SQS dead-letter queues | Clear redrive configuration and retention guidance (`maxReceiveCount`, retention windows). | No agent-specific replay governance model for side-effecting autonomous actions. |
| Azure Service Bus dead lettering | Practical enablement paths (Portal, CLI, PowerShell, ARM/Bicep). | Limited runtime triage patterns for safe replay and incident response. |
| RabbitMQ Dead Letter Exchanges | Strong low-level DLX routing details, cycle risks, and policy controls. | No control-plane guidance for replay approvals and cross-system audit evidence. |
Failure triage model
| Failure class | Signal | Replay policy |
|---|---|---|
| Transient infrastructure | timeout, dependency_unavailable, no_workers | Auto replay with exponential backoff and same idempotency key |
| Poison payload | schema_invalid, deserialization_error | Do not replay until payload is fixed or transformed |
| Policy/governance | denied, approval_required, policy_snapshot_mismatch | Re-evaluate against latest policy and require human approval if risk is high |
| Side-effect uncertainty | unknown_commit_state, partial_external_write | Run compensating check first, then replay only with manual override |
Cordum DLQ runtime behavior
Cordum stores structured DLQ entries and prioritizes DLQ persistence before message termination. That design avoids the worst-case failure mode: lost message with no forensic trail.
| Control | Current behavior | Why it matters |
|---|---|---|
| DLQ emission | Scheduler emits DLQ for terminal failures except `FAILED_RETRYABLE` | Avoids polluting DLQ with failures still eligible for retry. |
| Retry boundary | 50 scheduling attempts, 1s-30s exponential backoff, then DLQ | Provides deterministic cutoff before human/operator triage. |
| DLQ-first termination | Write DLQ entry before `msg.Term()`; on write failure use `NakWithDelay(5s)` | Prevents silent message loss when termination succeeds but persistence fails. |
| Storage model | Redis keys `dlq:entry:<job_id>` plus sorted index `dlq:index` | Supports fast querying and replay pipelines by status/reason/time. |
Implementation examples
DLQ triage policy (YAML)
version: v1
dlq_triage:
- match:
error_code: timeout
action: auto_replay
backoff: exponential
max_replays: 3
- match:
error_code: schema_invalid
action: quarantine
- match:
error_code: denied
action: require_humanReplay worker skeleton (Go)
func ReplayDLQEntry(ctx context.Context, entry DLQEntry) error {
if wasAlreadyProcessed(entry.IdempotencyKey) {
return nil
}
decision, err := policyClient.Evaluate(ctx, entry.JobRequest)
if err != nil {
return err
}
if decision == "deny" {
return fmt.Errorf("replay denied by policy")
}
return dispatch(entry.JobRequest) // keep original idempotency key
}DLQ entry shape for triage (JSON)
{
"job_id": "job_74c2",
"topic": "tool.github.pr.create",
"last_state": "FAILED",
"error_code": "dependency_unavailable",
"attempts": 50,
"idempotency_key": "run_2f91:step_3",
"policy_snapshot": "sha256:ab91...",
"replay_status": "pending_review"
}Limitations and tradeoffs
- - Aggressive auto-replay reduces operator load but can hide systemic defects longer.
- - Strict replay approvals improve safety but increase mean time to recovery for low-risk failures.
- - DLQ reason-code quality depends on producer discipline; vague error messages destroy triage speed.
- - Replay policy drift between teams can reintroduce manual, inconsistent incident handling.
Next step
Run this in one sprint:
- 1. Define a DLQ reason-code taxonomy with replay actions per code.
- 2. Require policy evaluation and idempotency checks in the replay pipeline.
- 3. Add an alert on DLQ growth rate, not only absolute depth.
- 4. Rehearse one incident where replay is intentionally denied by policy.
Continue with AI Agent Rollback and Compensation and AI Agent Circuit Breaker Pattern.