The production problem
A generic DLQ setup assumes retrying is cheap. Autonomous agents break that assumption. A replayed tool call can create a second ticket, send a second email, or push a second config change.
Example: 1,000 nightly jobs, 2% failures. Blind replay of 20 jobs is manageable when failures are transient. It is dangerous when even 2-3 are in unknown commit state.
The goal is not “empty DLQ quickly.” The goal is “recover safely with an audit trail.”
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Amazon SQS DLQ redrive (Developer Guide) | Operational redrive controls: destination queue choice, velocity control, task limits, and queue permissions. | No policy re-evaluation model for autonomous agents before replaying side-effecting actions. |
| AWS Compute Blog: replay with backoff | Practical replay counter and exponential backoff with jitter, plus a final human-operated queue. | No cross-system idempotency contract or governance gate tied to replay risk class. |
| RabbitMQ at-least-once dead lettering | Critical durability tradeoff between at-most-once and at-least-once dead lettering. | No control-plane workflow for approval-required replay when commit state is uncertain. |
Failure triage model
Treat replay as a policy decision, not a queue operation.
| Failure class | Signal | Replay policy |
|---|---|---|
| Transient infrastructure | timeout, dependency_unavailable, no_workers | Auto replay with capped exponential backoff using the same idempotency key |
| Poison payload | schema_invalid, parse_error, tool_contract_mismatch | Do not replay until payload or adapter mapping is fixed |
| Policy and governance | denied, approval_required, policy_snapshot_mismatch | Re-evaluate against latest policy and require explicit approval for high risk |
| Unknown commit state | partial_external_write, unknown_commit_state | Run compensating read/check first; default to manual decision |
Cordum runtime evidence
These controls are verified against current source and docs, not inferred from marketing copy.
| Control | Current behavior | Evidence | Why it matters |
|---|---|---|---|
| Retry ceiling before DLQ | Scheduler caps scheduling attempts at 50 with 1s-30s backoff (~25 min). | core/controlplane/scheduler/engine.go | Puts a deterministic boundary on auto-retry loops. |
| Retry exhaustion reason code | After ceiling, state moves to FAILED and DLQ emit uses reason `max_scheduling_retries`. | core/controlplane/scheduler/engine.go | Makes replay routing rule-based instead of log scraping. |
| DLQ persistence ordering | Bus writes DLQ before `msg.Term()`. DLQ write failure triggers `NakWithDelay(5s)`. | core/infra/bus/nats.go | Reduces message-loss risk between termination and DLQ persistence. |
| DLQ storage layout | Redis stores `dlq:entry:<jobID>` plus `dlq:index`; default entry TTL is 30 days. | docs/redis-operations.md | Enables fast triage queries while bounding storage growth. |
Known caveat (important)
`TestDLQEmitFailureDoesNotBlockStateTransition` documents a bug path: terminal state can be written before DLQ emit succeeds. If emit fails and redelivery sees terminal state, DLQ entry can be missed. Plan replay runbooks accordingly.
Implementation examples
DLQ triage policy (YAML)
version: v1
dlq_rules:
- match:
reason_code: max_scheduling_retries
action: auto_replay
max_replays: 2
backoff: exponential_jitter
- match:
reason_code: schema_invalid
action: quarantine
owner: integration-team
- match:
reason_code: denied
action: require_approval
approver_group: platform-governance
- match:
reason_code: unknown_commit_state
action: manual_reviewReplay worker skeleton (Go)
func ReplayDLQ(ctx context.Context, entry DLQEntry) error {
if AlreadyCommitted(entry.IdempotencyKey) {
return nil
}
decision, err := policyClient.Evaluate(ctx, entry.Request)
if err != nil {
return fmt.Errorf("policy evaluate: %w", err)
}
if decision == "deny" {
return fmt.Errorf("replay denied by policy")
}
// Preserve idempotency key to avoid duplicate side effects.
entry.Request.IdempotencyKey = entry.IdempotencyKey
return dispatcher.Dispatch(ctx, entry.Request)
}DLQ entry shape for triage (JSON)
{
"job_id": "job_74c2",
"topic": "tool.github.pr.create",
"status": "FAILED",
"reason_code": "max_scheduling_retries",
"reason": "max scheduling retries exceeded (attempts=50)",
"attempts": 50,
"idempotency_key": "run_2f91:step_3",
"policy_snapshot": "sha256:ab91...",
"replay_status": "pending_review"
}Limitations and tradeoffs
- - Auto replay lowers pager load, but it can delay detection of systemic defects.
- - Strict approval gates improve safety, but they increase MTTR for low-risk transient failures.
- - Reason-code quality is a hard dependency; vague errors turn replay into guesswork.
- - Current consistency tests document a DLQ caveat: if DLQ emit fails after terminal state transition, the job can stay terminal while the DLQ record is lost (`engine_consistency_test.go`, BUG-8).
- - DLQ add/trim is not fully transactional in documented tests; TTL mitigates orphaned entries but does not remove the risk entirely.
Next step
Run this as a one-sprint reliability upgrade:
- 1. Define reason-code taxonomy and default replay action per reason.
- 2. Enforce policy re-evaluation plus idempotency on every replay request.
- 3. Alert on DLQ growth rate and replay-denial rate, not only queue depth.
- 4. Rehearse one incident where replay is intentionally denied by policy.
Continue with AI Agent Rollback and Compensation and AI Agent Circuit Breaker Pattern.