The production problem
Most teams treat DLQ as the end of a reliability story. It is usually the middle.
A failed job only becomes actionable if failure context survives transport incidents, broker hiccups, and store timeouts.
If DLQ emission fails during an outage, incident responders lose the evidence trail and spend the next hour reconstructing state from logs.
That reconstruction step is where MTTR gets expensive.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| AWS SQS dead-letter queues | How messages move to a DLQ after `maxReceiveCount`, plus retention and redrive guidance. | Does not cover the reliability of the DLQ emission step itself when your own control plane publishes DLQ events. |
| Google Pub/Sub dead-letter topics | Configuring dead-letter topics and max delivery attempts for undeliverable messages. | No design pattern for dual-write DLQ persistence when bus and store can fail independently. |
| RabbitMQ dead-letter exchanges | DLX routing mechanics and policy configuration details. | Focuses on broker routing semantics, not application-level retry and telemetry when DLQ publishing fails. |
Cordum runtime mechanics
| Boundary | Current behavior | Why it matters |
|---|---|---|
| DLQ write order | `emitDLQ` writes to `DLQSink` first (if configured), then publishes `JobResult` to `sys.job.dlq`. | Durable store can still capture failures when bus publish is unavailable. |
| Retry policy | `emitDLQWithRetry` retries once after `500ms` and then stops. | Simple and bounded, but short outages can still drop bus-level DLQ events. |
| Observability | Final retry failure increments `IncDLQEmitFailure(topic)` metric. | You get a signal for DLQ-emission degradation instead of silent loss. |
| Timeout budget | Sink add uses `storeOpTimeout = 2s` context deadline. | Prevents long blocking in failure paths but can reject slow storage during incident conditions. |
| Test coverage | Hardening tests verify success-on-second-try, permanent-failure metric increment, and sink persistence when bus is down. | Behavior is intentional and regression-protected. |
DLQ code paths
Sink-first emit path
// core/controlplane/scheduler/engine.go (excerpt)
func (e *Engine) emitDLQ(jobID, topic string, status pb.JobStatus, reason, reasonCode string) error {
if e.dlqSink != nil {
ctx, cancel := context.WithTimeout(e.ctx, storeOpTimeout)
err := e.dlqSink.Add(ctx, DLQEntry{JobID: jobID, Topic: topic, Status: status.String(), Reason: reason, ReasonCode: reasonCode})
cancel()
if err != nil {
return fmt.Errorf("dlq sink add failed: %w", err)
}
}
if e.bus == nil {
return nil
}
return e.bus.Publish(dlqSubject, packet)
}Single-retry wrapper
// core/controlplane/scheduler/engine.go (excerpt)
func (e *Engine) emitDLQWithRetry(jobID, topic string, status pb.JobStatus, reason, reasonCode string) error {
err := e.emitDLQ(jobID, topic, status, reason, reasonCode)
if err == nil {
return nil
}
retryTimer := time.NewTimer(500 * time.Millisecond)
select {
case <-e.ctx.Done():
retryTimer.Stop()
return e.ctx.Err()
case <-retryTimer.C:
}
err = e.emitDLQ(jobID, topic, status, reason, reasonCode)
if err != nil && e.metrics != nil {
e.metrics.IncDLQEmitFailure(topic)
}
return err
}Hardening tests that lock behavior
// core/controlplane/scheduler/engine_hardening_test.go (excerpt)
func TestDLQEmitRetrySucceedsOnSecondAttempt(t *testing.T) {
bus := &failNBus{failCount: 1}
err := engine.emitDLQWithRetry("job-retry", "job.test", pb.JobStatus_JOB_STATUS_FAILED, "test", "test_code")
// expect success after one retry
}
func TestDLQEmitRetryPermanentFailureMetric(t *testing.T) {
bus := &failNBus{failCount: 999}
err := engine.emitDLQWithRetry("job-perm", "job.test", pb.JobStatus_JOB_STATUS_FAILED, "test", "test_code")
// expect IncDLQEmitFailure called once
}
func TestDLQEmitPersistsToSinkWhenBusUnavailable(t *testing.T) {
// bus publish fails, sink still stores entry
}Validation runbook
Test DLQ-emission reliability with fault injection before you need it during an incident.
# 1) Force bus publish failure while keeping DLQ sink healthy # - verify DLQ API still lists new entries # 2) Measure retry envelope # - first emit failure # - one retry after 500ms # - no further retries # 3) Alerting check # - confirm dlq emit failure metric increments on final failure # 4) Incident triage rule # - if sink writes succeed but bus publish fails: use API/Redis-based DLQ triage # - if sink and bus both fail: escalate as control-plane visibility incident # 5) Recovery test # - restore bus # - verify new DLQ events resume without backlog corruption
Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| One retry at 500ms (current) | Bounded latency and simple failure behavior. | Insufficient for outages longer than sub-second jitter windows. |
| Multi-attempt exponential retry | Higher success probability during brief infrastructure turbulence. | Longer scheduler critical-path latency and more retry traffic under systemic outage. |
| Sink-only durability with async bus fanout | Stronger forensic durability when bus is unstable. | Adds eventual-consistency complexity between storage and event-stream consumers. |
- - A single retry keeps code simple, but it assumes DLQ transport failures are short and rare.
- - Sink-first ordering helps durability, but downstream consumers that only watch `sys.job.dlq` can still miss events during bus outages.
- - Current tests validate core behavior, but there is still no long-outage soak test for repeated DLQ emit failures under sustained pressure.
Next step
Implement this next:
- 1. Add topic-scoped `dlq_emit_retry_attempts` and `dlq_emit_retry_delay` config knobs.
- 2. Emit a structured audit event whenever `IncDLQEmitFailure` increments.
- 3. Add a periodic reconciler that re-publishes sink-only DLQ entries when bus health recovers.
- 4. Run chaos drills where bus is down for 5-10 minutes and measure forensic data loss.
Continue with AI Agent DLQ Replay Patterns and AI Agent Poison Message Handling.