Skip to content
Deep Dive

AI Agent Stuck Job Recovery

Ack deadlines prevent one class of failure. Stuck state recovery needs a second system.

Deep Dive11 min readMar 2026
TL;DR
  • -Queue-level ack deadlines solve redelivery timing, but not persisted job-state recovery after scheduler failures.
  • -Cordum runs both a timeout reconciler and a pending replayer with distributed Redis locks to prevent duplicate tick execution.
  • -The pending replayer scans `PENDING`, approved `APPROVAL_REQUIRED`, and `SCHEDULED` jobs older than a cutoff and re-submits through normal dispatch logic.
  • -Recovery quality depends on timeout tuning, replay age tuning, and alerting on stale-job and orphan-replay metrics.
Dual recovery paths

Reconciler marks stale jobs `TIMEOUT`; pending replayer re-dispatches recoverable stuck jobs.

Single active ticker

Redis lock keys (`cordum:reconciler:default`, `cordum:replayer:pending`) use TTL hold to avoid duplicate ticks.

Observable recovery

Stale job gauge and orphan replay counter make recovery behavior visible during incidents.

Scope

This guide focuses on post-dispatch recovery for stuck jobs in Cordum scheduler paths. It does not cover worker business-logic retries. It covers the engine side: replay and timeout transitions.

The production problem

Message brokers can redeliver when consumers fail. That does not guarantee your control plane can recover jobs already persisted in intermediate scheduler states.

A common outage sequence is simple: request accepted, state written, scheduler restarts, no new event arrives to advance the state. The job is now stuck but not dead.

If your only safety net is broker redelivery, these records sit until timeout or manual repair. That is expensive and boring in the way only incident calls can be.

What top results cover and miss

SourceStrong coverageMissing piece
AWS SQS: processing messages in a timely mannerVisibility timeout sizing, heartbeat extension, and 12-hour max visibility boundary.No control-plane reconciliation for persisted job state after scheduler crash windows.
Google Pub/Sub: lease managementAck-deadline extension policies, redelivery tradeoffs, and lease extension knobs.No state-machine recovery loop that revisits stuck `PENDING`/`SCHEDULED` jobs in a centralized scheduler.
RabbitMQ: consumers and delivery acknowledgement timeoutAck timeout behavior, `PRECONDITION_FAILED`, and automatic requeue mechanics.No integrated replay path that rehydrates job requests and re-enters pre-dispatch policy logic.

These guides are correct and useful. They mostly describe queue semantics. They do not describe a pre-dispatch governance scheduler that owns a multi-state job record and must recover it across replica failure windows.

Cordum runtime behavior

BoundaryCurrent behaviorOperational impact
Reconciler lock modelUses `cordum:reconciler:default` lock with TTL `2 x pollInterval` before each tick.Only one replica performs stale-job timeout transitions per lock window.
Pending replayer lock modelUses `cordum:replayer:pending` lock with TTL `2 x pollInterval` and TTL-hold pattern.Avoids duplicate replay ticks from multiple scheduler replicas.
Pending replay scopeReplays old `PENDING`, approved `APPROVAL_REQUIRED`, and old `SCHEDULED` jobs.Recovers more than one stuck-state class, not only unacked messages.
Timeout scanner bounds`maxIterations=100`, list batch size `200`, and `maxRetriesPerJob=3` in timeout path.Caps tick work and retry churn during partial store failures.
Default operator config file`dispatch_timeout_seconds=300`, `running_timeout_seconds=900`, `scan_interval_seconds=30` in `config/timeouts.yaml`.Sets stale-job detection cadence and timeout windows in typical deployments.
Replay visibility`cordum_scheduler_orphan_replayed_total` counts replayed orphan jobs by topic.Gives direct signal that replay is active during outage recovery.

Code-level mechanics

Timeout reconciler loop (Go)

reconciler.go
Go
func (r *Reconciler) tick(ctx context.Context) {
  now := time.Now()
  dispatchTimeout, runningTimeout, scheduledTimeout := r.currentTimeouts()
  r.handleTimeouts(ctx, JobStateScheduled, now.Add(-scheduledTimeout), scheduledTimeout)
  r.handleTimeouts(ctx, JobStateDispatched, now.Add(-dispatchTimeout), dispatchTimeout)
  r.handleTimeouts(ctx, JobStateRunning, now.Add(-runningTimeout), runningTimeout)
  r.handleDeadlineExpirations(ctx, now)
}

const maxIterations = 100
const maxRetriesPerJob = 3
records, err := r.store.ListJobsByState(ctx, state, cutoffMicros, 200)

The reconciler path is explicit: scan stale jobs by state, mark `TIMEOUT`, write reason, and keep the loop bounded. Bounded loops matter under store pressure because outages are rarely kind enough to fail one subsystem at a time.

Pending replay loop (Go)

pending_replayer.go
Go
func NewPendingReplayer(engine *Engine, store JobStore, pendingAge, pollInterval time.Duration) *PendingReplayer {
  if pollInterval <= 0 { pollInterval = 30 * time.Second }
  if pendingAge <= 0 { pendingAge = 2 * time.Minute }
  return &PendingReplayer{
    pendingAge: pendingAge,
    pollInterval: pollInterval,
    lockKey: "cordum:replayer:pending",
    lockTTL: pollInterval * 2,
  }
}

func (r *PendingReplayer) tick(ctx context.Context) {
  cutoff := time.Now().Add(-r.pendingAge)
  r.replayPending(ctx, cutoff)
  r.replayApproved(ctx, cutoff)
  r.replayScheduled(ctx, cutoff)
}

Replay is not a blind state flip. It rehydrates the stored request and re-enters the normal request handling path, so policy checks and routing are still applied.

Default timeout profile (YAML)

config/timeouts.yaml
YAML
# config/timeouts.yaml
workflows: {}
topics: {}
reconciler:
  dispatch_timeout_seconds: 300
  running_timeout_seconds: 900
  scan_interval_seconds: 30

In scheduler startup, `dispatch_timeout_seconds` is also used as pending replay age input. That means changing dispatch timeout changes replay sensitivity. Treat it as a recovery control, not only a timeout control.

Operator runbook

During incidents, watch stale-state growth and replay activity together. Looking at one metric alone produces confident but wrong conclusions.

stuck_job_recovery_runbook.sh
Bash
# 1) Watch stale jobs by state
sum by (state) (cordum_scheduler_stale_jobs)

# 2) Confirm replay activity by topic
sum by (topic) (increase(cordum_scheduler_orphan_replayed_total[15m]))

# 3) If stale jobs rise and replay stays flat:
# - verify scheduler replicas see the same Redis
# - inspect lock keys: cordum:reconciler:default, cordum:replayer:pending
# - validate dispatch/running timeout settings in config/timeouts.yaml

# 4) During incident review, correlate:
# stale_jobs + orphan_replayed + dispatch latency + DLQ depth

Limitations and tradeoffs

  • - More aggressive replay age recovers faster but can re-submit jobs that were merely slow.
  • - Longer timeouts reduce false positives but keep real failures hidden for longer.
  • - TTL-held lock windows reduce duplicate ticks but also limit how often another replica can run recovery.
  • - Recovery loops help scheduler state, not worker-side idempotency. Handlers still need idempotent design.

A replay loop without idempotent handlers is a very efficient duplicate generator. It will do exactly what you asked, repeatedly.

Next step

Run one controlled stuck-job simulation this week:

  1. 1. Submit representative jobs, then restart scheduler between `SCHEDULED` and `DISPATCHED` paths.
  2. 2. Measure replay lag and stale-job decay with current timeout profile.
  3. 3. Tune `dispatch_timeout_seconds` and verify replay speed vs false replay rate.
  4. 4. Keep an incident dashboard that includes stale jobs, orphan replay, dispatch latency, and DLQ depth.

Continue with Safety Unavailable Retry Strategy and Backpressure and Queue Drain.

Recovery is a first-class feature

Fast dispatch is visible. Slow recovery is usually invisible until someone gets paged. Design for both.