Name: Cordum
Author: Cordum

The production problem

Message brokers can redeliver when consumers fail. That does not guarantee your control plane can recover jobs already persisted in intermediate scheduler states.

A common outage sequence is simple: request accepted, state written, scheduler restarts, no new event arrives to advance the state. The job is now stuck but not dead.

If your only safety net is broker redelivery, these records sit until timeout or manual repair. That is expensive and boring in the way only incident calls can be.

What top results cover and miss

Source	Strong coverage	Missing piece
AWS SQS: processing messages in a timely manner	Visibility timeout sizing, heartbeat extension, and 12-hour max visibility boundary.	No control-plane reconciliation for persisted job state after scheduler crash windows.
Google Pub/Sub: lease management	Ack-deadline extension policies, redelivery tradeoffs, and lease extension knobs.	No state-machine recovery loop that revisits stuck `PENDING`/`SCHEDULED` jobs in a centralized scheduler.
RabbitMQ: consumers and delivery acknowledgement timeout	Ack timeout behavior, `PRECONDITION_FAILED`, and automatic requeue mechanics.	No integrated replay path that rehydrates job requests and re-enters pre-dispatch policy logic.

These guides are correct and useful. They mostly describe queue semantics. They do not describe a pre-dispatch governance scheduler that owns a multi-state job record and must recover it across replica failure windows.

Cordum runtime behavior

Boundary	Current behavior	Operational impact
Reconciler lock model	Uses `cordum:reconciler:default` lock with TTL `2 x pollInterval` before each tick.	Only one replica performs stale-job timeout transitions per lock window.
Pending replayer lock model	Uses `cordum:replayer:pending` lock with TTL `2 x pollInterval` and TTL-hold pattern.	Avoids duplicate replay ticks from multiple scheduler replicas.
Pending replay scope	Replays old `PENDING`, approved `APPROVAL_REQUIRED`, and old `SCHEDULED` jobs.	Recovers more than one stuck-state class, not only unacked messages.
Timeout scanner bounds	`maxIterations=100`, list batch size `200`, and `maxRetriesPerJob=3` in timeout path.	Caps tick work and retry churn during partial store failures.
Default operator config file	`dispatch_timeout_seconds=300`, `running_timeout_seconds=900`, `scan_interval_seconds=30` in `config/timeouts.yaml`.	Sets stale-job detection cadence and timeout windows in typical deployments.
Replay visibility	`cordum_scheduler_orphan_replayed_total` counts replayed orphan jobs by topic.	Gives direct signal that replay is active during outage recovery.

Code-level mechanics

Timeout reconciler loop (Go)

reconciler.go

func (r *Reconciler) tick(ctx context.Context) {
  now := time.Now()
  dispatchTimeout, runningTimeout, scheduledTimeout := r.currentTimeouts()
  r.handleTimeouts(ctx, JobStateScheduled, now.Add(-scheduledTimeout), scheduledTimeout)
  r.handleTimeouts(ctx, JobStateDispatched, now.Add(-dispatchTimeout), dispatchTimeout)
  r.handleTimeouts(ctx, JobStateRunning, now.Add(-runningTimeout), runningTimeout)
  r.handleDeadlineExpirations(ctx, now)
}

const maxIterations = 100
const maxRetriesPerJob = 3
records, err := r.store.ListJobsByState(ctx, state, cutoffMicros, 200)

The reconciler path is explicit: scan stale jobs by state, mark `TIMEOUT`, write reason, and keep the loop bounded. Bounded loops matter under store pressure because outages are rarely kind enough to fail one subsystem at a time.

Pending replay loop (Go)

pending_replayer.go

func NewPendingReplayer(engine *Engine, store JobStore, pendingAge, pollInterval time.Duration) *PendingReplayer {
  if pollInterval <= 0 { pollInterval = 30 * time.Second }
  if pendingAge <= 0 { pendingAge = 2 * time.Minute }
  return &PendingReplayer{
    pendingAge: pendingAge,
    pollInterval: pollInterval,
    lockKey: "cordum:replayer:pending",
    lockTTL: pollInterval * 2,
  }
}

func (r *PendingReplayer) tick(ctx context.Context) {
  cutoff := time.Now().Add(-r.pendingAge)
  r.replayPending(ctx, cutoff)
  r.replayApproved(ctx, cutoff)
  r.replayScheduled(ctx, cutoff)
}

Replay is not a blind state flip. It rehydrates the stored request and re-enters the normal request handling path, so policy checks and routing are still applied.

Default timeout profile (YAML)

config/timeouts.yaml

YAML

# config/timeouts.yaml
workflows: {}
topics: {}
reconciler:
  dispatch_timeout_seconds: 300
  running_timeout_seconds: 900
  scan_interval_seconds: 30

In scheduler startup, `dispatch_timeout_seconds` is also used as pending replay age input. That means changing dispatch timeout changes replay sensitivity. Treat it as a recovery control, not only a timeout control.

Operator runbook

During incidents, watch stale-state growth and replay activity together. Looking at one metric alone produces confident but wrong conclusions.

stuck_job_recovery_runbook.sh

Bash

# 1) Watch stale jobs by state
sum by (state) (cordum_scheduler_stale_jobs)

# 2) Confirm replay activity by topic
sum by (topic) (increase(cordum_scheduler_orphan_replayed_total[15m]))

# 3) If stale jobs rise and replay stays flat:
# - verify scheduler replicas see the same Redis
# - inspect lock keys: cordum:reconciler:default, cordum:replayer:pending
# - validate dispatch/running timeout settings in config/timeouts.yaml

# 4) During incident review, correlate:
# stale_jobs + orphan_replayed + dispatch latency + DLQ depth

Limitations and tradeoffs

- More aggressive replay age recovers faster but can re-submit jobs that were merely slow.
- Longer timeouts reduce false positives but keep real failures hidden for longer.
- TTL-held lock windows reduce duplicate ticks but also limit how often another replica can run recovery.
- Recovery loops help scheduler state, not worker-side idempotency. Handlers still need idempotent design.

A replay loop without idempotent handlers is a very efficient duplicate generator. It will do exactly what you asked, repeatedly.

Next step

Run one controlled stuck-job simulation this week:

1. Submit representative jobs, then restart scheduler between `SCHEDULED` and `DISPATCHED` paths.
2. Measure replay lag and stale-job decay with current timeout profile.
3. Tune `dispatch_timeout_seconds` and verify replay speed vs false replay rate.
4. Keep an incident dashboard that includes stale jobs, orphan replay, dispatch latency, and DLQ depth.

Continue with Safety Unavailable Retry Strategy and Backpressure and Queue Drain.

AI Agent Stuck Job Recovery