The production problem
Message brokers can redeliver when consumers fail. That does not guarantee your control plane can recover jobs already persisted in intermediate scheduler states.
A common outage sequence is simple: request accepted, state written, scheduler restarts, no new event arrives to advance the state. The job is now stuck but not dead.
If your only safety net is broker redelivery, these records sit until timeout or manual repair. That is expensive and boring in the way only incident calls can be.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| AWS SQS: processing messages in a timely manner | Visibility timeout sizing, heartbeat extension, and 12-hour max visibility boundary. | No control-plane reconciliation for persisted job state after scheduler crash windows. |
| Google Pub/Sub: lease management | Ack-deadline extension policies, redelivery tradeoffs, and lease extension knobs. | No state-machine recovery loop that revisits stuck `PENDING`/`SCHEDULED` jobs in a centralized scheduler. |
| RabbitMQ: consumers and delivery acknowledgement timeout | Ack timeout behavior, `PRECONDITION_FAILED`, and automatic requeue mechanics. | No integrated replay path that rehydrates job requests and re-enters pre-dispatch policy logic. |
These guides are correct and useful. They mostly describe queue semantics. They do not describe a pre-dispatch governance scheduler that owns a multi-state job record and must recover it across replica failure windows.
Cordum runtime behavior
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Reconciler lock model | Uses `cordum:reconciler:default` lock with TTL `2 x pollInterval` before each tick. | Only one replica performs stale-job timeout transitions per lock window. |
| Pending replayer lock model | Uses `cordum:replayer:pending` lock with TTL `2 x pollInterval` and TTL-hold pattern. | Avoids duplicate replay ticks from multiple scheduler replicas. |
| Pending replay scope | Replays old `PENDING`, approved `APPROVAL_REQUIRED`, and old `SCHEDULED` jobs. | Recovers more than one stuck-state class, not only unacked messages. |
| Timeout scanner bounds | `maxIterations=100`, list batch size `200`, and `maxRetriesPerJob=3` in timeout path. | Caps tick work and retry churn during partial store failures. |
| Default operator config file | `dispatch_timeout_seconds=300`, `running_timeout_seconds=900`, `scan_interval_seconds=30` in `config/timeouts.yaml`. | Sets stale-job detection cadence and timeout windows in typical deployments. |
| Replay visibility | `cordum_scheduler_orphan_replayed_total` counts replayed orphan jobs by topic. | Gives direct signal that replay is active during outage recovery. |
Code-level mechanics
Timeout reconciler loop (Go)
func (r *Reconciler) tick(ctx context.Context) {
now := time.Now()
dispatchTimeout, runningTimeout, scheduledTimeout := r.currentTimeouts()
r.handleTimeouts(ctx, JobStateScheduled, now.Add(-scheduledTimeout), scheduledTimeout)
r.handleTimeouts(ctx, JobStateDispatched, now.Add(-dispatchTimeout), dispatchTimeout)
r.handleTimeouts(ctx, JobStateRunning, now.Add(-runningTimeout), runningTimeout)
r.handleDeadlineExpirations(ctx, now)
}
const maxIterations = 100
const maxRetriesPerJob = 3
records, err := r.store.ListJobsByState(ctx, state, cutoffMicros, 200)The reconciler path is explicit: scan stale jobs by state, mark `TIMEOUT`, write reason, and keep the loop bounded. Bounded loops matter under store pressure because outages are rarely kind enough to fail one subsystem at a time.
Pending replay loop (Go)
func NewPendingReplayer(engine *Engine, store JobStore, pendingAge, pollInterval time.Duration) *PendingReplayer {
if pollInterval <= 0 { pollInterval = 30 * time.Second }
if pendingAge <= 0 { pendingAge = 2 * time.Minute }
return &PendingReplayer{
pendingAge: pendingAge,
pollInterval: pollInterval,
lockKey: "cordum:replayer:pending",
lockTTL: pollInterval * 2,
}
}
func (r *PendingReplayer) tick(ctx context.Context) {
cutoff := time.Now().Add(-r.pendingAge)
r.replayPending(ctx, cutoff)
r.replayApproved(ctx, cutoff)
r.replayScheduled(ctx, cutoff)
}Replay is not a blind state flip. It rehydrates the stored request and re-enters the normal request handling path, so policy checks and routing are still applied.
Default timeout profile (YAML)
# config/timeouts.yaml
workflows: {}
topics: {}
reconciler:
dispatch_timeout_seconds: 300
running_timeout_seconds: 900
scan_interval_seconds: 30In scheduler startup, `dispatch_timeout_seconds` is also used as pending replay age input. That means changing dispatch timeout changes replay sensitivity. Treat it as a recovery control, not only a timeout control.
Operator runbook
During incidents, watch stale-state growth and replay activity together. Looking at one metric alone produces confident but wrong conclusions.
# 1) Watch stale jobs by state sum by (state) (cordum_scheduler_stale_jobs) # 2) Confirm replay activity by topic sum by (topic) (increase(cordum_scheduler_orphan_replayed_total[15m])) # 3) If stale jobs rise and replay stays flat: # - verify scheduler replicas see the same Redis # - inspect lock keys: cordum:reconciler:default, cordum:replayer:pending # - validate dispatch/running timeout settings in config/timeouts.yaml # 4) During incident review, correlate: # stale_jobs + orphan_replayed + dispatch latency + DLQ depth
Limitations and tradeoffs
- - More aggressive replay age recovers faster but can re-submit jobs that were merely slow.
- - Longer timeouts reduce false positives but keep real failures hidden for longer.
- - TTL-held lock windows reduce duplicate ticks but also limit how often another replica can run recovery.
- - Recovery loops help scheduler state, not worker-side idempotency. Handlers still need idempotent design.
A replay loop without idempotent handlers is a very efficient duplicate generator. It will do exactly what you asked, repeatedly.
Next step
Run one controlled stuck-job simulation this week:
- 1. Submit representative jobs, then restart scheduler between `SCHEDULED` and `DISPATCHED` paths.
- 2. Measure replay lag and stale-job decay with current timeout profile.
- 3. Tune `dispatch_timeout_seconds` and verify replay speed vs false replay rate.
- 4. Keep an incident dashboard that includes stale jobs, orphan replay, dispatch latency, and DLQ depth.
Continue with Safety Unavailable Retry Strategy and Backpressure and Queue Drain.