Skip to content
Deep Dive

AI Agent State-Read Fail-Closed Dispatch

If scheduler state is unknown, dispatching is a guess. Guesses are expensive.

Deep Dive10 min readApr 2026
TL;DR
  • -Cordum reads current job state before dispatch and refuses to dispatch when state read returns non-`redis.Nil` errors.
  • -That fail-closed branch returns `RetryAfter` and prevents duplicate work when actual state is unknown.
  • -Regression tests verify zero publishes on transient state-read errors and no duplicate dispatch after recovery.
  • -The tradeoff is temporary availability loss during Redis turbulence, but the alternative is uncontrolled duplicate side effects.
Failure mode

Store read fails. Scheduler cannot prove whether the job already dispatched. Dispatching anyway risks duplicate actions.

Current behavior

On state-read errors, scheduler fails closed and retries later instead of dispatching blindly.

Operational payoff

Duplicate-dispatch probability drops sharply under transient Redis faults.

Scope

This guide covers one dispatch invariant: non-`redis.Nil` state-read errors must block dispatch until the scheduler can prove current job state.

The production problem

At-least-once delivery means replay is normal.

Replay is harmless only when duplicate dispatch is impossible or harmless.

If store reads fail and scheduler dispatches anyway, the system can execute duplicate side effects while pretending availability is healthy.

That is an expensive illusion.

What top results cover and miss

SourceStrong coverageMissing piece
Idempotent Consumer PatternWhy at-least-once messaging requires duplicate-safe handlers.No guidance on scheduler-side fail-closed gates before dispatch enters the worker plane.
Confluent Idempotent Reader PatternReader-side dedupe strategy in event pipelines.No per-job state-machine rule for unknown-state reads in an orchestration scheduler.
gRPC Status Codes`FAILED_PRECONDITION` style principle: do not continue until state is valid.No concrete runbook for dispatch suppression when backing store health is uncertain.

Cordum runtime mechanics

BoundaryCurrent behaviorWhy it matters
State lookup`handleJobRequest` calls `jobStore.GetState` under a per-job lock (`jobLockTTL = 60s`).Scheduler has a serialized decision point before dispatch side effects.
Fail-closed branchIf `GetState` returns non-`redis.Nil` error, scheduler logs `state read failed, failing closed` and returns `RetryAfter`.Unknown state never proceeds to bus publish.
Known in-flight skipIf state is `DISPATCHED` or `RUNNING`, request redelivery becomes a no-op.Replay traffic does not duplicate active work.
Missing-key handling`redis.Nil` is treated as absent state, allowing normal new-job progression.System stays available for legitimate first-time dispatch.
Retry cadenceStore-path retry delay defaults to `1s` (`retryDelayStore`).Fast recovery under short Redis blips, with bounded dispatch pause.

Scheduler code paths

Fail-closed branch on state-read errors

core/controlplane/scheduler/engine.go
go
// core/controlplane/scheduler/engine.go (excerpt)
state, err := e.jobStore.GetState(ctx, jobID)
if err == nil {
  if state == JobStateDispatched || state == JobStateRunning {
    return nil
  }
} else if !errors.Is(err, redis.Nil) {
  slog.Error("state read failed, failing closed", "job_id", jobID, "error", err)
  return RetryAfter(err, retryDelayStore)
}

Lock and timeout constants around the guard

core/controlplane/scheduler/engine.go
go
// core/controlplane/scheduler/engine.go (constants)
const (
  storeOpTimeout  = 2 * time.Second
  jobLockTTL      = 60 * time.Second
  retryDelayStore = 1 * time.Second
)

return e.withJobLock(jobID, jobLockTTL, func(lockCtx context.Context) error {
  // state read + dispatch decision happens inside lock
})

Regression proof for duplicate-dispatch prevention

core/controlplane/scheduler/engine_lifecycle_regression_test.go
go
// core/controlplane/scheduler/engine_lifecycle_regression_test.go (excerpt)
func TestHandleJobRequestStateReadErrorDoesNotDispatchDuplicate(t *testing.T) {
  // First read fails -> retryable error, no publish
  err := engine.handleJobRequest(req, "trace-state-read")
  if _, ok := err.(*retryableError); !ok { t.Fatalf("expected retryableError") }
  if len(bus.published) != 0 { t.Fatalf("expected no dispatch") }

  // Second read succeeds with RUNNING -> no-op, still no publish
  err = engine.handleJobRequest(req, "trace-state-read-2")
  if err != nil { t.Fatalf("second call should be no-op") }
  if len(bus.published) != 0 { t.Fatalf("expected no duplicate dispatch") }
}

Validation runbook

Validate this path in tests and logs. Assume state-read errors are dispatch-critical, not cosmetic.

state-read-fail-closed-runbook.sh
bash
# 1) Validate invariant test
go test ./core/controlplane/scheduler -run TestHandleJobRequestStateReadErrorDoesNotDispatchDuplicate -count=1

# 2) Watch scheduler logs for fail-closed branch
rg "state read failed, failing closed" /var/log/cordum/scheduler.log

# 3) Submit probe job and confirm no duplicate dispatch side effects
JOB_ID=$(cordumctl job submit --topic job.default --prompt "state-read-fail-closed probe")
cordumctl job status "$JOB_ID" --json

# 4) If this error spikes, treat Redis health as a dispatch-critical dependency
cordumctl status

Limitations and tradeoffs

ApproachUpsideDownside
Fail closed on state-read errors (current)Protects correctness by blocking unknown-state dispatch.Short-term throughput reduction during store outages.
Fail open on state-read errorsHigher apparent availability under store turbulence.High risk of duplicate external side effects and reconciliation incidents.
Cached-state fallback then dispatchCan reduce outage impact if cache is fresh and bounded.Cache staleness can reintroduce duplicate-dispatch risk with false confidence.
  • - This strategy protects correctness in scheduler dispatch. Worker-side handlers still need idempotency.
  • - If Redis instability is frequent, fail-closed will surface broader platform reliability debt quickly.
  • - Retrying fast (`1s`) is useful for blips, but persistent store faults still require operational escalation.

Next step

Implement this next:

  1. 1. Add a dedicated metric for state-read fail-closed events (count + topic labels).
  2. 2. Define an SLO for maximum consecutive fail-closed dispatch retries per topic.
  3. 3. Add dashboard panels that correlate fail-closed spikes with Redis latency/error rates.
  4. 4. Keep this regression test in every release gate.

Continue with AI Agent Dispatch Rollback Consistency and AI Agent Exactly-Once Myth.

Correctness first, then throughput

A short dispatch pause is cheaper than duplicated autonomous actions in production.