Name: Cordum
Author: Cordum

The production problem

At-least-once delivery means replay is normal.

Replay is harmless only when duplicate dispatch is impossible or harmless.

If store reads fail and scheduler dispatches anyway, the system can execute duplicate side effects while pretending availability is healthy.

That is an expensive illusion.

What top results cover and miss

Source	Strong coverage	Missing piece
Idempotent Consumer Pattern	Why at-least-once messaging requires duplicate-safe handlers.	No guidance on scheduler-side fail-closed gates before dispatch enters the worker plane.
Confluent Idempotent Reader Pattern	Reader-side dedupe strategy in event pipelines.	No per-job state-machine rule for unknown-state reads in an orchestration scheduler.
gRPC Status Codes	`FAILED_PRECONDITION` style principle: do not continue until state is valid.	No concrete runbook for dispatch suppression when backing store health is uncertain.

Cordum runtime mechanics

Boundary	Current behavior	Why it matters
State lookup	`handleJobRequest` calls `jobStore.GetState` under a per-job lock (`jobLockTTL = 60s`).	Scheduler has a serialized decision point before dispatch side effects.
Fail-closed branch	If `GetState` returns non-`redis.Nil` error, scheduler logs `state read failed, failing closed` and returns `RetryAfter`.	Unknown state never proceeds to bus publish.
Known in-flight skip	If state is `DISPATCHED` or `RUNNING`, request redelivery becomes a no-op.	Replay traffic does not duplicate active work.
Missing-key handling	`redis.Nil` is treated as absent state, allowing normal new-job progression.	System stays available for legitimate first-time dispatch.
Retry cadence	Store-path retry delay defaults to `1s` (`retryDelayStore`).	Fast recovery under short Redis blips, with bounded dispatch pause.

Scheduler code paths

Fail-closed branch on state-read errors

core/controlplane/scheduler/engine.go

// core/controlplane/scheduler/engine.go (excerpt)
state, err := e.jobStore.GetState(ctx, jobID)
if err == nil {
  if state == JobStateDispatched || state == JobStateRunning {
    return nil
  }
} else if !errors.Is(err, redis.Nil) {
  slog.Error("state read failed, failing closed", "job_id", jobID, "error", err)
  return RetryAfter(err, retryDelayStore)
}

Lock and timeout constants around the guard

core/controlplane/scheduler/engine.go

// core/controlplane/scheduler/engine.go (constants)
const (
  storeOpTimeout  = 2 * time.Second
  jobLockTTL      = 60 * time.Second
  retryDelayStore = 1 * time.Second
)

return e.withJobLock(jobID, jobLockTTL, func(lockCtx context.Context) error {
  // state read + dispatch decision happens inside lock
})

Regression proof for duplicate-dispatch prevention

core/controlplane/scheduler/engine_lifecycle_regression_test.go

// core/controlplane/scheduler/engine_lifecycle_regression_test.go (excerpt)
func TestHandleJobRequestStateReadErrorDoesNotDispatchDuplicate(t *testing.T) {
  // First read fails -> retryable error, no publish
  err := engine.handleJobRequest(req, "trace-state-read")
  if _, ok := err.(*retryableError); !ok { t.Fatalf("expected retryableError") }
  if len(bus.published) != 0 { t.Fatalf("expected no dispatch") }

  // Second read succeeds with RUNNING -> no-op, still no publish
  err = engine.handleJobRequest(req, "trace-state-read-2")
  if err != nil { t.Fatalf("second call should be no-op") }
  if len(bus.published) != 0 { t.Fatalf("expected no duplicate dispatch") }
}

Validation runbook

Validate this path in tests and logs. Assume state-read errors are dispatch-critical, not cosmetic.

state-read-fail-closed-runbook.sh

bash

# 1) Validate invariant test
go test ./core/controlplane/scheduler -run TestHandleJobRequestStateReadErrorDoesNotDispatchDuplicate -count=1

# 2) Watch scheduler logs for fail-closed branch
rg "state read failed, failing closed" /var/log/cordum/scheduler.log

# 3) Submit probe job and confirm no duplicate dispatch side effects
JOB_ID=$(cordumctl job submit --topic job.default --prompt "state-read-fail-closed probe")
cordumctl job status "$JOB_ID" --json

# 4) If this error spikes, treat Redis health as a dispatch-critical dependency
cordumctl status

Limitations and tradeoffs

Approach	Upside	Downside
Fail closed on state-read errors (current)	Protects correctness by blocking unknown-state dispatch.	Short-term throughput reduction during store outages.
Fail open on state-read errors	Higher apparent availability under store turbulence.	High risk of duplicate external side effects and reconciliation incidents.
Cached-state fallback then dispatch	Can reduce outage impact if cache is fresh and bounded.	Cache staleness can reintroduce duplicate-dispatch risk with false confidence.

- This strategy protects correctness in scheduler dispatch. Worker-side handlers still need idempotency.
- If Redis instability is frequent, fail-closed will surface broader platform reliability debt quickly.
- Retrying fast (`1s`) is useful for blips, but persistent store faults still require operational escalation.

Next step

Implement this next:

1. Add a dedicated metric for state-read fail-closed events (count + topic labels).
2. Define an SLO for maximum consecutive fail-closed dispatch retries per topic.
3. Add dashboard panels that correlate fail-closed spikes with Redis latency/error rates.
4. Keep this regression test in every release gate.

Continue with AI Agent Dispatch Rollback Consistency and AI Agent Exactly-Once Myth.

AI Agent State-Read Fail-Closed Dispatch