The production problem
Duplicate dispatch bugs are quiet until they are expensive.
The bug pattern is simple: state write and message publish do not fail together.
If publish succeeds but state progression does not, redelivery can execute a second dispatch for the same job.
In autonomous systems, that means duplicate side effects. Think duplicate API calls, duplicate writes, duplicate agent actions.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Idempotent Consumer Pattern | Why at-least-once delivery causes duplicates and how consumer idempotency protects side effects. | No scheduler-state ordering details for `SCHEDULED`/`DISPATCHED`/`RUNNING` transitions. |
| Transactional Outbox Pattern | Atomic DB+message intent and relay model for consistency between state and publication. | No direct guidance for in-process dispatch rollback when publish fails after state transition. |
| Kafka Exactly-Once Semantics | How retries create duplicates and where exactly-once boundaries end for external side effects. | No per-job scheduler lock/state-machine strategy for replay-safe dispatch in control planes. |
Cordum runtime mechanics
| Boundary | Current behavior | Why it matters |
|---|---|---|
| Redelivery gate | `handleJobRequest` skips processing when current state is `DISPATCHED` or `RUNNING`. | Replay messages do not re-dispatch already in-flight jobs. |
| Dispatch ordering | State transition to `DISPATCHED` happens before bus publish. | A failed state write blocks publish, which prevents duplicate external side effects. |
| Publish failure recovery | Publish error logs rollback, sets state back to `SCHEDULED`, increments rollback metric, returns retryable error. | System can replay safely while preserving an audit trail of rollback pressure. |
| Attempt accounting | Lifecycle regression test shows each publish-failure replay increments attempts twice (`SCHEDULED` then rollback `SCHEDULED`). | Retry budgets drain faster than naive one-attempt-per-replay assumptions. |
| Residual risk window | If rollback set-state fails, job may remain `DISPATCHED`; reconciler timeout defaults to 300s from config. | Potential blind period before timeout recovery unless alerts catch rollback failures quickly. |
Scheduler code paths
Redelivery no-op guard
// core/controlplane/scheduler/engine.go (excerpt)
state, err := e.jobStore.GetState(ctx, jobID)
if err == nil {
if state == JobStateDispatched || state == JobStateRunning {
return nil
}
if terminalStates[state] {
return nil
}
}State-before-publish with rollback
// core/controlplane/scheduler/engine.go (excerpt)
// Set DISPATCHED before publish to prevent duplicate dispatch on redelivery.
if err := e.setJobState(jobID, JobStateDispatched); err != nil {
return RetryAfter(err, retryDelayStore)
}
if err := e.bus.Publish(subject, packet); err != nil {
if rbErr := e.setJobState(jobID, JobStateScheduled); rbErr != nil {
slog.Error("dispatch rollback failed", "job_id", jobID, "error", rbErr)
}
e.metrics.IncDispatchRollback(topic)
return RetryAfter(err, backoffDelay(attempts, backoffBase, backoffMax))
}Regression tests that lock behavior
// core/controlplane/scheduler/engine_consistency_test.go (excerpt)
func TestDuplicateDispatchOnDispatchedStateFailure(t *testing.T) {
// First call: DISPATCHED write fails -> no publish
err := engine.processJob(ctx, req, "trace-1")
require.Error(t, err)
assert.Equal(t, 0, bus.publishCount("job.default"))
// Replay: succeeds, exactly one publish total
err = engine.processJob(ctx, req, "trace-1")
require.NoError(t, err)
assert.Equal(t, 1, bus.publishCount("job.default"))
}
// engine_lifecycle_regression_test.go (excerpt)
// Two failed publish replays -> 4 attempts due to SCHEDULED + rollback SCHEDULED
if attempts != 4 {
t.Fatalf("expected 4 scheduling attempts, got %d", attempts)
}Validation runbook
Validate dispatch consistency with automated tests first. Then run controlled publish-failure drills in staging.
# 1) Verify invariants in CI/staging go test ./core/controlplane/scheduler -run TestDuplicateDispatchOnDispatchedStateFailure -count=1 go test ./core/controlplane/scheduler -run TestProcessJobPublishFailureScheduledReplayIncrementsAttempts -count=1 # 2) Trigger controlled publish failure in staging, then inspect rollback metric curl -s http://localhost:2112/metrics | rg scheduler_dispatch_rollback_total # 3) Submit a probe job and confirm it does not duplicate-dispatch under replay JOB_ID=$(cordumctl job submit --topic job.default --prompt "dispatch rollback probe") cordumctl job status "$JOB_ID" --json # 4) If DISPATCHED jobs stall, validate reconciler timeout config (default dispatch: 300s) cat config/timeouts.yaml
Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| State-before-publish + rollback (current) | Strong replay safety with explicit rollback metric and predictable state machine behavior. | Requires robust store writes; rollback failure creates a temporary blind window. |
| Publish-before-state | Slightly simpler write sequence. | High duplicate-dispatch risk under redelivery after partial failures. |
| Outbox + relay + idempotent consumer | Excellent cross-system durability and auditability at scale. | Higher implementation complexity and operational overhead for smaller control planes. |
- - This analysis targets scheduler request dispatch, not downstream worker business idempotency.
- - Exactly-once transport guarantees do not remove the need for state-machine correctness in external stores.
- - Timeout-based recovery works, but alerting on rollback anomalies is still the faster control loop.
Next step
Implement this next:
- 1. Add a dedicated metric for rollback failures, not just rollback count.
- 2. Tag retry events with dispatch phase (`dispatched_write`, `publish`, `rollback_write`) for cleaner incident triage.
- 3. Tune dispatch timeout per topic instead of relying on a global 300s default for all workloads.
- 4. Keep regression tests for replay semantics mandatory in release gates.
Continue with AI Agent Exactly-Once Myth and AI Agent Transactional Outbox.