Skip to content
Deep Dive

AI Agent Dispatch Rollback Consistency

At-least-once delivery is fine until state and publish ordering drift apart.

Deep Dive11 min readApr 2026
TL;DR
  • -Cordum sets `DISPATCHED` before bus publish so redelivery cannot silently double-dispatch a job.
  • -If publish fails, scheduler rolls state back to `SCHEDULED`, increments `scheduler_dispatch_rollback_total`, and retries with backoff.
  • -Regression tests verify exactly-one publish when `DISPATCHED` write fails and request is replayed.
  • -If rollback itself fails, jobs can remain `DISPATCHED` and get skipped until reconciler timeout (default dispatch timeout: 300s).
Failure mode

A message is published, then state persistence fails. Redelivery can send it again unless ordering is strict.

Current behavior

Scheduler writes `DISPATCHED` first, then publishes. Publish failures trigger explicit rollback to `SCHEDULED`.

Operational payoff

Replay-safe dispatch semantics under at-least-once delivery without requiring global exactly-once guarantees.

Scope

This guide covers one specific scheduler invariant: preserving exactly-one dispatch side effects under at-least-once request redelivery.

The production problem

Duplicate dispatch bugs are quiet until they are expensive.

The bug pattern is simple: state write and message publish do not fail together.

If publish succeeds but state progression does not, redelivery can execute a second dispatch for the same job.

In autonomous systems, that means duplicate side effects. Think duplicate API calls, duplicate writes, duplicate agent actions.

What top results cover and miss

SourceStrong coverageMissing piece
Idempotent Consumer PatternWhy at-least-once delivery causes duplicates and how consumer idempotency protects side effects.No scheduler-state ordering details for `SCHEDULED`/`DISPATCHED`/`RUNNING` transitions.
Transactional Outbox PatternAtomic DB+message intent and relay model for consistency between state and publication.No direct guidance for in-process dispatch rollback when publish fails after state transition.
Kafka Exactly-Once SemanticsHow retries create duplicates and where exactly-once boundaries end for external side effects.No per-job scheduler lock/state-machine strategy for replay-safe dispatch in control planes.

Cordum runtime mechanics

BoundaryCurrent behaviorWhy it matters
Redelivery gate`handleJobRequest` skips processing when current state is `DISPATCHED` or `RUNNING`.Replay messages do not re-dispatch already in-flight jobs.
Dispatch orderingState transition to `DISPATCHED` happens before bus publish.A failed state write blocks publish, which prevents duplicate external side effects.
Publish failure recoveryPublish error logs rollback, sets state back to `SCHEDULED`, increments rollback metric, returns retryable error.System can replay safely while preserving an audit trail of rollback pressure.
Attempt accountingLifecycle regression test shows each publish-failure replay increments attempts twice (`SCHEDULED` then rollback `SCHEDULED`).Retry budgets drain faster than naive one-attempt-per-replay assumptions.
Residual risk windowIf rollback set-state fails, job may remain `DISPATCHED`; reconciler timeout defaults to 300s from config.Potential blind period before timeout recovery unless alerts catch rollback failures quickly.

Scheduler code paths

Redelivery no-op guard

core/controlplane/scheduler/engine.go
go
// core/controlplane/scheduler/engine.go (excerpt)
state, err := e.jobStore.GetState(ctx, jobID)
if err == nil {
  if state == JobStateDispatched || state == JobStateRunning {
    return nil
  }
  if terminalStates[state] {
    return nil
  }
}

State-before-publish with rollback

core/controlplane/scheduler/engine.go
go
// core/controlplane/scheduler/engine.go (excerpt)
// Set DISPATCHED before publish to prevent duplicate dispatch on redelivery.
if err := e.setJobState(jobID, JobStateDispatched); err != nil {
  return RetryAfter(err, retryDelayStore)
}

if err := e.bus.Publish(subject, packet); err != nil {
  if rbErr := e.setJobState(jobID, JobStateScheduled); rbErr != nil {
    slog.Error("dispatch rollback failed", "job_id", jobID, "error", rbErr)
  }
  e.metrics.IncDispatchRollback(topic)
  return RetryAfter(err, backoffDelay(attempts, backoffBase, backoffMax))
}

Regression tests that lock behavior

core/controlplane/scheduler/*_test.go
go
// core/controlplane/scheduler/engine_consistency_test.go (excerpt)
func TestDuplicateDispatchOnDispatchedStateFailure(t *testing.T) {
  // First call: DISPATCHED write fails -> no publish
  err := engine.processJob(ctx, req, "trace-1")
  require.Error(t, err)
  assert.Equal(t, 0, bus.publishCount("job.default"))

  // Replay: succeeds, exactly one publish total
  err = engine.processJob(ctx, req, "trace-1")
  require.NoError(t, err)
  assert.Equal(t, 1, bus.publishCount("job.default"))
}

// engine_lifecycle_regression_test.go (excerpt)
// Two failed publish replays -> 4 attempts due to SCHEDULED + rollback SCHEDULED
if attempts != 4 {
  t.Fatalf("expected 4 scheduling attempts, got %d", attempts)
}

Validation runbook

Validate dispatch consistency with automated tests first. Then run controlled publish-failure drills in staging.

dispatch-rollback-runbook.sh
bash
# 1) Verify invariants in CI/staging
go test ./core/controlplane/scheduler -run TestDuplicateDispatchOnDispatchedStateFailure -count=1
go test ./core/controlplane/scheduler -run TestProcessJobPublishFailureScheduledReplayIncrementsAttempts -count=1

# 2) Trigger controlled publish failure in staging, then inspect rollback metric
curl -s http://localhost:2112/metrics | rg scheduler_dispatch_rollback_total

# 3) Submit a probe job and confirm it does not duplicate-dispatch under replay
JOB_ID=$(cordumctl job submit --topic job.default --prompt "dispatch rollback probe")
cordumctl job status "$JOB_ID" --json

# 4) If DISPATCHED jobs stall, validate reconciler timeout config (default dispatch: 300s)
cat config/timeouts.yaml

Limitations and tradeoffs

ApproachUpsideDownside
State-before-publish + rollback (current)Strong replay safety with explicit rollback metric and predictable state machine behavior.Requires robust store writes; rollback failure creates a temporary blind window.
Publish-before-stateSlightly simpler write sequence.High duplicate-dispatch risk under redelivery after partial failures.
Outbox + relay + idempotent consumerExcellent cross-system durability and auditability at scale.Higher implementation complexity and operational overhead for smaller control planes.
  • - This analysis targets scheduler request dispatch, not downstream worker business idempotency.
  • - Exactly-once transport guarantees do not remove the need for state-machine correctness in external stores.
  • - Timeout-based recovery works, but alerting on rollback anomalies is still the faster control loop.

Next step

Implement this next:

  1. 1. Add a dedicated metric for rollback failures, not just rollback count.
  2. 2. Tag retry events with dispatch phase (`dispatched_write`, `publish`, `rollback_write`) for cleaner incident triage.
  3. 3. Tune dispatch timeout per topic instead of relying on a global 300s default for all workloads.
  4. 4. Keep regression tests for replay semantics mandatory in release gates.

Continue with AI Agent Exactly-Once Myth and AI Agent Transactional Outbox.

State-machine order is a product decision

If the scheduler state and publish path disagree under failure, your agents will do real work twice.