AI Agent Dispatch Rollback Consistency (2026)

The production problem

Duplicate dispatch bugs are quiet until they are expensive.

The bug pattern is simple: state write and message publish do not fail together.

If publish succeeds but state progression does not, redelivery can execute a second dispatch for the same job.

In autonomous systems, that means duplicate side effects. Think duplicate API calls, duplicate writes, duplicate agent actions.

What top results cover and miss

Source	Strong coverage	Missing piece
Idempotent Consumer Pattern	Why at-least-once delivery causes duplicates and how consumer idempotency protects side effects.	No scheduler-state ordering details for `SCHEDULED`/`DISPATCHED`/`RUNNING` transitions.
Transactional Outbox Pattern	Atomic DB+message intent and relay model for consistency between state and publication.	No direct guidance for in-process dispatch rollback when publish fails after state transition.
Kafka Exactly-Once Semantics	How retries create duplicates and where exactly-once boundaries end for external side effects.	No per-job scheduler lock/state-machine strategy for replay-safe dispatch in control planes.

Cordum runtime mechanics

Boundary	Current behavior	Why it matters
Redelivery gate	`handleJobRequest` skips processing when current state is `DISPATCHED` or `RUNNING`.	Replay messages do not re-dispatch already in-flight jobs.
Dispatch ordering	State transition to `DISPATCHED` happens before bus publish.	A failed state write blocks publish, which prevents duplicate external side effects.
Publish failure recovery	Publish error logs rollback, sets state back to `SCHEDULED`, increments rollback metric, returns retryable error.	System can replay safely while preserving an audit trail of rollback pressure.
Attempt accounting	Lifecycle regression test shows each publish-failure replay increments attempts twice (`SCHEDULED` then rollback `SCHEDULED`).	Retry budgets drain faster than naive one-attempt-per-replay assumptions.
Residual risk window	If rollback set-state fails, job may remain `DISPATCHED`; reconciler timeout defaults to 300s from config.	Potential blind period before timeout recovery unless alerts catch rollback failures quickly.

Scheduler code paths

Redelivery no-op guard

core/controlplane/scheduler/engine.go

// core/controlplane/scheduler/engine.go (excerpt)
state, err := e.jobStore.GetState(ctx, jobID)
if err == nil {
  if state == JobStateDispatched || state == JobStateRunning {
    return nil
  }
  if terminalStates[state] {
    return nil
  }
}

State-before-publish with rollback

core/controlplane/scheduler/engine.go

// core/controlplane/scheduler/engine.go (excerpt)
// Set DISPATCHED before publish to prevent duplicate dispatch on redelivery.
if err := e.setJobState(jobID, JobStateDispatched); err != nil {
  return RetryAfter(err, retryDelayStore)
}

if err := e.bus.Publish(subject, packet); err != nil {
  if rbErr := e.setJobState(jobID, JobStateScheduled); rbErr != nil {
    slog.Error("dispatch rollback failed", "job_id", jobID, "error", rbErr)
  }
  e.metrics.IncDispatchRollback(topic)
  return RetryAfter(err, backoffDelay(attempts, backoffBase, backoffMax))
}

Regression tests that lock behavior

core/controlplane/scheduler/*_test.go

// core/controlplane/scheduler/engine_consistency_test.go (excerpt)
func TestDuplicateDispatchOnDispatchedStateFailure(t *testing.T) {
  // First call: DISPATCHED write fails -> no publish
  err := engine.processJob(ctx, req, "trace-1")
  require.Error(t, err)
  assert.Equal(t, 0, bus.publishCount("job.default"))

  // Replay: succeeds, exactly one publish total
  err = engine.processJob(ctx, req, "trace-1")
  require.NoError(t, err)
  assert.Equal(t, 1, bus.publishCount("job.default"))
}

// engine_lifecycle_regression_test.go (excerpt)
// Two failed publish replays -> 4 attempts due to SCHEDULED + rollback SCHEDULED
if attempts != 4 {
  t.Fatalf("expected 4 scheduling attempts, got %d", attempts)
}

Validation runbook

Validate dispatch consistency with automated tests first. Then run controlled publish-failure drills in staging.

dispatch-rollback-runbook.sh

bash

# 1) Verify invariants in CI/staging
go test ./core/controlplane/scheduler -run TestDuplicateDispatchOnDispatchedStateFailure -count=1
go test ./core/controlplane/scheduler -run TestProcessJobPublishFailureScheduledReplayIncrementsAttempts -count=1

# 2) Trigger controlled publish failure in staging, then inspect rollback metric
curl -s http://localhost:2112/metrics | rg scheduler_dispatch_rollback_total

# 3) Submit a probe job and confirm it does not duplicate-dispatch under replay
JOB_ID=$(cordumctl job submit --topic job.default --prompt "dispatch rollback probe")
cordumctl job status "$JOB_ID" --json

# 4) If DISPATCHED jobs stall, validate reconciler timeout config (default dispatch: 300s)
cat config/timeouts.yaml

Limitations and tradeoffs

Approach	Upside	Downside
State-before-publish + rollback (current)	Strong replay safety with explicit rollback metric and predictable state machine behavior.	Requires robust store writes; rollback failure creates a temporary blind window.
Publish-before-state	Slightly simpler write sequence.	High duplicate-dispatch risk under redelivery after partial failures.
Outbox + relay + idempotent consumer	Excellent cross-system durability and auditability at scale.	Higher implementation complexity and operational overhead for smaller control planes.

- This analysis targets scheduler request dispatch, not downstream worker business idempotency.
- Exactly-once transport guarantees do not remove the need for state-machine correctness in external stores.
- Timeout-based recovery works, but alerting on rollback anomalies is still the faster control loop.

Next step

Implement this next:

1. Add a dedicated metric for rollback failures, not just rollback count.
2. Tag retry events with dispatch phase (`dispatched_write`, `publish`, `rollback_write`) for cleaner incident triage.
3. Tune dispatch timeout per topic instead of relying on a global 300s default for all workloads.
4. Keep regression tests for replay semantics mandatory in release gates.

Continue with AI Agent Exactly-Once Myth and AI Agent Transactional Outbox.