Skip to content
Deep Dive

AI Agent Workflow Idempotency Reservation

Retries should reuse intent, not inherit stale admission state.

Deep Dive10 min readMar 2026
TL;DR
  • -Run idempotency must include cleanup logic when admission fails after key reservation.
  • -Cordum reserves `Idempotency-Key` before active-run counting, then deletes it on count failure, concurrency rejection, and run-create failure.
  • -`wf:run:idempotency:<key>` is stored with no TTL, so cleanup paths are not optional operationally.
  • -A test in the gateway suite verifies that a 429 rejection does not permanently poison the key.
Failure mode

Reserved keys that are never cleaned can block legitimate retries after capacity recovers.

Control point

Cordum explicitly calls idempotency cleanup on pre-create failure paths in `handleStartRun`.

Operational payoff

Retrying with the same key after a temporary 429 can succeed once run slots open.

Scope

This guide focuses on workflow run admission in the gateway, not generic API idempotency theory.

The production problem

Your caller sends `POST /workflows/:id/runs` with an idempotency key and times out. Retry should be safe.

The hard part is the admission pipeline. The key can be reserved before capacity checks and before run persistence.

If cleanup is missing, short spikes can poison keys. Example: 40,000 run starts/day and 0.2% temporary `429` rejections means 80 keys/day can become unusable.

What top results cover and miss

SourceStrong coverageMissing piece
AWS Builders Library: Making retries safe with idempotent APIsClient request identifiers, semantic equivalence, and why retries need explicit API contracts.No concrete guidance for reservation cleanup when a workflow admission gate rejects before run creation.
Stripe Docs: Idempotent requestsResult replay behavior, parameter mismatch handling, and caveats where validation/concurrency conflicts are not saved.No scheduler-level pattern for releasing reserved keys after tenant concurrency limits reject a run.
PayPal Docs: API requests (`PayPal-Request-Id`)Idempotent POST retries with user-generated request IDs and long retention windows for retry safety.No control-plane internals for pre-dispatch reservation lifecycle, cleanup observability, or admission locking.

Cordum runtime mechanics

BoundaryCurrent behaviorWhy it matters
Admission lockGateway uses `workflowAdmissionLockTTL=10s`, retry delay `10ms`, and max wait `2s` per org when max concurrency is enabled.Concurrent run submissions serialize around admission checks and reduce race windows.
Reservation timingIf `Idempotency-Key` exists, `TrySetRunIdempotencyKey` runs before `CountActiveRuns` and before `CreateRun`.Duplicate concurrent requests coalesce early, but cleanup is required on downstream rejection paths.
Cleanup pathsReservation cleanup runs on active-run count failure, concurrency limit rejection (`429`), and run creation failure.Keys do not remain blocked after temporary admission failures.
Redis storage semantics`TrySetRunIdempotencyKey` uses `SetNX(..., 0)` on `wf:run:idempotency:<key>` (no TTL).Reservation leaks persist until explicit delete, so cleanup reliability is critical.

Reservation lifecycle in code

Reservation and cleanup on `429` path

core/controlplane/gateway/handlers_workflows.go
go
// core/controlplane/gateway/handlers_workflows.go (excerpt)
idempotencyKey := idempotencyKeyFromRequest(r)
runID := uuid.NewString()
reservedKey := false

if idempotencyKey != "" {
  ok, err := s.workflowStore.TrySetRunIdempotencyKey(r.Context(), idempotencyKey, runID)
  if err != nil {
    writeErrorJSON(w, http.StatusInternalServerError, "idempotency reservation failed")
    return
  }
  if !ok {
    if existingID, err := s.workflowStore.GetRunByIdempotencyKey(r.Context(), idempotencyKey); err == nil && existingID != "" {
      writeJSON(w, map[string]string{"run_id": existingID})
      return
    }
    writeErrorJSON(w, http.StatusConflict, "idempotency key already used")
    return
  }
  reservedKey = true
}

if count >= limit {
  if reservedKey && idempotencyKey != "" {
    cleanupRunIdempotencyReservation(r.Context(), idempotencyKey, runID,
      "failed to cleanup idempotency key after concurrency limit rejection",
      s.workflowStore.DeleteRunIdempotencyKey)
  }
  writeErrorJSON(w, http.StatusTooManyRequests, "max concurrent runs reached")
  return
}

Redis semantics (`SetNX` with no TTL)

core/workflow/store_redis.go
go
// core/workflow/store_redis.go (excerpt)
func (s *RedisStore) TrySetRunIdempotencyKey(ctx context.Context, key, runID string) (bool, error) {
  if key == "" || runID == "" {
    return false, fmt.Errorf("idempotency key and run id required")
  }
  return s.client.SetNX(ctx, runIdempotencyKey(key), runID, 0).Result() // no TTL
}

func (s *RedisStore) DeleteRunIdempotencyKey(ctx context.Context, key string) error {
  if key == "" {
    return fmt.Errorf("idempotency key required")
  }
  return s.client.Del(ctx, runIdempotencyKey(key)).Err()
}

func runIdempotencyKey(key string) string {
  return "wf:run:idempotency:" + key
}

Regression test that protects retry behavior

core/controlplane/gateway/workflow_runs_test.go
go
// core/controlplane/gateway/workflow_runs_test.go (excerpt)
func TestHandleStartRunRejectedByConcurrencyLimitReleasesIdempotencyReservation(t *testing.T) {
  // 1) max_concurrent_runs = 1
  // 2) existing running run fills the slot
  // 3) request with Idempotency-Key gets 429
  // 4) complete blocking run
  // 5) retry with same key succeeds and persists a new run
}

Validation runbook

Validate this in staging before changing concurrency limits or idempotency retention behavior.

runbook.sh
bash
# 1) Set max_concurrent_runs=1 for test tenant
# 2) Start one long-running run to fill capacity
# 3) Submit another run with Idempotency-Key: retry-after-limit
# 4) Expect HTTP 429
# 5) Complete the blocking run
# 6) Retry same request with same key
# 7) Expect HTTP 200 and valid run_id
# 8) Verify Redis key wf:run:idempotency:retry-after-limit maps to the new run

Limitations and tradeoffs

ApproachUpsideDownside
Reserve key but never clean on admission failureSimple logic in happy path.Temporary 429 can poison a key permanently.
Short TTL on idempotency keysSelf-heals stale reservations over time.Late retries after TTL can duplicate intent.
No TTL + explicit cleanup (Cordum pattern)Stable replay behavior across long retry windows.Cleanup delete failures need monitoring and remediation.
  • - Cleanup failures are logged but not retried inline. A Redis outage during delete can still leave stale reservations.
  • - Reusing a key with different payload does not currently trigger payload mismatch validation in this admission path.
  • - No TTL on wf:run:idempotency:<key> means operational hygiene matters more than in 24h key models.

Next step

Do this in one iteration:

  1. 1. Add a metric for idempotency cleanup failures by failure context.
  2. 2. Add an alert when cleanup failures exceed 1 per 1,000 run starts.
  3. 3. Decide whether this endpoint should enforce payload fingerprint checks on key reuse.
  4. 4. Run the admission runbook after each concurrency policy change.

Continue with AI Agent Idempotency Keys and AI Agent Rate Limiting and Overload Control.

Operational follow-through

Idempotency is a lifecycle, not a header. Admission, persistence, and cleanup must agree under failure.