The production problem
Your caller sends `POST /workflows/:id/runs` with an idempotency key and times out. Retry should be safe.
The hard part is the admission pipeline. The key can be reserved before capacity checks and before run persistence.
If cleanup is missing, short spikes can poison keys. Example: 40,000 run starts/day and 0.2% temporary `429` rejections means 80 keys/day can become unusable.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| AWS Builders Library: Making retries safe with idempotent APIs | Client request identifiers, semantic equivalence, and why retries need explicit API contracts. | No concrete guidance for reservation cleanup when a workflow admission gate rejects before run creation. |
| Stripe Docs: Idempotent requests | Result replay behavior, parameter mismatch handling, and caveats where validation/concurrency conflicts are not saved. | No scheduler-level pattern for releasing reserved keys after tenant concurrency limits reject a run. |
| PayPal Docs: API requests (`PayPal-Request-Id`) | Idempotent POST retries with user-generated request IDs and long retention windows for retry safety. | No control-plane internals for pre-dispatch reservation lifecycle, cleanup observability, or admission locking. |
Cordum runtime mechanics
| Boundary | Current behavior | Why it matters |
|---|---|---|
| Admission lock | Gateway uses `workflowAdmissionLockTTL=10s`, retry delay `10ms`, and max wait `2s` per org when max concurrency is enabled. | Concurrent run submissions serialize around admission checks and reduce race windows. |
| Reservation timing | If `Idempotency-Key` exists, `TrySetRunIdempotencyKey` runs before `CountActiveRuns` and before `CreateRun`. | Duplicate concurrent requests coalesce early, but cleanup is required on downstream rejection paths. |
| Cleanup paths | Reservation cleanup runs on active-run count failure, concurrency limit rejection (`429`), and run creation failure. | Keys do not remain blocked after temporary admission failures. |
| Redis storage semantics | `TrySetRunIdempotencyKey` uses `SetNX(..., 0)` on `wf:run:idempotency:<key>` (no TTL). | Reservation leaks persist until explicit delete, so cleanup reliability is critical. |
Reservation lifecycle in code
Reservation and cleanup on `429` path
// core/controlplane/gateway/handlers_workflows.go (excerpt)
idempotencyKey := idempotencyKeyFromRequest(r)
runID := uuid.NewString()
reservedKey := false
if idempotencyKey != "" {
ok, err := s.workflowStore.TrySetRunIdempotencyKey(r.Context(), idempotencyKey, runID)
if err != nil {
writeErrorJSON(w, http.StatusInternalServerError, "idempotency reservation failed")
return
}
if !ok {
if existingID, err := s.workflowStore.GetRunByIdempotencyKey(r.Context(), idempotencyKey); err == nil && existingID != "" {
writeJSON(w, map[string]string{"run_id": existingID})
return
}
writeErrorJSON(w, http.StatusConflict, "idempotency key already used")
return
}
reservedKey = true
}
if count >= limit {
if reservedKey && idempotencyKey != "" {
cleanupRunIdempotencyReservation(r.Context(), idempotencyKey, runID,
"failed to cleanup idempotency key after concurrency limit rejection",
s.workflowStore.DeleteRunIdempotencyKey)
}
writeErrorJSON(w, http.StatusTooManyRequests, "max concurrent runs reached")
return
}Redis semantics (`SetNX` with no TTL)
// core/workflow/store_redis.go (excerpt)
func (s *RedisStore) TrySetRunIdempotencyKey(ctx context.Context, key, runID string) (bool, error) {
if key == "" || runID == "" {
return false, fmt.Errorf("idempotency key and run id required")
}
return s.client.SetNX(ctx, runIdempotencyKey(key), runID, 0).Result() // no TTL
}
func (s *RedisStore) DeleteRunIdempotencyKey(ctx context.Context, key string) error {
if key == "" {
return fmt.Errorf("idempotency key required")
}
return s.client.Del(ctx, runIdempotencyKey(key)).Err()
}
func runIdempotencyKey(key string) string {
return "wf:run:idempotency:" + key
}Regression test that protects retry behavior
// core/controlplane/gateway/workflow_runs_test.go (excerpt)
func TestHandleStartRunRejectedByConcurrencyLimitReleasesIdempotencyReservation(t *testing.T) {
// 1) max_concurrent_runs = 1
// 2) existing running run fills the slot
// 3) request with Idempotency-Key gets 429
// 4) complete blocking run
// 5) retry with same key succeeds and persists a new run
}Validation runbook
Validate this in staging before changing concurrency limits or idempotency retention behavior.
# 1) Set max_concurrent_runs=1 for test tenant # 2) Start one long-running run to fill capacity # 3) Submit another run with Idempotency-Key: retry-after-limit # 4) Expect HTTP 429 # 5) Complete the blocking run # 6) Retry same request with same key # 7) Expect HTTP 200 and valid run_id # 8) Verify Redis key wf:run:idempotency:retry-after-limit maps to the new run
Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Reserve key but never clean on admission failure | Simple logic in happy path. | Temporary 429 can poison a key permanently. |
| Short TTL on idempotency keys | Self-heals stale reservations over time. | Late retries after TTL can duplicate intent. |
| No TTL + explicit cleanup (Cordum pattern) | Stable replay behavior across long retry windows. | Cleanup delete failures need monitoring and remediation. |
- - Cleanup failures are logged but not retried inline. A Redis outage during delete can still leave stale reservations.
- - Reusing a key with different payload does not currently trigger payload mismatch validation in this admission path.
- - No TTL on
wf:run:idempotency:<key>means operational hygiene matters more than in 24h key models.
Next step
Do this in one iteration:
- 1. Add a metric for idempotency cleanup failures by failure context.
- 2. Add an alert when cleanup failures exceed 1 per 1,000 run starts.
- 3. Decide whether this endpoint should enforce payload fingerprint checks on key reuse.
- 4. Run the admission runbook after each concurrency policy change.
Continue with AI Agent Idempotency Keys and AI Agent Rate Limiting and Overload Control.