The production problem
Distributed failures are ambiguous. A request can time out at the caller and still commit on the server.
In autonomous systems, retries are often automatic. Without idempotency, one ambiguous timeout can create duplicate incident tickets, duplicate PRs, or duplicate rollback jobs.
This is not edge-case theater. If your control plane runs at scale, retries happen continuously. The only question is whether retries are replay-safe.
What top sources cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Stripe docs: idempotent requests | Clear endpoint-level behavior: same key returns first status and body, including failures; mismatched parameters are rejected; pruning after 24h+ is documented. | No detailed guidance for multi-step agent workflows where run admission, queue delivery, and compensation all need aligned idempotency semantics. |
| AWS Builders Library: making retries safe with idempotent APIs | Strong API contract model: caller-provided request ID, semantic equivalence, and atomic token + mutation handling. | Does not provide a concrete control-plane blueprint for autonomous agents that combine policy gates, distributed queues, and workflow rollback logic. |
| AWS Lambda durable execution idempotency | Practical matrix for key reuse vs payload mismatch and clear distinction between run-level and step-level idempotency. | Focused on Lambda durable executions. It does not address cross-service idempotency alignment when one agent workflow dispatches into external systems. |
Idempotency contract design
| Contract field | Required behavior | Failure mode if missing |
|---|---|---|
| Intent key | Stable per business intent. Example: `workflow_id:step_id:operation_id`. | Random-per-retry keys convert retries into duplicate writes. |
| Payload guard | Persist request fingerprint with key and reject key reuse with different payload. | Cross-intent replay bug: same key can point at different work. |
| Replay response | Return semantically equivalent result for duplicate request. | Caller flow diverges between first request and retry path. |
| Reservation lifecycle | Reserve key before mutating; clean reservation on admission failures. | Poisoned key blocks valid retries after transient rejection. |
| Retention policy | Define expiration based on retry horizon and replay lag. | Too short: duplicate effects. Too long: unbounded key cardinality. |
Cordum runtime behavior
| Control | Current behavior | Why it matters |
|---|---|---|
| Run create key inputs | Run creation accepts `Idempotency-Key`, `X-Idempotency-Key`, and query alternatives. | Client and proxy retries can preserve the same intent key across transport variants. |
| Run mapping storage | Workflow run dedupe is stored under `wf:run:idempotency:<key>` using Redis `SetNX`. | Concurrent submissions with one key race to one winner. |
| Concurrent replay behavior | Gateway test uses 10 concurrent requests with one key and persists exactly one run. | Dedupe is not theoretical; it is covered by concurrency tests. |
| Admission rejection cleanup | If concurrency gate rejects (`429`), reservation cleanup allows later retry with same key. | Prevents poisoned keys after temporary admission pressure. |
| Bus idempotency guard | JetStream handlers use Redis processed keys (`10m` TTL) and NAK-with-delay (`2s`) on guard failures. | Keeps at-least-once delivery from turning into duplicate processing under crash windows. |
| Compensation idempotency | Saga rollback auto-generates key from `workflow_id|job_id|topic|capability|step_index` hash. | Rollback retries avoid duplicate compensation side effects. |
Implementation examples
1) Stable intent key generation (Go)
func BuildRunIntentKey(workflowID, stepID, operationID string) string {
// Stable for the same business intent across retries.
return fmt.Sprintf("%s:%s:%s", workflowID, stepID, operationID)
}
func SubmitWithIntentKey(ctx context.Context, req StartRunRequest) error {
key := BuildRunIntentKey(req.WorkflowID, req.StepID, req.OperationID)
req.Headers["Idempotency-Key"] = key
return gateway.StartRun(ctx, req)
}2) Concurrency regression test pattern (Go)
const workers = 10
req.Header.Set("Idempotency-Key", "same-key")
// Expect all responses to carry the same run_id.
for _, runID := range runIDs[1:] {
if runID != runIDs[0] {
t.Fatalf("expected one run id, got %v", runIDs)
}
}
// Expect one persisted run.
if len(runs) != 1 {
t.Fatalf("expected exactly 1 persisted run, got %d", len(runs))
}This pattern catches a common regression: key reservation works in serial tests but fails under parallel submit pressure.
3) Payload mismatch guard (Go)
type IdemRecord struct {
Key string
RequestHash string
RunID string
CreatedAt time.Time
}
func ValidateReplay(existing IdemRecord, incomingHash string) error {
if existing.RequestHash != incomingHash {
return errors.New("idempotency key reused with different payload")
}
return nil
}If you skip this check, key reuse across different payloads can return a stale run reference and hide caller mistakes.
4) Retention caveat to watch
// Current run idempotency set path in Cordum workflow store // (SetNX with no expiration) ok, err := redis.SetNX(ctx, "wf:run:idempotency:"+key, runID, 0).Result()
Zero TTL means no automatic expiration. That can be exactly right for some audit workloads and exactly wrong for high-cardinality traffic.
Limitations and tradeoffs
Verified from current Cordum run-start path: replay behavior is keyed by idempotency key lookup, not payload hash comparison. If your client might reuse a key with changed input, add a payload guard.
- - Bus processed-key TTL is bounded (10 minutes). That aligns with redelivery windows, not long-horizon replay dedupe.
- - Long retention windows improve safety against late retries and increase key cardinality and storage costs.
- - Compensation keys are hash-derived and deterministic, but still rely on input metadata quality.
- - Idempotency prevents duplicate intent execution. It does not correct wrong intent or bad authorization.
Next step
Run this in one sprint:
- 1. Define intent-key templates for your top five side-effecting operations.
- 2. Add payload fingerprint checks to duplicate-request handling paths.
- 3. Decide retention policy explicitly per surface (run submit, queue dedupe, compensation).
- 4. Add a concurrency test with at least 10 parallel submits on the same key.
- 5. Drill one timeout + retry scenario in staging and verify one side effect.
Continue with AI Agent Idempotency Payload Mismatch and AI Agent Timeouts, Retries, and Backoff.