The production problem
A timeout at the client does not mean the server did nothing. Autonomous agents hit that ambiguity many times per hour.
Without idempotency keys, one ambiguous timeout can create duplicate tickets, duplicate pull requests, or duplicate infrastructure changes.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Stripe idempotent requests | Excellent concrete behavior: key reuse returns first response, parameter mismatch is rejected, and key retention guidance. | Not focused on multi-step autonomous workflows with delegated agent actions. |
| AWS Builders Library: making retries safe with idempotent APIs | Strong API contract design and client request identifiers for at-most-once intent. | No opinionated governance model for policy-gated replays in agent control planes. |
| AWS Lambda durable execution idempotency | Execution-name idempotency and step replay behavior in durable executions. | Limited cross-system guidance when one workflow calls multiple external tools. |
Key design model
| Layer | Required design | Failure if missing |
|---|---|---|
| Key format | Derive from business intent (`run_id:step_id:operation_id`), not random per retry. | Each retry looks new and duplicates side effects. |
| Parameter lock | Store request hash with key and reject mismatched replays. | Same key with changed payload mutates unexpected state. |
| Retention window | Keep key state long enough to absorb late retries and replay lag. | Expired keys allow accidental re-execution of old intents. |
| Outcome semantics | Return semantically equivalent result for duplicate requests. | Client behavior diverges between first call and retry path. |
Cordum runtime behavior
| Control | Current behavior | Why it matters |
|---|---|---|
| Run idempotency | Workflow run creation supports `Idempotency-Key` header | Prevents duplicate run creation when submit responses are lost. |
| State mapping | Workflow idempotency mapping in `wf:run:idempotency:<key>` | Provides deterministic lookup for duplicate submit attempts. |
| Handler safety | Bus handlers are idempotent via Redis locks + retryable NAKs | Protects against duplicate message handling in at-least-once delivery. |
| Compensation path | Saga compensation keys are auto-generated from workflow/job/topic/capability metadata | Keeps rollback actions replay-safe under repeated failure conditions. |
Implementation examples
Stable key generation (Go)
func BuildIdempotencyKey(runID, stepID, opID string) string {
// Stable across retries for the same business intent
return fmt.Sprintf("%s:%s:%s", runID, stepID, opID)
}
func ReplaySafeCall(ctx context.Context, req Request) (Response, error) {
key := BuildIdempotencyKey(req.RunID, req.StepID, req.OperationID)
return client.Do(req.WithHeader("Idempotency-Key", key))
}Idempotency storage policy (YAML)
idempotency:
retention_window: 24h
validate_parameter_hash: true
reject_mismatch: true
response_mode: semantic_equivalence
key_template: "{run_id}:{step_id}:{operation_id}"Ledger entry for retries (JSON)
{
"idempotency_key": "run_93a:step_4:create_pr",
"request_hash": "sha256:cb81...",
"first_seen_at": "2026-03-31T16:20:44Z",
"result_ptr": "res:job_771",
"replay_count": 3,
"last_replay_at": "2026-03-31T16:22:09Z"
}Limitations and tradeoffs
- - Longer retention windows improve replay safety but increase storage cost and lookup cardinality.
- - Strict parameter mismatch checks prevent accidental misuse and can block valid intent changes.
- - Generated random keys are easy and break semantic dedupe across services.
- - Idempotency does not replace compensation. It prevents duplicate intent, not wrong intent.
Next step
Run this in one sprint:
- 1. Define key templates for your top five side-effecting operations.
- 2. Add parameter hash validation to replay paths.
- 3. Instrument key hit ratio and mismatch rate in dashboards.
- 4. Run one fault drill with forced timeout + retry to verify no duplicate side effect.
Continue with AI Agent Timeouts, Retries, and Backoff and AI Agent DLQ and Replay Patterns.