Skip to content
Guide

AI Agent Idempotency Keys in Production

One timeout. Two retries. Three duplicated side effects. This is how you stop at one.

Guide11 min readApr 2026
TL;DR
  • -A timeout after a write can still commit the write. Retrying blindly can duplicate side effects.
  • -A useful idempotency contract includes: stable intent key, payload guard, replay response semantics, and retention policy.
  • -Cordum deduplicates workflow run creation by key and returns the existing run ID, but this path currently does not compare payload hash.
Intent-first keys

Key structure should represent operation intent, not retry attempt.

Replay semantics

Retries should return an equivalent outcome, not create duplicate work.

Workflow lineage

Track idempotency across run submit, bus processing, and compensation.

Scope

This article focuses on autonomous workflows that mutate external state: tickets, pull requests, infra, and approvals. Read this if your retry path can create a real-world duplicate.

The production problem

Distributed failures are ambiguous. A request can time out at the caller and still commit on the server.

In autonomous systems, retries are often automatic. Without idempotency, one ambiguous timeout can create duplicate incident tickets, duplicate PRs, or duplicate rollback jobs.

This is not edge-case theater. If your control plane runs at scale, retries happen continuously. The only question is whether retries are replay-safe.

What top sources cover and miss

SourceStrong coverageMissing piece
Stripe docs: idempotent requestsClear endpoint-level behavior: same key returns first status and body, including failures; mismatched parameters are rejected; pruning after 24h+ is documented.No detailed guidance for multi-step agent workflows where run admission, queue delivery, and compensation all need aligned idempotency semantics.
AWS Builders Library: making retries safe with idempotent APIsStrong API contract model: caller-provided request ID, semantic equivalence, and atomic token + mutation handling.Does not provide a concrete control-plane blueprint for autonomous agents that combine policy gates, distributed queues, and workflow rollback logic.
AWS Lambda durable execution idempotencyPractical matrix for key reuse vs payload mismatch and clear distinction between run-level and step-level idempotency.Focused on Lambda durable executions. It does not address cross-service idempotency alignment when one agent workflow dispatches into external systems.

Idempotency contract design

Contract fieldRequired behaviorFailure mode if missing
Intent keyStable per business intent. Example: `workflow_id:step_id:operation_id`.Random-per-retry keys convert retries into duplicate writes.
Payload guardPersist request fingerprint with key and reject key reuse with different payload.Cross-intent replay bug: same key can point at different work.
Replay responseReturn semantically equivalent result for duplicate request.Caller flow diverges between first request and retry path.
Reservation lifecycleReserve key before mutating; clean reservation on admission failures.Poisoned key blocks valid retries after transient rejection.
Retention policyDefine expiration based on retry horizon and replay lag.Too short: duplicate effects. Too long: unbounded key cardinality.

Cordum runtime behavior

ControlCurrent behaviorWhy it matters
Run create key inputsRun creation accepts `Idempotency-Key`, `X-Idempotency-Key`, and query alternatives.Client and proxy retries can preserve the same intent key across transport variants.
Run mapping storageWorkflow run dedupe is stored under `wf:run:idempotency:<key>` using Redis `SetNX`.Concurrent submissions with one key race to one winner.
Concurrent replay behaviorGateway test uses 10 concurrent requests with one key and persists exactly one run.Dedupe is not theoretical; it is covered by concurrency tests.
Admission rejection cleanupIf concurrency gate rejects (`429`), reservation cleanup allows later retry with same key.Prevents poisoned keys after temporary admission pressure.
Bus idempotency guardJetStream handlers use Redis processed keys (`10m` TTL) and NAK-with-delay (`2s`) on guard failures.Keeps at-least-once delivery from turning into duplicate processing under crash windows.
Compensation idempotencySaga rollback auto-generates key from `workflow_id|job_id|topic|capability|step_index` hash.Rollback retries avoid duplicate compensation side effects.

Implementation examples

1) Stable intent key generation (Go)

idempotency_key.go
Go
func BuildRunIntentKey(workflowID, stepID, operationID string) string {
  // Stable for the same business intent across retries.
  return fmt.Sprintf("%s:%s:%s", workflowID, stepID, operationID)
}

func SubmitWithIntentKey(ctx context.Context, req StartRunRequest) error {
  key := BuildRunIntentKey(req.WorkflowID, req.StepID, req.OperationID)
  req.Headers["Idempotency-Key"] = key
  return gateway.StartRun(ctx, req)
}

2) Concurrency regression test pattern (Go)

workflow_runs_test.go
Go
const workers = 10
req.Header.Set("Idempotency-Key", "same-key")

// Expect all responses to carry the same run_id.
for _, runID := range runIDs[1:] {
  if runID != runIDs[0] {
    t.Fatalf("expected one run id, got %v", runIDs)
  }
}

// Expect one persisted run.
if len(runs) != 1 {
  t.Fatalf("expected exactly 1 persisted run, got %d", len(runs))
}

This pattern catches a common regression: key reservation works in serial tests but fails under parallel submit pressure.

3) Payload mismatch guard (Go)

idempotency_guard.go
Go
type IdemRecord struct {
  Key         string
  RequestHash string
  RunID       string
  CreatedAt   time.Time
}

func ValidateReplay(existing IdemRecord, incomingHash string) error {
  if existing.RequestHash != incomingHash {
    return errors.New("idempotency key reused with different payload")
  }
  return nil
}

If you skip this check, key reuse across different payloads can return a stale run reference and hide caller mistakes.

4) Retention caveat to watch

store_redis.go
Go
// Current run idempotency set path in Cordum workflow store
// (SetNX with no expiration)
ok, err := redis.SetNX(ctx, "wf:run:idempotency:"+key, runID, 0).Result()

Zero TTL means no automatic expiration. That can be exactly right for some audit workloads and exactly wrong for high-cardinality traffic.

Limitations and tradeoffs

Verified from current Cordum run-start path: replay behavior is keyed by idempotency key lookup, not payload hash comparison. If your client might reuse a key with changed input, add a payload guard.

  • - Bus processed-key TTL is bounded (10 minutes). That aligns with redelivery windows, not long-horizon replay dedupe.
  • - Long retention windows improve safety against late retries and increase key cardinality and storage costs.
  • - Compensation keys are hash-derived and deterministic, but still rely on input metadata quality.
  • - Idempotency prevents duplicate intent execution. It does not correct wrong intent or bad authorization.

Next step

Run this in one sprint:

  1. 1. Define intent-key templates for your top five side-effecting operations.
  2. 2. Add payload fingerprint checks to duplicate-request handling paths.
  3. 3. Decide retention policy explicitly per surface (run submit, queue dedupe, compensation).
  4. 4. Add a concurrency test with at least 10 parallel submits on the same key.
  5. 5. Drill one timeout + retry scenario in staging and verify one side effect.

Continue with AI Agent Idempotency Payload Mismatch and AI Agent Timeouts, Retries, and Backoff.

Design for retries you did not ask for

Networks retry. SDKs retry. Proxies retry. Your system should treat that as normal behavior.