The production problem
Teams hear “exactly once” and assume duplicates disappear. Then a retry path or region boundary appears, and the same action runs twice.
In agent systems, duplicate side effects are expensive. One replay can open duplicate incidents, create duplicate PRs, or push duplicate infrastructure changes.
A small duplicate rate still hurts at scale. At 20,000 side-effecting operations per day, 0.5% duplicate execution means roughly 100 duplicate actions every day.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Google Pub/Sub exactly-once delivery | Clear exactly-once semantics and constraints (pull subscriptions, regional scope). | No guidance for cross-system side effects in autonomous workflows. |
| Amazon SQS at-least-once delivery | Direct explanation that duplicates can occur and consumers must be idempotent. | No framework for policy-gated replay in agent control planes. |
| Kafka delivery semantics (Confluent docs) | Strong description of at-most-once, at-least-once, and exactly-once tradeoffs. | Does not cover external side-effect systems outside topic-to-topic transactions. |
Delivery semantics model
| Layer | Required rule | Failure if missing |
|---|---|---|
| Transport delivery | Assume at-least-once unless your exact platform mode says otherwise. | Duplicate message handling not implemented. |
| Processing semantics | Store idempotency keys and processed markers at operation boundaries. | Repeated execution of state-changing actions. |
| External side effects | Use per-action dedupe in destination system when possible. | Duplicate tickets, PRs, payments, or infrastructure changes. |
| Governance layer | Run policy checks on replay path too, not just first execution. | Unsafe retries bypass runtime controls. |
Cordum runtime implications
| Implication | Current behavior | Why it matters |
|---|---|---|
| Message bus behavior | JetStream durable subjects with explicit ack/nak and a default 10m AckWait | Redelivery is expected when a handler fails or misses the ack deadline. |
| Scheduler retry budget | Up to 50 scheduling attempts with 1s-30s backoff before terminal DLQ | Retries are finite, so replay and dedupe need explicit operator workflows. |
| Run idempotency | Run creation supports `Idempotency-Key` header | Duplicate submission requests map to one logical run. |
| DLQ and replay | Terminal failures emit DLQ entries for controlled retry/replay | Acknowledges that failures and duplicates are operationally normal. |
| Policy checks | Submit-time and dispatch-time safety evaluation | Replay path still passes governance controls before side effects. |
Implementation examples
Idempotent consumer skeleton (Go)
func HandleMessage(msg Message) error {
if alreadyProcessed(msg.IdempotencyKey) {
return nil
}
if err := applySideEffect(msg.Payload); err != nil {
return err
}
markProcessed(msg.IdempotencyKey)
return nil
}Delivery policy defaults (YAML)
delivery_semantics:
default: at_least_once
replay:
require_policy_check: true
require_idempotency_key: true
deny_if_missing_key: trueDuplicate handling audit event (JSON)
{
"message_id": "msg_9041",
"idempotency_key": "run_18:step_5:create_pr",
"delivery_attempt": 3,
"duplicate_detected": true,
"side_effect_executed": false,
"decision": "dedupe_skip"
}Limitations and tradeoffs
- - Idempotency ledgers add state and storage overhead to consumers.
- - Strict dedupe windows can reject legitimate replays after long outages.
- - Exactly-once modes can reduce duplicate work and increase latency/cost in some paths.
- - Cross-system side effects still need compensation for non-idempotent destinations.
Next step
Run this in one sprint:
- 1. Classify each workflow edge as at-most-once, at-least-once, or scoped exactly-once.
- 2. Add idempotency keys to every side-effecting operation path.
- 3. Track duplicate-detected rate and replay success in dashboards.
- 4. Test one forced redelivery scenario per critical workflow.
Continue with AI Agent Idempotency Keys and AI Agent Transactional Outbox Pattern.