Name: Cordum
Author: Cordum

The production problem

Your publisher sends a job. Connection drops before it sees an ack. It retries.

If the retry lands inside the broker dedup window with the same `Nats-Msg-Id`, you are safe.

If the retry lands after the window, the broker may accept it as new. If your operation is expensive or side-effectful, that hurts.

What top results cover and miss

Source	Strong coverage	Missing piece
NATS docs: JetStream Streams	Duplicate window configuration (`Duplicates`) and stream-level dedup behavior.	No application-level key design guidance for approval retries, resubmits, and tenant-scoped operations.
NATS docs: JetStream Model Deep Dive	`Nats-Msg-Id` semantics and duplicate suppression model.	No control-plane pattern for combining broker dedup with longer idempotency horizons.
NATS Blog: Infinite message deduplication in JetStream	How to push dedup beyond window constraints with `DiscardNewPerSubject` patterns.	No concrete mapping for multi-step AI workflow operations where retries and approvals mutate request labels.

Cordum runtime mechanics

Cordum treats broker dedup and business idempotency as separate layers. The split is deliberate.

Boundary	Current behavior	Operational impact
Broker dedup window	Cordum configures JetStream streams with `Duplicates: 2 * time.Minute`.	Republish retries inside 2 minutes can dedup; after that, they may be accepted as new.
Publish-time key	`computeMsgID` derives keys from typed packet fields and supports `cordum.bus_msg_id` override.	Stable keys are possible when caller needs deterministic dedup for retries.
Approval requeue	Gateway approval path sets `req.Labels[cordum.bus_msg_id] = "approval:<job_id>"`.	Repeated approval publish attempts do not spray duplicate jobs in the dedup window.
Business idempotency	Redis job-store idempotency defaults to `90 * 24h`.	Late retries and client reconnect loops are still guarded after broker dedup has expired.

JetStream duplicate window in Cordum

// core/infra/bus/nats.go (excerpt)
_, err := js.AddStream(&nats.StreamConfig{
  Name:       name,
  Subjects:   subjects,
  Retention:  nats.LimitsPolicy,
  Storage:    nats.FileStorage,
  MaxAge:     maxAge,
  Replicas:   replicas,
  Duplicates: 2 * time.Minute,
})

Msg-Id calculation with override label

// core/infra/bus/nats.go (excerpt)
const LabelBusMsgID = "cordum.bus_msg_id"

func computeMsgID(subject string, packet *pb.BusPacket) string {
  switch payload := packet.Payload.(type) {
  case *pb.BusPacket_JobRequest:
    if payload.JobRequest != nil {
      if override := strings.TrimSpace(payload.JobRequest.Labels[LabelBusMsgID]); override != "" {
        return "jobreq:" + subject + ":" + override
      }
      return "jobreq:" + strings.TrimSpace(payload.JobRequest.JobId)
    }
  }
  return ""
}

Approval retry key stabilization

// core/controlplane/gateway/handlers_approvals.go (excerpt)
// Stable idempotency key per job so NATS dedup works on retries.
req.Labels[bus.LabelBusMsgID] = "approval:" + jobID

Long-horizon idempotency TTL

// core/infra/store/job_store.go (excerpt)
// Idempotency keys must outlive the job lifecycle to prevent duplicate jobs.
idempotencyTTL := 90 * 24 * time.Hour
if v := os.Getenv("CORDUM_IDEMPOTENCY_TTL"); v != "" {
  if parsed, err := time.ParseDuration(v); err == nil && parsed > 0 {
    idempotencyTTL = parsed
  }
}

Msg-Id design rules

Rule 1: Base Msg-Id on the semantic operation, not transport metadata.

Rule 2: For controlled replays, use an explicit override key so operators can force "same operation" versus "new operation" behavior.

Rule 3: Keep broker dedup short and predictable. Put long retry protection in your state store where you control TTL and key scope.

Rule 4: Test the boundary. Most duplicate incidents happen at time-window edges, not in the happy path.

Validation runbook

Run this in staging before rollout. If you never test the +121s retry case, you will eventually learn about it in production.

Dedup boundary validation

bash

# 1) Publish a job packet with Nats-Msg-Id = jobreq:job-123
# 2) Republish same packet at +30s (expect dedup)
# 3) Republish same packet at +150s (outside 2m window; expect possible new accept)
# 4) Submit same API request with same Idempotency-Key at +1h (expect existing job id)
# 5) Force approval endpoint retry and verify stable cordum.bus_msg_id behavior

Limitations and tradeoffs

Approach	Upside	Downside
Rely only on JetStream duplicate window	Simple setup and low app complexity.	Late retries outside the window can create duplicate business operations.
Broker dedup + app idempotency (Cordum pattern)	Good protection for both short reconnect storms and long retry tails.	Two layers to reason about and monitor.
Infinite per-subject dedup at broker	Stronger duplicate suppression in stream state.	More stream design constraints and operational complexity for evolving workflows.

Next step

Add an automated retry-boundary test suite that replays publish attempts at +30s, +119s, +121s, and +1h with fixed Msg-Id and fixed API idempotency key, then assert exactly which layer blocks each duplicate.

AI Agent NATS Msg-Id Strategy

The production problem

What top results cover and miss

Cordum runtime mechanics

Msg-Id design rules

Validation runbook

Limitations and tradeoffs

Next step

Related Articles

AI Agent NATS Publish Confirmation: Core Publish vs JetStream Ack in Control Planes (2026)

AI Agent AckWait and Dedup TTL Alignment: Stop Post-Crash Double Processing (2026)

AI Agent Idempotency Keys: Stop Duplicate Actions in Production (2026)

Need production-safe agent governance?