The production problem
Your publisher sends a job. Connection drops before it sees an ack. It retries.
If the retry lands inside the broker dedup window with the same `Nats-Msg-Id`, you are safe.
If the retry lands after the window, the broker may accept it as new. If your operation is expensive or side-effectful, that hurts.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS docs: JetStream Streams | Duplicate window configuration (`Duplicates`) and stream-level dedup behavior. | No application-level key design guidance for approval retries, resubmits, and tenant-scoped operations. |
| NATS docs: JetStream Model Deep Dive | `Nats-Msg-Id` semantics and duplicate suppression model. | No control-plane pattern for combining broker dedup with longer idempotency horizons. |
| NATS Blog: Infinite message deduplication in JetStream | How to push dedup beyond window constraints with `DiscardNewPerSubject` patterns. | No concrete mapping for multi-step AI workflow operations where retries and approvals mutate request labels. |
Cordum runtime mechanics
Cordum treats broker dedup and business idempotency as separate layers. The split is deliberate.
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Broker dedup window | Cordum configures JetStream streams with `Duplicates: 2 * time.Minute`. | Republish retries inside 2 minutes can dedup; after that, they may be accepted as new. |
| Publish-time key | `computeMsgID` derives keys from typed packet fields and supports `cordum.bus_msg_id` override. | Stable keys are possible when caller needs deterministic dedup for retries. |
| Approval requeue | Gateway approval path sets `req.Labels[cordum.bus_msg_id] = "approval:<job_id>"`. | Repeated approval publish attempts do not spray duplicate jobs in the dedup window. |
| Business idempotency | Redis job-store idempotency defaults to `90 * 24h`. | Late retries and client reconnect loops are still guarded after broker dedup has expired. |
// core/infra/bus/nats.go (excerpt)
_, err := js.AddStream(&nats.StreamConfig{
Name: name,
Subjects: subjects,
Retention: nats.LimitsPolicy,
Storage: nats.FileStorage,
MaxAge: maxAge,
Replicas: replicas,
Duplicates: 2 * time.Minute,
})// core/infra/bus/nats.go (excerpt)
const LabelBusMsgID = "cordum.bus_msg_id"
func computeMsgID(subject string, packet *pb.BusPacket) string {
switch payload := packet.Payload.(type) {
case *pb.BusPacket_JobRequest:
if payload.JobRequest != nil {
if override := strings.TrimSpace(payload.JobRequest.Labels[LabelBusMsgID]); override != "" {
return "jobreq:" + subject + ":" + override
}
return "jobreq:" + strings.TrimSpace(payload.JobRequest.JobId)
}
}
return ""
}// core/controlplane/gateway/handlers_approvals.go (excerpt) // Stable idempotency key per job so NATS dedup works on retries. req.Labels[bus.LabelBusMsgID] = "approval:" + jobID
// core/infra/store/job_store.go (excerpt)
// Idempotency keys must outlive the job lifecycle to prevent duplicate jobs.
idempotencyTTL := 90 * 24 * time.Hour
if v := os.Getenv("CORDUM_IDEMPOTENCY_TTL"); v != "" {
if parsed, err := time.ParseDuration(v); err == nil && parsed > 0 {
idempotencyTTL = parsed
}
}Msg-Id design rules
Rule 1: Base Msg-Id on the semantic operation, not transport metadata.
Rule 2: For controlled replays, use an explicit override key so operators can force "same operation" versus "new operation" behavior.
Rule 3: Keep broker dedup short and predictable. Put long retry protection in your state store where you control TTL and key scope.
Rule 4: Test the boundary. Most duplicate incidents happen at time-window edges, not in the happy path.
Validation runbook
Run this in staging before rollout. If you never test the +121s retry case, you will eventually learn about it in production.
# 1) Publish a job packet with Nats-Msg-Id = jobreq:job-123 # 2) Republish same packet at +30s (expect dedup) # 3) Republish same packet at +150s (outside 2m window; expect possible new accept) # 4) Submit same API request with same Idempotency-Key at +1h (expect existing job id) # 5) Force approval endpoint retry and verify stable cordum.bus_msg_id behavior
Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Rely only on JetStream duplicate window | Simple setup and low app complexity. | Late retries outside the window can create duplicate business operations. |
| Broker dedup + app idempotency (Cordum pattern) | Good protection for both short reconnect storms and long retry tails. | Two layers to reason about and monitor. |
| Infinite per-subject dedup at broker | Stronger duplicate suppression in stream state. | More stream design constraints and operational complexity for evolving workflows. |
Next step
Add an automated retry-boundary test suite that replays publish attempts at +30s, +119s, +121s, and +1h with fixed Msg-Id and fixed API idempotency key, then assert exactly which layer blocks each duplicate.