The production problem
Teams often expect broadcast behavior and queue behavior from the same subject without encoding that intent in subscription setup.
In an HA scheduler deployment, that mistake can hide in plain sight: one replica sees every signal, the others see none, and no exception is thrown.
For AI agent control planes, this can break alert fanout, progress streams, or control signals exactly when you need multi-replica visibility.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS queue groups docs | Queue-group load balancing where one consumer in group handles each message. | No app-level durable-name strategy for distinguishing queue and broadcast subscriptions. |
| NATS JetStream consumers docs | Durable vs ephemeral consumers and flow-control behavior. | No concrete HA scheduler pattern to avoid accidental one-replica delivery on broadcast channels. |
| NATS by Example: Queue Push Consumers | Practical queue push consumer behavior, deliver group mechanics, and durable conventions. | No subject classification policy for deciding which broadcast signals must avoid JetStream. |
Public docs explain primitives. The gap is implementation discipline for mixed semantics in real scheduler replicas.
Cordum runtime behavior
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Durable subject gate | Cordum enables JetStream durability for submit/result/DLQ/audit, `job.*`, and `worker.*.jobs`. | Critical delivery paths persist; informational subjects stay on core NATS. |
| Broadcast durable name | `durableName(subject, queue)` returns empty when `queue` is empty. | Each replica gets an ephemeral consumer and receives broadcast messages. |
| Ephemeral restart window | Ephemeral consumers are removed when disconnected; publish events between disconnect and reconnect are not replayed. | Broadcast-on-JetStream can drop signals during rolling restarts unless subject semantics tolerate at-most-once gaps. |
| Queue durable name | Non-empty queue yields deterministic durable name `dur_<queue>__<subject>`. | Replicas in same queue group share one consumer state and one-of-N delivery. |
| Subscribe branch | Empty queue uses `js.Subscribe`; queue mode uses `js.QueueSubscribe`. | Delivery semantics are decided by one explicit parameter. |
| Broadcast test | `TestBroadcastWithJetStream` asserts two replicas each receive the same DLQ message. | Protects against regressions that collapse fanout into one-replica delivery. |
| Queue-group test | `TestQueueGroupWithJetStream` verifies no duplicate total deliveries across replicas. | Confirms work-distribution mode stays load-balanced and non-duplicating. |
Subject classification rules
The durable-name rule works only when subject classes are defined up front. This is the real gap in most public docs: deciding what can tolerate at-most-once fanout loss during restarts.
| Subject class | Example subjects | Recommended policy | Why |
|---|---|---|---|
| Self-healing broadcast signals | `sys.heartbeat`, `sys.handshake`, `sys.config.changed` | Keep on Core NATS broadcast (non-durable). | Recovery path already exists (periodic heartbeat/poll/re-register). |
| State-carrying command/result paths | `sys.submit`, `sys.result`, `sys.dlq`, `job.*`, `worker.*.jobs` | Use JetStream durable consumers. | Message loss changes system state and must be replayable. |
| Informational fanout | `sys.alert`, `sys.job.progress`, `sys.workflow.event` | Core NATS broadcast unless requirements change. | At-most-once is acceptable and lower operational overhead. |
Quick rollout loss-budget check
# Estimate replay-gap risk for a JetStream broadcast subject # risk_messages = publish_rate_per_sec * disconnect_window_sec publish_rate_per_sec=20 disconnect_window_sec=8 risk_messages=$((publish_rate_per_sec * disconnect_window_sec)) echo "Potential dropped broadcast messages during rollout: $risk_messages"
Code-level mechanics
Durable-name split by queue intent (Go)
func durableName(subject, queue string) string {
name := sanitize(subject)
if name == "" {
return ""
}
if queue == "" {
// broadcast: ephemeral consumer per replica
return ""
}
q := sanitize(queue)
if q == "" {
return "dur_" + name
}
return "dur_" + q + "__" + name
}Subscribe path selects semantics (Go)
opts := []nats.SubOpt{
nats.ManualAck(),
nats.AckExplicit(),
nats.AckWait(b.ackWait),
nats.MaxAckPending(natsMaxAckPending()),
nats.MaxDeliver(maxJSRedeliveries),
}
if durable := durableName(subject, queue); durable != "" {
opts = append(opts, nats.Durable(durable))
}
if queue == "" {
sub, err = b.js.Subscribe(subject, cb, opts...)
} else {
sub, err = b.js.QueueSubscribe(subject, queue, cb, opts...)
}Tests that pin expected behavior (Go)
func TestDurableName(t *testing.T) {
// broadcast with empty queue must be ephemeral
if got := durableName("job.test.*", ""); got != "" {
t.Fatalf("expected empty durable name for broadcast, got %s", got)
}
}
func TestBroadcastWithJetStream(t *testing.T) {
// two replicas subscribe with empty queue
// publish one DLQ message
// assert both replicas receive it
}
func TestQueueGroupWithJetStream(t *testing.T) {
// queue group mode: total deliveries == published messages (no duplication)
}The tests are the real guardrail. If someone simplifies durable naming later, CI should fail before rollout.
Operator runbook
# 1) Verify subscription intent by subject class # - Broadcast signal? use empty queue # - Work distribution? use non-empty queue group # 2) Estimate restart-window loss budget for any JetStream broadcast candidate # risk_messages = publish_rate_per_sec * disconnect_window_sec # 3) Run bus semantics regression tests cd D:/Cordum/cordum go test ./core/infra/bus -run "TestDurableName|TestBroadcastWithJetStream|TestQueueGroupWithJetStream" # 4) HA smoke test (broadcast path) # - start 2 scheduler replicas # - publish one DLQ event # - assert both replicas log handler execution # 5) Queue path smoke test # - publish N job messages # - assert total processed == N (not 2N) # 6) During rollout, monitor for: # - sudden drop in broadcast handler invocations per replica # - unexpected single-replica ownership of broadcast-only signals
Limitations and tradeoffs
| Choice | Benefit | Cost |
|---|---|---|
| Ephemeral broadcast consumers | Every replica receives live fanout traffic. | No replay during disconnect windows. |
| Durable queue consumers | Reliable one-of-N work distribution. | Not suitable when every replica must see each signal. |
| Subject classification policy + tests | Semantics become explicit and regression-resistant. | Requires ongoing maintenance as new subjects are added. |
If a subject must be seen by every replica, never bind it to a shared queue group. Model it as broadcast on purpose.
Next step
Audit your current subscriptions by intent:
- 1. Mark each subject as broadcast or queue-distributed.
- 2. Confirm queue parameter matches that intent in code.
- 3. Add a two-replica smoke test for at least one subject in each class.
- 4. Fail CI if those semantics tests regress.
Continue with MaxAckPending Tuning and AckWait and Dedup TTL Alignment.