The production problem
Teams often expect broadcast behavior and queue behavior from the same subject without encoding that intent in subscription setup.
In an HA scheduler deployment, that mistake can hide in plain sight: one replica sees every signal, the others see none, and no exception is thrown.
For AI agent control planes, this can break alert fanout, progress streams, or control signals exactly when you need multi-replica visibility.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS queue groups docs | Queue-group load balancing where one consumer in group handles each message. | No app-level durable-name strategy for distinguishing queue and broadcast subscriptions. |
| NATS JetStream consumers docs | Durable vs ephemeral consumers and flow-control behavior. | No concrete HA scheduler pattern to avoid accidental one-replica delivery on broadcast channels. |
| RabbitMQ exchanges docs | Fanout exchange semantics: copy to every bound queue. | No JetStream queue/durable naming split for mixed fanout and competing-consumer paths. |
Public docs explain primitives. The gap is implementation discipline for mixed semantics in real scheduler replicas.
Cordum runtime behavior
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Durable subject gate | Cordum enables JetStream durability for submit/result/DLQ/audit, `job.*`, and `worker.*.jobs`. | Critical delivery paths persist; informational subjects stay on core NATS. |
| Broadcast durable name | `durableName(subject, queue)` returns empty when `queue` is empty. | Each replica gets an ephemeral consumer and receives broadcast messages. |
| Queue durable name | Non-empty queue yields deterministic durable name `dur_<queue>__<subject>`. | Replicas in same queue group share one consumer state and one-of-N delivery. |
| Subscribe branch | Empty queue uses `js.Subscribe`; queue mode uses `js.QueueSubscribe`. | Delivery semantics are decided by one explicit parameter. |
| Broadcast test | `TestBroadcastWithJetStream` asserts two replicas each receive the same DLQ message. | Protects against regressions that collapse fanout into one-replica delivery. |
| Queue-group test | `TestQueueGroupWithJetStream` verifies no duplicate total deliveries across replicas. | Confirms work-distribution mode stays load-balanced and non-duplicating. |
Code-level mechanics
Durable-name split by queue intent (Go)
func durableName(subject, queue string) string {
name := sanitize(subject)
if name == "" {
return ""
}
if queue == "" {
// broadcast: ephemeral consumer per replica
return ""
}
q := sanitize(queue)
if q == "" {
return "dur_" + name
}
return "dur_" + q + "__" + name
}Subscribe path selects semantics (Go)
opts := []nats.SubOpt{
nats.ManualAck(),
nats.AckExplicit(),
nats.AckWait(b.ackWait),
nats.MaxAckPending(natsMaxAckPending()),
nats.MaxDeliver(maxJSRedeliveries),
}
if durable := durableName(subject, queue); durable != "" {
opts = append(opts, nats.Durable(durable))
}
if queue == "" {
sub, err = b.js.Subscribe(subject, cb, opts...)
} else {
sub, err = b.js.QueueSubscribe(subject, queue, cb, opts...)
}Tests that pin expected behavior (Go)
func TestDurableName(t *testing.T) {
// broadcast with empty queue must be ephemeral
if got := durableName("job.test.*", ""); got != "" {
t.Fatalf("expected empty durable name for broadcast, got %s", got)
}
}
func TestBroadcastWithJetStream(t *testing.T) {
// two replicas subscribe with empty queue
// publish one DLQ message
// assert both replicas receive it
}
func TestQueueGroupWithJetStream(t *testing.T) {
// queue group mode: total deliveries == published messages (no duplication)
}The tests are the real guardrail. If someone simplifies durable naming later, CI should fail before rollout.
Operator runbook
# 1) Verify subscription intent by subject class # - Broadcast signal? use empty queue # - Work distribution? use non-empty queue group # 2) Run bus semantics regression tests cd D:/Cordum/cordum go test ./core/infra/bus -run "TestDurableName|TestBroadcastWithJetStream|TestQueueGroupWithJetStream" # 3) HA smoke test (broadcast path) # - start 2 scheduler replicas # - publish one DLQ event # - assert both replicas log handler execution # 4) Queue path smoke test # - publish N job messages # - assert total processed == N (not 2N) # 5) During rollout, monitor for: # - sudden drop in broadcast handler invocations per replica # - unexpected single-replica ownership of broadcast-only signals
Limitations and tradeoffs
- - Ephemeral broadcast consumers can miss messages during disconnect/reconnect windows.
- - Durable queue consumers favor load distribution, not per-replica observability.
- - Semantics bugs are configuration bugs; they rarely fail loudly.
- - Tests reduce risk, but rollout smoke checks are still required in multi-replica clusters.
If a subject must be seen by every replica, never bind it to a shared queue group. Model it as broadcast on purpose.
Next step
Audit your current subscriptions by intent:
- 1. Mark each subject as broadcast or queue-distributed.
- 2. Confirm queue parameter matches that intent in code.
- 3. Add a two-replica smoke test for at least one subject in each class.
- 4. Fail CI if those semantics tests regress.
Continue with MaxAckPending Tuning and AckWait and Dedup TTL Alignment.