Name: Cordum
Author: Cordum

The production problem

Teams usually hit partitioning pain in one of two ways. Either throughput is capped by a single queue, or ordering bugs appear after aggressive parallelization.

The root cause is often hidden in key design. If keys are too coarse, hot partitions form. If keys are too fine, ordering semantics disappear.

Partitioning is not just a transport concern. It is an application correctness decision.

What top results miss

Source	Strong coverage	Missing piece
Apache Kafka Introduction	Clear partition-order tradeoff: total order within partition, not across partitions.	No governance-aware dispatch model for autonomous agents with policy gates.
NATS Subject Mapping and Partitioning	Deterministic subject partitioning concepts and ordering constraints.	No tenant-limit and policy-failure integration for agent orchestration.
RabbitMQ Sharding Plugin README	Practical sharding mechanics and explicit note that total ordering is sacrificed.	No cross-partition replay strategy for long-running autonomous workflows.

Partitioning model

Pick partition strategy by workflow semantics first, then tune for throughput. Performance-first partitioning that violates business ordering is a hidden correctness bug.

Strategy	Best for	Primary risk	Mitigation
Key by tenant	Strong tenant isolation and fairness	Hot tenants become hot partitions	Add secondary key for high-volume tenant substreams
Key by entity/workflow ID	Per-entity ordering guarantees	Skewed entities can dominate throughput	Detect key skew and rebalance with consistent hashing versioning
Round-robin	Raw throughput with low ordering requirements	Ordering-sensitive tasks break	Use only for idempotent/stateless tasks
Priority + partition hybrid	Mixed urgency workloads	Priority inversion and starvation	Set fair-share floors and tenant caps

Cordum runtime implications

Implication	Current behavior	Why it matters
Fairness failure diagnostics	Scheduler reason codes include `tenant_limit`, `pool_overloaded`, `no_workers`	Helps separate partition-key issues from raw capacity exhaustion.
Retry pressure boundaries	Max scheduling retries 50 with 1s-30s backoff and 2s no-worker delay	Partitioning strategy must account for retry amplification under local hotspots.
Dispatch health signal	Dispatch p99 > 1s is an existing warning threshold	A fast indicator that partition balance is degrading.
Recovery debt visibility	`cordum_scheduler_stale_jobs` and `cordum_scheduler_orphan_replayed_total`	Shows whether partition-specific failures are resolved safely after outages.
Tenant guardrail enforcement	`max_concurrent_jobs` policy is enforced per tenant	Prevents a single tenant’s partition burst from consuming all dispatch capacity.

Implementation examples

Deterministic partition key selection (Go)

partition-key.go

type Job struct {
  TenantID   string
  WorkflowID string
  Priority   int
}

func PartitionKey(j Job) string {
  // Keep per-tenant + per-workflow ordering
  return j.TenantID + ":" + j.WorkflowID
}

func PartitionIndex(key string, partitionCount int) int {
  h := fnv.New32a()
  _, _ = h.Write([]byte(key))
  return int(h.Sum32() % uint32(partitionCount))
}

Partitioning and fairness config (YAML)

partitioning-policy.yaml

YAML

partitioning:
  partitions: 32
  key_strategy: tenant_workflow
  fairness:
    max_concurrent_jobs_per_tenant: 40
    min_share_per_priority_tier:
      p0: 0.40
      p1: 0.40
      p2: 0.20
  retry:
    max_scheduling_retries: 50
    backoff_base: 1s
    backoff_max: 30s
  alerts:
    dispatch_p99_seconds: "> 1"
    stale_jobs: "> 50"

Partition health validation queries (PromQL)

partition-health.promql

PromQL

# Dispatch latency signal
histogram_quantile(0.99, rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m]))

# Failed ratio
rate(cordum_jobs_completed_total{status="failed"}[5m])
/ clamp_min(rate(cordum_jobs_completed_total[5m]), 0.001)

# Recovery debt
cordum_scheduler_stale_jobs
rate(cordum_scheduler_orphan_replayed_total[5m])

# Policy dependency stress
rate(cordum_safety_unavailable_total[5m])

Limitations and tradeoffs

- More partitions increase concurrency but also increase coordination and operational complexity.
- Changing key strategy later can require migration and temporary dual-write operations.
- Strict per-key ordering can cap throughput for high-cardinality hot keys.
- Queue-level partitioning alone does not solve downstream tool bottlenecks.

Next step

Run this in one sprint:

1. Choose one stable partition key for your highest-volume workflow family.
2. Define fairness limits (`max_concurrent_jobs`) and reason-code alerting.
3. Run one load test to verify dispatch p99 and failed-ratio guardrails.
4. Simulate one partition hotspot and confirm recovery via stale/replay metrics.

Continue with AI Agent Capacity Planning Model and AI Agent Multi-Tenant Isolation.

AI Agent Queue Partitioning Strategy