The production problem
Teams usually hit partitioning pain in one of two ways. Either throughput is capped by a single queue, or ordering bugs appear after aggressive parallelization.
The root cause is often hidden in key design. If keys are too coarse, hot partitions form. If keys are too fine, ordering semantics disappear.
Partitioning is not just a transport concern. It is an application correctness decision.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS Subject Mapping and Partitioning | Deterministic token partitioning (`partition(n, ...)`) and explicit note that queue groups are non-deterministic for per-subject ordering. | No guidance on mapping partition design to tenant fairness limits and scheduler reason-code operations. |
| Amazon SQS high throughput FIFO | How `MessageGroupId` hashes to partitions, ordering scope per group, and throughput boundaries per partition. | No playbook for separating key-skew issues from pool capacity saturation in multi-tenant agent schedulers. |
| Google Pub/Sub ordered messaging | Within-key ordering, across-key non-ordering, redelivery behavior, and hot-key tradeoffs for ordered subscribers. | No direct runbook for tying ordering-key hotspots to replay debt and stale-job recovery metrics. |
Partitioning model
Pick partition strategy by workflow semantics first, then tune for throughput. Performance-first partitioning that violates business ordering is a hidden correctness bug.
| Strategy | Best for | Primary risk | Mitigation |
|---|---|---|---|
| Key by tenant | Strong tenant isolation and fairness | Hot tenants become hot partitions | Add secondary key for high-volume tenant substreams |
| Key by entity/workflow ID | Per-entity ordering guarantees | Skewed entities can dominate throughput | Detect key skew and rebalance with consistent hashing versioning |
| Round-robin | Raw throughput with low ordering requirements | Ordering-sensitive tasks break | Use only for idempotent/stateless tasks |
| Priority + partition hybrid | Mixed urgency workloads | Priority inversion and starvation | Set fair-share floors and tenant caps |
Cordum runtime implications
| Implication | Current behavior | Why it matters |
|---|---|---|
| Fairness failure diagnostics | Scheduler reason codes include `tenant_limit`, `pool_overloaded`, `no_workers` | Helps separate partition-key issues from raw capacity exhaustion. |
| Retry pressure boundaries | Max scheduling retries 50 with 1s-30s backoff and 2s no-worker delay | Partitioning strategy must account for retry amplification under local hotspots. |
| Dispatch health signal | Dispatch p99 > 1s is an existing warning threshold | A fast indicator that partition balance is degrading. |
| Recovery debt visibility | `cordum_scheduler_stale_jobs` and `cordum_scheduler_orphan_replayed_total` | Shows whether partition-specific failures are resolved safely after outages. |
| Tenant guardrail enforcement | `max_concurrent_jobs` policy is enforced per tenant | Prevents a single tenant’s partition burst from consuming all dispatch capacity. |
Implementation examples
Deterministic partition key selection (Go)
type Job struct {
TenantID string
WorkflowID string
Priority int
}
func PartitionKey(j Job) string {
// Keep per-tenant + per-workflow ordering
return j.TenantID + ":" + j.WorkflowID
}
func PartitionIndex(key string, partitionCount int) int {
h := fnv.New32a()
_, _ = h.Write([]byte(key))
return int(h.Sum32() % uint32(partitionCount))
}Partitioning and fairness config (YAML)
partitioning:
partitions: 32
key_strategy: tenant_workflow
fairness:
max_concurrent_jobs_per_tenant: 40
min_share_per_priority_tier:
p0: 0.40
p1: 0.40
p2: 0.20
retry:
max_scheduling_retries: 50
backoff_base: 1s
backoff_max: 30s
alerts:
dispatch_p99_seconds: "> 1"
stale_jobs: "> 50"Partition health validation queries (PromQL)
# Dispatch latency signal
histogram_quantile(0.99, rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m]))
# Failed ratio
rate(cordum_jobs_completed_total{status="failed"}[5m])
/ clamp_min(rate(cordum_jobs_completed_total[5m]), 0.001)
# Recovery debt
cordum_scheduler_stale_jobs
rate(cordum_scheduler_orphan_replayed_total[5m])
# Policy dependency stress
rate(cordum_safety_unavailable_total[5m])Limitations and tradeoffs
- - More partitions increase concurrency but also increase coordination and operational complexity.
- - Changing key strategy later can require migration and temporary dual-write operations.
- - Strict per-key ordering can cap throughput for high-cardinality hot keys.
- - Queue-level partitioning alone does not solve downstream tool bottlenecks.
Next step
Run this in one sprint:
- 1. Choose one stable partition key for your highest-volume workflow family.
- 2. Define fairness limits (`max_concurrent_jobs`) and reason-code alerting.
- 3. Run one load test to verify dispatch p99 and failed-ratio guardrails.
- 4. Simulate one partition hotspot and confirm recovery via stale/replay metrics.
Continue with AI Agent Capacity Planning Model and AI Agent Multi-Tenant Isolation.