The production problem
Teams usually hit partitioning pain in one of two ways. Either throughput is capped by a single queue, or ordering bugs appear after aggressive parallelization.
The root cause is often hidden in key design. If keys are too coarse, hot partitions form. If keys are too fine, ordering semantics disappear.
Partitioning is not just a transport concern. It is an application correctness decision.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Apache Kafka Introduction | Clear partition-order tradeoff: total order within partition, not across partitions. | No governance-aware dispatch model for autonomous agents with policy gates. |
| NATS Subject Mapping and Partitioning | Deterministic subject partitioning concepts and ordering constraints. | No tenant-limit and policy-failure integration for agent orchestration. |
| RabbitMQ Sharding Plugin README | Practical sharding mechanics and explicit note that total ordering is sacrificed. | No cross-partition replay strategy for long-running autonomous workflows. |
Partitioning model
Pick partition strategy by workflow semantics first, then tune for throughput. Performance-first partitioning that violates business ordering is a hidden correctness bug.
| Strategy | Best for | Primary risk | Mitigation |
|---|---|---|---|
| Key by tenant | Strong tenant isolation and fairness | Hot tenants become hot partitions | Add secondary key for high-volume tenant substreams |
| Key by entity/workflow ID | Per-entity ordering guarantees | Skewed entities can dominate throughput | Detect key skew and rebalance with consistent hashing versioning |
| Round-robin | Raw throughput with low ordering requirements | Ordering-sensitive tasks break | Use only for idempotent/stateless tasks |
| Priority + partition hybrid | Mixed urgency workloads | Priority inversion and starvation | Set fair-share floors and tenant caps |
Cordum runtime implications
| Implication | Current behavior | Why it matters |
|---|---|---|
| Fairness failure diagnostics | Scheduler reason codes include `tenant_limit`, `pool_overloaded`, `no_workers` | Helps separate partition-key issues from raw capacity exhaustion. |
| Retry pressure boundaries | Max scheduling retries 50 with 1s-30s backoff and 2s no-worker delay | Partitioning strategy must account for retry amplification under local hotspots. |
| Dispatch health signal | Dispatch p99 > 1s is an existing warning threshold | A fast indicator that partition balance is degrading. |
| Recovery debt visibility | `cordum_scheduler_stale_jobs` and `cordum_scheduler_orphan_replayed_total` | Shows whether partition-specific failures are resolved safely after outages. |
| Tenant guardrail enforcement | `max_concurrent_jobs` policy is enforced per tenant | Prevents a single tenant’s partition burst from consuming all dispatch capacity. |
Implementation examples
Deterministic partition key selection (Go)
type Job struct {
TenantID string
WorkflowID string
Priority int
}
func PartitionKey(j Job) string {
// Keep per-tenant + per-workflow ordering
return j.TenantID + ":" + j.WorkflowID
}
func PartitionIndex(key string, partitionCount int) int {
h := fnv.New32a()
_, _ = h.Write([]byte(key))
return int(h.Sum32() % uint32(partitionCount))
}Partitioning and fairness config (YAML)
partitioning:
partitions: 32
key_strategy: tenant_workflow
fairness:
max_concurrent_jobs_per_tenant: 40
min_share_per_priority_tier:
p0: 0.40
p1: 0.40
p2: 0.20
retry:
max_scheduling_retries: 50
backoff_base: 1s
backoff_max: 30s
alerts:
dispatch_p99_seconds: "> 1"
stale_jobs: "> 50"Partition health validation queries (PromQL)
# Dispatch latency signal
histogram_quantile(0.99, rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m]))
# Failed ratio
rate(cordum_jobs_completed_total{status="failed"}[5m])
/ clamp_min(rate(cordum_jobs_completed_total[5m]), 0.001)
# Recovery debt
cordum_scheduler_stale_jobs
rate(cordum_scheduler_orphan_replayed_total[5m])
# Policy dependency stress
rate(cordum_safety_unavailable_total[5m])Limitations and tradeoffs
- - More partitions increase concurrency but also increase coordination and operational complexity.
- - Changing key strategy later can require migration and temporary dual-write operations.
- - Strict per-key ordering can cap throughput for high-cardinality hot keys.
- - Queue-level partitioning alone does not solve downstream tool bottlenecks.
Next step
Run this in one sprint:
- 1. Choose one stable partition key for your highest-volume workflow family.
- 2. Define fairness limits (`max_concurrent_jobs`) and reason-code alerting.
- 3. Run one load test to verify dispatch p99 and failed-ratio guardrails.
- 4. Simulate one partition hotspot and confirm recovery via stale/replay metrics.
Continue with AI Agent Capacity Planning Model and AI Agent Multi-Tenant Isolation.