The production problem
Teams ask for deterministic worker routing.
Operations teams ask for overload protection and graceful fallback.
If you make hints mandatory, you get hot spots and brittle routing.
If you ignore hints completely, you lose locality and warm-cache gains.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Kubernetes Node Affinity | Hard vs soft scheduling preferences (`required` vs `preferred`) with explicit placement constraints. | No worker-level direct-subject routing hints inside an application scheduler. |
| gRPC Custom Load Balancing Policies | Policy-based balancing and client-side route selection behavior. | No policy for mixed hard pool constraints plus soft worker hints in queue dispatch workflows. |
| AWS ALB Sticky Sessions | Session affinity behavior and when stickiness improves UX continuity. | No guardrail model for rejecting sticky preference when target is overloaded or ineligible. |
Cordum runtime mechanics
| Boundary | Current behavior | Why it matters |
|---|---|---|
| Pool hint strictness | `preferred_pool` narrows topic pools, but fails if that pool is not mapped for the topic. | Prevents silently routing outside declared topic-to-pool contract. |
| Worker hint softness | `preferred_worker_id` is honored only if worker exists, belongs to eligible pool, matches placement labels, and is not overloaded. | Avoids mandatory pinning into unhealthy capacity. |
| Fallback route | If preferred worker is unsuitable, strategy falls back to least-loaded scoring across eligible workers. | Keeps dispatch progress without manual hint cleanup. |
| Overload guard | Worker is overloaded when utilization >= 0.9, or CPU/GPU utilization >= 90. | Hinted routing respects capacity safety limits. |
| Placement label scope | Only prefixed placement labels constrain worker matching; business labels are ignored. | Prevents accidental routing lock-in from application metadata. |
Strategy code paths
Strict pool hint, soft worker hint
// core/controlplane/scheduler/strategy_least_loaded.go (excerpt)
poolHint := labels["preferred_pool"]
if poolHint != "" {
if !containsPool(topicPools, poolHint) {
return "", fmt.Errorf("%w: preferred pool %q not mapped for topic %q", ErrNoPoolMapping, poolHint, req.Topic)
}
topicPools = []string{poolHint}
}
if preferredWorker := labels["preferred_worker_id"]; preferredWorker != "" {
if hb, exists := workers[preferredWorker]; exists {
if _, ok := poolSet[hb.GetPool()]; ok && matchesLabels(hb, requiredLabels) && !isOverloaded(hb) {
return bus.DirectSubject(preferredWorker), nil
}
}
}
// else fallback to least-loaded selectionOverload guardrails
// core/controlplane/scheduler/strategy_least_loaded.go (excerpt)
const overloadUtilizationThreshold = 0.9
func isOverloaded(hb *pb.Heartbeat) bool {
if capacity := hb.GetMaxParallelJobs(); capacity > 0 {
utilization := float32(hb.GetActiveJobs()) / float32(capacity)
if utilization >= overloadUtilizationThreshold { return true }
}
if hb.GetCpuLoad() >= 90 { return true }
if hb.GetGpuUtilization() >= 90 { return true }
return false
}Placement label scoping + tests
// core/controlplane/scheduler/strategy_least_loaded.go (excerpt)
func filterPlacementLabels(labels map[string]string) map[string]string {
for k, v := range labels {
if strings.HasPrefix(k, "placement.") ||
strings.HasPrefix(k, "constraint.") ||
strings.HasPrefix(k, "node.") {
out[k] = v
}
}
return out
}
// core/controlplane/scheduler/strategy_least_loaded_test.go (excerpt)
func TestLeastLoadedStrategyHonorsPreferredWorker(t *testing.T) {
req := &pb.JobRequest{
Topic: "job.default",
Labels: map[string]string{"preferred_worker_id": "w2"},
}
subject, _ := strategy.PickSubject(req, workers)
if subject != "worker.w2.jobs" { t.Fatalf("expected preferred worker") }
}
func TestFilterPlacementLabels(t *testing.T) {
// placement.* / constraint.* / node.* kept; business labels ignored
}Validation runbook
Validate hint behavior explicitly. Do not assume worker hints are strict pins.
# 1) Validate strategy tests
go test ./core/controlplane/scheduler -run TestLeastLoadedStrategyHonorsPreferredWorker -count=1
go test ./core/controlplane/scheduler -run TestFilterPlacementLabels -count=1
# 2) Submit job with preferred worker hint
cordumctl job submit --topic job.default --prompt "hint probe" --labels '{"preferred_worker_id":"w2"}'
# 3) Submit job with strict preferred pool hint
cordumctl job submit --topic job.default --prompt "pool hint probe" --labels '{"preferred_pool":"gpu-batch"}'
# 4) Inspect scheduler logs for hint decisions
rg "strategy pick preferred worker|no pool mapping for topic" /var/log/cordum/scheduler.logLimitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Soft worker hint + strict pool hint (current) | Good balance between determinism and safety. | Behavior can surprise teams expecting hard worker pinning. |
| Hard worker pinning | Maximum predictability for targeted workloads. | Higher risk of overload/staleness hotspots and manual operations burden. |
| Ignore all hints | Simplest scheduler behavior. | Loses useful locality and warm-cache optimization opportunities. |
- - Soft hints are safer by default, but teams need clear docs to avoid wrong assumptions.
- - Strict pool hints can be useful for compliance boundaries, but misconfiguration risk is higher.
- - Capacity and staleness checks must stay in front of hint shortcuts.
Next step
Implement this next:
- 1. Add explicit docs table: which hints are hard constraints vs soft preferences.
- 2. Add a test for preferred worker fallback when hinted worker exists but is overloaded.
- 3. Emit metrics for hint usage and hint rejection causes (`overloaded`, `label_mismatch`, `pool_ineligible`).
- 4. Add a dry-run endpoint that returns selected worker and rejection rationale for hints.
Continue with AI Agent Stale Worker Dispatch Retries and AI Agent Priority Fair Scheduling.