The production problem
Multi-tenant incidents often start as performance complaints and end as trust incidents. One large tenant saturates shared capacity, smaller tenants miss SLAs, and operators lose signal clarity.
Security boundaries alone do not solve this. You also need fairness boundaries and dispatch-time enforcement to avoid noisy-neighbor starvation.
The target architecture is not absolute isolation. It is controlled sharing with explicit tenant limits and clear fallback behavior.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Kubernetes Docs: Multi-tenancy | Strong hard/soft isolation framing across control plane and data plane. | No guidance for autonomous-agent retries, approvals, and policy-path behavior. |
| Amazon EKS Best Practices: Tenant Isolation | Concrete controls: RBAC, network policies, quotas, node isolation patterns. | No dispatch-layer reason code model for AI control planes. |
| AWS SaaS Tenant Isolation Strategies | Excellent silo/pool tradeoff analysis and isolation mindset. | No queue-level fairness strategy for autonomous workflow orchestration. |
Isolation model
Choose your isolation strategy per tenant segment, not per platform ideology. High-compliance tenants may need stronger boundaries than default pooled tenants.
| Model | Boundary style | Strengths | Tradeoffs |
|---|---|---|---|
| Silo isolation | Dedicated compute/data per tenant | Strong blast-radius control, simpler compliance posture | Higher cost and operational overhead |
| Pool isolation | Shared infrastructure with strict runtime policy enforcement | High utilization and operational simplicity | Requires rigorous fairness and policy controls |
| Bridge model | Most tenants pooled, selected tenants partially siloed | Balances economics and tenant-specific requirements | Adds routing and policy complexity |
| Priority-tier hybrid | Pooled baseline with premium resource tiers | Supports QoS tiers and commercial differentiation | Needs strong starvation safeguards |
Cordum runtime mapping
| Implication | Current behavior | Why it matters |
|---|---|---|
| Tenant concurrency policy | `max_concurrent_jobs` is enforced per tenant before dispatch | Prevents one tenant from monopolizing scheduler and worker capacity. |
| Fairness reason codes | `tenant_limit`, `pool_overloaded`, `no_workers` | Gives operators actionable isolation/fairness diagnostics instead of generic failures. |
| Shared-platform stress signals | `cordum_scheduler_dispatch_latency_seconds`, `cordum_scheduler_stale_jobs` | Shows noisy-neighbor pressure before complete dispatch failure. |
| Policy dependency behavior | `cordum_safety_unavailable_total` and fail-mode configuration | Isolation must include governance dependencies, not only compute boundaries. |
| Retry pressure cap | Max scheduling retries 50 and `retryDelayNoWorkers` 2s | Bounds repeated scheduling attempts during tenant-specific capacity pressure. |
Implementation examples
Tenant isolation policy (YAML)
tenancy:
mode: pool
tenant_limits:
default:
max_concurrent_jobs: 40
max_retries: 3
premium:
max_concurrent_jobs: 120
max_retries: 5
fairness:
scheduler_utilization_target: 0.70
deny_cross_tenant_overrides: true
alerts:
tenant_limit_breach_rate_5m: "> 0.2"
dispatch_p99_seconds: "> 1"
stale_jobs: "> 50"Reason-code routing (Go)
type DispatchReason string
const (
ReasonNoWorkers DispatchReason = "no_workers"
ReasonOverloaded DispatchReason = "pool_overloaded"
ReasonTenantLimit DispatchReason = "tenant_limit"
)
func routeOnReason(reason DispatchReason) string {
switch reason {
case ReasonTenantLimit:
return "throttle_tenant_and_notify_owner"
case ReasonOverloaded:
return "shift_to_backup_pool_or_defer"
case ReasonNoWorkers:
return "scale_workers_and_retry"
default:
return "manual_triage"
}
}Fairness and isolation signals (PromQL)
# Failed ratio guardrail
rate(cordum_jobs_completed_total{status="failed"}[5m])
/ clamp_min(rate(cordum_jobs_completed_total[5m]), 0.001)
# Dispatch latency guardrail
histogram_quantile(0.99, rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m]))
# Stale jobs guardrail
cordum_scheduler_stale_jobs
# Policy dependency degradation
rate(cordum_safety_unavailable_total[5m])Limitations and tradeoffs
- - Harder isolation improves security posture but increases infrastructure and operational cost.
- - Aggressive tenant limits protect fairness but can frustrate burst-heavy legitimate workloads.
- - Pool models need stronger observability to prove boundaries are enforced under load.
- - Hybrid models satisfy business tiers but require disciplined policy lifecycle management.
Next step
Run this in one sprint:
- 1. Define tenant tiers (default, premium, regulated) and target isolation model per tier.
- 2. Set `max_concurrent_jobs` defaults and alert on `tenant_limit` reason frequency.
- 3. Add dispatch latency and stale-job guardrails to detect noisy-neighbor impact early.
- 4. Simulate one tenant burst and verify that other tenants remain within SLO thresholds.
Continue with AI Agent Priority Queues and Fair Scheduling and AI Agent Capacity Planning Model.