Name: Cordum
Author: Cordum

The production problem

Multi-tenant incidents often start as performance complaints and end as trust incidents. One large tenant saturates shared capacity, smaller tenants miss SLAs, and operators lose signal clarity.

Security boundaries alone do not solve this. You also need fairness boundaries and dispatch-time enforcement to avoid noisy-neighbor starvation.

The target architecture is not absolute isolation. It is controlled sharing with explicit tenant limits and clear fallback behavior.

What top results miss

Source	Strong coverage	Missing piece
Kubernetes Docs: Multi-tenancy	Strong hard/soft isolation framing across control plane and data plane.	No guidance for autonomous-agent retries, approvals, and policy-path behavior.
Amazon EKS Best Practices: Tenant Isolation	Concrete controls: RBAC, network policies, quotas, node isolation patterns.	No dispatch-layer reason code model for AI control planes.
AWS SaaS Tenant Isolation Strategies	Excellent silo/pool tradeoff analysis and isolation mindset.	No queue-level fairness strategy for autonomous workflow orchestration.

Isolation model

Choose your isolation strategy per tenant segment, not per platform ideology. High-compliance tenants may need stronger boundaries than default pooled tenants.

Model	Boundary style	Strengths	Tradeoffs
Silo isolation	Dedicated compute/data per tenant	Strong blast-radius control, simpler compliance posture	Higher cost and operational overhead
Pool isolation	Shared infrastructure with strict runtime policy enforcement	High utilization and operational simplicity	Requires rigorous fairness and policy controls
Bridge model	Most tenants pooled, selected tenants partially siloed	Balances economics and tenant-specific requirements	Adds routing and policy complexity
Priority-tier hybrid	Pooled baseline with premium resource tiers	Supports QoS tiers and commercial differentiation	Needs strong starvation safeguards

Cordum runtime mapping

Implication	Current behavior	Why it matters
Tenant concurrency policy	`max_concurrent_jobs` is enforced per tenant before dispatch	Prevents one tenant from monopolizing scheduler and worker capacity.
Fairness reason codes	`tenant_limit`, `pool_overloaded`, `no_workers`	Gives operators actionable isolation/fairness diagnostics instead of generic failures.
Shared-platform stress signals	`cordum_scheduler_dispatch_latency_seconds`, `cordum_scheduler_stale_jobs`	Shows noisy-neighbor pressure before complete dispatch failure.
Policy dependency behavior	`cordum_safety_unavailable_total` and fail-mode configuration	Isolation must include governance dependencies, not only compute boundaries.
Retry pressure cap	Max scheduling retries 50 and `retryDelayNoWorkers` 2s	Bounds repeated scheduling attempts during tenant-specific capacity pressure.

Implementation examples

Tenant isolation policy (YAML)

tenant-isolation-policy.yaml

YAML

tenancy:
  mode: pool
  tenant_limits:
    default:
      max_concurrent_jobs: 40
      max_retries: 3
    premium:
      max_concurrent_jobs: 120
      max_retries: 5
  fairness:
    scheduler_utilization_target: 0.70
    deny_cross_tenant_overrides: true
  alerts:
    tenant_limit_breach_rate_5m: "> 0.2"
    dispatch_p99_seconds: "> 1"
    stale_jobs: "> 50"

Reason-code routing (Go)

reason-routing.go

type DispatchReason string

const (
  ReasonNoWorkers    DispatchReason = "no_workers"
  ReasonOverloaded   DispatchReason = "pool_overloaded"
  ReasonTenantLimit  DispatchReason = "tenant_limit"
)

func routeOnReason(reason DispatchReason) string {
  switch reason {
  case ReasonTenantLimit:
    return "throttle_tenant_and_notify_owner"
  case ReasonOverloaded:
    return "shift_to_backup_pool_or_defer"
  case ReasonNoWorkers:
    return "scale_workers_and_retry"
  default:
    return "manual_triage"
  }
}

Fairness and isolation signals (PromQL)

tenant-isolation-signals.promql

PromQL

# Failed ratio guardrail
rate(cordum_jobs_completed_total{status="failed"}[5m])
/ clamp_min(rate(cordum_jobs_completed_total[5m]), 0.001)

# Dispatch latency guardrail
histogram_quantile(0.99, rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m]))

# Stale jobs guardrail
cordum_scheduler_stale_jobs

# Policy dependency degradation
rate(cordum_safety_unavailable_total[5m])

Limitations and tradeoffs

- Harder isolation improves security posture but increases infrastructure and operational cost.
- Aggressive tenant limits protect fairness but can frustrate burst-heavy legitimate workloads.
- Pool models need stronger observability to prove boundaries are enforced under load.
- Hybrid models satisfy business tiers but require disciplined policy lifecycle management.

Next step

Run this in one sprint:

1. Define tenant tiers (default, premium, regulated) and target isolation model per tier.
2. Set `max_concurrent_jobs` defaults and alert on `tenant_limit` reason frequency.
3. Add dispatch latency and stale-job guardrails to detect noisy-neighbor impact early.
4. Simulate one tenant burst and verify that other tenants remain within SLO thresholds.

Continue with AI Agent Priority Queues and Fair Scheduling and AI Agent Capacity Planning Model.

AI Agent Multi-Tenant Isolation