K8s Lessons for AI Agent Orchestration

What top resources cover, and what they miss

The strongest current material explains orchestration patterns and network-level controls. The weak spot is decision-level governance. Production incidents usually happen in that missing layer.

Source	What it covers well	Gap for production fleets
Microsoft Learn: AI Agent Orchestration Patterns	Strong orchestration taxonomy: sequential, concurrent, group chat, handoff, and magentic manager patterns.	No dispatch-time policy gate contract that blocks unsafe tool actions before they execute.
Kubernetes Docs: Admission Control + RBAC	Deterministic governance primitives: API interception, validation, and role-scoped authorization.	No model-level controls for non-deterministic agent decisions or delegation-chain inheritance.
Kubernetes Agentic Networking	Secure, governed agent-to-agent and agent-to-tool communication with policy-focused APIs.	Still early on end-to-end run evidence that ties policy decision, approval, and final side effects together.

The missing artifact is a dispatch decision record that survives retries, delegation, and approval routing. Without this record, you can explain architecture but you cannot prove governance.

Minimum decision evidence for delegated agent actions

{
  "run_id": "run_01JTRP8T2N4Q8B7JQK0P1X7N3E",
  "agent_id": "orchestrator",
  "delegated_agent_id": "db-writer",
  "action": "job.db.write",
  "decision": "require_approval",
  "policy_version": "v1.14.2",
  "approval": {
    "status": "approved",
    "approved_by": "[email protected]",
    "approved_at": "2026-04-01T09:11:02Z"
  },
  "constraints": {
    "max_rows": 100,
    "allowed_tables": ["billing_adjustments"]
  }
}

2015: containers without orchestration

Docker made containers easy to build and run. On a single machine, it worked. On ten machines, you needed scripts. On fifty machines, the scripts broke. Networking was ad-hoc. Secrets were environment variables. Resource limits were suggestions. Deployments were SSH scripts or Ansible playbooks that worked until they did not.

Kubernetes solved this by adding a control plane. Not by replacing Docker. By adding the governance, scheduling, and observability layer that made containers manageable at scale. You still ran containers. K8s just made sure they ran safely, within resource bounds, with proper identity, and with an audit trail.

AI agents are at the same inflection point. Individual agents work fine. Running 50 agents in production without governance is the same chaos as running 50 containers without K8s. Different workload, same problem, same solution pattern.

AI agent orchestration: K8s primitives mapped

K8s: Admission Controllers

Validate/mutate resources before persistence

Agent: Safety Kernel

Evaluate every job against policy before dispatch

K8s: RBAC

Role-based access to API resources

Agent: Capability Restrictions

Per-agent capability scoping (read/write/admin)

K8s: Resource Quotas

CPU/memory limits per namespace

Agent: Budget Limits

Token spend and rate limits per agent/fleet

K8s: Namespaces

Workload isolation boundaries

Agent: Multi-Tenancy

Tenant-isolated agent environments

K8s: Audit Logging

Structured record of every API call

Agent: Audit Trail

Structured record of every agent decision

K8s: Liveness Probes

Detect and restart unhealthy pods

Agent: Heartbeats

Detect stale workers, reassign jobs

K8s: Helm Charts

Declarative app packaging and updates

Agent: Pack System

Declarative governance bundle installation

Seven primitives. Seven direct mappings. This is not a forced analogy. These are the same governance problems applied to a different workload type.

What Kubernetes got right for agent governance

Three K8s design decisions apply directly to agent governance.

Declarative desired state. K8s does not tell containers what to do step by step. You declare the desired state (3 replicas, 512MB memory, port 8080) and the control plane reconciles reality to match. Agent governance works the same way. You declare the policy-as-code (reads allowed, writes need approval, destructive blocked) and the Safety Kernel enforces it on every action.

K8s admission policy vs Agent safety policy

# K8s: OPA/Gatekeeper admission policy
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sBlockPrivilegedContainers
metadata:
  name: block-privileged
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]

# Agent equivalent: Safety Kernel policy
# safety.yaml
version: v1
rules:
  - id: block-destructive
    match:
      topics: ["job.*.delete", "job.*.drop"]
      risk_tags: ["destructive"]
    decision: deny
    reason: "Destructive operations blocked by policy"

Fail-closed by default. If a K8s admission controller cannot reach its webhook, it rejects the request. Not because it knows the request is bad, but because it cannot confirm it is safe. Cordum's Safety Kernel follows the same principle. If policy evaluation fails for any reason, the job is blocked. The safe default is deny, not allow.

Workload/infrastructure separation. K8s does not care what runs inside a container. Application developers write code; platform engineers manage the control plane. Agent governance works the same way. Agent developers write agent logic; platform teams manage the governance layer. The two concerns are separate, and mixing them is how both containers and agents get into trouble.

Where the Kubernetes analogy breaks down

Honesty matters more than cleverness here. The analogy has limits.

Containers are deterministic. Agents are not. Run the same container image with the same input and you get the same output. Run the same agent with the same prompt and you might get a completely different sequence of actions. Temperature, context window state, and model updates all introduce variance. This makes pre-dispatch policy evaluation more important for agents than for containers, not less. You cannot predict what an agent will do from its configuration alone.

Containers do not delegate. A container does not autonomously decide to spin up other containers and delegate work to them. Agents do. A research agent can decide to spawn a data access agent that spawns an API caller. This delegation chain does not exist in the container model, and it creates governance challenges (policy inheritance, approval escalation) that K8s never had to solve.

Resource consumption is unpredictable. A container's resource usage is bounded by its limits. An agent's token consumption depends on the model's reasoning path, which varies per request. Budget enforcement for agents requires runtime monitoring and circuit breakers, not just static limits.

These differences do not invalidate the control plane pattern. They reinforce why agents need one even more than containers did.

The control plane pattern applied to agents

Cordum's architecture maps directly to the K8s control plane. This is not an accident. We built it this way because the pattern works. Read more about the workflow orchestration architecture.

K8s control plane to agent control plane mapping

# Kubernetes Control Plane          Agent Control Plane (Cordum)
# -------------------------         ---------------------------
# API Server                    ->  API Gateway
#   Single entrypoint                 Single entrypoint
#   Authn/authz on every request      X-API-Key + tenant isolation
#
# Admission Controllers         ->  Safety Kernel
#   Validate/mutate before persist    Evaluate before dispatch
#   OPA/Gatekeeper policies           safety.yaml rules
#   Fail-closed by default            Fail-closed by default
#
# kube-scheduler                ->  Scheduler
#   Bin-pack pods to nodes            Route jobs to worker pools
#   Resource-aware placement          Capability-based routing
#
# controller-manager            ->  Workflow Engine
#   Reconcile desired state           DAG step orchestration
#   Watch + act loop                  Event-driven progression
#
# etcd                          ->  NATS + Redis
#   Durable state store               Durable messaging + state
#   Watch streams                     JetStream subscriptions
#
# Audit Logging                 ->  Audit Trail
#   Every API call recorded           Every decision recorded
#   Structured JSON events            Structured JSON events

API Gateway handles authentication, routing, and the single entrypoint for all operations, like the K8s API server. Safety Kernel evaluates every job before dispatch, like admission controllers evaluate every resource before persistence. Scheduler routes jobs to worker pools based on capabilities and load, like kube-scheduler places pods on nodes. Workflow Engine orchestrates multi-step processes, like controller-manager reconciles desired state.

If you have operated K8s at scale, you already understand how these components interact. The workloads changed from containers to agents. The governance pattern did not.

For platform engineers

If you think in control plane patterns, you already understand agent governance. Admission controllers are Safety Kernels. RBAC is capability scoping. Resource quotas are budget limits. Audit logs are audit trails. The vocabulary is different. The architecture is the same.

Platform engineering teams are already treating agents as first-class platform citizens, applying the same RBAC, quota, and governance primitives they manage for microservices. The kube-agentic-networking SIG is building agent identity and policy primitives directly into the K8s ecosystem.

We built Cordum for this community. Source-available (BUSL-1.1), built in Go on NATS and Redis, with sub-5ms policy evaluation. If you have opinions about admission controllers and API server design, look at our framework comparison and architecture docs. The primitives will feel familiar.

What K8s Taught Us About Governing Agents

What top resources cover, and what they miss

2015: containers without orchestration

AI agent orchestration: K8s primitives mapped

What Kubernetes got right for agent governance

Where the Kubernetes analogy breaks down

The control plane pattern applied to agents

For platform engineers

Apply the control plane pattern

Related reading

Temporal vs Cordum (2026): AI Agent Governance Comparison

Best AI Agent Frameworks 2026: LangChain, CrewAI, AutoGen

AI Workflow Orchestration (2026): Governance + Reliability