Name: Cordum
Author: Cordum

The production problem

You need to drain one worker pool for maintenance. New jobs must stop landing there.

Existing jobs still need to finish, or at least transition cleanly.

If no timeout exists, a single stuck worker can leave the pool in draining status for hours.

What top results cover and miss

Source	Strong coverage	Missing piece
Kubernetes Docs: Pod Lifecycle	SIGTERM and grace-period lifecycle for process termination.	No pool-level routing state machine for AI worker groups before pod-level shutdown.
Kubernetes Docs: Safely Drain a Node	Node-level eviction workflows and maintenance-safe drain practices.	No control-plane logic for topic-to-pool routing exclusion while jobs complete.
Google Cloud: Terminating with Grace	Grace-period sizing and SIGTERM handling basics.	No guidance on auto-transition from draining to inactive based on active job counters.

Cordum runtime mechanics

Cordum models pool drain as explicit state transitions plus an active-job convergence loop.

Boundary	Current behavior	Operational impact
Drain request	`POST /api/v1/pools/{name}/drain` sets status to draining, timestamps start time, and sets timeout.	New jobs stop routing to that pool immediately.
Default timeout	Drain API defaults to 300 seconds when timeout is absent or non-positive.	Operators get a bounded drain window even with minimal input.
Checker cadence	Background drain checker runs every 10 seconds.	Pool status converges quickly without requiring manual polling loops.
Exit conditions	Pool moves to inactive when active jobs are zero or when drain timeout expires.	No endless draining state during partial outages.

Drain API behavior

// core/controlplane/gateway/handlers_pools.go (excerpt)
if req.TimeoutSeconds <= 0 {
  req.TimeoutSeconds = 300
}

existing.Status = config.PoolStatusDraining
existing.DrainStartedAt = now
existing.DrainTimeoutSeconds = req.TimeoutSeconds

Drain checker loop

// core/controlplane/gateway/pool_drain.go (excerpt)
const defaultDrainCheckInterval = 10 * time.Second

if time.Since(startedAt) > timeout {
  d.transitionToInactive(ctx, poolName, "drain timeout expired")
  return
}

activeJobs := d.countActiveJobsForPool(poolName)
if activeJobs == 0 {
  d.transitionToInactive(ctx, poolName, "all jobs completed")
  return
}

Inactive transition update

// core/controlplane/gateway/pool_drain.go (excerpt)
pool.Status = config.PoolStatusInactive
pool.DrainStartedAt = ""
pool.DrainTimeoutSeconds = 0

d.srv.publishConfigChanged("system", "default")

Drain lifecycle details

Lifecycle is simple on paper: active, draining, inactive.

The hard part is deciding when draining is complete. Cordum uses two deterministic checks: active jobs equals zero, or timeout exceeded.

Tests cover both paths so operators are not learning behavior from production incidents.

Timeout path unit test

// core/controlplane/gateway/pool_drain_test.go (excerpt)
func TestDrainChecker_Timeout_ForcesInactive(t *testing.T) {
  // pool started draining 10 minutes ago with 60 second timeout
  // active jobs still present
  checker.checkAll(context.Background())

  if status := getPoolStatus(t, s, "test-pool"); status != config.PoolStatusInactive {
    t.Errorf("expected inactive (timeout), got %q", status)
  }
}

Validation runbook

Run this in staging before using pool drain during a high-risk rollout.

Drain validation steps

bash

# 1) Create a test pool and map one topic
# 2) Trigger drain with timeout_seconds=120
# 3) Verify pool status changes to draining and new jobs route elsewhere
# 4) Keep one worker busy and confirm active_jobs > 0 blocks transition
# 5) Wait for timeout, then verify status auto-changes to inactive

Limitations and tradeoffs

Approach	Upside	Downside
No timeout on draining pools	Maximum chance all long jobs complete naturally.	Risk of permanent draining state during failures.
Short timeout	Faster maintenance and rollout turnover.	More jobs may be interrupted or rerouted mid-window.
Active-job check plus timeout fallback (Cordum pattern)	Balanced behavior between graceful completion and bounded operations.	Requires reliable active-job snapshot data.

Next step

Add a canary runbook that drains one low-traffic pool every week, records elapsed drain time versus timeout budget, and updates default timeout values from measured data rather than guesswork.

AI Agent Worker Pool Draining

The production problem

What top results cover and miss

Cordum runtime mechanics

Drain lifecycle details

Validation runbook

Limitations and tradeoffs

Next step

Related Articles

AI Agent Rolling Restart Playbook: Zero-Drop Deployments with PDBs and Lock TTL Safety (2026)

AI Agent Graceful Shutdown: Drain Order, Lock Safety, and 15s Timeout Design (2026)

AI Agent Capacity Planning Model: How to Size Worker Pools Without Guessing (2026)

Need production-safe agent governance?