Skip to content
Deep Dive

AI Agent Worker Pool Draining

Stop new routing, finish in-flight work, then exit draining state on real conditions instead of hope.

Deep Dive10 min readMar 2026
TL;DR
  • -Pool drain should stop new routing while allowing in-flight jobs to complete.
  • -Cordum marks pool state as draining and stores `drain_started_at` plus timeout via API.
  • -A background checker runs every 10 seconds and transitions the pool to inactive when active jobs hit zero.
  • -If jobs never reach zero, timeout enforcement still transitions to inactive to avoid permanent limbo.
Failure mode

Without timeout logic, draining pools can stay in limbo forever after partial worker failure.

Control point

Cordum exposes explicit drain API and status transitions instead of ad-hoc manual routing edits.

Operational payoff

Rollouts and maintenance windows can proceed with deterministic pool lifecycle behavior.

Scope

This guide focuses on pool lifecycle and routing behavior in the gateway. It does not cover every autoscaler policy for every infrastructure provider.

The production problem

You need to drain one worker pool for maintenance. New jobs must stop landing there.

Existing jobs still need to finish, or at least transition cleanly.

If no timeout exists, a single stuck worker can leave the pool in draining status for hours.

What top results cover and miss

SourceStrong coverageMissing piece
Kubernetes Docs: Pod LifecycleSIGTERM and grace-period lifecycle for process termination.No pool-level routing state machine for AI worker groups before pod-level shutdown.
Kubernetes Docs: Safely Drain a NodeNode-level eviction workflows and maintenance-safe drain practices.No control-plane logic for topic-to-pool routing exclusion while jobs complete.
Google Cloud: Terminating with GraceGrace-period sizing and SIGTERM handling basics.No guidance on auto-transition from draining to inactive based on active job counters.

Cordum runtime mechanics

Cordum models pool drain as explicit state transitions plus an active-job convergence loop.

BoundaryCurrent behaviorOperational impact
Drain request`POST /api/v1/pools/{name}/drain` sets status to draining, timestamps start time, and sets timeout.New jobs stop routing to that pool immediately.
Default timeoutDrain API defaults to 300 seconds when timeout is absent or non-positive.Operators get a bounded drain window even with minimal input.
Checker cadenceBackground drain checker runs every 10 seconds.Pool status converges quickly without requiring manual polling loops.
Exit conditionsPool moves to inactive when active jobs are zero or when drain timeout expires.No endless draining state during partial outages.
Drain API behavior
go
// core/controlplane/gateway/handlers_pools.go (excerpt)
if req.TimeoutSeconds <= 0 {
  req.TimeoutSeconds = 300
}

existing.Status = config.PoolStatusDraining
existing.DrainStartedAt = now
existing.DrainTimeoutSeconds = req.TimeoutSeconds
Drain checker loop
go
// core/controlplane/gateway/pool_drain.go (excerpt)
const defaultDrainCheckInterval = 10 * time.Second

if time.Since(startedAt) > timeout {
  d.transitionToInactive(ctx, poolName, "drain timeout expired")
  return
}

activeJobs := d.countActiveJobsForPool(poolName)
if activeJobs == 0 {
  d.transitionToInactive(ctx, poolName, "all jobs completed")
  return
}
Inactive transition update
go
// core/controlplane/gateway/pool_drain.go (excerpt)
pool.Status = config.PoolStatusInactive
pool.DrainStartedAt = ""
pool.DrainTimeoutSeconds = 0

d.srv.publishConfigChanged("system", "default")

Drain lifecycle details

Lifecycle is simple on paper: active, draining, inactive.

The hard part is deciding when draining is complete. Cordum uses two deterministic checks: active jobs equals zero, or timeout exceeded.

Tests cover both paths so operators are not learning behavior from production incidents.

Timeout path unit test
go
// core/controlplane/gateway/pool_drain_test.go (excerpt)
func TestDrainChecker_Timeout_ForcesInactive(t *testing.T) {
  // pool started draining 10 minutes ago with 60 second timeout
  // active jobs still present
  checker.checkAll(context.Background())

  if status := getPoolStatus(t, s, "test-pool"); status != config.PoolStatusInactive {
    t.Errorf("expected inactive (timeout), got %q", status)
  }
}

Validation runbook

Run this in staging before using pool drain during a high-risk rollout.

Drain validation steps
bash
# 1) Create a test pool and map one topic
# 2) Trigger drain with timeout_seconds=120
# 3) Verify pool status changes to draining and new jobs route elsewhere
# 4) Keep one worker busy and confirm active_jobs > 0 blocks transition
# 5) Wait for timeout, then verify status auto-changes to inactive

Limitations and tradeoffs

ApproachUpsideDownside
No timeout on draining poolsMaximum chance all long jobs complete naturally.Risk of permanent draining state during failures.
Short timeoutFaster maintenance and rollout turnover.More jobs may be interrupted or rerouted mid-window.
Active-job check plus timeout fallback (Cordum pattern)Balanced behavior between graceful completion and bounded operations.Requires reliable active-job snapshot data.

Next step

Add a canary runbook that drains one low-traffic pool every week, records elapsed drain time versus timeout budget, and updates default timeout values from measured data rather than guesswork.

Related Articles

View all posts

Need production-safe agent governance?

Cordum helps teams enforce pre-dispatch policy, run dependable agent workflows, and keep evidence trails auditable.