The production problem
You need to drain one worker pool for maintenance. New jobs must stop landing there.
Existing jobs still need to finish, or at least transition cleanly.
If no timeout exists, a single stuck worker can leave the pool in draining status for hours.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Kubernetes Docs: Pod Lifecycle | SIGTERM and grace-period lifecycle for process termination. | No pool-level routing state machine for AI worker groups before pod-level shutdown. |
| Kubernetes Docs: Safely Drain a Node | Node-level eviction workflows and maintenance-safe drain practices. | No control-plane logic for topic-to-pool routing exclusion while jobs complete. |
| Google Cloud: Terminating with Grace | Grace-period sizing and SIGTERM handling basics. | No guidance on auto-transition from draining to inactive based on active job counters. |
Cordum runtime mechanics
Cordum models pool drain as explicit state transitions plus an active-job convergence loop.
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Drain request | `POST /api/v1/pools/{name}/drain` sets status to draining, timestamps start time, and sets timeout. | New jobs stop routing to that pool immediately. |
| Default timeout | Drain API defaults to 300 seconds when timeout is absent or non-positive. | Operators get a bounded drain window even with minimal input. |
| Checker cadence | Background drain checker runs every 10 seconds. | Pool status converges quickly without requiring manual polling loops. |
| Exit conditions | Pool moves to inactive when active jobs are zero or when drain timeout expires. | No endless draining state during partial outages. |
// core/controlplane/gateway/handlers_pools.go (excerpt)
if req.TimeoutSeconds <= 0 {
req.TimeoutSeconds = 300
}
existing.Status = config.PoolStatusDraining
existing.DrainStartedAt = now
existing.DrainTimeoutSeconds = req.TimeoutSeconds// core/controlplane/gateway/pool_drain.go (excerpt)
const defaultDrainCheckInterval = 10 * time.Second
if time.Since(startedAt) > timeout {
d.transitionToInactive(ctx, poolName, "drain timeout expired")
return
}
activeJobs := d.countActiveJobsForPool(poolName)
if activeJobs == 0 {
d.transitionToInactive(ctx, poolName, "all jobs completed")
return
}// core/controlplane/gateway/pool_drain.go (excerpt)
pool.Status = config.PoolStatusInactive
pool.DrainStartedAt = ""
pool.DrainTimeoutSeconds = 0
d.srv.publishConfigChanged("system", "default")Drain lifecycle details
Lifecycle is simple on paper: active, draining, inactive.
The hard part is deciding when draining is complete. Cordum uses two deterministic checks: active jobs equals zero, or timeout exceeded.
Tests cover both paths so operators are not learning behavior from production incidents.
// core/controlplane/gateway/pool_drain_test.go (excerpt)
func TestDrainChecker_Timeout_ForcesInactive(t *testing.T) {
// pool started draining 10 minutes ago with 60 second timeout
// active jobs still present
checker.checkAll(context.Background())
if status := getPoolStatus(t, s, "test-pool"); status != config.PoolStatusInactive {
t.Errorf("expected inactive (timeout), got %q", status)
}
}Validation runbook
Run this in staging before using pool drain during a high-risk rollout.
# 1) Create a test pool and map one topic # 2) Trigger drain with timeout_seconds=120 # 3) Verify pool status changes to draining and new jobs route elsewhere # 4) Keep one worker busy and confirm active_jobs > 0 blocks transition # 5) Wait for timeout, then verify status auto-changes to inactive
Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| No timeout on draining pools | Maximum chance all long jobs complete naturally. | Risk of permanent draining state during failures. |
| Short timeout | Faster maintenance and rollout turnover. | More jobs may be interrupted or rerouted mid-window. |
| Active-job check plus timeout fallback (Cordum pattern) | Balanced behavior between graceful completion and bounded operations. | Requires reliable active-job snapshot data. |
Next step
Add a canary runbook that drains one low-traffic pool every week, records elapsed drain time versus timeout budget, and updates default timeout values from measured data rather than guesswork.