Skip to content
Deep Dive

AI Agent Workflow Admission 429 vs 503

Same endpoint. Different failure class. Different retry policy.

Deep Dive10 min readMar 2026
TL;DR
  • -In workflow admission, `429` and `503` are different failure classes and should not share one retry policy.
  • -Cordum returns `429` when run slots are full and `503` when the admission gate itself cannot be evaluated.
  • -Current workflow-run responses do not set `Retry-After`, so clients must infer retry pacing from policy.
  • -Cordum job-submit throttling already sets `Retry-After: 30`, which is a useful contrast point for workflow APIs.
Failure mode

Treating `429` and `503` as identical can cause burst retries and false incident noise.

Protocol signal

HTTP specs distinguish rate/overload from temporary unavailability for a reason.

Operational payoff

Separating policies lowers duplicate pressure on admission locks and improves recovery smoothness.

Scope

This guide covers workflow run admission responses and client retry design. It does not redefine all HTTP error handling in your platform.

The production problem

Run-start clients often retry every non-200 with the same timer.

That policy hides intent. `429` usually means quota pressure. `503` usually means temporary backend inability.

Mix those paths and your client can amplify admission lock contention while still failing to recover fast from short backend outages.

What top results cover and miss

SourceStrong coverageMissing piece
RFC 6585 (`429 Too Many Requests`)`429` is for too many requests in a time window and can include `Retry-After` guidance for clients.No control-plane pattern for distinguishing tenant concurrency ceilings from backend gate failures.
RFC 9110 (`503 Service Unavailable`)`503` indicates temporary server inability to handle requests and can include `Retry-After` estimates.No implementation detail for lock-backed workflow admission where status depends on store availability.
Google Cloud IAM retry strategyTruncated exponential backoff with jitter, with explicit retryable classes like 500/502/503/504.No direct mapping for workflow APIs that intentionally emit both `429` and `503` from different admission branches.

Cordum runtime mechanics

BoundaryCurrent behaviorWhy it matters
Admission lock failureIf lock acquisition fails, workflow run start returns `503 workflow concurrency gate unavailable`.Client should treat this as transient gate failure, not tenant quota pressure.
Active run count failureIf counting active runs fails, response is `503 failed to enforce max concurrent runs`.Signal points to store/gate health, not business-limit exhaustion.
Concurrency ceiling reachedIf active runs are at limit, response is `429 max concurrent runs reached`.Request is valid but cannot be admitted right now due to tenant quota window.
Retry-After behaviorWorkflow start paths above do not set `Retry-After`; job submit throttle path sets `Retry-After: 30`.Workflow clients need explicit retry policy design in SDK or caller logic.

Status-code map in code

Workflow admission returns `429` and `503` from different branches

core/controlplane/gateway/handlers_workflows.go
go
// core/controlplane/gateway/handlers_workflows.go (excerpt)
if err != nil {
  slog.Error("workflow admission lock failed", "org_id", orgID, "error", err)
  writeErrorJSON(w, http.StatusServiceUnavailable, "workflow concurrency gate unavailable") // 503
  return
}

if count, err := s.workflowStore.CountActiveRuns(r.Context(), orgID); err != nil {
  writeErrorJSON(w, http.StatusServiceUnavailable, "failed to enforce max concurrent runs") // 503
  return
} else if count >= limit {
  writeErrorJSON(w, http.StatusTooManyRequests, "max concurrent runs reached") // 429
  return
}

Job submit throttle already emits `Retry-After`

core/controlplane/gateway/handlers_jobs.go
go
// core/controlplane/gateway/handlers_jobs.go (excerpt)
if policyResult.Throttled {
  w.Header().Set("Retry-After", "30")
  writeErrorJSON(w, http.StatusTooManyRequests, reason)
  return
}

Concurrency test confirms `429` saturation behavior

core/controlplane/gateway/workflow_runs_test.go
go
// core/controlplane/gateway/workflow_runs_test.go (excerpt)
func TestHandleStartRunConcurrentRequestsRespectMaxConcurrentLimit(t *testing.T) {
  // max_concurrent_runs = 2
  // 10 parallel start requests
  // assert: only 2 succeed, others return 429
}

Validation runbook

Run this before changing retry policy defaults in SDKs or internal clients.

runbook.sh
bash
# 1) Set max_concurrent_runs=2 in staging
# 2) Issue 50 parallel run-start requests for one tenant
# 3) Record response histogram (200/429/503)
# 4) Simulate Redis admission-gate failure and confirm 503 path
# 5) Compare client behavior for 429-policy vs 503-policy
# 6) Validate no duplicate run creation under retries with idempotency keys

Limitations and tradeoffs

ApproachUpsideDownside
Single retry policy for all 4xx/5xxVery simple client implementation.Loses protocol signal; often retries too hard on quota pressure or too weakly on transient backend failures.
Separate `429` and `503` retry classesCloser to HTTP semantics and better control-plane behavior under load.Requires more client code paths and observability dimensions.
Add `Retry-After` for workflow admission responsesGives callers first-party pacing hints without reverse-engineering backend timing.Static values can be wrong under variable contention; dynamic estimates add complexity.
  • - `Retry-After` alone does not solve client herd behavior unless retries also use jitter.
  • - Aggressive 503 retries can overload the same backend dependency that already failed.
  • - Aggressive 429 retries can lock-step with admission limits and waste CPU without improving success ratio.

Next step

Implement this in one iteration:

  1. 1. Split workflow-start retry policies into explicit `429` and `503` branches.
  2. 2. Add jittered backoff for both classes, with stricter cap for `429` loops.
  3. 3. Decide whether workflow endpoints should emit `Retry-After` and document the contract.
  4. 4. Add dashboards for `429` vs `503` ratio by tenant and workflow.

Continue with AI Agent Workflow Admission Lock and AI Agent Workflow Idempotency Reservation.

Retry policy is product behavior

If your API emits two failure classes, your client must listen to both. Retries are part of the contract.