Name: Cordum
Author: Cordum

The production problem

Run-start clients often retry every non-200 with the same timer.

That policy hides intent. `429` usually means quota pressure. `503` usually means temporary backend inability.

Mix those paths and your client can amplify admission lock contention while still failing to recover fast from short backend outages.

What top results cover and miss

Source	Strong coverage	Missing piece
RFC 6585 (`429 Too Many Requests`)	`429` is for too many requests in a time window and can include `Retry-After` guidance for clients.	No control-plane pattern for distinguishing tenant concurrency ceilings from backend gate failures.
RFC 9110 (`503 Service Unavailable`)	`503` indicates temporary server inability to handle requests and can include `Retry-After` estimates.	No implementation detail for lock-backed workflow admission where status depends on store availability.
Google Cloud IAM retry strategy	Truncated exponential backoff with jitter, with explicit retryable classes like 500/502/503/504.	No direct mapping for workflow APIs that intentionally emit both `429` and `503` from different admission branches.

Cordum runtime mechanics

Boundary	Current behavior	Why it matters
Admission lock failure	If lock acquisition fails, workflow run start returns `503 workflow concurrency gate unavailable`.	Client should treat this as transient gate failure, not tenant quota pressure.
Active run count failure	If counting active runs fails, response is `503 failed to enforce max concurrent runs`.	Signal points to store/gate health, not business-limit exhaustion.
Concurrency ceiling reached	If active runs are at limit, response is `429 max concurrent runs reached`.	Request is valid but cannot be admitted right now due to tenant quota window.
Retry-After behavior	Workflow start paths above do not set `Retry-After`; job submit throttle path sets `Retry-After: 30`.	Workflow clients need explicit retry policy design in SDK or caller logic.

Status-code map in code

Workflow admission returns `429` and `503` from different branches

core/controlplane/gateway/handlers_workflows.go

// core/controlplane/gateway/handlers_workflows.go (excerpt)
if err != nil {
  slog.Error("workflow admission lock failed", "org_id", orgID, "error", err)
  writeErrorJSON(w, http.StatusServiceUnavailable, "workflow concurrency gate unavailable") // 503
  return
}

if count, err := s.workflowStore.CountActiveRuns(r.Context(), orgID); err != nil {
  writeErrorJSON(w, http.StatusServiceUnavailable, "failed to enforce max concurrent runs") // 503
  return
} else if count >= limit {
  writeErrorJSON(w, http.StatusTooManyRequests, "max concurrent runs reached") // 429
  return
}

Job submit throttle already emits `Retry-After`

core/controlplane/gateway/handlers_jobs.go

// core/controlplane/gateway/handlers_jobs.go (excerpt)
if policyResult.Throttled {
  w.Header().Set("Retry-After", "30")
  writeErrorJSON(w, http.StatusTooManyRequests, reason)
  return
}

Concurrency test confirms `429` saturation behavior

core/controlplane/gateway/workflow_runs_test.go

// core/controlplane/gateway/workflow_runs_test.go (excerpt)
func TestHandleStartRunConcurrentRequestsRespectMaxConcurrentLimit(t *testing.T) {
  // max_concurrent_runs = 2
  // 10 parallel start requests
  // assert: only 2 succeed, others return 429
}

Validation runbook

Run this before changing retry policy defaults in SDKs or internal clients.

runbook.sh

bash

# 1) Set max_concurrent_runs=2 in staging
# 2) Issue 50 parallel run-start requests for one tenant
# 3) Record response histogram (200/429/503)
# 4) Simulate Redis admission-gate failure and confirm 503 path
# 5) Compare client behavior for 429-policy vs 503-policy
# 6) Validate no duplicate run creation under retries with idempotency keys

Limitations and tradeoffs

Approach	Upside	Downside
Single retry policy for all 4xx/5xx	Very simple client implementation.	Loses protocol signal; often retries too hard on quota pressure or too weakly on transient backend failures.
Separate `429` and `503` retry classes	Closer to HTTP semantics and better control-plane behavior under load.	Requires more client code paths and observability dimensions.
Add `Retry-After` for workflow admission responses	Gives callers first-party pacing hints without reverse-engineering backend timing.	Static values can be wrong under variable contention; dynamic estimates add complexity.

- `Retry-After` alone does not solve client herd behavior unless retries also use jitter.
- Aggressive 503 retries can overload the same backend dependency that already failed.
- Aggressive 429 retries can lock-step with admission limits and waste CPU without improving success ratio.

Next step

Implement this in one iteration:

1. Split workflow-start retry policies into explicit `429` and `503` branches.
2. Add jittered backoff for both classes, with stricter cap for `429` loops.
3. Decide whether workflow endpoints should emit `Retry-After` and document the contract.
4. Add dashboards for `429` vs `503` ratio by tenant and workflow.

Continue with AI Agent Workflow Admission Lock and AI Agent Workflow Idempotency Reservation.

AI Agent Workflow Admission 429 vs 503