The production problem
Run-start clients often retry every non-200 with the same timer.
That policy hides intent. `429` usually means quota pressure. `503` usually means temporary backend inability.
Mix those paths and your client can amplify admission lock contention while still failing to recover fast from short backend outages.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| RFC 6585 (`429 Too Many Requests`) | `429` is for too many requests in a time window and can include `Retry-After` guidance for clients. | No control-plane pattern for distinguishing tenant concurrency ceilings from backend gate failures. |
| RFC 9110 (`503 Service Unavailable`) | `503` indicates temporary server inability to handle requests and can include `Retry-After` estimates. | No implementation detail for lock-backed workflow admission where status depends on store availability. |
| Google Cloud IAM retry strategy | Truncated exponential backoff with jitter, with explicit retryable classes like 500/502/503/504. | No direct mapping for workflow APIs that intentionally emit both `429` and `503` from different admission branches. |
Cordum runtime mechanics
| Boundary | Current behavior | Why it matters |
|---|---|---|
| Admission lock failure | If lock acquisition fails, workflow run start returns `503 workflow concurrency gate unavailable`. | Client should treat this as transient gate failure, not tenant quota pressure. |
| Active run count failure | If counting active runs fails, response is `503 failed to enforce max concurrent runs`. | Signal points to store/gate health, not business-limit exhaustion. |
| Concurrency ceiling reached | If active runs are at limit, response is `429 max concurrent runs reached`. | Request is valid but cannot be admitted right now due to tenant quota window. |
| Retry-After behavior | Workflow start paths above do not set `Retry-After`; job submit throttle path sets `Retry-After: 30`. | Workflow clients need explicit retry policy design in SDK or caller logic. |
Status-code map in code
Workflow admission returns `429` and `503` from different branches
// core/controlplane/gateway/handlers_workflows.go (excerpt)
if err != nil {
slog.Error("workflow admission lock failed", "org_id", orgID, "error", err)
writeErrorJSON(w, http.StatusServiceUnavailable, "workflow concurrency gate unavailable") // 503
return
}
if count, err := s.workflowStore.CountActiveRuns(r.Context(), orgID); err != nil {
writeErrorJSON(w, http.StatusServiceUnavailable, "failed to enforce max concurrent runs") // 503
return
} else if count >= limit {
writeErrorJSON(w, http.StatusTooManyRequests, "max concurrent runs reached") // 429
return
}Job submit throttle already emits `Retry-After`
// core/controlplane/gateway/handlers_jobs.go (excerpt)
if policyResult.Throttled {
w.Header().Set("Retry-After", "30")
writeErrorJSON(w, http.StatusTooManyRequests, reason)
return
}Concurrency test confirms `429` saturation behavior
// core/controlplane/gateway/workflow_runs_test.go (excerpt)
func TestHandleStartRunConcurrentRequestsRespectMaxConcurrentLimit(t *testing.T) {
// max_concurrent_runs = 2
// 10 parallel start requests
// assert: only 2 succeed, others return 429
}Validation runbook
Run this before changing retry policy defaults in SDKs or internal clients.
# 1) Set max_concurrent_runs=2 in staging # 2) Issue 50 parallel run-start requests for one tenant # 3) Record response histogram (200/429/503) # 4) Simulate Redis admission-gate failure and confirm 503 path # 5) Compare client behavior for 429-policy vs 503-policy # 6) Validate no duplicate run creation under retries with idempotency keys
Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Single retry policy for all 4xx/5xx | Very simple client implementation. | Loses protocol signal; often retries too hard on quota pressure or too weakly on transient backend failures. |
| Separate `429` and `503` retry classes | Closer to HTTP semantics and better control-plane behavior under load. | Requires more client code paths and observability dimensions. |
| Add `Retry-After` for workflow admission responses | Gives callers first-party pacing hints without reverse-engineering backend timing. | Static values can be wrong under variable contention; dynamic estimates add complexity. |
- - `Retry-After` alone does not solve client herd behavior unless retries also use jitter.
- - Aggressive 503 retries can overload the same backend dependency that already failed.
- - Aggressive 429 retries can lock-step with admission limits and waste CPU without improving success ratio.
Next step
Implement this in one iteration:
- 1. Split workflow-start retry policies into explicit `429` and `503` branches.
- 2. Add jittered backoff for both classes, with stricter cap for `429` loops.
- 3. Decide whether workflow endpoints should emit `Retry-After` and document the contract.
- 4. Add dashboards for `429` vs `503` ratio by tenant and workflow.
Continue with AI Agent Workflow Admission Lock and AI Agent Workflow Idempotency Reservation.