The production problem
Teams often treat release as a best-effort epilogue. Under load, that assumption creates queue debt.
Work finishes. Release fails due to a transient backend issue. The lock remains until TTL expiry, and subsequent handlers see lock-busy even though the business action already completed.
One extra second of lock debt is cheap. Thousands of keys doing that together can look like a control-plane slowdown.
The fix is policy. Decide whether you retry release immediately, or rely only on TTL, then test it under failure.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Redis docs: distributed lock release script | Token-checked release safety and why plain key delete is unsafe. | No runtime policy for what to do when release RPC fails after successful critical-section execution. |
| AWS Builders' Library: avoiding fallback | Fallback paths are usually less tested and can amplify incident scope. | No lock-specific guidance for balancing release retry cost against lock-hold debt from TTL-only recovery. |
| etcd API docs: lease expiry model | Lease-based ownership eventually expires and attached keys are removed. | No guidance for immediate post-failure behavior when explicit release cannot be confirmed. |
Cordum runtime mechanics
| Component | Current behavior | Runtime numbers | Operational effect |
|---|---|---|---|
| Scheduler release path | Release call uses `context.Background()` timeout, retries once on failure, then logs TTL fallback. | Store timeout 2s per attempt, 2 attempts max in this path. | Reduces temporary lock debt from single release blips. |
| Workflow release path | Single release attempt with 2s context; logs warning on failure. | One release attempt per lock-holder completion. | Simpler path, but more dependence on TTL cleanup under transient errors. |
| Ownership enforcement | Store-level release uses compare-and-delete script; mismatch returns `lock not owned`. | No blind key delete allowed in release API. | Prevents stale owner from removing newer owner's lock. |
| Scheduler test coverage | `TestWithJobLockReleaseRetry` asserts first release fails and second succeeds. | Verifies at least 2 release calls. | Hardens retry behavior against regression. |
Release paths in code
Scheduler: bounded release retry
// core/controlplane/scheduler/engine.go (excerpt)
ctx, cancel := context.WithTimeout(context.Background(), storeOpTimeout)
if err := e.jobStore.ReleaseLock(ctx, key, token); err != nil {
slog.Warn("job lock release failed, retrying", "job_id", jobID, "error", err)
ctx2, cancel2 := context.WithTimeout(context.Background(), storeOpTimeout)
if err2 := e.jobStore.ReleaseLock(ctx2, key, token); err2 != nil {
slog.Error("job lock release retry failed, lock will expire via TTL",
"job_id", jobID, "ttl", ttl, "error", err2)
}
}Scheduler test coverage
// core/controlplane/scheduler/engine_hardening_test.go (excerpt)
func TestWithJobLockReleaseRetry(t *testing.T) {
// failReleaseLockStore fails first ReleaseLock, succeeds second
// expect: function succeeds
// expect: releaseCount >= 2
// expect: lock not held after retry
}Workflow: single release attempt
// core/workflow/engine.go (excerpt)
if redisToken != "" && lm.locker != nil {
rCtx, rCancel := context.WithTimeout(context.Background(), 2*time.Second)
if err := lm.locker.ReleaseLock(rCtx, runLockKey(runID), redisToken); err != nil {
slog.Warn("distributed run lock release failed", "run_id", runID, "error", err)
}
rCancel()
}Store-level release ownership check
// core/infra/store/job_store.go (excerpt)
func (s *RedisJobStore) ReleaseLock(ctx context.Context, key, token string) error {
result, err := releaseLockScript.Run(ctx, s.client, []string{key}, token).Int()
if err != nil {
return fmt.Errorf("job store release lock %s: %w", key, err)
}
if result == 0 {
return fmt.Errorf("lock not owned")
}
return nil
}Validation runbook
Validate release-failure behavior in staging before tuning lock TTLs or increasing worker concurrency.
# 1) Track release-failure logs by component # scheduler: "job lock release failed, retrying" # workflow: "distributed run lock release failed" # 2) Measure lock-busy debt after release failures # - count retries returning lock-busy for same key/run shortly after failures # 3) Verify retry effectiveness in scheduler path # - first release failure followed by successful second release # 4) Check TTL fallback events # - "lock will expire via TTL" should remain rare # 5) Incident threshold example # if release-failure rate spikes and lock-busy latency rises, # prioritize lock backend health and temporarily reduce concurrency # 6) Post-incident hardening # decide whether workflow path should add bounded release retry parity
Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Single release attempt + TTL cleanup | Simpler control flow and fewer backend calls. | Can increase temporary lock-busy debt after transient failures. |
| Bounded release retry (scheduler style) | Recovers quickly from one-off release blips. | Extra release RPC and more logic branches to test. |
| Aggressive indefinite retry | May minimize TTL wait in best case. | Risk of long tail latency and control-path blocking. |
- - Scheduler release retry is tested and pragmatic for transient failures, but it is still bounded and can fall back to TTL debt.
- - Workflow release path is simpler, yet can accumulate lock-busy latency during backend turbulence.
- - I found concrete scheduler tests for release retry, but no equivalent workflow test enforcing any retry policy in this path.
Next step
Implement this next:
- 1. Define per-component release retry policy and document why they differ.
- 2. Add workflow release retry hardening test similar to scheduler `TestWithJobLockReleaseRetry`.
- 3. Add lock-debt dashboard: release-failure rate plus lock-busy latency impact.
- 4. Rehearse release-path failure drills before raising throughput limits.
Continue with AI Agent Lock Renewal Failure Policy and AI Agent Lock Token Ownership.