Skip to content
Deep Dive

AI Agent Lock Release Failure

Release safety is not binary. You need an explicit plan for what happens when release calls fail under load.

Deep Dive10 min readMar 2026
TL;DR
  • -Safe lock ownership checks are necessary but not sufficient. Release-failure policy determines contention debt after transient Redis failures.
  • -Cordum scheduler retries lock release once with a fresh context, then explicitly falls back to TTL expiry if retry still fails.
  • -Cordum workflow lock manager logs release failure but does not perform a second release attempt in the same path.
  • -A release failure does not always mean correctness loss, but it can add lock hold latency that cascades under high queue pressure.
Failure mode

Critical section finished, `ReleaseLock` call fails, and the next worker sees lock-busy even though business work is already done.

Current behavior

Scheduler does one immediate retry. Workflow currently emits warning and relies on TTL cleanup.

Operational payoff

Explicit release policy reduces unnecessary lock-busy retries during short backend turbulence.

Scope

This guide focuses on release-failure handling after critical-section completion, not lock acquisition or renewal policy design.

The production problem

Teams often treat release as a best-effort epilogue. Under load, that assumption creates queue debt.

Work finishes. Release fails due to a transient backend issue. The lock remains until TTL expiry, and subsequent handlers see lock-busy even though the business action already completed.

One extra second of lock debt is cheap. Thousands of keys doing that together can look like a control-plane slowdown.

The fix is policy. Decide whether you retry release immediately, or rely only on TTL, then test it under failure.

What top results cover and miss

SourceStrong coverageMissing piece
Redis docs: distributed lock release scriptToken-checked release safety and why plain key delete is unsafe.No runtime policy for what to do when release RPC fails after successful critical-section execution.
AWS Builders' Library: avoiding fallbackFallback paths are usually less tested and can amplify incident scope.No lock-specific guidance for balancing release retry cost against lock-hold debt from TTL-only recovery.
etcd API docs: lease expiry modelLease-based ownership eventually expires and attached keys are removed.No guidance for immediate post-failure behavior when explicit release cannot be confirmed.

Cordum runtime mechanics

ComponentCurrent behaviorRuntime numbersOperational effect
Scheduler release pathRelease call uses `context.Background()` timeout, retries once on failure, then logs TTL fallback.Store timeout 2s per attempt, 2 attempts max in this path.Reduces temporary lock debt from single release blips.
Workflow release pathSingle release attempt with 2s context; logs warning on failure.One release attempt per lock-holder completion.Simpler path, but more dependence on TTL cleanup under transient errors.
Ownership enforcementStore-level release uses compare-and-delete script; mismatch returns `lock not owned`.No blind key delete allowed in release API.Prevents stale owner from removing newer owner's lock.
Scheduler test coverage`TestWithJobLockReleaseRetry` asserts first release fails and second succeeds.Verifies at least 2 release calls.Hardens retry behavior against regression.

Release paths in code

Scheduler: bounded release retry

core/controlplane/scheduler/engine.go
go
// core/controlplane/scheduler/engine.go (excerpt)
ctx, cancel := context.WithTimeout(context.Background(), storeOpTimeout)
if err := e.jobStore.ReleaseLock(ctx, key, token); err != nil {
  slog.Warn("job lock release failed, retrying", "job_id", jobID, "error", err)

  ctx2, cancel2 := context.WithTimeout(context.Background(), storeOpTimeout)
  if err2 := e.jobStore.ReleaseLock(ctx2, key, token); err2 != nil {
    slog.Error("job lock release retry failed, lock will expire via TTL",
      "job_id", jobID, "ttl", ttl, "error", err2)
  }
}

Scheduler test coverage

core/controlplane/scheduler/engine_hardening_test.go
go
// core/controlplane/scheduler/engine_hardening_test.go (excerpt)
func TestWithJobLockReleaseRetry(t *testing.T) {
  // failReleaseLockStore fails first ReleaseLock, succeeds second
  // expect: function succeeds
  // expect: releaseCount >= 2
  // expect: lock not held after retry
}

Workflow: single release attempt

core/workflow/engine.go
go
// core/workflow/engine.go (excerpt)
if redisToken != "" && lm.locker != nil {
  rCtx, rCancel := context.WithTimeout(context.Background(), 2*time.Second)
  if err := lm.locker.ReleaseLock(rCtx, runLockKey(runID), redisToken); err != nil {
    slog.Warn("distributed run lock release failed", "run_id", runID, "error", err)
  }
  rCancel()
}

Store-level release ownership check

core/infra/store/job_store.go
go
// core/infra/store/job_store.go (excerpt)
func (s *RedisJobStore) ReleaseLock(ctx context.Context, key, token string) error {
  result, err := releaseLockScript.Run(ctx, s.client, []string{key}, token).Int()
  if err != nil {
    return fmt.Errorf("job store release lock %s: %w", key, err)
  }
  if result == 0 {
    return fmt.Errorf("lock not owned")
  }
  return nil
}

Validation runbook

Validate release-failure behavior in staging before tuning lock TTLs or increasing worker concurrency.

lock-release-runbook.sh
bash
# 1) Track release-failure logs by component
# scheduler: "job lock release failed, retrying"
# workflow:  "distributed run lock release failed"

# 2) Measure lock-busy debt after release failures
# - count retries returning lock-busy for same key/run shortly after failures

# 3) Verify retry effectiveness in scheduler path
# - first release failure followed by successful second release

# 4) Check TTL fallback events
# - "lock will expire via TTL" should remain rare

# 5) Incident threshold example
# if release-failure rate spikes and lock-busy latency rises,
# prioritize lock backend health and temporarily reduce concurrency

# 6) Post-incident hardening
# decide whether workflow path should add bounded release retry parity

Limitations and tradeoffs

ApproachUpsideDownside
Single release attempt + TTL cleanupSimpler control flow and fewer backend calls.Can increase temporary lock-busy debt after transient failures.
Bounded release retry (scheduler style)Recovers quickly from one-off release blips.Extra release RPC and more logic branches to test.
Aggressive indefinite retryMay minimize TTL wait in best case.Risk of long tail latency and control-path blocking.
  • - Scheduler release retry is tested and pragmatic for transient failures, but it is still bounded and can fall back to TTL debt.
  • - Workflow release path is simpler, yet can accumulate lock-busy latency during backend turbulence.
  • - I found concrete scheduler tests for release retry, but no equivalent workflow test enforcing any retry policy in this path.

Next step

Implement this next:

  1. 1. Define per-component release retry policy and document why they differ.
  2. 2. Add workflow release retry hardening test similar to scheduler `TestWithJobLockReleaseRetry`.
  3. 3. Add lock-debt dashboard: release-failure rate plus lock-busy latency impact.
  4. 4. Rehearse release-path failure drills before raising throughput limits.

Continue with AI Agent Lock Renewal Failure Policy and AI Agent Lock Token Ownership.

Release policy is capacity policy

If release failures silently turn into long TTL waits, your queue eventually feels it. Make this behavior explicit and measurable.