Skip to content
Deep Dive

AI Agent Lock Renewal Failure Policy

Renewal failure is where lock correctness usually breaks. Policy determines whether you keep processing or fence the critical section.

Deep Dive12 min readMar 2026
TL;DR
  • -A healthy acquire path does not guarantee a safe critical section. Renewal failure policy decides what happens after lock ownership uncertainty begins.
  • -Cordum scheduler abandons the lock after 3 consecutive renewal failures and fences work with a canceled context cause.
  • -Cordum workflow currently logs renewal failure and keeps running; this favors continuity but leaves a wider race window after lease loss.
  • -If you do not define renewal-failure policy per component, you get different correctness behavior under the same outage pattern.
Failure mode

Renew calls start failing during a network event. Work continues past lease validity and a second actor may enter the same critical section.

Current state

Scheduler and workflow engines handle renewal failures differently today.

Operational payoff

A documented policy reduces surprise behavior and shortens incident triage when lock anomalies appear.

Scope

This piece compares renewal-failure policy in Cordum scheduler and workflow locking paths. It does not re-explain basic lease locking.

The production problem

Most lock bugs are diagnosed at acquire time and caused at renew time.

Acquire succeeds, critical section starts, then renew operations begin failing due to network or storage turbulence. If code keeps running after lease confidence is lost, another worker can enter the same section.

The key question is simple. After repeated renew failures, do you fence and stop or continue and hope idempotency absorbs overlap.

Both strategies can be correct depending on workload risk. What fails teams is having no explicit strategy.

What top results cover and miss

SourceStrong coverageMissing piece
Redis docs: distributed lock extension guidanceTTL windows, lock extension mechanism, and the requirement to limit reacquisition attempts.No component-level policy template for when renewal attempts fail repeatedly inside business-critical sections.
AWS Builders' Library: avoiding fallback in distributed systemsWhy rarely exercised fallback modes are risky and can amplify incidents.No lock-renewal-specific rule set tying thresholds, fencing behavior, and release policy together.
etcd API docs: lease keepalive semanticsA lease expires if keepalive does not arrive in time, and attached keys are removed on expiry.No application policy for what your code should do immediately after keepalive uncertainty begins.

Cordum runtime mechanics

ComponentRenewal failure policyRuntime numbersRisk envelope
Scheduler engine`maxRenewalFailures = 3`; abandon after 3 consecutive failures; cancel fenced context with `errLockAbandoned`.Default lock TTL 60s, renewal interval ttl/3, store timeout 2s.Stronger correctness boundary. Work is fenced when lock ownership cannot be trusted.
Workflow engineLogs `run lock renewal failed` on each failed renew attempt and continues execution.Run lock TTL 30s, renewal interval ttl/3 (~10s), renew timeout 2s.Higher continuity. Potential wider race window if lease expires while work continues.
Scheduler observabilityIncrements `scheduler_job_lock_abandoned_total` on abandonment.Counter emitted via core infra metrics package.Makes abandonment visible for alerts and rollback decisions.
Workflow observabilityNo dedicated abandonment counter for renewal failures in workflow lock manager path.Logs and general lock fallback metric exist, but no renewal-fence counter.Harder to quantify renewal instability versus harmless transient retries.

Code paths and tests

Scheduler: bounded failures then fence

core/controlplane/scheduler/engine.go
go
// core/controlplane/scheduler/engine.go (excerpt)
const (
  storeOpTimeout     = 2 * time.Second
  jobLockTTL         = 60 * time.Second
  maxRenewalFailures = 3
)

var errLockAbandoned = errors.New("job lock abandoned: renewal failed")

if err := e.jobStore.RenewLock(rCtx, key, token, ttl); err != nil {
  consecutiveFailures++
  if consecutiveFailures >= maxRenewalFailures {
    abandoned.Store(true)
    fenceCancel(errLockAbandoned)
    e.metrics.IncJobLockAbandoned()
    return
  }
}

// Skip release after abandonment to avoid dropping a newer owner's lock.
if abandoned.Load() {
  return
}

Scheduler test: exactly 3 failed renew attempts

core/controlplane/scheduler/engine_hardening_test.go
go
// core/controlplane/scheduler/engine_hardening_test.go (excerpt)
func TestWithJobLock_RenewalAbandonAfterConsecutiveFailures(t *testing.T) {
  // alwaysFailRenewStore returns error on each renew
  // expected: errLockAbandoned
  // expected: exactly 3 renewal attempts, then stop
  // expected: lock remains (release skipped after abandonment)
}

Workflow: warn-only renew failure path

core/workflow/engine.go
go
// core/workflow/engine.go (excerpt)
const runLockTTL = 30 * time.Second

if renewer, ok := lm.locker.(RunLockRenewer); ok {
  ticker := time.NewTicker(runLockTTL / 3)
  for {
    select {
    case <-ticker.C:
      rCtx, cancel := context.WithTimeout(renewCtx, 2*time.Second)
      if err := renewer.RenewLock(rCtx, key, token, runLockTTL); err != nil {
        slog.Warn("run lock renewal failed", "run_id", runID, "error", err)
      }
      cancel()
    }
  }
}

Workflow tests around renewal

core/workflow/lock_test.go
go
// core/workflow/lock_test.go (excerpt)
func TestDistributedRunLock_Renewal(t *testing.T) {
  // verifies lock key remains alive past original TTL when renew succeeds
}

func TestDistributedRunLock_RenewalStopsOnContextCancel(t *testing.T) {
  // verifies renewal loop stops on context cancellation
}

// no dedicated test today that enforces abandon/fence after N renewal failures

Validation runbook

Run this in staging before changing lock renewal thresholds or context cancellation behavior.

lock-renewal-policy-runbook.sh
bash
# 1) Track scheduler abandonment events
# rate(cordum_scheduler_job_lock_abandoned_total[5m])

# 2) Track workflow renewal warnings
# grep/aggregate: "run lock renewal failed"

# 3) Correlate with duplicate critical-section effects
# - repeated processing for same run_id/job_id
# - state transition conflicts shortly after renewal failures

# 4) During incident, classify behavior by component
# - scheduler path should fence after 3 consecutive failures
# - workflow path currently continues unless context is canceled

# 5) Decide temporary policy
# - high-risk workloads: reduce tolerance, prefer fence/stop
# - low-risk workloads: allow continuity with stronger idempotency checks

# 6) Post-incident action
# add explicit renewal-failure SLOs and regression tests for both paths

Limitations and tradeoffs

ApproachUpsideDownside
Fence quickly after repeated renewal failureHigher correctness under uncertain lock ownership.Lower continuity during lock backend turbulence.
Warn and continueBetter short-term throughput and fewer interrupted flows.Potential overlap window after lease loss; harder correctness guarantees.
Policy by workload risk classApplies strict fencing only where correctness cost is highest.More policy and testing complexity across components.
  • - The scheduler path has clear fencing semantics and abandonment metric coverage. The workflow path currently prioritizes continuity with weaker lease-loss fencing.
  • - Workload-specific policy is often the right answer, but it requires explicit documentation and testing to avoid inconsistent behavior across components.
  • - I found strong scheduler hardening tests for consecutive failures and intermittent failures, but no equivalent workflow test enforcing a bounded-failure abandon policy.

Next step

Implement this next:

  1. 1. Define a renewal-failure policy matrix by workload risk (strict fence vs continuity).
  2. 2. Add workflow-level metric parity for renewal-failure abandonment, not only warning logs.
  3. 3. Add workflow tests that inject repeated `RenewLock` failures and validate chosen policy behavior.
  4. 4. Document expected operator response for each lock path in the runbook.

Continue with AI Agent Distributed Lock Fallback and AI Agent Lock Token Ownership.

Policy beats hope

Renewal failure handling is one of those design choices that looks optional until the first multi-replica outage. Make it explicit before that day.