Name: Cordum
Author: Cordum

The production problem

Most lock bugs are diagnosed at acquire time and caused at renew time.

Acquire succeeds, critical section starts, then renew operations begin failing due to network or storage turbulence. If code keeps running after lease confidence is lost, another worker can enter the same section.

The key question is simple. After repeated renew failures, do you fence and stop or continue and hope idempotency absorbs overlap.

Both strategies can be correct depending on workload risk. What fails teams is having no explicit strategy.

What top results cover and miss

Source	Strong coverage	Missing piece
Redis docs: distributed lock extension guidance	TTL windows, lock extension mechanism, and the requirement to limit reacquisition attempts.	No component-level policy template for when renewal attempts fail repeatedly inside business-critical sections.
AWS Builders' Library: avoiding fallback in distributed systems	Why rarely exercised fallback modes are risky and can amplify incidents.	No lock-renewal-specific rule set tying thresholds, fencing behavior, and release policy together.
etcd API docs: lease keepalive semantics	A lease expires if keepalive does not arrive in time, and attached keys are removed on expiry.	No application policy for what your code should do immediately after keepalive uncertainty begins.

Cordum runtime mechanics

Component	Renewal failure policy	Runtime numbers	Risk envelope
Scheduler engine	`maxRenewalFailures = 3`; abandon after 3 consecutive failures; cancel fenced context with `errLockAbandoned`.	Default lock TTL 60s, renewal interval ttl/3, store timeout 2s.	Stronger correctness boundary. Work is fenced when lock ownership cannot be trusted.
Workflow engine	Logs `run lock renewal failed` on each failed renew attempt and continues execution.	Run lock TTL 30s, renewal interval ttl/3 (~10s), renew timeout 2s.	Higher continuity. Potential wider race window if lease expires while work continues.
Scheduler observability	Increments `scheduler_job_lock_abandoned_total` on abandonment.	Counter emitted via core infra metrics package.	Makes abandonment visible for alerts and rollback decisions.
Workflow observability	No dedicated abandonment counter for renewal failures in workflow lock manager path.	Logs and general lock fallback metric exist, but no renewal-fence counter.	Harder to quantify renewal instability versus harmless transient retries.

Code paths and tests

Scheduler: bounded failures then fence

core/controlplane/scheduler/engine.go

// core/controlplane/scheduler/engine.go (excerpt)
const (
  storeOpTimeout     = 2 * time.Second
  jobLockTTL         = 60 * time.Second
  maxRenewalFailures = 3
)

var errLockAbandoned = errors.New("job lock abandoned: renewal failed")

if err := e.jobStore.RenewLock(rCtx, key, token, ttl); err != nil {
  consecutiveFailures++
  if consecutiveFailures >= maxRenewalFailures {
    abandoned.Store(true)
    fenceCancel(errLockAbandoned)
    e.metrics.IncJobLockAbandoned()
    return
  }
}

// Skip release after abandonment to avoid dropping a newer owner's lock.
if abandoned.Load() {
  return
}

Scheduler test: exactly 3 failed renew attempts

core/controlplane/scheduler/engine_hardening_test.go

// core/controlplane/scheduler/engine_hardening_test.go (excerpt)
func TestWithJobLock_RenewalAbandonAfterConsecutiveFailures(t *testing.T) {
  // alwaysFailRenewStore returns error on each renew
  // expected: errLockAbandoned
  // expected: exactly 3 renewal attempts, then stop
  // expected: lock remains (release skipped after abandonment)
}

Workflow: warn-only renew failure path

core/workflow/engine.go

// core/workflow/engine.go (excerpt)
const runLockTTL = 30 * time.Second

if renewer, ok := lm.locker.(RunLockRenewer); ok {
  ticker := time.NewTicker(runLockTTL / 3)
  for {
    select {
    case <-ticker.C:
      rCtx, cancel := context.WithTimeout(renewCtx, 2*time.Second)
      if err := renewer.RenewLock(rCtx, key, token, runLockTTL); err != nil {
        slog.Warn("run lock renewal failed", "run_id", runID, "error", err)
      }
      cancel()
    }
  }
}

Workflow tests around renewal

core/workflow/lock_test.go

// core/workflow/lock_test.go (excerpt)
func TestDistributedRunLock_Renewal(t *testing.T) {
  // verifies lock key remains alive past original TTL when renew succeeds
}

func TestDistributedRunLock_RenewalStopsOnContextCancel(t *testing.T) {
  // verifies renewal loop stops on context cancellation
}

// no dedicated test today that enforces abandon/fence after N renewal failures

Validation runbook

Run this in staging before changing lock renewal thresholds or context cancellation behavior.

lock-renewal-policy-runbook.sh

bash

# 1) Track scheduler abandonment events
# rate(cordum_scheduler_job_lock_abandoned_total[5m])

# 2) Track workflow renewal warnings
# grep/aggregate: "run lock renewal failed"

# 3) Correlate with duplicate critical-section effects
# - repeated processing for same run_id/job_id
# - state transition conflicts shortly after renewal failures

# 4) During incident, classify behavior by component
# - scheduler path should fence after 3 consecutive failures
# - workflow path currently continues unless context is canceled

# 5) Decide temporary policy
# - high-risk workloads: reduce tolerance, prefer fence/stop
# - low-risk workloads: allow continuity with stronger idempotency checks

# 6) Post-incident action
# add explicit renewal-failure SLOs and regression tests for both paths

Limitations and tradeoffs

Approach	Upside	Downside
Fence quickly after repeated renewal failure	Higher correctness under uncertain lock ownership.	Lower continuity during lock backend turbulence.
Warn and continue	Better short-term throughput and fewer interrupted flows.	Potential overlap window after lease loss; harder correctness guarantees.
Policy by workload risk class	Applies strict fencing only where correctness cost is highest.	More policy and testing complexity across components.

- The scheduler path has clear fencing semantics and abandonment metric coverage. The workflow path currently prioritizes continuity with weaker lease-loss fencing.
- Workload-specific policy is often the right answer, but it requires explicit documentation and testing to avoid inconsistent behavior across components.
- I found strong scheduler hardening tests for consecutive failures and intermittent failures, but no equivalent workflow test enforcing a bounded-failure abandon policy.

Next step

Implement this next:

1. Define a renewal-failure policy matrix by workload risk (strict fence vs continuity).
2. Add workflow-level metric parity for renewal-failure abandonment, not only warning logs.
3. Add workflow tests that inject repeated `RenewLock` failures and validate chosen policy behavior.
4. Document expected operator response for each lock path in the runbook.

Continue with AI Agent Distributed Lock Fallback and AI Agent Lock Token Ownership.

AI Agent Lock Renewal Failure Policy