The production problem
Most lock bugs are diagnosed at acquire time and caused at renew time.
Acquire succeeds, critical section starts, then renew operations begin failing due to network or storage turbulence. If code keeps running after lease confidence is lost, another worker can enter the same section.
The key question is simple. After repeated renew failures, do you fence and stop or continue and hope idempotency absorbs overlap.
Both strategies can be correct depending on workload risk. What fails teams is having no explicit strategy.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Redis docs: distributed lock extension guidance | TTL windows, lock extension mechanism, and the requirement to limit reacquisition attempts. | No component-level policy template for when renewal attempts fail repeatedly inside business-critical sections. |
| AWS Builders' Library: avoiding fallback in distributed systems | Why rarely exercised fallback modes are risky and can amplify incidents. | No lock-renewal-specific rule set tying thresholds, fencing behavior, and release policy together. |
| etcd API docs: lease keepalive semantics | A lease expires if keepalive does not arrive in time, and attached keys are removed on expiry. | No application policy for what your code should do immediately after keepalive uncertainty begins. |
Cordum runtime mechanics
| Component | Renewal failure policy | Runtime numbers | Risk envelope |
|---|---|---|---|
| Scheduler engine | `maxRenewalFailures = 3`; abandon after 3 consecutive failures; cancel fenced context with `errLockAbandoned`. | Default lock TTL 60s, renewal interval ttl/3, store timeout 2s. | Stronger correctness boundary. Work is fenced when lock ownership cannot be trusted. |
| Workflow engine | Logs `run lock renewal failed` on each failed renew attempt and continues execution. | Run lock TTL 30s, renewal interval ttl/3 (~10s), renew timeout 2s. | Higher continuity. Potential wider race window if lease expires while work continues. |
| Scheduler observability | Increments `scheduler_job_lock_abandoned_total` on abandonment. | Counter emitted via core infra metrics package. | Makes abandonment visible for alerts and rollback decisions. |
| Workflow observability | No dedicated abandonment counter for renewal failures in workflow lock manager path. | Logs and general lock fallback metric exist, but no renewal-fence counter. | Harder to quantify renewal instability versus harmless transient retries. |
Code paths and tests
Scheduler: bounded failures then fence
// core/controlplane/scheduler/engine.go (excerpt)
const (
storeOpTimeout = 2 * time.Second
jobLockTTL = 60 * time.Second
maxRenewalFailures = 3
)
var errLockAbandoned = errors.New("job lock abandoned: renewal failed")
if err := e.jobStore.RenewLock(rCtx, key, token, ttl); err != nil {
consecutiveFailures++
if consecutiveFailures >= maxRenewalFailures {
abandoned.Store(true)
fenceCancel(errLockAbandoned)
e.metrics.IncJobLockAbandoned()
return
}
}
// Skip release after abandonment to avoid dropping a newer owner's lock.
if abandoned.Load() {
return
}Scheduler test: exactly 3 failed renew attempts
// core/controlplane/scheduler/engine_hardening_test.go (excerpt)
func TestWithJobLock_RenewalAbandonAfterConsecutiveFailures(t *testing.T) {
// alwaysFailRenewStore returns error on each renew
// expected: errLockAbandoned
// expected: exactly 3 renewal attempts, then stop
// expected: lock remains (release skipped after abandonment)
}Workflow: warn-only renew failure path
// core/workflow/engine.go (excerpt)
const runLockTTL = 30 * time.Second
if renewer, ok := lm.locker.(RunLockRenewer); ok {
ticker := time.NewTicker(runLockTTL / 3)
for {
select {
case <-ticker.C:
rCtx, cancel := context.WithTimeout(renewCtx, 2*time.Second)
if err := renewer.RenewLock(rCtx, key, token, runLockTTL); err != nil {
slog.Warn("run lock renewal failed", "run_id", runID, "error", err)
}
cancel()
}
}
}Workflow tests around renewal
// core/workflow/lock_test.go (excerpt)
func TestDistributedRunLock_Renewal(t *testing.T) {
// verifies lock key remains alive past original TTL when renew succeeds
}
func TestDistributedRunLock_RenewalStopsOnContextCancel(t *testing.T) {
// verifies renewal loop stops on context cancellation
}
// no dedicated test today that enforces abandon/fence after N renewal failuresValidation runbook
Run this in staging before changing lock renewal thresholds or context cancellation behavior.
# 1) Track scheduler abandonment events # rate(cordum_scheduler_job_lock_abandoned_total[5m]) # 2) Track workflow renewal warnings # grep/aggregate: "run lock renewal failed" # 3) Correlate with duplicate critical-section effects # - repeated processing for same run_id/job_id # - state transition conflicts shortly after renewal failures # 4) During incident, classify behavior by component # - scheduler path should fence after 3 consecutive failures # - workflow path currently continues unless context is canceled # 5) Decide temporary policy # - high-risk workloads: reduce tolerance, prefer fence/stop # - low-risk workloads: allow continuity with stronger idempotency checks # 6) Post-incident action # add explicit renewal-failure SLOs and regression tests for both paths
Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Fence quickly after repeated renewal failure | Higher correctness under uncertain lock ownership. | Lower continuity during lock backend turbulence. |
| Warn and continue | Better short-term throughput and fewer interrupted flows. | Potential overlap window after lease loss; harder correctness guarantees. |
| Policy by workload risk class | Applies strict fencing only where correctness cost is highest. | More policy and testing complexity across components. |
- - The scheduler path has clear fencing semantics and abandonment metric coverage. The workflow path currently prioritizes continuity with weaker lease-loss fencing.
- - Workload-specific policy is often the right answer, but it requires explicit documentation and testing to avoid inconsistent behavior across components.
- - I found strong scheduler hardening tests for consecutive failures and intermittent failures, but no equivalent workflow test enforcing a bounded-failure abandon policy.
Next step
Implement this next:
- 1. Define a renewal-failure policy matrix by workload risk (strict fence vs continuity).
- 2. Add workflow-level metric parity for renewal-failure abandonment, not only warning logs.
- 3. Add workflow tests that inject repeated `RenewLock` failures and validate chosen policy behavior.
- 4. Document expected operator response for each lock path in the runbook.
Continue with AI Agent Distributed Lock Fallback and AI Agent Lock Token Ownership.