fix(stage-engine): durable transition-ownership lease + expected-stage CAS (ORCH-114)

Close the root class of the ORCH-110/111/112/113 incident chain: side-effectful
stage transitions had no single ownership. `advance_stage` is re-enterable and wrote
the stage with a bare `UPDATE ... WHERE id=?` (no compare-and-swap), while >=5 actors
(monitor / Plane-webhook / reconciler F-1 / job-reaper / deploy-finalizer) enter the
same transition independently. A concurrent or post-restart re-entry therefore
re-applied irreversible effects (merge_pr / coverage-ratchet / image-rebuild /
prod-deploy initiation) and produced a contradictory rollback<->done (incident
ORCH-111, job 1914 / PR #130).

Two complementary layers, both additive, under one kill-switch, never-raise:
  1. Durable transition-lease (new table `transition_lease`) — owner-exclusion on
     ENTRY to the side-effectful region: a second actor that sees a LIVE owner does
     not start the heavy sub-gates at all (prevention, not post-hoc repair).
  2. Expected-stage CAS (`db.update_task_stage_cas`) — atomicity on the stage WRITE:
     a lost race aborts with NO side effect. Also closes the 6 paths that write the
     stage in bypass of advance_stage (gitea x5 + plane rollback).

Owner liveness = owner_pid + owner_boot_id (NOT a heartbeat — a blocking 900s merge
re-test cannot beat one; ADR-001 D3), making restart recovery free (a fresh boot_id
renders every prior lease stale -> reclaimed by recover_on_startup). The lease has no
own TTL: its hard age ceiling is the reaper Tier-3 backstop reaper_max_running_s, so
the cross-cutting budget invariant ORCH-065/109/110/113 is untouched.

Generalises ORCH-113 finalizer-liveness (process-local, Tier-2, deploy-staging) to a
durable cross-path lease: the reaper consults it on all relevant paths (defer live,
reclaim dead; Tier-3 ignores the marker -> bounded; a reap force-releases the lease);
reconciler F-1 and the Plane webhook defer on an active lease; main.lifespan calls
recover_on_startup() after requeue_running_jobs. finalizer_liveness.py is unchanged
(it remains the kill-switch-off fallback).

Scope self-hosting (transition_lease_repos="" -> orchestrator only; enduro untouched).
Kill-switch ORCH_TRANSITION_LEASE_ENABLED=false -> CAS degenerates to the prior
unconditional update_task_stage, lease inert, reaper -> ORCH-113 fallback (byte-for-
byte pre-ORCH-114). STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys /
existing table schemas — byte-for-byte (one additive table, no epoch column on tasks).

Observability: read-only `transition_lease` block in GET /queue + a Telegram alert on
forced/stale reclaim + optional POST /transition-lease/release?work_item=<id>.

Coverage: tests/test_orch114_transition_ownership.py (TC-01 mandatory regression of
the ORCH-111 class — red before fix, green after; TC-02..TC-14). Full suite green
(2048 passed); the 4 webhook tests that spied on the removed gitea.update_task_stage
were updated to spy on the new commit_stage_cas write path.

ADR: docs/work-items/ORCH-114/06-adr/ADR-001-transition-ownership-lease-and-stage-cas.md
Cross-cutting: docs/architecture/adr/adr-0045-transition-ownership-lease-and-stage-cas.md

Refs: ORCH-114
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-15 17:37:11 +03:00
committed by deployer
parent cc03e68847
commit 6ea4402942
15 changed files with 1591 additions and 82 deletions

View File

@@ -60,6 +60,25 @@ async def lifespan(app: FastAPI):
if requeued:
log.warning(f"Queue-recovery: requeued {requeued} running job(s) after restart")
# ORCH-114 (adr-0045 / D7 / FR-4): clear durable transition-leases left by the
# PREVIOUS process boot. This process has a fresh boot_id, so every prior lease is
# stale by construction -> reclaim it so the just-requeued jobs can re-drive their
# side-effectful transitions cleanly. Idempotency of the re-drive comes from the
# authoritative durable facts (SHA-in-main / the INITIATED self-deploy marker /
# the coverage-ratchet CAS), NOT from a new recovery brain — the lease only
# guarantees the re-drive runs SEQUENTIALLY (one owner), never concurrently. Runs
# AFTER requeue_running_jobs and BEFORE the reaper starts. never raises.
try:
from . import transition_lease
cleared_leases = transition_lease.recover_on_startup()
if cleared_leases:
log.warning(
f"Transition-lease recovery: cleared {cleared_leases} stale lease(s) "
f"from a previous boot"
)
except Exception as e:
log.warning(f"Transition-lease recovery skipped: {e}")
# ORCH-065: proactive startup reclaim of dead/stale merge-leases, next to the
# queue-recovery above. A lease held by the previous (now dead) process pid is
# released at once instead of waiting for the TTL / a foreign acquire so the
@@ -215,6 +234,7 @@ async def queue():
from . import bug_fast_track
from . import lessons
from . import checkout_hygiene
from . import transition_lease
from .disk_watchdog import disk_watchdog
from .build_cache_pruner import build_cache_pruner
return {
@@ -258,6 +278,11 @@ async def queue():
# ORCH-112 (D3): deploy-base checkout-hygiene observability (read-only) —
# kill-switch + scope. Additive block; never-raise.
"checkout_hygiene": checkout_hygiene.snapshot(),
# ORCH-114 (adr-0045 / D10 / FR-6): durable transition-ownership lease
# observability (read-only) — kill-switch, scope, boot_id, active holders
# (owner/stage/age/live) + defer/reclaim/CAS-lost counters. Additive block;
# never-raise.
"transition_lease": transition_lease.snapshot(),
# ORCH-098 (FR-4 / AC-4): lessons-journal observability (read-only) —
# kill-switch + counts by type/status + last N lessons. Additive block;
# never-raise (snapshot() returns {"enabled": ...} minimum on error).
@@ -324,6 +349,39 @@ async def serial_gate_unfreeze(repo: str = ""):
return {"ok": True, "repo": repo, "cleared": cleared, "frozen": frozen}
@app.post("/transition-lease/release")
async def transition_lease_release(work_item: str = ""):
"""ORCH-114 (adr-0045 / D10): operator manual reclaim of a stuck transition-lease.
By образцу ``POST /serial-gate/unfreeze``: if a lease somehow outlives its owner
(the normal try/finally release + the reaper force-release + the Tier-3 backstop
should make this unnecessary), an operator can force-release it by work-item id so
a re-approve / the reconciler can re-drive the transition. Idempotent: releasing a
free task reports ``released: false``. Read-only/never-raise otherwise.
"""
from . import transition_lease
from .db import get_task_by_work_item_id
if not work_item or not work_item.strip():
return {"ok": False, "error": "missing 'work_item'", "work_item": work_item}
work_item = work_item.strip()
task = get_task_by_work_item_id(work_item)
if not task:
return {"ok": False, "error": "task not found", "work_item": work_item}
task_id = task["id"]
held_before = transition_lease.is_held_by_live_owner(task_id)
transition_lease.release(task_id, force=True)
if held_before:
try:
from .notifications import send_telegram, link_for
send_telegram(
f"🔓 {link_for(work_item)}: transition-lease сброшен вручную "
f"(task {task_id}). Переход может быть пере-исполнен."
)
except Exception:
pass
return {"ok": True, "work_item": work_item, "task_id": task_id, "released": held_before}
@app.post("/fs-normalize/check")
async def fs_normalize_check(normalize: bool = False):
"""ORCH-057 (D6 / AC-4): force a fresh legacy-ownership detect (bypass the TTL