fix(stage-engine): durable transition-ownership lease + expected-stage CAS (ORCH-114)

Close the root class of the ORCH-110/111/112/113 incident chain: side-effectful
stage transitions had no single ownership. `advance_stage` is re-enterable and wrote
the stage with a bare `UPDATE ... WHERE id=?` (no compare-and-swap), while >=5 actors
(monitor / Plane-webhook / reconciler F-1 / job-reaper / deploy-finalizer) enter the
same transition independently. A concurrent or post-restart re-entry therefore
re-applied irreversible effects (merge_pr / coverage-ratchet / image-rebuild /
prod-deploy initiation) and produced a contradictory rollback<->done (incident
ORCH-111, job 1914 / PR #130).

Two complementary layers, both additive, under one kill-switch, never-raise:
  1. Durable transition-lease (new table `transition_lease`) — owner-exclusion on
     ENTRY to the side-effectful region: a second actor that sees a LIVE owner does
     not start the heavy sub-gates at all (prevention, not post-hoc repair).
  2. Expected-stage CAS (`db.update_task_stage_cas`) — atomicity on the stage WRITE:
     a lost race aborts with NO side effect. Also closes the 6 paths that write the
     stage in bypass of advance_stage (gitea x5 + plane rollback).

Owner liveness = owner_pid + owner_boot_id (NOT a heartbeat — a blocking 900s merge
re-test cannot beat one; ADR-001 D3), making restart recovery free (a fresh boot_id
renders every prior lease stale -> reclaimed by recover_on_startup). The lease has no
own TTL: its hard age ceiling is the reaper Tier-3 backstop reaper_max_running_s, so
the cross-cutting budget invariant ORCH-065/109/110/113 is untouched.

Generalises ORCH-113 finalizer-liveness (process-local, Tier-2, deploy-staging) to a
durable cross-path lease: the reaper consults it on all relevant paths (defer live,
reclaim dead; Tier-3 ignores the marker -> bounded; a reap force-releases the lease);
reconciler F-1 and the Plane webhook defer on an active lease; main.lifespan calls
recover_on_startup() after requeue_running_jobs. finalizer_liveness.py is unchanged
(it remains the kill-switch-off fallback).

Scope self-hosting (transition_lease_repos="" -> orchestrator only; enduro untouched).
Kill-switch ORCH_TRANSITION_LEASE_ENABLED=false -> CAS degenerates to the prior
unconditional update_task_stage, lease inert, reaper -> ORCH-113 fallback (byte-for-
byte pre-ORCH-114). STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys /
existing table schemas — byte-for-byte (one additive table, no epoch column on tasks).

Observability: read-only `transition_lease` block in GET /queue + a Telegram alert on
forced/stale reclaim + optional POST /transition-lease/release?work_item=<id>.

Coverage: tests/test_orch114_transition_ownership.py (TC-01 mandatory regression of
the ORCH-111 class — red before fix, green after; TC-02..TC-14). Full suite green
(2048 passed); the 4 webhook tests that spied on the removed gitea.update_task_stage
were updated to spy on the new commit_stage_cas write path.

ADR: docs/work-items/ORCH-114/06-adr/ADR-001-transition-ownership-lease-and-stage-cas.md
Cross-cutting: docs/architecture/adr/adr-0045-transition-ownership-lease-and-stage-cas.md

Refs: ORCH-114
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-15 17:37:11 +03:00
committed by deployer
parent cc03e68847
commit 6ea4402942
15 changed files with 1591 additions and 82 deletions

View File

@@ -70,6 +70,7 @@ from .webhooks.plane import handle_status_start, handle_verdict
from .notifications import send_telegram, link_for
from . import projects
from . import task_deps
from . import transition_lease
logger = logging.getLogger("orchestrator.reconciler")
@@ -153,6 +154,10 @@ class Reconciler:
# ORCH-068 observability: terminal-state skips and dedup suppressions.
self.skipped_terminal_total: int = 0
self.deduped_total: int = 0
# ORCH-114 (adr-0045 / FR-5): F-1 advances deferred because a live actor owns
# the task's side-effectful transition (transition-lease active). Reset on
# restart (safe: a live lease is itself recovered/reclaimed on restart).
self.transition_lease_defers_total: int = 0
# ORCH-068 (TR-3): in-memory dedup guard {issue_id -> last unblocked
# state uuid}. Best-effort (resets on restart, like unblocked_total);
# suppresses a repeat unblock notification for the same issue+state.
@@ -246,6 +251,19 @@ class Reconciler:
if cyc:
task_deps.handle_cycle(cyc)
return
# ORCH-114 (adr-0045 / FR-5, AC-7): a live actor already owns this task's
# side-effectful transition -> F-1 must NOT advance it in parallel. Silent
# defer (mirrors the escalated/Blocked/task-deps skip-guards above); the owner
# finishes the transition or, on death, the reaper reclaims it in bounded time.
# fail-safe: is_held_by_live_owner is conservative (True on doubt -> defer).
# never raises; no-op (False) when the lease is disabled / repo out of scope.
if transition_lease.is_held_by_live_owner(task_id):
self.transition_lease_defers_total += 1
logger.debug(
f"reconciler F-1: task {task_id} has an active transition-lease — "
f"deferring advance to its owner"
)
return
result = advance_if_gate_passed(
task_id,
stage,
@@ -596,6 +614,8 @@ class Reconciler:
# ORCH-068 observability.
"skipped_terminal_total": self.skipped_terminal_total,
"deduped_total": self.deduped_total,
# ORCH-114 observability: F-1 advances deferred to a live lease owner.
"transition_lease_defers_total": self.transition_lease_defers_total,
}