fix(stage-engine): durable transition-ownership lease + expected-stage CAS (ORCH-114)
All checks were successful
CI / test (push) Successful in 1m11s
CI / test (pull_request) Successful in 1m8s

Close the root class of the ORCH-110/111/112/113 incident chain: side-effectful
stage transitions had no single ownership. `advance_stage` is re-enterable and wrote
the stage with a bare `UPDATE ... WHERE id=?` (no compare-and-swap), while >=5 actors
(monitor / Plane-webhook / reconciler F-1 / job-reaper / deploy-finalizer) enter the
same transition independently. A concurrent or post-restart re-entry therefore
re-applied irreversible effects (merge_pr / coverage-ratchet / image-rebuild /
prod-deploy initiation) and produced a contradictory rollback<->done (incident
ORCH-111, job 1914 / PR #130).

Two complementary layers, both additive, under one kill-switch, never-raise:
  1. Durable transition-lease (new table `transition_lease`) — owner-exclusion on
     ENTRY to the side-effectful region: a second actor that sees a LIVE owner does
     not start the heavy sub-gates at all (prevention, not post-hoc repair).
  2. Expected-stage CAS (`db.update_task_stage_cas`) — atomicity on the stage WRITE:
     a lost race aborts with NO side effect. Also closes the 6 paths that write the
     stage in bypass of advance_stage (gitea x5 + plane rollback).

Owner liveness = owner_pid + owner_boot_id (NOT a heartbeat — a blocking 900s merge
re-test cannot beat one; ADR-001 D3), making restart recovery free (a fresh boot_id
renders every prior lease stale -> reclaimed by recover_on_startup). The lease has no
own TTL: its hard age ceiling is the reaper Tier-3 backstop reaper_max_running_s, so
the cross-cutting budget invariant ORCH-065/109/110/113 is untouched.

Generalises ORCH-113 finalizer-liveness (process-local, Tier-2, deploy-staging) to a
durable cross-path lease: the reaper consults it on all relevant paths (defer live,
reclaim dead; Tier-3 ignores the marker -> bounded; a reap force-releases the lease);
reconciler F-1 and the Plane webhook defer on an active lease; main.lifespan calls
recover_on_startup() after requeue_running_jobs. finalizer_liveness.py is unchanged
(it remains the kill-switch-off fallback).

Scope self-hosting (transition_lease_repos="" -> orchestrator only; enduro untouched).
Kill-switch ORCH_TRANSITION_LEASE_ENABLED=false -> CAS degenerates to the prior
unconditional update_task_stage, lease inert, reaper -> ORCH-113 fallback (byte-for-
byte pre-ORCH-114). STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys /
existing table schemas — byte-for-byte (one additive table, no epoch column on tasks).

Observability: read-only `transition_lease` block in GET /queue + a Telegram alert on
forced/stale reclaim + optional POST /transition-lease/release?work_item=<id>.

Coverage: tests/test_orch114_transition_ownership.py (TC-01 mandatory regression of
the ORCH-111 class — red before fix, green after; TC-02..TC-14). Full suite green
(2048 passed); the 4 webhook tests that spied on the removed gitea.update_task_stage
were updated to spy on the new commit_stage_cas write path.

ADR: docs/work-items/ORCH-114/06-adr/ADR-001-transition-ownership-lease-and-stage-cas.md
Cross-cutting: docs/architecture/adr/adr-0045-transition-ownership-lease-and-stage-cas.md

Refs: ORCH-114
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-15 17:37:11 +03:00
parent 0939893c70
commit 0bda3c145b
15 changed files with 1591 additions and 82 deletions

View File

@@ -14,7 +14,6 @@ from ..db import (
get_task_by_plane_id,
get_next_work_item_id,
ensure_unique_work_item_id,
update_task_stage,
enqueue_job,
insert_event_dedup,
create_task_atomic,
@@ -35,6 +34,7 @@ from ..projects import (
get_project_by_repo,
known_plane_project_ids,
)
from .. import transition_lease
logger = logging.getLogger("orchestrator.webhooks.plane")
@@ -803,7 +803,17 @@ async def _rollback_stage(
if not prev_stage:
logger.info(f"Task {task_id}: rejected at {current_stage} but no previous stage")
return
update_task_stage(task_id, prev_stage)
# ORCH-114 (adr-0045 / D4, TR-4): this Rejected-rollback writes the stage in
# BYPASS of advance_stage. Route it through the expected-stage CAS so it can never
# clobber an authoritative write made by a concurrent owner (e.g. a deploy->done
# finalizer) — a lost race aborts the rollback WITHOUT its side effects. Kill-switch
# off / repo out of scope -> unconditional update (byte-for-byte).
if not transition_lease.commit_stage_cas(task_id, current_stage, prev_stage, repo):
logger.info(
f"Task {task_id}: rollback stage-CAS lost ({current_stage}->{prev_stage}) "
f"— task already moved by another writer; skipping rollback"
)
return
notify_stage_change(task_id, current_stage, prev_stage)
# Feature 3: plane_notify_stage moves the board to the prev stage's status.
plane_notify_stage(work_item_id, current_stage, prev_stage)
@@ -857,10 +867,25 @@ async def _try_advance_stage(
advance_stage). It is True ONLY on the "Confirm Deploy" path
(handle_confirm_deploy) and gates Phase B of the self-hosting prod deploy; the
plain Approved path (handle_verdict) leaves it at the default False.
ORCH-114 (adr-0045 / FR-5, AC-8): if a live actor already owns this task's
side-effectful transition (transition-lease active), DEFER — do not re-enter the
transition in parallel. The late legitimate signal is not lost: once the owner
releases (or dies and the reaper reclaims), a re-approve / the reconciler re-drives
it, or advance_stage becomes an idempotent no-op against the authoritative facts
(SHA-in-main / INITIATED). never raises; no-op when the lease is disabled / repo
out of scope.
"""
import asyncio
from ..stage_engine import advance_stage
if transition_lease.is_held_by_live_owner(task_id):
logger.info(
f"Task {task_id}: transition-lease active — deferring webhook advance "
f"from {current_stage} (confirm_deploy={confirm_deploy})"
)
return
await asyncio.to_thread(
advance_stage,
task_id,