fix(stage-engine): durable transition-ownership lease + expected-stage CAS (ORCH-114)

Close the root class of the ORCH-110/111/112/113 incident chain: side-effectful stage transitions had no single ownership. `advance_stage` is re-enterable and wrote the stage with a bare `UPDATE ... WHERE id=?` (no compare-and-swap), while >=5 actors (monitor / Plane-webhook / reconciler F-1 / job-reaper / deploy-finalizer) enter the same transition independently. A concurrent or post-restart re-entry therefore re-applied irreversible effects (merge_pr / coverage-ratchet / image-rebuild / prod-deploy initiation) and produced a contradictory rollback<->done (incident ORCH-111, job 1914 / PR #130). Two complementary layers, both additive, under one kill-switch, never-raise: 1. Durable transition-lease (new table `transition_lease`) — owner-exclusion on ENTRY to the side-effectful region: a second actor that sees a LIVE owner does not start the heavy sub-gates at all (prevention, not post-hoc repair). 2. Expected-stage CAS (`db.update_task_stage_cas`) — atomicity on the stage WRITE: a lost race aborts with NO side effect. Also closes the 6 paths that write the stage in bypass of advance_stage (gitea x5 + plane rollback). Owner liveness = owner_pid + owner_boot_id (NOT a heartbeat — a blocking 900s merge re-test cannot beat one; ADR-001 D3), making restart recovery free (a fresh boot_id renders every prior lease stale -> reclaimed by recover_on_startup). The lease has no own TTL: its hard age ceiling is the reaper Tier-3 backstop reaper_max_running_s, so the cross-cutting budget invariant ORCH-065/109/110/113 is untouched. Generalises ORCH-113 finalizer-liveness (process-local, Tier-2, deploy-staging) to a durable cross-path lease: the reaper consults it on all relevant paths (defer live, reclaim dead; Tier-3 ignores the marker -> bounded; a reap force-releases the lease); reconciler F-1 and the Plane webhook defer on an active lease; main.lifespan calls recover_on_startup() after requeue_running_jobs. finalizer_liveness.py is unchanged (it remains the kill-switch-off fallback). Scope self-hosting (transition_lease_repos="" -> orchestrator only; enduro untouched). Kill-switch ORCH_TRANSITION_LEASE_ENABLED=false -> CAS degenerates to the prior unconditional update_task_stage, lease inert, reaper -> ORCH-113 fallback (byte-for- byte pre-ORCH-114). STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / existing table schemas — byte-for-byte (one additive table, no epoch column on tasks). Observability: read-only `transition_lease` block in GET /queue + a Telegram alert on forced/stale reclaim + optional POST /transition-lease/release?work_item=<id>. Coverage: tests/test_orch114_transition_ownership.py (TC-01 mandatory regression of the ORCH-111 class — red before fix, green after; TC-02..TC-14). Full suite green (2048 passed); the 4 webhook tests that spied on the removed gitea.update_task_stage were updated to spy on the new commit_stage_cas write path. ADR: docs/work-items/ORCH-114/06-adr/ADR-001-transition-ownership-lease-and-stage-cas.md Cross-cutting: docs/architecture/adr/adr-0045-transition-ownership-lease-and-stage-cas.md Refs: ORCH-114 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 17:37:11 +03:00
parent cc03e68847
commit 6ea4402942
15 changed files with 1591 additions and 82 deletions
--- a/src/config.py
+++ b/src/config.py
@@ -590,6 +590,43 @@ class Settings(BaseSettings):
    lease_reclaim_enabled: bool = True
    reaper_finalizer_liveness_enabled: bool = True

+    # ORCH-114 (adr-0045): durable transition-ownership lease + expected-stage CAS for
+    # side-effectful stage transitions. Generalises the process-local ORCH-113
+    # finalizer-liveness to a DURABLE, cross-path owner-exclusion (additive table
+    # `transition_lease`) so a concurrent OR post-restart re-entry into a side-effectful
+    # transition (reaper / reconciler / webhook / startup-requeue) is deferred or a
+    # no-op instead of re-applying an irreversible effect (merge_pr / coverage-ratchet /
+    # image-rebuild / prod-deploy initiation / contradictory rollback↔done). Two
+    # complementary layers, both gated by the SINGLE kill-switch below:
+    #   (1) durable lease on ENTRY to the side-effectful region (a second actor seeing a
+    #       live owner does not start the heavy sub-gates at all — prevention, not repair);
+    #   (2) expected-stage CAS on the stage WRITE (update_task_stage_cas: a lost race ->
+    #       abort with NO side effect), which also closes the 6 paths that write the
+    #       stage in bypass of advance_stage (gitea/plane direct update_task_stage).
+    # Liveness of the owner = owner_pid + owner_boot_id (NOT a heartbeat — a blocking
+    # 900s merge re-test cannot beat a heartbeat; ADR-001 D3), which makes restart
+    # recovery free (a new process -> new boot_id -> all prior leases are instantly
+    # stale -> reclaimed). The lease has NO own TTL: its hard age ceiling IS the reaper
+    # Tier-3 backstop reaper_max_running_s (5400), so the cross-cutting budget invariant
+    # ORCH-065/109/110/113 is untouched. STAGE_TRANSITIONS / QG_CHECKS / check_* /
+    # machine-verdict keys / existing table schemas — byte-for-byte. never-raise:
+    # hot-path guard fail-open (never wedge the shared queue), prod-safety fail-closed.
+    # See docs/work-items/ORCH-114/06-adr/ADR-001-transition-ownership-lease-and-stage-cas.md
+    # and the cross-cutting docs/architecture/adr/adr-0045-…md.
+    #   transition_lease_enabled -> SINGLE kill-switch (env ORCH_TRANSITION_LEASE_ENABLED).
+    #                               False -> the lease is neither written nor read AND the
+    #                               CAS degenerates to the prior unconditional
+    #                               update_task_stage -> behaviour byte-for-byte as before
+    #                               ORCH-114 (reaper -> ORCH-113 in-memory fallback,
+    #                               reconciler/webhook skip-guard inert). Default True.
+    #   transition_lease_repos   -> CSV scope (env ORCH_TRANSITION_LEASE_REPOS). Empty ->
+    #                               applies ONLY to the self-hosting repo (orchestrator),
+    #                               where the irreversible side-effectful edges live;
+    #                               non-empty -> only the listed repos. Mirrors
+    #                               coverage_gate_repos -> enduro untouched at the default.
+    transition_lease_enabled: bool = True
+    transition_lease_repos: str = ""
+
    # ORCH-063: disk-watchdog — background heartbeat that measures host-FS fill via
    # the mounted bind-paths and Telegram-alerts the operator at >= threshold. On
    # 07.06.2026 the mva154 host disk silently hit 100% and stalled the WHOLE
--- a/src/db.py
+++ b/src/db.py
@@ -263,6 +263,28 @@ def init_db():
    _ensure_column(conn, "lessons", "attribution", "TEXT")
    _ensure_column(conn, "lessons", "target_repo", "TEXT")
    _ensure_column(conn, "lessons", "target_domain", "TEXT")
+    # ORCH-114 (adr-0045 / 08-data-requirements.md): durable transition-ownership
+    # lease. ONE additive object (CREATE TABLE IF NOT EXISTS, pattern repo_freeze/
+    # coverage_baseline/lessons) -> idempotent, restart-safe on the shared prod DB;
+    # existing tables (tasks/jobs/agent_runs/...) untouched byte-for-byte (NFR-3,
+    # AC-11). One row per task = at most one active owner of a side-effectful
+    # transition. Liveness of the holder = owner_boot_id (this process's start nonce)
+    # + owner_pid (os.getpid of the holding process); a row from a previous boot is
+    # instantly stale on restart -> reclaimed (ADR-001 D3). No index needed (access by
+    # PK task_id; snapshot() is a full-scan over a tiny table). The src/transition_lease.py
+    # leaf wraps all access in its never-raise contract. NO epoch/version column (D2:
+    # for the one-process model the stage IS the CAS version).
+    conn.executescript("""
+        CREATE TABLE IF NOT EXISTS transition_lease (
+            task_id       INTEGER PRIMARY KEY,
+            owner         TEXT NOT NULL,
+            owner_pid     INTEGER,
+            owner_boot_id TEXT,
+            run_id        INTEGER,
+            stage         TEXT,
+            acquired_at   TEXT NOT NULL DEFAULT (datetime('now'))
+        );
+    """)
    conn.commit()
    conn.close()

@@ -679,6 +701,39 @@ def update_task_stage(task_id: int, stage: str):
    conn.close()


+def update_task_stage_cas(task_id: int, expected_stage: str, new_stage: str) -> bool:
+    """ORCH-114 (adr-0045 / FR-2): compare-and-swap variant of update_task_stage.
+
+    Writes the stage ONLY when the task is still at ``expected_stage`` (the value the
+    caller read before running the side-effectful region) — ``UPDATE … SET stage=?
+    WHERE id=? AND stage=?`` — and reports whether THIS writer won. Returns:
+
+      * ``True``  -> ``rowcount == 1``: the CAS succeeded, the stage moved exactly once.
+      * ``False`` -> ``rowcount == 0``: the task is no longer at ``expected_stage``
+        (another actor already advanced/rolled it back, or the row is gone) -> the
+        caller MUST abort WITHOUT applying any side effect (merge_pr / ratchet /
+        rebuild / deploy-init / enqueue) — it lost the race.
+
+    In the current one-process model each side-effectful edge leads to a DISTINCT
+    next stage, so the stage itself is a complete version for the compare-and-swap;
+    no separate epoch/version column is needed (ADR-001 D2). The plain
+    ``update_task_stage`` above is kept unchanged for the kill-switch-off path and
+    for non-side-effectful writes. Mirrors the atomic rowcount-guard idiom of
+    ``claim_next_job`` / ``reap_running_job``.
+    """
+    conn = get_db()
+    try:
+        cur = conn.execute(
+            "UPDATE tasks SET stage = ?, updated_at = datetime('now') "
+            "WHERE id = ? AND stage = ?",
+            (new_stage, task_id, expected_stage),
+        )
+        conn.commit()
+        return cur.rowcount == 1
+    finally:
+        conn.close()
+
+
 # ---------------------------------------------------------------------------
 # ORCH-019: bug-fast-track task type (tasks.track) helpers
 # ---------------------------------------------------------------------------
--- a/src/job_reaper.py
+++ b/src/job_reaper.py
@@ -434,18 +434,35 @@ class JobReaper:
            return None, None, None

    def _finalizer_owns(self, job: dict) -> bool:
-        """ORCH-113 (adr-0043 / D3): True iff a LIVE monitor still owns this job's
-        ``deploy-staging`` finalization, so the Tier-2 reap must be deferred.
+        """True iff a LIVE actor still owns this job's side-effectful finalization, so
+        the Tier-2 reap must be deferred.

-        Order matters for the zero-regression contract: the kill-switch is checked
-        FIRST (disabled -> ``False`` with no DB lookup, so the path is byte-for-byte
-        prior); then the stage is scoped to ``deploy-staging`` only (the sole edge
-        whose in-thread finalization runs for minutes — every other stage is left
-        untouched); only then is the process-local ownership marker consulted. Never
-        raises -> ``False`` on any error (conservative: never block reaping when
-        ownership is unknowable, so the Tier-3 backstop is never neutered).
+        ORCH-114 (adr-0045 / D6) GENERALISES the ORCH-113 process-local, Tier-2,
+        ``deploy-staging``-only marker to a DURABLE, cross-path lease: when the
+        transition-lease applies to this repo, consult ``transition_lease`` keyed on
+        the task (covers EVERY relevant edge — deploy-staging AND deploy->done — and
+        survives restart). Otherwise (kill-switch off) fall back to the unchanged
+        ORCH-113 in-memory ``finalizer_liveness`` (Tier-2 / ``deploy-staging`` only),
+        so the disabled path is byte-for-byte prior.
+
+        Either way the Tier-3 backstop (``reaper_max_running_s``) IGNORES this marker
+        (it does not call here), so a stuck/dead finalizer is still reaped in bounded
+        time. Never raises -> ``False`` on any error (conservative: never block reaping
+        when ownership is unknowable, so the backstop is never neutered).
        """
        try:
+            repo = job.get("repo")
+            # ORCH-114: durable cross-path lease (when enabled for this repo).
+            try:
+                from . import transition_lease
+                if transition_lease.applies(repo):
+                    return transition_lease.is_held_by_live_owner(job.get("task_id"))
+            except Exception as e:  # noqa: BLE001 - fall back to ORCH-113 on any error
+                logger.warning(
+                    "reaper: transition-lease check failed for job %s: %s",
+                    job.get("id"), e,
+                )
+            # ORCH-113 fallback (kill-switch off): process-local, Tier-2/deploy-staging.
            if not settings.reaper_finalizer_liveness_enabled:
                return False
            _branch, stage, _wid = self._task_meta(job)
@@ -472,6 +489,18 @@ class JobReaper:

    def _note_reap(self, job: dict, outcome: str, reason: str) -> None:
        """Record + log one successful reap (Р-6 observability)."""
+        # ORCH-114 (adr-0045 / D6): a reap reclaims the job, so its durable
+        # transition-lease must NOT outlive it — force-release (any owner/boot) so a
+        # requeued job can re-acquire cleanly. never-raise; no-op when the lease is
+        # disabled / no row exists.
+        try:
+            from . import transition_lease
+            transition_lease.release(job.get("task_id"), force=True)
+        except Exception as e:  # noqa: BLE001 - never break the reap
+            logger.warning(
+                "reaper: transition-lease force-release failed for job %s: %s",
+                job.get("id"), e,
+            )
        self.reaped_total += 1
        self.last_reaped = {
            "job_id": job.get("id"),
--- a/src/main.py
+++ b/src/main.py
@@ -60,6 +60,25 @@ async def lifespan(app: FastAPI):
    if requeued:
        log.warning(f"Queue-recovery: requeued {requeued} running job(s) after restart")

+    # ORCH-114 (adr-0045 / D7 / FR-4): clear durable transition-leases left by the
+    # PREVIOUS process boot. This process has a fresh boot_id, so every prior lease is
+    # stale by construction -> reclaim it so the just-requeued jobs can re-drive their
+    # side-effectful transitions cleanly. Idempotency of the re-drive comes from the
+    # authoritative durable facts (SHA-in-main / the INITIATED self-deploy marker /
+    # the coverage-ratchet CAS), NOT from a new recovery brain — the lease only
+    # guarantees the re-drive runs SEQUENTIALLY (one owner), never concurrently. Runs
+    # AFTER requeue_running_jobs and BEFORE the reaper starts. never raises.
+    try:
+        from . import transition_lease
+        cleared_leases = transition_lease.recover_on_startup()
+        if cleared_leases:
+            log.warning(
+                f"Transition-lease recovery: cleared {cleared_leases} stale lease(s) "
+                f"from a previous boot"
+            )
+    except Exception as e:
+        log.warning(f"Transition-lease recovery skipped: {e}")
+
    # ORCH-065: proactive startup reclaim of dead/stale merge-leases, next to the
    # queue-recovery above. A lease held by the previous (now dead) process pid is
    # released at once instead of waiting for the TTL / a foreign acquire so the
@@ -215,6 +234,7 @@ async def queue():
    from . import bug_fast_track
    from . import lessons
    from . import checkout_hygiene
+    from . import transition_lease
    from .disk_watchdog import disk_watchdog
    from .build_cache_pruner import build_cache_pruner
    return {
@@ -258,6 +278,11 @@ async def queue():
        # ORCH-112 (D3): deploy-base checkout-hygiene observability (read-only) —
        # kill-switch + scope. Additive block; never-raise.
        "checkout_hygiene": checkout_hygiene.snapshot(),
+        # ORCH-114 (adr-0045 / D10 / FR-6): durable transition-ownership lease
+        # observability (read-only) — kill-switch, scope, boot_id, active holders
+        # (owner/stage/age/live) + defer/reclaim/CAS-lost counters. Additive block;
+        # never-raise.
+        "transition_lease": transition_lease.snapshot(),
        # ORCH-098 (FR-4 / AC-4): lessons-journal observability (read-only) —
        # kill-switch + counts by type/status + last N lessons. Additive block;
        # never-raise (snapshot() returns {"enabled": ...} minimum on error).
@@ -324,6 +349,39 @@ async def serial_gate_unfreeze(repo: str = ""):
    return {"ok": True, "repo": repo, "cleared": cleared, "frozen": frozen}


+@app.post("/transition-lease/release")
+async def transition_lease_release(work_item: str = ""):
+    """ORCH-114 (adr-0045 / D10): operator manual reclaim of a stuck transition-lease.
+
+    By образцу ``POST /serial-gate/unfreeze``: if a lease somehow outlives its owner
+    (the normal try/finally release + the reaper force-release + the Tier-3 backstop
+    should make this unnecessary), an operator can force-release it by work-item id so
+    a re-approve / the reconciler can re-drive the transition. Idempotent: releasing a
+    free task reports ``released: false``. Read-only/never-raise otherwise.
+    """
+    from . import transition_lease
+    from .db import get_task_by_work_item_id
+    if not work_item or not work_item.strip():
+        return {"ok": False, "error": "missing 'work_item'", "work_item": work_item}
+    work_item = work_item.strip()
+    task = get_task_by_work_item_id(work_item)
+    if not task:
+        return {"ok": False, "error": "task not found", "work_item": work_item}
+    task_id = task["id"]
+    held_before = transition_lease.is_held_by_live_owner(task_id)
+    transition_lease.release(task_id, force=True)
+    if held_before:
+        try:
+            from .notifications import send_telegram, link_for
+            send_telegram(
+                f"🔓 {link_for(work_item)}: transition-lease сброшен вручную "
+                f"(task {task_id}). Переход может быть пере-исполнен."
+            )
+        except Exception:
+            pass
+    return {"ok": True, "work_item": work_item, "task_id": task_id, "released": held_before}
+
+
@app.post("/fs-normalize/check")
 async def fs_normalize_check(normalize: bool = False):
    """ORCH-057 (D6 / AC-4): force a fresh legacy-ownership detect (bypass the TTL
--- a/src/reconciler.py
+++ b/src/reconciler.py
@@ -70,6 +70,7 @@ from .webhooks.plane import handle_status_start, handle_verdict
 from .notifications import send_telegram, link_for
 from . import projects
 from . import task_deps
+from . import transition_lease

 logger = logging.getLogger("orchestrator.reconciler")

@@ -153,6 +154,10 @@ class Reconciler:
        # ORCH-068 observability: terminal-state skips and dedup suppressions.
        self.skipped_terminal_total: int = 0
        self.deduped_total: int = 0
+        # ORCH-114 (adr-0045 / FR-5): F-1 advances deferred because a live actor owns
+        # the task's side-effectful transition (transition-lease active). Reset on
+        # restart (safe: a live lease is itself recovered/reclaimed on restart).
+        self.transition_lease_defers_total: int = 0
        # ORCH-068 (TR-3): in-memory dedup guard {issue_id -> last unblocked
        # state uuid}. Best-effort (resets on restart, like unblocked_total);
        # suppresses a repeat unblock notification for the same issue+state.
@@ -246,6 +251,19 @@ class Reconciler:
                if cyc:
                    task_deps.handle_cycle(cyc)
                return
+        # ORCH-114 (adr-0045 / FR-5, AC-7): a live actor already owns this task's
+        # side-effectful transition -> F-1 must NOT advance it in parallel. Silent
+        # defer (mirrors the escalated/Blocked/task-deps skip-guards above); the owner
+        # finishes the transition or, on death, the reaper reclaims it in bounded time.
+        # fail-safe: is_held_by_live_owner is conservative (True on doubt -> defer).
+        # never raises; no-op (False) when the lease is disabled / repo out of scope.
+        if transition_lease.is_held_by_live_owner(task_id):
+            self.transition_lease_defers_total += 1
+            logger.debug(
+                f"reconciler F-1: task {task_id} has an active transition-lease — "
+                f"deferring advance to its owner"
+            )
+            return
        result = advance_if_gate_passed(
            task_id,
            stage,
@@ -596,6 +614,8 @@ class Reconciler:
            # ORCH-068 observability.
            "skipped_terminal_total": self.skipped_terminal_total,
            "deduped_total": self.deduped_total,
+            # ORCH-114 observability: F-1 advances deferred to a live lease owner.
+            "transition_lease_defers_total": self.transition_lease_defers_total,
        }


--- a/src/stage_engine.py
+++ b/src/stage_engine.py
@@ -41,6 +41,7 @@ from . import self_deploy
 from . import post_deploy
 from . import labels
 from . import bug_fast_track
+from . import transition_lease
 from .notifications import (
    notify_stage_change,
    notify_qg_failure,
@@ -173,6 +174,20 @@ def developer_retry_count(task_id: int) -> int:
 _developer_retry_count = developer_retry_count


+def _is_side_effectful_edge(current_stage: str | None, next_stage: str | None) -> bool:
+    """ORCH-114 (adr-0045 D4): does this ``advance_stage`` edge run IRREVERSIBLE work
+    that must be owned by exactly one actor (lease on entry)?
+
+      * ``deploy-staging`` (-> deploy): the heavy edge sub-gates (security / merge-gate
+        re-test / coverage / image-freshness rebuild) + Phase A.
+      * ``deploy`` (-> done OR Phase B): merge_pr / coverage-ratchet / proof-of-merge,
+        or the detached prod-deploy initiation (confirm_deploy).
+    Every other edge (created -> … -> testing) is reversible and is protected by the
+    CAS-on-write alone (no lease). Pure, never raises.
+    """
+    return current_stage in ("deploy-staging", "deploy")
+
+
 def advance_stage(
    task_id: int,
    current_stage: str,
@@ -210,6 +225,12 @@ def advance_stage(
    """
    result = AdvanceResult(from_stage=current_stage)
    agent = finished_agent
+    # ORCH-114 (adr-0045): set True once we acquire the durable transition-lease on a
+    # side-effectful edge, so the finally below ALWAYS releases it (on success, on a
+    # lost CAS, on a sub-gate rollback, and on ANY exception caught by the outer
+    # except). Released holder-aware (this process only) so a reaper reclaim + reacquire
+    # in between is never clobbered.
+    _lease_held = False
    try:
        qg_name = get_qg_for_stage(current_stage)
        next_stage = get_next_stage(current_stage)
@@ -240,6 +261,28 @@ def advance_stage(
            result.note = "terminal"
            return result

+        # --- ORCH-114 transition-ownership lease: acquire on ENTRY (ADR-001 D5) ----
+        # On a side-effectful edge (deploy-staging / deploy) acquire the DURABLE
+        # owner-exclusion lease BEFORE the Phase B / sub-gate / merge-verify region. A
+        # second concurrent actor (reaper / reconciler / webhook / a re-driven startup
+        # job) that sees a live owner gets a clean "busy" defer here and does NOT start
+        # the heavy region at all — this is what kills the double-effect class
+        # (incident ORCH-111) at the root. Released in the `finally` below. Kill-switch
+        # off / repo out of scope -> applies() False -> no lease, byte-for-byte prior.
+        if _is_side_effectful_edge(current_stage, next_stage) and transition_lease.applies(repo):
+            if not transition_lease.acquire(
+                task_id, finished_agent or "engine", run_id=None, stage=current_stage
+            ):
+                logger.info(
+                    f"Task {task_id}: transition-lease busy on "
+                    f"{current_stage}->{next_stage} — deferring (another actor owns "
+                    f"this transition)"
+                )
+                result.note = "transition-lease-busy"
+                result.advanced = False
+                return result
+            _lease_held = True
+
        # --- ORCH-036/059 Phase B: "Confirm Deploy" on `deploy` -> initiate ----
        # ORCH-059: the prod-deploy trigger is now the DEDICATED "Confirm Deploy"
        # status (confirm_deploy=True), NOT the overloaded "Approved". On the
@@ -399,7 +442,23 @@ def advance_stage(
                return result

        # --- Advance ---------------------------------------------------------
-        update_task_stage(task_id, next_stage)
+        # ORCH-114 (adr-0045 / FR-2): expected-stage compare-and-swap. Writes the
+        # stage only if the task is STILL at current_stage (the value we read on
+        # entry); a lost race (another writer advanced/rolled back first) returns
+        # False -> abort here WITHOUT any side effect (no notify / no arm / no
+        # terminal-sync / no enqueue). Kill-switch off / repo out of scope ->
+        # degenerates to the prior unconditional update_task_stage (returns True) ->
+        # byte-for-byte prior behaviour. Defense-in-depth: under the lease acquired
+        # above this CAS practically always wins; it also covers the narrow
+        # consult->acquire window and any bypass writer (TR-5).
+        if not transition_lease.commit_stage_cas(task_id, current_stage, next_stage, repo):
+            logger.info(
+                f"Task {task_id}: stage-CAS lost on {current_stage}->{next_stage} — "
+                f"aborting without side effects (another writer advanced first)"
+            )
+            result.note = "stage-cas-lost"
+            result.advanced = False
+            return result
        # Telegram live tracker: the analysis->architecture advance is the human
        # Approved gate clearing -> stamp the END of "Ревью БРД" (the only
        # human time). Idempotent: only the first stamp counts.
@@ -510,6 +569,16 @@ def advance_stage(
        logger.error(f"advance_stage failed for task_id={task_id}: {e}")
        result.note = f"error: {e}"
        return result
+    finally:
+        # ORCH-114 (adr-0045 / AC-3): release the transition-lease on EVERY exit —
+        # normal advance, lost CAS, sub-gate rollback, Phase A/B early return, and any
+        # exception caught above — so the lease never "leaks" and wedges the task.
+        # holder-aware (force=False): only releases a row this process owns.
+        if _lease_held:
+            try:
+                transition_lease.release(task_id)
+            except Exception as e:  # noqa: BLE001 - never-raise (Tier-3 backstop bounds it)
+                logger.warning(f"Task {task_id}: transition-lease release failed: {e}")


 def advance_if_gate_passed(
@@ -1482,7 +1551,21 @@ def _handle_self_deploy_phase_a(
    restart-safe `approve-requested` marker records that Phase A ran. The merge
    lease stays HELD.
    """
-    update_task_stage(task_id, "deploy")
+    # ORCH-114 (adr-0045 / D4): this IS the deploy-staging -> deploy stage write on
+    # the self-hosting path (advance_stage's line-402 CAS is not reached — Phase A
+    # returns first). Use the same expected-stage CAS. It runs under the transition-
+    # lease acquired by advance_stage, so it practically always wins; a lost CAS
+    # (a concurrent writer despite the lease) -> abort Phase A WITHOUT initiating the
+    # prod-deploy ask / autoDeploy (no double effect). Kill-switch off / repo out of
+    # scope -> unconditional update (byte-for-byte).
+    if not transition_lease.commit_stage_cas(task_id, current_stage, "deploy", repo):
+        logger.info(
+            f"Task {task_id}: Phase A stage-CAS lost ({current_stage}->deploy) — "
+            f"aborting Phase A without side effects"
+        )
+        result.note = "phase-a-cas-lost"
+        result.advanced = False
+        return
    notify_stage_change(task_id, current_stage, "deploy")
    result.advanced = True
    result.to_stage = "deploy"
--- a/src/transition_lease.py
+++ b/src/transition_lease.py
@@ -0,0 +1,471 @@
+"""ORCH-114 (adr-0045): durable transition-ownership lease + expected-stage CAS.
+
+Leaf module — pure, never-raise (pattern of ``serial_gate`` / ``coverage_gate`` /
+``finalizer_liveness``: imports only ``db`` + ``config`` and lazily
+``merge_gate.pid_alive`` / ``qg.checks.is_self_hosting_repo`` / ``notifications``;
+it NEVER imports ``stage_engine`` / ``launcher`` and talks to no network).
+
+The bug class it closes
+-----------------------
+``stage_engine.advance_stage`` is the single entry to side-effectful transitions
+(the heavy ``deploy-staging -> deploy`` edge sub-gates — security / merge-gate
+re-test / coverage / image-freshness — and the ``deploy -> done`` merge-verify:
+``merge_pr`` / coverage-ratchet / proof-of-merge). It is RE-ENTERABLE: at least
+five actors (monitor / Plane-webhook / reconciler F-1 / job-reaper / deploy
+finalizer) can enter the SAME transition independently, and the stage write was a
+bare ``UPDATE … WHERE id=?`` with no compare-and-swap. Two concurrent — or a
+post-restart re-driven — entry therefore re-applied irreversible effects and
+produced contradictory outcomes (one path rolled back to ``development`` while
+another merged + finished — incident ORCH-111, job 1914 / PR #130). ORCH-113
+closed only the in-memory, Tier-2, ``deploy-staging``-only slice of this; it is
+lost on restart.
+
+Two complementary layers (ADR-001 D1), both gated by one kill-switch:
+
+  1. **Durable lease (owner-exclusion on ENTRY).** A row in the additive
+     ``transition_lease`` table (one per task) records "an actor owns this task's
+     side-effectful transition". A second actor that sees a LIVE owner does not
+     start the heavy sub-gates AT ALL (prevention, not post-hoc repair).
+  2. **Expected-stage CAS (atomicity on the WRITE).** ``update_task_stage_cas``
+     writes the stage only when the task is still at the expected stage; a lost
+     race aborts with NO side effect. It also closes the six paths that write the
+     stage in BYPASS of ``advance_stage`` (gitea / plane direct ``update_task_stage``).
+
+Liveness without a heartbeat (ADR-001 D3)
+-----------------------------------------
+An owner is LIVE ⇔ ``owner_boot_id == <this process's boot id>`` AND
+``merge_gate.pid_alive(owner_pid)``. There is NO heartbeat (a blocking 900 s merge
+re-test cannot beat one — the very argument ORCH-113 used to reject heartbeats).
+This makes restart recovery free: a new process has a new ``boot_id`` so every row
+written by a previous process is instantly stale and reclaimed
+(``recover_on_startup``). Within the one-process model every live owner shares the
+SAME boot id and pid, so a same-boot row is by definition owned by the (alive)
+current process; only a different-boot row can be stale — which is why the
+acquire/recover logic keys staleness on the boot id.
+
+No own TTL (ADR-001 D8): the lease's hard age ceiling IS the reaper Tier-3 backstop
+``reaper_max_running_s`` (the reaper force-releases the lease when it reaps), so the
+cross-cutting budget invariant ORCH-065/109/110/113 is untouched.
+
+never-raise (ADR-001 D9 / NFR-1): every public function is isolated. The
+directional defaults:
+  * ``acquire`` error -> ``False`` (busy): the caller DEFERS/aborts a side-effectful
+    transition rather than risk a double effect (fail-CLOSED to no-double-effect).
+  * ``is_held_by_live_owner`` error -> ``True`` (treat as held): the consulting
+    reconciler / webhook / reaper conservatively DEFERS (the safe action; the reaper
+    Tier-3 backstop still bounds a genuinely stuck task).
+  * ``commit_stage_cas`` error on the CAS path -> ``False``: abort the write, never a
+    blind overwrite.
+The hot claim path (``db.claim_next_job``) is deliberately NOT touched, so a lease
+bug can never wedge the shared queue of all projects (AC-8 ORCH-088 intact).
+
+See docs/work-items/ORCH-114/06-adr/ADR-001-transition-ownership-lease-and-stage-cas.md
+and the cross-cutting docs/architecture/adr/adr-0045-transition-ownership-lease-and-stage-cas.md.
+"""
+from __future__ import annotations
+
+import logging
+import os
+import secrets
+import threading
+
+from . import db
+from .config import settings
+
+logger = logging.getLogger("orchestrator.transition_lease")
+
+# Per-process boot nonce (ADR-001 D3). Generated ONCE at import: every lease row a
+# previous process wrote carries a DIFFERENT boot id and is therefore instantly
+# stale after a restart -> reclaimed by recover_on_startup / acquire. Not derived
+# from the clock so it cannot collide across a fast restart.
+_BOOT_ID = secrets.token_hex(16)
+
+# Best-effort observability counters (reset on restart, like the reaper's). Guarded
+# by a lock because the monitor / reaper / reconciler / webhook threads all touch
+# them. Never a source of truth — purely for GET /queue.
+_LOCK = threading.Lock()
+_COUNTERS: dict[str, int] = {
+    "acquired_total": 0,      # leases successfully acquired
+    "busy_total": 0,          # acquire deferred — a live owner already held it
+    "released_total": 0,      # normal try/finally releases
+    "cas_lost_total": 0,      # stage-CAS lost the race (aborted without side effect)
+    "stale_reclaims_total": 0,  # rows reclaimed because the owner was not live
+    "force_reclaims_total": 0,  # rows force-released (reaper / operator)
+}
+
+
+def _bump(key: str, n: int = 1) -> None:
+    try:
+        with _LOCK:
+            _COUNTERS[key] = _COUNTERS.get(key, 0) + n
+    except Exception:  # noqa: BLE001 - counters never break a caller
+        pass
+
+
+def boot_id() -> str:
+    """This process's boot nonce (exposed for tests / observability)."""
+    return _BOOT_ID
+
+
+# ---------------------------------------------------------------------------
+# Conditionality (mirrors coverage_gate_applies — self-hosting-only by default)
+# ---------------------------------------------------------------------------
+def _enabled() -> bool:
+    try:
+        return bool(getattr(settings, "transition_lease_enabled", False))
+    except Exception:  # noqa: BLE001
+        return False
+
+
+def applies(repo: str) -> bool:
+    """Whether the transition-lease + CAS are REAL for this repo (ADR-001 D10).
+
+      * ``transition_lease_enabled=False`` -> always False (kill-switch; the lease is
+        neither written nor read AND ``commit_stage_cas`` degenerates to the prior
+        unconditional ``update_task_stage`` -> behaviour byte-for-byte as before
+        ORCH-114).
+      * ``transition_lease_repos`` (CSV) non-empty -> real only for the listed repos.
+      * empty CSV -> real ONLY for the self-hosting repo (``orchestrator``), where the
+        irreversible side-effectful edges live (mirrors coverage_gate_repos -> enduro
+        untouched at the default).
+    Never raises -> False on error (the safe "mechanism inert" default == kill-switch
+    off).
+    """
+    try:
+        if not _enabled():
+            return False
+        raw = (getattr(settings, "transition_lease_repos", "") or "").strip()
+        if raw:
+            allowed = {r.strip().lower() for r in raw.split(",") if r.strip()}
+            return (repo or "").strip().lower() in allowed
+        from .qg.checks import is_self_hosting_repo
+        return is_self_hosting_repo(repo)
+    except Exception as e:  # noqa: BLE001 - never-raise contract
+        logger.warning("transition_lease.applies error for %s: %s", repo, e)
+        return False
+
+
+# ---------------------------------------------------------------------------
+# Liveness
+# ---------------------------------------------------------------------------
+def _pid_alive(pid) -> bool:
+    """Probe ``pid`` liveness via ``merge_gate.pid_alive`` (ADR-001 references it for
+    a single shared semantics). Lazy import keeps this module a leaf; on import error
+    fall back to a conservative ``True`` (a lease whose pid we cannot probe is treated
+    as live — the boot-id check below + the Tier-3 backstop still bound it).
+    """
+    try:
+        from .merge_gate import pid_alive
+        return pid_alive(pid)
+    except Exception:  # noqa: BLE001
+        return True
+
+
+def _row_is_live(owner_boot_id, owner_pid) -> bool:
+    """True iff the lease owner is LIVE (this process's boot AND a live pid).
+
+    A row from a DIFFERENT boot id (a previous process) is dead by construction
+    (ADR-001 D3); a same-boot row is owned by the current — alive — process, but we
+    still probe the pid for forward-compatibility with a future multi-process model.
+    """
+    if owner_boot_id != _BOOT_ID:
+        return False
+    return _pid_alive(owner_pid)
+
+
+def is_held_by_live_owner(task_id) -> bool:
+    """True iff an active lease row for ``task_id`` is owned by a LIVE actor.
+
+    Consulted by the reconciler F-1 / Plane-webhook DEFER guards and the reaper.
+    Returns ``False`` when there is no row or the owner is stale. Fail-CLOSED on any
+    error -> ``True`` (treat as held): the consulting caller DEFERS, which is always
+    the safe-against-double-effect action (the reaper Tier-3 backstop still bounds a
+    truly stuck task). When the mechanism is disabled -> ``False`` (no defer).
+    """
+    if task_id is None:
+        return False
+    if not _enabled():
+        return False
+    try:
+        conn = db.get_db()
+        try:
+            row = conn.execute(
+                "SELECT owner_boot_id, owner_pid FROM transition_lease WHERE task_id=?",
+                (task_id,),
+            ).fetchone()
+        finally:
+            conn.close()
+        if row is None:
+            return False
+        return _row_is_live(row["owner_boot_id"], row["owner_pid"])
+    except Exception as e:  # noqa: BLE001 - fail-CLOSED on doubt (conservative defer)
+        logger.warning(
+            "transition_lease.is_held_by_live_owner error for task %s -> "
+            "fail-CLOSED (defer): %s", task_id, e,
+        )
+        return True
+
+
+# ---------------------------------------------------------------------------
+# Acquire / release / reclaim
+# ---------------------------------------------------------------------------
+def _clear_stale_row(conn, task_id) -> int:
+    """Delete the lease row for ``task_id`` IFF its owner is not live. Returns the
+    rowcount. Uses the caller's connection (same transaction as the INSERT in
+    ``acquire``). Raises on a real DB fault (the caller swallows)."""
+    row = conn.execute(
+        "SELECT owner_boot_id, owner_pid FROM transition_lease WHERE task_id=?",
+        (task_id,),
+    ).fetchone()
+    if row is None:
+        return 0
+    if _row_is_live(row["owner_boot_id"], row["owner_pid"]):
+        return 0
+    cur = conn.execute("DELETE FROM transition_lease WHERE task_id=?", (task_id,))
+    return cur.rowcount or 0
+
+
+def acquire(task_id, owner: str, run_id=None, stage: str | None = None) -> bool:
+    """Acquire the side-effectful-transition lease for ``task_id`` (ADR-001 D5).
+
+    Atomic rowcount-guard (pattern ``claim_next_job`` / ``reap_running_job``): a stale
+    row (owner from a previous boot / dead pid) is cleared first, then an
+    ``INSERT … ON CONFLICT(task_id) DO NOTHING`` competes only with LIVE same-process
+    owners. ``rowcount == 1`` -> WE won. ``rowcount == 0`` -> a live owner already
+    holds it -> ``False`` (the caller DEFERS without starting the heavy region).
+
+    Kill-switch off -> ``True`` (no-op acquire; the caller proceeds exactly as before
+    ORCH-114; ``release`` is then an idempotent no-op). ``task_id is None`` -> ``True``
+    (a job with no task cannot be leased — legacy direct ``launch()``; proceed).
+
+    never-raise: any error -> ``False`` (busy) so the caller DEFERS a side-effectful
+    transition rather than risk a double effect (fail-CLOSED to no-double-effect,
+    ADR-001 D9).
+    """
+    if not _enabled():
+        return True
+    if task_id is None:
+        return True
+    try:
+        conn = db.get_db()
+        try:
+            _clear_stale_row(conn, task_id)
+            cur = conn.execute(
+                "INSERT INTO transition_lease "
+                "(task_id, owner, owner_pid, owner_boot_id, run_id, stage) "
+                "VALUES (?, ?, ?, ?, ?, ?) "
+                "ON CONFLICT(task_id) DO NOTHING",
+                (task_id, owner or "engine", os.getpid(), _BOOT_ID, run_id, stage),
+            )
+            conn.commit()
+            won = (cur.rowcount == 1)
+        finally:
+            conn.close()
+        if won:
+            _bump("acquired_total")
+            return True
+        _bump("busy_total")
+        logger.info(
+            "transition_lease: task %s busy (a live owner holds the transition); "
+            "%s defers", task_id, owner,
+        )
+        return False
+    except Exception as e:  # noqa: BLE001 - fail-CLOSED (busy) to avoid double effects
+        logger.warning("transition_lease.acquire error for task %s: %s", task_id, e)
+        return False
+
+
+def release(task_id, force: bool = False) -> None:
+    """Release the lease for ``task_id`` (ADR-001 D5). Idempotent, never raises.
+
+      * ``force=False`` (normal try/finally release in ``advance_stage``): delete only
+        a row owned by THIS process (``owner_boot_id == boot``), so a release delayed
+        past a reaper-reclaim-then-reacquire can never delete a lease a DIFFERENT
+        process/owner acquired afterwards (holder-aware, mirrors ``release_merge_lease``).
+      * ``force=True`` (reaper reap / operator endpoint): delete unconditionally.
+    """
+    if task_id is None:
+        return
+    if not _enabled():
+        return
+    try:
+        conn = db.get_db()
+        try:
+            if force:
+                cur = conn.execute(
+                    "DELETE FROM transition_lease WHERE task_id=?", (task_id,)
+                )
+            else:
+                cur = conn.execute(
+                    "DELETE FROM transition_lease WHERE task_id=? AND owner_boot_id=?",
+                    (task_id, _BOOT_ID),
+                )
+            conn.commit()
+            n = cur.rowcount or 0
+        finally:
+            conn.close()
+        if n:
+            _bump("force_reclaims_total" if force else "released_total", n)
+    except Exception as e:  # noqa: BLE001 - never-raise (a leaked lease is bounded by Tier-3)
+        logger.warning("transition_lease.release error for task %s: %s", task_id, e)
+
+
+def reclaim_if_stale(task_id) -> bool:
+    """Reclaim (delete) the lease row for ``task_id`` IFF its owner is not live.
+
+    Returns True iff a stale row was reclaimed. Used by the operator endpoint and as
+    a backstop. never-raise -> False on error.
+    """
+    if task_id is None or not _enabled():
+        return False
+    try:
+        conn = db.get_db()
+        try:
+            n = _clear_stale_row(conn, task_id)
+            conn.commit()
+        finally:
+            conn.close()
+        if n:
+            _bump("stale_reclaims_total", n)
+            logger.warning("transition_lease: reclaimed stale lease for task %s", task_id)
+        return n > 0
+    except Exception as e:  # noqa: BLE001 - never-raise
+        logger.warning("transition_lease.reclaim_if_stale error for task %s: %s", task_id, e)
+        return False
+
+
+def recover_on_startup() -> int:
+    """Clear every lease left by a PREVIOUS process boot (ADR-001 D7).
+
+    Called from ``main.lifespan`` right after ``requeue_running_jobs`` and BEFORE the
+    reaper starts. A fresh process boot id means every existing row predates this
+    process -> stale -> deleted, so the requeued jobs re-drive their transitions
+    cleanly (idempotency comes from the authoritative durable facts — SHA-in-main,
+    the INITIATED self-deploy marker, the coverage-ratchet CAS — NOT from a new
+    recovery brain). Returns the number of rows cleared. never-raise -> 0 on error.
+    """
+    if not _enabled():
+        return 0
+    try:
+        conn = db.get_db()
+        try:
+            cur = conn.execute(
+                "DELETE FROM transition_lease "
+                "WHERE owner_boot_id IS NULL OR owner_boot_id != ?",
+                (_BOOT_ID,),
+            )
+            conn.commit()
+            n = cur.rowcount or 0
+        finally:
+            conn.close()
+        if n:
+            _bump("stale_reclaims_total", n)
+            logger.warning(
+                "transition_lease.recover_on_startup: cleared %d stale lease(s) from a "
+                "previous boot", n,
+            )
+            # FR-6 / AC-12: a forced/stale reclaim is observable (Telegram alert). A
+            # restart-time bulk reclaim is summarised (per-task clickable alerts come
+            # from the operator endpoint). best-effort, never-raise.
+            try:
+                from .notifications import send_telegram
+                send_telegram(
+                    f"♻️ Transition-lease recovery: сброшено {n} устаревш"
+                    f"(ий/их) lease после рестарта (переходы будут пере-исполнены "
+                    f"последовательно)."
+                )
+            except Exception:  # noqa: BLE001 - alert is best-effort
+                pass
+        return n
+    except Exception as e:  # noqa: BLE001 - never-raise
+        logger.warning("transition_lease.recover_on_startup error: %s", e)
+        return 0
+
+
+# ---------------------------------------------------------------------------
+# Stage write — compare-and-swap wrapper (ADR-001 D5 / FR-2)
+# ---------------------------------------------------------------------------
+def commit_stage_cas(task_id, expected_stage: str, new_stage: str, repo: str) -> bool:
+    """Write the task stage under the ORCH-114 contract. Returns True iff the write
+    was applied (and the caller may proceed with side effects), False iff the writer
+    lost the CAS race (the caller MUST abort WITHOUT side effects).
+
+      * ``applies(repo)`` False (kill-switch off / repo out of scope) -> the prior
+        unconditional ``update_task_stage`` (byte-for-byte) -> always True. Not wrapped
+        in a swallowing try, so a DB error propagates EXACTLY as before ORCH-114.
+      * ``applies(repo)`` True -> ``update_task_stage_cas`` (expected-stage compare-and-
+        swap). A lost race -> False (no side effect). never-raise on the CAS path: a DB
+        error -> False (abort the write; never a blind overwrite).
+    """
+    try:
+        scoped = applies(repo)
+    except Exception:  # noqa: BLE001 - applies already never-raises; belt-and-suspenders
+        scoped = False
+    if not scoped:
+        db.update_task_stage(task_id, new_stage)
+        return True
+    try:
+        won = db.update_task_stage_cas(task_id, expected_stage, new_stage)
+        if not won:
+            _bump("cas_lost_total")
+        return won
+    except Exception as e:  # noqa: BLE001 - abort the write (no blind overwrite)
+        logger.warning(
+            "transition_lease.commit_stage_cas error for task %s (%s->%s): %s",
+            task_id, expected_stage, new_stage, e,
+        )
+        return False
+
+
+# ---------------------------------------------------------------------------
+# Observability snapshot for GET /queue (ADR-001 D10 / FR-6)
+# ---------------------------------------------------------------------------
+def snapshot() -> dict:
+    """Read-only transition-lease summary for GET /queue. Additive block; existing
+    /queue keys untouched. never-raise -> a minimal dict on error.
+    """
+    try:
+        enabled = _enabled()
+    except Exception:  # noqa: BLE001
+        enabled = False
+    try:
+        repos_cfg = getattr(settings, "transition_lease_repos", "") or ""
+    except Exception:  # noqa: BLE001
+        repos_cfg = ""
+    holders: list[dict] = []
+    try:
+        conn = db.get_db()
+        try:
+            rows = conn.execute(
+                "SELECT task_id, owner, owner_pid, owner_boot_id, run_id, stage, "
+                "acquired_at, "
+                "CAST(strftime('%s','now') - strftime('%s', acquired_at) AS INTEGER) "
+                "  AS age_s "
+                "FROM transition_lease ORDER BY task_id"
+            ).fetchall()
+        finally:
+            conn.close()
+        for r in rows:
+            holders.append({
+                "task_id": r["task_id"],
+                "owner": r["owner"],
+                "stage": r["stage"],
+                "run_id": r["run_id"],
+                "age_s": r["age_s"],
+                "live": _row_is_live(r["owner_boot_id"], r["owner_pid"]),
+            })
+    except Exception as e:  # noqa: BLE001 - never break /queue
+        logger.warning("transition_lease.snapshot error: %s", e)
+    try:
+        with _LOCK:
+            counters = dict(_COUNTERS)
+    except Exception:  # noqa: BLE001
+        counters = {}
+    return {
+        "enabled": enabled,
+        "repos": repos_cfg,
+        "boot_id": _BOOT_ID,
+        "active": len(holders),
+        "holders": holders,
+        "counters": counters,
+    }
--- a/src/webhooks/gitea.py
+++ b/src/webhooks/gitea.py
@@ -13,7 +13,6 @@ from ..config import settings
 from ..db import (
    get_db,
    get_task_by_repo_branch,
-    update_task_stage,
    enqueue_job,
    insert_event_dedup,
 )
@@ -24,6 +23,7 @@ from ..notifications import notify_stage_change, notify_qg_failure, notify_error
 from ..agents.launcher import launcher
 from ..plane_sync import notify_stage_change as plane_notify_stage
 from ..projects import get_project_by_repo
+from .. import transition_lease

 logger = logging.getLogger("orchestrator.webhooks.gitea")

@@ -124,18 +124,25 @@ async def handle_push(payload: dict):
        if has_adr:
            # Advance to development
            next_stage = "development"
-            update_task_stage(task_id, next_stage)
-            notify_stage_change(task_id, current_stage, next_stage)
-            plane_notify_stage(work_item_id, current_stage, next_stage)
+            # ORCH-114 (adr-0045 / D4, TR-4): this push-driven advance writes the stage
+            # in BYPASS of advance_stage -> route through the expected-stage CAS so it
+            # cannot clobber a concurrent authoritative write; a lost race skips the
+            # notify + enqueue (no duplicate agent). Kill-switch off -> unconditional
+            # (byte-for-byte).
+            if transition_lease.commit_stage_cas(task_id, current_stage, next_stage, repo_name):
+                notify_stage_change(task_id, current_stage, next_stage)
+                plane_notify_stage(work_item_id, current_stage, next_stage)

-            agent = get_agent_for_stage(current_stage)
-            if agent:
-                try:
-                    task_desc = f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {branch}\nStage: {next_stage}"
-                    job_id = enqueue_job(agent, repo_name, task_desc, task_id=task_id)
-                    logger.info(f"Task {task_id}: push triggered {current_stage} → {next_stage}, enqueued '{agent}' (job_id={job_id})")
-                except Exception as e:
-                    notify_error(task_id, f"Failed to launch agent '{agent}': {e}")
+                agent = get_agent_for_stage(current_stage)
+                if agent:
+                    try:
+                        task_desc = f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {branch}\nStage: {next_stage}"
+                        job_id = enqueue_job(agent, repo_name, task_desc, task_id=task_id)
+                        logger.info(f"Task {task_id}: push triggered {current_stage} → {next_stage}, enqueued '{agent}' (job_id={job_id})")
+                    except Exception as e:
+                        notify_error(task_id, f"Failed to launch agent '{agent}': {e}")
+            else:
+                logger.info(f"Task {task_id}: push-advance stage-CAS lost ({current_stage}->{next_stage}); another writer moved it")

    elif current_stage == "development":
        # Source files pushed — just log, wait for CI
@@ -239,18 +246,22 @@ async def handle_ci_status(payload: dict):
        passed, reason = check_ci_green(repo_name, branch)
        if passed:
            next_stage = "review"
-            update_task_stage(task_id, next_stage)
-            notify_stage_change(task_id, current_stage, next_stage)
-            plane_notify_stage(work_item_id, current_stage, next_stage)
+            # ORCH-114 (adr-0045 / D4, TR-4): CI-green advance in BYPASS of
+            # advance_stage -> expected-stage CAS; a lost race skips notify + enqueue.
+            if transition_lease.commit_stage_cas(task_id, current_stage, next_stage, repo_name):
+                notify_stage_change(task_id, current_stage, next_stage)
+                plane_notify_stage(work_item_id, current_stage, next_stage)

-            agent = get_agent_for_stage(current_stage)
-            if agent:
-                try:
-                    task_desc = f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {branch}\nStage: {next_stage}"
-                    job_id = enqueue_job(agent, repo_name, task_desc, task_id=task_id)
-                    logger.info(f"Task {task_id}: CI green → {next_stage}, enqueued '{agent}' (job_id={job_id})")
-                except Exception as e:
-                    notify_error(task_id, f"Failed to launch agent '{agent}': {e}")
+                agent = get_agent_for_stage(current_stage)
+                if agent:
+                    try:
+                        task_desc = f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {branch}\nStage: {next_stage}"
+                        job_id = enqueue_job(agent, repo_name, task_desc, task_id=task_id)
+                        logger.info(f"Task {task_id}: CI green → {next_stage}, enqueued '{agent}' (job_id={job_id})")
+                    except Exception as e:
+                        notify_error(task_id, f"Failed to launch agent '{agent}': {e}")
+            else:
+                logger.info(f"Task {task_id}: CI-green stage-CAS lost ({current_stage}->{next_stage}); another writer moved it")
        else:
            notify_qg_failure(task_id, current_stage, "check_ci_green", reason)

@@ -330,18 +341,22 @@ async def handle_pr(payload: dict):
            passed, reason = check_review_approved(repo_name, pr_number)
            if passed:
                next_stage = "testing"
-                update_task_stage(task_id, next_stage)
-                notify_stage_change(task_id, current_stage, next_stage)
-                plane_notify_stage(work_item_id, current_stage, next_stage)
+                # ORCH-114 (adr-0045 / D4, TR-4): PR-approved advance in BYPASS of
+                # advance_stage -> expected-stage CAS; a lost race skips notify + enqueue.
+                if transition_lease.commit_stage_cas(task_id, current_stage, next_stage, repo_name):
+                    notify_stage_change(task_id, current_stage, next_stage)
+                    plane_notify_stage(work_item_id, current_stage, next_stage)

-                agent = get_agent_for_stage(current_stage)
-                if agent:
-                    try:
-                        task_desc = f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {head_branch}\nStage: {next_stage}"
-                        job_id = enqueue_job(agent, repo_name, task_desc, task_id=task_id)
-                        logger.info(f"Task {task_id}: PR approved → {next_stage}, enqueued '{agent}' (job_id={job_id})")
-                    except Exception as e:
-                        notify_error(task_id, f"Failed to launch agent '{agent}': {e}")
+                    agent = get_agent_for_stage(current_stage)
+                    if agent:
+                        try:
+                            task_desc = f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {head_branch}\nStage: {next_stage}"
+                            job_id = enqueue_job(agent, repo_name, task_desc, task_id=task_id)
+                            logger.info(f"Task {task_id}: PR approved → {next_stage}, enqueued '{agent}' (job_id={job_id})")
+                        except Exception as e:
+                            notify_error(task_id, f"Failed to launch agent '{agent}': {e}")
+                else:
+                    logger.info(f"Task {task_id}: PR-approved stage-CAS lost ({current_stage}->{next_stage}); another writer moved it")
            else:
                notify_qg_failure(task_id, current_stage, "check_review_approved", reason)

@@ -355,18 +370,24 @@ async def handle_pr(payload: dict):
            conn.close()

            if retry_count < MAX_DEV_RETRIES:
-                # Back to development, relaunch developer
-                update_task_stage(task_id, "development")
-                notify_stage_change(task_id, current_stage, "development")
-                try:
-                    task_desc = (
-                        f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {head_branch}\n"
-                        f"Stage: development\nNote: Changes requested in review (attempt {retry_count + 1}/{MAX_DEV_RETRIES})"
-                    )
-                    job_id = enqueue_job("developer", repo_name, task_desc, task_id=task_id)
-                    logger.info(f"Task {task_id}: changes requested, enqueued developer (attempt {retry_count + 1}, job_id={job_id})")
-                except Exception as e:
-                    notify_error(task_id, f"Failed to relaunch developer: {e}")
+                # Back to development, relaunch developer.
+                # ORCH-114 (adr-0045 / D4, TR-4): REQUEST_CHANGES rollback writes the
+                # stage in BYPASS of advance_stage -> expected-stage CAS so it cannot
+                # clobber a concurrent authoritative write (e.g. a task that already
+                # advanced); a lost race skips the rollback + developer relaunch.
+                if transition_lease.commit_stage_cas(task_id, current_stage, "development", repo_name):
+                    notify_stage_change(task_id, current_stage, "development")
+                    try:
+                        task_desc = (
+                            f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {head_branch}\n"
+                            f"Stage: development\nNote: Changes requested in review (attempt {retry_count + 1}/{MAX_DEV_RETRIES})"
+                        )
+                        job_id = enqueue_job("developer", repo_name, task_desc, task_id=task_id)
+                        logger.info(f"Task {task_id}: changes requested, enqueued developer (attempt {retry_count + 1}, job_id={job_id})")
+                    except Exception as e:
+                        notify_error(task_id, f"Failed to relaunch developer: {e}")
+                else:
+                    logger.info(f"Task {task_id}: REQUEST_CHANGES rollback stage-CAS lost ({current_stage}->development); another writer moved it")
            else:
                notify_error(task_id, f"Max developer retries ({MAX_DEV_RETRIES}) reached, escalating")
                logger.error(f"Task {task_id}: max retries reached, needs manual intervention")
@@ -395,6 +416,11 @@ async def handle_pr(payload: dict):
                f"deployer verdict (check_deploy_status), ignoring merge-driven done."
            )
            return
-        update_task_stage(task_id, "done")
-        notify_stage_change(task_id, current_stage, "done")
-        logger.info(f"Task {task_id}: PR merged, stage → done")
+        # ORCH-114 (adr-0045 / D4, TR-4): merge-driven done writes the stage in BYPASS
+        # of advance_stage -> expected-stage CAS so a concurrent authoritative writer
+        # is not clobbered; a lost race skips the (idempotent) notify.
+        if transition_lease.commit_stage_cas(task_id, current_stage, "done", repo_name):
+            notify_stage_change(task_id, current_stage, "done")
+            logger.info(f"Task {task_id}: PR merged, stage → done")
+        else:
+            logger.info(f"Task {task_id}: merge-driven done stage-CAS lost ({current_stage}->done); another writer moved it")
--- a/src/webhooks/plane.py
+++ b/src/webhooks/plane.py
@@ -14,7 +14,6 @@ from ..db import (
    get_task_by_plane_id,
    get_next_work_item_id,
    ensure_unique_work_item_id,
-    update_task_stage,
    enqueue_job,
    insert_event_dedup,
    create_task_atomic,
@@ -35,6 +34,7 @@ from ..projects import (
    get_project_by_repo,
    known_plane_project_ids,
 )
+from .. import transition_lease

 logger = logging.getLogger("orchestrator.webhooks.plane")

@@ -803,7 +803,17 @@ async def _rollback_stage(
    if not prev_stage:
        logger.info(f"Task {task_id}: rejected at {current_stage} but no previous stage")
        return
-    update_task_stage(task_id, prev_stage)
+    # ORCH-114 (adr-0045 / D4, TR-4): this Rejected-rollback writes the stage in
+    # BYPASS of advance_stage. Route it through the expected-stage CAS so it can never
+    # clobber an authoritative write made by a concurrent owner (e.g. a deploy->done
+    # finalizer) — a lost race aborts the rollback WITHOUT its side effects. Kill-switch
+    # off / repo out of scope -> unconditional update (byte-for-byte).
+    if not transition_lease.commit_stage_cas(task_id, current_stage, prev_stage, repo):
+        logger.info(
+            f"Task {task_id}: rollback stage-CAS lost ({current_stage}->{prev_stage}) "
+            f"— task already moved by another writer; skipping rollback"
+        )
+        return
    notify_stage_change(task_id, current_stage, prev_stage)
    # Feature 3: plane_notify_stage moves the board to the prev stage's status.
    plane_notify_stage(work_item_id, current_stage, prev_stage)
@@ -857,10 +867,25 @@ async def _try_advance_stage(
    advance_stage). It is True ONLY on the "Confirm Deploy" path
    (handle_confirm_deploy) and gates Phase B of the self-hosting prod deploy; the
    plain Approved path (handle_verdict) leaves it at the default False.
+
+    ORCH-114 (adr-0045 / FR-5, AC-8): if a live actor already owns this task's
+    side-effectful transition (transition-lease active), DEFER — do not re-enter the
+    transition in parallel. The late legitimate signal is not lost: once the owner
+    releases (or dies and the reaper reclaims), a re-approve / the reconciler re-drives
+    it, or advance_stage becomes an idempotent no-op against the authoritative facts
+    (SHA-in-main / INITIATED). never raises; no-op when the lease is disabled / repo
+    out of scope.
    """
    import asyncio
    from ..stage_engine import advance_stage

+    if transition_lease.is_held_by_live_owner(task_id):
+        logger.info(
+            f"Task {task_id}: transition-lease active — deferring webhook advance "
+            f"from {current_stage} (confirm_deploy={confirm_deploy})"
+        )
+        return
+
    await asyncio.to_thread(
        advance_stage,
        task_id,