fix(stage-engine): durable transition-ownership lease + expected-stage CAS (ORCH-114)
Close the root class of the ORCH-110/111/112/113 incident chain: side-effectful stage transitions had no single ownership. `advance_stage` is re-enterable and wrote the stage with a bare `UPDATE ... WHERE id=?` (no compare-and-swap), while >=5 actors (monitor / Plane-webhook / reconciler F-1 / job-reaper / deploy-finalizer) enter the same transition independently. A concurrent or post-restart re-entry therefore re-applied irreversible effects (merge_pr / coverage-ratchet / image-rebuild / prod-deploy initiation) and produced a contradictory rollback<->done (incident ORCH-111, job 1914 / PR #130). Two complementary layers, both additive, under one kill-switch, never-raise: 1. Durable transition-lease (new table `transition_lease`) — owner-exclusion on ENTRY to the side-effectful region: a second actor that sees a LIVE owner does not start the heavy sub-gates at all (prevention, not post-hoc repair). 2. Expected-stage CAS (`db.update_task_stage_cas`) — atomicity on the stage WRITE: a lost race aborts with NO side effect. Also closes the 6 paths that write the stage in bypass of advance_stage (gitea x5 + plane rollback). Owner liveness = owner_pid + owner_boot_id (NOT a heartbeat — a blocking 900s merge re-test cannot beat one; ADR-001 D3), making restart recovery free (a fresh boot_id renders every prior lease stale -> reclaimed by recover_on_startup). The lease has no own TTL: its hard age ceiling is the reaper Tier-3 backstop reaper_max_running_s, so the cross-cutting budget invariant ORCH-065/109/110/113 is untouched. Generalises ORCH-113 finalizer-liveness (process-local, Tier-2, deploy-staging) to a durable cross-path lease: the reaper consults it on all relevant paths (defer live, reclaim dead; Tier-3 ignores the marker -> bounded; a reap force-releases the lease); reconciler F-1 and the Plane webhook defer on an active lease; main.lifespan calls recover_on_startup() after requeue_running_jobs. finalizer_liveness.py is unchanged (it remains the kill-switch-off fallback). Scope self-hosting (transition_lease_repos="" -> orchestrator only; enduro untouched). Kill-switch ORCH_TRANSITION_LEASE_ENABLED=false -> CAS degenerates to the prior unconditional update_task_stage, lease inert, reaper -> ORCH-113 fallback (byte-for- byte pre-ORCH-114). STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / existing table schemas — byte-for-byte (one additive table, no epoch column on tasks). Observability: read-only `transition_lease` block in GET /queue + a Telegram alert on forced/stale reclaim + optional POST /transition-lease/release?work_item=<id>. Coverage: tests/test_orch114_transition_ownership.py (TC-01 mandatory regression of the ORCH-111 class — red before fix, green after; TC-02..TC-14). Full suite green (2048 passed); the 4 webhook tests that spied on the removed gitea.update_task_stage were updated to spy on the new commit_stage_cas write path. ADR: docs/work-items/ORCH-114/06-adr/ADR-001-transition-ownership-lease-and-stage-cas.md Cross-cutting: docs/architecture/adr/adr-0045-transition-ownership-lease-and-stage-cas.md Refs: ORCH-114 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -590,6 +590,43 @@ class Settings(BaseSettings):
|
||||
lease_reclaim_enabled: bool = True
|
||||
reaper_finalizer_liveness_enabled: bool = True
|
||||
|
||||
# ORCH-114 (adr-0045): durable transition-ownership lease + expected-stage CAS for
|
||||
# side-effectful stage transitions. Generalises the process-local ORCH-113
|
||||
# finalizer-liveness to a DURABLE, cross-path owner-exclusion (additive table
|
||||
# `transition_lease`) so a concurrent OR post-restart re-entry into a side-effectful
|
||||
# transition (reaper / reconciler / webhook / startup-requeue) is deferred or a
|
||||
# no-op instead of re-applying an irreversible effect (merge_pr / coverage-ratchet /
|
||||
# image-rebuild / prod-deploy initiation / contradictory rollback↔done). Two
|
||||
# complementary layers, both gated by the SINGLE kill-switch below:
|
||||
# (1) durable lease on ENTRY to the side-effectful region (a second actor seeing a
|
||||
# live owner does not start the heavy sub-gates at all — prevention, not repair);
|
||||
# (2) expected-stage CAS on the stage WRITE (update_task_stage_cas: a lost race ->
|
||||
# abort with NO side effect), which also closes the 6 paths that write the
|
||||
# stage in bypass of advance_stage (gitea/plane direct update_task_stage).
|
||||
# Liveness of the owner = owner_pid + owner_boot_id (NOT a heartbeat — a blocking
|
||||
# 900s merge re-test cannot beat a heartbeat; ADR-001 D3), which makes restart
|
||||
# recovery free (a new process -> new boot_id -> all prior leases are instantly
|
||||
# stale -> reclaimed). The lease has NO own TTL: its hard age ceiling IS the reaper
|
||||
# Tier-3 backstop reaper_max_running_s (5400), so the cross-cutting budget invariant
|
||||
# ORCH-065/109/110/113 is untouched. STAGE_TRANSITIONS / QG_CHECKS / check_* /
|
||||
# machine-verdict keys / existing table schemas — byte-for-byte. never-raise:
|
||||
# hot-path guard fail-open (never wedge the shared queue), prod-safety fail-closed.
|
||||
# See docs/work-items/ORCH-114/06-adr/ADR-001-transition-ownership-lease-and-stage-cas.md
|
||||
# and the cross-cutting docs/architecture/adr/adr-0045-…md.
|
||||
# transition_lease_enabled -> SINGLE kill-switch (env ORCH_TRANSITION_LEASE_ENABLED).
|
||||
# False -> the lease is neither written nor read AND the
|
||||
# CAS degenerates to the prior unconditional
|
||||
# update_task_stage -> behaviour byte-for-byte as before
|
||||
# ORCH-114 (reaper -> ORCH-113 in-memory fallback,
|
||||
# reconciler/webhook skip-guard inert). Default True.
|
||||
# transition_lease_repos -> CSV scope (env ORCH_TRANSITION_LEASE_REPOS). Empty ->
|
||||
# applies ONLY to the self-hosting repo (orchestrator),
|
||||
# where the irreversible side-effectful edges live;
|
||||
# non-empty -> only the listed repos. Mirrors
|
||||
# coverage_gate_repos -> enduro untouched at the default.
|
||||
transition_lease_enabled: bool = True
|
||||
transition_lease_repos: str = ""
|
||||
|
||||
# ORCH-063: disk-watchdog — background heartbeat that measures host-FS fill via
|
||||
# the mounted bind-paths and Telegram-alerts the operator at >= threshold. On
|
||||
# 07.06.2026 the mva154 host disk silently hit 100% and stalled the WHOLE
|
||||
|
||||
55
src/db.py
55
src/db.py
@@ -263,6 +263,28 @@ def init_db():
|
||||
_ensure_column(conn, "lessons", "attribution", "TEXT")
|
||||
_ensure_column(conn, "lessons", "target_repo", "TEXT")
|
||||
_ensure_column(conn, "lessons", "target_domain", "TEXT")
|
||||
# ORCH-114 (adr-0045 / 08-data-requirements.md): durable transition-ownership
|
||||
# lease. ONE additive object (CREATE TABLE IF NOT EXISTS, pattern repo_freeze/
|
||||
# coverage_baseline/lessons) -> idempotent, restart-safe on the shared prod DB;
|
||||
# existing tables (tasks/jobs/agent_runs/...) untouched byte-for-byte (NFR-3,
|
||||
# AC-11). One row per task = at most one active owner of a side-effectful
|
||||
# transition. Liveness of the holder = owner_boot_id (this process's start nonce)
|
||||
# + owner_pid (os.getpid of the holding process); a row from a previous boot is
|
||||
# instantly stale on restart -> reclaimed (ADR-001 D3). No index needed (access by
|
||||
# PK task_id; snapshot() is a full-scan over a tiny table). The src/transition_lease.py
|
||||
# leaf wraps all access in its never-raise contract. NO epoch/version column (D2:
|
||||
# for the one-process model the stage IS the CAS version).
|
||||
conn.executescript("""
|
||||
CREATE TABLE IF NOT EXISTS transition_lease (
|
||||
task_id INTEGER PRIMARY KEY,
|
||||
owner TEXT NOT NULL,
|
||||
owner_pid INTEGER,
|
||||
owner_boot_id TEXT,
|
||||
run_id INTEGER,
|
||||
stage TEXT,
|
||||
acquired_at TEXT NOT NULL DEFAULT (datetime('now'))
|
||||
);
|
||||
""")
|
||||
conn.commit()
|
||||
conn.close()
|
||||
|
||||
@@ -679,6 +701,39 @@ def update_task_stage(task_id: int, stage: str):
|
||||
conn.close()
|
||||
|
||||
|
||||
def update_task_stage_cas(task_id: int, expected_stage: str, new_stage: str) -> bool:
|
||||
"""ORCH-114 (adr-0045 / FR-2): compare-and-swap variant of update_task_stage.
|
||||
|
||||
Writes the stage ONLY when the task is still at ``expected_stage`` (the value the
|
||||
caller read before running the side-effectful region) — ``UPDATE … SET stage=?
|
||||
WHERE id=? AND stage=?`` — and reports whether THIS writer won. Returns:
|
||||
|
||||
* ``True`` -> ``rowcount == 1``: the CAS succeeded, the stage moved exactly once.
|
||||
* ``False`` -> ``rowcount == 0``: the task is no longer at ``expected_stage``
|
||||
(another actor already advanced/rolled it back, or the row is gone) -> the
|
||||
caller MUST abort WITHOUT applying any side effect (merge_pr / ratchet /
|
||||
rebuild / deploy-init / enqueue) — it lost the race.
|
||||
|
||||
In the current one-process model each side-effectful edge leads to a DISTINCT
|
||||
next stage, so the stage itself is a complete version for the compare-and-swap;
|
||||
no separate epoch/version column is needed (ADR-001 D2). The plain
|
||||
``update_task_stage`` above is kept unchanged for the kill-switch-off path and
|
||||
for non-side-effectful writes. Mirrors the atomic rowcount-guard idiom of
|
||||
``claim_next_job`` / ``reap_running_job``.
|
||||
"""
|
||||
conn = get_db()
|
||||
try:
|
||||
cur = conn.execute(
|
||||
"UPDATE tasks SET stage = ?, updated_at = datetime('now') "
|
||||
"WHERE id = ? AND stage = ?",
|
||||
(new_stage, task_id, expected_stage),
|
||||
)
|
||||
conn.commit()
|
||||
return cur.rowcount == 1
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# ORCH-019: bug-fast-track task type (tasks.track) helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@@ -434,18 +434,35 @@ class JobReaper:
|
||||
return None, None, None
|
||||
|
||||
def _finalizer_owns(self, job: dict) -> bool:
|
||||
"""ORCH-113 (adr-0043 / D3): True iff a LIVE monitor still owns this job's
|
||||
``deploy-staging`` finalization, so the Tier-2 reap must be deferred.
|
||||
"""True iff a LIVE actor still owns this job's side-effectful finalization, so
|
||||
the Tier-2 reap must be deferred.
|
||||
|
||||
Order matters for the zero-regression contract: the kill-switch is checked
|
||||
FIRST (disabled -> ``False`` with no DB lookup, so the path is byte-for-byte
|
||||
prior); then the stage is scoped to ``deploy-staging`` only (the sole edge
|
||||
whose in-thread finalization runs for minutes — every other stage is left
|
||||
untouched); only then is the process-local ownership marker consulted. Never
|
||||
raises -> ``False`` on any error (conservative: never block reaping when
|
||||
ownership is unknowable, so the Tier-3 backstop is never neutered).
|
||||
ORCH-114 (adr-0045 / D6) GENERALISES the ORCH-113 process-local, Tier-2,
|
||||
``deploy-staging``-only marker to a DURABLE, cross-path lease: when the
|
||||
transition-lease applies to this repo, consult ``transition_lease`` keyed on
|
||||
the task (covers EVERY relevant edge — deploy-staging AND deploy->done — and
|
||||
survives restart). Otherwise (kill-switch off) fall back to the unchanged
|
||||
ORCH-113 in-memory ``finalizer_liveness`` (Tier-2 / ``deploy-staging`` only),
|
||||
so the disabled path is byte-for-byte prior.
|
||||
|
||||
Either way the Tier-3 backstop (``reaper_max_running_s``) IGNORES this marker
|
||||
(it does not call here), so a stuck/dead finalizer is still reaped in bounded
|
||||
time. Never raises -> ``False`` on any error (conservative: never block reaping
|
||||
when ownership is unknowable, so the backstop is never neutered).
|
||||
"""
|
||||
try:
|
||||
repo = job.get("repo")
|
||||
# ORCH-114: durable cross-path lease (when enabled for this repo).
|
||||
try:
|
||||
from . import transition_lease
|
||||
if transition_lease.applies(repo):
|
||||
return transition_lease.is_held_by_live_owner(job.get("task_id"))
|
||||
except Exception as e: # noqa: BLE001 - fall back to ORCH-113 on any error
|
||||
logger.warning(
|
||||
"reaper: transition-lease check failed for job %s: %s",
|
||||
job.get("id"), e,
|
||||
)
|
||||
# ORCH-113 fallback (kill-switch off): process-local, Tier-2/deploy-staging.
|
||||
if not settings.reaper_finalizer_liveness_enabled:
|
||||
return False
|
||||
_branch, stage, _wid = self._task_meta(job)
|
||||
@@ -472,6 +489,18 @@ class JobReaper:
|
||||
|
||||
def _note_reap(self, job: dict, outcome: str, reason: str) -> None:
|
||||
"""Record + log one successful reap (Р-6 observability)."""
|
||||
# ORCH-114 (adr-0045 / D6): a reap reclaims the job, so its durable
|
||||
# transition-lease must NOT outlive it — force-release (any owner/boot) so a
|
||||
# requeued job can re-acquire cleanly. never-raise; no-op when the lease is
|
||||
# disabled / no row exists.
|
||||
try:
|
||||
from . import transition_lease
|
||||
transition_lease.release(job.get("task_id"), force=True)
|
||||
except Exception as e: # noqa: BLE001 - never break the reap
|
||||
logger.warning(
|
||||
"reaper: transition-lease force-release failed for job %s: %s",
|
||||
job.get("id"), e,
|
||||
)
|
||||
self.reaped_total += 1
|
||||
self.last_reaped = {
|
||||
"job_id": job.get("id"),
|
||||
|
||||
58
src/main.py
58
src/main.py
@@ -60,6 +60,25 @@ async def lifespan(app: FastAPI):
|
||||
if requeued:
|
||||
log.warning(f"Queue-recovery: requeued {requeued} running job(s) after restart")
|
||||
|
||||
# ORCH-114 (adr-0045 / D7 / FR-4): clear durable transition-leases left by the
|
||||
# PREVIOUS process boot. This process has a fresh boot_id, so every prior lease is
|
||||
# stale by construction -> reclaim it so the just-requeued jobs can re-drive their
|
||||
# side-effectful transitions cleanly. Idempotency of the re-drive comes from the
|
||||
# authoritative durable facts (SHA-in-main / the INITIATED self-deploy marker /
|
||||
# the coverage-ratchet CAS), NOT from a new recovery brain — the lease only
|
||||
# guarantees the re-drive runs SEQUENTIALLY (one owner), never concurrently. Runs
|
||||
# AFTER requeue_running_jobs and BEFORE the reaper starts. never raises.
|
||||
try:
|
||||
from . import transition_lease
|
||||
cleared_leases = transition_lease.recover_on_startup()
|
||||
if cleared_leases:
|
||||
log.warning(
|
||||
f"Transition-lease recovery: cleared {cleared_leases} stale lease(s) "
|
||||
f"from a previous boot"
|
||||
)
|
||||
except Exception as e:
|
||||
log.warning(f"Transition-lease recovery skipped: {e}")
|
||||
|
||||
# ORCH-065: proactive startup reclaim of dead/stale merge-leases, next to the
|
||||
# queue-recovery above. A lease held by the previous (now dead) process pid is
|
||||
# released at once instead of waiting for the TTL / a foreign acquire so the
|
||||
@@ -215,6 +234,7 @@ async def queue():
|
||||
from . import bug_fast_track
|
||||
from . import lessons
|
||||
from . import checkout_hygiene
|
||||
from . import transition_lease
|
||||
from .disk_watchdog import disk_watchdog
|
||||
from .build_cache_pruner import build_cache_pruner
|
||||
return {
|
||||
@@ -258,6 +278,11 @@ async def queue():
|
||||
# ORCH-112 (D3): deploy-base checkout-hygiene observability (read-only) —
|
||||
# kill-switch + scope. Additive block; never-raise.
|
||||
"checkout_hygiene": checkout_hygiene.snapshot(),
|
||||
# ORCH-114 (adr-0045 / D10 / FR-6): durable transition-ownership lease
|
||||
# observability (read-only) — kill-switch, scope, boot_id, active holders
|
||||
# (owner/stage/age/live) + defer/reclaim/CAS-lost counters. Additive block;
|
||||
# never-raise.
|
||||
"transition_lease": transition_lease.snapshot(),
|
||||
# ORCH-098 (FR-4 / AC-4): lessons-journal observability (read-only) —
|
||||
# kill-switch + counts by type/status + last N lessons. Additive block;
|
||||
# never-raise (snapshot() returns {"enabled": ...} minimum on error).
|
||||
@@ -324,6 +349,39 @@ async def serial_gate_unfreeze(repo: str = ""):
|
||||
return {"ok": True, "repo": repo, "cleared": cleared, "frozen": frozen}
|
||||
|
||||
|
||||
@app.post("/transition-lease/release")
|
||||
async def transition_lease_release(work_item: str = ""):
|
||||
"""ORCH-114 (adr-0045 / D10): operator manual reclaim of a stuck transition-lease.
|
||||
|
||||
By образцу ``POST /serial-gate/unfreeze``: if a lease somehow outlives its owner
|
||||
(the normal try/finally release + the reaper force-release + the Tier-3 backstop
|
||||
should make this unnecessary), an operator can force-release it by work-item id so
|
||||
a re-approve / the reconciler can re-drive the transition. Idempotent: releasing a
|
||||
free task reports ``released: false``. Read-only/never-raise otherwise.
|
||||
"""
|
||||
from . import transition_lease
|
||||
from .db import get_task_by_work_item_id
|
||||
if not work_item or not work_item.strip():
|
||||
return {"ok": False, "error": "missing 'work_item'", "work_item": work_item}
|
||||
work_item = work_item.strip()
|
||||
task = get_task_by_work_item_id(work_item)
|
||||
if not task:
|
||||
return {"ok": False, "error": "task not found", "work_item": work_item}
|
||||
task_id = task["id"]
|
||||
held_before = transition_lease.is_held_by_live_owner(task_id)
|
||||
transition_lease.release(task_id, force=True)
|
||||
if held_before:
|
||||
try:
|
||||
from .notifications import send_telegram, link_for
|
||||
send_telegram(
|
||||
f"🔓 {link_for(work_item)}: transition-lease сброшен вручную "
|
||||
f"(task {task_id}). Переход может быть пере-исполнен."
|
||||
)
|
||||
except Exception:
|
||||
pass
|
||||
return {"ok": True, "work_item": work_item, "task_id": task_id, "released": held_before}
|
||||
|
||||
|
||||
@app.post("/fs-normalize/check")
|
||||
async def fs_normalize_check(normalize: bool = False):
|
||||
"""ORCH-057 (D6 / AC-4): force a fresh legacy-ownership detect (bypass the TTL
|
||||
|
||||
@@ -70,6 +70,7 @@ from .webhooks.plane import handle_status_start, handle_verdict
|
||||
from .notifications import send_telegram, link_for
|
||||
from . import projects
|
||||
from . import task_deps
|
||||
from . import transition_lease
|
||||
|
||||
logger = logging.getLogger("orchestrator.reconciler")
|
||||
|
||||
@@ -153,6 +154,10 @@ class Reconciler:
|
||||
# ORCH-068 observability: terminal-state skips and dedup suppressions.
|
||||
self.skipped_terminal_total: int = 0
|
||||
self.deduped_total: int = 0
|
||||
# ORCH-114 (adr-0045 / FR-5): F-1 advances deferred because a live actor owns
|
||||
# the task's side-effectful transition (transition-lease active). Reset on
|
||||
# restart (safe: a live lease is itself recovered/reclaimed on restart).
|
||||
self.transition_lease_defers_total: int = 0
|
||||
# ORCH-068 (TR-3): in-memory dedup guard {issue_id -> last unblocked
|
||||
# state uuid}. Best-effort (resets on restart, like unblocked_total);
|
||||
# suppresses a repeat unblock notification for the same issue+state.
|
||||
@@ -246,6 +251,19 @@ class Reconciler:
|
||||
if cyc:
|
||||
task_deps.handle_cycle(cyc)
|
||||
return
|
||||
# ORCH-114 (adr-0045 / FR-5, AC-7): a live actor already owns this task's
|
||||
# side-effectful transition -> F-1 must NOT advance it in parallel. Silent
|
||||
# defer (mirrors the escalated/Blocked/task-deps skip-guards above); the owner
|
||||
# finishes the transition or, on death, the reaper reclaims it in bounded time.
|
||||
# fail-safe: is_held_by_live_owner is conservative (True on doubt -> defer).
|
||||
# never raises; no-op (False) when the lease is disabled / repo out of scope.
|
||||
if transition_lease.is_held_by_live_owner(task_id):
|
||||
self.transition_lease_defers_total += 1
|
||||
logger.debug(
|
||||
f"reconciler F-1: task {task_id} has an active transition-lease — "
|
||||
f"deferring advance to its owner"
|
||||
)
|
||||
return
|
||||
result = advance_if_gate_passed(
|
||||
task_id,
|
||||
stage,
|
||||
@@ -596,6 +614,8 @@ class Reconciler:
|
||||
# ORCH-068 observability.
|
||||
"skipped_terminal_total": self.skipped_terminal_total,
|
||||
"deduped_total": self.deduped_total,
|
||||
# ORCH-114 observability: F-1 advances deferred to a live lease owner.
|
||||
"transition_lease_defers_total": self.transition_lease_defers_total,
|
||||
}
|
||||
|
||||
|
||||
|
||||
@@ -41,6 +41,7 @@ from . import self_deploy
|
||||
from . import post_deploy
|
||||
from . import labels
|
||||
from . import bug_fast_track
|
||||
from . import transition_lease
|
||||
from .notifications import (
|
||||
notify_stage_change,
|
||||
notify_qg_failure,
|
||||
@@ -173,6 +174,20 @@ def developer_retry_count(task_id: int) -> int:
|
||||
_developer_retry_count = developer_retry_count
|
||||
|
||||
|
||||
def _is_side_effectful_edge(current_stage: str | None, next_stage: str | None) -> bool:
|
||||
"""ORCH-114 (adr-0045 D4): does this ``advance_stage`` edge run IRREVERSIBLE work
|
||||
that must be owned by exactly one actor (lease on entry)?
|
||||
|
||||
* ``deploy-staging`` (-> deploy): the heavy edge sub-gates (security / merge-gate
|
||||
re-test / coverage / image-freshness rebuild) + Phase A.
|
||||
* ``deploy`` (-> done OR Phase B): merge_pr / coverage-ratchet / proof-of-merge,
|
||||
or the detached prod-deploy initiation (confirm_deploy).
|
||||
Every other edge (created -> … -> testing) is reversible and is protected by the
|
||||
CAS-on-write alone (no lease). Pure, never raises.
|
||||
"""
|
||||
return current_stage in ("deploy-staging", "deploy")
|
||||
|
||||
|
||||
def advance_stage(
|
||||
task_id: int,
|
||||
current_stage: str,
|
||||
@@ -210,6 +225,12 @@ def advance_stage(
|
||||
"""
|
||||
result = AdvanceResult(from_stage=current_stage)
|
||||
agent = finished_agent
|
||||
# ORCH-114 (adr-0045): set True once we acquire the durable transition-lease on a
|
||||
# side-effectful edge, so the finally below ALWAYS releases it (on success, on a
|
||||
# lost CAS, on a sub-gate rollback, and on ANY exception caught by the outer
|
||||
# except). Released holder-aware (this process only) so a reaper reclaim + reacquire
|
||||
# in between is never clobbered.
|
||||
_lease_held = False
|
||||
try:
|
||||
qg_name = get_qg_for_stage(current_stage)
|
||||
next_stage = get_next_stage(current_stage)
|
||||
@@ -240,6 +261,28 @@ def advance_stage(
|
||||
result.note = "terminal"
|
||||
return result
|
||||
|
||||
# --- ORCH-114 transition-ownership lease: acquire on ENTRY (ADR-001 D5) ----
|
||||
# On a side-effectful edge (deploy-staging / deploy) acquire the DURABLE
|
||||
# owner-exclusion lease BEFORE the Phase B / sub-gate / merge-verify region. A
|
||||
# second concurrent actor (reaper / reconciler / webhook / a re-driven startup
|
||||
# job) that sees a live owner gets a clean "busy" defer here and does NOT start
|
||||
# the heavy region at all — this is what kills the double-effect class
|
||||
# (incident ORCH-111) at the root. Released in the `finally` below. Kill-switch
|
||||
# off / repo out of scope -> applies() False -> no lease, byte-for-byte prior.
|
||||
if _is_side_effectful_edge(current_stage, next_stage) and transition_lease.applies(repo):
|
||||
if not transition_lease.acquire(
|
||||
task_id, finished_agent or "engine", run_id=None, stage=current_stage
|
||||
):
|
||||
logger.info(
|
||||
f"Task {task_id}: transition-lease busy on "
|
||||
f"{current_stage}->{next_stage} — deferring (another actor owns "
|
||||
f"this transition)"
|
||||
)
|
||||
result.note = "transition-lease-busy"
|
||||
result.advanced = False
|
||||
return result
|
||||
_lease_held = True
|
||||
|
||||
# --- ORCH-036/059 Phase B: "Confirm Deploy" on `deploy` -> initiate ----
|
||||
# ORCH-059: the prod-deploy trigger is now the DEDICATED "Confirm Deploy"
|
||||
# status (confirm_deploy=True), NOT the overloaded "Approved". On the
|
||||
@@ -399,7 +442,23 @@ def advance_stage(
|
||||
return result
|
||||
|
||||
# --- Advance ---------------------------------------------------------
|
||||
update_task_stage(task_id, next_stage)
|
||||
# ORCH-114 (adr-0045 / FR-2): expected-stage compare-and-swap. Writes the
|
||||
# stage only if the task is STILL at current_stage (the value we read on
|
||||
# entry); a lost race (another writer advanced/rolled back first) returns
|
||||
# False -> abort here WITHOUT any side effect (no notify / no arm / no
|
||||
# terminal-sync / no enqueue). Kill-switch off / repo out of scope ->
|
||||
# degenerates to the prior unconditional update_task_stage (returns True) ->
|
||||
# byte-for-byte prior behaviour. Defense-in-depth: under the lease acquired
|
||||
# above this CAS practically always wins; it also covers the narrow
|
||||
# consult->acquire window and any bypass writer (TR-5).
|
||||
if not transition_lease.commit_stage_cas(task_id, current_stage, next_stage, repo):
|
||||
logger.info(
|
||||
f"Task {task_id}: stage-CAS lost on {current_stage}->{next_stage} — "
|
||||
f"aborting without side effects (another writer advanced first)"
|
||||
)
|
||||
result.note = "stage-cas-lost"
|
||||
result.advanced = False
|
||||
return result
|
||||
# Telegram live tracker: the analysis->architecture advance is the human
|
||||
# Approved gate clearing -> stamp the END of "Ревью БРД" (the only
|
||||
# human time). Idempotent: only the first stamp counts.
|
||||
@@ -510,6 +569,16 @@ def advance_stage(
|
||||
logger.error(f"advance_stage failed for task_id={task_id}: {e}")
|
||||
result.note = f"error: {e}"
|
||||
return result
|
||||
finally:
|
||||
# ORCH-114 (adr-0045 / AC-3): release the transition-lease on EVERY exit —
|
||||
# normal advance, lost CAS, sub-gate rollback, Phase A/B early return, and any
|
||||
# exception caught above — so the lease never "leaks" and wedges the task.
|
||||
# holder-aware (force=False): only releases a row this process owns.
|
||||
if _lease_held:
|
||||
try:
|
||||
transition_lease.release(task_id)
|
||||
except Exception as e: # noqa: BLE001 - never-raise (Tier-3 backstop bounds it)
|
||||
logger.warning(f"Task {task_id}: transition-lease release failed: {e}")
|
||||
|
||||
|
||||
def advance_if_gate_passed(
|
||||
@@ -1482,7 +1551,21 @@ def _handle_self_deploy_phase_a(
|
||||
restart-safe `approve-requested` marker records that Phase A ran. The merge
|
||||
lease stays HELD.
|
||||
"""
|
||||
update_task_stage(task_id, "deploy")
|
||||
# ORCH-114 (adr-0045 / D4): this IS the deploy-staging -> deploy stage write on
|
||||
# the self-hosting path (advance_stage's line-402 CAS is not reached — Phase A
|
||||
# returns first). Use the same expected-stage CAS. It runs under the transition-
|
||||
# lease acquired by advance_stage, so it practically always wins; a lost CAS
|
||||
# (a concurrent writer despite the lease) -> abort Phase A WITHOUT initiating the
|
||||
# prod-deploy ask / autoDeploy (no double effect). Kill-switch off / repo out of
|
||||
# scope -> unconditional update (byte-for-byte).
|
||||
if not transition_lease.commit_stage_cas(task_id, current_stage, "deploy", repo):
|
||||
logger.info(
|
||||
f"Task {task_id}: Phase A stage-CAS lost ({current_stage}->deploy) — "
|
||||
f"aborting Phase A without side effects"
|
||||
)
|
||||
result.note = "phase-a-cas-lost"
|
||||
result.advanced = False
|
||||
return
|
||||
notify_stage_change(task_id, current_stage, "deploy")
|
||||
result.advanced = True
|
||||
result.to_stage = "deploy"
|
||||
|
||||
471
src/transition_lease.py
Normal file
471
src/transition_lease.py
Normal file
@@ -0,0 +1,471 @@
|
||||
"""ORCH-114 (adr-0045): durable transition-ownership lease + expected-stage CAS.
|
||||
|
||||
Leaf module — pure, never-raise (pattern of ``serial_gate`` / ``coverage_gate`` /
|
||||
``finalizer_liveness``: imports only ``db`` + ``config`` and lazily
|
||||
``merge_gate.pid_alive`` / ``qg.checks.is_self_hosting_repo`` / ``notifications``;
|
||||
it NEVER imports ``stage_engine`` / ``launcher`` and talks to no network).
|
||||
|
||||
The bug class it closes
|
||||
-----------------------
|
||||
``stage_engine.advance_stage`` is the single entry to side-effectful transitions
|
||||
(the heavy ``deploy-staging -> deploy`` edge sub-gates — security / merge-gate
|
||||
re-test / coverage / image-freshness — and the ``deploy -> done`` merge-verify:
|
||||
``merge_pr`` / coverage-ratchet / proof-of-merge). It is RE-ENTERABLE: at least
|
||||
five actors (monitor / Plane-webhook / reconciler F-1 / job-reaper / deploy
|
||||
finalizer) can enter the SAME transition independently, and the stage write was a
|
||||
bare ``UPDATE … WHERE id=?`` with no compare-and-swap. Two concurrent — or a
|
||||
post-restart re-driven — entry therefore re-applied irreversible effects and
|
||||
produced contradictory outcomes (one path rolled back to ``development`` while
|
||||
another merged + finished — incident ORCH-111, job 1914 / PR #130). ORCH-113
|
||||
closed only the in-memory, Tier-2, ``deploy-staging``-only slice of this; it is
|
||||
lost on restart.
|
||||
|
||||
Two complementary layers (ADR-001 D1), both gated by one kill-switch:
|
||||
|
||||
1. **Durable lease (owner-exclusion on ENTRY).** A row in the additive
|
||||
``transition_lease`` table (one per task) records "an actor owns this task's
|
||||
side-effectful transition". A second actor that sees a LIVE owner does not
|
||||
start the heavy sub-gates AT ALL (prevention, not post-hoc repair).
|
||||
2. **Expected-stage CAS (atomicity on the WRITE).** ``update_task_stage_cas``
|
||||
writes the stage only when the task is still at the expected stage; a lost
|
||||
race aborts with NO side effect. It also closes the six paths that write the
|
||||
stage in BYPASS of ``advance_stage`` (gitea / plane direct ``update_task_stage``).
|
||||
|
||||
Liveness without a heartbeat (ADR-001 D3)
|
||||
-----------------------------------------
|
||||
An owner is LIVE ⇔ ``owner_boot_id == <this process's boot id>`` AND
|
||||
``merge_gate.pid_alive(owner_pid)``. There is NO heartbeat (a blocking 900 s merge
|
||||
re-test cannot beat one — the very argument ORCH-113 used to reject heartbeats).
|
||||
This makes restart recovery free: a new process has a new ``boot_id`` so every row
|
||||
written by a previous process is instantly stale and reclaimed
|
||||
(``recover_on_startup``). Within the one-process model every live owner shares the
|
||||
SAME boot id and pid, so a same-boot row is by definition owned by the (alive)
|
||||
current process; only a different-boot row can be stale — which is why the
|
||||
acquire/recover logic keys staleness on the boot id.
|
||||
|
||||
No own TTL (ADR-001 D8): the lease's hard age ceiling IS the reaper Tier-3 backstop
|
||||
``reaper_max_running_s`` (the reaper force-releases the lease when it reaps), so the
|
||||
cross-cutting budget invariant ORCH-065/109/110/113 is untouched.
|
||||
|
||||
never-raise (ADR-001 D9 / NFR-1): every public function is isolated. The
|
||||
directional defaults:
|
||||
* ``acquire`` error -> ``False`` (busy): the caller DEFERS/aborts a side-effectful
|
||||
transition rather than risk a double effect (fail-CLOSED to no-double-effect).
|
||||
* ``is_held_by_live_owner`` error -> ``True`` (treat as held): the consulting
|
||||
reconciler / webhook / reaper conservatively DEFERS (the safe action; the reaper
|
||||
Tier-3 backstop still bounds a genuinely stuck task).
|
||||
* ``commit_stage_cas`` error on the CAS path -> ``False``: abort the write, never a
|
||||
blind overwrite.
|
||||
The hot claim path (``db.claim_next_job``) is deliberately NOT touched, so a lease
|
||||
bug can never wedge the shared queue of all projects (AC-8 ORCH-088 intact).
|
||||
|
||||
See docs/work-items/ORCH-114/06-adr/ADR-001-transition-ownership-lease-and-stage-cas.md
|
||||
and the cross-cutting docs/architecture/adr/adr-0045-transition-ownership-lease-and-stage-cas.md.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import secrets
|
||||
import threading
|
||||
|
||||
from . import db
|
||||
from .config import settings
|
||||
|
||||
logger = logging.getLogger("orchestrator.transition_lease")
|
||||
|
||||
# Per-process boot nonce (ADR-001 D3). Generated ONCE at import: every lease row a
|
||||
# previous process wrote carries a DIFFERENT boot id and is therefore instantly
|
||||
# stale after a restart -> reclaimed by recover_on_startup / acquire. Not derived
|
||||
# from the clock so it cannot collide across a fast restart.
|
||||
_BOOT_ID = secrets.token_hex(16)
|
||||
|
||||
# Best-effort observability counters (reset on restart, like the reaper's). Guarded
|
||||
# by a lock because the monitor / reaper / reconciler / webhook threads all touch
|
||||
# them. Never a source of truth — purely for GET /queue.
|
||||
_LOCK = threading.Lock()
|
||||
_COUNTERS: dict[str, int] = {
|
||||
"acquired_total": 0, # leases successfully acquired
|
||||
"busy_total": 0, # acquire deferred — a live owner already held it
|
||||
"released_total": 0, # normal try/finally releases
|
||||
"cas_lost_total": 0, # stage-CAS lost the race (aborted without side effect)
|
||||
"stale_reclaims_total": 0, # rows reclaimed because the owner was not live
|
||||
"force_reclaims_total": 0, # rows force-released (reaper / operator)
|
||||
}
|
||||
|
||||
|
||||
def _bump(key: str, n: int = 1) -> None:
|
||||
try:
|
||||
with _LOCK:
|
||||
_COUNTERS[key] = _COUNTERS.get(key, 0) + n
|
||||
except Exception: # noqa: BLE001 - counters never break a caller
|
||||
pass
|
||||
|
||||
|
||||
def boot_id() -> str:
|
||||
"""This process's boot nonce (exposed for tests / observability)."""
|
||||
return _BOOT_ID
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Conditionality (mirrors coverage_gate_applies — self-hosting-only by default)
|
||||
# ---------------------------------------------------------------------------
|
||||
def _enabled() -> bool:
|
||||
try:
|
||||
return bool(getattr(settings, "transition_lease_enabled", False))
|
||||
except Exception: # noqa: BLE001
|
||||
return False
|
||||
|
||||
|
||||
def applies(repo: str) -> bool:
|
||||
"""Whether the transition-lease + CAS are REAL for this repo (ADR-001 D10).
|
||||
|
||||
* ``transition_lease_enabled=False`` -> always False (kill-switch; the lease is
|
||||
neither written nor read AND ``commit_stage_cas`` degenerates to the prior
|
||||
unconditional ``update_task_stage`` -> behaviour byte-for-byte as before
|
||||
ORCH-114).
|
||||
* ``transition_lease_repos`` (CSV) non-empty -> real only for the listed repos.
|
||||
* empty CSV -> real ONLY for the self-hosting repo (``orchestrator``), where the
|
||||
irreversible side-effectful edges live (mirrors coverage_gate_repos -> enduro
|
||||
untouched at the default).
|
||||
Never raises -> False on error (the safe "mechanism inert" default == kill-switch
|
||||
off).
|
||||
"""
|
||||
try:
|
||||
if not _enabled():
|
||||
return False
|
||||
raw = (getattr(settings, "transition_lease_repos", "") or "").strip()
|
||||
if raw:
|
||||
allowed = {r.strip().lower() for r in raw.split(",") if r.strip()}
|
||||
return (repo or "").strip().lower() in allowed
|
||||
from .qg.checks import is_self_hosting_repo
|
||||
return is_self_hosting_repo(repo)
|
||||
except Exception as e: # noqa: BLE001 - never-raise contract
|
||||
logger.warning("transition_lease.applies error for %s: %s", repo, e)
|
||||
return False
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Liveness
|
||||
# ---------------------------------------------------------------------------
|
||||
def _pid_alive(pid) -> bool:
|
||||
"""Probe ``pid`` liveness via ``merge_gate.pid_alive`` (ADR-001 references it for
|
||||
a single shared semantics). Lazy import keeps this module a leaf; on import error
|
||||
fall back to a conservative ``True`` (a lease whose pid we cannot probe is treated
|
||||
as live — the boot-id check below + the Tier-3 backstop still bound it).
|
||||
"""
|
||||
try:
|
||||
from .merge_gate import pid_alive
|
||||
return pid_alive(pid)
|
||||
except Exception: # noqa: BLE001
|
||||
return True
|
||||
|
||||
|
||||
def _row_is_live(owner_boot_id, owner_pid) -> bool:
|
||||
"""True iff the lease owner is LIVE (this process's boot AND a live pid).
|
||||
|
||||
A row from a DIFFERENT boot id (a previous process) is dead by construction
|
||||
(ADR-001 D3); a same-boot row is owned by the current — alive — process, but we
|
||||
still probe the pid for forward-compatibility with a future multi-process model.
|
||||
"""
|
||||
if owner_boot_id != _BOOT_ID:
|
||||
return False
|
||||
return _pid_alive(owner_pid)
|
||||
|
||||
|
||||
def is_held_by_live_owner(task_id) -> bool:
|
||||
"""True iff an active lease row for ``task_id`` is owned by a LIVE actor.
|
||||
|
||||
Consulted by the reconciler F-1 / Plane-webhook DEFER guards and the reaper.
|
||||
Returns ``False`` when there is no row or the owner is stale. Fail-CLOSED on any
|
||||
error -> ``True`` (treat as held): the consulting caller DEFERS, which is always
|
||||
the safe-against-double-effect action (the reaper Tier-3 backstop still bounds a
|
||||
truly stuck task). When the mechanism is disabled -> ``False`` (no defer).
|
||||
"""
|
||||
if task_id is None:
|
||||
return False
|
||||
if not _enabled():
|
||||
return False
|
||||
try:
|
||||
conn = db.get_db()
|
||||
try:
|
||||
row = conn.execute(
|
||||
"SELECT owner_boot_id, owner_pid FROM transition_lease WHERE task_id=?",
|
||||
(task_id,),
|
||||
).fetchone()
|
||||
finally:
|
||||
conn.close()
|
||||
if row is None:
|
||||
return False
|
||||
return _row_is_live(row["owner_boot_id"], row["owner_pid"])
|
||||
except Exception as e: # noqa: BLE001 - fail-CLOSED on doubt (conservative defer)
|
||||
logger.warning(
|
||||
"transition_lease.is_held_by_live_owner error for task %s -> "
|
||||
"fail-CLOSED (defer): %s", task_id, e,
|
||||
)
|
||||
return True
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Acquire / release / reclaim
|
||||
# ---------------------------------------------------------------------------
|
||||
def _clear_stale_row(conn, task_id) -> int:
|
||||
"""Delete the lease row for ``task_id`` IFF its owner is not live. Returns the
|
||||
rowcount. Uses the caller's connection (same transaction as the INSERT in
|
||||
``acquire``). Raises on a real DB fault (the caller swallows)."""
|
||||
row = conn.execute(
|
||||
"SELECT owner_boot_id, owner_pid FROM transition_lease WHERE task_id=?",
|
||||
(task_id,),
|
||||
).fetchone()
|
||||
if row is None:
|
||||
return 0
|
||||
if _row_is_live(row["owner_boot_id"], row["owner_pid"]):
|
||||
return 0
|
||||
cur = conn.execute("DELETE FROM transition_lease WHERE task_id=?", (task_id,))
|
||||
return cur.rowcount or 0
|
||||
|
||||
|
||||
def acquire(task_id, owner: str, run_id=None, stage: str | None = None) -> bool:
|
||||
"""Acquire the side-effectful-transition lease for ``task_id`` (ADR-001 D5).
|
||||
|
||||
Atomic rowcount-guard (pattern ``claim_next_job`` / ``reap_running_job``): a stale
|
||||
row (owner from a previous boot / dead pid) is cleared first, then an
|
||||
``INSERT … ON CONFLICT(task_id) DO NOTHING`` competes only with LIVE same-process
|
||||
owners. ``rowcount == 1`` -> WE won. ``rowcount == 0`` -> a live owner already
|
||||
holds it -> ``False`` (the caller DEFERS without starting the heavy region).
|
||||
|
||||
Kill-switch off -> ``True`` (no-op acquire; the caller proceeds exactly as before
|
||||
ORCH-114; ``release`` is then an idempotent no-op). ``task_id is None`` -> ``True``
|
||||
(a job with no task cannot be leased — legacy direct ``launch()``; proceed).
|
||||
|
||||
never-raise: any error -> ``False`` (busy) so the caller DEFERS a side-effectful
|
||||
transition rather than risk a double effect (fail-CLOSED to no-double-effect,
|
||||
ADR-001 D9).
|
||||
"""
|
||||
if not _enabled():
|
||||
return True
|
||||
if task_id is None:
|
||||
return True
|
||||
try:
|
||||
conn = db.get_db()
|
||||
try:
|
||||
_clear_stale_row(conn, task_id)
|
||||
cur = conn.execute(
|
||||
"INSERT INTO transition_lease "
|
||||
"(task_id, owner, owner_pid, owner_boot_id, run_id, stage) "
|
||||
"VALUES (?, ?, ?, ?, ?, ?) "
|
||||
"ON CONFLICT(task_id) DO NOTHING",
|
||||
(task_id, owner or "engine", os.getpid(), _BOOT_ID, run_id, stage),
|
||||
)
|
||||
conn.commit()
|
||||
won = (cur.rowcount == 1)
|
||||
finally:
|
||||
conn.close()
|
||||
if won:
|
||||
_bump("acquired_total")
|
||||
return True
|
||||
_bump("busy_total")
|
||||
logger.info(
|
||||
"transition_lease: task %s busy (a live owner holds the transition); "
|
||||
"%s defers", task_id, owner,
|
||||
)
|
||||
return False
|
||||
except Exception as e: # noqa: BLE001 - fail-CLOSED (busy) to avoid double effects
|
||||
logger.warning("transition_lease.acquire error for task %s: %s", task_id, e)
|
||||
return False
|
||||
|
||||
|
||||
def release(task_id, force: bool = False) -> None:
|
||||
"""Release the lease for ``task_id`` (ADR-001 D5). Idempotent, never raises.
|
||||
|
||||
* ``force=False`` (normal try/finally release in ``advance_stage``): delete only
|
||||
a row owned by THIS process (``owner_boot_id == boot``), so a release delayed
|
||||
past a reaper-reclaim-then-reacquire can never delete a lease a DIFFERENT
|
||||
process/owner acquired afterwards (holder-aware, mirrors ``release_merge_lease``).
|
||||
* ``force=True`` (reaper reap / operator endpoint): delete unconditionally.
|
||||
"""
|
||||
if task_id is None:
|
||||
return
|
||||
if not _enabled():
|
||||
return
|
||||
try:
|
||||
conn = db.get_db()
|
||||
try:
|
||||
if force:
|
||||
cur = conn.execute(
|
||||
"DELETE FROM transition_lease WHERE task_id=?", (task_id,)
|
||||
)
|
||||
else:
|
||||
cur = conn.execute(
|
||||
"DELETE FROM transition_lease WHERE task_id=? AND owner_boot_id=?",
|
||||
(task_id, _BOOT_ID),
|
||||
)
|
||||
conn.commit()
|
||||
n = cur.rowcount or 0
|
||||
finally:
|
||||
conn.close()
|
||||
if n:
|
||||
_bump("force_reclaims_total" if force else "released_total", n)
|
||||
except Exception as e: # noqa: BLE001 - never-raise (a leaked lease is bounded by Tier-3)
|
||||
logger.warning("transition_lease.release error for task %s: %s", task_id, e)
|
||||
|
||||
|
||||
def reclaim_if_stale(task_id) -> bool:
|
||||
"""Reclaim (delete) the lease row for ``task_id`` IFF its owner is not live.
|
||||
|
||||
Returns True iff a stale row was reclaimed. Used by the operator endpoint and as
|
||||
a backstop. never-raise -> False on error.
|
||||
"""
|
||||
if task_id is None or not _enabled():
|
||||
return False
|
||||
try:
|
||||
conn = db.get_db()
|
||||
try:
|
||||
n = _clear_stale_row(conn, task_id)
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
if n:
|
||||
_bump("stale_reclaims_total", n)
|
||||
logger.warning("transition_lease: reclaimed stale lease for task %s", task_id)
|
||||
return n > 0
|
||||
except Exception as e: # noqa: BLE001 - never-raise
|
||||
logger.warning("transition_lease.reclaim_if_stale error for task %s: %s", task_id, e)
|
||||
return False
|
||||
|
||||
|
||||
def recover_on_startup() -> int:
|
||||
"""Clear every lease left by a PREVIOUS process boot (ADR-001 D7).
|
||||
|
||||
Called from ``main.lifespan`` right after ``requeue_running_jobs`` and BEFORE the
|
||||
reaper starts. A fresh process boot id means every existing row predates this
|
||||
process -> stale -> deleted, so the requeued jobs re-drive their transitions
|
||||
cleanly (idempotency comes from the authoritative durable facts — SHA-in-main,
|
||||
the INITIATED self-deploy marker, the coverage-ratchet CAS — NOT from a new
|
||||
recovery brain). Returns the number of rows cleared. never-raise -> 0 on error.
|
||||
"""
|
||||
if not _enabled():
|
||||
return 0
|
||||
try:
|
||||
conn = db.get_db()
|
||||
try:
|
||||
cur = conn.execute(
|
||||
"DELETE FROM transition_lease "
|
||||
"WHERE owner_boot_id IS NULL OR owner_boot_id != ?",
|
||||
(_BOOT_ID,),
|
||||
)
|
||||
conn.commit()
|
||||
n = cur.rowcount or 0
|
||||
finally:
|
||||
conn.close()
|
||||
if n:
|
||||
_bump("stale_reclaims_total", n)
|
||||
logger.warning(
|
||||
"transition_lease.recover_on_startup: cleared %d stale lease(s) from a "
|
||||
"previous boot", n,
|
||||
)
|
||||
# FR-6 / AC-12: a forced/stale reclaim is observable (Telegram alert). A
|
||||
# restart-time bulk reclaim is summarised (per-task clickable alerts come
|
||||
# from the operator endpoint). best-effort, never-raise.
|
||||
try:
|
||||
from .notifications import send_telegram
|
||||
send_telegram(
|
||||
f"♻️ Transition-lease recovery: сброшено {n} устаревш"
|
||||
f"(ий/их) lease после рестарта (переходы будут пере-исполнены "
|
||||
f"последовательно)."
|
||||
)
|
||||
except Exception: # noqa: BLE001 - alert is best-effort
|
||||
pass
|
||||
return n
|
||||
except Exception as e: # noqa: BLE001 - never-raise
|
||||
logger.warning("transition_lease.recover_on_startup error: %s", e)
|
||||
return 0
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Stage write — compare-and-swap wrapper (ADR-001 D5 / FR-2)
|
||||
# ---------------------------------------------------------------------------
|
||||
def commit_stage_cas(task_id, expected_stage: str, new_stage: str, repo: str) -> bool:
|
||||
"""Write the task stage under the ORCH-114 contract. Returns True iff the write
|
||||
was applied (and the caller may proceed with side effects), False iff the writer
|
||||
lost the CAS race (the caller MUST abort WITHOUT side effects).
|
||||
|
||||
* ``applies(repo)`` False (kill-switch off / repo out of scope) -> the prior
|
||||
unconditional ``update_task_stage`` (byte-for-byte) -> always True. Not wrapped
|
||||
in a swallowing try, so a DB error propagates EXACTLY as before ORCH-114.
|
||||
* ``applies(repo)`` True -> ``update_task_stage_cas`` (expected-stage compare-and-
|
||||
swap). A lost race -> False (no side effect). never-raise on the CAS path: a DB
|
||||
error -> False (abort the write; never a blind overwrite).
|
||||
"""
|
||||
try:
|
||||
scoped = applies(repo)
|
||||
except Exception: # noqa: BLE001 - applies already never-raises; belt-and-suspenders
|
||||
scoped = False
|
||||
if not scoped:
|
||||
db.update_task_stage(task_id, new_stage)
|
||||
return True
|
||||
try:
|
||||
won = db.update_task_stage_cas(task_id, expected_stage, new_stage)
|
||||
if not won:
|
||||
_bump("cas_lost_total")
|
||||
return won
|
||||
except Exception as e: # noqa: BLE001 - abort the write (no blind overwrite)
|
||||
logger.warning(
|
||||
"transition_lease.commit_stage_cas error for task %s (%s->%s): %s",
|
||||
task_id, expected_stage, new_stage, e,
|
||||
)
|
||||
return False
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Observability snapshot for GET /queue (ADR-001 D10 / FR-6)
|
||||
# ---------------------------------------------------------------------------
|
||||
def snapshot() -> dict:
|
||||
"""Read-only transition-lease summary for GET /queue. Additive block; existing
|
||||
/queue keys untouched. never-raise -> a minimal dict on error.
|
||||
"""
|
||||
try:
|
||||
enabled = _enabled()
|
||||
except Exception: # noqa: BLE001
|
||||
enabled = False
|
||||
try:
|
||||
repos_cfg = getattr(settings, "transition_lease_repos", "") or ""
|
||||
except Exception: # noqa: BLE001
|
||||
repos_cfg = ""
|
||||
holders: list[dict] = []
|
||||
try:
|
||||
conn = db.get_db()
|
||||
try:
|
||||
rows = conn.execute(
|
||||
"SELECT task_id, owner, owner_pid, owner_boot_id, run_id, stage, "
|
||||
"acquired_at, "
|
||||
"CAST(strftime('%s','now') - strftime('%s', acquired_at) AS INTEGER) "
|
||||
" AS age_s "
|
||||
"FROM transition_lease ORDER BY task_id"
|
||||
).fetchall()
|
||||
finally:
|
||||
conn.close()
|
||||
for r in rows:
|
||||
holders.append({
|
||||
"task_id": r["task_id"],
|
||||
"owner": r["owner"],
|
||||
"stage": r["stage"],
|
||||
"run_id": r["run_id"],
|
||||
"age_s": r["age_s"],
|
||||
"live": _row_is_live(r["owner_boot_id"], r["owner_pid"]),
|
||||
})
|
||||
except Exception as e: # noqa: BLE001 - never break /queue
|
||||
logger.warning("transition_lease.snapshot error: %s", e)
|
||||
try:
|
||||
with _LOCK:
|
||||
counters = dict(_COUNTERS)
|
||||
except Exception: # noqa: BLE001
|
||||
counters = {}
|
||||
return {
|
||||
"enabled": enabled,
|
||||
"repos": repos_cfg,
|
||||
"boot_id": _BOOT_ID,
|
||||
"active": len(holders),
|
||||
"holders": holders,
|
||||
"counters": counters,
|
||||
}
|
||||
@@ -13,7 +13,6 @@ from ..config import settings
|
||||
from ..db import (
|
||||
get_db,
|
||||
get_task_by_repo_branch,
|
||||
update_task_stage,
|
||||
enqueue_job,
|
||||
insert_event_dedup,
|
||||
)
|
||||
@@ -24,6 +23,7 @@ from ..notifications import notify_stage_change, notify_qg_failure, notify_error
|
||||
from ..agents.launcher import launcher
|
||||
from ..plane_sync import notify_stage_change as plane_notify_stage
|
||||
from ..projects import get_project_by_repo
|
||||
from .. import transition_lease
|
||||
|
||||
logger = logging.getLogger("orchestrator.webhooks.gitea")
|
||||
|
||||
@@ -124,18 +124,25 @@ async def handle_push(payload: dict):
|
||||
if has_adr:
|
||||
# Advance to development
|
||||
next_stage = "development"
|
||||
update_task_stage(task_id, next_stage)
|
||||
notify_stage_change(task_id, current_stage, next_stage)
|
||||
plane_notify_stage(work_item_id, current_stage, next_stage)
|
||||
# ORCH-114 (adr-0045 / D4, TR-4): this push-driven advance writes the stage
|
||||
# in BYPASS of advance_stage -> route through the expected-stage CAS so it
|
||||
# cannot clobber a concurrent authoritative write; a lost race skips the
|
||||
# notify + enqueue (no duplicate agent). Kill-switch off -> unconditional
|
||||
# (byte-for-byte).
|
||||
if transition_lease.commit_stage_cas(task_id, current_stage, next_stage, repo_name):
|
||||
notify_stage_change(task_id, current_stage, next_stage)
|
||||
plane_notify_stage(work_item_id, current_stage, next_stage)
|
||||
|
||||
agent = get_agent_for_stage(current_stage)
|
||||
if agent:
|
||||
try:
|
||||
task_desc = f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {branch}\nStage: {next_stage}"
|
||||
job_id = enqueue_job(agent, repo_name, task_desc, task_id=task_id)
|
||||
logger.info(f"Task {task_id}: push triggered {current_stage} → {next_stage}, enqueued '{agent}' (job_id={job_id})")
|
||||
except Exception as e:
|
||||
notify_error(task_id, f"Failed to launch agent '{agent}': {e}")
|
||||
agent = get_agent_for_stage(current_stage)
|
||||
if agent:
|
||||
try:
|
||||
task_desc = f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {branch}\nStage: {next_stage}"
|
||||
job_id = enqueue_job(agent, repo_name, task_desc, task_id=task_id)
|
||||
logger.info(f"Task {task_id}: push triggered {current_stage} → {next_stage}, enqueued '{agent}' (job_id={job_id})")
|
||||
except Exception as e:
|
||||
notify_error(task_id, f"Failed to launch agent '{agent}': {e}")
|
||||
else:
|
||||
logger.info(f"Task {task_id}: push-advance stage-CAS lost ({current_stage}->{next_stage}); another writer moved it")
|
||||
|
||||
elif current_stage == "development":
|
||||
# Source files pushed — just log, wait for CI
|
||||
@@ -239,18 +246,22 @@ async def handle_ci_status(payload: dict):
|
||||
passed, reason = check_ci_green(repo_name, branch)
|
||||
if passed:
|
||||
next_stage = "review"
|
||||
update_task_stage(task_id, next_stage)
|
||||
notify_stage_change(task_id, current_stage, next_stage)
|
||||
plane_notify_stage(work_item_id, current_stage, next_stage)
|
||||
# ORCH-114 (adr-0045 / D4, TR-4): CI-green advance in BYPASS of
|
||||
# advance_stage -> expected-stage CAS; a lost race skips notify + enqueue.
|
||||
if transition_lease.commit_stage_cas(task_id, current_stage, next_stage, repo_name):
|
||||
notify_stage_change(task_id, current_stage, next_stage)
|
||||
plane_notify_stage(work_item_id, current_stage, next_stage)
|
||||
|
||||
agent = get_agent_for_stage(current_stage)
|
||||
if agent:
|
||||
try:
|
||||
task_desc = f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {branch}\nStage: {next_stage}"
|
||||
job_id = enqueue_job(agent, repo_name, task_desc, task_id=task_id)
|
||||
logger.info(f"Task {task_id}: CI green → {next_stage}, enqueued '{agent}' (job_id={job_id})")
|
||||
except Exception as e:
|
||||
notify_error(task_id, f"Failed to launch agent '{agent}': {e}")
|
||||
agent = get_agent_for_stage(current_stage)
|
||||
if agent:
|
||||
try:
|
||||
task_desc = f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {branch}\nStage: {next_stage}"
|
||||
job_id = enqueue_job(agent, repo_name, task_desc, task_id=task_id)
|
||||
logger.info(f"Task {task_id}: CI green → {next_stage}, enqueued '{agent}' (job_id={job_id})")
|
||||
except Exception as e:
|
||||
notify_error(task_id, f"Failed to launch agent '{agent}': {e}")
|
||||
else:
|
||||
logger.info(f"Task {task_id}: CI-green stage-CAS lost ({current_stage}->{next_stage}); another writer moved it")
|
||||
else:
|
||||
notify_qg_failure(task_id, current_stage, "check_ci_green", reason)
|
||||
|
||||
@@ -330,18 +341,22 @@ async def handle_pr(payload: dict):
|
||||
passed, reason = check_review_approved(repo_name, pr_number)
|
||||
if passed:
|
||||
next_stage = "testing"
|
||||
update_task_stage(task_id, next_stage)
|
||||
notify_stage_change(task_id, current_stage, next_stage)
|
||||
plane_notify_stage(work_item_id, current_stage, next_stage)
|
||||
# ORCH-114 (adr-0045 / D4, TR-4): PR-approved advance in BYPASS of
|
||||
# advance_stage -> expected-stage CAS; a lost race skips notify + enqueue.
|
||||
if transition_lease.commit_stage_cas(task_id, current_stage, next_stage, repo_name):
|
||||
notify_stage_change(task_id, current_stage, next_stage)
|
||||
plane_notify_stage(work_item_id, current_stage, next_stage)
|
||||
|
||||
agent = get_agent_for_stage(current_stage)
|
||||
if agent:
|
||||
try:
|
||||
task_desc = f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {head_branch}\nStage: {next_stage}"
|
||||
job_id = enqueue_job(agent, repo_name, task_desc, task_id=task_id)
|
||||
logger.info(f"Task {task_id}: PR approved → {next_stage}, enqueued '{agent}' (job_id={job_id})")
|
||||
except Exception as e:
|
||||
notify_error(task_id, f"Failed to launch agent '{agent}': {e}")
|
||||
agent = get_agent_for_stage(current_stage)
|
||||
if agent:
|
||||
try:
|
||||
task_desc = f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {head_branch}\nStage: {next_stage}"
|
||||
job_id = enqueue_job(agent, repo_name, task_desc, task_id=task_id)
|
||||
logger.info(f"Task {task_id}: PR approved → {next_stage}, enqueued '{agent}' (job_id={job_id})")
|
||||
except Exception as e:
|
||||
notify_error(task_id, f"Failed to launch agent '{agent}': {e}")
|
||||
else:
|
||||
logger.info(f"Task {task_id}: PR-approved stage-CAS lost ({current_stage}->{next_stage}); another writer moved it")
|
||||
else:
|
||||
notify_qg_failure(task_id, current_stage, "check_review_approved", reason)
|
||||
|
||||
@@ -355,18 +370,24 @@ async def handle_pr(payload: dict):
|
||||
conn.close()
|
||||
|
||||
if retry_count < MAX_DEV_RETRIES:
|
||||
# Back to development, relaunch developer
|
||||
update_task_stage(task_id, "development")
|
||||
notify_stage_change(task_id, current_stage, "development")
|
||||
try:
|
||||
task_desc = (
|
||||
f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {head_branch}\n"
|
||||
f"Stage: development\nNote: Changes requested in review (attempt {retry_count + 1}/{MAX_DEV_RETRIES})"
|
||||
)
|
||||
job_id = enqueue_job("developer", repo_name, task_desc, task_id=task_id)
|
||||
logger.info(f"Task {task_id}: changes requested, enqueued developer (attempt {retry_count + 1}, job_id={job_id})")
|
||||
except Exception as e:
|
||||
notify_error(task_id, f"Failed to relaunch developer: {e}")
|
||||
# Back to development, relaunch developer.
|
||||
# ORCH-114 (adr-0045 / D4, TR-4): REQUEST_CHANGES rollback writes the
|
||||
# stage in BYPASS of advance_stage -> expected-stage CAS so it cannot
|
||||
# clobber a concurrent authoritative write (e.g. a task that already
|
||||
# advanced); a lost race skips the rollback + developer relaunch.
|
||||
if transition_lease.commit_stage_cas(task_id, current_stage, "development", repo_name):
|
||||
notify_stage_change(task_id, current_stage, "development")
|
||||
try:
|
||||
task_desc = (
|
||||
f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {head_branch}\n"
|
||||
f"Stage: development\nNote: Changes requested in review (attempt {retry_count + 1}/{MAX_DEV_RETRIES})"
|
||||
)
|
||||
job_id = enqueue_job("developer", repo_name, task_desc, task_id=task_id)
|
||||
logger.info(f"Task {task_id}: changes requested, enqueued developer (attempt {retry_count + 1}, job_id={job_id})")
|
||||
except Exception as e:
|
||||
notify_error(task_id, f"Failed to relaunch developer: {e}")
|
||||
else:
|
||||
logger.info(f"Task {task_id}: REQUEST_CHANGES rollback stage-CAS lost ({current_stage}->development); another writer moved it")
|
||||
else:
|
||||
notify_error(task_id, f"Max developer retries ({MAX_DEV_RETRIES}) reached, escalating")
|
||||
logger.error(f"Task {task_id}: max retries reached, needs manual intervention")
|
||||
@@ -395,6 +416,11 @@ async def handle_pr(payload: dict):
|
||||
f"deployer verdict (check_deploy_status), ignoring merge-driven done."
|
||||
)
|
||||
return
|
||||
update_task_stage(task_id, "done")
|
||||
notify_stage_change(task_id, current_stage, "done")
|
||||
logger.info(f"Task {task_id}: PR merged, stage → done")
|
||||
# ORCH-114 (adr-0045 / D4, TR-4): merge-driven done writes the stage in BYPASS
|
||||
# of advance_stage -> expected-stage CAS so a concurrent authoritative writer
|
||||
# is not clobbered; a lost race skips the (idempotent) notify.
|
||||
if transition_lease.commit_stage_cas(task_id, current_stage, "done", repo_name):
|
||||
notify_stage_change(task_id, current_stage, "done")
|
||||
logger.info(f"Task {task_id}: PR merged, stage → done")
|
||||
else:
|
||||
logger.info(f"Task {task_id}: merge-driven done stage-CAS lost ({current_stage}->done); another writer moved it")
|
||||
|
||||
@@ -14,7 +14,6 @@ from ..db import (
|
||||
get_task_by_plane_id,
|
||||
get_next_work_item_id,
|
||||
ensure_unique_work_item_id,
|
||||
update_task_stage,
|
||||
enqueue_job,
|
||||
insert_event_dedup,
|
||||
create_task_atomic,
|
||||
@@ -35,6 +34,7 @@ from ..projects import (
|
||||
get_project_by_repo,
|
||||
known_plane_project_ids,
|
||||
)
|
||||
from .. import transition_lease
|
||||
|
||||
logger = logging.getLogger("orchestrator.webhooks.plane")
|
||||
|
||||
@@ -803,7 +803,17 @@ async def _rollback_stage(
|
||||
if not prev_stage:
|
||||
logger.info(f"Task {task_id}: rejected at {current_stage} but no previous stage")
|
||||
return
|
||||
update_task_stage(task_id, prev_stage)
|
||||
# ORCH-114 (adr-0045 / D4, TR-4): this Rejected-rollback writes the stage in
|
||||
# BYPASS of advance_stage. Route it through the expected-stage CAS so it can never
|
||||
# clobber an authoritative write made by a concurrent owner (e.g. a deploy->done
|
||||
# finalizer) — a lost race aborts the rollback WITHOUT its side effects. Kill-switch
|
||||
# off / repo out of scope -> unconditional update (byte-for-byte).
|
||||
if not transition_lease.commit_stage_cas(task_id, current_stage, prev_stage, repo):
|
||||
logger.info(
|
||||
f"Task {task_id}: rollback stage-CAS lost ({current_stage}->{prev_stage}) "
|
||||
f"— task already moved by another writer; skipping rollback"
|
||||
)
|
||||
return
|
||||
notify_stage_change(task_id, current_stage, prev_stage)
|
||||
# Feature 3: plane_notify_stage moves the board to the prev stage's status.
|
||||
plane_notify_stage(work_item_id, current_stage, prev_stage)
|
||||
@@ -857,10 +867,25 @@ async def _try_advance_stage(
|
||||
advance_stage). It is True ONLY on the "Confirm Deploy" path
|
||||
(handle_confirm_deploy) and gates Phase B of the self-hosting prod deploy; the
|
||||
plain Approved path (handle_verdict) leaves it at the default False.
|
||||
|
||||
ORCH-114 (adr-0045 / FR-5, AC-8): if a live actor already owns this task's
|
||||
side-effectful transition (transition-lease active), DEFER — do not re-enter the
|
||||
transition in parallel. The late legitimate signal is not lost: once the owner
|
||||
releases (or dies and the reaper reclaims), a re-approve / the reconciler re-drives
|
||||
it, or advance_stage becomes an idempotent no-op against the authoritative facts
|
||||
(SHA-in-main / INITIATED). never raises; no-op when the lease is disabled / repo
|
||||
out of scope.
|
||||
"""
|
||||
import asyncio
|
||||
from ..stage_engine import advance_stage
|
||||
|
||||
if transition_lease.is_held_by_live_owner(task_id):
|
||||
logger.info(
|
||||
f"Task {task_id}: transition-lease active — deferring webhook advance "
|
||||
f"from {current_stage} (confirm_deploy={confirm_deploy})"
|
||||
)
|
||||
return
|
||||
|
||||
await asyncio.to_thread(
|
||||
advance_stage,
|
||||
task_id,
|
||||
|
||||
Reference in New Issue
Block a user