fix(reaper): Tier-2 finalization grace + claim-before-act (no dup advance)

Tier-2 reaped a LIVE, still-finalizing monitor: _monitor_agent writes agent_runs.exit_code FIRST, then does git push / PR / Plane comments before _finalize_job, and the agent pid is already dead in that window — so the old "exit_code recorded -> reap now" had no grace and could race a healthy job. Worse, _reap_known_outcome ran the advance (advance_stage -> enqueue_job) BEFORE the atomic claim, so a reaper that lost the race had already enqueued the next stage (dup advance / dup enqueue), violating ADR-001 Р-1. Fix: - Tier-2 grace: reap only once agent_runs.exit_code has been recorded for >= reaper_finalize_grace_s (new setting, default 300s; > max finalization window). A live finalizing monitor is never reaped (FR-1.3/AC-3). New finished_age_s column computed in get_running_jobs. - claim-before-act for exit0: evaluate the canonical QG READ-ONLY (the reconciler pattern) to choose the terminal status, then atomically claim 'done' FIRST; only the claim winner runs the advance. A loser performs no side effects -> no dup advance / dup enqueue. Docs (golden source) updated in the same change: ADR-001, global adr-0011, README, internals, .env.example, CHANGELOG (also fixes the P3 broken adr-0011 link). New tests cover the grace window, lost-claim no-side-effects, and the already-advanced idempotent path. Refs: ORCH-065 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-07 16:06:27 +00:00
parent 9b7c855df3
commit 720c31393a
10 changed files with 354 additions and 99 deletions
--- a/src/config.py
+++ b/src/config.py
@@ -314,6 +314,14 @@ class Settings(BaseSettings):
    #   reaper_max_running_s  -> Tier-3 backstop ceiling: a job 'running' longer than
    #                           this is reaped even when liveness is unknowable. MUST be
    #                           > max agent_timeout + grace so a legit agent is safe.
+    #   reaper_finalize_grace_s -> Tier-2 anti-false-positive: a LIVE monitor writes
+    #                           agent_runs.exit_code FIRST, THEN does git commit/push +
+    #                           PR + Plane usage comments (seconds..minutes) and only
+    #                           then _finalize_job. The agent pid is already dead in
+    #                           that window, so pid cannot tell "monitor died" from
+    #                           "monitor still finalizing". A job is reaped via Tier-2
+    #                           only once exit_code has been recorded for at least this
+    #                           many seconds (MUST be > the max finalization window).
    #   lease_reclaim_enabled -> kill-switch for the proactive stale/dead lease reclaim
    #                           (false -> only the legacy lazy TTL reclaim in acquire).
    # (reuse) merge_lock_timeout_s -> lease TTL; merge_gate_repos -> reclaim scope.
@@ -321,6 +329,7 @@ class Settings(BaseSettings):
    reaper_interval_s: int = 60
    reaper_dead_ticks: int = 2
    reaper_max_running_s: int = 3600
+    reaper_finalize_grace_s: int = 300
    lease_reclaim_enabled: bool = True

    # Telegram notifications
--- a/src/db.py
+++ b/src/db.py
@@ -601,11 +601,15 @@ def requeue_running_jobs() -> int:
 def get_running_jobs() -> list[dict]:
    """ORCH-065: snapshot of every 'running' job for the job-reaper scan.

-    Each row carries the job columns plus three reaper inputs:
+    Each row carries the job columns plus four reaper inputs:
      * ``running_age_s`` — seconds since ``started_at`` (Tier-3 backstop);
      * ``exit_code``     — the linked ``agent_runs.exit_code`` (Tier-2: process
        finished but the job is still 'running' -> monitor died mid-finalize);
-      * ``finished_at_run`` — the linked ``agent_runs.finished_at`` (debug only).
+      * ``finished_at_run`` — the linked ``agent_runs.finished_at``;
+      * ``finished_age_s`` — seconds since ``agent_runs.finished_at`` (Tier-2
+        finalization grace: a LIVE monitor writes exit_code, THEN does git
+        push / PR / Plane comments before _finalize_job, so a freshly-finished
+        run is NOT yet a zombie — the reaper waits ``reaper_finalize_grace_s``).

    A LEFT JOIN on ``run_id`` keeps jobs with no agent_runs row (exit_code NULL).
    Read-only; never mutates. The reaper applies liveness/streak/backstop on top.
@@ -616,7 +620,9 @@ def get_running_jobs() -> list[dict]:
            "SELECT j.*, "
            "CAST(strftime('%s','now') - strftime('%s', j.started_at) AS INTEGER) "
            "  AS running_age_s, "
-            "r.exit_code AS exit_code, r.finished_at AS finished_at_run "
+            "r.exit_code AS exit_code, r.finished_at AS finished_at_run, "
+            "CAST(strftime('%s','now') - strftime('%s', r.finished_at) AS INTEGER) "
+            "  AS finished_age_s "
            "FROM jobs j LEFT JOIN agent_runs r ON r.id = j.run_id "
            "WHERE j.status='running'"
        ).fetchall()
--- a/src/job_reaper.py
+++ b/src/job_reaper.py
@@ -28,10 +28,17 @@ Liveness (defense in depth, ADR-001 Р-1):
    ``reaper_dead_ticks`` (>=2) CONSECUTIVE dead-pid ticks — an in-memory streak
    counter kills false positives (AC-3); a live agent within its timeout is
    never reaped.
-  * **Tier-2 (completion race): exit_code recorded but job still running.** The
-    monitor died between writing ``agent_runs.exit_code`` and ``_finalize_job``.
-    The outcome is KNOWN -> gate-driven advance on exit0, else the standard
-    transient/permanent contract.
+  * **Tier-2 (completion race): exit_code recorded but job still running.** This
+    window is AMBIGUOUS — it is both "the monitor died between writing
+    ``agent_runs.exit_code`` and ``_finalize_job``" AND "a LIVE monitor is still
+    finalizing" (``_monitor_agent`` writes ``exit_code`` FIRST, then git
+    commit/push (+PR), the БАГ-8 check and network Plane usage comments — seconds
+    to tens of seconds — and ONLY THEN ``_try_advance_stage`` -> ``_finalize_job``).
+    The agent pid is already dead in BOTH cases, so it cannot disambiguate. The
+    reaper therefore treats it as a dead monitor (KNOWN outcome) only after a
+    finalization grace: ``exit_code`` recorded for >= ``reaper_finalize_grace_s``
+    (a live finalizing monitor is NEVER reaped, FR-1.3/AC-3). Within the grace the
+    row is left untouched.
  * **Tier-3 (backstop): age ceiling.** A job ``running`` longer than
    ``reaper_max_running_s`` (deliberately > max ``agent_timeout`` + grace) is
    reaped even when liveness cannot be determined (pid reused / unknown).
@@ -41,13 +48,16 @@ Action on confirmed death reuses existing contracts (no new merge/stage logic):
    ``db.reap_running_job(... WHERE status='running')`` — so a late-arriving
    monitor / the startup ``requeue_running_jobs`` / a second tick can never
    double-process a row (AC-5; the loser sees ``rowcount==0``).
-  * **exit0 (Tier-2):** gate-driven idempotent advance — the source of truth is
-    the canonical quality gate, NOT "exit0". If the stage already advanced ->
-    just mark ``done`` (idempotent cleanup). Else run ``launcher._try_advance_stage``
-    (it runs the canonical QG: artifact/PR present -> green -> advance; absent ->
-    red -> no advance) and re-check: advanced -> ``done``; still red (e.g. the
-    monitor died before git-push, so no artifact) -> fall through to the failure
-    path. This makes a false ``done`` without real work impossible.
+  * **exit0 (Tier-2): claim-BEFORE-act (ADR-001 Р-1).** The source of truth is the
+    canonical quality gate, NOT "exit0". If the stage already advanced -> atomic
+    ``done`` claim only (idempotent cleanup). Else evaluate the canonical QG
+    READ-ONLY (no side effects, the reconciler pattern): red (e.g. the monitor died
+    before git-push, so no artifact) -> failure path (no false ``done``); green ->
+    atomically claim ``done`` FIRST, and only the claim winner then runs
+    ``launcher._try_advance_stage`` (advance + ``enqueue_job`` of the next stage).
+    A tick that loses the claim performs NO side effects, so a late-finalizing
+    monitor / the startup ``requeue_running_jobs`` can never be double-advanced or
+    double-enqueued.
  * **exit!=0 (Tier-2) / unknown outcome (Tier-1 dead pid, Tier-3 backstop):**
    ``attempts < max_attempts`` -> ``queued`` (mirrors ``requeue_running_jobs``);
    budget exhausted -> ``failed`` + Telegram. We never fabricate exit0.
@@ -173,27 +183,46 @@ class JobReaper:
        exit_code = job.get("exit_code")  # from the LEFT JOIN on agent_runs

        # Tier-2: the process finished (exit_code recorded) but the job is still
-        # 'running' -> the monitor died mid-finalize. Outcome is KNOWN.
+        # 'running'. This is AMBIGUOUS: it is BOTH "the monitor died mid-finalize"
+        # AND "a LIVE monitor is still finalizing" — _monitor_agent writes exit_code
+        # FIRST, then does git commit/push (+PR), the БАГ-8 check, network Plane
+        # usage comments (seconds..tens of seconds), and ONLY THEN _try_advance_stage
+        # -> _finalize_job. The agent pid is already dead in BOTH cases, so pid can
+        # NOT disambiguate. We treat it as a dead monitor (KNOWN outcome) only after
+        # a finalization grace: exit_code must have been recorded for at least
+        # `reaper_finalize_grace_s` (FR-1.3/AC-3 — a live finalizing monitor is never
+        # reaped). Within the grace window we leave the row alone (and fall through to
+        # the Tier-3 backstop only, which never trips before the grace given a sane
+        # config where reaper_max_running_s > reaper_finalize_grace_s).
        if exit_code is not None:
            self._streak.pop(job_id, None)
-            self._reap_known_outcome(job, int(exit_code))
-            return
-
-        # Tier-1: dead pid, only after `reaper_dead_ticks` consecutive dead ticks.
-        if pid is not None and not merge_gate.pid_alive(pid):
-            n = self._streak.get(job_id, 0) + 1
-            self._streak[job_id] = n
-            if n >= max(int(settings.reaper_dead_ticks), 1):
-                self._streak.pop(job_id, None)
-                self._reap_unknown_outcome(job, reason=f"dead pid={pid}")
+            finished_age = job.get("finished_age_s")
+            grace = int(settings.reaper_finalize_grace_s)
+            if finished_age is not None and int(finished_age) >= grace:
+                self._reap_known_outcome(job, int(exit_code))
                return
            logger.info(
-                "reaper: job %s pid=%s dead (streak %d/%d) — deferring",
-                job_id, pid, n, settings.reaper_dead_ticks,
+                "reaper: job %s exit_code=%s recorded %ss ago (< grace %ss) — "
+                "deferring (monitor may still be finalizing)",
+                job_id, exit_code, finished_age, grace,
            )
+            # fall through to the Tier-3 backstop guard below.
        else:
-            # Alive / no pid -> reset the streak (must be CONSECUTIVE).
-            self._streak.pop(job_id, None)
+            # Tier-1: dead pid, only after `reaper_dead_ticks` consecutive dead ticks.
+            if pid is not None and not merge_gate.pid_alive(pid):
+                n = self._streak.get(job_id, 0) + 1
+                self._streak[job_id] = n
+                if n >= max(int(settings.reaper_dead_ticks), 1):
+                    self._streak.pop(job_id, None)
+                    self._reap_unknown_outcome(job, reason=f"dead pid={pid}")
+                    return
+                logger.info(
+                    "reaper: job %s pid=%s dead (streak %d/%d) — deferring",
+                    job_id, pid, n, settings.reaper_dead_ticks,
+                )
+            else:
+                # Alive / no pid -> reset the streak (must be CONSECUTIVE).
+                self._streak.pop(job_id, None)

        # Tier-3: backstop ceiling (one-shot; reaps even when liveness is unknown).
        if age >= int(settings.reaper_max_running_s):
@@ -206,16 +235,83 @@ class JobReaper:
    def _reap_known_outcome(self, job: dict, exit_code: int) -> None:
        """Tier-2: the agent's exit_code is known; drive the job's terminal status."""
        if exit_code == 0:
-            if self._gate_driven_advance(job):
-                if reap_running_job(job["id"], "done", run_id=job.get("run_id")):
-                    self._note_reap(job, "done", reason="exit0, gate green")
-                return
-            # exit0 but the gate is red (e.g. monitor died before git-push -> no
-            # artifact). Do NOT fabricate 'done'; treat as a failed outcome.
-            self._reap_unknown_outcome(job, reason="exit0 but gate red")
+            self._reap_exit0(job)
        else:
            self._reap_unknown_outcome(job, reason=f"exit={exit_code}")

+    def _reap_exit0(self, job: dict) -> None:
+        """Reap an exit0 Tier-2 job with claim-BEFORE-act (ADR-001 Р-1).
+
+        The atomic ``reap_running_job`` claim (guard ``WHERE status='running'``) MUST
+        precede any ``advance_stage`` / ``enqueue_job`` side effect, so a reaper tick
+        that LOSES the row (to a late-finalizing monitor or the startup
+        ``requeue_running_jobs``) performs NO side effects — no duplicate advance, no
+        duplicate ``enqueue_job`` of the next stage (FR-1.2/AC-4).
+
+        Because the claim flips the row OUT of 'running', we cannot run the advance
+        first to learn the gate colour. Instead we evaluate the canonical quality gate
+        READ-ONLY (no side effects — the pattern the reconciler uses) to choose the
+        terminal status BEFORE claiming:
+          * already advanced past this agent -> idempotent clean ``done`` (no advance);
+          * gate green  -> claim ``done`` first, THEN advance exactly once;
+          * gate red (e.g. monitor died before git-push -> no artifact) -> NOT a real
+            success: route to the retry/fail contract (never a false ``done``).
+        """
+        job_id = job["id"]
+        run_id = job.get("run_id")
+        agent = job.get("agent")
+        branch, stage, work_item_id = self._task_meta(job)
+        candidates = {s for s in STAGE_TRANSITIONS if get_agent_for_stage(s) == agent}
+
+        if stage is None or stage not in candidates:
+            # Stage already advanced past this agent (or unknown) -> a clean 'done'
+            # is correct WITHOUT re-advancing. Atomic claim only (idempotent cleanup).
+            if reap_running_job(job_id, "done", run_id=run_id):
+                self._note_reap(job, "done", reason="exit0, already advanced")
+            return
+
+        if not branch or not self._gate_is_green(stage, job, branch, work_item_id):
+            # exit0 but the gate is red -> do NOT fabricate 'done'; treat as failure
+            # (retry within budget, else failed + Telegram).
+            self._reap_unknown_outcome(job, reason="exit0 but gate red")
+            return
+
+        # Gate green. CLAIM-BEFORE-ACT: own the row atomically FIRST.
+        if not reap_running_job(job_id, "done", run_id=run_id):
+            # Lost the race -> the winner (late monitor / startup requeue) owns the
+            # advance; we do NOTHING (no duplicate side effects).
+            return
+        # We exclusively own the row now -> drive the gate-based advance exactly once.
+        self._gate_driven_advance(job)
+        self._note_reap(job, "done", reason="exit0, gate green")
+
+    def _gate_is_green(
+        self, stage: str, job: dict, branch: str, work_item_id: str | None
+    ) -> bool:
+        """Read-only canonical-QG evaluation for a reaped exit0 job (no side effects).
+
+        Mirrors the reconciler's cheap pre-evaluation: dispatch the stage's QG via
+        the SAME ``_run_qg`` the webhook path uses, returning its pass/fail WITHOUT
+        running ``advance_stage`` (so no stage move / enqueue / notification happens
+        here). A stage with no registered gate is treated as green (nothing blocks a
+        clean 'done'). Never raises -> any error returns False (conservative: route
+        to retry, never a false 'done').
+        """
+        try:
+            from .stages import get_qg_for_stage
+            from .stage_engine import _run_qg
+            qg_name = get_qg_for_stage(stage)
+            if not qg_name:
+                return True
+            passed, _reason = _run_qg(qg_name, job.get("repo"), work_item_id, branch)
+            return bool(passed)
+        except Exception as e:  # noqa: BLE001 - never break the reap
+            logger.warning(
+                "reaper: gate pre-eval failed for job %s (stage=%s): %s",
+                job.get("id"), stage, e,
+            )
+            return False
+
    def _reap_unknown_outcome(self, job: dict, reason: str) -> None:
        """Tier-1/Tier-3 (or exit!=0): outcome not a clean success.

@@ -252,7 +348,7 @@ class JobReaper:
        agent = job.get("agent")
        repo = job.get("repo")
        run_id = job.get("run_id")
-        branch, stage = self._task_branch_stage(job)
+        branch, stage, _wid = self._task_meta(job)
        # Candidate stages whose finishing agent is THIS agent (deployer maps to
        # both 'testing' and 'deploy-staging', hence a set).
        candidates = {s for s in STAGE_TRANSITIONS if get_agent_for_stage(s) == agent}
@@ -270,28 +366,29 @@ class JobReaper:
                         job.get("id"), e)
            return False
        # Re-read the stage: advanced out of the candidate set -> gate was green.
-        _branch, new_stage = self._task_branch_stage(job)
+        _branch, new_stage, _wid2 = self._task_meta(job)
        return new_stage is None or new_stage not in candidates

    @staticmethod
-    def _task_branch_stage(job: dict) -> tuple[str | None, str | None]:
-        """Resolve (branch, stage) for the job's task. Never raises."""
+    def _task_meta(job: dict) -> tuple[str | None, str | None, str | None]:
+        """Resolve (branch, stage, work_item_id) for the job's task. Never raises."""
        task_id = job.get("task_id")
        if not task_id:
-            return None, None
+            return None, None, None
        try:
            conn = get_db()
            row = conn.execute(
-                "SELECT branch, stage FROM tasks WHERE id = ?", (task_id,)
+                "SELECT branch, stage, work_item_id FROM tasks WHERE id = ?",
+                (task_id,),
            ).fetchone()
            conn.close()
            if not row:
-                return None, None
-            return row["branch"], row["stage"]
+                return None, None, None
+            return row["branch"], row["stage"], row["work_item_id"]
        except Exception as e:  # noqa: BLE001 - never-raise contract
            logger.warning("reaper: task lookup failed for job %s: %s",
                           job.get("id"), e)
-            return None, None
+            return None, None, None

    def _notify_failed(self, job: dict, reason: str) -> None:
        try: