fix(reaper): Tier-2 finalization grace + claim-before-act (no dup advance)

Tier-2 reaped a LIVE, still-finalizing monitor: _monitor_agent writes agent_runs.exit_code FIRST, then does git push / PR / Plane comments before _finalize_job, and the agent pid is already dead in that window — so the old "exit_code recorded -> reap now" had no grace and could race a healthy job. Worse, _reap_known_outcome ran the advance (advance_stage -> enqueue_job) BEFORE the atomic claim, so a reaper that lost the race had already enqueued the next stage (dup advance / dup enqueue), violating ADR-001 Р-1. Fix: - Tier-2 grace: reap only once agent_runs.exit_code has been recorded for >= reaper_finalize_grace_s (new setting, default 300s; > max finalization window). A live finalizing monitor is never reaped (FR-1.3/AC-3). New finished_age_s column computed in get_running_jobs. - claim-before-act for exit0: evaluate the canonical QG READ-ONLY (the reconciler pattern) to choose the terminal status, then atomically claim 'done' FIRST; only the claim winner runs the advance. A loser performs no side effects -> no dup advance / dup enqueue. Docs (golden source) updated in the same change: ADR-001, global adr-0011, README, internals, .env.example, CHANGELOG (also fixes the P3 broken adr-0011 link). New tests cover the grace window, lost-claim no-side-effects, and the already-advanced idempotent path. Refs: ORCH-065 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-07 16:06:27 +00:00
parent 9b7c855df3
commit 720c31393a
10 changed files with 354 additions and 99 deletions
--- a/.env.example
+++ b/.env.example
@@ -123,13 +123,19 @@ ORCH_RECONCILE_SKIP_BLOCKED_ENABLED=true
 # (one zombie at max_concurrency=1 blocks the whole shared queue) and periodically
 # reclaims dead/stale merge-leases. Liveness is three-tier: Tier-1 dead jobs.pid
 # (os.kill(pid,0)) after REAPER_DEAD_TICKS consecutive dead ticks (anti-false-positive
-# for a live agent); Tier-2 agent_runs.exit_code recorded but job still 'running';
-# Tier-3 backstop after REAPER_MAX_RUNNING_S. The terminal flip carries an atomic
-# status='running' guard so it never double-processes a row racing requeue_running_jobs.
+# for a live agent); Tier-2 agent_runs.exit_code recorded but job still 'running'
+# (only after a REAPER_FINALIZE_GRACE_S finalization grace, so a live monitor still
+# doing git push / PR / Plane comments is never reaped); Tier-3 backstop after
+# REAPER_MAX_RUNNING_S. The terminal flip carries an atomic status='running' guard and
+# precedes any advance/enqueue (claim-before-act) so it never double-processes/-advances
+# a row racing a late monitor or requeue_running_jobs.
 #   REAPER_ENABLED          -> global kill-switch (false -> strictly prior behaviour).
 #   REAPER_INTERVAL_S       -> background scan period (seconds).
 #   REAPER_DEAD_TICKS       -> consecutive dead-pid ticks before reaping (Tier-1, >=2).
 #   REAPER_MAX_RUNNING_S    -> Tier-3 backstop ceiling; must exceed max agent_timeout+grace.
+#   REAPER_FINALIZE_GRACE_S -> Tier-2 grace: how long agent_runs.exit_code must have been
+#                              recorded before a still-'running' job is reaped; MUST exceed
+#                              the max finalization window (git push + PR + Plane comments).
 #   LEASE_RECLAIM_ENABLED   -> kill-switch for the proactive stale/dead lease reclaim
 #                              (false -> only the legacy lazy TTL reclaim in acquire_merge_lease).
 # (reuse) ORCH_MERGE_LOCK_TIMEOUT_S -> lease TTL; ORCH_MERGE_GATE_REPOS -> reclaim scope.
@@ -137,6 +143,7 @@ ORCH_REAPER_ENABLED=true
 ORCH_REAPER_INTERVAL_S=60
 ORCH_REAPER_DEAD_TICKS=2
 ORCH_REAPER_MAX_RUNNING_S=3600
+ORCH_REAPER_FINALIZE_GRACE_S=300
 ORCH_LEASE_RECLAIM_ENABLED=true

 # ORCH-021: post-deploy production monitoring + degradation reaction. After the
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
--- a/docs/architecture/README.md
+++ b/docs/architecture/README.md
@@ -207,14 +207,19 @@ ORCH-065 вводит фоновый watchdog, чтобы смерть проц
  работает **без рестарта**. Трёхуровневая liveness: Tier-1 мёртвый `jobs.pid`
  (новая колонка) после `reaper_dead_ticks` подряд тиков (анти-ложноположительность
  — живой долгий агент не реапится); Tier-2 `agent_runs.exit_code` записан, а job
-  ещё `running` (monitor умер между записью exit_code и финализацией); Tier-3
-  backstop по потолку `reaper_max_running_s` (> max agent_timeout+grace). Действие
-  переиспользует контракты: exit0 → **gate-driven idempotent advance**
-  (`_try_advance_stage`+`_finalize_job`, источник истины — канонический QG, не
-  факт «exit0»; нет дубль-перехода); exit≠0/неизвестно → `attempts<max`→`queued`,
-  иначе `failed`+Telegram. Атомарный reap-claim (`UPDATE ... WHERE id=? AND
-  status='running'`) совместим со стартовым `requeue_running_jobs` (restart-safe,
-  без двойной обработки).
+  ещё `running` — но это окно неоднозначно (живой monitor пишет exit_code ПЕРВЫМ,
+  затем git push/PR/Plane-комментарии), поэтому Tier-2 реапит только после
+  finalization-grace `reaper_finalize_grace_s` (живой финализирующий monitor НЕ
+  реапится); Tier-3 backstop по потолку `reaper_max_running_s` (> max
+  agent_timeout+grace). Действие переиспользует контракты по принципу
+  **claim-before-act**: для exit0 канонический QG оценивается read-only ПЕРЕД
+  атомарным claim, затем claim `done` ПЕРВЫМ и только победитель claim делает
+  `_try_advance_stage` (advance+enqueue) — проигравший claim (поздний monitor /
+  стартовый requeue) не выполняет побочных эффектов (нет дубль-advance/-enqueue);
+  источник истины — канонический QG, не факт «exit0»; гейт красный или exit≠0/
+  неизвестно → `attempts<max`→`queued`, иначе `failed`+Telegram. Атомарный
+  reap-claim (`UPDATE ... WHERE id=? AND status='running'`) совместим со стартовым
+  `requeue_running_jobs` (restart-safe, без двойной обработки).
 - **Проактивный реклейм stale/dead lease** (функции в `merge_gate.py`:
  `pid_alive`, `reclaim_stale_lease`) — на старте (рядом с `requeue_running_jobs`)
  и периодически из тика reaper: освобождает lease, чей держатель **мёртв** (pid
@@ -238,7 +243,8 @@ ORCH-065 вводит фоновый watchdog, чтобы смерть проц
  `logger.warning`; reap→`failed` и lease-reclaim → Telegram.
 - **Kill-switch'и:** `ORCH_REAPER_ENABLED`, `ORCH_REAPER_INTERVAL_S`,
  `ORCH_REAPER_DEAD_TICKS`, `ORCH_REAPER_MAX_RUNNING_S`,
-  `ORCH_LEASE_RECLAIM_ENABLED`; `false` → строго прежнее поведение.
+  `ORCH_REAPER_FINALIZE_GRACE_S`, `ORCH_LEASE_RECLAIM_ENABLED`; `false` → строго
+  прежнее поведение.

 Подробнее: [adr-0011](adr/adr-0011-job-reaper-lease-reclaim.md), детально —
 `docs/work-items/ORCH-065/06-adr/ADR-001-job-reaper-and-lease-reclaim.md`.
--- a/docs/architecture/adr/adr-0011-job-reaper-lease-reclaim.md
+++ b/docs/architecture/adr/adr-0011-job-reaper-lease-reclaim.md
@@ -28,13 +28,18 @@ self-restart во время deploy) оставляет строку `jobs` на
   never-raise, `_stop`-Event, старт/стоп в `lifespan`, снимок в `/queue`,
   kill-switch). Работает **без рестарта** процесса. Liveness — трёхуровневая:
   Tier-1 мёртвый `jobs.pid` (новая колонка) после `reaper_dead_ticks` подряд
-   тиков; Tier-2 `agent_runs.exit_code` записан, а job ещё `running`; Tier-3
-   backstop по потолку `reaper_max_running_s`. Действие переиспользует контракты:
-   exit0 → gate-driven idempotent advance (`_try_advance_stage`+`_finalize_job`,
-   источник истины — QG, не «exit0»); exit≠0 / неизвестно → `attempts<max` →
-   `queued`, иначе `failed`+Telegram. Атомарный reap-claim (`UPDATE ... WHERE
-   id=? AND status='running'` + `rowcount`, как `claim_next_job`) исключает
-   двойную обработку (совместимость со стартовым `requeue_running_jobs`).
+   тиков; Tier-2 `agent_runs.exit_code` записан, а job ещё `running` — но только
+   после finalization-grace `reaper_finalize_grace_s` (окно неоднозначно: живой
+   monitor пишет exit_code ПЕРВЫМ, затем git push/PR/Plane-комментарии, поэтому
+   живой финализирующий monitor НЕ реапится); Tier-3 backstop по потолку
+   `reaper_max_running_s`. Действие — **claim-before-act**: для exit0 канонический
+   QG оценивается read-only ПЕРЕД атомарным claim, затем claim `done` ПЕРВЫМ и
+   только победитель claim выполняет `_try_advance_stage` (advance+enqueue) —
+   проигравший не делает побочных эффектов (источник истины — QG, не «exit0»);
+   гейт красный или exit≠0 / неизвестно → `attempts<max` → `queued`, иначе
+   `failed`+Telegram. Атомарный reap-claim (`UPDATE ... WHERE id=? AND
+   status='running'` + `rowcount`, как `claim_next_job`) исключает двойную
+   обработку (совместимость со стартовым `requeue_running_jobs`).
 2. **Проактивный реклейм stale/dead lease** — функции в `merge_gate.py`
   (`pid_alive`, `reclaim_stale_lease`), вызываемые на старте (рядом с
   `requeue_running_jobs`) и периодически из тика reaper. Освобождение, если
@@ -49,9 +54,9 @@ self-restart во время deploy) оставляет строку `jobs` на
   (паттерн live-safe миграции). Больше ничего не меняется.

 Kill-switch'и (`ORCH_*`): `reaper_enabled`, `reaper_interval_s`,
-`reaper_dead_ticks`, `reaper_max_running_s`, `lease_reclaim_enabled`;
-переиспользуются `merge_lock_timeout_s`, `merge_gate_repos`. `false` → строго
-прежнее поведение.
+`reaper_dead_ticks`, `reaper_max_running_s`, `reaper_finalize_grace_s`,
+`lease_reclaim_enabled`; переиспользуются `merge_lock_timeout_s`,
+`merge_gate_repos`. `false` → строго прежнее поведение.

 ## Альтернативы
 - Reaper внутри reconciler — отвергнуто (смешение stage- и jobs-уровней, общий
--- a/docs/architecture/internals.md
+++ b/docs/architecture/internals.md
@@ -354,13 +354,20 @@ daemon-поток `src/job_reaper.py` (каркас `reconciler`) периоди
 - **Tier-1** — `jobs.pid` мёртв (`os.kill(pid,0)`→`ProcessLookupError`) на
  протяжении `reaper_dead_ticks` подряд тиков (анти-ложноположительность);
 - **Tier-2** — у `agent_runs[run_id]` записан `exit_code`, а `jobs.status` ещё
-  `running` (monitor умер между записью exit_code и `_finalize_job`);
+  `running`. Окно неоднозначно: живой monitor пишет `exit_code` ПЕРВЫМ, затем
+  git push/PR/Plane-комментарии (секунды-десятки секунд) и лишь потом
+  `_finalize_job`; pid агента к этому моменту мёртв в обоих случаях. Поэтому
+  Tier-2 реапит только после finalization-grace `reaper_finalize_grace_s`
+  (`finished_age_s >= grace`) — живой финализирующий monitor НЕ реапится;
 - **Tier-3** — backstop: job висит `running` дольше `reaper_max_running_s`.

 Реап атомарен (`UPDATE jobs SET ... WHERE id=? AND status='running'` + `rowcount`,
 как `claim_next_job`) → совместим со стартовым `requeue_running_jobs` без двойной
-обработки. Действие переиспользует контракты: exit0 → gate-driven
-`_try_advance_stage`+`_finalize_job` (источник истины — QG); exit≠0/неизвестно →
+обработки. Действие — **claim-before-act**: для exit0 канонический QG оценивается
+read-only ПЕРЕД атомарным claim, затем claim `done` ПЕРВЫМ и только победитель
+claim делает `_try_advance_stage` (advance+enqueue) — проигравший (поздний monitor
+/ стартовый requeue) не выполняет побочных эффектов (нет дубль-advance/-enqueue);
+источник истины — QG, не «exit0»; гейт красный или exit≠0/неизвестно →
 `attempts<max`→`queued`, иначе `failed`+Telegram. Тот же поток на старте и
 периодически делает проактивный реклейм stale/dead merge-lease (`merge_gate.py`:
 `pid_alive`/`reclaim_stale_lease`). never-raise; kill-switch `ORCH_REAPER_ENABLED`
--- a/docs/work-items/ORCH-065/06-adr/ADR-001-job-reaper-and-lease-reclaim.md
+++ b/docs/work-items/ORCH-065/06-adr/ADR-001-job-reaper-and-lease-reclaim.md
@@ -65,9 +65,18 @@ liveness процесса). Это разные never-raise-домены и ра
   `pid_alive(pid)` = `os.kill(pid, 0)` с обработкой `ProcessLookupError` (мёртв)
   / `PermissionError` (жив, чужой) — единственный сигнал, ловящий «monitor умер
   ДО записи `finished_at`».
-2. **Tier-2 (completion race): exit_code записан, job ещё `running`.** Если у
-   `agent_runs[run_id]` есть `finished_at`/`exit_code`, а `jobs.status='running'`
-   — monitor умер между записью exit_code и `_finalize_job`. Здесь исход **известен**.
+2. **Tier-2 (completion race): exit_code записан, job ещё `running`.** Окно
+   **неоднозначно**: это И «monitor умер между записью exit_code и
+   `_finalize_job`», И «живой monitor ещё финализирует» — `_monitor_agent`
+   пишет `exit_code` ПЕРВЫМ, затем git commit/push (+PR), БАГ-8-проверку и
+   сетевые usage-комментарии в Plane (секунды-десятки секунд), и лишь потом
+   `_try_advance_stage` → `_finalize_job`. pid агента к этому моменту уже мёртв в
+   ОБОИХ случаях, поэтому по pid их не различить. **Анти-ложноположительность
+   Tier-2 (FR-1.3, AC-3): finalization-grace.** Job реапится по Tier-2 только
+   когда `exit_code` записан не меньше `reaper_finalize_grace_s` назад (потолок
+   заведомо > максимального окна финализации). В пределах grace строка не
+   трогается (живой финализирующий monitor НИКОГДА не реапится; нет дубль-advance
+   / дубль-enqueue). После grace monitor заведомо мёртв → исход **известен**.
 3. **Tier-3 (backstop по потолку):** job висит `running` дольше
   `reaper_max_running_s` (заведомо > max `agent_timeout`+grace). Реап даже когда
   liveness определить нельзя (pid переиспользован/неизвестен).
@@ -87,18 +96,23 @@ liveness процесса). Это разные never-raise-домены и ра
  flip несёт guard `WHERE id=? AND status='running'` и проверяет `rowcount`. При
  гонке (поздно доехавший monitor, стартовый `requeue_running_jobs`) проигравший
  видит `rowcount==0` и НЕ обрабатывает строку повторно (AC-5).
- **Исход известен (Tier-2, exit_code в `agent_runs`):** маршрутизируем через
-  существующий `launcher._finalize_job(job_id, agent, run_id, exit_code,
-  output_path)`:
-  - `exit==0`: **gate-driven idempotent advance.** Сначала проверяем, не
-    продвинулась ли уже стадия (текущая `tasks.stage` ≠ исходная стадия агента
-    или активного job нет и гейт уже пройден) → если да, просто `mark_job(done)`
-    (идемпотентная уборка, без дубль-перехода). Если нет — `_try_advance_stage`
-    (он сам гоняет канонический QG: артефакт/PR есть → зелёный гейт → advance;
-    нет → красный гейт → НЕ advance), затем `_finalize_job`. **Источник истины —
-    гейт, не «exit0»** — это исключает ложный `done` без реально выполненной
-    работы (если monitor умер ДО git-push, артефакта нет → гейт красный →
-    переходим к ветке «исход неуспешен» ниже).
+- **Исход известен (Tier-2, exit_code в `agent_runs`, grace прошёл):**
+  - `exit==0`: **claim-BEFORE-act, gate-driven idempotent advance.** Порядок
+    критичен (см. «Атомарный reap-claim» выше): атомарный claim ОБЯЗАН
+    предшествовать любому `advance_stage`/`enqueue_job`. Поскольку claim
+    переводит строку ИЗ `running`, прогнать advance ДО claim, чтобы узнать цвет
+    гейта, нельзя — поэтому канонический QG оценивается **read-only, без
+    побочных эффектов** (тот же `_run_qg`, что у reconciler) ПЕРЕД claim:
+    - стадия уже продвинута мимо этого агента → атомарный `done` без advance
+      (идемпотентная уборка);
+    - гейт зелёный → атомарный claim `done` ПЕРВЫМ, и только победитель claim
+      выполняет `_try_advance_stage` (advance + `enqueue_job` следующей стадии)
+      РОВНО один раз; проигравший claim (поздний monitor / стартовый
+      `requeue_running_jobs`) НЕ делает побочных эффектов (нет дубль-advance /
+      дубль-enqueue);
+    - гейт красный (monitor умер ДО git-push, артефакта нет) → НЕ выдумываем
+      `done`, уходим в ветку «исход неуспешен» ниже.
+    **Источник истины — гейт, не «exit0».**
  - `exit!=0`: ровно существующий контракт `_finalize_job` (классификация
    transient/permanent, `attempts<max` → `queued`, иначе `failed`+Telegram).
 - **Исход неизвестен (Tier-1 мёртвый pid без exit_code, или Tier-3 backstop):**
@@ -177,6 +191,7 @@ logic».
 | `reaper_interval_s` | период сканирования | `60` |
 | `reaper_dead_ticks` | подряд тиков мёртвого pid перед реапом (Tier-1) | `2` |
 | `reaper_max_running_s` | потолок `running` (Tier-3 backstop), > max agent_timeout+grace | `3600` |
+| `reaper_finalize_grace_s` | Tier-2 grace: сколько `exit_code` должен быть записан до реапа (> max окна финализации) | `300` |
 | `lease_reclaim_enabled` | kill-switch проактивного реклейма lease | `True` |
 | (reuse) `merge_lock_timeout_s` | TTL lease | `300` |
 | (reuse) `merge_gate_repos` | область применения lease-reclaim | как есть |
--- a/src/config.py
+++ b/src/config.py
@@ -314,6 +314,14 @@ class Settings(BaseSettings):
    #   reaper_max_running_s  -> Tier-3 backstop ceiling: a job 'running' longer than
    #                           this is reaped even when liveness is unknowable. MUST be
    #                           > max agent_timeout + grace so a legit agent is safe.
+    #   reaper_finalize_grace_s -> Tier-2 anti-false-positive: a LIVE monitor writes
+    #                           agent_runs.exit_code FIRST, THEN does git commit/push +
+    #                           PR + Plane usage comments (seconds..minutes) and only
+    #                           then _finalize_job. The agent pid is already dead in
+    #                           that window, so pid cannot tell "monitor died" from
+    #                           "monitor still finalizing". A job is reaped via Tier-2
+    #                           only once exit_code has been recorded for at least this
+    #                           many seconds (MUST be > the max finalization window).
    #   lease_reclaim_enabled -> kill-switch for the proactive stale/dead lease reclaim
    #                           (false -> only the legacy lazy TTL reclaim in acquire).
    # (reuse) merge_lock_timeout_s -> lease TTL; merge_gate_repos -> reclaim scope.
@@ -321,6 +329,7 @@ class Settings(BaseSettings):
    reaper_interval_s: int = 60
    reaper_dead_ticks: int = 2
    reaper_max_running_s: int = 3600
+    reaper_finalize_grace_s: int = 300
    lease_reclaim_enabled: bool = True

    # Telegram notifications
--- a/src/db.py
+++ b/src/db.py
@@ -601,11 +601,15 @@ def requeue_running_jobs() -> int:
 def get_running_jobs() -> list[dict]:
    """ORCH-065: snapshot of every 'running' job for the job-reaper scan.

-    Each row carries the job columns plus three reaper inputs:
+    Each row carries the job columns plus four reaper inputs:
      * ``running_age_s`` — seconds since ``started_at`` (Tier-3 backstop);
      * ``exit_code``     — the linked ``agent_runs.exit_code`` (Tier-2: process
        finished but the job is still 'running' -> monitor died mid-finalize);
-      * ``finished_at_run`` — the linked ``agent_runs.finished_at`` (debug only).
+      * ``finished_at_run`` — the linked ``agent_runs.finished_at``;
+      * ``finished_age_s`` — seconds since ``agent_runs.finished_at`` (Tier-2
+        finalization grace: a LIVE monitor writes exit_code, THEN does git
+        push / PR / Plane comments before _finalize_job, so a freshly-finished
+        run is NOT yet a zombie — the reaper waits ``reaper_finalize_grace_s``).

    A LEFT JOIN on ``run_id`` keeps jobs with no agent_runs row (exit_code NULL).
    Read-only; never mutates. The reaper applies liveness/streak/backstop on top.
@@ -616,7 +620,9 @@ def get_running_jobs() -> list[dict]:
            "SELECT j.*, "
            "CAST(strftime('%s','now') - strftime('%s', j.started_at) AS INTEGER) "
            "  AS running_age_s, "
-            "r.exit_code AS exit_code, r.finished_at AS finished_at_run "
+            "r.exit_code AS exit_code, r.finished_at AS finished_at_run, "
+            "CAST(strftime('%s','now') - strftime('%s', r.finished_at) AS INTEGER) "
+            "  AS finished_age_s "
            "FROM jobs j LEFT JOIN agent_runs r ON r.id = j.run_id "
            "WHERE j.status='running'"
        ).fetchall()
--- a/src/job_reaper.py
+++ b/src/job_reaper.py
@@ -28,10 +28,17 @@ Liveness (defense in depth, ADR-001 Р-1):
    ``reaper_dead_ticks`` (>=2) CONSECUTIVE dead-pid ticks — an in-memory streak
    counter kills false positives (AC-3); a live agent within its timeout is
    never reaped.
-  * **Tier-2 (completion race): exit_code recorded but job still running.** The
-    monitor died between writing ``agent_runs.exit_code`` and ``_finalize_job``.
-    The outcome is KNOWN -> gate-driven advance on exit0, else the standard
-    transient/permanent contract.
+  * **Tier-2 (completion race): exit_code recorded but job still running.** This
+    window is AMBIGUOUS — it is both "the monitor died between writing
+    ``agent_runs.exit_code`` and ``_finalize_job``" AND "a LIVE monitor is still
+    finalizing" (``_monitor_agent`` writes ``exit_code`` FIRST, then git
+    commit/push (+PR), the БАГ-8 check and network Plane usage comments — seconds
+    to tens of seconds — and ONLY THEN ``_try_advance_stage`` -> ``_finalize_job``).
+    The agent pid is already dead in BOTH cases, so it cannot disambiguate. The
+    reaper therefore treats it as a dead monitor (KNOWN outcome) only after a
+    finalization grace: ``exit_code`` recorded for >= ``reaper_finalize_grace_s``
+    (a live finalizing monitor is NEVER reaped, FR-1.3/AC-3). Within the grace the
+    row is left untouched.
  * **Tier-3 (backstop): age ceiling.** A job ``running`` longer than
    ``reaper_max_running_s`` (deliberately > max ``agent_timeout`` + grace) is
    reaped even when liveness cannot be determined (pid reused / unknown).
@@ -41,13 +48,16 @@ Action on confirmed death reuses existing contracts (no new merge/stage logic):
    ``db.reap_running_job(... WHERE status='running')`` — so a late-arriving
    monitor / the startup ``requeue_running_jobs`` / a second tick can never
    double-process a row (AC-5; the loser sees ``rowcount==0``).
-  * **exit0 (Tier-2):** gate-driven idempotent advance — the source of truth is
-    the canonical quality gate, NOT "exit0". If the stage already advanced ->
-    just mark ``done`` (idempotent cleanup). Else run ``launcher._try_advance_stage``
-    (it runs the canonical QG: artifact/PR present -> green -> advance; absent ->
-    red -> no advance) and re-check: advanced -> ``done``; still red (e.g. the
-    monitor died before git-push, so no artifact) -> fall through to the failure
-    path. This makes a false ``done`` without real work impossible.
+  * **exit0 (Tier-2): claim-BEFORE-act (ADR-001 Р-1).** The source of truth is the
+    canonical quality gate, NOT "exit0". If the stage already advanced -> atomic
+    ``done`` claim only (idempotent cleanup). Else evaluate the canonical QG
+    READ-ONLY (no side effects, the reconciler pattern): red (e.g. the monitor died
+    before git-push, so no artifact) -> failure path (no false ``done``); green ->
+    atomically claim ``done`` FIRST, and only the claim winner then runs
+    ``launcher._try_advance_stage`` (advance + ``enqueue_job`` of the next stage).
+    A tick that loses the claim performs NO side effects, so a late-finalizing
+    monitor / the startup ``requeue_running_jobs`` can never be double-advanced or
+    double-enqueued.
  * **exit!=0 (Tier-2) / unknown outcome (Tier-1 dead pid, Tier-3 backstop):**
    ``attempts < max_attempts`` -> ``queued`` (mirrors ``requeue_running_jobs``);
    budget exhausted -> ``failed`` + Telegram. We never fabricate exit0.
@@ -173,12 +183,31 @@ class JobReaper:
        exit_code = job.get("exit_code")  # from the LEFT JOIN on agent_runs

        # Tier-2: the process finished (exit_code recorded) but the job is still
-        # 'running' -> the monitor died mid-finalize. Outcome is KNOWN.
+        # 'running'. This is AMBIGUOUS: it is BOTH "the monitor died mid-finalize"
+        # AND "a LIVE monitor is still finalizing" — _monitor_agent writes exit_code
+        # FIRST, then does git commit/push (+PR), the БАГ-8 check, network Plane
+        # usage comments (seconds..tens of seconds), and ONLY THEN _try_advance_stage
+        # -> _finalize_job. The agent pid is already dead in BOTH cases, so pid can
+        # NOT disambiguate. We treat it as a dead monitor (KNOWN outcome) only after
+        # a finalization grace: exit_code must have been recorded for at least
+        # `reaper_finalize_grace_s` (FR-1.3/AC-3 — a live finalizing monitor is never
+        # reaped). Within the grace window we leave the row alone (and fall through to
+        # the Tier-3 backstop only, which never trips before the grace given a sane
+        # config where reaper_max_running_s > reaper_finalize_grace_s).
        if exit_code is not None:
            self._streak.pop(job_id, None)
+            finished_age = job.get("finished_age_s")
+            grace = int(settings.reaper_finalize_grace_s)
+            if finished_age is not None and int(finished_age) >= grace:
                self._reap_known_outcome(job, int(exit_code))
                return
-
+            logger.info(
+                "reaper: job %s exit_code=%s recorded %ss ago (< grace %ss) — "
+                "deferring (monitor may still be finalizing)",
+                job_id, exit_code, finished_age, grace,
+            )
+            # fall through to the Tier-3 backstop guard below.
+        else:
            # Tier-1: dead pid, only after `reaper_dead_ticks` consecutive dead ticks.
            if pid is not None and not merge_gate.pid_alive(pid):
                n = self._streak.get(job_id, 0) + 1
@@ -206,16 +235,83 @@ class JobReaper:
    def _reap_known_outcome(self, job: dict, exit_code: int) -> None:
        """Tier-2: the agent's exit_code is known; drive the job's terminal status."""
        if exit_code == 0:
-            if self._gate_driven_advance(job):
-                if reap_running_job(job["id"], "done", run_id=job.get("run_id")):
-                    self._note_reap(job, "done", reason="exit0, gate green")
-                return
-            # exit0 but the gate is red (e.g. monitor died before git-push -> no
-            # artifact). Do NOT fabricate 'done'; treat as a failed outcome.
-            self._reap_unknown_outcome(job, reason="exit0 but gate red")
+            self._reap_exit0(job)
        else:
            self._reap_unknown_outcome(job, reason=f"exit={exit_code}")

+    def _reap_exit0(self, job: dict) -> None:
+        """Reap an exit0 Tier-2 job with claim-BEFORE-act (ADR-001 Р-1).
+
+        The atomic ``reap_running_job`` claim (guard ``WHERE status='running'``) MUST
+        precede any ``advance_stage`` / ``enqueue_job`` side effect, so a reaper tick
+        that LOSES the row (to a late-finalizing monitor or the startup
+        ``requeue_running_jobs``) performs NO side effects — no duplicate advance, no
+        duplicate ``enqueue_job`` of the next stage (FR-1.2/AC-4).
+
+        Because the claim flips the row OUT of 'running', we cannot run the advance
+        first to learn the gate colour. Instead we evaluate the canonical quality gate
+        READ-ONLY (no side effects — the pattern the reconciler uses) to choose the
+        terminal status BEFORE claiming:
+          * already advanced past this agent -> idempotent clean ``done`` (no advance);
+          * gate green  -> claim ``done`` first, THEN advance exactly once;
+          * gate red (e.g. monitor died before git-push -> no artifact) -> NOT a real
+            success: route to the retry/fail contract (never a false ``done``).
+        """
+        job_id = job["id"]
+        run_id = job.get("run_id")
+        agent = job.get("agent")
+        branch, stage, work_item_id = self._task_meta(job)
+        candidates = {s for s in STAGE_TRANSITIONS if get_agent_for_stage(s) == agent}
+
+        if stage is None or stage not in candidates:
+            # Stage already advanced past this agent (or unknown) -> a clean 'done'
+            # is correct WITHOUT re-advancing. Atomic claim only (idempotent cleanup).
+            if reap_running_job(job_id, "done", run_id=run_id):
+                self._note_reap(job, "done", reason="exit0, already advanced")
+            return
+
+        if not branch or not self._gate_is_green(stage, job, branch, work_item_id):
+            # exit0 but the gate is red -> do NOT fabricate 'done'; treat as failure
+            # (retry within budget, else failed + Telegram).
+            self._reap_unknown_outcome(job, reason="exit0 but gate red")
+            return
+
+        # Gate green. CLAIM-BEFORE-ACT: own the row atomically FIRST.
+        if not reap_running_job(job_id, "done", run_id=run_id):
+            # Lost the race -> the winner (late monitor / startup requeue) owns the
+            # advance; we do NOTHING (no duplicate side effects).
+            return
+        # We exclusively own the row now -> drive the gate-based advance exactly once.
+        self._gate_driven_advance(job)
+        self._note_reap(job, "done", reason="exit0, gate green")
+
+    def _gate_is_green(
+        self, stage: str, job: dict, branch: str, work_item_id: str | None
+    ) -> bool:
+        """Read-only canonical-QG evaluation for a reaped exit0 job (no side effects).
+
+        Mirrors the reconciler's cheap pre-evaluation: dispatch the stage's QG via
+        the SAME ``_run_qg`` the webhook path uses, returning its pass/fail WITHOUT
+        running ``advance_stage`` (so no stage move / enqueue / notification happens
+        here). A stage with no registered gate is treated as green (nothing blocks a
+        clean 'done'). Never raises -> any error returns False (conservative: route
+        to retry, never a false 'done').
+        """
+        try:
+            from .stages import get_qg_for_stage
+            from .stage_engine import _run_qg
+            qg_name = get_qg_for_stage(stage)
+            if not qg_name:
+                return True
+            passed, _reason = _run_qg(qg_name, job.get("repo"), work_item_id, branch)
+            return bool(passed)
+        except Exception as e:  # noqa: BLE001 - never break the reap
+            logger.warning(
+                "reaper: gate pre-eval failed for job %s (stage=%s): %s",
+                job.get("id"), stage, e,
+            )
+            return False
+
    def _reap_unknown_outcome(self, job: dict, reason: str) -> None:
        """Tier-1/Tier-3 (or exit!=0): outcome not a clean success.

@@ -252,7 +348,7 @@ class JobReaper:
        agent = job.get("agent")
        repo = job.get("repo")
        run_id = job.get("run_id")
-        branch, stage = self._task_branch_stage(job)
+        branch, stage, _wid = self._task_meta(job)
        # Candidate stages whose finishing agent is THIS agent (deployer maps to
        # both 'testing' and 'deploy-staging', hence a set).
        candidates = {s for s in STAGE_TRANSITIONS if get_agent_for_stage(s) == agent}
@@ -270,28 +366,29 @@ class JobReaper:
                         job.get("id"), e)
            return False
        # Re-read the stage: advanced out of the candidate set -> gate was green.
-        _branch, new_stage = self._task_branch_stage(job)
+        _branch, new_stage, _wid2 = self._task_meta(job)
        return new_stage is None or new_stage not in candidates

    @staticmethod
-    def _task_branch_stage(job: dict) -> tuple[str | None, str | None]:
-        """Resolve (branch, stage) for the job's task. Never raises."""
+    def _task_meta(job: dict) -> tuple[str | None, str | None, str | None]:
+        """Resolve (branch, stage, work_item_id) for the job's task. Never raises."""
        task_id = job.get("task_id")
        if not task_id:
-            return None, None
+            return None, None, None
        try:
            conn = get_db()
            row = conn.execute(
-                "SELECT branch, stage FROM tasks WHERE id = ?", (task_id,)
+                "SELECT branch, stage, work_item_id FROM tasks WHERE id = ?",
+                (task_id,),
            ).fetchone()
            conn.close()
            if not row:
-                return None, None
-            return row["branch"], row["stage"]
+                return None, None, None
+            return row["branch"], row["stage"], row["work_item_id"]
        except Exception as e:  # noqa: BLE001 - never-raise contract
            logger.warning("reaper: task lookup failed for job %s: %s",
                           job.get("id"), e)
-            return None, None
+            return None, None, None

    def _notify_failed(self, job: dict, reason: str) -> None:
        try:
--- a/tests/test_job_reaper.py
+++ b/tests/test_job_reaper.py
@@ -32,18 +32,22 @@ def fresh_db(tmp_path, monkeypatch):
 # --- helpers ----------------------------------------------------------------
 def _make_running_job(agent="developer", repo="orchestrator", task_id=None,
                      pid=None, age_s=0, attempts=0, max_attempts=2,
-                      run_id=None, exit_code=None):
+                      run_id=None, exit_code=None, finished_age_s=600):
    """Insert a job already in 'running' with the given pid/age/attempts.

    started_at is back-dated by ``age_s`` seconds so running_age_s reflects it.
-    When ``exit_code`` is given an agent_runs row is created and linked (Tier-2).
+    When ``exit_code`` is given an agent_runs row is created and linked (Tier-2);
+    its ``finished_at`` is back-dated by ``finished_age_s`` seconds so the
+    Tier-2 finalization grace (``reaper_finalize_grace_s``, default 300) is
+    satisfied by default — pass a small ``finished_age_s`` to exercise the
+    "monitor may still be finalizing" deferral.
    """
    conn = get_db()
    if run_id is None and exit_code is not None:
        cur = conn.execute(
            "INSERT INTO agent_runs (task_id, agent, finished_at, exit_code) "
-            "VALUES (?, ?, datetime('now'), ?)",
-            (task_id, agent, exit_code),
+            "VALUES (?, ?, datetime('now', ?), ?)",
+            (task_id, agent, f"-{int(finished_age_s)} seconds", exit_code),
        )
        run_id = cur.lastrowid
    cur = conn.execute(
@@ -153,11 +157,20 @@ def test_tc04_backstop_no_pid(monkeypatch):


 # --- TC-05: correct outcome by exit_code (Tier-2) ---------------------------
+def _gate(monkeypatch, green: bool):
+    """Force the reaper's READ-ONLY gate pre-evaluation to green/red."""
+    monkeypatch.setattr(
+        JobReaper, "_gate_is_green",
+        lambda self, stage, job, branch, wid: green,
+    )
+
+
 def test_tc05_exit0_gate_green_done(monkeypatch):
    # A developer job runs to LEAVE the 'architecture' stage (-> 'development').
    tid = _make_task(stage="architecture")
    jid = _make_running_job(agent="developer", task_id=tid, exit_code=0)
-    # gate green -> advance succeeds (stage leaves the developer candidate set).
+    _gate(monkeypatch, green=True)
+    # gate green -> the claim flips 'done' FIRST, then the advance runs.
    import src.agents.launcher as L
    monkeypatch.setattr(
        L.launcher, "_try_advance_stage",
@@ -172,13 +185,31 @@ def test_tc05_exit0_gate_red_requeues(monkeypatch):
    tid = _make_task(stage="architecture")
    jid = _make_running_job(agent="developer", task_id=tid, exit_code=0,
                            attempts=0, max_attempts=2)
-    # gate red -> _try_advance_stage is a no-op (stage stays 'architecture').
+    _gate(monkeypatch, green=False)  # read-only pre-eval says red
+    # The advance path must NEVER run when the gate is red (claim-before-act).
    import src.agents.launcher as L
+    called = []
    monkeypatch.setattr(L.launcher, "_try_advance_stage",
-                        lambda run_id, agent, repo, branch: None)
+                        lambda run_id, agent, repo, branch: called.append(1))
    r = JobReaper()
    r.reap_once()
    assert get_job(jid)["status"] == "queued"  # exit0 but gate red -> not 'done'
+    assert not called, "no advance/side-effects on a red gate"
+
+
+def test_tc05_exit0_already_advanced_done_no_side_effects(monkeypatch):
+    # Stage already past the developer candidate set -> idempotent clean 'done'
+    # with NO advance call (the monitor already advanced before dying).
+    tid = _make_task(stage="development")  # developer's candidate is 'architecture'
+    jid = _make_running_job(agent="developer", task_id=tid, exit_code=0)
+    import src.agents.launcher as L
+    called = []
+    monkeypatch.setattr(L.launcher, "_try_advance_stage",
+                        lambda run_id, agent, repo, branch: called.append(1))
+    r = JobReaper()
+    r.reap_once()
+    assert get_job(jid)["status"] == "done"
+    assert not called, "already-advanced reap must not re-advance"


 def test_tc05_nonzero_exit_requeue_then_failed(monkeypatch):
@@ -201,6 +232,78 @@ def test_tc05_nonzero_exit_requeue_then_failed(monkeypatch):
    assert sent, "failed reap must send a Telegram alert"


+# --- TC-05b: Tier-2 finalization grace (live monitor still finalizing) -------
+def test_tc05_tier2_within_grace_not_reaped(monkeypatch):
+    """exit_code freshly recorded -> a LIVE monitor may still be finalizing.
+
+    The reaper must NOT reap it within ``reaper_finalize_grace_s`` (FR-1.3/AC-3:
+    a live finalizing monitor — git push / PR / Plane comments — is never reaped,
+    no dup advance / enqueue).
+    """
+    monkeypatch.setattr(db.settings, "reaper_finalize_grace_s", 300)
+    tid = _make_task(stage="architecture")
+    # exit_code recorded only 5s ago -> still inside the finalization grace.
+    jid = _make_running_job(agent="developer", task_id=tid, exit_code=0,
+                            finished_age_s=5)
+    import src.agents.launcher as L
+    called = []
+    monkeypatch.setattr(L.launcher, "_try_advance_stage",
+                        lambda run_id, agent, repo, branch: called.append(1))
+    r = JobReaper()
+    r.reap_once()
+    assert get_job(jid)["status"] == "running"  # deferred, NOT reaped
+    assert r.reaped_total == 0
+    assert not called, "a live finalizing monitor must not be advanced by the reaper"
+
+
+def test_tc05_tier2_after_grace_reaped(monkeypatch):
+    """Once exit_code has been recorded longer than the grace, the monitor is
+    genuinely dead and the Tier-2 reap proceeds."""
+    monkeypatch.setattr(db.settings, "reaper_finalize_grace_s", 300)
+    tid = _make_task(stage="architecture")
+    jid = _make_running_job(agent="developer", task_id=tid, exit_code=0,
+                            finished_age_s=600)  # well past the grace
+    _gate(monkeypatch, green=True)
+    import src.agents.launcher as L
+    monkeypatch.setattr(
+        L.launcher, "_try_advance_stage",
+        lambda run_id, agent, repo, branch: db.update_task_stage(tid, "development"),
+    )
+    r = JobReaper()
+    r.reap_once()
+    assert get_job(jid)["status"] == "done"
+
+
+def test_tc05_tier2_lost_claim_no_side_effects(monkeypatch):
+    """claim-BEFORE-act: when another actor (a late monitor / startup requeue)
+    moves the row out of 'running' AFTER the reaper read it but BEFORE the atomic
+    claim, the reaper's claim loses (rowcount==0) and it performs NO advance side
+    effects (no dup advance / dup enqueue) — ADR-001 Р-1."""
+    monkeypatch.setattr(db.settings, "reaper_finalize_grace_s", 0)
+    tid = _make_task(stage="architecture")
+    jid = _make_running_job(agent="developer", task_id=tid, exit_code=0,
+                            finished_age_s=10)
+    import src.agents.launcher as L
+    called = []
+    monkeypatch.setattr(L.launcher, "_try_advance_stage",
+                        lambda run_id, agent, repo, branch: called.append(1))
+
+    # The read-only gate pre-eval reports green, but the row is concurrently
+    # claimed by someone else right before the reaper's atomic claim runs.
+    def green_then_steal(self, stage, job, branch, wid):
+        db.requeue_running_jobs()  # another actor wins the 'running' row first
+        return True
+
+    monkeypatch.setattr(JobReaper, "_gate_is_green", green_then_steal)
+    r = JobReaper()
+    r.reap_once()
+    # Reaper lost the atomic claim -> no advance, no double work. The row stays
+    # where the winner left it ('queued'), not flipped to 'done' by the reaper.
+    assert not called, "reaper that lost the claim must not advance/enqueue"
+    assert get_job(jid)["status"] == "queued"
+    assert r.reaped_total == 0
+
+
 # --- TC-06: atomicity — reaper vs requeue_running_jobs (status guard) --------
 def test_tc06_atomic_no_double_reap(monkeypatch):
    _dead_pid(monkeypatch)