fix(reaper): Tier-2 finalization grace + claim-before-act (no dup advance)
Tier-2 reaped a LIVE, still-finalizing monitor: _monitor_agent writes agent_runs.exit_code FIRST, then does git push / PR / Plane comments before _finalize_job, and the agent pid is already dead in that window — so the old "exit_code recorded -> reap now" had no grace and could race a healthy job. Worse, _reap_known_outcome ran the advance (advance_stage -> enqueue_job) BEFORE the atomic claim, so a reaper that lost the race had already enqueued the next stage (dup advance / dup enqueue), violating ADR-001 Р-1. Fix: - Tier-2 grace: reap only once agent_runs.exit_code has been recorded for >= reaper_finalize_grace_s (new setting, default 300s; > max finalization window). A live finalizing monitor is never reaped (FR-1.3/AC-3). New finished_age_s column computed in get_running_jobs. - claim-before-act for exit0: evaluate the canonical QG READ-ONLY (the reconciler pattern) to choose the terminal status, then atomically claim 'done' FIRST; only the claim winner runs the advance. A loser performs no side effects -> no dup advance / dup enqueue. Docs (golden source) updated in the same change: ADR-001, global adr-0011, README, internals, .env.example, CHANGELOG (also fixes the P3 broken adr-0011 link). New tests cover the grace window, lost-claim no-side-effects, and the already-advanced idempotent path. Refs: ORCH-065 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -314,6 +314,14 @@ class Settings(BaseSettings):
|
||||
# reaper_max_running_s -> Tier-3 backstop ceiling: a job 'running' longer than
|
||||
# this is reaped even when liveness is unknowable. MUST be
|
||||
# > max agent_timeout + grace so a legit agent is safe.
|
||||
# reaper_finalize_grace_s -> Tier-2 anti-false-positive: a LIVE monitor writes
|
||||
# agent_runs.exit_code FIRST, THEN does git commit/push +
|
||||
# PR + Plane usage comments (seconds..minutes) and only
|
||||
# then _finalize_job. The agent pid is already dead in
|
||||
# that window, so pid cannot tell "monitor died" from
|
||||
# "monitor still finalizing". A job is reaped via Tier-2
|
||||
# only once exit_code has been recorded for at least this
|
||||
# many seconds (MUST be > the max finalization window).
|
||||
# lease_reclaim_enabled -> kill-switch for the proactive stale/dead lease reclaim
|
||||
# (false -> only the legacy lazy TTL reclaim in acquire).
|
||||
# (reuse) merge_lock_timeout_s -> lease TTL; merge_gate_repos -> reclaim scope.
|
||||
@@ -321,6 +329,7 @@ class Settings(BaseSettings):
|
||||
reaper_interval_s: int = 60
|
||||
reaper_dead_ticks: int = 2
|
||||
reaper_max_running_s: int = 3600
|
||||
reaper_finalize_grace_s: int = 300
|
||||
lease_reclaim_enabled: bool = True
|
||||
|
||||
# Telegram notifications
|
||||
|
||||
12
src/db.py
12
src/db.py
@@ -601,11 +601,15 @@ def requeue_running_jobs() -> int:
|
||||
def get_running_jobs() -> list[dict]:
|
||||
"""ORCH-065: snapshot of every 'running' job for the job-reaper scan.
|
||||
|
||||
Each row carries the job columns plus three reaper inputs:
|
||||
Each row carries the job columns plus four reaper inputs:
|
||||
* ``running_age_s`` — seconds since ``started_at`` (Tier-3 backstop);
|
||||
* ``exit_code`` — the linked ``agent_runs.exit_code`` (Tier-2: process
|
||||
finished but the job is still 'running' -> monitor died mid-finalize);
|
||||
* ``finished_at_run`` — the linked ``agent_runs.finished_at`` (debug only).
|
||||
* ``finished_at_run`` — the linked ``agent_runs.finished_at``;
|
||||
* ``finished_age_s`` — seconds since ``agent_runs.finished_at`` (Tier-2
|
||||
finalization grace: a LIVE monitor writes exit_code, THEN does git
|
||||
push / PR / Plane comments before _finalize_job, so a freshly-finished
|
||||
run is NOT yet a zombie — the reaper waits ``reaper_finalize_grace_s``).
|
||||
|
||||
A LEFT JOIN on ``run_id`` keeps jobs with no agent_runs row (exit_code NULL).
|
||||
Read-only; never mutates. The reaper applies liveness/streak/backstop on top.
|
||||
@@ -616,7 +620,9 @@ def get_running_jobs() -> list[dict]:
|
||||
"SELECT j.*, "
|
||||
"CAST(strftime('%s','now') - strftime('%s', j.started_at) AS INTEGER) "
|
||||
" AS running_age_s, "
|
||||
"r.exit_code AS exit_code, r.finished_at AS finished_at_run "
|
||||
"r.exit_code AS exit_code, r.finished_at AS finished_at_run, "
|
||||
"CAST(strftime('%s','now') - strftime('%s', r.finished_at) AS INTEGER) "
|
||||
" AS finished_age_s "
|
||||
"FROM jobs j LEFT JOIN agent_runs r ON r.id = j.run_id "
|
||||
"WHERE j.status='running'"
|
||||
).fetchall()
|
||||
|
||||
@@ -28,10 +28,17 @@ Liveness (defense in depth, ADR-001 Р-1):
|
||||
``reaper_dead_ticks`` (>=2) CONSECUTIVE dead-pid ticks — an in-memory streak
|
||||
counter kills false positives (AC-3); a live agent within its timeout is
|
||||
never reaped.
|
||||
* **Tier-2 (completion race): exit_code recorded but job still running.** The
|
||||
monitor died between writing ``agent_runs.exit_code`` and ``_finalize_job``.
|
||||
The outcome is KNOWN -> gate-driven advance on exit0, else the standard
|
||||
transient/permanent contract.
|
||||
* **Tier-2 (completion race): exit_code recorded but job still running.** This
|
||||
window is AMBIGUOUS — it is both "the monitor died between writing
|
||||
``agent_runs.exit_code`` and ``_finalize_job``" AND "a LIVE monitor is still
|
||||
finalizing" (``_monitor_agent`` writes ``exit_code`` FIRST, then git
|
||||
commit/push (+PR), the БАГ-8 check and network Plane usage comments — seconds
|
||||
to tens of seconds — and ONLY THEN ``_try_advance_stage`` -> ``_finalize_job``).
|
||||
The agent pid is already dead in BOTH cases, so it cannot disambiguate. The
|
||||
reaper therefore treats it as a dead monitor (KNOWN outcome) only after a
|
||||
finalization grace: ``exit_code`` recorded for >= ``reaper_finalize_grace_s``
|
||||
(a live finalizing monitor is NEVER reaped, FR-1.3/AC-3). Within the grace the
|
||||
row is left untouched.
|
||||
* **Tier-3 (backstop): age ceiling.** A job ``running`` longer than
|
||||
``reaper_max_running_s`` (deliberately > max ``agent_timeout`` + grace) is
|
||||
reaped even when liveness cannot be determined (pid reused / unknown).
|
||||
@@ -41,13 +48,16 @@ Action on confirmed death reuses existing contracts (no new merge/stage logic):
|
||||
``db.reap_running_job(... WHERE status='running')`` — so a late-arriving
|
||||
monitor / the startup ``requeue_running_jobs`` / a second tick can never
|
||||
double-process a row (AC-5; the loser sees ``rowcount==0``).
|
||||
* **exit0 (Tier-2):** gate-driven idempotent advance — the source of truth is
|
||||
the canonical quality gate, NOT "exit0". If the stage already advanced ->
|
||||
just mark ``done`` (idempotent cleanup). Else run ``launcher._try_advance_stage``
|
||||
(it runs the canonical QG: artifact/PR present -> green -> advance; absent ->
|
||||
red -> no advance) and re-check: advanced -> ``done``; still red (e.g. the
|
||||
monitor died before git-push, so no artifact) -> fall through to the failure
|
||||
path. This makes a false ``done`` without real work impossible.
|
||||
* **exit0 (Tier-2): claim-BEFORE-act (ADR-001 Р-1).** The source of truth is the
|
||||
canonical quality gate, NOT "exit0". If the stage already advanced -> atomic
|
||||
``done`` claim only (idempotent cleanup). Else evaluate the canonical QG
|
||||
READ-ONLY (no side effects, the reconciler pattern): red (e.g. the monitor died
|
||||
before git-push, so no artifact) -> failure path (no false ``done``); green ->
|
||||
atomically claim ``done`` FIRST, and only the claim winner then runs
|
||||
``launcher._try_advance_stage`` (advance + ``enqueue_job`` of the next stage).
|
||||
A tick that loses the claim performs NO side effects, so a late-finalizing
|
||||
monitor / the startup ``requeue_running_jobs`` can never be double-advanced or
|
||||
double-enqueued.
|
||||
* **exit!=0 (Tier-2) / unknown outcome (Tier-1 dead pid, Tier-3 backstop):**
|
||||
``attempts < max_attempts`` -> ``queued`` (mirrors ``requeue_running_jobs``);
|
||||
budget exhausted -> ``failed`` + Telegram. We never fabricate exit0.
|
||||
@@ -173,27 +183,46 @@ class JobReaper:
|
||||
exit_code = job.get("exit_code") # from the LEFT JOIN on agent_runs
|
||||
|
||||
# Tier-2: the process finished (exit_code recorded) but the job is still
|
||||
# 'running' -> the monitor died mid-finalize. Outcome is KNOWN.
|
||||
# 'running'. This is AMBIGUOUS: it is BOTH "the monitor died mid-finalize"
|
||||
# AND "a LIVE monitor is still finalizing" — _monitor_agent writes exit_code
|
||||
# FIRST, then does git commit/push (+PR), the БАГ-8 check, network Plane
|
||||
# usage comments (seconds..tens of seconds), and ONLY THEN _try_advance_stage
|
||||
# -> _finalize_job. The agent pid is already dead in BOTH cases, so pid can
|
||||
# NOT disambiguate. We treat it as a dead monitor (KNOWN outcome) only after
|
||||
# a finalization grace: exit_code must have been recorded for at least
|
||||
# `reaper_finalize_grace_s` (FR-1.3/AC-3 — a live finalizing monitor is never
|
||||
# reaped). Within the grace window we leave the row alone (and fall through to
|
||||
# the Tier-3 backstop only, which never trips before the grace given a sane
|
||||
# config where reaper_max_running_s > reaper_finalize_grace_s).
|
||||
if exit_code is not None:
|
||||
self._streak.pop(job_id, None)
|
||||
self._reap_known_outcome(job, int(exit_code))
|
||||
return
|
||||
|
||||
# Tier-1: dead pid, only after `reaper_dead_ticks` consecutive dead ticks.
|
||||
if pid is not None and not merge_gate.pid_alive(pid):
|
||||
n = self._streak.get(job_id, 0) + 1
|
||||
self._streak[job_id] = n
|
||||
if n >= max(int(settings.reaper_dead_ticks), 1):
|
||||
self._streak.pop(job_id, None)
|
||||
self._reap_unknown_outcome(job, reason=f"dead pid={pid}")
|
||||
finished_age = job.get("finished_age_s")
|
||||
grace = int(settings.reaper_finalize_grace_s)
|
||||
if finished_age is not None and int(finished_age) >= grace:
|
||||
self._reap_known_outcome(job, int(exit_code))
|
||||
return
|
||||
logger.info(
|
||||
"reaper: job %s pid=%s dead (streak %d/%d) — deferring",
|
||||
job_id, pid, n, settings.reaper_dead_ticks,
|
||||
"reaper: job %s exit_code=%s recorded %ss ago (< grace %ss) — "
|
||||
"deferring (monitor may still be finalizing)",
|
||||
job_id, exit_code, finished_age, grace,
|
||||
)
|
||||
# fall through to the Tier-3 backstop guard below.
|
||||
else:
|
||||
# Alive / no pid -> reset the streak (must be CONSECUTIVE).
|
||||
self._streak.pop(job_id, None)
|
||||
# Tier-1: dead pid, only after `reaper_dead_ticks` consecutive dead ticks.
|
||||
if pid is not None and not merge_gate.pid_alive(pid):
|
||||
n = self._streak.get(job_id, 0) + 1
|
||||
self._streak[job_id] = n
|
||||
if n >= max(int(settings.reaper_dead_ticks), 1):
|
||||
self._streak.pop(job_id, None)
|
||||
self._reap_unknown_outcome(job, reason=f"dead pid={pid}")
|
||||
return
|
||||
logger.info(
|
||||
"reaper: job %s pid=%s dead (streak %d/%d) — deferring",
|
||||
job_id, pid, n, settings.reaper_dead_ticks,
|
||||
)
|
||||
else:
|
||||
# Alive / no pid -> reset the streak (must be CONSECUTIVE).
|
||||
self._streak.pop(job_id, None)
|
||||
|
||||
# Tier-3: backstop ceiling (one-shot; reaps even when liveness is unknown).
|
||||
if age >= int(settings.reaper_max_running_s):
|
||||
@@ -206,16 +235,83 @@ class JobReaper:
|
||||
def _reap_known_outcome(self, job: dict, exit_code: int) -> None:
|
||||
"""Tier-2: the agent's exit_code is known; drive the job's terminal status."""
|
||||
if exit_code == 0:
|
||||
if self._gate_driven_advance(job):
|
||||
if reap_running_job(job["id"], "done", run_id=job.get("run_id")):
|
||||
self._note_reap(job, "done", reason="exit0, gate green")
|
||||
return
|
||||
# exit0 but the gate is red (e.g. monitor died before git-push -> no
|
||||
# artifact). Do NOT fabricate 'done'; treat as a failed outcome.
|
||||
self._reap_unknown_outcome(job, reason="exit0 but gate red")
|
||||
self._reap_exit0(job)
|
||||
else:
|
||||
self._reap_unknown_outcome(job, reason=f"exit={exit_code}")
|
||||
|
||||
def _reap_exit0(self, job: dict) -> None:
|
||||
"""Reap an exit0 Tier-2 job with claim-BEFORE-act (ADR-001 Р-1).
|
||||
|
||||
The atomic ``reap_running_job`` claim (guard ``WHERE status='running'``) MUST
|
||||
precede any ``advance_stage`` / ``enqueue_job`` side effect, so a reaper tick
|
||||
that LOSES the row (to a late-finalizing monitor or the startup
|
||||
``requeue_running_jobs``) performs NO side effects — no duplicate advance, no
|
||||
duplicate ``enqueue_job`` of the next stage (FR-1.2/AC-4).
|
||||
|
||||
Because the claim flips the row OUT of 'running', we cannot run the advance
|
||||
first to learn the gate colour. Instead we evaluate the canonical quality gate
|
||||
READ-ONLY (no side effects — the pattern the reconciler uses) to choose the
|
||||
terminal status BEFORE claiming:
|
||||
* already advanced past this agent -> idempotent clean ``done`` (no advance);
|
||||
* gate green -> claim ``done`` first, THEN advance exactly once;
|
||||
* gate red (e.g. monitor died before git-push -> no artifact) -> NOT a real
|
||||
success: route to the retry/fail contract (never a false ``done``).
|
||||
"""
|
||||
job_id = job["id"]
|
||||
run_id = job.get("run_id")
|
||||
agent = job.get("agent")
|
||||
branch, stage, work_item_id = self._task_meta(job)
|
||||
candidates = {s for s in STAGE_TRANSITIONS if get_agent_for_stage(s) == agent}
|
||||
|
||||
if stage is None or stage not in candidates:
|
||||
# Stage already advanced past this agent (or unknown) -> a clean 'done'
|
||||
# is correct WITHOUT re-advancing. Atomic claim only (idempotent cleanup).
|
||||
if reap_running_job(job_id, "done", run_id=run_id):
|
||||
self._note_reap(job, "done", reason="exit0, already advanced")
|
||||
return
|
||||
|
||||
if not branch or not self._gate_is_green(stage, job, branch, work_item_id):
|
||||
# exit0 but the gate is red -> do NOT fabricate 'done'; treat as failure
|
||||
# (retry within budget, else failed + Telegram).
|
||||
self._reap_unknown_outcome(job, reason="exit0 but gate red")
|
||||
return
|
||||
|
||||
# Gate green. CLAIM-BEFORE-ACT: own the row atomically FIRST.
|
||||
if not reap_running_job(job_id, "done", run_id=run_id):
|
||||
# Lost the race -> the winner (late monitor / startup requeue) owns the
|
||||
# advance; we do NOTHING (no duplicate side effects).
|
||||
return
|
||||
# We exclusively own the row now -> drive the gate-based advance exactly once.
|
||||
self._gate_driven_advance(job)
|
||||
self._note_reap(job, "done", reason="exit0, gate green")
|
||||
|
||||
def _gate_is_green(
|
||||
self, stage: str, job: dict, branch: str, work_item_id: str | None
|
||||
) -> bool:
|
||||
"""Read-only canonical-QG evaluation for a reaped exit0 job (no side effects).
|
||||
|
||||
Mirrors the reconciler's cheap pre-evaluation: dispatch the stage's QG via
|
||||
the SAME ``_run_qg`` the webhook path uses, returning its pass/fail WITHOUT
|
||||
running ``advance_stage`` (so no stage move / enqueue / notification happens
|
||||
here). A stage with no registered gate is treated as green (nothing blocks a
|
||||
clean 'done'). Never raises -> any error returns False (conservative: route
|
||||
to retry, never a false 'done').
|
||||
"""
|
||||
try:
|
||||
from .stages import get_qg_for_stage
|
||||
from .stage_engine import _run_qg
|
||||
qg_name = get_qg_for_stage(stage)
|
||||
if not qg_name:
|
||||
return True
|
||||
passed, _reason = _run_qg(qg_name, job.get("repo"), work_item_id, branch)
|
||||
return bool(passed)
|
||||
except Exception as e: # noqa: BLE001 - never break the reap
|
||||
logger.warning(
|
||||
"reaper: gate pre-eval failed for job %s (stage=%s): %s",
|
||||
job.get("id"), stage, e,
|
||||
)
|
||||
return False
|
||||
|
||||
def _reap_unknown_outcome(self, job: dict, reason: str) -> None:
|
||||
"""Tier-1/Tier-3 (or exit!=0): outcome not a clean success.
|
||||
|
||||
@@ -252,7 +348,7 @@ class JobReaper:
|
||||
agent = job.get("agent")
|
||||
repo = job.get("repo")
|
||||
run_id = job.get("run_id")
|
||||
branch, stage = self._task_branch_stage(job)
|
||||
branch, stage, _wid = self._task_meta(job)
|
||||
# Candidate stages whose finishing agent is THIS agent (deployer maps to
|
||||
# both 'testing' and 'deploy-staging', hence a set).
|
||||
candidates = {s for s in STAGE_TRANSITIONS if get_agent_for_stage(s) == agent}
|
||||
@@ -270,28 +366,29 @@ class JobReaper:
|
||||
job.get("id"), e)
|
||||
return False
|
||||
# Re-read the stage: advanced out of the candidate set -> gate was green.
|
||||
_branch, new_stage = self._task_branch_stage(job)
|
||||
_branch, new_stage, _wid2 = self._task_meta(job)
|
||||
return new_stage is None or new_stage not in candidates
|
||||
|
||||
@staticmethod
|
||||
def _task_branch_stage(job: dict) -> tuple[str | None, str | None]:
|
||||
"""Resolve (branch, stage) for the job's task. Never raises."""
|
||||
def _task_meta(job: dict) -> tuple[str | None, str | None, str | None]:
|
||||
"""Resolve (branch, stage, work_item_id) for the job's task. Never raises."""
|
||||
task_id = job.get("task_id")
|
||||
if not task_id:
|
||||
return None, None
|
||||
return None, None, None
|
||||
try:
|
||||
conn = get_db()
|
||||
row = conn.execute(
|
||||
"SELECT branch, stage FROM tasks WHERE id = ?", (task_id,)
|
||||
"SELECT branch, stage, work_item_id FROM tasks WHERE id = ?",
|
||||
(task_id,),
|
||||
).fetchone()
|
||||
conn.close()
|
||||
if not row:
|
||||
return None, None
|
||||
return row["branch"], row["stage"]
|
||||
return None, None, None
|
||||
return row["branch"], row["stage"], row["work_item_id"]
|
||||
except Exception as e: # noqa: BLE001 - never-raise contract
|
||||
logger.warning("reaper: task lookup failed for job %s: %s",
|
||||
job.get("id"), e)
|
||||
return None, None
|
||||
return None, None, None
|
||||
|
||||
def _notify_failed(self, job: dict, reason: str) -> None:
|
||||
try:
|
||||
|
||||
Reference in New Issue
Block a user