Tier-2 reaped a LIVE, still-finalizing monitor: _monitor_agent writes agent_runs.exit_code FIRST, then does git push / PR / Plane comments before _finalize_job, and the agent pid is already dead in that window — so the old "exit_code recorded -> reap now" had no grace and could race a healthy job. Worse, _reap_known_outcome ran the advance (advance_stage -> enqueue_job) BEFORE the atomic claim, so a reaper that lost the race had already enqueued the next stage (dup advance / dup enqueue), violating ADR-001 Р-1. Fix: - Tier-2 grace: reap only once agent_runs.exit_code has been recorded for >= reaper_finalize_grace_s (new setting, default 300s; > max finalization window). A live finalizing monitor is never reaped (FR-1.3/AC-3). New finished_age_s column computed in get_running_jobs. - claim-before-act for exit0: evaluate the canonical QG READ-ONLY (the reconciler pattern) to choose the terminal status, then atomically claim 'done' FIRST; only the claim winner runs the advance. A loser performs no side effects -> no dup advance / dup enqueue. Docs (golden source) updated in the same change: ADR-001, global adr-0011, README, internals, .env.example, CHANGELOG (also fixes the P3 broken adr-0011 link). New tests cover the grace window, lost-claim no-side-effects, and the already-advanced idempotent path. Refs: ORCH-065 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
468 lines
23 KiB
Python
468 lines
23 KiB
Python
"""ORCH-065: job-reaper + proactive merge-lease reclaim background daemon.
|
||
|
||
Three failure classes share one root cause — "the thread/process died while it
|
||
still held captured state" — and one inert recovery layer
|
||
(``requeue_running_jobs``) that only fires on a process restart:
|
||
|
||
* **A — zombie jobs.** A job's terminal status (``done``/``queued``/``failed``)
|
||
is written ONLY inside ``launcher._monitor_agent -> _finalize_job`` in the
|
||
live process. If that thread/process dies between ``proc.wait()`` and the
|
||
status write (crash, OOM, self-restart mid-deploy) the ``jobs`` row stays
|
||
``running`` forever. At ``max_concurrency=1`` one zombie blocks the claim of
|
||
EVERY project's jobs -> the whole shared pipeline stalls.
|
||
* **B — stuck merge-lease.** The file lease ``.merge-lease-<repo>.json``
|
||
(ORCH-043) is reclaimed only lazily, by TTL, and only when ANOTHER task tries
|
||
to acquire it. Holder liveness (pid) is never probed, so a death with the
|
||
lease held blocks foreign merges until the TTL expires.
|
||
|
||
This module is a background daemon thread modelled on ``reconciler``
|
||
(``threading.Thread(daemon=True)`` + ``threading.Event``, start/stop in
|
||
``main.lifespan``, ``/queue`` snapshot, per-unit never-raise, kill-switch). Each
|
||
tick: (1) scans ``running`` jobs and reaps the dead ones via three-tier liveness
|
||
detection; (2) proactively reclaims dead/stale merge-leases (mechanism B) for the
|
||
in-scope repos.
|
||
|
||
Liveness (defense in depth, ADR-001 Р-1):
|
||
* **Tier-1 (primary): dead pid.** ``jobs.pid`` (stamped by ``launcher._spawn``)
|
||
probed with ``merge_gate.pid_alive``. A job is reaped only after
|
||
``reaper_dead_ticks`` (>=2) CONSECUTIVE dead-pid ticks — an in-memory streak
|
||
counter kills false positives (AC-3); a live agent within its timeout is
|
||
never reaped.
|
||
* **Tier-2 (completion race): exit_code recorded but job still running.** This
|
||
window is AMBIGUOUS — it is both "the monitor died between writing
|
||
``agent_runs.exit_code`` and ``_finalize_job``" AND "a LIVE monitor is still
|
||
finalizing" (``_monitor_agent`` writes ``exit_code`` FIRST, then git
|
||
commit/push (+PR), the БАГ-8 check and network Plane usage comments — seconds
|
||
to tens of seconds — and ONLY THEN ``_try_advance_stage`` -> ``_finalize_job``).
|
||
The agent pid is already dead in BOTH cases, so it cannot disambiguate. The
|
||
reaper therefore treats it as a dead monitor (KNOWN outcome) only after a
|
||
finalization grace: ``exit_code`` recorded for >= ``reaper_finalize_grace_s``
|
||
(a live finalizing monitor is NEVER reaped, FR-1.3/AC-3). Within the grace the
|
||
row is left untouched.
|
||
* **Tier-3 (backstop): age ceiling.** A job ``running`` longer than
|
||
``reaper_max_running_s`` (deliberately > max ``agent_timeout`` + grace) is
|
||
reaped even when liveness cannot be determined (pid reused / unknown).
|
||
|
||
Action on confirmed death reuses existing contracts (no new merge/stage logic):
|
||
* The reaper's ONLY mutating write to a job row is the atomic terminal flip
|
||
``db.reap_running_job(... WHERE status='running')`` — so a late-arriving
|
||
monitor / the startup ``requeue_running_jobs`` / a second tick can never
|
||
double-process a row (AC-5; the loser sees ``rowcount==0``).
|
||
* **exit0 (Tier-2): claim-BEFORE-act (ADR-001 Р-1).** The source of truth is the
|
||
canonical quality gate, NOT "exit0". If the stage already advanced -> atomic
|
||
``done`` claim only (idempotent cleanup). Else evaluate the canonical QG
|
||
READ-ONLY (no side effects, the reconciler pattern): red (e.g. the monitor died
|
||
before git-push, so no artifact) -> failure path (no false ``done``); green ->
|
||
atomically claim ``done`` FIRST, and only the claim winner then runs
|
||
``launcher._try_advance_stage`` (advance + ``enqueue_job`` of the next stage).
|
||
A tick that loses the claim performs NO side effects, so a late-finalizing
|
||
monitor / the startup ``requeue_running_jobs`` can never be double-advanced or
|
||
double-enqueued.
|
||
* **exit!=0 (Tier-2) / unknown outcome (Tier-1 dead pid, Tier-3 backstop):**
|
||
``attempts < max_attempts`` -> ``queued`` (mirrors ``requeue_running_jobs``);
|
||
budget exhausted -> ``failed`` + Telegram. We never fabricate exit0.
|
||
|
||
Invariants (ТЗ §8 / ADR-001): never-raise per unit of work; idempotency (atomic
|
||
guard + gate-driven advance); restart-safe (the reaper starts AFTER the startup
|
||
``requeue_running_jobs``); silence when nothing is anomalous; the reaper NEVER
|
||
restarts/kills the prod container and NEVER pushes ``main``. ``STAGE_TRANSITIONS``
|
||
/ ``QG_CHECKS`` and every ``check_*`` signature are unchanged.
|
||
|
||
See docs/work-items/ORCH-065/06-adr/ADR-001-job-reaper-and-lease-reclaim.md and
|
||
the cross-cutting docs/architecture/adr/adr-0011-job-reaper-lease-reclaim.md.
|
||
"""
|
||
|
||
import logging
|
||
import threading
|
||
from datetime import datetime, timezone
|
||
|
||
from .config import settings
|
||
from .db import (
|
||
get_db,
|
||
get_running_jobs,
|
||
reap_running_job,
|
||
)
|
||
from .stages import STAGE_TRANSITIONS, get_agent_for_stage
|
||
|
||
logger = logging.getLogger("orchestrator.job_reaper")
|
||
|
||
|
||
def reclaim_all_stale_leases() -> int:
|
||
"""Proactively reclaim dead/stale merge-leases for every in-scope repo.
|
||
|
||
Used both at startup (``main.lifespan``, next to ``requeue_running_jobs``) and
|
||
on every reaper tick (mechanism B). Iterates the merge-gate scope
|
||
(``merge_gate_repos`` CSV, else self-hosting ``orchestrator``) and calls the
|
||
never-raise ``merge_gate.reclaim_stale_lease`` per repo. Returns the number of
|
||
leases actually reclaimed. Never raises (per-repo isolation).
|
||
"""
|
||
if not settings.lease_reclaim_enabled:
|
||
return 0
|
||
reclaimed = 0
|
||
try:
|
||
from . import merge_gate
|
||
raw = (settings.merge_gate_repos or "").strip()
|
||
if raw:
|
||
repos = [r.strip() for r in raw.split(",") if r.strip()]
|
||
else:
|
||
from .qg.checks import SELF_HOSTING_REPO
|
||
repos = [SELF_HOSTING_REPO]
|
||
for repo in repos:
|
||
try:
|
||
if merge_gate.reclaim_stale_lease(repo):
|
||
reclaimed += 1
|
||
except Exception as e: # noqa: BLE001 - isolate one repo's failure
|
||
logger.error("lease-reclaim failed for repo %s: %s", repo, e)
|
||
except Exception as e: # noqa: BLE001 - never-raise contract
|
||
logger.error("reclaim_all_stale_leases error: %s", e)
|
||
return reclaimed
|
||
|
||
|
||
class JobReaper:
|
||
"""Background daemon that reaps zombie jobs and reclaims stale merge-leases.
|
||
|
||
Modelled on ``Reconciler``: a ``threading.Thread(daemon=True)`` + a
|
||
``threading.Event`` for a clean stop. The only in-memory state is the
|
||
best-effort Tier-1 dead-pid streak counter (``_streak``) and the
|
||
observability counters (``reaped_total`` / ``last_reaped`` /
|
||
``lease_reclaimed_total`` / ``last_run_ts``); all reset on restart, which is
|
||
safe because the startup ``requeue_running_jobs`` covers the restart path.
|
||
"""
|
||
|
||
def __init__(self, interval_s: float | None = None):
|
||
self.interval_s = (
|
||
interval_s if interval_s is not None else settings.reaper_interval_s
|
||
)
|
||
self._stop = threading.Event()
|
||
self._thread: threading.Thread | None = None
|
||
# Tier-1 anti-false-positive: {job_id: consecutive dead-pid ticks}.
|
||
self._streak: dict[int, int] = {}
|
||
# Best-effort observability (Р-6).
|
||
self.last_run_ts: float | None = None
|
||
self.reaped_total: int = 0
|
||
self.last_reaped: dict | None = None
|
||
self.lease_reclaimed_total: int = 0
|
||
|
||
# -- A: zombie-job reaping --------------------------------------------
|
||
def reap_once(self) -> None:
|
||
"""One scan over all ``running`` jobs (per-job never-raise) + lease reclaim."""
|
||
if settings.reaper_enabled:
|
||
try:
|
||
running = get_running_jobs()
|
||
except Exception as e: # noqa: BLE001 - never break the tick
|
||
logger.error("reaper: get_running_jobs failed: %s", e)
|
||
running = []
|
||
seen: set[int] = set()
|
||
for job in running:
|
||
jid = job.get("id")
|
||
if jid is not None:
|
||
seen.add(jid)
|
||
try:
|
||
self._reap_job(job)
|
||
except Exception as e: # noqa: BLE001 - isolate one job's failure
|
||
logger.error(
|
||
"reaper: job %s (agent=%s) failed: %s",
|
||
job.get("id"), job.get("agent"), e,
|
||
)
|
||
# Forget streaks for rows that are no longer running (reaped / requeued
|
||
# / finished by the monitor) so the dict cannot grow unbounded.
|
||
self._streak = {k: v for k, v in self._streak.items() if k in seen}
|
||
# Mechanism B: proactive stale/dead lease reclaim (own kill-switch).
|
||
try:
|
||
self.lease_reclaimed_total += reclaim_all_stale_leases()
|
||
except Exception as e: # noqa: BLE001 - never break the tick
|
||
logger.error("reaper: lease reclaim sweep failed: %s", e)
|
||
|
||
def _reap_job(self, job: dict) -> None:
|
||
"""Apply the three-tier liveness policy to a single running job."""
|
||
from . import merge_gate
|
||
|
||
job_id = job["id"]
|
||
pid = job.get("pid")
|
||
age = int(job.get("running_age_s") or 0)
|
||
exit_code = job.get("exit_code") # from the LEFT JOIN on agent_runs
|
||
|
||
# Tier-2: the process finished (exit_code recorded) but the job is still
|
||
# 'running'. This is AMBIGUOUS: it is BOTH "the monitor died mid-finalize"
|
||
# AND "a LIVE monitor is still finalizing" — _monitor_agent writes exit_code
|
||
# FIRST, then does git commit/push (+PR), the БАГ-8 check, network Plane
|
||
# usage comments (seconds..tens of seconds), and ONLY THEN _try_advance_stage
|
||
# -> _finalize_job. The agent pid is already dead in BOTH cases, so pid can
|
||
# NOT disambiguate. We treat it as a dead monitor (KNOWN outcome) only after
|
||
# a finalization grace: exit_code must have been recorded for at least
|
||
# `reaper_finalize_grace_s` (FR-1.3/AC-3 — a live finalizing monitor is never
|
||
# reaped). Within the grace window we leave the row alone (and fall through to
|
||
# the Tier-3 backstop only, which never trips before the grace given a sane
|
||
# config where reaper_max_running_s > reaper_finalize_grace_s).
|
||
if exit_code is not None:
|
||
self._streak.pop(job_id, None)
|
||
finished_age = job.get("finished_age_s")
|
||
grace = int(settings.reaper_finalize_grace_s)
|
||
if finished_age is not None and int(finished_age) >= grace:
|
||
self._reap_known_outcome(job, int(exit_code))
|
||
return
|
||
logger.info(
|
||
"reaper: job %s exit_code=%s recorded %ss ago (< grace %ss) — "
|
||
"deferring (monitor may still be finalizing)",
|
||
job_id, exit_code, finished_age, grace,
|
||
)
|
||
# fall through to the Tier-3 backstop guard below.
|
||
else:
|
||
# Tier-1: dead pid, only after `reaper_dead_ticks` consecutive dead ticks.
|
||
if pid is not None and not merge_gate.pid_alive(pid):
|
||
n = self._streak.get(job_id, 0) + 1
|
||
self._streak[job_id] = n
|
||
if n >= max(int(settings.reaper_dead_ticks), 1):
|
||
self._streak.pop(job_id, None)
|
||
self._reap_unknown_outcome(job, reason=f"dead pid={pid}")
|
||
return
|
||
logger.info(
|
||
"reaper: job %s pid=%s dead (streak %d/%d) — deferring",
|
||
job_id, pid, n, settings.reaper_dead_ticks,
|
||
)
|
||
else:
|
||
# Alive / no pid -> reset the streak (must be CONSECUTIVE).
|
||
self._streak.pop(job_id, None)
|
||
|
||
# Tier-3: backstop ceiling (one-shot; reaps even when liveness is unknown).
|
||
if age >= int(settings.reaper_max_running_s):
|
||
self._streak.pop(job_id, None)
|
||
self._reap_unknown_outcome(
|
||
job, reason=f"backstop age={age}s>={settings.reaper_max_running_s}s"
|
||
)
|
||
|
||
# -- reap actions ------------------------------------------------------
|
||
def _reap_known_outcome(self, job: dict, exit_code: int) -> None:
|
||
"""Tier-2: the agent's exit_code is known; drive the job's terminal status."""
|
||
if exit_code == 0:
|
||
self._reap_exit0(job)
|
||
else:
|
||
self._reap_unknown_outcome(job, reason=f"exit={exit_code}")
|
||
|
||
def _reap_exit0(self, job: dict) -> None:
|
||
"""Reap an exit0 Tier-2 job with claim-BEFORE-act (ADR-001 Р-1).
|
||
|
||
The atomic ``reap_running_job`` claim (guard ``WHERE status='running'``) MUST
|
||
precede any ``advance_stage`` / ``enqueue_job`` side effect, so a reaper tick
|
||
that LOSES the row (to a late-finalizing monitor or the startup
|
||
``requeue_running_jobs``) performs NO side effects — no duplicate advance, no
|
||
duplicate ``enqueue_job`` of the next stage (FR-1.2/AC-4).
|
||
|
||
Because the claim flips the row OUT of 'running', we cannot run the advance
|
||
first to learn the gate colour. Instead we evaluate the canonical quality gate
|
||
READ-ONLY (no side effects — the pattern the reconciler uses) to choose the
|
||
terminal status BEFORE claiming:
|
||
* already advanced past this agent -> idempotent clean ``done`` (no advance);
|
||
* gate green -> claim ``done`` first, THEN advance exactly once;
|
||
* gate red (e.g. monitor died before git-push -> no artifact) -> NOT a real
|
||
success: route to the retry/fail contract (never a false ``done``).
|
||
"""
|
||
job_id = job["id"]
|
||
run_id = job.get("run_id")
|
||
agent = job.get("agent")
|
||
branch, stage, work_item_id = self._task_meta(job)
|
||
candidates = {s for s in STAGE_TRANSITIONS if get_agent_for_stage(s) == agent}
|
||
|
||
if stage is None or stage not in candidates:
|
||
# Stage already advanced past this agent (or unknown) -> a clean 'done'
|
||
# is correct WITHOUT re-advancing. Atomic claim only (idempotent cleanup).
|
||
if reap_running_job(job_id, "done", run_id=run_id):
|
||
self._note_reap(job, "done", reason="exit0, already advanced")
|
||
return
|
||
|
||
if not branch or not self._gate_is_green(stage, job, branch, work_item_id):
|
||
# exit0 but the gate is red -> do NOT fabricate 'done'; treat as failure
|
||
# (retry within budget, else failed + Telegram).
|
||
self._reap_unknown_outcome(job, reason="exit0 but gate red")
|
||
return
|
||
|
||
# Gate green. CLAIM-BEFORE-ACT: own the row atomically FIRST.
|
||
if not reap_running_job(job_id, "done", run_id=run_id):
|
||
# Lost the race -> the winner (late monitor / startup requeue) owns the
|
||
# advance; we do NOTHING (no duplicate side effects).
|
||
return
|
||
# We exclusively own the row now -> drive the gate-based advance exactly once.
|
||
self._gate_driven_advance(job)
|
||
self._note_reap(job, "done", reason="exit0, gate green")
|
||
|
||
def _gate_is_green(
|
||
self, stage: str, job: dict, branch: str, work_item_id: str | None
|
||
) -> bool:
|
||
"""Read-only canonical-QG evaluation for a reaped exit0 job (no side effects).
|
||
|
||
Mirrors the reconciler's cheap pre-evaluation: dispatch the stage's QG via
|
||
the SAME ``_run_qg`` the webhook path uses, returning its pass/fail WITHOUT
|
||
running ``advance_stage`` (so no stage move / enqueue / notification happens
|
||
here). A stage with no registered gate is treated as green (nothing blocks a
|
||
clean 'done'). Never raises -> any error returns False (conservative: route
|
||
to retry, never a false 'done').
|
||
"""
|
||
try:
|
||
from .stages import get_qg_for_stage
|
||
from .stage_engine import _run_qg
|
||
qg_name = get_qg_for_stage(stage)
|
||
if not qg_name:
|
||
return True
|
||
passed, _reason = _run_qg(qg_name, job.get("repo"), work_item_id, branch)
|
||
return bool(passed)
|
||
except Exception as e: # noqa: BLE001 - never break the reap
|
||
logger.warning(
|
||
"reaper: gate pre-eval failed for job %s (stage=%s): %s",
|
||
job.get("id"), stage, e,
|
||
)
|
||
return False
|
||
|
||
def _reap_unknown_outcome(self, job: dict, reason: str) -> None:
|
||
"""Tier-1/Tier-3 (or exit!=0): outcome not a clean success.
|
||
|
||
Mirrors ``requeue_running_jobs`` / the permanent-failure contract:
|
||
``attempts < max_attempts`` -> ``queued`` (a retry); budget exhausted ->
|
||
``failed`` + Telegram. The terminal flip is the atomic ``reap_running_job``
|
||
guard, so a racing requeue/monitor never double-processes the row.
|
||
"""
|
||
job_id = job["id"]
|
||
run_id = job.get("run_id")
|
||
attempts = int(job.get("attempts") or 0)
|
||
max_attempts = int(job.get("max_attempts") or 2)
|
||
err = f"reaped: {reason} (run_id={run_id})"
|
||
if attempts < max_attempts:
|
||
if reap_running_job(job_id, "queued", run_id=run_id, error=err):
|
||
self._note_reap(job, "queued", reason=reason)
|
||
else:
|
||
if reap_running_job(job_id, "failed", run_id=run_id, error=err):
|
||
self._note_reap(job, "failed", reason=reason)
|
||
self._notify_failed(job, reason)
|
||
|
||
def _gate_driven_advance(self, job: dict) -> bool:
|
||
"""Idempotent, gate-driven stage advance for a reaped exit0 job.
|
||
|
||
Returns True iff the stage is (or has become) advanced past this agent's
|
||
stage — i.e. the canonical quality gate is satisfied and a clean ``done``
|
||
is correct. Returns False when the gate is still red (the caller then
|
||
routes the job to the failure path instead of a false ``done``).
|
||
|
||
The advance itself reuses the UNCHANGED ``launcher._try_advance_stage``
|
||
(which runs the canonical QG and the unified ``advance_stage``); the
|
||
reaper never duplicates ``update_task_stage`` / ``enqueue_job``.
|
||
"""
|
||
agent = job.get("agent")
|
||
repo = job.get("repo")
|
||
run_id = job.get("run_id")
|
||
branch, stage, _wid = self._task_meta(job)
|
||
# Candidate stages whose finishing agent is THIS agent (deployer maps to
|
||
# both 'testing' and 'deploy-staging', hence a set).
|
||
candidates = {s for s in STAGE_TRANSITIONS if get_agent_for_stage(s) == agent}
|
||
if stage is None or stage not in candidates:
|
||
# Stage already advanced past this agent (or unknown) -> idempotent
|
||
# cleanup: a clean 'done' is correct without re-advancing.
|
||
return True
|
||
if not branch:
|
||
return False
|
||
try:
|
||
from .agents.launcher import launcher
|
||
launcher._try_advance_stage(run_id, agent, repo, branch)
|
||
except Exception as e: # noqa: BLE001 - never break the reap
|
||
logger.error("reaper: gate-driven advance failed for job %s: %s",
|
||
job.get("id"), e)
|
||
return False
|
||
# Re-read the stage: advanced out of the candidate set -> gate was green.
|
||
_branch, new_stage, _wid2 = self._task_meta(job)
|
||
return new_stage is None or new_stage not in candidates
|
||
|
||
@staticmethod
|
||
def _task_meta(job: dict) -> tuple[str | None, str | None, str | None]:
|
||
"""Resolve (branch, stage, work_item_id) for the job's task. Never raises."""
|
||
task_id = job.get("task_id")
|
||
if not task_id:
|
||
return None, None, None
|
||
try:
|
||
conn = get_db()
|
||
row = conn.execute(
|
||
"SELECT branch, stage, work_item_id FROM tasks WHERE id = ?",
|
||
(task_id,),
|
||
).fetchone()
|
||
conn.close()
|
||
if not row:
|
||
return None, None, None
|
||
return row["branch"], row["stage"], row["work_item_id"]
|
||
except Exception as e: # noqa: BLE001 - never-raise contract
|
||
logger.warning("reaper: task lookup failed for job %s: %s",
|
||
job.get("id"), e)
|
||
return None, None, None
|
||
|
||
def _notify_failed(self, job: dict, reason: str) -> None:
|
||
try:
|
||
from .notifications import send_telegram
|
||
send_telegram(
|
||
f"\U0001f6a8 reaper: job {job.get('id')} ({job.get('agent')}, "
|
||
f"repo {job.get('repo')}) reaped as FAILED: {reason}"
|
||
)
|
||
except Exception as e: # noqa: BLE001 - telegram best-effort
|
||
logger.warning("reaper: failed-notify telegram error: %s", e)
|
||
|
||
def _note_reap(self, job: dict, outcome: str, reason: str) -> None:
|
||
"""Record + log one successful reap (Р-6 observability)."""
|
||
self.reaped_total += 1
|
||
self.last_reaped = {
|
||
"job_id": job.get("id"),
|
||
"agent": job.get("agent"),
|
||
"outcome": outcome,
|
||
}
|
||
logger.warning(
|
||
"reaper: job %s (agent=%s, repo=%s, run_id=%s, pid=%s) reaped -> %s (%s)",
|
||
job.get("id"), job.get("agent"), job.get("repo"),
|
||
job.get("run_id"), job.get("pid"), outcome, reason,
|
||
)
|
||
|
||
# -- loop / lifecycle --------------------------------------------------
|
||
def _tick(self) -> None:
|
||
try:
|
||
self.reap_once()
|
||
finally:
|
||
self.last_run_ts = datetime.now(timezone.utc).timestamp()
|
||
|
||
def _run(self) -> None:
|
||
logger.info(
|
||
"JobReaper started (interval=%ss, enabled=%s, dead_ticks=%s, "
|
||
"max_running_s=%s, lease_reclaim=%s)",
|
||
self.interval_s, settings.reaper_enabled, settings.reaper_dead_ticks,
|
||
settings.reaper_max_running_s, settings.lease_reclaim_enabled,
|
||
)
|
||
while not self._stop.is_set():
|
||
try:
|
||
self._tick()
|
||
except Exception as e: # noqa: BLE001 - outer never-raise
|
||
logger.error("JobReaper loop error: %s", e)
|
||
self._stop.wait(self.interval_s)
|
||
logger.info("JobReaper stopped")
|
||
|
||
def start(self) -> None:
|
||
"""Start the daemon thread (idempotent: a live thread is a no-op)."""
|
||
if self._thread and self._thread.is_alive():
|
||
return
|
||
self._stop.clear()
|
||
self._thread = threading.Thread(
|
||
target=self._run, name="job-reaper", daemon=True
|
||
)
|
||
self._thread.start()
|
||
|
||
def stop(self, timeout: float = 5.0) -> None:
|
||
self._stop.set()
|
||
if self._thread:
|
||
self._thread.join(timeout=timeout)
|
||
|
||
def status(self) -> dict:
|
||
"""Reaper snapshot for /queue observability (Р-6)."""
|
||
return {
|
||
"enabled": settings.reaper_enabled,
|
||
"interval": self.interval_s,
|
||
"last_run_ts": self.last_run_ts,
|
||
"reaped_total": self.reaped_total,
|
||
"last_reaped": self.last_reaped,
|
||
"lease_reclaimed_total": self.lease_reclaimed_total,
|
||
}
|
||
|
||
|
||
# Module-level singleton used by the FastAPI lifespan.
|
||
reaper = JobReaper()
|