Files
orchestrator/src/finalizer_liveness.py
claude-bot 41074915d2
All checks were successful
CI / test (push) Successful in 3m10s
CI / test (pull_request) Successful in 2m53s
fix(reaper): do not re-run deploy-staging finalization while finalizer is alive
On the deploy-staging -> deploy edge the live monitor stamps
agent_runs.finished_at FIRST, then runs the heavy edge sub-gates
(security/merge-gate re-test/coverage/image-freshness) in-thread for MINUTES
and only THEN _finalize_job. Reaper Tier-2 measures finished_age_s from
finished_at, so past reaper_finalize_grace_s it treated the live, long
finalizer as dead and independently re-ran the advance -> a second re-test
went red -> false rollback deploy-staging -> development while the original
finalizer concurrently merged the PR (incident ORCH-111, job 1914).

Add a process-local finalizer-ownership registry (src/finalizer_liveness.py,
never-raise): the monitor mark()s ownership right after the exit_code stamp and
clear()s it in a try/finally around the (verbatim-extracted) finalization tail,
so an exception in the monitor thread still releases ownership and a genuinely
dead finalizer is reaped. The reaper Tier-2 consults the marker only when the
kill-switch is on AND the task stage == deploy-staging AND ownership is active
-> DEFER (no second advance) and fall through to the Tier-3 backstop, which
ignores the marker (a stuck/dead finalizer is still reaped in bounded time).
In-memory is authoritative (monitor + reaper are daemon threads of one uvicorn
process); restart is covered by the startup requeue_running_jobs.

Additive, global kill-switch reaper_finalizer_liveness_enabled (default True;
false -> reaper byte-for-byte prior). STAGE_TRANSITIONS / QG_CHECKS / every
check_* / machine-verdict keys / DB schema unchanged; grace/ceiling and the
ORCH-065/109/110 budget invariant untouched; never restarts prod, never pushes
main. Observability: finalizer_defers_total + finalizer_owned in GET /queue.
Tests: tests/test_orch113_reaper_finalizer_liveness.py (TC-01..TC-08, incl. the
mandatory ORCH-111 regression: red before the fix, green after).

Refs: ORCH-113

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 12:42:04 +03:00

121 lines
5.3 KiB
Python

"""ORCH-113 (adr-0043): process-local finalizer-ownership registry.
Leaf module — pure, process-local, never-raise (pattern of ``serial_gate`` /
``coverage_gate``: imports nothing from ``stage_engine`` / ``launcher`` / the DB,
talks to no network). It records "a LIVE monitor thread is currently finalizing
job X" so the job-reaper can tell a long-running-but-alive finalizer apart from a
genuinely dead one.
Why in-memory is authoritative (ADR-001 / adr-0043): the monitor
(``launcher._monitor_agent``) and the reaper (``job_reaper``) are daemon THREADS
of the SAME single uvicorn process (CMD has no ``--workers``), sharing one SQLite
DB. So liveness of the finalizing thread can be observed in-process. A whole-process
death is covered by the startup ``requeue_running_jobs()`` (``running -> queued``),
which ``main.lifespan`` runs BEFORE the reaper starts — so a restart leaves this
registry empty and the requeued jobs are re-driven cleanly (restart-safe, no durable
state needed).
The bug this closes (incident ORCH-111, deployer job 1914): on the
``deploy-staging -> deploy`` edge the monitor stamps ``agent_runs.finished_at``
FIRST, then runs the heavy edge sub-gates (security -> merge-gate re-test ->
coverage -> image-freshness) synchronously in its own thread — MINUTES — and only
THEN ``_finalize_job``. Reaper Tier-2 measures ``finished_age_s`` from
``finished_at`` (= the START of finalization), so once it exceeds
``reaper_finalize_grace_s`` (300s) it treated the live, long-finalizing monitor as
dead and independently re-ran the same heavy advance -> a second re-test went red ->
false rollback ``deploy-staging -> development`` while the original finalizer
concurrently merged the PR. State diverged.
No own TTL: time-bounding is the reaper's Tier-3 backstop (``reaper_max_running_s``),
which deliberately IGNORES this marker so a truly stuck finalizer is still reaped in
bounded time. Every public function is isolated (``try/except`` -> safe default);
``is_active`` defaults to ``False`` on error (conservative: never block the reaping
of a possibly-dead finalizer).
See docs/work-items/ORCH-113/06-adr/ADR-001-reaper-finalizer-liveness-ownership.md
and the cross-cutting docs/architecture/adr/adr-0043-reaper-finalizer-liveness-ownership.md.
"""
from __future__ import annotations
import logging
import threading
import time
logger = logging.getLogger("orchestrator.finalizer_liveness")
# Process-local ownership registry: {job_id: {"run_id", "stage", "started_ts"}}.
# Guarded by a Lock because the monitor thread writes (mark/clear) while the reaper
# thread reads (is_active/snapshot). All state resets on process restart, which is
# safe (the startup requeue_running_jobs covers the restart path).
_LOCK = threading.Lock()
_OWNED: dict[int, dict] = {}
def mark(job_id: int | None, run_id: int | None, stage: str | None) -> None:
"""Register that a live monitor thread is finalizing ``job_id``.
Called by ``launcher._monitor_agent`` right after the ``exit_code`` stamp (the
earliest moment the reaper can enter Tier-2). ``stage`` is best-effort context
for the snapshot only — the reaper decides the actual stage from ``tasks`` via
its own ``_task_meta`` lookup. No-op when ``job_id is None`` (legacy direct
``launch()`` jobs are not in ``get_running_jobs`` and are unreapable). Never
raises.
"""
if job_id is None:
return
try:
with _LOCK:
_OWNED[job_id] = {
"run_id": run_id,
"stage": stage,
"started_ts": time.time(),
}
except Exception as e: # noqa: BLE001 - never-raise contract
logger.warning("finalizer_liveness.mark failed for job %s: %s", job_id, e)
def clear(job_id: int | None) -> None:
"""Release ownership of ``job_id`` (idempotent).
Called from the ``finally`` of the monitor's finalization tail, so ANY exception
in the monitor thread still releases ownership -> a genuinely dead finalizer is
reaped (FR-4). Never raises.
"""
if job_id is None:
return
try:
with _LOCK:
_OWNED.pop(job_id, None)
except Exception as e: # noqa: BLE001 - never-raise contract
logger.warning("finalizer_liveness.clear failed for job %s: %s", job_id, e)
def is_active(job_id: int | None) -> bool:
"""True iff a live monitor currently owns the finalization of ``job_id``.
Consulted by the reaper Tier-2 branch. Defaults to ``False`` on any error or
when ``job_id is None`` (conservative: never block the reaping of a possibly
dead finalizer). Never raises.
"""
if job_id is None:
return False
try:
with _LOCK:
return job_id in _OWNED
except Exception as e: # noqa: BLE001 - never-raise contract
logger.warning("finalizer_liveness.is_active failed for job %s: %s", job_id, e)
return False
def snapshot() -> dict:
"""Read-only view of current ownership for ``GET /queue`` observability.
Returns ``{"active": <count>, "jobs": [job_id, ...]}``. Never raises.
"""
try:
with _LOCK:
return {"active": len(_OWNED), "jobs": sorted(_OWNED.keys())}
except Exception as e: # noqa: BLE001 - never-raise contract
logger.warning("finalizer_liveness.snapshot failed: %s", e)
return {"active": 0, "jobs": []}