ORCH-113: reaper must not re-run deploy-staging finalization while the finalizer is alive #134
Reference in New Issue
Block a user
Delete Branch "feature/ORCH-113-bug-job-reaper-must-not-re-run"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
ORCH-113 (bug → escalate full-cycle) — incident ORCH-111
Root cause. On the
deploy-staging → deployedge the live monitor(
launcher._monitor_agent) stampsagent_runs.finished_at/exit_codefirst,then runs the heavy edge sub-gates (security → merge-gate re-test → coverage →
image-freshness) synchronously in its own thread — minutes — and only then
_finalize_job. Reaper Tier-2 measuresfinished_age_sfromfinished_at(= thestart of finalization), so past
reaper_finalize_grace_s=300it treated the live,long-finalizing monitor as dead and independently re-ran the same advance. The
second re-test went red → false rollback
deploy-staging → development(+ falsedeveloper-retry) while the original finalizer concurrently drove deploy to
SUCCESS and merged the PR. State diverged (incident ORCH-111, deployer job 1914).
Fix (ADR-001 / adr-0043)
src/finalizer_liveness.py(new leaf, never-raise, process-local): ownershipregistry
mark()/clear()/is_active()/snapshot()under athreading.Lock.Authoritative in-memory because the monitor and the reaper are daemon threads of
one uvicorn process; restart is covered by the startup
requeue_running_jobs.launcher._monitor_agent):mark()right after theexit_codestamp (earliest Tier-2 moment); the finalization tail is extracted verbatim to
_run_monitor_finalizationand wrapped intry/finallywithclear()infinally— any exception in the monitor thread still releases ownership so agenuinely dead finalizer is reaped (FR-4). The marker is written unconditionally
(the kill-switch gates only the reaper's consultation). Verify the tail is unchanged
with
git diff -w(+49 / −0 on launcher.py).job_reaper._reap_jobTier-2): whenreaper_finalizer_liveness_enabledand stage== "deploy-staging"andownership is active → defer (counter + log), fall through to the Tier-3 backstop,
which ignores the marker (a stuck/dead finalizer is still reaped in bounded time).
Global kill-switch only — no per-repo split (the bug is common to every repo with a
deploy-stagingstage).Invariants (AC-5)
STAGE_TRANSITIONS/QG_CHECKS/ everycheck_*/ machine-verdict keys / DB schema— byte-for-byte. Zero schema change.
reaper_finalize_grace_s/reaper_max_running_sand the cross-cutting budget
reaper_max_running_s (5400) > Σ(gate-work) + grace(ORCH-065/109/110) untouched. never-raise; does not restart prod; does not push
main.Kill-switch
ORCH_REAPER_FINALIZER_LIVENESS_ENABLED=false→ reaper byte-for-byte prior.Observability
Additive keys
finalizer_liveness_enabled/finalizer_defers_total/finalizer_ownedin thereaperblock ofGET /queue(existing keys unchanged).Tests
tests/test_orch113_reaper_finalizer_liveness.py— TC-01..TC-08, including themandatory ORCH-111 regression (TC-05): red before the fix, green after (verified
by reproducing the false rollback with the kill-switch off). Full suite: 2001 passed.
Docs (README/internals/adr-0043 + work-item docs) were updated upstream in this branch;
CHANGELOG.mdupdated in this PR.Refs: ORCH-113
🤖 Generated with Claude Code
4ae4f81d79to7523b843a5