ORCH-113: reaper must not re-run deploy-staging finalization while the finalizer is alive #134

Merged
admin merged 8 commits from feature/ORCH-113-bug-job-reaper-must-not-re-run into main 2026-06-15 13:51:58 +03:00
Owner

ORCH-113 (bug → escalate full-cycle) — incident ORCH-111

Root cause. On the deploy-staging → deploy edge the live monitor
(launcher._monitor_agent) stamps agent_runs.finished_at/exit_code first,
then runs the heavy edge sub-gates (security → merge-gate re-test → coverage →
image-freshness) synchronously in its own thread — minutes — and only then
_finalize_job. Reaper Tier-2 measures finished_age_s from finished_at (= the
start of finalization), so past reaper_finalize_grace_s=300 it treated the live,
long-finalizing monitor as dead and independently re-ran the same advance. The
second re-test went red → false rollback deploy-staging → development (+ false
developer-retry) while the original finalizer concurrently drove deploy to
SUCCESS and merged the PR. State diverged (incident ORCH-111, deployer job 1914).

Fix (ADR-001 / adr-0043)

  • src/finalizer_liveness.py (new leaf, never-raise, process-local): ownership
    registry mark()/clear()/is_active()/snapshot() under a threading.Lock.
    Authoritative in-memory because the monitor and the reaper are daemon threads of
    one
    uvicorn process; restart is covered by the startup requeue_running_jobs.
  • Emission (launcher._monitor_agent): mark() right after the exit_code
    stamp (earliest Tier-2 moment); the finalization tail is extracted verbatim to
    _run_monitor_finalization and wrapped in try/finally with clear() in
    finally — any exception in the monitor thread still releases ownership so a
    genuinely dead finalizer is reaped (FR-4). The marker is written unconditionally
    (the kill-switch gates only the reaper's consultation). Verify the tail is unchanged
    with git diff -w (+49 / −0 on launcher.py).
  • Consultation (job_reaper._reap_job Tier-2): when
    reaper_finalizer_liveness_enabled and stage == "deploy-staging" and
    ownership is active → defer (counter + log), fall through to the Tier-3 backstop,
    which ignores the marker (a stuck/dead finalizer is still reaped in bounded time).
    Global kill-switch only — no per-repo split (the bug is common to every repo with a
    deploy-staging stage).

Invariants (AC-5)

STAGE_TRANSITIONS / QG_CHECKS / every check_* / machine-verdict keys / DB schema
— byte-for-byte. Zero schema change. reaper_finalize_grace_s / reaper_max_running_s
and the cross-cutting budget reaper_max_running_s (5400) > Σ(gate-work) + grace
(ORCH-065/109/110) untouched. never-raise; does not restart prod; does not push main.
Kill-switch ORCH_REAPER_FINALIZER_LIVENESS_ENABLED=false → reaper byte-for-byte prior.

Observability

Additive keys finalizer_liveness_enabled / finalizer_defers_total /
finalizer_owned in the reaper block of GET /queue (existing keys unchanged).

Tests

tests/test_orch113_reaper_finalizer_liveness.py — TC-01..TC-08, including the
mandatory ORCH-111 regression (TC-05): red before the fix, green after (verified
by reproducing the false rollback with the kill-switch off). Full suite: 2001 passed.

Docs (README/internals/adr-0043 + work-item docs) were updated upstream in this branch;
CHANGELOG.md updated in this PR.

Refs: ORCH-113

🤖 Generated with Claude Code

## ORCH-113 (bug → escalate full-cycle) — incident ORCH-111 **Root cause.** On the `deploy-staging → deploy` edge the live monitor (`launcher._monitor_agent`) stamps `agent_runs.finished_at`/`exit_code` **first**, then runs the heavy edge sub-gates (security → merge-gate re-test → coverage → image-freshness) synchronously **in its own thread** — minutes — and only **then** `_finalize_job`. Reaper Tier-2 measures `finished_age_s` from `finished_at` (= the start of finalization), so past `reaper_finalize_grace_s=300` it treated the live, long-finalizing monitor as **dead** and independently re-ran the same advance. The second re-test went red → false rollback `deploy-staging → development` (+ false developer-retry) **while** the original finalizer concurrently drove deploy to SUCCESS and merged the PR. State diverged (incident ORCH-111, deployer job 1914). ## Fix (ADR-001 / adr-0043) - **`src/finalizer_liveness.py` (new leaf, never-raise, process-local):** ownership registry `mark()`/`clear()`/`is_active()`/`snapshot()` under a `threading.Lock`. Authoritative in-memory because the monitor and the reaper are daemon **threads of one** uvicorn process; restart is covered by the startup `requeue_running_jobs`. - **Emission (`launcher._monitor_agent`):** `mark()` right after the `exit_code` stamp (earliest Tier-2 moment); the finalization tail is extracted **verbatim** to `_run_monitor_finalization` and wrapped in `try/finally` with `clear()` in `finally` — any exception in the monitor thread still releases ownership so a genuinely dead finalizer is reaped (FR-4). The marker is written **unconditionally** (the kill-switch gates only the reaper's consultation). Verify the tail is unchanged with `git diff -w` (+49 / −0 on launcher.py). - **Consultation (`job_reaper._reap_job` Tier-2):** when `reaper_finalizer_liveness_enabled` **and** stage `== "deploy-staging"` **and** ownership is active → **defer** (counter + log), fall through to the Tier-3 backstop, which **ignores** the marker (a stuck/dead finalizer is still reaped in bounded time). Global kill-switch only — no per-repo split (the bug is common to every repo with a `deploy-staging` stage). ## Invariants (AC-5) `STAGE_TRANSITIONS` / `QG_CHECKS` / every `check_*` / machine-verdict keys / DB schema — byte-for-byte. Zero schema change. `reaper_finalize_grace_s` / `reaper_max_running_s` and the cross-cutting budget `reaper_max_running_s (5400) > Σ(gate-work) + grace` (ORCH-065/109/110) untouched. never-raise; does not restart prod; does not push `main`. Kill-switch `ORCH_REAPER_FINALIZER_LIVENESS_ENABLED=false` → reaper byte-for-byte prior. ## Observability Additive keys `finalizer_liveness_enabled` / `finalizer_defers_total` / `finalizer_owned` in the `reaper` block of `GET /queue` (existing keys unchanged). ## Tests `tests/test_orch113_reaper_finalizer_liveness.py` — TC-01..TC-08, including the **mandatory ORCH-111 regression** (TC-05): red before the fix, green after (verified by reproducing the false rollback with the kill-switch off). Full suite: **2001 passed**. Docs (README/internals/adr-0043 + work-item docs) were updated upstream in this branch; `CHANGELOG.md` updated in this PR. Refs: ORCH-113 🤖 Generated with [Claude Code](https://claude.com/claude-code)
admin added 6 commits 2026-06-15 13:08:46 +03:00
On the deploy-staging -> deploy edge the live monitor stamps
agent_runs.finished_at FIRST, then runs the heavy edge sub-gates
(security/merge-gate re-test/coverage/image-freshness) in-thread for MINUTES
and only THEN _finalize_job. Reaper Tier-2 measures finished_age_s from
finished_at, so past reaper_finalize_grace_s it treated the live, long
finalizer as dead and independently re-ran the advance -> a second re-test
went red -> false rollback deploy-staging -> development while the original
finalizer concurrently merged the PR (incident ORCH-111, job 1914).

Add a process-local finalizer-ownership registry (src/finalizer_liveness.py,
never-raise): the monitor mark()s ownership right after the exit_code stamp and
clear()s it in a try/finally around the (verbatim-extracted) finalization tail,
so an exception in the monitor thread still releases ownership and a genuinely
dead finalizer is reaped. The reaper Tier-2 consults the marker only when the
kill-switch is on AND the task stage == deploy-staging AND ownership is active
-> DEFER (no second advance) and fall through to the Tier-3 backstop, which
ignores the marker (a stuck/dead finalizer is still reaped in bounded time).
In-memory is authoritative (monitor + reaper are daemon threads of one uvicorn
process); restart is covered by the startup requeue_running_jobs.

Additive, global kill-switch reaper_finalizer_liveness_enabled (default True;
false -> reaper byte-for-byte prior). STAGE_TRANSITIONS / QG_CHECKS / every
check_* / machine-verdict keys / DB schema unchanged; grace/ceiling and the
ORCH-065/109/110 budget invariant untouched; never restarts prod, never pushes
main. Observability: finalizer_defers_total + finalizer_owned in GET /queue.
Tests: tests/test_orch113_reaper_finalizer_liveness.py (TC-01..TC-08, incl. the
mandatory ORCH-111 regression: red before the fix, green after).

Refs: ORCH-113

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tester(ET): auto-commit from tester run_id=696
All checks were successful
CI / test (push) Successful in 4m41s
CI / test (pull_request) Successful in 4m1s
7523b843a5
admin force-pushed feature/ORCH-113-bug-job-reaper-must-not-re-run from 4ae4f81d79 to 7523b843a5 2026-06-15 13:08:46 +03:00 Compare
admin added 1 commit 2026-06-15 13:43:27 +03:00
developer(ET): auto-commit from developer run_id=699
All checks were successful
CI / test (push) Successful in 3m22s
CI / test (pull_request) Successful in 3m43s
b62e196710
admin added 1 commit 2026-06-15 13:51:48 +03:00
deploy(ORCH-036): finalize SUCCESS for ORCH-113
All checks were successful
CI / test (push) Successful in 3m9s
CI / test (pull_request) Successful in 3m5s
c8faa1ec23
admin merged commit a1544f4677 into main 2026-06-15 13:51:58 +03:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: admin/orchestrator#134