feat(reaper): job-reaper + stale merge-lease reclaim + idempotent merge finalization (ORCH-065) #66

Merged
admin merged 10 commits from feature/ORCH-065-bug-zombie-jobs-merge-lease-ru into main 2026-06-07 19:16:24 +03:00
Owner

Summary

  • Closes the zombie jobs incident class: job status was set only inside the live launcher process, so a process death left jobs.status='running' forever; at max_concurrency=1 one zombie blocked ALL projects' queue (self-hosting risk). New background daemon src/job_reaper.py with three-tier liveness (dead-pid streak / known exit_code / max-running backstop); its only mutating write is an atomic terminal flip guarded by WHERE status='running' (no double-process). exit0 routes through the canonical QG (gate-driven advance), not "exit0".
  • Proactive stale merge-lease reclaim (dead pid OR TTL) via file delete only (no git ops); reclaim on every reaper tick and at startup.
  • Idempotent merge finalization: pr_already_merged guard + up-to-date short-circuit on re-drive (no second expensive rebase+re-test).
  • New jobs.pid column via idempotent _ensure_column (no migration); pid stamped in launcher._spawn after Popen. Reaper start/stop in lifespan; reaper block in GET /queue.
  • Kill-switches: ORCH_REAPER_ENABLED, ORCH_REAPER_INTERVAL_S, ORCH_REAPER_DEAD_TICKS, ORCH_REAPER_MAX_RUNNING_S, ORCH_LEASE_RECLAIM_ENABLED.

Invariants unchanged (AC-13): STAGE_TRANSITIONS, QG_CHECKS registry, check_branch_mergeable signature/behaviour, BUG-8 rollback, hook exit codes. restart-safe, never-raise per unit of background work.

Docs updated in same PR: docs/architecture/README.md, CHANGELOG.md, .env.example. ADR: docs/work-items/ORCH-065/06-adr/ADR-001-job-reaper-and-lease-reclaim.md.

Test plan

  • tests/test_job_reaper.py (TC-01..08, TC-21)
  • tests/test_merge_lease_reclaim.py (TC-10..15 + pid_alive)
  • tests/test_merge_gate.py (TC-16 pr_already_merged)
  • tests/test_merge_gate_race.py (TC-17 idempotent re-drive)
  • tests/test_queue.py::TestReaperUnblocksQueue (TC-09/TC-18)
  • tests/test_config.py (TC-19 invariants / TC-20 settings)
  • Full suite: 742 passed; ruff clean for changed files

Refs: ORCH-065

## Summary - Closes the **zombie jobs** incident class: job status was set only inside the live launcher process, so a process death left `jobs.status='running'` forever; at `max_concurrency=1` one zombie blocked ALL projects' queue (self-hosting risk). New background daemon `src/job_reaper.py` with three-tier liveness (dead-pid streak / known `exit_code` / max-running backstop); its only mutating write is an atomic terminal flip guarded by `WHERE status='running'` (no double-process). exit0 routes through the canonical QG (gate-driven advance), not "exit0". - Proactive **stale merge-lease reclaim** (dead pid OR TTL) via file delete only (no git ops); reclaim on every reaper tick and at startup. - **Idempotent merge finalization**: `pr_already_merged` guard + up-to-date short-circuit on re-drive (no second expensive rebase+re-test). - New `jobs.pid` column via idempotent `_ensure_column` (no migration); pid stamped in `launcher._spawn` after `Popen`. Reaper start/stop in lifespan; `reaper` block in `GET /queue`. - Kill-switches: `ORCH_REAPER_ENABLED`, `ORCH_REAPER_INTERVAL_S`, `ORCH_REAPER_DEAD_TICKS`, `ORCH_REAPER_MAX_RUNNING_S`, `ORCH_LEASE_RECLAIM_ENABLED`. **Invariants unchanged (AC-13):** `STAGE_TRANSITIONS`, `QG_CHECKS` registry, `check_branch_mergeable` signature/behaviour, BUG-8 rollback, hook exit codes. restart-safe, never-raise per unit of background work. Docs updated in same PR: `docs/architecture/README.md`, `CHANGELOG.md`, `.env.example`. ADR: `docs/work-items/ORCH-065/06-adr/ADR-001-job-reaper-and-lease-reclaim.md`. ## Test plan - [x] `tests/test_job_reaper.py` (TC-01..08, TC-21) - [x] `tests/test_merge_lease_reclaim.py` (TC-10..15 + pid_alive) - [x] `tests/test_merge_gate.py` (TC-16 pr_already_merged) - [x] `tests/test_merge_gate_race.py` (TC-17 idempotent re-drive) - [x] `tests/test_queue.py::TestReaperUnblocksQueue` (TC-09/TC-18) - [x] `tests/test_config.py` (TC-19 invariants / TC-20 settings) - [x] Full suite: 742 passed; ruff clean for changed files Refs: ORCH-065
admin added 10 commits 2026-06-07 19:14:46 +03:00
Closes the "zombie jobs" incident class: job status was set only inside
the live launcher process, so a process death left jobs.status='running'
forever; at max_concurrency=1 one zombie blocked ALL projects' queue
(self-hosting risk). Adds a background daemon (src/job_reaper.py) with
three-tier liveness (dead-pid streak / known exit_code / max-running
backstop) whose only mutating write is an atomic terminal flip guarded by
WHERE status='running' (no double-process). For exit0 the canonical QG is
the source of truth via gate-driven advance, not "exit0".

Also proactively reclaims stale merge-lease (dead pid OR TTL) via file
delete only (no git ops), and makes merge finalization idempotent
(pr_already_merged guard + up-to-date short-circuit on re-drive).

New jobs.pid column via idempotent _ensure_column (no migration); pid
stamped in launcher._spawn after Popen. Reaper start/stop in lifespan;
"reaper" snapshot in GET /queue. Kill-switches: ORCH_REAPER_ENABLED,
ORCH_REAPER_INTERVAL_S, ORCH_REAPER_DEAD_TICKS, ORCH_REAPER_MAX_RUNNING_S,
ORCH_LEASE_RECLAIM_ENABLED.

Invariants unchanged (AC-13): STAGE_TRANSITIONS, QG_CHECKS registry,
check_branch_mergeable signature/behaviour, BUG-8 rollback, hook exit
codes. restart-safe, never-raise per unit of background work.

Docs: docs/architecture/README.md, CHANGELOG.md, .env.example.
Tests: tests/test_job_reaper.py, tests/test_merge_lease_reclaim.py,
tests/test_merge_gate.py (TC-16), tests/test_merge_gate_race.py (TC-17),
tests/test_queue.py, tests/test_config.py (TC-19/TC-20). 742 passed.

Refs: ORCH-065

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The pr_already_merged guard was defined + unit-tested but consulted by zero
production code, while ADR-001 Р-3 / README / CHANGELOG claimed the merge path
consults it before a repeat merge (reviewer P1, ORCH-065 attempt 2/3). The
actual merge actor is the LLM deployer agent (it merges the feature PR at the
start of the `deploy` stage), so on a reaper re-drive of an already-merged PR
the deployer would blindly re-merge → Gitea error → false БАГ-8 rollback; AC-11
("no second merge") was not met deterministically.

Wire the guard at the real consultation point — the deployer prompt — so it
runs merge_gate.pr_already_merged before any (re-)merge and no-ops when the PR
is already merged. check_branch_mergeable is left untouched (AC-13: check_*
behaviour unchanged; it runs on the first deploy-staging→deploy edge, not on a
deploy-stage re-drive where the second-merge risk lives).

- .openclaw/agents/deployer.md: idempotent pre-merge guard step + general rule.
- src/merge_gate.py: docstring names the deployer-prompt consultation point.
- docs/architecture/README.md, CHANGELOG.md: state the consultation point so
  golden-source matches implementation.
- tests/test_merge_gate.py: regression test asserting the deployer prompt wires
  the guard (so it can't silently become dead code again).

pytest tests/ -q: 743 passed.

Refs: ORCH-065
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tier-2 reaped a LIVE, still-finalizing monitor: _monitor_agent writes
agent_runs.exit_code FIRST, then does git push / PR / Plane comments before
_finalize_job, and the agent pid is already dead in that window — so the old
"exit_code recorded -> reap now" had no grace and could race a healthy job.
Worse, _reap_known_outcome ran the advance (advance_stage -> enqueue_job)
BEFORE the atomic claim, so a reaper that lost the race had already enqueued
the next stage (dup advance / dup enqueue), violating ADR-001 Р-1.

Fix:
- Tier-2 grace: reap only once agent_runs.exit_code has been recorded for
  >= reaper_finalize_grace_s (new setting, default 300s; > max finalization
  window). A live finalizing monitor is never reaped (FR-1.3/AC-3). New
  finished_age_s column computed in get_running_jobs.
- claim-before-act for exit0: evaluate the canonical QG READ-ONLY (the
  reconciler pattern) to choose the terminal status, then atomically claim
  'done' FIRST; only the claim winner runs the advance. A loser performs no
  side effects -> no dup advance / dup enqueue.

Docs (golden source) updated in the same change: ADR-001, global adr-0011,
README, internals, .env.example, CHANGELOG (also fixes the P3 broken adr-0011
link). New tests cover the grace window, lost-claim no-side-effects, and the
already-advanced idempotent path.

Refs: ORCH-065

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tester(ET): auto-commit from tester run_id=324
All checks were successful
CI / test (push) Successful in 20s
CI / test (pull_request) Successful in 18s
930e65298c
admin force-pushed feature/ORCH-065-bug-zombie-jobs-merge-lease-ru from b4504edc58 to 930e65298c 2026-06-07 19:14:46 +03:00 Compare
admin merged commit bb03350ec9 into main 2026-06-07 19:16:24 +03:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: admin/orchestrator#66