feat(reaper): job-reaper + stale merge-lease reclaim + idempotent merge finalization

Closes the "zombie jobs" incident class: job status was set only inside
the live launcher process, so a process death left jobs.status='running'
forever; at max_concurrency=1 one zombie blocked ALL projects' queue
(self-hosting risk). Adds a background daemon (src/job_reaper.py) with
three-tier liveness (dead-pid streak / known exit_code / max-running
backstop) whose only mutating write is an atomic terminal flip guarded by
WHERE status='running' (no double-process). For exit0 the canonical QG is
the source of truth via gate-driven advance, not "exit0".

Also proactively reclaims stale merge-lease (dead pid OR TTL) via file
delete only (no git ops), and makes merge finalization idempotent
(pr_already_merged guard + up-to-date short-circuit on re-drive).

New jobs.pid column via idempotent _ensure_column (no migration); pid
stamped in launcher._spawn after Popen. Reaper start/stop in lifespan;
"reaper" snapshot in GET /queue. Kill-switches: ORCH_REAPER_ENABLED,
ORCH_REAPER_INTERVAL_S, ORCH_REAPER_DEAD_TICKS, ORCH_REAPER_MAX_RUNNING_S,
ORCH_LEASE_RECLAIM_ENABLED.

Invariants unchanged (AC-13): STAGE_TRANSITIONS, QG_CHECKS registry,
check_branch_mergeable signature/behaviour, BUG-8 rollback, hook exit
codes. restart-safe, never-raise per unit of background work.

Docs: docs/architecture/README.md, CHANGELOG.md, .env.example.
Tests: tests/test_job_reaper.py, tests/test_merge_lease_reclaim.py,
tests/test_merge_gate.py (TC-16), tests/test_merge_gate_race.py (TC-17),
tests/test_queue.py, tests/test_config.py (TC-19/TC-20). 742 passed.

Refs: ORCH-065

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-06-07 15:31:37 +00:00
committed by Dev Agent
parent 9f846b5a50
commit 4bebb921ff
15 changed files with 1341 additions and 5 deletions

View File

@@ -302,3 +302,58 @@ class TestWorkerConcurrency:
assert count_running_jobs() == 0
counts = job_status_counts()
assert counts["failed"] == 1
# ---------------------------------------------------------------------------
# ORCH-065: job-reaper unblocks the shared queue (TC-09) + /queue block (TC-18)
# ---------------------------------------------------------------------------
class TestReaperUnblocksQueue:
def test_tc09_reap_unblocks_claim_at_concurrency_1(self, monkeypatch):
"""A zombie 'running' row at max_concurrency=1 blocks every claim; once the
reaper reaps it the next queued job can be claimed (AC-2)."""
import src.merge_gate as mg
from src.job_reaper import JobReaper
monkeypatch.setattr(db.settings, "reaper_dead_ticks", 1)
monkeypatch.setattr(mg, "pid_alive", lambda pid: False) # zombie pid dead
# A zombie row stuck 'running' with a dead pid.
conn = db.get_db()
cur = conn.execute(
"INSERT INTO jobs (agent, repo, status, attempts, max_attempts, pid, "
"started_at) VALUES ('developer','r','running',2,2,999999,datetime('now'))"
)
zombie = cur.lastrowid
conn.commit()
conn.close()
# A second job waits in the queue behind it.
nxt = enqueue_job("analyst", "r")
# At concurrency 1 the slot is fully occupied -> nothing else can run.
assert count_running_jobs() == 1
monkeypatch.setattr("src.notifications.send_telegram", lambda *a, **k: None)
JobReaper().reap_once() # dead pid, attempts>=max -> failed
assert get_job(zombie)["status"] == "failed"
assert count_running_jobs() == 0
# Queue is unblocked: the next job claims successfully.
claimed = claim_next_job()
assert claimed is not None and claimed["id"] == nxt
def test_tc18_queue_endpoint_has_reaper_block(self):
"""GET /queue exposes the reaper observability block (AC-15).
Calls the endpoint coroutine directly (no lifespan / no background
threads / no network) so the test stays hermetic.
"""
import asyncio
import src.main as main
body = asyncio.run(main.queue())
assert "reaper" in body
reaper = body["reaper"]
for key in ("enabled", "interval", "last_run_ts", "reaped_total",
"last_reaped", "lease_reclaimed_total"):
assert key in reaper