feat(reaper): job-reaper + stale merge-lease reclaim + idempotent merge finalization

Closes the "zombie jobs" incident class: job status was set only inside
the live launcher process, so a process death left jobs.status='running'
forever; at max_concurrency=1 one zombie blocked ALL projects' queue
(self-hosting risk). Adds a background daemon (src/job_reaper.py) with
three-tier liveness (dead-pid streak / known exit_code / max-running
backstop) whose only mutating write is an atomic terminal flip guarded by
WHERE status='running' (no double-process). For exit0 the canonical QG is
the source of truth via gate-driven advance, not "exit0".

Also proactively reclaims stale merge-lease (dead pid OR TTL) via file
delete only (no git ops), and makes merge finalization idempotent
(pr_already_merged guard + up-to-date short-circuit on re-drive).

New jobs.pid column via idempotent _ensure_column (no migration); pid
stamped in launcher._spawn after Popen. Reaper start/stop in lifespan;
"reaper" snapshot in GET /queue. Kill-switches: ORCH_REAPER_ENABLED,
ORCH_REAPER_INTERVAL_S, ORCH_REAPER_DEAD_TICKS, ORCH_REAPER_MAX_RUNNING_S,
ORCH_LEASE_RECLAIM_ENABLED.

Invariants unchanged (AC-13): STAGE_TRANSITIONS, QG_CHECKS registry,
check_branch_mergeable signature/behaviour, BUG-8 rollback, hook exit
codes. restart-safe, never-raise per unit of background work.

Docs: docs/architecture/README.md, CHANGELOG.md, .env.example.
Tests: tests/test_job_reaper.py, tests/test_merge_lease_reclaim.py,
tests/test_merge_gate.py (TC-16), tests/test_merge_gate_race.py (TC-17),
tests/test_queue.py, tests/test_config.py (TC-19/TC-20). 742 passed.

Refs: ORCH-065

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-06-07 15:31:37 +00:00
committed by Dev Agent
parent 9f846b5a50
commit 4bebb921ff
15 changed files with 1341 additions and 5 deletions

View File

@@ -117,6 +117,28 @@ ORCH_RECONCILE_GRACE_OVERRIDES_JSON=
ORCH_RECONCILE_NOTIFY_UNBLOCK=true
ORCH_RECONCILE_SKIP_BLOCKED_ENABLED=true
# ORCH-065: job-reaper + proactive merge-lease reclaim. A background daemon thread
# (src/job_reaper.py, started LAST in main.lifespan after requeue_running_jobs) reaps
# zombie 'running' jobs whose monitor/process died before writing the terminal status
# (one zombie at max_concurrency=1 blocks the whole shared queue) and periodically
# reclaims dead/stale merge-leases. Liveness is three-tier: Tier-1 dead jobs.pid
# (os.kill(pid,0)) after REAPER_DEAD_TICKS consecutive dead ticks (anti-false-positive
# for a live agent); Tier-2 agent_runs.exit_code recorded but job still 'running';
# Tier-3 backstop after REAPER_MAX_RUNNING_S. The terminal flip carries an atomic
# status='running' guard so it never double-processes a row racing requeue_running_jobs.
# REAPER_ENABLED -> global kill-switch (false -> strictly prior behaviour).
# REAPER_INTERVAL_S -> background scan period (seconds).
# REAPER_DEAD_TICKS -> consecutive dead-pid ticks before reaping (Tier-1, >=2).
# REAPER_MAX_RUNNING_S -> Tier-3 backstop ceiling; must exceed max agent_timeout+grace.
# LEASE_RECLAIM_ENABLED -> kill-switch for the proactive stale/dead lease reclaim
# (false -> only the legacy lazy TTL reclaim in acquire_merge_lease).
# (reuse) ORCH_MERGE_LOCK_TIMEOUT_S -> lease TTL; ORCH_MERGE_GATE_REPOS -> reclaim scope.
ORCH_REAPER_ENABLED=true
ORCH_REAPER_INTERVAL_S=60
ORCH_REAPER_DEAD_TICKS=2
ORCH_REAPER_MAX_RUNNING_S=3600
ORCH_LEASE_RECLAIM_ENABLED=true
# ORCH-021: post-deploy production monitoring + degradation reaction. After the
# terminal deploy->done transition for an applicable repo, a reserved-agent job
# `post-deploy-monitor` (no LLM, modelled on deploy-finalizer) probes prod over a