feat(reaper): job-reaper + stale merge-lease reclaim + idempotent merge finalization

Closes the "zombie jobs" incident class: job status was set only inside
the live launcher process, so a process death left jobs.status='running'
forever; at max_concurrency=1 one zombie blocked ALL projects' queue
(self-hosting risk). Adds a background daemon (src/job_reaper.py) with
three-tier liveness (dead-pid streak / known exit_code / max-running
backstop) whose only mutating write is an atomic terminal flip guarded by
WHERE status='running' (no double-process). For exit0 the canonical QG is
the source of truth via gate-driven advance, not "exit0".

Also proactively reclaims stale merge-lease (dead pid OR TTL) via file
delete only (no git ops), and makes merge finalization idempotent
(pr_already_merged guard + up-to-date short-circuit on re-drive).

New jobs.pid column via idempotent _ensure_column (no migration); pid
stamped in launcher._spawn after Popen. Reaper start/stop in lifespan;
"reaper" snapshot in GET /queue. Kill-switches: ORCH_REAPER_ENABLED,
ORCH_REAPER_INTERVAL_S, ORCH_REAPER_DEAD_TICKS, ORCH_REAPER_MAX_RUNNING_S,
ORCH_LEASE_RECLAIM_ENABLED.

Invariants unchanged (AC-13): STAGE_TRANSITIONS, QG_CHECKS registry,
check_branch_mergeable signature/behaviour, BUG-8 rollback, hook exit
codes. restart-safe, never-raise per unit of background work.

Docs: docs/architecture/README.md, CHANGELOG.md, .env.example.
Tests: tests/test_job_reaper.py, tests/test_merge_lease_reclaim.py,
tests/test_merge_gate.py (TC-16), tests/test_merge_gate_race.py (TC-17),
tests/test_queue.py, tests/test_config.py (TC-19/TC-20). 742 passed.

Refs: ORCH-065

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-06-07 15:31:37 +00:00
committed by Dev Agent
parent 9f846b5a50
commit 4bebb921ff
15 changed files with 1341 additions and 5 deletions

View File

@@ -417,6 +417,14 @@ class AgentLauncher:
"UPDATE agent_runs SET output_path = ? WHERE id = ?",
(output_path, run_id),
)
# ORCH-065: stamp the agent process pid onto the job row so the job-reaper
# can probe liveness (os.kill(pid, 0)). proc.pid only exists after Popen,
# so this is a second UPDATE next to run_id/started_at (set above in _spawn).
if job_id is not None:
conn.execute(
"UPDATE jobs SET pid = ? WHERE id = ?",
(proc.pid, job_id),
)
conn.commit()
conn.close()