feat(reaper): job-reaper + stale merge-lease reclaim + idempotent merge finalization
Closes the "zombie jobs" incident class: job status was set only inside the live launcher process, so a process death left jobs.status='running' forever; at max_concurrency=1 one zombie blocked ALL projects' queue (self-hosting risk). Adds a background daemon (src/job_reaper.py) with three-tier liveness (dead-pid streak / known exit_code / max-running backstop) whose only mutating write is an atomic terminal flip guarded by WHERE status='running' (no double-process). For exit0 the canonical QG is the source of truth via gate-driven advance, not "exit0". Also proactively reclaims stale merge-lease (dead pid OR TTL) via file delete only (no git ops), and makes merge finalization idempotent (pr_already_merged guard + up-to-date short-circuit on re-drive). New jobs.pid column via idempotent _ensure_column (no migration); pid stamped in launcher._spawn after Popen. Reaper start/stop in lifespan; "reaper" snapshot in GET /queue. Kill-switches: ORCH_REAPER_ENABLED, ORCH_REAPER_INTERVAL_S, ORCH_REAPER_DEAD_TICKS, ORCH_REAPER_MAX_RUNNING_S, ORCH_LEASE_RECLAIM_ENABLED. Invariants unchanged (AC-13): STAGE_TRANSITIONS, QG_CHECKS registry, check_branch_mergeable signature/behaviour, BUG-8 rollback, hook exit codes. restart-safe, never-raise per unit of background work. Docs: docs/architecture/README.md, CHANGELOG.md, .env.example. Tests: tests/test_job_reaper.py, tests/test_merge_lease_reclaim.py, tests/test_merge_gate.py (TC-16), tests/test_merge_gate_race.py (TC-17), tests/test_queue.py, tests/test_config.py (TC-19/TC-20). 742 passed. Refs: ORCH-065 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -296,6 +296,33 @@ class Settings(BaseSettings):
|
||||
post_deploy_auto_rollback: bool = False
|
||||
post_deploy_base_url: str = "http://localhost:8500"
|
||||
|
||||
# ORCH-065: job-reaper + proactive merge-lease reclaim. A background daemon
|
||||
# thread (modelled on the reconciler) makes "the monitor thread / process died
|
||||
# while a job/lease was held" self-heal WITHOUT a restart. Status (done/queued/
|
||||
# failed) is otherwise only ever set by launcher._monitor_agent -> _finalize_job
|
||||
# inside the live process; a death there left the jobs row 'running' forever and
|
||||
# (at max_concurrency=1) wedged the queue of EVERY project (incidents 07.06: jobs
|
||||
# 236/239/242/254). The same thread proactively reclaims a stale/dead merge-lease
|
||||
# (ORCH-043) instead of waiting for the lazy TTL on the next foreign acquire. See
|
||||
# docs/architecture/adr/adr-0011-job-reaper-lease-reclaim.md.
|
||||
# reaper_enabled -> global kill-switch (false -> strictly prior behaviour;
|
||||
# only the startup requeue_running_jobs remains).
|
||||
# reaper_interval_s -> background scan period (seconds).
|
||||
# reaper_dead_ticks -> Tier-1: consecutive ticks a job's pid must be dead
|
||||
# before it is reaped (>=2 anti-false-positive; a live
|
||||
# long-running agent is NEVER reaped).
|
||||
# reaper_max_running_s -> Tier-3 backstop ceiling: a job 'running' longer than
|
||||
# this is reaped even when liveness is unknowable. MUST be
|
||||
# > max agent_timeout + grace so a legit agent is safe.
|
||||
# lease_reclaim_enabled -> kill-switch for the proactive stale/dead lease reclaim
|
||||
# (false -> only the legacy lazy TTL reclaim in acquire).
|
||||
# (reuse) merge_lock_timeout_s -> lease TTL; merge_gate_repos -> reclaim scope.
|
||||
reaper_enabled: bool = True
|
||||
reaper_interval_s: int = 60
|
||||
reaper_dead_ticks: int = 2
|
||||
reaper_max_running_s: int = 3600
|
||||
lease_reclaim_enabled: bool = True
|
||||
|
||||
# Telegram notifications
|
||||
telegram_bot_token: str = ""
|
||||
telegram_chat_id: str = ""
|
||||
|
||||
Reference in New Issue
Block a user