feat(reaper): job-reaper + stale merge-lease reclaim + idempotent merge finalization (ORCH-065) #66
Reference in New Issue
Block a user
Delete Branch "feature/ORCH-065-bug-zombie-jobs-merge-lease-ru"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
jobs.status='running'forever; atmax_concurrency=1one zombie blocked ALL projects' queue (self-hosting risk). New background daemonsrc/job_reaper.pywith three-tier liveness (dead-pid streak / knownexit_code/ max-running backstop); its only mutating write is an atomic terminal flip guarded byWHERE status='running'(no double-process). exit0 routes through the canonical QG (gate-driven advance), not "exit0".pr_already_mergedguard + up-to-date short-circuit on re-drive (no second expensive rebase+re-test).jobs.pidcolumn via idempotent_ensure_column(no migration); pid stamped inlauncher._spawnafterPopen. Reaper start/stop in lifespan;reaperblock inGET /queue.ORCH_REAPER_ENABLED,ORCH_REAPER_INTERVAL_S,ORCH_REAPER_DEAD_TICKS,ORCH_REAPER_MAX_RUNNING_S,ORCH_LEASE_RECLAIM_ENABLED.Invariants unchanged (AC-13):
STAGE_TRANSITIONS,QG_CHECKSregistry,check_branch_mergeablesignature/behaviour, BUG-8 rollback, hook exit codes. restart-safe, never-raise per unit of background work.Docs updated in same PR:
docs/architecture/README.md,CHANGELOG.md,.env.example. ADR:docs/work-items/ORCH-065/06-adr/ADR-001-job-reaper-and-lease-reclaim.md.Test plan
tests/test_job_reaper.py(TC-01..08, TC-21)tests/test_merge_lease_reclaim.py(TC-10..15 + pid_alive)tests/test_merge_gate.py(TC-16 pr_already_merged)tests/test_merge_gate_race.py(TC-17 idempotent re-drive)tests/test_queue.py::TestReaperUnblocksQueue(TC-09/TC-18)tests/test_config.py(TC-19 invariants / TC-20 settings)Refs: ORCH-065
The pr_already_merged guard was defined + unit-tested but consulted by zero production code, while ADR-001 Р-3 / README / CHANGELOG claimed the merge path consults it before a repeat merge (reviewer P1, ORCH-065 attempt 2/3). The actual merge actor is the LLM deployer agent (it merges the feature PR at the start of the `deploy` stage), so on a reaper re-drive of an already-merged PR the deployer would blindly re-merge → Gitea error → false БАГ-8 rollback; AC-11 ("no second merge") was not met deterministically. Wire the guard at the real consultation point — the deployer prompt — so it runs merge_gate.pr_already_merged before any (re-)merge and no-ops when the PR is already merged. check_branch_mergeable is left untouched (AC-13: check_* behaviour unchanged; it runs on the first deploy-staging→deploy edge, not on a deploy-stage re-drive where the second-merge risk lives). - .openclaw/agents/deployer.md: idempotent pre-merge guard step + general rule. - src/merge_gate.py: docstring names the deployer-prompt consultation point. - docs/architecture/README.md, CHANGELOG.md: state the consultation point so golden-source matches implementation. - tests/test_merge_gate.py: regression test asserting the deployer prompt wires the guard (so it can't silently become dead code again). pytest tests/ -q: 743 passed. Refs: ORCH-065 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>b4504edc58to930e65298c