orchestrator

Author	SHA1	Message	Date
claude-bot	6ea4402942	fix(stage-engine): durable transition-ownership lease + expected-stage CAS (ORCH-114) Close the root class of the ORCH-110/111/112/113 incident chain: side-effectful stage transitions had no single ownership. `advance_stage` is re-enterable and wrote the stage with a bare `UPDATE ... WHERE id=?` (no compare-and-swap), while >=5 actors (monitor / Plane-webhook / reconciler F-1 / job-reaper / deploy-finalizer) enter the same transition independently. A concurrent or post-restart re-entry therefore re-applied irreversible effects (merge_pr / coverage-ratchet / image-rebuild / prod-deploy initiation) and produced a contradictory rollback<->done (incident ORCH-111, job 1914 / PR #130). Two complementary layers, both additive, under one kill-switch, never-raise: 1. Durable transition-lease (new table `transition_lease`) — owner-exclusion on ENTRY to the side-effectful region: a second actor that sees a LIVE owner does not start the heavy sub-gates at all (prevention, not post-hoc repair). 2. Expected-stage CAS (`db.update_task_stage_cas`) — atomicity on the stage WRITE: a lost race aborts with NO side effect. Also closes the 6 paths that write the stage in bypass of advance_stage (gitea x5 + plane rollback). Owner liveness = owner_pid + owner_boot_id (NOT a heartbeat — a blocking 900s merge re-test cannot beat one; ADR-001 D3), making restart recovery free (a fresh boot_id renders every prior lease stale -> reclaimed by recover_on_startup). The lease has no own TTL: its hard age ceiling is the reaper Tier-3 backstop reaper_max_running_s, so the cross-cutting budget invariant ORCH-065/109/110/113 is untouched. Generalises ORCH-113 finalizer-liveness (process-local, Tier-2, deploy-staging) to a durable cross-path lease: the reaper consults it on all relevant paths (defer live, reclaim dead; Tier-3 ignores the marker -> bounded; a reap force-releases the lease); reconciler F-1 and the Plane webhook defer on an active lease; main.lifespan calls recover_on_startup() after requeue_running_jobs. finalizer_liveness.py is unchanged (it remains the kill-switch-off fallback). Scope self-hosting (transition_lease_repos="" -> orchestrator only; enduro untouched). Kill-switch ORCH_TRANSITION_LEASE_ENABLED=false -> CAS degenerates to the prior unconditional update_task_stage, lease inert, reaper -> ORCH-113 fallback (byte-for- byte pre-ORCH-114). STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / existing table schemas — byte-for-byte (one additive table, no epoch column on tasks). Observability: read-only `transition_lease` block in GET /queue + a Telegram alert on forced/stale reclaim + optional POST /transition-lease/release?work_item=<id>. Coverage: tests/test_orch114_transition_ownership.py (TC-01 mandatory regression of the ORCH-111 class — red before fix, green after; TC-02..TC-14). Full suite green (2048 passed); the 4 webhook tests that spied on the removed gitea.update_task_stage were updated to spy on the new commit_stage_cas write path. ADR: docs/work-items/ORCH-114/06-adr/ADR-001-transition-ownership-lease-and-stage-cas.md Cross-cutting: docs/architecture/adr/adr-0045-transition-ownership-lease-and-stage-cas.md Refs: ORCH-114 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 19:28:38 +03:00
claude-bot	7cb1f83f6c	fix(reaper): do not re-run deploy-staging finalization while finalizer is alive On the deploy-staging -> deploy edge the live monitor stamps agent_runs.finished_at FIRST, then runs the heavy edge sub-gates (security/merge-gate re-test/coverage/image-freshness) in-thread for MINUTES and only THEN _finalize_job. Reaper Tier-2 measures finished_age_s from finished_at, so past reaper_finalize_grace_s it treated the live, long finalizer as dead and independently re-ran the advance -> a second re-test went red -> false rollback deploy-staging -> development while the original finalizer concurrently merged the PR (incident ORCH-111, job 1914). Add a process-local finalizer-ownership registry (src/finalizer_liveness.py, never-raise): the monitor mark()s ownership right after the exit_code stamp and clear()s it in a try/finally around the (verbatim-extracted) finalization tail, so an exception in the monitor thread still releases ownership and a genuinely dead finalizer is reaped. The reaper Tier-2 consults the marker only when the kill-switch is on AND the task stage == deploy-staging AND ownership is active -> DEFER (no second advance) and fall through to the Tier-3 backstop, which ignores the marker (a stuck/dead finalizer is still reaped in bounded time). In-memory is authoritative (monitor + reaper are daemon threads of one uvicorn process); restart is covered by the startup requeue_running_jobs. Additive, global kill-switch reaper_finalizer_liveness_enabled (default True; false -> reaper byte-for-byte prior). STAGE_TRANSITIONS / QG_CHECKS / every check_* / machine-verdict keys / DB schema unchanged; grace/ceiling and the ORCH-065/109/110 budget invariant untouched; never restarts prod, never pushes main. Observability: finalizer_defers_total + finalizer_owned in GET /queue. Tests: tests/test_orch113_reaper_finalizer_liveness.py (TC-01..TC-08, incl. the mandatory ORCH-111 regression: red before the fix, green after). Refs: ORCH-113 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 13:08:41 +03:00
claude-bot	ebbf2e7a2d	feat(cancel): STOP-status task cancellation + relaunch-hole close (ORCH-090) Introduce the dedicated Plane STOP status as a single declarative task-cancel mechanism: stop the active agent (graceful SIGTERM cascade), cancel all jobs (terminal `cancelled`, never requeued), remove the worktree + delete the remote feature branch (never main, never force-push), drive the task to the new system-terminal state `cancelled` and tombstone the natural keys so a later "To Analyse" re-creates it from scratch (docs artefacts preserved). STOP during a critical merge/deploy window is deferred until the irreversible step finishes honestly. Also closes the relaunch hole: handle_status_start relaunch is gated to the `analysis` stage; the only pipeline-start entry point remains "To Analyse". Cross-cutting (adr-0026): the "task terminal" predicate is widened {done} -> {done, cancelled} in serial_gate / task_deps / stages sink + reaper/worker requeue guards. STAGE_TRANSITIONS exit-gates / QG_CHECKS / check_* are unchanged (`cancelled` is a sink, not a new edge). Additive, never-raise, restart-safe, under kill-switch ORCH_STOP_STATUS_ENABLED (off -> zero regression). New: src/cancel.py (leaf), src/gitea.py (delete_remote_branch), tasks columns cancelled_at/cancel_requested_at, jobs status `cancelled`, GET /queue `stop` block. Tests: tests/test_stop_status.py (TC-01..TC-14 + D7); full suite green (1345). Docs updated in-PR (architecture README, CLAUDE.md, README.md, .env.example, CHANGELOG). ADR-001 D4 refinement: plane_issue_id is tombstoned too (the lookup ORs on it) — original UUID recoverable from the parseable suffix. Refs: ORCH-090 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:31:56 +03:00
claude-bot	720c31393a	fix(reaper): Tier-2 finalization grace + claim-before-act (no dup advance) Tier-2 reaped a LIVE, still-finalizing monitor: _monitor_agent writes agent_runs.exit_code FIRST, then does git push / PR / Plane comments before _finalize_job, and the agent pid is already dead in that window — so the old "exit_code recorded -> reap now" had no grace and could race a healthy job. Worse, _reap_known_outcome ran the advance (advance_stage -> enqueue_job) BEFORE the atomic claim, so a reaper that lost the race had already enqueued the next stage (dup advance / dup enqueue), violating ADR-001 Р-1. Fix: - Tier-2 grace: reap only once agent_runs.exit_code has been recorded for >= reaper_finalize_grace_s (new setting, default 300s; > max finalization window). A live finalizing monitor is never reaped (FR-1.3/AC-3). New finished_age_s column computed in get_running_jobs. - claim-before-act for exit0: evaluate the canonical QG READ-ONLY (the reconciler pattern) to choose the terminal status, then atomically claim 'done' FIRST; only the claim winner runs the advance. A loser performs no side effects -> no dup advance / dup enqueue. Docs (golden source) updated in the same change: ADR-001, global adr-0011, README, internals, .env.example, CHANGELOG (also fixes the P3 broken adr-0011 link). New tests cover the grace window, lost-claim no-side-effects, and the already-advanced idempotent path. Refs: ORCH-065 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 16:14:45 +00:00
claude-bot	4bebb921ff	feat(reaper): job-reaper + stale merge-lease reclaim + idempotent merge finalization Closes the "zombie jobs" incident class: job status was set only inside the live launcher process, so a process death left jobs.status='running' forever; at max_concurrency=1 one zombie blocked ALL projects' queue (self-hosting risk). Adds a background daemon (src/job_reaper.py) with three-tier liveness (dead-pid streak / known exit_code / max-running backstop) whose only mutating write is an atomic terminal flip guarded by WHERE status='running' (no double-process). For exit0 the canonical QG is the source of truth via gate-driven advance, not "exit0". Also proactively reclaims stale merge-lease (dead pid OR TTL) via file delete only (no git ops), and makes merge finalization idempotent (pr_already_merged guard + up-to-date short-circuit on re-drive). New jobs.pid column via idempotent _ensure_column (no migration); pid stamped in launcher._spawn after Popen. Reaper start/stop in lifespan; "reaper" snapshot in GET /queue. Kill-switches: ORCH_REAPER_ENABLED, ORCH_REAPER_INTERVAL_S, ORCH_REAPER_DEAD_TICKS, ORCH_REAPER_MAX_RUNNING_S, ORCH_LEASE_RECLAIM_ENABLED. Invariants unchanged (AC-13): STAGE_TRANSITIONS, QG_CHECKS registry, check_branch_mergeable signature/behaviour, BUG-8 rollback, hook exit codes. restart-safe, never-raise per unit of background work. Docs: docs/architecture/README.md, CHANGELOG.md, .env.example. Tests: tests/test_job_reaper.py, tests/test_merge_lease_reclaim.py, tests/test_merge_gate.py (TC-16), tests/test_merge_gate_race.py (TC-17), tests/test_queue.py, tests/test_config.py (TC-19/TC-20). 742 passed. Refs: ORCH-065 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 16:14:45 +00:00

5 Commits