Commit Graph

5 Commits

Author SHA1 Message Date
6ea4402942 fix(stage-engine): durable transition-ownership lease + expected-stage CAS (ORCH-114)
Close the root class of the ORCH-110/111/112/113 incident chain: side-effectful
stage transitions had no single ownership. `advance_stage` is re-enterable and wrote
the stage with a bare `UPDATE ... WHERE id=?` (no compare-and-swap), while >=5 actors
(monitor / Plane-webhook / reconciler F-1 / job-reaper / deploy-finalizer) enter the
same transition independently. A concurrent or post-restart re-entry therefore
re-applied irreversible effects (merge_pr / coverage-ratchet / image-rebuild /
prod-deploy initiation) and produced a contradictory rollback<->done (incident
ORCH-111, job 1914 / PR #130).

Two complementary layers, both additive, under one kill-switch, never-raise:
  1. Durable transition-lease (new table `transition_lease`) — owner-exclusion on
     ENTRY to the side-effectful region: a second actor that sees a LIVE owner does
     not start the heavy sub-gates at all (prevention, not post-hoc repair).
  2. Expected-stage CAS (`db.update_task_stage_cas`) — atomicity on the stage WRITE:
     a lost race aborts with NO side effect. Also closes the 6 paths that write the
     stage in bypass of advance_stage (gitea x5 + plane rollback).

Owner liveness = owner_pid + owner_boot_id (NOT a heartbeat — a blocking 900s merge
re-test cannot beat one; ADR-001 D3), making restart recovery free (a fresh boot_id
renders every prior lease stale -> reclaimed by recover_on_startup). The lease has no
own TTL: its hard age ceiling is the reaper Tier-3 backstop reaper_max_running_s, so
the cross-cutting budget invariant ORCH-065/109/110/113 is untouched.

Generalises ORCH-113 finalizer-liveness (process-local, Tier-2, deploy-staging) to a
durable cross-path lease: the reaper consults it on all relevant paths (defer live,
reclaim dead; Tier-3 ignores the marker -> bounded; a reap force-releases the lease);
reconciler F-1 and the Plane webhook defer on an active lease; main.lifespan calls
recover_on_startup() after requeue_running_jobs. finalizer_liveness.py is unchanged
(it remains the kill-switch-off fallback).

Scope self-hosting (transition_lease_repos="" -> orchestrator only; enduro untouched).
Kill-switch ORCH_TRANSITION_LEASE_ENABLED=false -> CAS degenerates to the prior
unconditional update_task_stage, lease inert, reaper -> ORCH-113 fallback (byte-for-
byte pre-ORCH-114). STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys /
existing table schemas — byte-for-byte (one additive table, no epoch column on tasks).

Observability: read-only `transition_lease` block in GET /queue + a Telegram alert on
forced/stale reclaim + optional POST /transition-lease/release?work_item=<id>.

Coverage: tests/test_orch114_transition_ownership.py (TC-01 mandatory regression of
the ORCH-111 class — red before fix, green after; TC-02..TC-14). Full suite green
(2048 passed); the 4 webhook tests that spied on the removed gitea.update_task_stage
were updated to spy on the new commit_stage_cas write path.

ADR: docs/work-items/ORCH-114/06-adr/ADR-001-transition-ownership-lease-and-stage-cas.md
Cross-cutting: docs/architecture/adr/adr-0045-transition-ownership-lease-and-stage-cas.md

Refs: ORCH-114
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 19:28:38 +03:00
7cb1f83f6c fix(reaper): do not re-run deploy-staging finalization while finalizer is alive
On the deploy-staging -> deploy edge the live monitor stamps
agent_runs.finished_at FIRST, then runs the heavy edge sub-gates
(security/merge-gate re-test/coverage/image-freshness) in-thread for MINUTES
and only THEN _finalize_job. Reaper Tier-2 measures finished_age_s from
finished_at, so past reaper_finalize_grace_s it treated the live, long
finalizer as dead and independently re-ran the advance -> a second re-test
went red -> false rollback deploy-staging -> development while the original
finalizer concurrently merged the PR (incident ORCH-111, job 1914).

Add a process-local finalizer-ownership registry (src/finalizer_liveness.py,
never-raise): the monitor mark()s ownership right after the exit_code stamp and
clear()s it in a try/finally around the (verbatim-extracted) finalization tail,
so an exception in the monitor thread still releases ownership and a genuinely
dead finalizer is reaped. The reaper Tier-2 consults the marker only when the
kill-switch is on AND the task stage == deploy-staging AND ownership is active
-> DEFER (no second advance) and fall through to the Tier-3 backstop, which
ignores the marker (a stuck/dead finalizer is still reaped in bounded time).
In-memory is authoritative (monitor + reaper are daemon threads of one uvicorn
process); restart is covered by the startup requeue_running_jobs.

Additive, global kill-switch reaper_finalizer_liveness_enabled (default True;
false -> reaper byte-for-byte prior). STAGE_TRANSITIONS / QG_CHECKS / every
check_* / machine-verdict keys / DB schema unchanged; grace/ceiling and the
ORCH-065/109/110 budget invariant untouched; never restarts prod, never pushes
main. Observability: finalizer_defers_total + finalizer_owned in GET /queue.
Tests: tests/test_orch113_reaper_finalizer_liveness.py (TC-01..TC-08, incl. the
mandatory ORCH-111 regression: red before the fix, green after).

Refs: ORCH-113

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 13:08:41 +03:00
ebbf2e7a2d feat(cancel): STOP-status task cancellation + relaunch-hole close (ORCH-090)
Introduce the dedicated Plane STOP status as a single declarative task-cancel
mechanism: stop the active agent (graceful SIGTERM cascade), cancel all jobs
(terminal `cancelled`, never requeued), remove the worktree + delete the remote
feature branch (never main, never force-push), drive the task to the new
system-terminal state `cancelled` and tombstone the natural keys so a later
"To Analyse" re-creates it from scratch (docs artefacts preserved). STOP during a
critical merge/deploy window is deferred until the irreversible step finishes
honestly. Also closes the relaunch hole: handle_status_start relaunch is gated to
the `analysis` stage; the only pipeline-start entry point remains "To Analyse".

Cross-cutting (adr-0026): the "task terminal" predicate is widened {done} ->
{done, cancelled} in serial_gate / task_deps / stages sink + reaper/worker
requeue guards. STAGE_TRANSITIONS exit-gates / QG_CHECKS / check_* are unchanged
(`cancelled` is a sink, not a new edge). Additive, never-raise, restart-safe,
under kill-switch ORCH_STOP_STATUS_ENABLED (off -> zero regression).

New: src/cancel.py (leaf), src/gitea.py (delete_remote_branch), tasks columns
cancelled_at/cancel_requested_at, jobs status `cancelled`, GET /queue `stop` block.
Tests: tests/test_stop_status.py (TC-01..TC-14 + D7); full suite green (1345).
Docs updated in-PR (architecture README, CLAUDE.md, README.md, .env.example,
CHANGELOG). ADR-001 D4 refinement: plane_issue_id is tombstoned too (the lookup
ORs on it) — original UUID recoverable from the parseable suffix.

Refs: ORCH-090

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:31:56 +03:00
720c31393a fix(reaper): Tier-2 finalization grace + claim-before-act (no dup advance)
Tier-2 reaped a LIVE, still-finalizing monitor: _monitor_agent writes
agent_runs.exit_code FIRST, then does git push / PR / Plane comments before
_finalize_job, and the agent pid is already dead in that window — so the old
"exit_code recorded -> reap now" had no grace and could race a healthy job.
Worse, _reap_known_outcome ran the advance (advance_stage -> enqueue_job)
BEFORE the atomic claim, so a reaper that lost the race had already enqueued
the next stage (dup advance / dup enqueue), violating ADR-001 Р-1.

Fix:
- Tier-2 grace: reap only once agent_runs.exit_code has been recorded for
  >= reaper_finalize_grace_s (new setting, default 300s; > max finalization
  window). A live finalizing monitor is never reaped (FR-1.3/AC-3). New
  finished_age_s column computed in get_running_jobs.
- claim-before-act for exit0: evaluate the canonical QG READ-ONLY (the
  reconciler pattern) to choose the terminal status, then atomically claim
  'done' FIRST; only the claim winner runs the advance. A loser performs no
  side effects -> no dup advance / dup enqueue.

Docs (golden source) updated in the same change: ADR-001, global adr-0011,
README, internals, .env.example, CHANGELOG (also fixes the P3 broken adr-0011
link). New tests cover the grace window, lost-claim no-side-effects, and the
already-advanced idempotent path.

Refs: ORCH-065

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-07 16:14:45 +00:00
4bebb921ff feat(reaper): job-reaper + stale merge-lease reclaim + idempotent merge finalization
Closes the "zombie jobs" incident class: job status was set only inside
the live launcher process, so a process death left jobs.status='running'
forever; at max_concurrency=1 one zombie blocked ALL projects' queue
(self-hosting risk). Adds a background daemon (src/job_reaper.py) with
three-tier liveness (dead-pid streak / known exit_code / max-running
backstop) whose only mutating write is an atomic terminal flip guarded by
WHERE status='running' (no double-process). For exit0 the canonical QG is
the source of truth via gate-driven advance, not "exit0".

Also proactively reclaims stale merge-lease (dead pid OR TTL) via file
delete only (no git ops), and makes merge finalization idempotent
(pr_already_merged guard + up-to-date short-circuit on re-drive).

New jobs.pid column via idempotent _ensure_column (no migration); pid
stamped in launcher._spawn after Popen. Reaper start/stop in lifespan;
"reaper" snapshot in GET /queue. Kill-switches: ORCH_REAPER_ENABLED,
ORCH_REAPER_INTERVAL_S, ORCH_REAPER_DEAD_TICKS, ORCH_REAPER_MAX_RUNNING_S,
ORCH_LEASE_RECLAIM_ENABLED.

Invariants unchanged (AC-13): STAGE_TRANSITIONS, QG_CHECKS registry,
check_branch_mergeable signature/behaviour, BUG-8 rollback, hook exit
codes. restart-safe, never-raise per unit of background work.

Docs: docs/architecture/README.md, CHANGELOG.md, .env.example.
Tests: tests/test_job_reaper.py, tests/test_merge_lease_reclaim.py,
tests/test_merge_gate.py (TC-16), tests/test_merge_gate_race.py (TC-17),
tests/test_queue.py, tests/test_config.py (TC-19/TC-20). 742 passed.

Refs: ORCH-065

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-07 16:14:45 +00:00