Resolves the REQUEST_CHANGES findings on ORCH-114 (durable transition-ownership
lease + expected-stage CAS):
P1 — documentation = golden source:
- .env.example: add ORCH_TRANSITION_LEASE_ENABLED / ORCH_TRANSITION_LEASE_REPOS
(canon of 100% start keys, ORCH-101), next to the other gate kill-switches.
- CLAUDE.md: add the ORCH-114 passport section (mechanism, invariant, flags,
ADR links) so a future agent editing advance_stage/reaper/webhooks finds the
ownership invariant in the first mandatory-read doc (ORCH-078 traceability index).
P2 — should-fix:
- docs/overview/ (system showcase, ORCH-011): add transition_lease to
tech-data-model.md (helper tables), tech-observability.md (/queue blocks) and
tech-architecture.md (components).
- ADR-001 D4 alignment: the four side-effectful-edge rollback handlers
(_handle_merge_gate_rollback / _handle_security_gate / _handle_coverage_gate /
_handle_image_freshness) now write `development` through the expected-stage CAS
via a shared _rollback_stage_cas helper (defence against the rollback↔done
contradiction, BR-6) instead of a bare unconditional update_task_stage. Under the
held lease the sole owner always wins; a lost race aborts WITHOUT side effects.
Kill-switch off / out-of-scope repo -> degenerates to the prior write -> 1:1.
- Test isolation: make tests/test_webhooks.py order-independent by pinning the
proj-1 registry per-test (mirrors test_webhook_dedup.proj_registry); it had only
passed by relying on import order. Drop the needless module-level ORCH_DB_PATH
setdefault in test_orch114 (fresh_db already isolates db_path).
New regression tests (TC-11): in-region rollback writes route through CAS;
rollback CAS wins when at expected stage; rollback CAS-lost does NOT clobber `done`;
kill-switch-off rollback degenerates to the unconditional write.
ruff clean (src/stage_engine.py, src/transition_lease.py); full suite 2052 passed.
Refs: ORCH-114
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Close the root class of the ORCH-110/111/112/113 incident chain: side-effectful
stage transitions had no single ownership. `advance_stage` is re-enterable and wrote
the stage with a bare `UPDATE ... WHERE id=?` (no compare-and-swap), while >=5 actors
(monitor / Plane-webhook / reconciler F-1 / job-reaper / deploy-finalizer) enter the
same transition independently. A concurrent or post-restart re-entry therefore
re-applied irreversible effects (merge_pr / coverage-ratchet / image-rebuild /
prod-deploy initiation) and produced a contradictory rollback<->done (incident
ORCH-111, job 1914 / PR #130).
Two complementary layers, both additive, under one kill-switch, never-raise:
1. Durable transition-lease (new table `transition_lease`) — owner-exclusion on
ENTRY to the side-effectful region: a second actor that sees a LIVE owner does
not start the heavy sub-gates at all (prevention, not post-hoc repair).
2. Expected-stage CAS (`db.update_task_stage_cas`) — atomicity on the stage WRITE:
a lost race aborts with NO side effect. Also closes the 6 paths that write the
stage in bypass of advance_stage (gitea x5 + plane rollback).
Owner liveness = owner_pid + owner_boot_id (NOT a heartbeat — a blocking 900s merge
re-test cannot beat one; ADR-001 D3), making restart recovery free (a fresh boot_id
renders every prior lease stale -> reclaimed by recover_on_startup). The lease has no
own TTL: its hard age ceiling is the reaper Tier-3 backstop reaper_max_running_s, so
the cross-cutting budget invariant ORCH-065/109/110/113 is untouched.
Generalises ORCH-113 finalizer-liveness (process-local, Tier-2, deploy-staging) to a
durable cross-path lease: the reaper consults it on all relevant paths (defer live,
reclaim dead; Tier-3 ignores the marker -> bounded; a reap force-releases the lease);
reconciler F-1 and the Plane webhook defer on an active lease; main.lifespan calls
recover_on_startup() after requeue_running_jobs. finalizer_liveness.py is unchanged
(it remains the kill-switch-off fallback).
Scope self-hosting (transition_lease_repos="" -> orchestrator only; enduro untouched).
Kill-switch ORCH_TRANSITION_LEASE_ENABLED=false -> CAS degenerates to the prior
unconditional update_task_stage, lease inert, reaper -> ORCH-113 fallback (byte-for-
byte pre-ORCH-114). STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys /
existing table schemas — byte-for-byte (one additive table, no epoch column on tasks).
Observability: read-only `transition_lease` block in GET /queue + a Telegram alert on
forced/stale reclaim + optional POST /transition-lease/release?work_item=<id>.
Coverage: tests/test_orch114_transition_ownership.py (TC-01 mandatory regression of
the ORCH-111 class — red before fix, green after; TC-02..TC-14). Full suite green
(2048 passed); the 4 webhook tests that spied on the removed gitea.update_task_stage
were updated to spy on the new commit_stage_cas write path.
ADR: docs/work-items/ORCH-114/06-adr/ADR-001-transition-ownership-lease-and-stage-cas.md
Cross-cutting: docs/architecture/adr/adr-0045-transition-ownership-lease-and-stage-cas.md
Refs: ORCH-114
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Staging suite 8/10 PASS, REAL failed: none; C9a/C9b infra-waived (ORCH-061).
staging_status: SUCCESS
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Eliminate the false `deploy-staging -> development` rollback that fired when the
merge-gate local re-test timed out (infra/resource) on a green CI + tester +
staging branch (incident ORCH-109/PR #129: a 516.7s suite blew its 600s budget
under CPU starvation from orphaned pytest processes -> timeout misrouted as a
code fault -> developer-retry loop -> manual gate).
Additive, 5 independent kill-switches, never-raise, self-hosting scope. Untouched
byte-for-byte: STAGE_TRANSITIONS, the QG_CHECKS registry, check_branch_mergeable
name/semantics, machine-verdict keys, the DB schema. INV-4 (never push/force-push
main) and the no-prod-restart rule are preserved.
- D1: new stdlib-only leaf src/proc_group.py runs the spawned re-test/coverage
pytest in its own process group (start_new_session) and tree-kills the WHOLE
group on timeout (os.killpg SIGTERM->grace->SIGKILL); used by
merge_gate.retest_branch and coverage_gate.measure_coverage. No orphan leak.
Fallback never-break: subprocess_tree_kill_enabled=False / non-POSIX -> the
prior subprocess.run.
- D2/D3: merge_gate.classify_retest_failure distinguishes timeout/red/lock-busy/
other; an infra timeout routes to _handle_merge_gate_infra_retry (bounded
re-queue, task stays on deploy-staging, no rollback / no developer-retry); a
red re-test / conflict still rolls back (BR-6). Exhaustion -> one infra alert.
- D4: skip the local re-test when the pre-merge rebase was a proven no-op (HEAD
already CI/tester/staging-validated); fail-safe runs the re-test on any
uncertainty. Flag merge_retest_skip_when_current_enabled.
- D5: merge_retest_timeout_s 600 -> 900 + _resolve_retest_timeout validation;
reaper_max_running_s invariant preserved without change.
- D6: in-process counters + read-only merge_gate block in GET /queue; appended
("ORCH-110","classify_retest_failure","src/merge_gate.py") to
MAIN_REGRESSION_MARKERS. Docs (README/internals overview/CLAUDE/CHANGELOG/
.env.example) updated in the same PR.
Tests: tests/test_orch110_*.py (TC-01..TC-12, incl. the red-before/green-after
incident regression). Full suite green (1988 passed).
Refs: ORCH-110
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Close the observability gap between agent_hung (only tracked jobs by jobs.pid)
and orphaned pytest subprocesses the orchestrator launches itself
(merge_gate.retest_branch / coverage_gate.measure_coverage). On a timeout-kill of
the agent (-9, ORCH-109) the grand-child pytest reparents onto tini and keeps
running for days, starving CPU and failing merge-gate re-test — with no alert.
Strictly inside the observer (watchdog/** + the watchdog compose service):
- watchdog/collectors/proc.py: stdlib-only /proc scan (under pid: host),
read-only, never-raise -> []; pure parsers split from I/O (tested on a fake
/proc tree). Never reads /proc/<pid>/environ.
- watchdog/signals.py: pure proc_signals builder, per-entity
("proc_blocking", pid), active iff age_s > proc_age_s; actionable RU detail.
- watchdog/core.py: opt-in tick block (gated on proc_enabled -> zero overhead /
byte-for-byte when off) + RECOVERY synthesis for a vanished process through the
existing decide()/AlertState (no new anti-spam logic).
- watchdog/config.py: WATCHDOG_PROC_{ENABLED(false),AGE_MIN(60),PATTERNS(pytest),
COOLDOWN_S(1800)}; default threshold > max(merge_retest_timeout_s=600,
coverage_run_timeout_s=900) so a legit in-flight run never crosses it.
- docker-compose.yml: pid: host on orchestrator-watchdog ONLY (read-only privilege).
Anti-false-positive and no overlap with agent_hung are by construction (cmdline
scope + age threshold), not fragile cross-namespace PID matching.
Canon synced: WATCHDOG_PROC_* in .env.watchdog.example <-> .env.example block;
documented in LITE_SETUP.md and docs/architecture/README.md (architect). src/**,
/metrics, schema_version, STAGE_TRANSITIONS, QG_CHECKS, check_*, machine-verdict
and the DB schema are untouched; deploy rebuilds only the sidecar, prod
orchestrator is not restarted (NFR-3).
Tests: tests/watchdog/test_proc_blocking_signal.py (TC-01..TC-06),
test_proc_collector.py (/proc parsing), test_tick_proc_blocking_integration.py
(TC-07), plus pid: host and proc-config assertions. Full pytest tests/ green (1930).
Refs: ORCH-111
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>