ORCH-118 (inventory-first, docs+tests only): publish an evidence-based map of
every place the orchestrator's control flow consumes (or can consume) an LLM
judgment, mark the control-path axis (C control-path vs P artifact-producer),
define "avoidable LLM control path" as a checkable two-bit predicate, classify
each call-site, and order the deterministic-replacement roadmap. Pin the map to
code with offline structural anti-drift tests.
- docs/architecture/llm-call-sites.md — map + machine-readable inventory block
+ control-path axis + classification + keep-LLM justifications + deterministic
non-agent paths (FR-1/FR-2/FR-3/FR-8).
- docs/architecture/llm-determinization-roadmap.md — ordered candidates BY ROLE,
savings sourced from agent_runs, recommended first slice = deployer staging
(FR-4). No fabricated follow-up Plane-IDs (R3/NFR-6).
- docs/architecture/llm-usage-policy.md — normative principle, keep/replace
criteria via the axis, definition of "avoidable LLM control path" (FR-5/FR-8).
- tests/test_llm_call_site_inventory.py — TC-01/02/03/04/05/06/09/12/13/14.
- tests/test_llm_determinization_docs.py — TC-07/08/11.
- CHANGELOG.md + docs/overview/tech-quality-security.md — golden-source sync (AC-8).
Avoidable LLM control paths = {tester, deployer}; control-path-keep = {reviewer};
not-control-path (P) = {analyst, architect, developer}. Single LLM transport =
launcher._spawn (S0); no alternative transport (TC-12). Runtime untouched:
STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / DB schema are
byte-for-byte; no replacement runners implemented (FR-7). Full suite: 2081 passed.
Refs: ORCH-118
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Close the root class of incident ORCH-114: a pytest/worktree process performed a
REAL write (PATCH issues state=<Done> + comment) against the PRODUCTION Plane
project, because test/staging processes inherit the live Plane token
(PLANE_HEADERS/PROJECT_ID are captured at import — a post-hoc env/token swap is a
no-op) and nothing forced them to write only to the sandbox. Symmetric to the
existing _no_telegram autouse floor.
- New pure never-raise leaf src/plane_write_guard.py (decide/audit_block/
audit_allow), wired into the 3 plane_sync write primitives (update_issue_state /
add_comment / _set_issue_state_direct) via _guard_allows_write, AT CALL TIME,
before any network step. Active ONLY in a test process (pytest in sys.modules /
PYTEST_CURRENT_TEST); live + staging runtimes (uvicorn) are a strict no-op.
- In a test process: default-deny. A write is allowed iff opt-in
(plane_test_write_enabled) AND target project in the sandbox allowlist
(plane_test_sandbox_projects, default = the one SANDBOX id). Prod is blocked even
with opt-in (allowlist sandbox-only); unresolved project -> block (fail-closed).
- Independent second layer: tests/conftest.py::_plane_sandbox_only autouse floor.
Intentionally NO prod-block kill-switch (anti back-door, NFR-6).
- Audit: block -> loud ERROR; sandbox-allow -> INFO.
- Bypass fixtures for the 3 (+1) pre-existing tests that assert on the mocked
write primitive's httpx call (header/URL/state logic), the guard is no Quality
Gate: STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict / DB schema
untouched.
- Tests: tests/test_orch117_plane_write_isolation.py (TC-01 mandatory ORCH-114
regression + TC-02..TC-14). Docs: CLAUDE.md, architecture/README.md,
operations/INFRA.md, .env.example, CHANGELOG.md.
Refs: ORCH-117
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolves the REQUEST_CHANGES findings on ORCH-114 (durable transition-ownership
lease + expected-stage CAS):
P1 — documentation = golden source:
- .env.example: add ORCH_TRANSITION_LEASE_ENABLED / ORCH_TRANSITION_LEASE_REPOS
(canon of 100% start keys, ORCH-101), next to the other gate kill-switches.
- CLAUDE.md: add the ORCH-114 passport section (mechanism, invariant, flags,
ADR links) so a future agent editing advance_stage/reaper/webhooks finds the
ownership invariant in the first mandatory-read doc (ORCH-078 traceability index).
P2 — should-fix:
- docs/overview/ (system showcase, ORCH-011): add transition_lease to
tech-data-model.md (helper tables), tech-observability.md (/queue blocks) and
tech-architecture.md (components).
- ADR-001 D4 alignment: the four side-effectful-edge rollback handlers
(_handle_merge_gate_rollback / _handle_security_gate / _handle_coverage_gate /
_handle_image_freshness) now write `development` through the expected-stage CAS
via a shared _rollback_stage_cas helper (defence against the rollback↔done
contradiction, BR-6) instead of a bare unconditional update_task_stage. Under the
held lease the sole owner always wins; a lost race aborts WITHOUT side effects.
Kill-switch off / out-of-scope repo -> degenerates to the prior write -> 1:1.
- Test isolation: make tests/test_webhooks.py order-independent by pinning the
proj-1 registry per-test (mirrors test_webhook_dedup.proj_registry); it had only
passed by relying on import order. Drop the needless module-level ORCH_DB_PATH
setdefault in test_orch114 (fresh_db already isolates db_path).
New regression tests (TC-11): in-region rollback writes route through CAS;
rollback CAS wins when at expected stage; rollback CAS-lost does NOT clobber `done`;
kill-switch-off rollback degenerates to the unconditional write.
ruff clean (src/stage_engine.py, src/transition_lease.py); full suite 2052 passed.
Refs: ORCH-114
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Close the root class of the ORCH-110/111/112/113 incident chain: side-effectful
stage transitions had no single ownership. `advance_stage` is re-enterable and wrote
the stage with a bare `UPDATE ... WHERE id=?` (no compare-and-swap), while >=5 actors
(monitor / Plane-webhook / reconciler F-1 / job-reaper / deploy-finalizer) enter the
same transition independently. A concurrent or post-restart re-entry therefore
re-applied irreversible effects (merge_pr / coverage-ratchet / image-rebuild /
prod-deploy initiation) and produced a contradictory rollback<->done (incident
ORCH-111, job 1914 / PR #130).
Two complementary layers, both additive, under one kill-switch, never-raise:
1. Durable transition-lease (new table `transition_lease`) — owner-exclusion on
ENTRY to the side-effectful region: a second actor that sees a LIVE owner does
not start the heavy sub-gates at all (prevention, not post-hoc repair).
2. Expected-stage CAS (`db.update_task_stage_cas`) — atomicity on the stage WRITE:
a lost race aborts with NO side effect. Also closes the 6 paths that write the
stage in bypass of advance_stage (gitea x5 + plane rollback).
Owner liveness = owner_pid + owner_boot_id (NOT a heartbeat — a blocking 900s merge
re-test cannot beat one; ADR-001 D3), making restart recovery free (a fresh boot_id
renders every prior lease stale -> reclaimed by recover_on_startup). The lease has no
own TTL: its hard age ceiling is the reaper Tier-3 backstop reaper_max_running_s, so
the cross-cutting budget invariant ORCH-065/109/110/113 is untouched.
Generalises ORCH-113 finalizer-liveness (process-local, Tier-2, deploy-staging) to a
durable cross-path lease: the reaper consults it on all relevant paths (defer live,
reclaim dead; Tier-3 ignores the marker -> bounded; a reap force-releases the lease);
reconciler F-1 and the Plane webhook defer on an active lease; main.lifespan calls
recover_on_startup() after requeue_running_jobs. finalizer_liveness.py is unchanged
(it remains the kill-switch-off fallback).
Scope self-hosting (transition_lease_repos="" -> orchestrator only; enduro untouched).
Kill-switch ORCH_TRANSITION_LEASE_ENABLED=false -> CAS degenerates to the prior
unconditional update_task_stage, lease inert, reaper -> ORCH-113 fallback (byte-for-
byte pre-ORCH-114). STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys /
existing table schemas — byte-for-byte (one additive table, no epoch column on tasks).
Observability: read-only `transition_lease` block in GET /queue + a Telegram alert on
forced/stale reclaim + optional POST /transition-lease/release?work_item=<id>.
Coverage: tests/test_orch114_transition_ownership.py (TC-01 mandatory regression of
the ORCH-111 class — red before fix, green after; TC-02..TC-14). Full suite green
(2048 passed); the 4 webhook tests that spied on the removed gitea.update_task_stage
were updated to spy on the new commit_stage_cas write path.
ADR: docs/work-items/ORCH-114/06-adr/ADR-001-transition-ownership-lease-and-stage-cas.md
Cross-cutting: docs/architecture/adr/adr-0045-transition-ownership-lease-and-stage-cas.md
Refs: ORCH-114
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Staging suite 8/10 PASS, REAL failed: none; C9a/C9b infra-waived (ORCH-061).
staging_status: SUCCESS
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>