fix(plane): sandbox-only fail-closed guard for Plane writes from test process (ORCH-117)

Close the root class of incident ORCH-114: a pytest/worktree process performed a
REAL write (PATCH issues state=<Done> + comment) against the PRODUCTION Plane
project, because test/staging processes inherit the live Plane token
(PLANE_HEADERS/PROJECT_ID are captured at import — a post-hoc env/token swap is a
no-op) and nothing forced them to write only to the sandbox. Symmetric to the
existing _no_telegram autouse floor.

- New pure never-raise leaf src/plane_write_guard.py (decide/audit_block/
  audit_allow), wired into the 3 plane_sync write primitives (update_issue_state /
  add_comment / _set_issue_state_direct) via _guard_allows_write, AT CALL TIME,
  before any network step. Active ONLY in a test process (pytest in sys.modules /
  PYTEST_CURRENT_TEST); live + staging runtimes (uvicorn) are a strict no-op.
- In a test process: default-deny. A write is allowed iff opt-in
  (plane_test_write_enabled) AND target project in the sandbox allowlist
  (plane_test_sandbox_projects, default = the one SANDBOX id). Prod is blocked even
  with opt-in (allowlist sandbox-only); unresolved project -> block (fail-closed).
- Independent second layer: tests/conftest.py::_plane_sandbox_only autouse floor.
  Intentionally NO prod-block kill-switch (anti back-door, NFR-6).
- Audit: block -> loud ERROR; sandbox-allow -> INFO.
- Bypass fixtures for the 3 (+1) pre-existing tests that assert on the mocked
  write primitive's httpx call (header/URL/state logic), the guard is no Quality
  Gate: STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict / DB schema
  untouched.
- Tests: tests/test_orch117_plane_write_isolation.py (TC-01 mandatory ORCH-114
  regression + TC-02..TC-14). Docs: CLAUDE.md, architecture/README.md,
  operations/INFRA.md, .env.example, CHANGELOG.md.

Refs: ORCH-117
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-15 21:16:28 +03:00
committed by deployer
parent 77d3a66356
commit 861b5ee984
14 changed files with 679 additions and 1 deletions

193
src/plane_write_guard.py Normal file
View File

@@ -0,0 +1,193 @@
"""ORCH-117: fail-closed guard for Plane WRITE primitives from a test/worktree process.
Leaf module — pure, never-raise in the live path, config-gated. Mirrors the leaf
pattern of ``src/deploy_status_guard.py`` / ``src/serial_gate.py`` / ``src/cancel.py``:
it imports only ``config`` (and stdlib ``os``/``sys``), never ``plane_sync`` /
``stage_engine`` — the three write primitives that need a verdict call
:func:`decide`, the guard does not live there.
The incident (established fact, ORCH-114). A test/worktree process performed a REAL
write to Plane against the **production** project: ``PATCH …/issues/… state=<Done>``
plus a "Stage: deploy → done" comment, i.e. ``notify_stage_change("ORCH-114",
"deploy","done")`` run from pytest mutated a live board issue ("false Done"). The
root: test/staging processes inherit the live Plane token (``PLANE_HEADERS`` /
``PROJECT_ID`` are captured as literals at module import, so a post-hoc env/token
swap is a no-op, NFR-4), and *nothing* forced them to write only to the sandbox.
The precedent. ``tests/conftest.py::_no_telegram`` is an autouse fixture muting
``send_telegram`` in ALL tests, exactly because "pytest on prod sent REAL Telegram
messages to Slava". The symmetric protection for Plane WRITES did not exist — this
is that protection.
The fix (ADR-001 D1/D3): a low choke-point on the entry of the three write
primitives, evaluated **at call time** (not at import). The guard is active **only
in a test process** (``pytest``-in-process detection) — for the live orchestrator
runtime and the staging runtime (both ``uvicorn src.main:app``, no pytest in the
process) it is a strict no-op (byte-for-byte, NFR-2/NFR-3). In a test process a
write is allowed **iff** simultaneously (a) the dedicated opt-in flag is ON **and**
(b) the target project ∈ the sandbox-allowlist; otherwise BLOCK (default-deny). A
non-resolvable target → BLOCK (fail-closed, NFR-1). An allowed sandbox write is
audited at INFO; a block is audited LOUDLY at ERROR so an ORCH-114-class incident is
obvious, not silent (FR-5 / D7).
Deliberately NO kill-switch for the prod-block (ADR-001 D4 / FR-7 / NFR-6): a guard
that makes a prod write from pytest *physically impossible* must not ship with a
config that re-opens it (that would be the very back-door NFR-6 forbids). The only
reversible regulator is the sandbox-bound opt-in (``plane_test_write_enabled`` +
``plane_test_sandbox_projects``); "disable the guard""allow prod from pytest"
that transition does not exist by design. The independent conftest floor
(``_plane_sandbox_only``, ADR-001 D5) is the second sandbox-bound layer.
This is bugfix-isolation, NOT a Quality Gate and NOT a stage: ``STAGE_TRANSITIONS`` /
``QG_CHECKS`` / ``check_*`` / machine-verdict keys / the DB schema are byte-for-byte
untouched.
"""
from __future__ import annotations
import logging
import os
import sys
from .config import settings
logger = logging.getLogger("orchestrator.plane_write_guard")
# Verdicts returned by decide() (the calling primitive executes them).
ALLOW = True
BLOCK = False
# Reasons (stable slugs — asserted by tests / read in audit lines).
R_LIVE_RUNTIME = "live-runtime" # not a test process -> no-op (prod/staging).
R_AMBIGUOUS = "ambiguous-target" # project_id empty/unresolved -> "don't know => don't write".
R_OPT_IN_DISABLED = "opt-in-disabled" # test process, opt-in OFF -> default-deny.
R_PROD_IN_TEST = "prod-project-in-test" # test process, project NOT in sandbox allowlist.
R_SANDBOX_OPT_IN = "sandbox-opt-in" # test process, opt-in ON + sandbox project -> ALLOW.
R_GUARD_ERROR = "guard-error" # internal error inside the test-path -> fail-closed BLOCK.
# Operation tokens (one per call site) — used only for the audit line.
OP_STATE = "state" # update_issue_state / _set_issue_state_direct (httpx.patch)
OP_COMMENT = "comment" # add_comment (httpx.post)
def _in_test_process() -> bool:
"""True iff this Python process is a pytest/worktree test process (ADR-001 D2).
``"pytest" in sys.modules`` is true for the whole pytest run (collection +
execution) in THIS process, which is exactly the worktree ``python -m pytest``
process from the incident. The live orchestrator and the staging runtime start
via ``uvicorn src.main:app`` and never import pytest into their process, so the
detection never fires there (NFR-2/NFR-3, AC-5/AC-6). ``PYTEST_CURRENT_TEST`` is
a secondary confirming signal pytest sets for the duration of a test body. Both
are read at call time (NFR-4). Never raises: on any error we treat the process
as NOT-a-test (-> live ALLOW), so the guard can never accidentally wedge a
legitimate prod write.
"""
try:
if "pytest" in sys.modules:
return True
if os.environ.get("PYTEST_CURRENT_TEST"):
return True
except Exception: # noqa: BLE001 - never-raise; ambiguity -> "not a test" (live ALLOW).
return False
return False
def _sandbox_allowlist() -> set[str]:
"""Sanitised set of sandbox project ids from ``plane_test_sandbox_projects``.
Empty/blank CSV -> empty set (then EVERY project blocks in a test process,
fail-closed). Never raises.
"""
try:
raw = (settings.plane_test_sandbox_projects or "").strip()
except Exception: # noqa: BLE001 - never-raise.
return set()
if not raw:
return set()
return {tok.strip() for tok in raw.split(",") if tok.strip()}
def decide(project_id: str | None, op: str, work_item_id: str | None = None) -> tuple[bool, str]:
"""Decide whether a Plane WRITE primitive may hit the network (ADR-001 D3).
Returns ``(ok, reason)``. Steps:
1. ``not _in_test_process()`` -> ALLOW (``live-runtime``: prod/staging no-op).
2. ``project_id`` empty/None/unresolved -> BLOCK (``ambiguous-target``, NFR-1).
3. opt-in flag OFF -> BLOCK (``opt-in-disabled``, default-deny).
4. ``project_id`` ∉ sandbox allowlist -> BLOCK (``prod-project-in-test``, AC-3).
5. otherwise -> ALLOW (``sandbox-opt-in``, audit INFO).
never-raise: the live path returns at step 1 BEFORE the try-block, so a prod
write can never be wedged by a guard bug. Once we know we are in a test process,
any internal error fails CLOSED to BLOCK (``guard-error``) — the defect surfaces
loudly rather than re-opening prod (AC-9 / FR-7).
"""
# Step 1 — outside any test process the guard is a strict no-op. Evaluated FIRST
# and OUTSIDE the try-block so a live prod/staging write is never affected.
if not _in_test_process():
return ALLOW, R_LIVE_RUNTIME
# From here on we are in a test process: default-deny, fail-closed on any error.
try:
pid = (project_id or "").strip()
if not pid:
return BLOCK, R_AMBIGUOUS # step 2
if not bool(getattr(settings, "plane_test_write_enabled", False)):
return BLOCK, R_OPT_IN_DISABLED # step 3
if pid not in _sandbox_allowlist():
return BLOCK, R_PROD_IN_TEST # step 4 — sandbox-only, even with opt-in.
return ALLOW, R_SANDBOX_OPT_IN # step 5
except Exception as e: # noqa: BLE001 - never-raise; in-test -> fail CLOSED.
logger.error(
"plane_write_guard.decide error in test-process -> BLOCK (fail-closed): %s", e
)
return BLOCK, R_GUARD_ERROR
def audit_block(project_id: str | None, op: str, work_item_id: str | None, reason: str) -> None:
"""Loud structured audit of a BLOCKED write (FR-5 / D7).
Logged at ERROR so an ORCH-114-class incident (a pytest mutating a non-sandbox
project) is obvious in the run log, not silent. Never raises.
"""
try:
logger.error(
"plane_write_guard BLOCKED Plane %s write from a test process: "
"project_id=%s work_item=%s reason=%s "
"(ORCH-117 fail-closed; this would have mutated a non-sandbox Plane "
"project from pytest — cf. the ORCH-114 incident)",
op, project_id, work_item_id, reason,
)
except Exception: # noqa: BLE001 - logging must never raise.
pass
def audit_allow(project_id: str | None, op: str, work_item_id: str | None,
reason: str = R_SANDBOX_OPT_IN) -> None:
"""Audit (INFO) an ALLOWED real sandbox write from a test process (FR-5 / D7).
Only the ``sandbox-opt-in`` case is audited here — the ``live-runtime`` ALLOW
(prod/staging) is the normal hot path and is intentionally NOT logged to avoid
spamming the production log. Never raises.
"""
try:
logger.info(
"plane_write_guard ALLOWED sandbox Plane %s write from a test process: "
"project_id=%s work_item=%s reason=%s",
op, project_id, work_item_id, reason,
)
except Exception: # noqa: BLE001 - logging must never raise.
pass
def snapshot() -> dict:
"""Read-only view of the guard state (optional observability, D7). Never raises."""
try:
return {
"in_test_process": _in_test_process(),
"opt_in_enabled": bool(getattr(settings, "plane_test_write_enabled", False)),
"sandbox_allowlist": sorted(_sandbox_allowlist()),
}
except Exception as e: # noqa: BLE001 - never-raise.
return {"error": str(e)}