fix(plane): sandbox-only fail-closed guard for Plane writes from test process (ORCH-117)
Close the root class of incident ORCH-114: a pytest/worktree process performed a REAL write (PATCH issues state=<Done> + comment) against the PRODUCTION Plane project, because test/staging processes inherit the live Plane token (PLANE_HEADERS/PROJECT_ID are captured at import — a post-hoc env/token swap is a no-op) and nothing forced them to write only to the sandbox. Symmetric to the existing _no_telegram autouse floor. - New pure never-raise leaf src/plane_write_guard.py (decide/audit_block/ audit_allow), wired into the 3 plane_sync write primitives (update_issue_state / add_comment / _set_issue_state_direct) via _guard_allows_write, AT CALL TIME, before any network step. Active ONLY in a test process (pytest in sys.modules / PYTEST_CURRENT_TEST); live + staging runtimes (uvicorn) are a strict no-op. - In a test process: default-deny. A write is allowed iff opt-in (plane_test_write_enabled) AND target project in the sandbox allowlist (plane_test_sandbox_projects, default = the one SANDBOX id). Prod is blocked even with opt-in (allowlist sandbox-only); unresolved project -> block (fail-closed). - Independent second layer: tests/conftest.py::_plane_sandbox_only autouse floor. Intentionally NO prod-block kill-switch (anti back-door, NFR-6). - Audit: block -> loud ERROR; sandbox-allow -> INFO. - Bypass fixtures for the 3 (+1) pre-existing tests that assert on the mocked write primitive's httpx call (header/URL/state logic), the guard is no Quality Gate: STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict / DB schema untouched. - Tests: tests/test_orch117_plane_write_isolation.py (TC-01 mandatory ORCH-114 regression + TC-02..TC-14). Docs: CLAUDE.md, architecture/README.md, operations/INFRA.md, .env.example, CHANGELOG.md. Refs: ORCH-117 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
193
src/plane_write_guard.py
Normal file
193
src/plane_write_guard.py
Normal file
@@ -0,0 +1,193 @@
|
||||
"""ORCH-117: fail-closed guard for Plane WRITE primitives from a test/worktree process.
|
||||
|
||||
Leaf module — pure, never-raise in the live path, config-gated. Mirrors the leaf
|
||||
pattern of ``src/deploy_status_guard.py`` / ``src/serial_gate.py`` / ``src/cancel.py``:
|
||||
it imports only ``config`` (and stdlib ``os``/``sys``), never ``plane_sync`` /
|
||||
``stage_engine`` — the three write primitives that need a verdict call
|
||||
:func:`decide`, the guard does not live there.
|
||||
|
||||
The incident (established fact, ORCH-114). A test/worktree process performed a REAL
|
||||
write to Plane against the **production** project: ``PATCH …/issues/… state=<Done>``
|
||||
plus a "Stage: deploy → done" comment, i.e. ``notify_stage_change("ORCH-114",
|
||||
"deploy","done")`` run from pytest mutated a live board issue ("false Done"). The
|
||||
root: test/staging processes inherit the live Plane token (``PLANE_HEADERS`` /
|
||||
``PROJECT_ID`` are captured as literals at module import, so a post-hoc env/token
|
||||
swap is a no-op, NFR-4), and *nothing* forced them to write only to the sandbox.
|
||||
|
||||
The precedent. ``tests/conftest.py::_no_telegram`` is an autouse fixture muting
|
||||
``send_telegram`` in ALL tests, exactly because "pytest on prod sent REAL Telegram
|
||||
messages to Slava". The symmetric protection for Plane WRITES did not exist — this
|
||||
is that protection.
|
||||
|
||||
The fix (ADR-001 D1/D3): a low choke-point on the entry of the three write
|
||||
primitives, evaluated **at call time** (not at import). The guard is active **only
|
||||
in a test process** (``pytest``-in-process detection) — for the live orchestrator
|
||||
runtime and the staging runtime (both ``uvicorn src.main:app``, no pytest in the
|
||||
process) it is a strict no-op (byte-for-byte, NFR-2/NFR-3). In a test process a
|
||||
write is allowed **iff** simultaneously (a) the dedicated opt-in flag is ON **and**
|
||||
(b) the target project ∈ the sandbox-allowlist; otherwise BLOCK (default-deny). A
|
||||
non-resolvable target → BLOCK (fail-closed, NFR-1). An allowed sandbox write is
|
||||
audited at INFO; a block is audited LOUDLY at ERROR so an ORCH-114-class incident is
|
||||
obvious, not silent (FR-5 / D7).
|
||||
|
||||
Deliberately NO kill-switch for the prod-block (ADR-001 D4 / FR-7 / NFR-6): a guard
|
||||
that makes a prod write from pytest *physically impossible* must not ship with a
|
||||
config that re-opens it (that would be the very back-door NFR-6 forbids). The only
|
||||
reversible regulator is the sandbox-bound opt-in (``plane_test_write_enabled`` +
|
||||
``plane_test_sandbox_projects``); "disable the guard" ≠ "allow prod from pytest" —
|
||||
that transition does not exist by design. The independent conftest floor
|
||||
(``_plane_sandbox_only``, ADR-001 D5) is the second sandbox-bound layer.
|
||||
|
||||
This is bugfix-isolation, NOT a Quality Gate and NOT a stage: ``STAGE_TRANSITIONS`` /
|
||||
``QG_CHECKS`` / ``check_*`` / machine-verdict keys / the DB schema are byte-for-byte
|
||||
untouched.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import sys
|
||||
|
||||
from .config import settings
|
||||
|
||||
logger = logging.getLogger("orchestrator.plane_write_guard")
|
||||
|
||||
# Verdicts returned by decide() (the calling primitive executes them).
|
||||
ALLOW = True
|
||||
BLOCK = False
|
||||
|
||||
# Reasons (stable slugs — asserted by tests / read in audit lines).
|
||||
R_LIVE_RUNTIME = "live-runtime" # not a test process -> no-op (prod/staging).
|
||||
R_AMBIGUOUS = "ambiguous-target" # project_id empty/unresolved -> "don't know => don't write".
|
||||
R_OPT_IN_DISABLED = "opt-in-disabled" # test process, opt-in OFF -> default-deny.
|
||||
R_PROD_IN_TEST = "prod-project-in-test" # test process, project NOT in sandbox allowlist.
|
||||
R_SANDBOX_OPT_IN = "sandbox-opt-in" # test process, opt-in ON + sandbox project -> ALLOW.
|
||||
R_GUARD_ERROR = "guard-error" # internal error inside the test-path -> fail-closed BLOCK.
|
||||
|
||||
# Operation tokens (one per call site) — used only for the audit line.
|
||||
OP_STATE = "state" # update_issue_state / _set_issue_state_direct (httpx.patch)
|
||||
OP_COMMENT = "comment" # add_comment (httpx.post)
|
||||
|
||||
|
||||
def _in_test_process() -> bool:
|
||||
"""True iff this Python process is a pytest/worktree test process (ADR-001 D2).
|
||||
|
||||
``"pytest" in sys.modules`` is true for the whole pytest run (collection +
|
||||
execution) in THIS process, which is exactly the worktree ``python -m pytest``
|
||||
process from the incident. The live orchestrator and the staging runtime start
|
||||
via ``uvicorn src.main:app`` and never import pytest into their process, so the
|
||||
detection never fires there (NFR-2/NFR-3, AC-5/AC-6). ``PYTEST_CURRENT_TEST`` is
|
||||
a secondary confirming signal pytest sets for the duration of a test body. Both
|
||||
are read at call time (NFR-4). Never raises: on any error we treat the process
|
||||
as NOT-a-test (-> live ALLOW), so the guard can never accidentally wedge a
|
||||
legitimate prod write.
|
||||
"""
|
||||
try:
|
||||
if "pytest" in sys.modules:
|
||||
return True
|
||||
if os.environ.get("PYTEST_CURRENT_TEST"):
|
||||
return True
|
||||
except Exception: # noqa: BLE001 - never-raise; ambiguity -> "not a test" (live ALLOW).
|
||||
return False
|
||||
return False
|
||||
|
||||
|
||||
def _sandbox_allowlist() -> set[str]:
|
||||
"""Sanitised set of sandbox project ids from ``plane_test_sandbox_projects``.
|
||||
|
||||
Empty/blank CSV -> empty set (then EVERY project blocks in a test process,
|
||||
fail-closed). Never raises.
|
||||
"""
|
||||
try:
|
||||
raw = (settings.plane_test_sandbox_projects or "").strip()
|
||||
except Exception: # noqa: BLE001 - never-raise.
|
||||
return set()
|
||||
if not raw:
|
||||
return set()
|
||||
return {tok.strip() for tok in raw.split(",") if tok.strip()}
|
||||
|
||||
|
||||
def decide(project_id: str | None, op: str, work_item_id: str | None = None) -> tuple[bool, str]:
|
||||
"""Decide whether a Plane WRITE primitive may hit the network (ADR-001 D3).
|
||||
|
||||
Returns ``(ok, reason)``. Steps:
|
||||
|
||||
1. ``not _in_test_process()`` -> ALLOW (``live-runtime``: prod/staging no-op).
|
||||
2. ``project_id`` empty/None/unresolved -> BLOCK (``ambiguous-target``, NFR-1).
|
||||
3. opt-in flag OFF -> BLOCK (``opt-in-disabled``, default-deny).
|
||||
4. ``project_id`` ∉ sandbox allowlist -> BLOCK (``prod-project-in-test``, AC-3).
|
||||
5. otherwise -> ALLOW (``sandbox-opt-in``, audit INFO).
|
||||
|
||||
never-raise: the live path returns at step 1 BEFORE the try-block, so a prod
|
||||
write can never be wedged by a guard bug. Once we know we are in a test process,
|
||||
any internal error fails CLOSED to BLOCK (``guard-error``) — the defect surfaces
|
||||
loudly rather than re-opening prod (AC-9 / FR-7).
|
||||
"""
|
||||
# Step 1 — outside any test process the guard is a strict no-op. Evaluated FIRST
|
||||
# and OUTSIDE the try-block so a live prod/staging write is never affected.
|
||||
if not _in_test_process():
|
||||
return ALLOW, R_LIVE_RUNTIME
|
||||
|
||||
# From here on we are in a test process: default-deny, fail-closed on any error.
|
||||
try:
|
||||
pid = (project_id or "").strip()
|
||||
if not pid:
|
||||
return BLOCK, R_AMBIGUOUS # step 2
|
||||
if not bool(getattr(settings, "plane_test_write_enabled", False)):
|
||||
return BLOCK, R_OPT_IN_DISABLED # step 3
|
||||
if pid not in _sandbox_allowlist():
|
||||
return BLOCK, R_PROD_IN_TEST # step 4 — sandbox-only, even with opt-in.
|
||||
return ALLOW, R_SANDBOX_OPT_IN # step 5
|
||||
except Exception as e: # noqa: BLE001 - never-raise; in-test -> fail CLOSED.
|
||||
logger.error(
|
||||
"plane_write_guard.decide error in test-process -> BLOCK (fail-closed): %s", e
|
||||
)
|
||||
return BLOCK, R_GUARD_ERROR
|
||||
|
||||
|
||||
def audit_block(project_id: str | None, op: str, work_item_id: str | None, reason: str) -> None:
|
||||
"""Loud structured audit of a BLOCKED write (FR-5 / D7).
|
||||
|
||||
Logged at ERROR so an ORCH-114-class incident (a pytest mutating a non-sandbox
|
||||
project) is obvious in the run log, not silent. Never raises.
|
||||
"""
|
||||
try:
|
||||
logger.error(
|
||||
"plane_write_guard BLOCKED Plane %s write from a test process: "
|
||||
"project_id=%s work_item=%s reason=%s "
|
||||
"(ORCH-117 fail-closed; this would have mutated a non-sandbox Plane "
|
||||
"project from pytest — cf. the ORCH-114 incident)",
|
||||
op, project_id, work_item_id, reason,
|
||||
)
|
||||
except Exception: # noqa: BLE001 - logging must never raise.
|
||||
pass
|
||||
|
||||
|
||||
def audit_allow(project_id: str | None, op: str, work_item_id: str | None,
|
||||
reason: str = R_SANDBOX_OPT_IN) -> None:
|
||||
"""Audit (INFO) an ALLOWED real sandbox write from a test process (FR-5 / D7).
|
||||
|
||||
Only the ``sandbox-opt-in`` case is audited here — the ``live-runtime`` ALLOW
|
||||
(prod/staging) is the normal hot path and is intentionally NOT logged to avoid
|
||||
spamming the production log. Never raises.
|
||||
"""
|
||||
try:
|
||||
logger.info(
|
||||
"plane_write_guard ALLOWED sandbox Plane %s write from a test process: "
|
||||
"project_id=%s work_item=%s reason=%s",
|
||||
op, project_id, work_item_id, reason,
|
||||
)
|
||||
except Exception: # noqa: BLE001 - logging must never raise.
|
||||
pass
|
||||
|
||||
|
||||
def snapshot() -> dict:
|
||||
"""Read-only view of the guard state (optional observability, D7). Never raises."""
|
||||
try:
|
||||
return {
|
||||
"in_test_process": _in_test_process(),
|
||||
"opt_in_enabled": bool(getattr(settings, "plane_test_write_enabled", False)),
|
||||
"sandbox_allowlist": sorted(_sandbox_allowlist()),
|
||||
}
|
||||
except Exception as e: # noqa: BLE001 - never-raise.
|
||||
return {"error": str(e)}
|
||||
Reference in New Issue
Block a user