fix(staging): tolerate sandbox-infra-only FAILs (C9a/C9b) in deploy-staging verdict

The self-hosting orchestrator looped on deploy-staging -> development because scripts/staging_check.py exited 1 on ANY failed check, so two infra-only checks (C9a sandbox branch / C9b analyst-job — caused by SANDBOX bot accounts not being members of the sandbox Plane project, NOT a pipeline regress) forced staging_status: FAILED -> rollback -> loop, burning developer retries and tokens. Direction (б) per ADR-001: classify staging checks as REAL (all pipeline checks, fail-closed) vs SANDBOX_INFRA (narrow allowlist {C9a, C9b}, waivable). New leaf module src/staging_verdict.py (stdlib-only, never-raise): classify_check + compute_staging_verdict fold per-check results into a tolerant-but-fail-closed verdict — any REAL failure -> FAILED/exit1 (safety net holds under any flag); only C9a/C9b failed & tolerant -> SUCCESS/exit0 with waived list; only infra & strict -> FAILED/exit1; any internal error -> FAILED/exit1 (never a false green). staging_check.py now auto-classifies each check (public 3-tuple _items shape kept as an ORCH-048 b6 regression guard), exposes categorized_items(), prints INFRA-WAIVED/VERDICT lines, and exits via the verdict; new --strict flag forces legacy strictness per-run. Kill-switch ORCH_STAGING_INFRA_TOLERANCE_ENABLED (default true) restores legacy strict mode globally. launcher gains action_stage_no_changes_note so "no changes to commit" on action stages is logged as expected, not treated as under-delivery. Contracts unchanged: STAGE_TRANSITIONS, QG_CHECKS registry, staging_status:/ deploy_status: frontmatter, hook exit-code (0/1/2), check_staging_status; no DB migration. Docs: README, STAGING_CHECK.md, deployer.md, .env.example, CHANGELOG. Refs: ORCH-061 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-07 12:39:00 +00:00
parent 1d1208c136
commit 9070489968
15 changed files with 831 additions and 7 deletions
--- a/src/agents/launcher.py
+++ b/src/agents/launcher.py
@@ -20,6 +20,33 @@ logger = logging.getLogger("orchestrator.launcher")
 # never passed through to the CLI.
 VALID_EFFORTS = frozenset({"low", "medium", "high", "xhigh", "max"})

+# ORCH-061: action stages whose success is an ACTION (restart/retag), not a src
+# edit — so "no changes to commit" is EXPECTED there, not under-delivery (FR-3).
+_ACTION_STAGES = frozenset({"deploy-staging", "deploy"})
+
+
+def action_stage_no_changes_note(stage, repo) -> str | None:
+    """ORCH-061 (FR-3 / FR-7): observability for an empty diff on an action stage.
+
+    The ``deploy-staging`` / ``deploy`` stages are actions (restart / retag), not
+    code edits, so the post-run "no changes to commit" is the NORMAL case there —
+    advancement is decided by the agent exit-code + the staging/deploy gate verdict,
+    NEVER by the presence of a commit (FR-3 / AC-4). This is a PURE decision used
+    only to emit an explicit log line distinguishing an expected action-stage no-op
+    from a code-stage no-op; it has no effect on stage advancement.
+
+    Returns an explicit note string when the empty diff is expected (an action
+    stage of a self-deploy repo), else ``None``. Never raises.
+    """
+    try:
+        if stage in _ACTION_STAGES:
+            from ..self_deploy import self_deploy_applies
+            if self_deploy_applies(repo):
+                return f"{stage}: no code changes (expected on action stage)"
+        return None
+    except Exception:  # noqa: BLE001 - observability only, never raise
+        return None
+

 def _resolve_agent_attr(agent, project_id, project_map_attr, env_attr_prefix,
                        default_attr):
@@ -582,6 +609,22 @@ class AgentLauncher:
                    logger.warning(f"Agent run_id={run_id}: commit failed: {commit_result.stderr}")
            else:
                logger.info(f"Agent run_id={run_id}: no changes to commit")
+                # ORCH-061: on a self-deploy action stage (deploy-staging/deploy)
+                # an empty diff is EXPECTED (action, not a src edit). Emit an
+                # explicit observability line so an operator can tell this apart
+                # from a code-stage no-op. Does NOT affect advancement (decided by
+                # exit-code + gate verdict, never by a commit existing).
+                try:
+                    _t = get_task_by_repo_branch(repo, branch)
+                    _stage = _t["stage"] if _t else None
+                    _note = action_stage_no_changes_note(_stage, repo)
+                    if _note:
+                        logger.info(f"Agent run_id={run_id}: {_note}")
+                except Exception as _e:
+                    logger.debug(
+                        f"Agent run_id={run_id}: action-stage no-changes note "
+                        f"skipped: {_e}"
+                    )
        except Exception as e:
            logger.error(f"Agent run_id={run_id}: post-run git failed: {e}")

--- a/src/config.py
+++ b/src/config.py
@@ -219,6 +219,22 @@ class Settings(BaseSettings):
    image_freshness_enabled: bool = True
    image_freshness_repos: str = ""

+    # ORCH-061: tolerate KNOWN sandbox-infra FAILs (C9a/C9b) in the staging suite.
+    # The self-hosting deploy-staging stage looped because scripts/staging_check.py
+    # exited non-zero on ANY failed check, so two infra-only failures (sandbox bot
+    # accounts not members of the sandbox Plane project) produced staging_status:
+    # FAILED -> rollback deploy-staging -> development -> loop.
+    #   True  -> a run whose ONLY failures are allowlisted sandbox-infra checks
+    #            (C9a/C9b) is waived to SUCCESS; ANY real pipeline check that fails
+    #            still fails closed -> FAILED -> rollback (safety net intact, FR-4).
+    #   False -> 1:1 pre-ORCH-061 strict behaviour: any FAIL -> FAILED -> rollback.
+    # Default True (mirrors merge_gate_enabled / image_freshness_enabled /
+    # self_deploy_enabled): the safety net holds regardless of the flag; the flag
+    # exists to instantly restore legacy strictness without a code redeploy. Lives
+    # in .env.staging (ORCH_ prefix) so it is reachable inside orchestrator-staging.
+    # Env ORCH_STAGING_INFRA_TOLERANCE_ENABLED.
+    staging_infra_tolerance_enabled: bool = True
+
    # ORCH-053: stuck-task reconciler (sweeper for lost webhooks). A background
    # daemon thread reconciles the "source of truth (gate / Plane) != task stage"
    # drift left behind by a dropped webhook (502 on rebuild, no Plane/Gitea
--- a/src/staging_verdict.py
+++ b/src/staging_verdict.py
@@ -0,0 +1,173 @@
+"""ORCH-061: pure staging-verdict logic (classification + tolerant verdict).
+
+The self-hosting ``orchestrator`` looped on ``deploy-staging`` because
+``scripts/staging_check.py`` summed ``all_ok = passed == total`` and exited
+non-zero on ANY failed check — so two *infrastructure-only* failures (C9a branch
+not found / C9b analyst-job not in queue, both caused by the SANDBOX bot accounts
+not being members of the sandbox Plane project) produced ``staging_status:
+FAILED`` → rollback ``deploy-staging → development`` → loop (ADR-001 §Context).
+
+This module isolates the **pure verdict logic** so both outcomes are unit-testable
+without a live staging stand or docker (TRZ §9):
+
+  * ``classify_check(label)`` — label → ``REAL`` | ``SANDBOX_INFRA`` (narrow,
+    allowlist-driven, fail-closed to ``REAL`` on anything unrecognised);
+  * ``compute_staging_verdict(items, infra_tolerant)`` — fold the per-check
+    pass/fail + category into a single ``StagingVerdict``.
+
+It is a **leaf**: stdlib only, no I/O, no project imports — so it is safe to import
+both from the orchestrator process and from ``scripts/staging_check.py`` (which
+runs inside the ``orchestrator-staging`` container, pattern B6 / ORCH-048). Every
+public function honours a **never-raise** contract: on any malformed input it
+returns the *conservative* (fail-closed) result, never an exception.
+
+Safety invariant (FR-4 / AC-3): a failed REAL check ALWAYS yields ``FAILED`` /
+exit 1 regardless of ``infra_tolerant``. The waiver applies ONLY to the named
+``SANDBOX_INFRA`` checks and ONLY when every REAL check (incl. C7/C8) is green —
+so the blast-radius of the tolerance is exactly the two allowlisted checks.
+"""
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+
+# Category constants ---------------------------------------------------------
+REAL = "real"                 # a real pipeline check — fail-closed, always counts
+SANDBOX_INFRA = "sandbox_infra"  # known to depend on sandbox infra (waivable)
+
+# Narrow allowlist of checks known to depend on sandbox infrastructure rather
+# than the pipeline itself (ADR-001 §1). Matched by the check's leading label
+# token, e.g. "C9a Branch appears in orchestrator-sandbox" -> token "C9a".
+# Keep this set MINIMAL — every entry is a hole in the staging safety-net.
+SANDBOX_INFRA_CHECKS = frozenset({"C9a", "C9b"})
+
+
+def classify_check(label) -> str:
+    """Classify a staging-check label as ``REAL`` or ``SANDBOX_INFRA``.
+
+    A label is ``SANDBOX_INFRA`` iff its leading whitespace-delimited token is one
+    of :data:`SANDBOX_INFRA_CHECKS` (exact match or prefix, e.g. ``"C9a"`` from
+    ``"C9a Branch appears…"``). Everything else — and anything unrecognised /
+    malformed — is ``REAL`` (conservative / fail-closed: an unknown check counts
+    toward the safety net). Never raises.
+    """
+    try:
+        text = str(label).strip()
+        if not text:
+            return REAL
+        token = text.split()[0]
+        for prefix in SANDBOX_INFRA_CHECKS:
+            if token == prefix or token.startswith(prefix):
+                return SANDBOX_INFRA
+        return REAL
+    except Exception:
+        return REAL
+
+
+@dataclass
+class StagingVerdict:
+    """Outcome of folding the staging-check suite into a single verdict.
+
+    ``status``    — ``"SUCCESS"`` | ``"FAILED"`` (mirrors the ``staging_status:``
+                    frontmatter contract the deployer writes; unchanged).
+    ``exit_code`` — ``0`` (advance) | ``1`` (rollback). Drives ``sys.exit`` in
+                    ``staging_check.py``.
+    ``waived``    — labels of SANDBOX_INFRA checks that failed but were tolerated
+                    (empty unless the waiver actually fired — observability, FR-7).
+    ``summary``   — human-readable one-liner for logs.
+    """
+
+    status: str
+    exit_code: int
+    waived: list = field(default_factory=list)
+    summary: str = ""
+
+
+def _coerce_item(item) -> tuple[str, bool, str]:
+    """Normalise an input row into ``(label, passed, category)``.
+
+    Accepts ``(label, passed)`` or ``(label, passed, category)``. A missing/None
+    category is resolved via :func:`classify_check`. Never raises — a malformed
+    row degrades to a failed REAL check (fail-closed) so it cannot silently pass.
+    """
+    try:
+        label = str(item[0])
+        passed = bool(item[1])
+        category = item[2] if len(item) > 2 and item[2] else None
+    except Exception:
+        return ("<malformed>", False, REAL)
+    if category not in (REAL, SANDBOX_INFRA):
+        category = classify_check(label)
+    return (label, passed, category)
+
+
+def compute_staging_verdict(items, infra_tolerant: bool) -> StagingVerdict:
+    """Fold per-check results into a tolerant-but-fail-closed staging verdict.
+
+    ``items`` — iterable of ``(label, passed: bool[, category: str])``.
+
+    Decision table (ADR-001 §1):
+      * any REAL check failed                      -> FAILED / exit 1 (safety net)
+      * only SANDBOX_INFRA failed & infra_tolerant -> SUCCESS / exit 0 (waived)
+      * only SANDBOX_INFRA failed & !infra_tolerant -> FAILED / exit 1 (legacy strict)
+      * nothing failed                             -> SUCCESS / exit 0
+
+    Never raises: on any internal error the verdict degrades to a conservative
+    ``FAILED`` / exit 1 (never a false green) — AC-10.
+    """
+    try:
+        real_failed: list[str] = []
+        infra_failed: list[str] = []
+        for raw in items:
+            label, passed, category = _coerce_item(raw)
+            if passed:
+                continue
+            if category == SANDBOX_INFRA:
+                infra_failed.append(label)
+            else:
+                real_failed.append(label)
+
+        if real_failed:
+            # Safety net (FR-4): a real pipeline regression always fails closed,
+            # regardless of tolerance. Infra failures (if any) are noted but the
+            # verdict is dominated by the real failure.
+            extra = f"; infra-fail {infra_failed}" if infra_failed else ""
+            return StagingVerdict(
+                status="FAILED",
+                exit_code=1,
+                waived=[],
+                summary=f"FAILED: real checks failed {real_failed}{extra}",
+            )
+        if infra_failed and infra_tolerant:
+            # Waiver fires ONLY here: every REAL check is green and the only
+            # failures are allowlisted sandbox-infra checks (FR-2).
+            return StagingVerdict(
+                status="SUCCESS",
+                exit_code=0,
+                waived=list(infra_failed),
+                summary=(
+                    f"SUCCESS (infra-waived): {infra_failed} are known sandbox-infra "
+                    "checks; all real checks green"
+                ),
+            )
+        if infra_failed and not infra_tolerant:
+            # Legacy strict (kill-switch off): any failure fails closed (1:1 pre-061).
+            return StagingVerdict(
+                status="FAILED",
+                exit_code=1,
+                waived=[],
+                summary=f"FAILED (strict): {infra_failed} failed and tolerance disabled",
+            )
+        return StagingVerdict(
+            status="SUCCESS",
+            exit_code=0,
+            waived=[],
+            summary="SUCCESS: all checks green",
+        )
+    except Exception as e:  # noqa: BLE001 - never-raise; fail closed on doubt
+        return StagingVerdict(
+            status="FAILED",
+            exit_code=1,
+            waived=[],
+            summary=f"FAILED (verdict error, fail-closed): {e}",
+        )