feat(staging): deterministic staging-runner replacing LLM deployer on deploy-staging (ORCH-115)

Replace the LLM `deployer` agent on the `deploy-staging` stage (self-hosting orchestrator) with a deterministic staging-runner intercepted in launch_job BEFORE _spawn (the deploy-finalizer / post-deploy-monitor reserved-agent precedent). The runner executes the SAME staging suite, maps the exit-code to `staging_status:` via the existing self_deploy.map_exit_code_to_status contract, writes 15-staging-log.md, and initiates the UNCHANGED check_staging_status gate exactly as a finished LLM-deployer would. Invariant (NFR-1): this replaces only the *producer* of the artifact — the artifact contract, the gate / _parse_staging_status / check_staging_status name, STAGE_TRANSITIONS, the machine-verdict key `staging_status:` and the DB schema are byte-for-byte unchanged. Additive, under a kill-switch + repo-scope CSV, never-raise, fail-safe back to the LLM path. Two-level outcome (D5, anti ORCH-110): suite executed -> verdict -> advance (FAILED -> the existing deploy-staging -> development rollback + developer-retry, same as a FAILED LLM verdict); tool-error (suite did not execute) -> bounded DEFER -> fail-closed FAILED + alert on exhaustion (infra != code fault; never a silent advance / false green). First implemented slice of the LLM determinization roadmap (ORCH-118 A6, replace-deterministic-now). - New leaf src/staging_runner.py (never-raise; proc_group tree-kill + timeout) - launch_job intercept + _run_staging_runner_job (mirror _run_deploy_finalizer_job) - config: ORCH_STAGING_RUNNER_* keys (enabled/repos/timeout/infra-retry budget) - GET /queue staging_runner observability block - docs: llm-call-sites/roadmap/usage-policy (A6 implemented; machine blocks + single-transport invariant intact), deployer.md (LLM branch -> fallback), CLAUDE.md, CHANGELOG.md, overview (tech-pipeline/tech-agents/tech-quality-security), .env.example - tests/test_orch115_staging_runner.py (TC-01..TC-13); LLM anti-drift green (TC-14) Refs: ORCH-115 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 01:59:43 +03:00
parent f120e4bd8f
commit b50cf1dd08
16 changed files with 1235 additions and 7 deletions
--- a/src/agents/launcher.py
+++ b/src/agents/launcher.py
@@ -385,6 +385,14 @@ class AgentLauncher:
        (no-LLM) job — intercept it BEFORE _spawn (which would raise
        "Unknown agent", R-6) and run the deploy finalizer synchronously, driving
        the jobs row status itself. Returns None (no agent_run row).
+
+        ORCH-115: the LLM ``deployer`` on the ``deploy-staging`` stage (self-hosting
+        scope) is replaced by a DETERMINISTIC staging-runner — intercepted here
+        BEFORE _spawn (same precedent as deploy-finalizer / post-deploy-monitor). The
+        discriminator is the TASK STAGE (deploy-staging), not the role name, so the
+        prod ``deploy`` deployer is never caught (staging_runner.should_intercept).
+        Kill-switch off / out of scope -> should_intercept False -> the prior LLM
+        deployer runs via _spawn byte-for-byte.
        """
        if job.get("agent") == "deploy-finalizer":
            return self._run_deploy_finalizer_job(job)
@@ -393,6 +401,11 @@ class AgentLauncher:
        # observation tick synchronously. Returns None (no agent_run row).
        if job.get("agent") == "post-deploy-monitor":
            return self._run_post_deploy_monitor_job(job)
+        # ORCH-115: deterministic staging-runner intercept (BEFORE _spawn).
+        if job.get("agent") == "deployer":
+            from .. import staging_runner
+            if staging_runner.should_intercept(job):
+                return self._run_staging_runner_job(job)
        return self._spawn(
            job["agent"],
            job["repo"],
@@ -422,6 +435,28 @@ class AgentLauncher:
                pass
        return None

+    def _run_staging_runner_job(self, job: dict):
+        """ORCH-115: run the deterministic staging gate for a deployer job.
+
+        Not an LLM spawn — there is no subprocess/monitor of an agent, so we mark the
+        jobs row done/failed here (mirror of _run_deploy_finalizer_job). The runner
+        never-raises, but we guard anyway so a runner fault can't wedge the worker.
+        Returns None (no agent_run row, _spawn not called).
+        """
+        from ..db import mark_job
+        from .. import staging_runner
+        try:
+            staging_runner.run_staging_gate(job)
+            mark_job(job["id"], "done")
+            logger.info(f"staging-runner job {job['id']} done")
+        except Exception as e:
+            logger.error(f"staging-runner job {job['id']} failed: {e}")
+            try:
+                mark_job(job["id"], "failed", error=f"staging-runner error: {e}")
+            except Exception:
+                pass
+        return None
+
    def _run_post_deploy_monitor_job(self, job: dict):
        """ORCH-021: run one deterministic post-deploy monitor tick for a job.

--- a/src/config.py
+++ b/src/config.py
@@ -413,6 +413,51 @@ class Settings(BaseSettings):
    coverage_tool_fail_closed: bool = False
    coverage_run_timeout_s: int = 900

+    # ORCH-115: deterministic staging-runner replacing the LLM `deployer` agent on
+    # the `deploy-staging` stage for the self-hosting orchestrator. A new leaf
+    # src/staging_runner.py (never-raise) is intercepted in launch_job BEFORE _spawn
+    # (mirroring the deploy-finalizer / post-deploy-monitor reserved-agent
+    # precedent, launcher.py:389/394): it runs the SAME staging suite the LLM ran
+    # (`docker exec orchestrator-staging python3 .../staging_check.py`), maps the
+    # exit-code -> staging_status: via the existing self_deploy.map_exit_code_to_status
+    # contract, writes 15-staging-log.md, and initiates the EXISTING check_staging_status
+    # gate exactly as a finished LLM-deployer would. The artifact contract, the gate,
+    # STAGE_TRANSITIONS and the DB schema are byte-for-byte UNCHANGED — this only
+    # replaces the *producer* of the artifact. Pattern = coverage_gate_* / self_deploy_*.
+    # See docs/work-items/ORCH-115/06-adr/ADR-001-deterministic-staging-runner.md and
+    # docs/architecture/adr/adr-0048-deterministic-staging-runner.md.
+    #   staging_runner_enabled            -> SINGLE kill-switch (env
+    #                                        ORCH_STAGING_RUNNER_ENABLED). False -> the
+    #                                        intercept never fires -> the prior LLM
+    #                                        deployer runs on deploy-staging via _spawn
+    #                                        byte-for-byte as before ORCH-115 (D8/AC-6).
+    #   staging_runner_repos              -> CSV scope (env ORCH_STAGING_RUNNER_REPOS).
+    #                                        Empty -> self-hosting only (orchestrator)
+    #                                        via is_self_hosting_repo; non-empty ->
+    #                                        membership. Mirrors coverage_gate_repos.
+    #   staging_runner_timeout_s          -> wall-clock budget for the docker-exec
+    #                                        staging suite (env ORCH_STAGING_RUNNER_TIMEOUT_S).
+    #                                        Malformed/non-positive -> default + WARNING
+    #                                        (never-break). Aligned with the cross-cutting
+    #                                        budget invariant ORCH-065/109/110 WITHOUT
+    #                                        touching reaper_max_running_s (D9): it replaces
+    #                                        the up-to-900s LLM staging window with a bounded
+    #                                        <=600s deterministic one (Σ on the edge does not grow).
+    #   staging_runner_infra_max_retries  -> tool-error (suite did NOT execute: spawn-error /
+    #                                        timeout / returncode None) bounded DEFER budget
+    #                                        before a fail-closed FAILED (env
+    #                                        ORCH_STAGING_RUNNER_INFRA_MAX_RETRIES). Mirrors
+    #                                        merge_retest_infra_max_retries — infra hiccup is
+    #                                        NOT a code-fault, so it never burns a developer-retry
+    #                                        until the budget is exhausted (D5, anti ORCH-110).
+    #   staging_runner_infra_retry_delay_s-> delay before the re-queued deployer job
+    #                                        (env ORCH_STAGING_RUNNER_INFRA_RETRY_DELAY_S).
+    staging_runner_enabled: bool = True
+    staging_runner_repos: str = ""
+    staging_runner_timeout_s: int = 600
+    staging_runner_infra_max_retries: int = 2
+    staging_runner_infra_retry_delay_s: int = 30
+
    # ORCH-098 (FND/F2): machine lessons-journal — additive `lessons` table + leaf
    # src/lessons.py (never-raise observer, by образцу serial_gate/coverage_gate/
    # metrics). The journal is an OBSERVER, never a Quality Gate: writing a lesson
--- a/src/main.py
+++ b/src/main.py
@@ -235,6 +235,7 @@ async def queue():
    from . import lessons
    from . import checkout_hygiene
    from . import transition_lease
+    from . import staging_runner
    from .disk_watchdog import disk_watchdog
    from .build_cache_pruner import build_cache_pruner
    return {
@@ -283,6 +284,11 @@ async def queue():
        # (owner/stage/age/live) + defer/reclaim/CAS-lost counters. Additive block;
        # never-raise.
        "transition_lease": transition_lease.snapshot(),
+        # ORCH-115 (FR-7 / AC-10): deterministic staging-runner observability
+        # (read-only) — kill-switch, scope, timeout/infra budget + run/success/
+        # failed/tool_error/deferred counters, so a code-fail FAILED is distinguishable
+        # from an infra tool-error. Additive block; never-raise.
+        "staging_runner": staging_runner.snapshot(),
        # ORCH-098 (FR-4 / AC-4): lessons-journal observability (read-only) —
        # kill-switch + counts by type/status + last N lessons. Additive block;
        # never-raise (snapshot() returns {"enabled": ...} minimum on error).
--- a/src/staging_runner.py
+++ b/src/staging_runner.py
@@ -0,0 +1,581 @@
+"""Deterministic staging-runner (ORCH-115).
+
+The `deploy-staging` stage for the self-hosting ``orchestrator`` repo was driven by
+the LLM ``deployer`` agent, but the actual work is purely deterministic
+(``.openclaw/agents/deployer.md`` steps 1-4): run the staging suite, map its
+**exit-code** to a verdict (``0 -> SUCCESS``, else ``FAILED``), write
+``15-staging-log.md`` and commit it. This leaf replaces that LLM consultation with
+deterministic code, intercepted in ``launcher.launch_job`` BEFORE ``_spawn`` (the
+``deploy-finalizer`` / ``post-deploy-monitor`` reserved-agent precedent,
+``launcher.py:389/394``).
+
+What is and is NOT changed (NFR-1, the critical invariant):
+  * UNCHANGED — the artifact contract (``15-staging-log.md`` with
+    ``staging_status: SUCCESS|FAILED``), the gate ``check_staging_status`` /
+    ``_parse_staging_status``, ``STAGE_TRANSITIONS``, the DB schema. This module
+    replaces only the *producer* of the artifact, never the gate that reads it.
+  * NEW — a deterministic producer + a launch_job intercept. Under a kill-switch +
+    repo-scope CSV; fail-safe back to the LLM path when off / out of scope.
+
+This module is a **leaf** (mirror of ``self_deploy`` / ``proc_group`` /
+``staging_verdict``): it imports only ``config`` / ``logging`` / ``proc_group`` at
+module load; ``db`` / ``git_worktree`` / ``self_deploy.map_exit_code_to_status`` /
+``qg.checks`` / ``stage_engine.advance_stage`` / ``notifications`` are imported
+LAZILY inside functions so the heavy ``stage_engine`` is never pulled at import and
+no import cycle forms (pattern: ``transition_lease`` / ``merge_gate``). Every public
+function honours a **never-raise** contract (AC-7): a staging-infra hiccup can never
+crash the worker / wedge the queue.
+
+Two-level outcome (D5 — the key safety decision, anti ORCH-110):
+  * the suite EXECUTED (a real exit-code, 0 or non-zero) -> trust the code:
+    ``0 -> SUCCESS -> advance``; ``!=0 -> FAILED -> the existing rollback
+    deploy-staging -> development`` (same developer-retry path as a FAILED LLM
+    verdict). ORCH-061 infra-tolerance is already INSIDE ``staging_check.py`` — the
+    runner never re-judges it (BR-4).
+  * the suite did NOT execute (tool-error: spawn-error / timeout / ``returncode is
+    None``) -> an infra fault, NOT a code fault -> a bounded DEFER (re-queue a fresh
+    ``deployer`` job with a delay + a restart-safe marker). On budget exhaustion ->
+    fail-closed ``FAILED`` + advance + alert. So the runner NEVER does a silent
+    advance / false green, and NEVER wedges the queue, but does NOT burn a
+    developer-retry on transient infra.
+"""
+
+import logging
+import time
+
+from .config import settings
+from . import proc_group
+
+logger = logging.getLogger("orchestrator.staging_runner")
+
+# Platform service literal — the staging compose service the suite runs inside.
+# Already an accepted platform literal (mirror image_freshness._STAGING_SERVICE);
+# NOT a host hardcode (test_no_host_hardcodes forbids host IP/home/hostname only).
+STAGING_SERVICE = "orchestrator-staging"
+
+# Default wall-clock budget for the docker-exec staging suite (D9). Kept <= the LLM
+# staging window it replaces so Σ(work on the deploy-staging edge) does not grow and
+# the cross-cutting reaper invariant (ORCH-065/109/110) holds WITHOUT touching
+# reaper_max_running_s.
+_DEFAULT_TIMEOUT_S = 600
+
+_GIT_TIMEOUT = 60
+
+# Restart-safe DEFER marker (counted from the persisted jobs queue, mirror of
+# stage_engine._merge_infra_retry_count). Embedded verbatim in the re-queued job's
+# task_content so a service restart never resets the infra-retry budget.
+_INFRA_RETRY_MARKER = "staging-runner infra-retry"
+
+# In-process observability counters (mirror merge_gate._MERGE_GATE_COUNTERS, ORCH-110).
+_STAGING_RUNNER_COUNTERS: dict = {
+    "runs": 0,          # run_staging_gate entered
+    "success": 0,       # suite ran, exit 0 -> SUCCESS
+    "failed": 0,        # suite ran non-zero, OR infra budget exhausted -> FAILED
+    "tool_error": 0,    # suite did NOT execute (spawn-error / timeout / None)
+    "deferred": 0,      # bounded infra DEFER (re-queued)
+}
+
+
+def _bump(key: str) -> None:
+    """Increment an observability counter. Never raises."""
+    try:
+        _STAGING_RUNNER_COUNTERS[key] += 1
+    except Exception:  # noqa: BLE001 - observability must never break a decision
+        pass
+
+
+# ---------------------------------------------------------------------------
+# Conditionality (D8 / FR-6 / AC-6)
+# ---------------------------------------------------------------------------
+def applies(repo: str) -> bool:
+    """Whether the deterministic staging-runner is REAL for ``repo``.
+
+    Mirrors ``self_deploy_applies`` / ``coverage_gate``:
+      * ``staging_runner_enabled=False`` -> always False (global kill-switch); the
+        legacy LLM-deployer path runs on deploy-staging via ``_spawn``.
+      * ``staging_runner_repos`` (CSV) non-empty -> only the listed repos.
+      * empty CSV -> only the self-hosting repo (``orchestrator``), which is the only
+        one with a staging instance.
+    Checked FIRST in ``should_intercept`` (local, no network, no DB) so a disabled
+    flag costs nothing. never-raise -> False (fail-safe to the prior LLM path).
+    """
+    try:
+        if not settings.staging_runner_enabled:
+            return False
+        raw = (settings.staging_runner_repos or "").strip()
+        if raw:
+            allowed = {r.strip().lower() for r in raw.split(",") if r.strip()}
+            return (repo or "").strip().lower() in allowed
+        # Lazy import keeps this module a leaf (no qg import at module load).
+        from .qg.checks import is_self_hosting_repo
+        return is_self_hosting_repo(repo)
+    except Exception as e:  # noqa: BLE001 - never-raise contract
+        logger.warning("staging_runner.applies error for %s: %s", repo, e)
+        return False
+
+
+def should_intercept(job: dict) -> bool:
+    """True iff this ``deployer`` job is the deterministic staging-suite job (D1).
+
+    The discriminator of "staging vs prod" is the TASK STAGE, not the role name
+    (``deployer`` owns both ``deploy-staging`` and ``deploy``): intercept iff
+    ``agent == "deployer"`` AND the task's ``tasks.stage == "deploy-staging"`` AND
+    ``applies(repo)``. For self-hosting the prod ``deploy`` edge runs Phase A (no
+    deployer spawn); for non-self repos ``applies`` is False, so a non-self prod
+    ``deployer`` job is never intercepted (R-1 / TC-06).
+
+    never-raise -> False (a DB-lookup failure falls through to ``_spawn``, fail-safe
+    to the prior LLM path).
+    """
+    try:
+        if (job.get("agent") or "") != "deployer":
+            return False
+        # applies() FIRST (local, no DB): disabled flag -> zero DB overhead.
+        if not applies(job.get("repo")):
+            return False
+        task_id = job.get("task_id")
+        if task_id is None:
+            return False
+        from .db import get_db
+        conn = get_db()
+        row = conn.execute("SELECT stage FROM tasks WHERE id=?", (task_id,)).fetchone()
+        conn.close()
+        if not row:
+            return False
+        return (row[0] or "") == "deploy-staging"
+    except Exception as e:  # noqa: BLE001 - never-raise contract
+        logger.warning("staging_runner.should_intercept error: %s", e)
+        return False
+
+
+# ---------------------------------------------------------------------------
+# Suite execution (D3 / FR-2 / NFR-3 / AC-8 / AC-9)
+# ---------------------------------------------------------------------------
+def build_staging_command() -> list[str]:
+    """Build the canonical staging-suite argv (same command the LLM-deployer ran).
+
+    ``docker exec <STAGING_SERVICE> python3 <repos_dir>/<self-repo>/scripts/staging_check.py
+    --base-url http://localhost:<staging_port> --mode stub``. Host-specifics come from
+    config (ORCH-101, no host hardcodes). Self-hosting safety (BR-7 / AC-8 / TC-12):
+    NO restart of 8500, NO ``docker compose up orchestrator`` / ``--build``, NO
+    force-push, NO ``.env`` edit — the runner only reads/executes the staging suite
+    (8501) and writes a log.
+    """
+    from .qg.checks import SELF_HOSTING_REPO
+    repos_dir = (settings.repos_dir or "/repos").rstrip("/")
+    script = f"{repos_dir}/{SELF_HOSTING_REPO}/scripts/staging_check.py"
+    base_url = f"http://localhost:{int(settings.staging_port)}"
+    return [
+        "docker", "exec", STAGING_SERVICE,
+        "python3", script,
+        "--base-url", base_url,
+        "--mode", "stub",
+    ]
+
+
+def _resolve_timeout() -> int:
+    """Resolve ``staging_runner_timeout_s`` (malformed/non-positive -> default +
+    WARNING, never-break — mirror of ``merge_gate._resolve_retest_timeout``)."""
+    raw = getattr(settings, "staging_runner_timeout_s", _DEFAULT_TIMEOUT_S)
+    try:
+        t = int(raw)
+        if t > 0:
+            return t
+        logger.warning(
+            "staging_runner_timeout_s non-positive (%r) -> default %ds",
+            raw, _DEFAULT_TIMEOUT_S,
+        )
+    except (TypeError, ValueError):
+        logger.warning(
+            "staging_runner_timeout_s malformed (%r) -> default %ds",
+            raw, _DEFAULT_TIMEOUT_S,
+        )
+    return _DEFAULT_TIMEOUT_S
+
+
+def run_staging_suite() -> proc_group.ProcResult:
+    """Execute the staging suite in its own process group, tree-killed on timeout.
+
+    Reuses ``proc_group.run_in_process_group`` (ORCH-110) so a hung docker-exec /
+    pytest subtree is killed whole (no orphans, AC-9). Never raises (proc_group
+    degrades any OS error to a safe ``ProcResult``).
+    """
+    cmd = build_staging_command()
+    timeout = _resolve_timeout()
+    try:
+        grace = float(getattr(settings, "agent_kill_grace_seconds", 5) or 5)
+    except (TypeError, ValueError):
+        grace = 5.0
+    return proc_group.run_in_process_group(
+        cmd,
+        cwd=None,
+        timeout=timeout,
+        grace_s=grace,
+        tree_kill=bool(getattr(settings, "subprocess_tree_kill_enabled", True)),
+    )
+
+
+# ---------------------------------------------------------------------------
+# exit-code -> verdict (D4 / FR-3 / AC-3) — reuse the single contract, no 2nd mapping
+# ---------------------------------------------------------------------------
+def map_exit_code_to_status(exit_code) -> str:
+    """``0 -> SUCCESS``; non-zero / None / non-int -> ``FAILED`` (fail-closed).
+
+    Delegates to ``self_deploy.map_exit_code_to_status`` — the SAME pure, unit-tested
+    contract the deploy-finalizer uses (BR-4: no second, drifting mapping).
+    """
+    from .self_deploy import map_exit_code_to_status as _map
+    return _map(exit_code)
+
+
+# ---------------------------------------------------------------------------
+# Artifact 15-staging-log.md (D6 / FR-4 / AC-2 / AC-8) — mirror write_deploy_log
+# ---------------------------------------------------------------------------
+def _extract_infra_waived(stdout: str) -> list[str]:
+    """Lines from the suite stdout carrying the ORCH-061 ``INFRA-WAIVED:`` marker
+    (copied into the log body for observability, as the prompt required)."""
+    if not stdout:
+        return []
+    return [ln.strip() for ln in stdout.splitlines() if "INFRA-WAIVED" in ln]
+
+
+def build_staging_log(
+    work_item_id: str, exit_code, status: str, stdout: str = "", *, tool_error: bool = False
+) -> str:
+    """Render a ``15-staging-log.md`` body whose ``staging_status:`` frontmatter is the
+    verdict ``check_staging_status`` / ``_parse_staging_status`` reads (contract
+    UNCHANGED, AC-2). Carries the mandatory 52c schema (``work_item`` / ``stage`` /
+    ``author_agent`` / ``status`` / ``created_at`` / ``model_used``); ``author_agent:
+    staging-runner`` / ``model_used: n/a`` honestly reflect the DETERMINISTIC producer.
+    The machine key ``staging_status:`` and its UPPERCASE value are NOT changed.
+
+    Written as a literal block (mirror of ``self_deploy.build_deploy_log``) so the
+    machine-read frontmatter is byte-exact; only the frontmatter is machine-read, the
+    body is informational.
+    """
+    import datetime
+    created = datetime.date.today().isoformat()
+    sub_status = "success" if status == "SUCCESS" else "failed"
+    base_url = f"http://localhost:{int(settings.staging_port)}"
+
+    waived = _extract_infra_waived(stdout)
+    waived_body = ""
+    if waived:
+        waived_body = "\nINFRA-WAIVED lines (ORCH-061, copied for observability):\n" + "\n".join(
+            f"- {ln}" for ln in waived
+        ) + "\n"
+
+    tail = ""
+    if stdout:
+        tail_text = stdout.strip()[-1500:]
+        if tail_text:
+            tail = f"\nStaging suite stdout (tail):\n```\n{tail_text}\n```\n"
+
+    note = (
+        "Staging suite did NOT execute (tool-error) and the infra-retry budget was "
+        "exhausted -> fail-closed FAILED."
+        if tool_error
+        else f"Staging suite exit-code `{exit_code}` -> `staging_status: {status}`."
+    )
+
+    return (
+        "---\n"
+        f"staging_status: {status}\n"
+        f"work_item: {work_item_id}\n"
+        "stage: deploy-staging\n"
+        "author_agent: staging-runner\n"
+        f"status: {sub_status}\n"
+        f"created_at: {created}\n"
+        "model_used: n/a\n"
+        f"exit_code: {exit_code}\n"
+        f"base_url: {base_url}\n"
+        "---\n\n"
+        "# Staging Gate Log (deterministic runner, ORCH-115)\n\n"
+        f"{note}\n\n"
+        "Вердикт зафиксирован детерминированным staging-раннером (ORCH-115), не LLM. "
+        "infra-tolerance (ORCH-061) уже учтена внутри `staging_check.py` — раннер её не "
+        "пересуживает.\n"
+        f"{waived_body}"
+        f"{tail}"
+    )
+
+
+def write_staging_log(
+    repo: str, work_item_id: str, branch: str, exit_code, status: str,
+    stdout: str = "", *, tool_error: bool = False,
+) -> bool:
+    """Write ``15-staging-log.md`` into the task worktree (so ``check_staging_status``
+    reads it) and best-effort commit+push it to the FEATURE BRANCH. Returns True iff
+    the file was written. Never raises.
+
+    Mirror of ``self_deploy.write_deploy_log`` EXCEPT: the actor name is
+    ``staging-runner`` and the log is pushed only to the feature branch — there is NO
+    separate PR-merge of the log into ``main`` (the gate reads the worktree first;
+    excluding any direct work on ``main`` strengthens AC-8 / BR-7). The feature branch
+    is merged into ``main`` later by the normal merge-gate path.
+    """
+    import os
+    import subprocess
+    from .git_worktree import get_worktree_path
+
+    rel = f"docs/work-items/{work_item_id}/15-staging-log.md"
+    try:
+        wt = get_worktree_path(repo, branch)
+    except Exception as e:  # noqa: BLE001 - never-raise
+        logger.error("write_staging_log: worktree error for %s/%s: %s", repo, branch, e)
+        return False
+
+    path = os.path.join(wt, rel)
+    content = build_staging_log(work_item_id, exit_code, status, stdout, tool_error=tool_error)
+    try:
+        os.makedirs(os.path.dirname(path), exist_ok=True)
+        with open(path, "w", encoding="utf-8") as f:
+            f.write(content)
+    except OSError as e:
+        logger.error("write_staging_log: write error at %s: %s", path, e)
+        return False
+
+    # Best-effort commit + push to the feature branch (the gate also falls back to
+    # origin/main). ORCH-101: HOME + email domain from Settings; the actor NAME stays
+    # the platform literal `staging-runner` (deterministic system-actor commits).
+    _email = f"staging-runner@{settings.git_email_domain}"
+    git_env = {
+        **os.environ,
+        "HOME": settings.agent_home_dir,
+        "GIT_AUTHOR_NAME": "staging-runner",
+        "GIT_AUTHOR_EMAIL": _email,
+        "GIT_COMMITTER_NAME": "staging-runner",
+        "GIT_COMMITTER_EMAIL": _email,
+    }
+    try:
+        subprocess.run(["git", "-C", wt, "add", rel],
+                       capture_output=True, timeout=_GIT_TIMEOUT, env=git_env)
+        commit = subprocess.run(
+            ["git", "-C", wt, "commit", "-m",
+             f"staging(ORCH-115): staging gate {status} for {work_item_id}"],
+            capture_output=True, text=True, timeout=_GIT_TIMEOUT, env=git_env,
+        )
+        if commit.returncode == 0:
+            subprocess.run(["git", "-C", wt, "push", "origin", branch],
+                           capture_output=True, timeout=_GIT_TIMEOUT, env=git_env)
+    except (subprocess.SubprocessError, OSError) as e:
+        logger.warning("write_staging_log: git commit/push best-effort failed: %s", e)
+    return True
+
+
+# ---------------------------------------------------------------------------
+# Existing-gate initiation (D7 / FR-5 / AC-4) — no new routing branch
+# ---------------------------------------------------------------------------
+def _advance(task_id, repo: str, work_item_id: str, branch: str) -> None:
+    """Initiate the SAME ``advance_stage`` evaluation a finished LLM-deployer would
+    (``finished_agent="deployer"`` on ``deploy-staging``): SUCCESS -> sub-gates
+    security->merge->coverage->image-freshness (ORCH-022/043/027/058) + Phase A
+    (ORCH-036); FAILED -> the existing rollback deploy-staging -> development
+    (``stage_engine.py:932``). No new routing branch. The transition-lease (ORCH-114)
+    is taken INSIDE advance_stage on the side-effectful edge — the runner never
+    touches it (task boundary O1). never-raise."""
+    try:
+        from . import stage_engine
+        stage_engine.advance_stage(
+            task_id=task_id,
+            current_stage="deploy-staging",
+            repo=repo,
+            work_item_id=work_item_id,
+            branch=branch,
+            finished_agent="deployer",
+        )
+    except Exception as e:  # noqa: BLE001 - never-raise into the worker
+        logger.error(
+            "staging_runner._advance: advance_stage failed for task %s (%s): %s",
+            task_id, work_item_id, e,
+        )
+
+
+# ---------------------------------------------------------------------------
+# Two-level outcome (D5) — tool-error DEFER bookkeeping
+# ---------------------------------------------------------------------------
+def _infra_retry_count(task_id) -> int:
+    """How many times this task was re-queued by the tool-error DEFER path
+    (restart-safe; counted from the persisted jobs queue by the marker — mirror of
+    ``stage_engine._merge_infra_retry_count``). Never raises -> 0 on error."""
+    try:
+        from .db import get_db
+        conn = get_db()
+        n = conn.execute(
+            "SELECT COUNT(*) FROM jobs WHERE task_id=? AND task_content LIKE ?",
+            (task_id, f"%{_INFRA_RETRY_MARKER}%"),
+        ).fetchone()[0]
+        conn.close()
+        return int(n)
+    except Exception as e:  # noqa: BLE001 - never-raise
+        logger.warning("staging_runner._infra_retry_count error for %s: %s", task_id, e)
+        return 0
+
+
+def _handle_tool_error(
+    task_id, repo: str, work_item_id: str, branch: str, result: proc_group.ProcResult
+) -> None:
+    """Suite did NOT execute (tool-error) -> bounded DEFER, then fail-closed (D5).
+
+    Anti ORCH-110: an infra fault is NOT a code fault, so we re-queue a fresh
+    ``deployer`` job (which re-enters this runner) with a delay instead of an
+    immediate FAILED-rollback that would burn a developer-retry. On budget exhaustion
+    -> write ``staging_status: FAILED`` + advance (the existing rollback) + an
+    INFRA-specific alert (explicitly "not a code defect"). Never a silent advance /
+    false green; never wedges the queue. never-raise."""
+    retries = _infra_retry_count(task_id)
+    try:
+        max_retries = int(settings.staging_runner_infra_max_retries)
+    except (TypeError, ValueError):
+        max_retries = 2
+    try:
+        delay = int(settings.staging_runner_infra_retry_delay_s)
+    except (TypeError, ValueError):
+        delay = 30
+
+    if retries < max_retries:
+        _bump("deferred")
+        reason = "timeout" if result.timed_out else "suite did not execute (tool-error)"
+        task_desc = (
+            f"Work item: {work_item_id}\nRepo: {repo}\nBranch: {branch}\n"
+            f"Stage: deploy-staging\nNote: {_INFRA_RETRY_MARKER} "
+            f"(attempt {retries + 1}/{max_retries}) — {reason}, retrying after {delay}s."
+        )
+        try:
+            from .db import enqueue_job
+            new_job = enqueue_job(
+                "deployer", repo, task_desc, task_id=task_id, available_at_delay_s=delay,
+            )
+            logger.warning(
+                "Task %s (%s): staging suite did not execute (%s) -> infra-DEFER "
+                "(job_id=%s, attempt %d/%d)",
+                task_id, work_item_id, reason, new_job, retries + 1, max_retries,
+            )
+        except Exception as e:  # noqa: BLE001 - never-raise
+            logger.error("staging_runner: infra-DEFER enqueue failed for %s: %s", task_id, e)
+        return
+
+    # Budget exhausted -> fail-closed FAILED (terminal, never a false green).
+    _bump("failed")
+    logger.error(
+        "Task %s (%s): staging tool-error DEFER budget exhausted (%d) -> fail-closed FAILED",
+        task_id, work_item_id, max_retries,
+    )
+    write_staging_log(repo, work_item_id, branch, result.returncode, "FAILED",
+                      result.stdout, tool_error=True)
+    _alert_infra_exhausted(work_item_id, max_retries)
+    _advance(task_id, repo, work_item_id, branch)
+
+
+def _alert_infra_exhausted(work_item_id: str, max_retries: int) -> None:
+    """Best-effort Telegram alert that the staging suite never executed (infra, NOT a
+    code defect) after the retry budget. never-raise."""
+    try:
+        from .notifications import send_telegram, link_for
+        send_telegram(
+            f"\U0001f6a8 {link_for(work_item_id)}: staging suite не запустилась "
+            f"(инфра, НЕ дефект кода) после {max_retries} попыток — fail-closed FAILED, "
+            f"откат на development. Нужно проверить staging-стенд."
+        )
+    except Exception as e:  # noqa: BLE001 - never-raise
+        logger.warning("staging_runner: infra-exhausted alert failed for %s: %s", work_item_id, e)
+
+
+# ---------------------------------------------------------------------------
+# Entry point (D2) — owns the full deterministic flow, mirror run_deploy_finalizer
+# ---------------------------------------------------------------------------
+def run_staging_gate(job: dict) -> None:
+    """Deterministic staging gate for a ``deployer`` job on ``deploy-staging``.
+
+    Flow (mirror of ``stage_engine.run_deploy_finalizer``):
+      1. resolve ``work_item_id`` / ``branch`` by ``task_id``;
+      2. execute the staging suite (D3) -> ProcResult;
+      3. suite EXECUTED -> map exit-code -> ``staging_status:``, write
+         ``15-staging-log.md``, initiate the existing gate via ``advance_stage`` (D7);
+      4. suite did NOT execute (tool-error) -> bounded DEFER / fail-closed (D5);
+      5. observability counters + one structured verdict log (D10).
+    Never raises into the caller (the launcher marks the job done/failed)."""
+    started = time.time()
+    _bump("runs")
+    task_id = job.get("task_id")
+    repo = job.get("repo")
+
+    # 1. resolve task fields.
+    work_item_id, branch = None, None
+    try:
+        from .db import get_db
+        conn = get_db()
+        row = conn.execute(
+            "SELECT work_item_id, branch FROM tasks WHERE id=?", (task_id,)
+        ).fetchone()
+        conn.close()
+        if row:
+            work_item_id, branch = row[0], row[1]
+    except Exception as e:  # noqa: BLE001 - never-raise
+        logger.error("staging_runner: task lookup failed for task_id=%s: %s", task_id, e)
+    if not work_item_id or not branch:
+        logger.error(
+            "staging_runner: missing work_item_id/branch for task_id=%s — aborting", task_id
+        )
+        return
+
+    # 2-4. execute + classify + route — guarded so AC-7 (never-raise) holds even if
+    # an unexpected error escapes a sub-step (the worker must never crash on staging
+    # infra; the task is left on deploy-staging for the reconciler/reaper to re-drive).
+    try:
+        result = run_staging_suite()
+        duration_s = round(time.time() - started, 1)
+        suite_ran = (result.returncode is not None) and (not result.timed_out)
+
+        if suite_ran:
+            # 3. trust the exit-code (ORCH-061 already inside staging_check.py).
+            status = map_exit_code_to_status(result.returncode)
+            _bump("success" if status == "SUCCESS" else "failed")
+            logger.info(
+                "staging_runner verdict: work_item=%s repo=%s exit_code=%s status=%s "
+                "duration_s=%s outcome=%s",
+                work_item_id, repo, result.returncode, status, duration_s,
+                "code-pass" if status == "SUCCESS" else "code-fail",
+            )
+            write_staging_log(repo, work_item_id, branch, result.returncode, status, result.stdout)
+            _advance(task_id, repo, work_item_id, branch)
+            return
+
+        # 4. tool-error (suite did not execute) -> DEFER / fail-closed (D5).
+        _bump("tool_error")
+        logger.warning(
+            "staging_runner verdict: work_item=%s repo=%s exit_code=%s status=%s "
+            "duration_s=%s outcome=tool-error (timed_out=%s)",
+            work_item_id, repo, result.returncode, "TOOL-ERROR", duration_s, result.timed_out,
+        )
+        _handle_tool_error(task_id, repo, work_item_id, branch, result)
+    except Exception as e:  # noqa: BLE001 - never-raise into the worker (AC-7)
+        logger.error(
+            "staging_runner.run_staging_gate: unexpected error for task %s (%s): %s",
+            task_id, work_item_id, e,
+        )
+
+
+# ---------------------------------------------------------------------------
+# Observability (D10 / FR-7 / AC-10)
+# ---------------------------------------------------------------------------
+def snapshot() -> dict:
+    """Read-only staging-runner summary for ``GET /queue`` (FR-7 / AC-10).
+
+    Additive block; existing ``/queue`` keys are untouched. never-raise: any error ->
+    a minimal dict with the kill-switch state."""
+    try:
+        return {
+            "enabled": bool(settings.staging_runner_enabled),
+            "repos": getattr(settings, "staging_runner_repos", "") or "",
+            "timeout_s": getattr(settings, "staging_runner_timeout_s", _DEFAULT_TIMEOUT_S),
+            "infra_max_retries": getattr(settings, "staging_runner_infra_max_retries", 2),
+            "runs": _STAGING_RUNNER_COUNTERS["runs"],
+            "success": _STAGING_RUNNER_COUNTERS["success"],
+            "failed": _STAGING_RUNNER_COUNTERS["failed"],
+            "tool_error": _STAGING_RUNNER_COUNTERS["tool_error"],
+            "deferred": _STAGING_RUNNER_COUNTERS["deferred"],
+        }
+    except Exception as e:  # noqa: BLE001 - never-raise -> minimal dict
+        logger.warning("staging_runner.snapshot error: %s", e)
+        return {"enabled": False}