feat(staging): deterministic staging-runner replacing LLM deployer on deploy-staging (ORCH-115)

Replace the LLM `deployer` agent on the `deploy-staging` stage (self-hosting orchestrator) with a deterministic staging-runner intercepted in launch_job BEFORE _spawn (the deploy-finalizer / post-deploy-monitor reserved-agent precedent). The runner executes the SAME staging suite, maps the exit-code to `staging_status:` via the existing self_deploy.map_exit_code_to_status contract, writes 15-staging-log.md, and initiates the UNCHANGED check_staging_status gate exactly as a finished LLM-deployer would. Invariant (NFR-1): this replaces only the *producer* of the artifact — the artifact contract, the gate / _parse_staging_status / check_staging_status name, STAGE_TRANSITIONS, the machine-verdict key `staging_status:` and the DB schema are byte-for-byte unchanged. Additive, under a kill-switch + repo-scope CSV, never-raise, fail-safe back to the LLM path. Two-level outcome (D5, anti ORCH-110): suite executed -> verdict -> advance (FAILED -> the existing deploy-staging -> development rollback + developer-retry, same as a FAILED LLM verdict); tool-error (suite did not execute) -> bounded DEFER -> fail-closed FAILED + alert on exhaustion (infra != code fault; never a silent advance / false green). First implemented slice of the LLM determinization roadmap (ORCH-118 A6, replace-deterministic-now). - New leaf src/staging_runner.py (never-raise; proc_group tree-kill + timeout) - launch_job intercept + _run_staging_runner_job (mirror _run_deploy_finalizer_job) - config: ORCH_STAGING_RUNNER_* keys (enabled/repos/timeout/infra-retry budget) - GET /queue staging_runner observability block - docs: llm-call-sites/roadmap/usage-policy (A6 implemented; machine blocks + single-transport invariant intact), deployer.md (LLM branch -> fallback), CLAUDE.md, CHANGELOG.md, overview (tech-pipeline/tech-agents/tech-quality-security), .env.example - tests/test_orch115_staging_runner.py (TC-01..TC-13); LLM anti-drift green (TC-14) Refs: ORCH-115 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 01:59:43 +03:00
parent f120e4bd8f
commit b50cf1dd08
16 changed files with 1235 additions and 7 deletions
--- a/src/agents/launcher.py
+++ b/src/agents/launcher.py
@@ -385,6 +385,14 @@ class AgentLauncher:
        (no-LLM) job — intercept it BEFORE _spawn (which would raise
        "Unknown agent", R-6) and run the deploy finalizer synchronously, driving
        the jobs row status itself. Returns None (no agent_run row).
+
+        ORCH-115: the LLM ``deployer`` on the ``deploy-staging`` stage (self-hosting
+        scope) is replaced by a DETERMINISTIC staging-runner — intercepted here
+        BEFORE _spawn (same precedent as deploy-finalizer / post-deploy-monitor). The
+        discriminator is the TASK STAGE (deploy-staging), not the role name, so the
+        prod ``deploy`` deployer is never caught (staging_runner.should_intercept).
+        Kill-switch off / out of scope -> should_intercept False -> the prior LLM
+        deployer runs via _spawn byte-for-byte.
        """
        if job.get("agent") == "deploy-finalizer":
            return self._run_deploy_finalizer_job(job)
@@ -393,6 +401,11 @@ class AgentLauncher:
        # observation tick synchronously. Returns None (no agent_run row).
        if job.get("agent") == "post-deploy-monitor":
            return self._run_post_deploy_monitor_job(job)
+        # ORCH-115: deterministic staging-runner intercept (BEFORE _spawn).
+        if job.get("agent") == "deployer":
+            from .. import staging_runner
+            if staging_runner.should_intercept(job):
+                return self._run_staging_runner_job(job)
        return self._spawn(
            job["agent"],
            job["repo"],
@@ -422,6 +435,28 @@ class AgentLauncher:
                pass
        return None

+    def _run_staging_runner_job(self, job: dict):
+        """ORCH-115: run the deterministic staging gate for a deployer job.
+
+        Not an LLM spawn — there is no subprocess/monitor of an agent, so we mark the
+        jobs row done/failed here (mirror of _run_deploy_finalizer_job). The runner
+        never-raises, but we guard anyway so a runner fault can't wedge the worker.
+        Returns None (no agent_run row, _spawn not called).
+        """
+        from ..db import mark_job
+        from .. import staging_runner
+        try:
+            staging_runner.run_staging_gate(job)
+            mark_job(job["id"], "done")
+            logger.info(f"staging-runner job {job['id']} done")
+        except Exception as e:
+            logger.error(f"staging-runner job {job['id']} failed: {e}")
+            try:
+                mark_job(job["id"], "failed", error=f"staging-runner error: {e}")
+            except Exception:
+                pass
+        return None
+
    def _run_post_deploy_monitor_job(self, job: dict):
        """ORCH-021: run one deterministic post-deploy monitor tick for a job.