feat(staging): deterministic staging-runner replacing LLM deployer on deploy-staging (ORCH-115)

Replace the LLM `deployer` agent on the `deploy-staging` stage (self-hosting orchestrator) with a deterministic staging-runner intercepted in launch_job BEFORE _spawn (the deploy-finalizer / post-deploy-monitor reserved-agent precedent). The runner executes the SAME staging suite, maps the exit-code to `staging_status:` via the existing self_deploy.map_exit_code_to_status contract, writes 15-staging-log.md, and initiates the UNCHANGED check_staging_status gate exactly as a finished LLM-deployer would. Invariant (NFR-1): this replaces only the *producer* of the artifact — the artifact contract, the gate / _parse_staging_status / check_staging_status name, STAGE_TRANSITIONS, the machine-verdict key `staging_status:` and the DB schema are byte-for-byte unchanged. Additive, under a kill-switch + repo-scope CSV, never-raise, fail-safe back to the LLM path. Two-level outcome (D5, anti ORCH-110): suite executed -> verdict -> advance (FAILED -> the existing deploy-staging -> development rollback + developer-retry, same as a FAILED LLM verdict); tool-error (suite did not execute) -> bounded DEFER -> fail-closed FAILED + alert on exhaustion (infra != code fault; never a silent advance / false green). First implemented slice of the LLM determinization roadmap (ORCH-118 A6, replace-deterministic-now). - New leaf src/staging_runner.py (never-raise; proc_group tree-kill + timeout) - launch_job intercept + _run_staging_runner_job (mirror _run_deploy_finalizer_job) - config: ORCH_STAGING_RUNNER_* keys (enabled/repos/timeout/infra-retry budget) - GET /queue staging_runner observability block - docs: llm-call-sites/roadmap/usage-policy (A6 implemented; machine blocks + single-transport invariant intact), deployer.md (LLM branch -> fallback), CLAUDE.md, CHANGELOG.md, overview (tech-pipeline/tech-agents/tech-quality-security), .env.example - tests/test_orch115_staging_runner.py (TC-01..TC-13); LLM anti-drift green (TC-14) Refs: ORCH-115 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 01:59:43 +03:00
parent f120e4bd8f
commit b50cf1dd08
16 changed files with 1235 additions and 7 deletions
--- a/src/config.py
+++ b/src/config.py
@@ -413,6 +413,51 @@ class Settings(BaseSettings):
    coverage_tool_fail_closed: bool = False
    coverage_run_timeout_s: int = 900

+    # ORCH-115: deterministic staging-runner replacing the LLM `deployer` agent on
+    # the `deploy-staging` stage for the self-hosting orchestrator. A new leaf
+    # src/staging_runner.py (never-raise) is intercepted in launch_job BEFORE _spawn
+    # (mirroring the deploy-finalizer / post-deploy-monitor reserved-agent
+    # precedent, launcher.py:389/394): it runs the SAME staging suite the LLM ran
+    # (`docker exec orchestrator-staging python3 .../staging_check.py`), maps the
+    # exit-code -> staging_status: via the existing self_deploy.map_exit_code_to_status
+    # contract, writes 15-staging-log.md, and initiates the EXISTING check_staging_status
+    # gate exactly as a finished LLM-deployer would. The artifact contract, the gate,
+    # STAGE_TRANSITIONS and the DB schema are byte-for-byte UNCHANGED — this only
+    # replaces the *producer* of the artifact. Pattern = coverage_gate_* / self_deploy_*.
+    # See docs/work-items/ORCH-115/06-adr/ADR-001-deterministic-staging-runner.md and
+    # docs/architecture/adr/adr-0048-deterministic-staging-runner.md.
+    #   staging_runner_enabled            -> SINGLE kill-switch (env
+    #                                        ORCH_STAGING_RUNNER_ENABLED). False -> the
+    #                                        intercept never fires -> the prior LLM
+    #                                        deployer runs on deploy-staging via _spawn
+    #                                        byte-for-byte as before ORCH-115 (D8/AC-6).
+    #   staging_runner_repos              -> CSV scope (env ORCH_STAGING_RUNNER_REPOS).
+    #                                        Empty -> self-hosting only (orchestrator)
+    #                                        via is_self_hosting_repo; non-empty ->
+    #                                        membership. Mirrors coverage_gate_repos.
+    #   staging_runner_timeout_s          -> wall-clock budget for the docker-exec
+    #                                        staging suite (env ORCH_STAGING_RUNNER_TIMEOUT_S).
+    #                                        Malformed/non-positive -> default + WARNING
+    #                                        (never-break). Aligned with the cross-cutting
+    #                                        budget invariant ORCH-065/109/110 WITHOUT
+    #                                        touching reaper_max_running_s (D9): it replaces
+    #                                        the up-to-900s LLM staging window with a bounded
+    #                                        <=600s deterministic one (Σ on the edge does not grow).
+    #   staging_runner_infra_max_retries  -> tool-error (suite did NOT execute: spawn-error /
+    #                                        timeout / returncode None) bounded DEFER budget
+    #                                        before a fail-closed FAILED (env
+    #                                        ORCH_STAGING_RUNNER_INFRA_MAX_RETRIES). Mirrors
+    #                                        merge_retest_infra_max_retries — infra hiccup is
+    #                                        NOT a code-fault, so it never burns a developer-retry
+    #                                        until the budget is exhausted (D5, anti ORCH-110).
+    #   staging_runner_infra_retry_delay_s-> delay before the re-queued deployer job
+    #                                        (env ORCH_STAGING_RUNNER_INFRA_RETRY_DELAY_S).
+    staging_runner_enabled: bool = True
+    staging_runner_repos: str = ""
+    staging_runner_timeout_s: int = 600
+    staging_runner_infra_max_retries: int = 2
+    staging_runner_infra_retry_delay_s: int = 30
+
    # ORCH-098 (FND/F2): machine lessons-journal — additive `lessons` table + leaf
    # src/lessons.py (never-raise observer, by образцу serial_gate/coverage_gate/
    # metrics). The journal is an OBSERVER, never a Quality Gate: writing a lesson