feat(staging): deterministic staging-runner replacing LLM deployer on deploy-staging (ORCH-115)
All checks were successful
CI / test (push) Successful in 1m8s
CI / test (pull_request) Successful in 1m8s

Replace the LLM `deployer` agent on the `deploy-staging` stage (self-hosting
orchestrator) with a deterministic staging-runner intercepted in launch_job
BEFORE _spawn (the deploy-finalizer / post-deploy-monitor reserved-agent
precedent). The runner executes the SAME staging suite, maps the exit-code to
`staging_status:` via the existing self_deploy.map_exit_code_to_status contract,
writes 15-staging-log.md, and initiates the UNCHANGED check_staging_status gate
exactly as a finished LLM-deployer would.

Invariant (NFR-1): this replaces only the *producer* of the artifact — the
artifact contract, the gate / _parse_staging_status / check_staging_status name,
STAGE_TRANSITIONS, the machine-verdict key `staging_status:` and the DB schema are
byte-for-byte unchanged. Additive, under a kill-switch + repo-scope CSV,
never-raise, fail-safe back to the LLM path.

Two-level outcome (D5, anti ORCH-110): suite executed -> verdict -> advance
(FAILED -> the existing deploy-staging -> development rollback + developer-retry,
same as a FAILED LLM verdict); tool-error (suite did not execute) -> bounded DEFER
-> fail-closed FAILED + alert on exhaustion (infra != code fault; never a silent
advance / false green).

First implemented slice of the LLM determinization roadmap (ORCH-118 A6,
replace-deterministic-now).

- New leaf src/staging_runner.py (never-raise; proc_group tree-kill + timeout)
- launch_job intercept + _run_staging_runner_job (mirror _run_deploy_finalizer_job)
- config: ORCH_STAGING_RUNNER_* keys (enabled/repos/timeout/infra-retry budget)
- GET /queue staging_runner observability block
- docs: llm-call-sites/roadmap/usage-policy (A6 implemented; machine blocks +
  single-transport invariant intact), deployer.md (LLM branch -> fallback),
  CLAUDE.md, CHANGELOG.md, overview (tech-pipeline/tech-agents/tech-quality-security),
  .env.example
- tests/test_orch115_staging_runner.py (TC-01..TC-13); LLM anti-drift green (TC-14)

Refs: ORCH-115

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-16 01:59:43 +03:00
parent f120e4bd8f
commit b50cf1dd08
16 changed files with 1235 additions and 7 deletions

View File

@@ -385,6 +385,14 @@ class AgentLauncher:
(no-LLM) job — intercept it BEFORE _spawn (which would raise
"Unknown agent", R-6) and run the deploy finalizer synchronously, driving
the jobs row status itself. Returns None (no agent_run row).
ORCH-115: the LLM ``deployer`` on the ``deploy-staging`` stage (self-hosting
scope) is replaced by a DETERMINISTIC staging-runner — intercepted here
BEFORE _spawn (same precedent as deploy-finalizer / post-deploy-monitor). The
discriminator is the TASK STAGE (deploy-staging), not the role name, so the
prod ``deploy`` deployer is never caught (staging_runner.should_intercept).
Kill-switch off / out of scope -> should_intercept False -> the prior LLM
deployer runs via _spawn byte-for-byte.
"""
if job.get("agent") == "deploy-finalizer":
return self._run_deploy_finalizer_job(job)
@@ -393,6 +401,11 @@ class AgentLauncher:
# observation tick synchronously. Returns None (no agent_run row).
if job.get("agent") == "post-deploy-monitor":
return self._run_post_deploy_monitor_job(job)
# ORCH-115: deterministic staging-runner intercept (BEFORE _spawn).
if job.get("agent") == "deployer":
from .. import staging_runner
if staging_runner.should_intercept(job):
return self._run_staging_runner_job(job)
return self._spawn(
job["agent"],
job["repo"],
@@ -422,6 +435,28 @@ class AgentLauncher:
pass
return None
def _run_staging_runner_job(self, job: dict):
"""ORCH-115: run the deterministic staging gate for a deployer job.
Not an LLM spawn — there is no subprocess/monitor of an agent, so we mark the
jobs row done/failed here (mirror of _run_deploy_finalizer_job). The runner
never-raises, but we guard anyway so a runner fault can't wedge the worker.
Returns None (no agent_run row, _spawn not called).
"""
from ..db import mark_job
from .. import staging_runner
try:
staging_runner.run_staging_gate(job)
mark_job(job["id"], "done")
logger.info(f"staging-runner job {job['id']} done")
except Exception as e:
logger.error(f"staging-runner job {job['id']} failed: {e}")
try:
mark_job(job["id"], "failed", error=f"staging-runner error: {e}")
except Exception:
pass
return None
def _run_post_deploy_monitor_job(self, job: dict):
"""ORCH-021: run one deterministic post-deploy monitor tick for a job.