feat(staging): deterministic staging-runner replacing LLM deployer on deploy-staging (ORCH-115)
Replace the LLM `deployer` agent on the `deploy-staging` stage (self-hosting orchestrator) with a deterministic staging-runner intercepted in launch_job BEFORE _spawn (the deploy-finalizer / post-deploy-monitor reserved-agent precedent). The runner executes the SAME staging suite, maps the exit-code to `staging_status:` via the existing self_deploy.map_exit_code_to_status contract, writes 15-staging-log.md, and initiates the UNCHANGED check_staging_status gate exactly as a finished LLM-deployer would. Invariant (NFR-1): this replaces only the *producer* of the artifact — the artifact contract, the gate / _parse_staging_status / check_staging_status name, STAGE_TRANSITIONS, the machine-verdict key `staging_status:` and the DB schema are byte-for-byte unchanged. Additive, under a kill-switch + repo-scope CSV, never-raise, fail-safe back to the LLM path. Two-level outcome (D5, anti ORCH-110): suite executed -> verdict -> advance (FAILED -> the existing deploy-staging -> development rollback + developer-retry, same as a FAILED LLM verdict); tool-error (suite did not execute) -> bounded DEFER -> fail-closed FAILED + alert on exhaustion (infra != code fault; never a silent advance / false green). First implemented slice of the LLM determinization roadmap (ORCH-118 A6, replace-deterministic-now). - New leaf src/staging_runner.py (never-raise; proc_group tree-kill + timeout) - launch_job intercept + _run_staging_runner_job (mirror _run_deploy_finalizer_job) - config: ORCH_STAGING_RUNNER_* keys (enabled/repos/timeout/infra-retry budget) - GET /queue staging_runner observability block - docs: llm-call-sites/roadmap/usage-policy (A6 implemented; machine blocks + single-transport invariant intact), deployer.md (LLM branch -> fallback), CLAUDE.md, CHANGELOG.md, overview (tech-pipeline/tech-agents/tech-quality-security), .env.example - tests/test_orch115_staging_runner.py (TC-01..TC-13); LLM anti-drift green (TC-14) Refs: ORCH-115 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -385,6 +385,14 @@ class AgentLauncher:
|
||||
(no-LLM) job — intercept it BEFORE _spawn (which would raise
|
||||
"Unknown agent", R-6) and run the deploy finalizer synchronously, driving
|
||||
the jobs row status itself. Returns None (no agent_run row).
|
||||
|
||||
ORCH-115: the LLM ``deployer`` on the ``deploy-staging`` stage (self-hosting
|
||||
scope) is replaced by a DETERMINISTIC staging-runner — intercepted here
|
||||
BEFORE _spawn (same precedent as deploy-finalizer / post-deploy-monitor). The
|
||||
discriminator is the TASK STAGE (deploy-staging), not the role name, so the
|
||||
prod ``deploy`` deployer is never caught (staging_runner.should_intercept).
|
||||
Kill-switch off / out of scope -> should_intercept False -> the prior LLM
|
||||
deployer runs via _spawn byte-for-byte.
|
||||
"""
|
||||
if job.get("agent") == "deploy-finalizer":
|
||||
return self._run_deploy_finalizer_job(job)
|
||||
@@ -393,6 +401,11 @@ class AgentLauncher:
|
||||
# observation tick synchronously. Returns None (no agent_run row).
|
||||
if job.get("agent") == "post-deploy-monitor":
|
||||
return self._run_post_deploy_monitor_job(job)
|
||||
# ORCH-115: deterministic staging-runner intercept (BEFORE _spawn).
|
||||
if job.get("agent") == "deployer":
|
||||
from .. import staging_runner
|
||||
if staging_runner.should_intercept(job):
|
||||
return self._run_staging_runner_job(job)
|
||||
return self._spawn(
|
||||
job["agent"],
|
||||
job["repo"],
|
||||
@@ -422,6 +435,28 @@ class AgentLauncher:
|
||||
pass
|
||||
return None
|
||||
|
||||
def _run_staging_runner_job(self, job: dict):
|
||||
"""ORCH-115: run the deterministic staging gate for a deployer job.
|
||||
|
||||
Not an LLM spawn — there is no subprocess/monitor of an agent, so we mark the
|
||||
jobs row done/failed here (mirror of _run_deploy_finalizer_job). The runner
|
||||
never-raises, but we guard anyway so a runner fault can't wedge the worker.
|
||||
Returns None (no agent_run row, _spawn not called).
|
||||
"""
|
||||
from ..db import mark_job
|
||||
from .. import staging_runner
|
||||
try:
|
||||
staging_runner.run_staging_gate(job)
|
||||
mark_job(job["id"], "done")
|
||||
logger.info(f"staging-runner job {job['id']} done")
|
||||
except Exception as e:
|
||||
logger.error(f"staging-runner job {job['id']} failed: {e}")
|
||||
try:
|
||||
mark_job(job["id"], "failed", error=f"staging-runner error: {e}")
|
||||
except Exception:
|
||||
pass
|
||||
return None
|
||||
|
||||
def _run_post_deploy_monitor_job(self, job: dict):
|
||||
"""ORCH-021: run one deterministic post-deploy monitor tick for a job.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user