Replace the LLM `deployer` agent on the `deploy-staging` stage (self-hosting
orchestrator) with a deterministic staging-runner intercepted in launch_job
BEFORE _spawn (the deploy-finalizer / post-deploy-monitor reserved-agent
precedent). The runner executes the SAME staging suite, maps the exit-code to
`staging_status:` via the existing self_deploy.map_exit_code_to_status contract,
writes 15-staging-log.md, and initiates the UNCHANGED check_staging_status gate
exactly as a finished LLM-deployer would.
Invariant (NFR-1): this replaces only the *producer* of the artifact — the
artifact contract, the gate / _parse_staging_status / check_staging_status name,
STAGE_TRANSITIONS, the machine-verdict key `staging_status:` and the DB schema are
byte-for-byte unchanged. Additive, under a kill-switch + repo-scope CSV,
never-raise, fail-safe back to the LLM path.
Two-level outcome (D5, anti ORCH-110): suite executed -> verdict -> advance
(FAILED -> the existing deploy-staging -> development rollback + developer-retry,
same as a FAILED LLM verdict); tool-error (suite did not execute) -> bounded DEFER
-> fail-closed FAILED + alert on exhaustion (infra != code fault; never a silent
advance / false green).
First implemented slice of the LLM determinization roadmap (ORCH-118 A6,
replace-deterministic-now).
- New leaf src/staging_runner.py (never-raise; proc_group tree-kill + timeout)
- launch_job intercept + _run_staging_runner_job (mirror _run_deploy_finalizer_job)
- config: ORCH_STAGING_RUNNER_* keys (enabled/repos/timeout/infra-retry budget)
- GET /queue staging_runner observability block
- docs: llm-call-sites/roadmap/usage-policy (A6 implemented; machine blocks +
single-transport invariant intact), deployer.md (LLM branch -> fallback),
CLAUDE.md, CHANGELOG.md, overview (tech-pipeline/tech-agents/tech-quality-security),
.env.example
- tests/test_orch115_staging_runner.py (TC-01..TC-13); LLM anti-drift green (TC-14)
Refs: ORCH-115
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Eliminate the false `deploy-staging -> development` rollback that fired when the
merge-gate local re-test timed out (infra/resource) on a green CI + tester +
staging branch (incident ORCH-109/PR #129: a 516.7s suite blew its 600s budget
under CPU starvation from orphaned pytest processes -> timeout misrouted as a
code fault -> developer-retry loop -> manual gate).
Additive, 5 independent kill-switches, never-raise, self-hosting scope. Untouched
byte-for-byte: STAGE_TRANSITIONS, the QG_CHECKS registry, check_branch_mergeable
name/semantics, machine-verdict keys, the DB schema. INV-4 (never push/force-push
main) and the no-prod-restart rule are preserved.
- D1: new stdlib-only leaf src/proc_group.py runs the spawned re-test/coverage
pytest in its own process group (start_new_session) and tree-kills the WHOLE
group on timeout (os.killpg SIGTERM->grace->SIGKILL); used by
merge_gate.retest_branch and coverage_gate.measure_coverage. No orphan leak.
Fallback never-break: subprocess_tree_kill_enabled=False / non-POSIX -> the
prior subprocess.run.
- D2/D3: merge_gate.classify_retest_failure distinguishes timeout/red/lock-busy/
other; an infra timeout routes to _handle_merge_gate_infra_retry (bounded
re-queue, task stays on deploy-staging, no rollback / no developer-retry); a
red re-test / conflict still rolls back (BR-6). Exhaustion -> one infra alert.
- D4: skip the local re-test when the pre-merge rebase was a proven no-op (HEAD
already CI/tester/staging-validated); fail-safe runs the re-test on any
uncertainty. Flag merge_retest_skip_when_current_enabled.
- D5: merge_retest_timeout_s 600 -> 900 + _resolve_retest_timeout validation;
reaper_max_running_s invariant preserved without change.
- D6: in-process counters + read-only merge_gate block in GET /queue; appended
("ORCH-110","classify_retest_failure","src/merge_gate.py") to
MAIN_REGRESSION_MARKERS. Docs (README/internals overview/CLAUDE/CHANGELOG/
.env.example) updated in the same PR.
Tests: tests/test_orch110_*.py (TC-01..TC-12, incl. the red-before/green-after
incident regression). Full suite green (1988 passed).
Refs: ORCH-110
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>