fix(staging): host-side ssh execution + env classification for staging-runner (ORCH-123)
The ORCH-115 deterministic staging-runner ran `docker exec` FROM INSIDE the prod `orchestrator` container, which ships only `openssh-client git curl` — no `docker` CLI (Dockerfile:11). `Popen(["docker", ...])` hit FileNotFoundError -> a PERMANENT environment defect that was mis-routed as a code-fail rollback `deploy-staging -> development` (burning developer-retries). Incident ORCH-116: every self-hosting task reaching deploy-staging was doomed to a false rollback. Fix (adr-0049, additive, flag-gated, never-raise, self-hosting scope; the gate / artifact contract / STAGE_TRANSITIONS / DB schema are byte-for-byte unchanged): - D1: build_staging_command() wraps the SAME `docker exec ... staging_check.py ... --mode stub` in `ssh <user@host> '<...>'` so it runs HOST-SIDE over the existing trusted ssh channel (mirror self_deploy / image_freshness). New flag staging_runner_exec_host_side (default True). No docker CLI/SDK added to the image, docker.sock not used in-container (D2 security). - D3: three-way classify_staging_outcome (suite-ran / permanent-env / transient-infra), disambiguating the exit=1 collision by scanning stderr. - D4: invariant "infra != code-fail" — permanent-env / exhausted transient-infra end in an infra-HOLD (no rollback, no developer-retry), NOT a false FAILED rollback (supersedes ORCH-115 D5). A really-executed failing suite still rolls back (anti-over-tolerance). R-2 verified: a held deploy-staging task is not rolled back by the reconciler. - D5: prod-like preflight() of the host-side channel at startup (main.lifespan, best-effort, never blocks). - D8: snapshot adds permanent_env / exec_host_side / preflight. Docs (golden source, same PR): INFRA.md execution-boundary section, architecture/README.md, CLAUDE.md, CHANGELOG.md, .env.example. Tests: tests/test_orch123_staging_runner_exec.py (TC-01 mandatory regression red->green; TC-02..TC-14 + R-2). ORCH-115 anti-drift green (3 tests updated for the D1/D4/D8 supersession). Full suite: 2131 passed. Refs: ORCH-123 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
11
.env.example
11
.env.example
@@ -569,14 +569,21 @@ ORCH_COVERAGE_RUN_TIMEOUT_S=900
|
||||
# STAGING_RUNNER_REPOS -> CSV scope; empty -> self-hosting only.
|
||||
# STAGING_RUNNER_TIMEOUT_S -> wall-clock budget for the docker-exec suite
|
||||
# (malformed/non-positive -> default 600 + WARNING).
|
||||
# STAGING_RUNNER_INFRA_MAX_RETRIES -> tool-error (suite did NOT execute) bounded DEFER
|
||||
# budget before a fail-closed FAILED (anti ORCH-110).
|
||||
# STAGING_RUNNER_INFRA_MAX_RETRIES -> transient-infra (timeout/ssh) bounded DEFER
|
||||
# budget before an infra-HOLD (anti ORCH-110).
|
||||
# STAGING_RUNNER_INFRA_RETRY_DELAY_S-> delay before the re-queued deployer job.
|
||||
# STAGING_RUNNER_EXEC_HOST_SIDE -> ORCH-123 (adr-0049): true (default = prod) wraps
|
||||
# the `docker exec` in `ssh <user@host> '<...>'` so
|
||||
# the suite runs HOST-SIDE (the prod container ships
|
||||
# no docker CLI; incident ORCH-116). false -> the
|
||||
# prior in-container `docker exec` (valid only where
|
||||
# a docker CLI is baked into the image). Rollback knob.
|
||||
ORCH_STAGING_RUNNER_ENABLED=true
|
||||
ORCH_STAGING_RUNNER_REPOS=
|
||||
ORCH_STAGING_RUNNER_TIMEOUT_S=600
|
||||
ORCH_STAGING_RUNNER_INFRA_MAX_RETRIES=2
|
||||
ORCH_STAGING_RUNNER_INFRA_RETRY_DELAY_S=30
|
||||
ORCH_STAGING_RUNNER_EXEC_HOST_SIDE=true
|
||||
|
||||
# ORCH-057 (follow-up ORCH-040): legacy root-owned ownership detect + actionable
|
||||
# worktree error. After the uid migration (user: "1000:1000") legacy root:root files
|
||||
|
||||
Reference in New Issue
Block a user