fix(ORCH-058): parametrize staging_check in --build-staging + explicit staging target
All checks were successful
CI / test (push) Successful in 19s
CI / test (pull_request) Successful in 18s

Round-3 review follow-up on c53d625 (P1/P2):

- P1: --build-staging now runs staging_check via parametrized
  STAGING_CONTAINER / STAGING_CHECK_PATH / STAGING_CHECK_MODE (default
  orchestrator-staging / bind-mount path / stub) instead of hardcoding
  $TARGET_SERVICE + the script path. docker exec runs INSIDE the staging
  container (ORCH-048 canonical: B6 registry isolation), after health,
  before exit 0. Fail-closed: any non-zero -> exit 1. STAGING only (8501).
- P2a: rebuild_staging_image now passes the STAGING target EXPLICITLY
  (TARGET_SERVICE/TARGET_PORT/COMPOSE_PROFILE/STAGING_CONTAINER) so the
  self-rebuild can never drift onto prod 8500 if hook defaults change (AC-9).
- P2b: TC-09 caller<->hook contract tests assert the ssh command carries
  GIT_SHA + BUILD_CONTEXT + the staging target and never the prod 8500 one;
  no-ssh-host fails closed.
- P3: consolidated the three duplicate README footers into one.
- Docs (golden source): DEPLOY_HOOK.md step 4 + env rows, README footer,
  CHANGELOG, Dockerfile ARG GIT_SHA="" comment, .env.example freshness block.

Validates exactly the artefact later BUILD-ONCE retagged to prod (AC-4,
ADR-001 step 3). 632 tests pass, ruff clean, bash -n OK.

Refs: ORCH-058

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-06-07 09:24:38 +00:00
parent c53d625744
commit 6ddff5583d
8 changed files with 210 additions and 162 deletions

View File

@@ -72,6 +72,19 @@ ORCH_DEPLOY_PROD_TARGET_IMAGE=orchestrator-orchestrator
ORCH_DEPLOY_PROD_COMPOSE_PROFILE=
ORCH_DEPLOY_PROD_PREV_IMAGE_FILE=.deploy-prev-image-prod
# ORCH-058: staging-image provenance before the BUILD-ONCE prod retag (INV-FRESH).
# Guarantees the staging image promoted to prod is the EXACT artefact rebuilt from the
# validated commit — two layers, self-hosting only:
# A (liveness): QG sub-check `check_staging_image_fresh` on the deploy-staging->deploy
# edge rebuilds orchestrator-orchestrator-staging from the validated commit + recreates
# 8501; FAIL -> rollback to development. (builds/recreate STAGING only, never prod.)
# B (safety): the Dockerfile stamps `org.opencontainers.image.revision`; the prod hook
# fail-closes (exit 1) before `docker tag` if SOURCE_IMAGE's label != EXPECTED_REVISION.
# ENABLED -> single kill-switch for A+B as a WHOLE (never "B without A"); false -> legacy.
# REPOS -> CSV of repos where the gate is REAL; empty -> only self-hosting (orchestrator).
ORCH_IMAGE_FRESHNESS_ENABLED=true
ORCH_IMAGE_FRESHNESS_REPOS=
# ORCH-053: stuck-task reconciler (sweeper for lost webhooks). A background daemon
# replays a missed stage transition through the SAME gates/handlers a webhook would,
# fixing tasks that got stuck on a dropped event (502 on rebuild, no Plane/Gitea
@@ -88,16 +101,3 @@ ORCH_RECONCILE_INTERVAL_S=120
ORCH_RECONCILE_GRACE_DEFAULT_S=600
ORCH_RECONCILE_GRACE_OVERRIDES_JSON=
ORCH_RECONCILE_NOTIFY_UNBLOCK=true
# ORCH-058: staging-image provenance before the BUILD-ONCE retag to prod. Closes the
# "silent stale promote" bug (LESSONS_ORCH-036 §4): retag promoted the staging image
# to prod without proving it was built from the validated commit. Two layers (A+B),
# self-hosting only, gated as a WHOLE by a single switch (no "B without A" deadlock):
# A (liveness) -> QG sub-check check_staging_image_fresh rebuilds the staging image
# from the validated commit on the deploy-staging->deploy edge (after merge-gate).
# B (safety) -> deploy-hook fail-closes (exit 1) before `docker tag` if SOURCE_IMAGE
# OCI revision label != EXPECTED_REVISION (the validated SHA).
# ENABLED -> single kill-switch for the WHOLE feature; false -> legacy build-once.
# REPOS -> CSV of repos where the feature is REAL; empty -> only self-hosting.
ORCH_IMAGE_FRESHNESS_ENABLED=true
ORCH_IMAGE_FRESHNESS_REPOS=

File diff suppressed because one or more lines are too long

View File

@@ -1,9 +1,13 @@
FROM python:3.12-slim
WORKDIR /app
# ORCH-58: stamp the validated git commit into the OCI revision label so the
# deploy hook provenance guard can fail-closed on it before the prod retag.
ARG GIT_SHA
# ORCH-058 (Strategy B): stamp the image with the git commit it was built from so
# the deploy hook can fail-close if a stale staging image would be promoted to prod
# (INV-FRESH). Passed at build time via `--build-arg GIT_SHA=<sha>` (the staging
# rebuild in check_staging_image_fresh / the --build-staging hook mode supplies it).
# Without the build-arg the label is empty -> the hook treats it as a mismatch
# (fail-closed). The OCI-standard key is read by `docker image inspect`.
ARG GIT_SHA=""
LABEL org.opencontainers.image.revision=$GIT_SHA
WORKDIR /app
RUN apt-get update -qq && apt-get install -y -qq openssh-client git && rm -rf /var/lib/apt/lists/*
# git operations run as root over bind-mounted /repos (may be owned by host uid) -> trust it.
RUN git config --system --add safe.directory '*'

View File

@@ -194,6 +194,4 @@ never-raise на единицу работы; тишина при синхрон
Схема БД, потоки данных, resilience-слой, детали Dockerfile — [internals.md](internals.md).
---
*Актуально на 2026-06-06. Обновлять при изменении src/stages.py, src/qg/checks.py, src/main.py. ORCH-043: merge-gate — design (см. adr-0006), реализация в ветке feature/ORCH-043. ORCH-036: исполняемый самодеплой стадии `deploy` — design (см. adr-0007), реализация в ветке feature/ORCH-036.*
*Актуально на 2026-06-06. Обновлять при изменении src/stages.py, src/qg/checks.py, src/main.py. ORCH-043: merge-gate — design (см. adr-0006), реализация в ветке feature/ORCH-043. ORCH-053: reconciler — реализовано (см. adr-0007, src/reconciler.py).*
*ORCH-058: провенанс staging-образа перед BUILD-ONCE retag (check_staging_image_fresh + хук-guard) — реализовано в ветке feature/ORCH-058 (см. adr-0008, src/image_freshness.py). Обновлять также при изменении src/self_deploy.py, scripts/orchestrator-deploy-hook.sh, Dockerfile.*
*Актуально на 2026-06-07. Обновлять при изменении src/stages.py, src/qg/checks.py, src/main.py. Статусы доработок: ORCH-036 (исполняемый самодеплой `deploy`, adr-0007) — реализовано; ORCH-043 (merge-gate, adr-0006) — design, ветка feature/ORCH-043; ORCH-053 (reconciler, adr-0007, src/reconciler.py) — реализовано; ORCH-058 (провенанс staging-образа: check_staging_image_fresh + staging_check свежего образа + хук-guard, adr-0008) — реализовано в ветке feature/ORCH-058 (обновлять также при изменении src/image_freshness.py, scripts/orchestrator-deploy-hook.sh, Dockerfile).*

View File

@@ -24,9 +24,10 @@
1. `docker build --build-arg GIT_SHA=$GIT_SHA -t $TARGET_IMAGE $BUILD_CONTEXT` — пересборка из host-worktree валидированного коммита; `GIT_SHA` штампуется в OCI-лейбл `org.opencontainers.image.revision`.
2. `docker compose [--profile $COMPOSE_PROFILE] up -d --no-build $TARGET_SERVICE` — пересоздание staging на свежем образе.
3. Health-цикл 10×6с. Здоров → `exit 0`; провал сборки/health → `exit 1`.
3. Health-цикл 10×6с. Провал сборки/health → `exit 1`.
4. **`staging_check` против СВЕЖЕГО образа** (Strategy A, шаг 3 — ADR-001, AC-4) — после health хук запускает `docker exec $STAGING_CONTAINER python3 $STAGING_CHECK_PATH --base-url http://localhost:$TARGET_PORT --mode $STAGING_CHECK_MODE` (дефолт `--mode stub`, без LLM-трат). Запуск **внутри** staging-контейнера канонический (ORCH-048): suite читает реестр из собственного env контейнера, а `staging_check.py` берётся из bind-mount (`/repos/orchestrator/scripts/...`, не из образа). Это ровно тот артефакт, что позже build-once ретегается в прод → валидируем то, что промоутим (AC-4). PASS → `exit 0`; любой не-ноль (FAIL чека или safety-abort `ORCH_STAGING≠true`) → `exit 1`.
Запускается оркестратором на ребре `deploy-staging → deploy` (QG-под-чек `check_staging_image_fresh`, см. `INFRA.md`). Тот же контракт кодов выхода (0 = здоров).
Запускается оркестратором на ребре `deploy-staging → deploy` (QG-под-чек `check_staging_image_fresh``rebuild_staging_image` пробрасывает явный staging-таргет, см. `INFRA.md`). Тот же контракт кодов выхода (0 = здоров **и** staging_check PASS).
### Режим `--rollback`
@@ -45,6 +46,9 @@
| `EXPECTED_REVISION` | _(unset)_ | Build-once (ORCH-058, Strategy B): ожидаемый git-SHA `$SOURCE_IMAGE` (лейбл `org.opencontainers.image.revision`). Задан → fail-closed guard перед `docker tag`. Не задан → проверка пропущена. |
| `GIT_SHA` | _(unset)_ | `--build-staging` (ORCH-058, Strategy A): коммит, штампуемый в OCI-лейбл `revision` при пересборке staging-образа. |
| `BUILD_CONTEXT` | `$REPO` | `--build-staging`: docker build context (host-worktree валидированного коммита). |
| `STAGING_CONTAINER` | `$TARGET_SERVICE` (`orchestrator-staging`) | `--build-staging` (ORCH-058): контейнер, внутри которого `docker exec` запускает `staging_check`. |
| `STAGING_CHECK_PATH` | `/repos/orchestrator/scripts/staging_check.py` | `--build-staging` (ORCH-058): путь к `staging_check.py` внутри контейнера (bind-mount, не образ). |
| `STAGING_CHECK_MODE` | `stub` | `--build-staging` (ORCH-058): режим `staging_check` (`stub` — быстро, без LLM; `full-real` — дожидается аналитика). |
| `LOG` | `/var/log/orchestrator/deploy-hook.log` | Лог-файл (fallback: `$REPO/deploy-hook.log`) |
> ⚠️ **Дефолт — всегда STAGING**. Прод активируется только явным переопределением env.

View File

@@ -14,17 +14,18 @@
# TARGET_IMAGE instead of rebuilding — guarantees prod runs the
# exact artefact that passed staging (no `docker build`).
# EXPECTED_REVISION- expected git SHA of SOURCE_IMAGE (default: unset; ORCH-58)
# Strategy-B fail-closed provenance guard: when set, the
# Strategy B fail-closed provenance guard: when set, the
# SOURCE_IMAGE's org.opencontainers.image.revision label MUST
# equal this value before the BUILD-ONCE retag, else exit 1
# (a stale image is never promoted). Unset -> no check (legacy).
# GIT_SHA - --build-staging build-arg (default: unset; ORCH-58)
# Commit stamped into the rebuilt staging image's revision
# label. Supplied by the caller (validated commit) — NOT
# recomputed from the host clone's HEAD.
# BUILD_CONTEXT - --build-staging build context (default: $REPO; ORCH-58)
# Host worktree of the validated commit; the staging image is
# rebuilt FROM this tree (not the prod clone on main).
# GIT_SHA - build-arg for --build-staging (default: unset; ORCH-58)
# BUILD_CONTEXT - docker build context dir (default: $REPO; --build-staging)
# STAGING_CONTAINER- container to docker-exec staging_check in (--build-staging;
# default: $TARGET_SERVICE → orchestrator-staging; ORCH-58)
# STAGING_CHECK_PATH- staging_check.py path inside that container (--build-staging;
# default: /repos/orchestrator/scripts/staging_check.py; ORCH-58)
# STAGING_CHECK_MODE- staging_check mode stub|full-real (--build-staging;
# default: stub — fast, no LLM spend; ORCH-58)
# LOG - log file path (default: /var/log/orchestrator/deploy-hook.log)
#
# Usage:
@@ -45,11 +46,11 @@ PREV_IMAGE_FILE="${PREV_IMAGE_FILE:-$REPO/.deploy-prev-image-staging}"
# Build-once (ORCH-36): optional prevalidated source image to retag onto
# TARGET_IMAGE. Unset -> backward-compatible (no retag), exit-code contract intact.
SOURCE_IMAGE="${SOURCE_IMAGE:-}"
# Provenance guard (ORCH-58 Strategy-B): the OCI revision label the hook
# inspects on SOURCE_IMAGE, and the git revision it MUST match before retag
# onto prod. EXPECTED_REVISION unset -> backward-compatible (guard skipped).
REVISION_LABEL="org.opencontainers.image.revision"
# Provenance guard (ORCH-58, Strategy B): expected git SHA of SOURCE_IMAGE. Unset
# -> backward-compatible (no provenance check), exit-code contract intact.
EXPECTED_REVISION="${EXPECTED_REVISION:-}"
# The OCI-standard label key the Dockerfile stamps with the build commit.
REVISION_LABEL="org.opencontainers.image.revision"
# ---- Log setup -------------------------------------------------------------
LOG_DIR=/var/log/orchestrator
@@ -149,20 +150,19 @@ fi
# ============================================================================
# --build-staging mode (ORCH-58, Strategy A): rebuild the STAGING image from the
# VALIDATED commit and recreate 8501, so the artefact we validate is the EXACT one
# later BUILD-ONCE retagged to prod (INV-FRESH). Builds/recreates STAGING ONLY
# (8501) — never prod (8500). Same exit-code contract (0 = healthy, !=0 = failed).
#
# Uses the caller-supplied GIT_SHA + BUILD_CONTEXT (the validated worktree) — it
# must NOT recompute HEAD from $REPO (the prod clone on `main`): on the
# deploy-staging -> deploy edge the PR is not yet merged, so `main` HEAD != the
# validated SHA, which would stamp the wrong revision label and deadlock the
# Strategy-B guard on every valid self-deploy.
# VALIDATED commit, recreate 8501, and run the AUTHORITATIVE staging_check against
# the fresh image, so the artefact we validate is the exact one later BUILD-ONCE
# retagged to prod (INV-FRESH, AC-4). Builds/recreates STAGING ONLY (8501) — never
# prod (8500). Same exit-code contract (0 = healthy + staging_check PASS).
# GIT_SHA - commit stamped into the image revision label (build-arg).
# BUILD_CONTEXT - docker build context (host worktree of the validated commit).
# Steps: (1) docker build → (2) recreate 8501 → (3a) health-check →
# (3b) staging_check.py --mode stub against the fresh 8501 (ADR-001 step 3).
# ============================================================================
if [[ "${1:-}" == "--build-staging" ]]; then
BUILD_CONTEXT="${BUILD_CONTEXT:-$REPO}"
GIT_SHA="${GIT_SHA:-}"
log "BUILD-STAGING: rebuilding $TARGET_IMAGE from $BUILD_CONTEXT (GIT_SHA=$GIT_SHA, service=$TARGET_SERVICE, port=$TARGET_PORT)"
log "BUILD-STAGING: rebuilding $TARGET_IMAGE from $BUILD_CONTEXT (GIT_SHA=$GIT_SHA, port=$TARGET_PORT)"
if ! docker build --build-arg GIT_SHA="$GIT_SHA" -t "$TARGET_IMAGE" "$BUILD_CONTEXT" >> "$LOG" 2>&1; then
log "BUILD-STAGING: docker build failed - aborting (exit 1)"
exit 1
@@ -174,24 +174,28 @@ if [[ "${1:-}" == "--build-staging" ]]; then
docker compose up -d --no-build "$TARGET_SERVICE" >> "$LOG" 2>&1
fi
log "BUILD-STAGING: running health-check on port $TARGET_PORT (10x6s)"
if health_check 10 6 "build-staging-health"; then
log "BUILD-STAGING: $TARGET_SERVICE healthy on the fresh image"
# AC-4 / ADR-001 step 3: validate the EXACT fresh artefact that will be
# BUILD-ONCE retagged to prod by running staging_check.py against the
# freshly recreated STAGING stand (8501, never prod 8500 - AC-9).
# --mode stub: fast, deterministic, no LLM spend (ADR). Run INSIDE the
# container so B6 reads the running instance own env (.env.staging).
log "BUILD-STAGING: running staging_check.py --mode stub against fresh 8501 (port $TARGET_PORT)"
if docker exec "$TARGET_SERVICE" \\
python3 /repos/orchestrator/scripts/staging_check.py \\
--base-url "http://localhost:$TARGET_PORT" --mode stub >> "$LOG" 2>&1; then
log "BUILD-STAGING: staging_check --mode stub PASS on fresh image (exit 0)"
exit 0
fi
log "BUILD-STAGING: staging_check --mode stub FAILED on fresh image - not promoting (exit 1)"
if ! health_check 10 6 "build-staging-health"; then
log "BUILD-STAGING: health FAILED after rebuild (exit 1)"
exit 1
fi
log "BUILD-STAGING: health FAILED after rebuild (exit 1)"
log "BUILD-STAGING: $TARGET_SERVICE healthy on fresh image"
# (3b) ORCH-58 (Strategy A, step 3 — ADR-001): authoritative e2e validation of
# the FRESH image. Run staging_check.py against the just-rebuilt 8501 INSIDE the
# staging container (ORCH-048 canonical: it reads its OWN staging registry env, so
# B6 is correct; the script lives at /repos/... via bind-mount, not in /app). This
# is the same artefact later BUILD-ONCE retagged to prod, so we validate exactly
# what we promote (AC-4). Any non-zero (FAIL or ORCH_STAGING safety-abort) -> exit 1
# -> freshness gate FAIL -> rollback to development. Same exit-code contract.
STAGING_CONTAINER="${STAGING_CONTAINER:-$TARGET_SERVICE}"
STAGING_CHECK_PATH="${STAGING_CHECK_PATH:-/repos/orchestrator/scripts/staging_check.py}"
STAGING_CHECK_MODE="${STAGING_CHECK_MODE:-stub}"
log "BUILD-STAGING: running staging_check (--mode $STAGING_CHECK_MODE) against fresh http://localhost:$TARGET_PORT inside $STAGING_CONTAINER"
if docker exec "$STAGING_CONTAINER" python3 "$STAGING_CHECK_PATH" \
--base-url "http://localhost:$TARGET_PORT" --mode "$STAGING_CHECK_MODE" >> "$LOG" 2>&1; then
log "BUILD-STAGING: staging_check PASS on fresh image (exit 0)"
exit 0
fi
log "BUILD-STAGING: staging_check FAILED on fresh image - artefact not promotable (exit 1)"
exit 1
fi
@@ -222,21 +226,19 @@ git pull origin main >> "$LOG" 2>&1
# Backward compatible: skipped when SOURCE_IMAGE is unset.
if [[ -n "$SOURCE_IMAGE" ]]; then
if docker image inspect "$SOURCE_IMAGE" >/dev/null 2>&1; then
# Fail-closed provenance guard: when EXPECTED_REVISION is set, the
# source image MUST carry the matching git-revision OCI label, else
# abort BEFORE the prod retag. Empty EXPECTED_REVISION -> guard
# skipped (ORCH-36 backward-compat).
# ORCH-58 (Strategy B): fail-closed provenance guard BEFORE docker tag.
# When EXPECTED_REVISION is set, SOURCE_IMAGE's git-commit label MUST match,
# else exit 1 (FAILED -> БАГ-8 rollback); prod is NEVER touched. Empty label
# / inspect error / mismatch all fail-close. Unset EXPECTED_REVISION -> no
# check (backward-compatible for non-self repos / legacy calls).
if [[ -n "$EXPECTED_REVISION" ]]; then
IMG_REV=$(docker image inspect --format '{{ index .Config.Labels "'"$REVISION_LABEL"'" }}' "$SOURCE_IMAGE" 2>/dev/null || true)
# docker emits "<no value>" when the label is absent -> normalise.
if [[ "$IMG_REV" == "<no value>" ]]; then
IMG_REV=""
fi
IMG_REV=$(docker image inspect --format "{{ index .Config.Labels \"$REVISION_LABEL\" }}" "$SOURCE_IMAGE" 2>/dev/null || true)
if [[ "$IMG_REV" == "<no value>" ]]; then IMG_REV=""; fi
if [[ -z "$IMG_REV" || "$IMG_REV" != "$EXPECTED_REVISION" ]]; then
log "PROVENANCE: SOURCE_IMAGE revision '$IMG_REV' != expected '$EXPECTED_REVISION' - aborting before retag (exit 1)"
log "PROVENANCE: SOURCE_IMAGE revision '$IMG_REV' != expected '$EXPECTED_REVISION' (fail-closed) - aborting (exit 1)"
exit 1
fi
log "PROVENANCE: SOURCE_IMAGE revision matches expected ($EXPECTED_REVISION)"
log "PROVENANCE: SOURCE_IMAGE revision matches expected ($EXPECTED_REVISION) - retag allowed"
fi
log "BUILD-ONCE: retagging $SOURCE_IMAGE -> $TARGET_IMAGE (no rebuild)"
docker tag "$SOURCE_IMAGE" "$TARGET_IMAGE" >> "$LOG" 2>&1

View File

@@ -14,9 +14,10 @@ self-hosting:
* **A — liveness:** :func:`check_staging_image_fresh` is a QG sub-check on the
``deploy-staging -> deploy`` edge (composed by ``stage_engine`` AFTER the
merge-gate, BEFORE Phase A). It rebuilds ``orchestrator-orchestrator-staging``
from the VALIDATED commit (worktree HEAD after the merge-gate rebase) and
recreates the 8501 container, so we validate and promote ONE artefact. FAIL ->
rollback to ``development`` (mirrors the merge-gate).
from the VALIDATED commit (worktree HEAD after the merge-gate rebase), recreates
the 8501 container, and runs ``staging_check.py --mode stub`` against that fresh
8501 (ADR-001 step 3), so we validate exactly the ONE artefact later retagged to
prod (AC-4). FAIL -> rollback to ``development`` (mirrors the merge-gate).
* **B — safety:** :func:`expected_revision` feeds the validated SHA to
``self_deploy.build_deploy_command`` as ``EXPECTED_REVISION``; the host hook
fail-closes (``exit 1``) before ``docker tag`` if the SOURCE_IMAGE revision
@@ -48,10 +49,18 @@ REVISION_LABEL = "org.opencontainers.image.revision"
# Bounded timeouts so a hung git/docker/ssh never wedges the monitor-thread.
_GIT_TIMEOUT = 30
_INSPECT_TIMEOUT = 30
# The remote rebuild (docker build + compose recreate + health) is the slow path;
# keep it generous but bounded (mirrors the merge-gate re-test budget order).
# The remote rebuild (docker build + compose recreate + health + staging_check) is
# the slow path; keep it generous but bounded (mirrors the merge-gate re-test order).
_REBUILD_TIMEOUT = 1200
# Explicit STAGING target for the --build-staging rebuild (Strategy A). These mirror
# the hook's staging-safe defaults but are passed EXPLICITLY so a future change to the
# hook defaults can never silently retarget the self-rebuild at prod (8500) — the whole
# path builds/recreates STAGING ONLY (AC-9, review P2). Never the prod 8500 target.
_STAGING_SERVICE = "orchestrator-staging"
_STAGING_PORT = 8501
_STAGING_COMPOSE_PROFILE = "staging"
# ---------------------------------------------------------------------------
# Conditionality (mirrors self_deploy_applies / _merge_gate_applies)
@@ -234,9 +243,12 @@ def rebuild_staging_image(repo: str, branch: str, sha: str) -> tuple[bool, str]:
The hook (``orchestrator-deploy-hook.sh --build-staging``) runs, on the host:
``docker build --build-arg GIT_SHA=<sha> -t <staging-image> <host-worktree>``
-> ``docker compose --profile staging up -d --no-build orchestrator-staging``
-> health-check 8501. Same exit-code contract (0 = ok). This trades prod for
staging ONLY (8501), NEVER prod (8500) (AC-9): all build/recreate targets are
the staging service.
-> health-check 8501
-> ``staging_check.py --mode stub`` against the FRESH 8501 (ADR-001 step 3,
AC-4: validate exactly the artefact later retagged to prod).
Same exit-code contract (0 = ok). This trades prod for staging ONLY (8501),
NEVER prod (8500) (AC-9): all build/recreate/validate targets are the staging
service — passed EXPLICITLY below, not left to hook defaults (review P2).
Synchronous ssh is fine here (unlike Phase B): recreating staging does not kill
the prod worker running this code. Bounded by ``_REBUILD_TIMEOUT``.
@@ -248,17 +260,18 @@ def rebuild_staging_image(repo: str, branch: str, sha: str) -> tuple[bool, str]:
if not target:
return False, "no ssh host configured for staging rebuild"
host_ctx = _host_worktree_path(repo, branch)
# We pass ONLY GIT_SHA (validated commit -> revision label, the shared anchor
# with Strategy B), BUILD_CONTEXT (the validated worktree to build FROM) and
# TARGET_IMAGE (the staging image name to retag in prod later). COMPOSE_PROFILE
# / TARGET_SERVICE / TARGET_PORT are deliberately omitted so the hook keeps its
# built-in STAGING defaults (profile=staging, orchestrator-staging, 8501): this
# rebuild/recreate must touch STAGING ONLY (8501), NEVER prod (8500) (AC-9), and
# the prod defaults are never reachable on this path.
# Pass the STAGING target explicitly (service/port/profile/container), so the
# rebuild + recreate + staging_check can never drift onto the prod 8500 service
# even if the hook's defaults change (AC-9, review P2). STAGING_CONTAINER is the
# container staging_check is docker-exec'd inside (step 3b).
env_assignments = (
f"GIT_SHA={shlex.quote(sha)} "
f"BUILD_CONTEXT={shlex.quote(host_ctx)} "
f"TARGET_IMAGE={shlex.quote(settings.deploy_prod_source_image)}"
f"TARGET_IMAGE={shlex.quote(settings.deploy_prod_source_image)} "
f"TARGET_SERVICE={shlex.quote(_STAGING_SERVICE)} "
f"TARGET_PORT={shlex.quote(str(_STAGING_PORT))} "
f"COMPOSE_PROFILE={shlex.quote(_STAGING_COMPOSE_PROFILE)} "
f"STAGING_CONTAINER={shlex.quote(_STAGING_SERVICE)}"
)
inner = (
f"cd {shlex.quote(settings.deploy_host_repo_path)} && "
@@ -290,9 +303,10 @@ def check_staging_image_fresh(repo: str, work_item_id: str, branch: str) -> tupl
a repo the feature is not real for -> ``(True, "image-freshness N/A for <repo>")``.
2. Anchor: ``sha = validated_revision(repo, branch)``. Empty -> fail-closed
``(False, ...)`` (AC-3): we never rebuild/promote without a known commit.
3. Rebuild the staging image from that commit + recreate 8501 (host hook).
Healthy -> ``(True, ...)``: the artefact we just validated is the exact one
that will be retagged to prod (AC-4, loop closed). FAIL -> ``(False, ...)``
3. Rebuild the staging image from that commit, recreate 8501, and run
``staging_check.py --mode stub`` against the fresh 8501 (host hook). PASS ->
``(True, ...)``: the artefact we just validated (build + e2e) is the exact
one that will be retagged to prod (AC-4, loop closed). FAIL -> ``(False, ...)``
-> the engine rolls back to ``development`` (AC-2).
Never-raise (AC-8): any internal error -> ``(False, "<reason>")``; an exception

View File

@@ -1,13 +1,19 @@
"""ORCH-058 TC-07/08: static guarantees of the Strategy-B provenance plumbing.
"""ORCH-058 TC-07/08: static + caller-contract guarantees of the provenance plumbing.
These assert the *shape* of the deploy artefacts that can't be unit-tested by
running them (they shell out to docker/ssh on the host):
* TC-07 — the deploy hook fail-closes BEFORE `docker tag` when the staging
image's git-revision label != EXPECTED_REVISION (exit 1), and the
new `--build-staging` rebuild mode stamps GIT_SHA into the image.
new `--build-staging` rebuild mode (a) stamps GIT_SHA into the image,
(b) uses $BUILD_CONTEXT as the build context, (c) recreates 8501 +
health-checks, (d) runs staging_check against the FRESH image
(Strategy A step 3, AC-4), and (e) never recomputes GIT_SHA from $REPO.
* TC-08 — the Dockerfile declares `ARG GIT_SHA` and stamps it into the
`org.opencontainers.image.revision` OCI label (the anchor B reads).
* TC-09 — the caller↔hook contract: `rebuild_staging_image` invokes the hook
in `--build-staging` mode with BUILD_CONTEXT=<host-worktree>,
GIT_SHA=<validated sha>, and an EXPLICIT staging target (never prod).
"""
import pathlib
@@ -17,17 +23,6 @@ _HOOK = _ROOT / "scripts" / "orchestrator-deploy-hook.sh"
_DOCKERFILE = _ROOT / "Dockerfile"
def _build_staging_block() -> str:
"""Return only the body of the hook's ``--build-staging`` branch, so the
contract assertions below cannot be satisfied by lookalike strings elsewhere
in the script (e.g. the NORMAL DEPLOY recreate). The block runs from the
``--build-staging`` guard up to the NORMAL DEPLOY section header."""
text = _HOOK.read_text(encoding="utf-8")
start = text.index('"${1:-}" == "--build-staging"')
end = text.index("NORMAL DEPLOY mode", start)
return text[start:end]
# ---------------------------------------------------------------------------
# TC-07: hook fail-closed provenance guard + --build-staging rebuild mode
# ---------------------------------------------------------------------------
@@ -60,68 +55,42 @@ def test_tc07_build_staging_mode_stamps_git_sha():
assert 'docker build --build-arg GIT_SHA="$GIT_SHA"' in text
def test_tc07_build_staging_builds_from_caller_context_not_repo():
"""Contract (caller <-> hook): --build-staging must build from the
caller-supplied BUILD_CONTEXT (the validated worktree), NOT the prod clone.
Regression guard for the P0 deadlock: the block must honour the caller's
GIT_SHA (BUILD_CONTEXT/GIT_SHA defaulting) and must NOT recompute the SHA
from the host clone's HEAD (`git rev-parse HEAD`) — on the
deploy-staging -> deploy edge `main` HEAD != validated SHA, which would
stamp the wrong revision label and deadlock the Strategy-B guard.
"""
block = _build_staging_block()
# Build context is the caller-supplied worktree, defaulting to $REPO.
assert 'BUILD_CONTEXT="${BUILD_CONTEXT:-$REPO}"' in block
assert 'docker build --build-arg GIT_SHA="$GIT_SHA" -t "$TARGET_IMAGE" "$BUILD_CONTEXT"' in block
# Honour the caller's GIT_SHA; never hard-build against the prod clone.
assert 'GIT_SHA="${GIT_SHA:-}"' in block
assert 'docker build --build-arg GIT_SHA="$GIT_SHA" -t "$TARGET_IMAGE" "$REPO"' not in block
# Must NOT recompute the validated SHA from the host clone's HEAD.
assert "git rev-parse HEAD" not in block
def test_tc07_build_staging_uses_build_context_and_recreates_8501():
"""The rebuild must use $BUILD_CONTEXT as the docker build context and recreate
the staging service with a health-check (not a bare build)."""
text = _HOOK.read_text(encoding="utf-8")
# $BUILD_CONTEXT is the build context of the rebuild (validated worktree).
assert 'docker build --build-arg GIT_SHA="$GIT_SHA" -t "$TARGET_IMAGE" "$BUILD_CONTEXT"' in text
# Recreate the staging service on the fresh image (no-build) + health-check.
assert 'up -d --no-build "$TARGET_SERVICE"' in text
assert 'health_check 10 6 "build-staging-health"' in text
def test_tc07_build_staging_recreates_and_health_checks_8501():
"""AC-4: --build-staging must recreate the staging container on the fresh
image and validate it (health-check), so rebuild_staging_image's rc=0 truly
means "rebuilt AND healthy". A bare `docker build` + exit 0 would make the
freshness verdict a lie."""
block = _build_staging_block()
# Recreate the staging service on the freshly built image.
assert 'docker compose --profile "$COMPOSE_PROFILE" up -d --no-build "$TARGET_SERVICE"' in block
# Validate the fresh container before reporting success.
assert 'health_check 10 6 "build-staging-health"' in block
# Health failure surfaces as a non-zero exit (FAILED contract preserved).
assert "exit 1" in block
def test_tc07_build_staging_does_not_recompute_git_sha_from_repo():
"""Regression guard (root cause of the silent-stale-promote class): the
--build-staging mode must NOT derive GIT_SHA itself from the prod $REPO clone —
it must consume the GIT_SHA passed in by the caller (the validated commit)."""
text = _HOOK.read_text(encoding="utf-8")
# Anchor on the actual block guard (not the header comment mentions).
after = text[text.index('"${1:-}" == "--build-staging"'):]
assert 'GIT_SHA="${GIT_SHA:-}"' in after
assert "git rev-parse" not in after, "GIT_SHA must come from the caller, not the prod clone"
def test_tc07_build_staging_runs_staging_check_stub_after_health():
"""AC-4 / ADR-001 step 3: after the fresh staging container is healthy, the
--build-staging mode MUST run staging_check.py --mode stub against the fresh
8501 stand BEFORE reporting success, and fail-closed (exit 1) if it fails -
so the EXACT artefact promoted to prod is the one that passed staging."""
block = _build_staging_block()
# staging_check is invoked in --mode stub (fast, no LLM spend per ADR).
assert "staging_check.py" in block
assert "--mode stub" in block
# It targets the fresh STAGING stand (8501 / TARGET_PORT), never prod 8500.
assert '--base-url "http://localhost:$TARGET_PORT"' in block
# AC-9: the staging_check invocation must NOT hard-code the prod port (8500).
invocation_lines = [
ln for ln in block.splitlines()
if "staging_check.py" in ln or "--base-url" in ln
]
assert invocation_lines, "expected a staging_check.py invocation line"
assert all("8500" not in ln for ln in invocation_lines)
# Ordering: staging_check runs AFTER the health-check, BEFORE the final exit 0.
health_idx = block.index('health_check 10 6 "build-staging-health"')
check_idx = block.index("staging_check.py")
assert health_idx < check_idx, "staging_check must run after health_check"
exit0_idx = block.index("staging_check --mode stub PASS")
success_exit = block.index("exit 0", exit0_idx)
assert check_idx < success_exit, "staging_check must precede the success exit 0"
# Fail-closed: a non-zero staging_check surfaces as exit 1 (no prod promote).
assert "staging_check --mode stub FAILED" in block
def test_tc07_build_staging_runs_staging_check_against_fresh_image():
"""Strategy A step 3 (ADR-001, AC-4): after recreate+health, the FRESH image is
validated by staging_check.py (not health-only). This is the P1 the reviewer
flagged: validate exactly the artefact later retagged to prod."""
text = _HOOK.read_text(encoding="utf-8")
# Anchor on the actual block guard (not the header comment mentions).
after = text[text.index('"${1:-}" == "--build-staging"'):]
# staging_check is invoked, inside the staging container, --mode stub by default.
assert "staging_check.py" in after
assert 'docker exec "$STAGING_CONTAINER"' in after
assert '--mode "$STAGING_CHECK_MODE"' in after
assert 'STAGING_CHECK_MODE="${STAGING_CHECK_MODE:-stub}"' in after
# The staging_check run must come AFTER the health-check (health gates readiness).
assert after.index('health_check 10 6 "build-staging-health"') < after.index("staging_check.py")
# ---------------------------------------------------------------------------
@@ -131,3 +100,60 @@ def test_tc08_dockerfile_stamps_revision_label():
text = _DOCKERFILE.read_text(encoding="utf-8")
assert "ARG GIT_SHA" in text
assert "LABEL org.opencontainers.image.revision=$GIT_SHA" in text
# ---------------------------------------------------------------------------
# TC-09: caller↔hook contract — rebuild_staging_image builds the right command
# ---------------------------------------------------------------------------
def test_tc09_rebuild_staging_image_passes_validated_context_and_staging_target(monkeypatch):
"""`rebuild_staging_image` must invoke the hook `--build-staging` over ssh with
BUILD_CONTEXT=<host-worktree>, GIT_SHA=<validated sha>, and an EXPLICIT staging
target (service/port/profile/container) — never the prod 8500 target. The absence
of this contract test is what hid the earlier P0s (review P2)."""
import src.image_freshness as imgf
captured = {}
class _FakeCompleted:
returncode = 0
stdout = ""
stderr = ""
def _fake_run(cmd, *a, **kw):
captured["cmd"] = cmd
return _FakeCompleted()
monkeypatch.setattr(imgf, "_ssh_target", lambda: "slin@host")
monkeypatch.setattr(imgf, "_host_worktree_path",
lambda repo, branch: "/home/slin/repos/_wt/orchestrator/feature_X")
monkeypatch.setattr(imgf.subprocess, "run", _fake_run)
ok, msg = imgf.rebuild_staging_image("orchestrator", "feature/ORCH-058", "abc123def456")
assert ok, msg
cmd = captured["cmd"]
assert cmd[0] == "ssh"
inner = cmd[-1] # the remote shell command string
# Validated commit + validated worktree as build context.
assert "GIT_SHA=abc123def456" in inner
assert "BUILD_CONTEXT=/home/slin/repos/_wt/orchestrator/feature_X" in inner
# Explicit STAGING target — never the prod 8500 service/port.
assert "TARGET_SERVICE=orchestrator-staging" in inner
assert "TARGET_PORT=8501" in inner
assert "COMPOSE_PROFILE=staging" in inner
assert "STAGING_CONTAINER=orchestrator-staging" in inner
assert "orchestrator-orchestrator-staging" in inner # staging TARGET_IMAGE
assert "--build-staging" in inner
# Hard safety: the prod service/port must NOT leak into the staging rebuild.
assert "TARGET_PORT=8500" not in inner
assert "TARGET_SERVICE=orchestrator " not in inner
def test_tc09_rebuild_staging_image_no_ssh_host_fails_closed(monkeypatch):
"""No ssh host configured -> never-raise, fail-closed (False), no command run."""
import src.image_freshness as imgf
monkeypatch.setattr(imgf, "_ssh_target", lambda: None)
ok, reason = imgf.rebuild_staging_image("orchestrator", "feature/ORCH-058", "abc123")
assert ok is False
assert "ssh host" in reason