fix(deploy): resilient-pull hygiene for dirty shared deploy-base (ORCH-112)

Self-deploy git pull blocked on a dirty shared main checkout (manual/abandoned
WIP from a failed/cancelled task) — incident ORCH-111: "Your local changes to
src/config.py would be overwritten by merge" wedged the prod deploy and required
manual intervention (a group risk on self-hosting).

The deploy hook (--deploy) now converges the deploy-base to a clean, current
origin/main BEFORE the pull (git fetch + reset --hard origin/main + a SCOPED
`git clean -fd`, NEVER -x), strictly preserving the rollback/log artefacts
(.deploy-prev-image-* / deploy-hook.log via -e), gitignored .env/data/*.db/build
(no -x), and sibling/.git state (out of clean scope). Gated by CHECKOUT_HYGIENE
env injected by self_deploy.build_deploy_command only when the new pure never-raise
leaf src/checkout_hygiene.py says applies(repo) (kill-switch + self-hosting scope).
Convergence after failed/cancelled is this same deploy-time self-heal — cancel_task
is NOT extended and no background janitor is introduced. Observability: the hook
writes a `hygiene` sentinel, the Phase-C finalizer reads it and sends a best-effort
Telegram alert.

Additive, under kill-switch (ORCH_CHECKOUT_HYGIENE_ENABLED, default true; off ->
bare `git pull origin main` 1:1 before ORCH-112), never-raise, self-hosting scope.
STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / DB schema / the
hook exit-code contract (0/1/2, ORCH-036) are byte-for-byte untouched.

Coverage: tests/test_deploy_checkout_hygiene.py (TC-01..TC-10; real-hook shell
simulation in a temp git repo, no network/prod/ssh, + unit). TC-01 is the
mandatory ORCH-111 regression (RED before the fix, GREEN after). Docs golden
source updated in the same PR (CLAUDE.md, CHANGELOG.md, .env.example; INFRA.md /
architecture/README.md / adr-0044 written at the architecture stage).

Refs: ORCH-112

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-15 14:50:43 +03:00
committed by deployer
parent 31b4f3fd1d
commit a1f3b7588a
11 changed files with 825 additions and 3 deletions

View File

@@ -220,6 +220,35 @@ else
log "No previous image captured (first deploy or service not running?)"
fi
# 2a. ORCH-112: resilient pull — converge the shared deploy-base to a clean, current
# origin/main BEFORE the pull, so a dirty working tree (manual/abandoned WIP left
# by a failed/cancelled task) never blocks the deploy (incident ORCH-111, dirt from
# ORCH-104). Gated by CHECKOUT_HYGIENE (Python kill-switch + self-hosting scope,
# injected by self_deploy.build_deploy_command). NEVER `-x` (would delete gitignored
# .env / data/*.db / build/); EXCLUDES the untracked-but-not-ignored rollback/log
# artefacts .deploy-prev-image-* and deploy-hook.log (NFR-2). Best-effort: every git
# step is `|| log "...continuing"` and the bare `git pull` below still runs
# (never-break). On a CLEAN base the whole block is a no-op -> the happy-path
# behaviour and exit-codes (0/1/2, ORCH-036) are byte-for-byte unchanged.
if [[ "${CHECKOUT_HYGIENE:-0}" == "1" ]]; then
dirty="$(git status --porcelain 2>/dev/null || true)"
if [[ -n "$dirty" ]]; then
log "HYGIENE: dirty deploy-base detected, converging to origin/main:"
log "$dirty"
git fetch origin main >> "$LOG" 2>&1 || log "HYGIENE: fetch failed (continuing)"
git reset --hard origin/main >> "$LOG" 2>&1 || log "HYGIENE: reset failed (continuing)"
git clean -fd \
-e '.deploy-prev-image-*' \
-e 'deploy-hook.log' \
>> "$LOG" 2>&1 || log "HYGIENE: clean failed (continuing)"
if [[ -n "${HYGIENE_REPORT:-}" ]]; then
{ printf 'dirty=1\n'; printf '%s\n' "$dirty"; } > "$HYGIENE_REPORT" 2>/dev/null || true
fi
else
log "HYGIENE: deploy-base already clean (no-op)"
fi
fi
# 2. Pull latest code (keeps the host working tree current for future builds;
# the DEPLOYED artefact is the retagged SOURCE_IMAGE below when build-once).
log "git pull origin main"