Self-deploy git pull blocked on a dirty shared main checkout (manual/abandoned WIP from a failed/cancelled task) — incident ORCH-111: "Your local changes to src/config.py would be overwritten by merge" wedged the prod deploy and required manual intervention (a group risk on self-hosting). The deploy hook (--deploy) now converges the deploy-base to a clean, current origin/main BEFORE the pull (git fetch + reset --hard origin/main + a SCOPED `git clean -fd`, NEVER -x), strictly preserving the rollback/log artefacts (.deploy-prev-image-* / deploy-hook.log via -e), gitignored .env/data/*.db/build (no -x), and sibling/.git state (out of clean scope). Gated by CHECKOUT_HYGIENE env injected by self_deploy.build_deploy_command only when the new pure never-raise leaf src/checkout_hygiene.py says applies(repo) (kill-switch + self-hosting scope). Convergence after failed/cancelled is this same deploy-time self-heal — cancel_task is NOT extended and no background janitor is introduced. Observability: the hook writes a `hygiene` sentinel, the Phase-C finalizer reads it and sends a best-effort Telegram alert. Additive, under kill-switch (ORCH_CHECKOUT_HYGIENE_ENABLED, default true; off -> bare `git pull origin main` 1:1 before ORCH-112), never-raise, self-hosting scope. STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / DB schema / the hook exit-code contract (0/1/2, ORCH-036) are byte-for-byte untouched. Coverage: tests/test_deploy_checkout_hygiene.py (TC-01..TC-10; real-hook shell simulation in a temp git repo, no network/prod/ssh, + unit). TC-01 is the mandatory ORCH-111 regression (RED before the fix, GREEN after). Docs golden source updated in the same PR (CLAUDE.md, CHANGELOG.md, .env.example; INFRA.md / architecture/README.md / adr-0044 written at the architecture stage). Refs: ORCH-112 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
315 lines
16 KiB
Bash
Executable File
315 lines
16 KiB
Bash
Executable File
#!/bin/bash
|
|
# Deploy hook for orchestrator
|
|
# Supports --deploy (default) and --rollback modes.
|
|
# Adds health-check loop + automatic rollback if new deploy is unhealthy.
|
|
#
|
|
# Parametrised via env vars (defaults are STAGING — never prod):
|
|
# TARGET_SERVICE - docker-compose service name (default: orchestrator-staging)
|
|
# TARGET_PORT - health check port (default: 8501)
|
|
# TARGET_IMAGE - image name for retag (default: orchestrator-orchestrator-staging)
|
|
# COMPOSE_PROFILE - docker compose profile (default: staging)
|
|
# PREV_IMAGE_FILE - path to prev-image snapshot (default: $REPO/.deploy-prev-image-staging)
|
|
# SOURCE_IMAGE - build-once source image (default: unset; ORCH-36)
|
|
# When set, the prevalidated (staging) image is retagged onto
|
|
# TARGET_IMAGE instead of rebuilding — guarantees prod runs the
|
|
# exact artefact that passed staging (no `docker build`).
|
|
# EXPECTED_REVISION- expected git SHA of SOURCE_IMAGE (default: unset; ORCH-58)
|
|
# Strategy B fail-closed provenance guard: when set, the
|
|
# SOURCE_IMAGE's org.opencontainers.image.revision label MUST
|
|
# equal this value before the BUILD-ONCE retag, else exit 1
|
|
# (a stale image is never promoted). Unset -> no check (legacy).
|
|
# GIT_SHA - build-arg for --build-staging (default: unset; ORCH-58)
|
|
# BUILD_CONTEXT - docker build context dir (default: $REPO; --build-staging)
|
|
# STAGING_CONTAINER- container to docker-exec staging_check in (--build-staging;
|
|
# default: $TARGET_SERVICE → orchestrator-staging; ORCH-58)
|
|
# STAGING_CHECK_PATH- staging_check.py path inside that container (--build-staging;
|
|
# default: /repos/orchestrator/scripts/staging_check.py; ORCH-58)
|
|
# STAGING_CHECK_MODE- staging_check mode stub|full-real (--build-staging;
|
|
# default: stub — fast, no LLM spend; ORCH-58)
|
|
# LOG - log file path (default: /var/log/orchestrator/deploy-hook.log)
|
|
#
|
|
# Usage:
|
|
# ./orchestrator-deploy-hook.sh [--deploy] # normal deploy (default)
|
|
# ./orchestrator-deploy-hook.sh --rollback # manual rollback
|
|
# ./orchestrator-deploy-hook.sh --build-staging # ORCH-58: rebuild staging image (8501)
|
|
|
|
set -euo pipefail
|
|
|
|
# ORCH-101 (D7): env-override like every other variable of this hook. The wired
|
|
# invokers (self_deploy.build_deploy_command / image_freshness.rebuild_staging_image)
|
|
# pass REPO explicitly from ORCH_DEPLOY_HOST_REPO_PATH; the default below serves
|
|
# manual operator runs on the current host.
|
|
REPO="${REPO:-/home/slin/repos/orchestrator}"
|
|
|
|
# ---- Defaults (STAGING — safe) ---------------------------------------------
|
|
TARGET_SERVICE="${TARGET_SERVICE:-orchestrator-staging}"
|
|
TARGET_PORT="${TARGET_PORT:-8501}"
|
|
TARGET_IMAGE="${TARGET_IMAGE:-orchestrator-orchestrator-staging}"
|
|
COMPOSE_PROFILE="${COMPOSE_PROFILE:-staging}"
|
|
PREV_IMAGE_FILE="${PREV_IMAGE_FILE:-$REPO/.deploy-prev-image-staging}"
|
|
# Build-once (ORCH-36): optional prevalidated source image to retag onto
|
|
# TARGET_IMAGE. Unset -> backward-compatible (no retag), exit-code contract intact.
|
|
SOURCE_IMAGE="${SOURCE_IMAGE:-}"
|
|
# Provenance guard (ORCH-58, Strategy B): expected git SHA of SOURCE_IMAGE. Unset
|
|
# -> backward-compatible (no provenance check), exit-code contract intact.
|
|
EXPECTED_REVISION="${EXPECTED_REVISION:-}"
|
|
# The OCI-standard label key the Dockerfile stamps with the build commit.
|
|
REVISION_LABEL="org.opencontainers.image.revision"
|
|
|
|
# ---- Log setup -------------------------------------------------------------
|
|
LOG_DIR=/var/log/orchestrator
|
|
if mkdir -p "$LOG_DIR" 2>/dev/null; then
|
|
LOG="${LOG:-$LOG_DIR/deploy-hook.log}"
|
|
else
|
|
LOG="${LOG:-$REPO/deploy-hook.log}"
|
|
fi
|
|
|
|
log() {
|
|
echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] $*" | tee -a "$LOG"
|
|
}
|
|
|
|
log "Deploy hook called: target=$TARGET_SERVICE port=$TARGET_PORT args=$*"
|
|
|
|
cd "$REPO"
|
|
|
|
# ============================================================================
|
|
# HEALTH CHECK helper
|
|
# Args: max_attempts sleep_sec label
|
|
# Returns 0 if healthy within attempts, 1 otherwise
|
|
# ============================================================================
|
|
health_check() {
|
|
local max_attempts="$1"
|
|
local sleep_sec="$2"
|
|
local label="${3:-health-check}"
|
|
local attempt=0
|
|
while [[ $attempt -lt $max_attempts ]]; do
|
|
attempt=$(( attempt + 1 ))
|
|
log "$label: attempt $attempt/$max_attempts - GET http://localhost:$TARGET_PORT/health"
|
|
local http_code body
|
|
body=$(curl -s --max-time 5 "http://localhost:$TARGET_PORT/health" 2>/dev/null || true)
|
|
http_code=$(curl -s -o /dev/null -w '%{http_code}' --max-time 5 "http://localhost:$TARGET_PORT/health" 2>/dev/null || echo "000")
|
|
if [[ "$http_code" == "200" ]] && echo "$body" | grep -q '"status":"ok"'; then
|
|
log "$label: OK (HTTP $http_code, body=$body)"
|
|
return 0
|
|
fi
|
|
log "$label: not ready yet (HTTP $http_code, body=$body)"
|
|
if [[ $attempt -lt $max_attempts ]]; then
|
|
sleep "$sleep_sec"
|
|
fi
|
|
done
|
|
log "$label: FAILED after $max_attempts attempts"
|
|
return 1
|
|
}
|
|
|
|
# ============================================================================
|
|
# ROLLBACK helper (also called for auto-rollback after bad deploy)
|
|
# ============================================================================
|
|
do_rollback() {
|
|
log "ROLLBACK: checking $PREV_IMAGE_FILE"
|
|
if [[ ! -s "$PREV_IMAGE_FILE" ]]; then
|
|
log "ROLLBACK: no previous image recorded - rollback skipped (exit 1)"
|
|
return 1
|
|
fi
|
|
local prev_img
|
|
prev_img=$(cat "$PREV_IMAGE_FILE")
|
|
if [[ -z "$prev_img" ]]; then
|
|
log "ROLLBACK: PREV_IMAGE_FILE is empty - rollback skipped (exit 1)"
|
|
return 1
|
|
fi
|
|
if ! docker image inspect "$prev_img" >/dev/null 2>&1; then
|
|
log "ROLLBACK: recorded image '$prev_img' not found locally - rollback skipped (exit 1)"
|
|
return 1
|
|
fi
|
|
log "ROLLBACK: retagging $prev_img -> $TARGET_IMAGE"
|
|
docker tag "$prev_img" "$TARGET_IMAGE" >> "$LOG" 2>&1
|
|
log "ROLLBACK: restarting $TARGET_SERVICE on previous image"
|
|
if [[ -n "$COMPOSE_PROFILE" ]]; then
|
|
docker compose --profile "$COMPOSE_PROFILE" up -d --no-build "$TARGET_SERVICE" >> "$LOG" 2>&1
|
|
else
|
|
docker compose up -d --no-build "$TARGET_SERVICE" >> "$LOG" 2>&1
|
|
fi
|
|
log "ROLLBACK: container restarted, running post-rollback health check (5x3s)"
|
|
if health_check 5 3 "ROLLBACK-health"; then
|
|
log "ROLLBACK: service is healthy on previous image ($prev_img)"
|
|
return 0
|
|
else
|
|
log "ROLLBACK: ROLLBACK ALSO FAILED - service still unhealthy after restoring $prev_img"
|
|
return 2
|
|
fi
|
|
}
|
|
|
|
# ============================================================================
|
|
# MANUAL --rollback mode
|
|
# ============================================================================
|
|
if [[ "${1:-}" == "--rollback" ]]; then
|
|
log "Manual ROLLBACK requested"
|
|
if do_rollback; then
|
|
log "Manual ROLLBACK succeeded"
|
|
exit 0
|
|
else
|
|
log "Manual ROLLBACK failed"
|
|
exit 1
|
|
fi
|
|
fi
|
|
|
|
# ============================================================================
|
|
# --build-staging mode (ORCH-58, Strategy A): rebuild the STAGING image from the
|
|
# VALIDATED commit, recreate 8501, and run the AUTHORITATIVE staging_check against
|
|
# the fresh image, so the artefact we validate is the exact one later BUILD-ONCE
|
|
# retagged to prod (INV-FRESH, AC-4). Builds/recreates STAGING ONLY (8501) — never
|
|
# prod (8500). Same exit-code contract (0 = healthy + staging_check PASS).
|
|
# GIT_SHA - commit stamped into the image revision label (build-arg).
|
|
# BUILD_CONTEXT - docker build context (host worktree of the validated commit).
|
|
# Steps: (1) docker build → (2) recreate 8501 → (3a) health-check →
|
|
# (3b) staging_check.py --mode stub against the fresh 8501 (ADR-001 step 3).
|
|
# ============================================================================
|
|
if [[ "${1:-}" == "--build-staging" ]]; then
|
|
BUILD_CONTEXT="${BUILD_CONTEXT:-$REPO}"
|
|
GIT_SHA="${GIT_SHA:-}"
|
|
log "BUILD-STAGING: rebuilding $TARGET_IMAGE from $BUILD_CONTEXT (GIT_SHA=$GIT_SHA, port=$TARGET_PORT)"
|
|
if ! docker build --build-arg GIT_SHA="$GIT_SHA" -t "$TARGET_IMAGE" "$BUILD_CONTEXT" >> "$LOG" 2>&1; then
|
|
log "BUILD-STAGING: docker build failed - aborting (exit 1)"
|
|
exit 1
|
|
fi
|
|
log "BUILD-STAGING: recreating $TARGET_SERVICE (profile=$COMPOSE_PROFILE) on the fresh image"
|
|
if [[ -n "$COMPOSE_PROFILE" ]]; then
|
|
docker compose --profile "$COMPOSE_PROFILE" up -d --no-build "$TARGET_SERVICE" >> "$LOG" 2>&1
|
|
else
|
|
docker compose up -d --no-build "$TARGET_SERVICE" >> "$LOG" 2>&1
|
|
fi
|
|
log "BUILD-STAGING: running health-check on port $TARGET_PORT (10x6s)"
|
|
if ! health_check 10 6 "build-staging-health"; then
|
|
log "BUILD-STAGING: health FAILED after rebuild (exit 1)"
|
|
exit 1
|
|
fi
|
|
log "BUILD-STAGING: $TARGET_SERVICE healthy on fresh image"
|
|
# (3b) ORCH-58 (Strategy A, step 3 — ADR-001): authoritative e2e validation of
|
|
# the FRESH image. Run staging_check.py against the just-rebuilt 8501 INSIDE the
|
|
# staging container (ORCH-048 canonical: it reads its OWN staging registry env, so
|
|
# B6 is correct; the script lives at /repos/... via bind-mount, not in /app). This
|
|
# is the same artefact later BUILD-ONCE retagged to prod, so we validate exactly
|
|
# what we promote (AC-4). Any non-zero (FAIL or ORCH_STAGING safety-abort) -> exit 1
|
|
# -> freshness gate FAIL -> rollback to development. Same exit-code contract.
|
|
STAGING_CONTAINER="${STAGING_CONTAINER:-$TARGET_SERVICE}"
|
|
STAGING_CHECK_PATH="${STAGING_CHECK_PATH:-/repos/orchestrator/scripts/staging_check.py}"
|
|
STAGING_CHECK_MODE="${STAGING_CHECK_MODE:-stub}"
|
|
log "BUILD-STAGING: running staging_check (--mode $STAGING_CHECK_MODE) against fresh http://localhost:$TARGET_PORT inside $STAGING_CONTAINER"
|
|
if docker exec "$STAGING_CONTAINER" python3 "$STAGING_CHECK_PATH" \
|
|
--base-url "http://localhost:$TARGET_PORT" --mode "$STAGING_CHECK_MODE" >> "$LOG" 2>&1; then
|
|
log "BUILD-STAGING: staging_check PASS on fresh image (exit 0)"
|
|
exit 0
|
|
fi
|
|
log "BUILD-STAGING: staging_check FAILED on fresh image - artefact not promotable (exit 1)"
|
|
exit 1
|
|
fi
|
|
|
|
# ============================================================================
|
|
# NORMAL DEPLOY mode (--deploy or no argument)
|
|
# ============================================================================
|
|
|
|
# 1. Capture currently running image BEFORE restart (best-effort)
|
|
PREV_IMG=""
|
|
SVC_CID=$(docker compose --profile "$COMPOSE_PROFILE" ps -q "$TARGET_SERVICE" 2>/dev/null || true)
|
|
if [[ -n "$SVC_CID" ]]; then
|
|
PREV_IMG=$(docker inspect --format '{{.Image}}' "$SVC_CID" 2>/dev/null || true)
|
|
fi
|
|
if [[ -n "$PREV_IMG" ]]; then
|
|
echo "$PREV_IMG" > "$PREV_IMAGE_FILE"
|
|
log "Saved previous image: $PREV_IMG -> $PREV_IMAGE_FILE"
|
|
else
|
|
log "No previous image captured (first deploy or service not running?)"
|
|
fi
|
|
|
|
# 2a. ORCH-112: resilient pull — converge the shared deploy-base to a clean, current
|
|
# origin/main BEFORE the pull, so a dirty working tree (manual/abandoned WIP left
|
|
# by a failed/cancelled task) never blocks the deploy (incident ORCH-111, dirt from
|
|
# ORCH-104). Gated by CHECKOUT_HYGIENE (Python kill-switch + self-hosting scope,
|
|
# injected by self_deploy.build_deploy_command). NEVER `-x` (would delete gitignored
|
|
# .env / data/*.db / build/); EXCLUDES the untracked-but-not-ignored rollback/log
|
|
# artefacts .deploy-prev-image-* and deploy-hook.log (NFR-2). Best-effort: every git
|
|
# step is `|| log "...continuing"` and the bare `git pull` below still runs
|
|
# (never-break). On a CLEAN base the whole block is a no-op -> the happy-path
|
|
# behaviour and exit-codes (0/1/2, ORCH-036) are byte-for-byte unchanged.
|
|
if [[ "${CHECKOUT_HYGIENE:-0}" == "1" ]]; then
|
|
dirty="$(git status --porcelain 2>/dev/null || true)"
|
|
if [[ -n "$dirty" ]]; then
|
|
log "HYGIENE: dirty deploy-base detected, converging to origin/main:"
|
|
log "$dirty"
|
|
git fetch origin main >> "$LOG" 2>&1 || log "HYGIENE: fetch failed (continuing)"
|
|
git reset --hard origin/main >> "$LOG" 2>&1 || log "HYGIENE: reset failed (continuing)"
|
|
git clean -fd \
|
|
-e '.deploy-prev-image-*' \
|
|
-e 'deploy-hook.log' \
|
|
>> "$LOG" 2>&1 || log "HYGIENE: clean failed (continuing)"
|
|
if [[ -n "${HYGIENE_REPORT:-}" ]]; then
|
|
{ printf 'dirty=1\n'; printf '%s\n' "$dirty"; } > "$HYGIENE_REPORT" 2>/dev/null || true
|
|
fi
|
|
else
|
|
log "HYGIENE: deploy-base already clean (no-op)"
|
|
fi
|
|
fi
|
|
|
|
# 2. Pull latest code (keeps the host working tree current for future builds;
|
|
# the DEPLOYED artefact is the retagged SOURCE_IMAGE below when build-once).
|
|
log "git pull origin main"
|
|
git pull origin main >> "$LOG" 2>&1
|
|
|
|
# 2b. Build-once (ORCH-36): retag the prevalidated staging image onto TARGET_IMAGE
|
|
# instead of rebuilding, so prod runs the exact artefact that passed staging.
|
|
# Backward compatible: skipped when SOURCE_IMAGE is unset.
|
|
if [[ -n "$SOURCE_IMAGE" ]]; then
|
|
if docker image inspect "$SOURCE_IMAGE" >/dev/null 2>&1; then
|
|
# ORCH-58 (Strategy B): fail-closed provenance guard BEFORE docker tag.
|
|
# When EXPECTED_REVISION is set, SOURCE_IMAGE's git-commit label MUST match,
|
|
# else exit 1 (FAILED -> БАГ-8 rollback); prod is NEVER touched. Empty label
|
|
# / inspect error / mismatch all fail-close. Unset EXPECTED_REVISION -> no
|
|
# check (backward-compatible for non-self repos / legacy calls).
|
|
if [[ -n "$EXPECTED_REVISION" ]]; then
|
|
IMG_REV=$(docker image inspect --format "{{ index .Config.Labels \"$REVISION_LABEL\" }}" "$SOURCE_IMAGE" 2>/dev/null || true)
|
|
if [[ "$IMG_REV" == "<no value>" ]]; then IMG_REV=""; fi
|
|
if [[ -z "$IMG_REV" || "$IMG_REV" != "$EXPECTED_REVISION" ]]; then
|
|
log "PROVENANCE: SOURCE_IMAGE revision '$IMG_REV' != expected '$EXPECTED_REVISION' (fail-closed) - aborting (exit 1)"
|
|
exit 1
|
|
fi
|
|
log "PROVENANCE: SOURCE_IMAGE revision matches expected ($EXPECTED_REVISION) - retag allowed"
|
|
fi
|
|
log "BUILD-ONCE: retagging $SOURCE_IMAGE -> $TARGET_IMAGE (no rebuild)"
|
|
docker tag "$SOURCE_IMAGE" "$TARGET_IMAGE" >> "$LOG" 2>&1
|
|
else
|
|
log "BUILD-ONCE: SOURCE_IMAGE '$SOURCE_IMAGE' not found locally - aborting (exit 1)"
|
|
exit 1
|
|
fi
|
|
fi
|
|
|
|
# 3. Restart service
|
|
log "Starting $TARGET_SERVICE (profile=$COMPOSE_PROFILE)"
|
|
if [[ -n "$COMPOSE_PROFILE" ]]; then
|
|
docker compose --profile "$COMPOSE_PROFILE" up -d --no-build "$TARGET_SERVICE" >> "$LOG" 2>&1
|
|
else
|
|
docker compose up -d --no-build "$TARGET_SERVICE" >> "$LOG" 2>&1
|
|
fi
|
|
log "$TARGET_SERVICE restarted"
|
|
|
|
# 4. Health-check loop: 10 attempts x 6 seconds = up to 60s
|
|
log "Starting health-check: 10 attempts x 6s (max 60s)"
|
|
if health_check 10 6 "deploy-health"; then
|
|
log "Deploy SUCCESS: $TARGET_SERVICE healthy on port $TARGET_PORT"
|
|
exit 0
|
|
fi
|
|
|
|
# 5. Health failed -> AUTO ROLLBACK
|
|
log "deploy FAILED: health not ok after 60s - initiating AUTO ROLLBACK"
|
|
rollback_rc=0
|
|
do_rollback || rollback_rc=$?
|
|
|
|
if [[ $rollback_rc -eq 0 ]]; then
|
|
log "deploy FAILED, rolled back to previous image successfully - exit 1"
|
|
exit 1
|
|
elif [[ $rollback_rc -eq 2 ]]; then
|
|
log "deploy FAILED, ROLLBACK ALSO FAILED - service may be down - exit 2"
|
|
exit 2
|
|
else
|
|
log "deploy FAILED, rollback skipped (no previous image) - exit 1"
|
|
exit 1
|
|
fi
|