feat(staging): add orchestrator deploy hook with health-check and auto-rollback (ORCH-34)

Merge pull request 'feat(staging): add live staging check suite (smoke + access + e2e) [ORCH-33]' (#29 ) from feature/ORCH-33-staging-testsuite into main
feat(staging): add live staging check suite (smoke + access + e2e)
2026-06-05 09:26:12 +03:00 · 2026-06-05 09:12:51 +03:00 · 2026-06-05 08:54:56 +03:00 · 2026-06-05 08:01:10 +03:00 · 2026-06-05 07:34:48 +03:00 · 2026-06-05 07:29:04 +03:00
11 changed files with 1312 additions and 12 deletions
--- a/.env.staging.example
+++ b/.env.staging.example
@@ -0,0 +1,52 @@
+# STAGING env for orchestrator-staging (port 8501).
+# Plane/Gitea tokens and sandbox project — configured in ORCH-32.
+# On Stage 1 (ORCH-31) you can copy from prod .env, changing only isolation-related keys.
+#
+# DO NOT COMMIT the real .env.staging — this file is the template only.
+# Create .env.staging on the server and fill in real values before starting staging.
+
+# ── Plane ─────────────────────────────────────────────────────────────────────
+ORCH_PLANE_API_URL=http://localhost:8091
+ORCH_PLANE_API_TOKEN=<plane-api-token>
+ORCH_PLANE_WORKSPACE_SLUG=<workspace-slug>
+ORCH_PLANE_WEBHOOK_SECRET=<webhook-secret>
+
+# Per-agent Plane bot tokens (authorship in Plane comments).
+# Leave empty to use ORCH_PLANE_API_TOKEN fallback.
+ORCH_PLANE_BOT_ANALYST=
+ORCH_PLANE_BOT_ARCHITECT=
+ORCH_PLANE_BOT_DEVELOPER=
+ORCH_PLANE_BOT_REVIEWER=
+ORCH_PLANE_BOT_TESTER=
+ORCH_PLANE_BOT_DEPLOYER=
+ORCH_PLANE_BOT_STREAM=
+
+# ── Gitea ─────────────────────────────────────────────────────────────────────
+ORCH_GITEA_URL=http://localhost:3000
+ORCH_GITEA_PUBLIC_URL=https://git.mva154.duckdns.org
+ORCH_GITEA_TOKEN=<gitea-token>
+ORCH_GITEA_WEBHOOK_SECRET=<gitea-webhook-secret>
+
+# ── Telegram ──────────────────────────────────────────────────────────────────
+ORCH_TELEGRAM_BOT_TOKEN=<telegram-bot-token>
+ORCH_TELEGRAM_CHAT_ID=<telegram-chat-id>
+
+# ── Claude / repos ────────────────────────────────────────────────────────────
+ORCH_CLAUDE_BIN=/usr/bin/claude
+ORCH_REPOS_DIR=/repos
+ORCH_HOST_REPOS_DIR=/home/slin/repos
+
+# ── Database (ISOLATION KEY for staging) ─────────────────────────────────────
+# The staging volume mounts ./data/staging:/app/data, so the DB physically lives
+# at ./data/staging/orchestrator.db on the host — fully isolated from prod.
+# Do NOT change this path; isolation is achieved via the volume mount, not this path.
+ORCH_DB_PATH=/app/data/orchestrator.db
+
+# ── Concurrency / worker ──────────────────────────────────────────────────────
+ORCH_MAX_CONCURRENCY=1
+ORCH_QUEUE_POLL_INTERVAL=2.0
+
+# ── Deploy hook ───────────────────────────────────────────────────────────────
+DEPLOY_SSH_USER=slin
+DEPLOY_SSH_HOST=127.0.0.1
+DEPLOY_HOOK_SCRIPT=/home/slin/bin/enduro-deploy-hook.sh
--- a/.gitea/workflows/ci.yml
+++ b/.gitea/workflows/ci.yml
@@ -0,0 +1,22 @@
+name: CI
+on:
+  push:
+    branches: ["feature/**", "bugfix/**", "hotfix/**", "fix/**", "ci/**"]
+  pull_request:
+    branches: [main]
+
+jobs:
+  test:
+    runs-on: self-hosted
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install dependencies
+        run: |
+          python3 -m pip install --user --upgrade pip
+          python3 -m pip install --user -r requirements.txt
+      - name: Test
+        env:
+          PYTHONPATH: ${{ github.workspace }}
+        run: |
+          export PATH="$HOME/.local/bin:$PATH"
+          python3 -m pytest tests/ -q
--- a/.gitignore
+++ b/.gitignore
@@ -5,3 +5,7 @@ __pycache__/
 data/
 *.db
 .pytest_cache/
+# ORCH-31: staging env (secrets, not committed — see .env.staging.example)
+.env.staging
+# ORCH-31: staging DB data directory
+data/staging/
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -25,3 +25,39 @@ services:
      - DEPLOY_HOOK_SCRIPT=/home/slin/bin/enduro-deploy-hook.sh
    group_add:
      - "999"
+
+  # ORCH-31: staging instance (port 8501, isolated DB).
+  # Starts ONLY with: docker compose --profile staging up -d orchestrator-staging
+  # Normal "docker compose up -d" does NOT start this service.
+  orchestrator-staging:
+    profiles:
+      - staging
+    build: .
+    container_name: orchestrator-staging
+    restart: unless-stopped
+    init: true
+    network_mode: host
+    command: ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8501"]
+    volumes:
+      - ./data/staging:/app/data
+      - /home/slin/repos:/repos
+      - /var/run/docker.sock:/var/run/docker.sock
+      - /usr/lib/node_modules/@anthropic-ai/claude-code:/opt/claude-code:ro
+      - /usr/bin/node:/usr/bin/node:ro
+      - /home/slin/.claude:/home/slin/.claude
+      - /home/slin/.claude.json:/home/slin/.claude.json:ro
+      - /home/slin/.orchestrator-ssh:/root/.ssh:ro
+    env_file: .env.staging
+    environment:
+      - ORCH_REPOS_DIR=/repos
+      - ORCH_HOST_REPOS_DIR=/home/slin/repos
+      - DEPLOY_SSH_USER=slin
+      - DEPLOY_SSH_HOST=127.0.0.1
+      - DEPLOY_HOOK_SCRIPT=/home/slin/bin/enduro-deploy-hook.sh
+      # Staging DB is isolated via ./data/staging volume mount.
+      # Inside the container the path remains /app/data/orchestrator.db (same default),
+      # but on the host it physically lives at ./data/staging/orchestrator.db — 
+      # completely separate from prod ./data/orchestrator.db.
+      - ORCH_DB_PATH=/app/data/orchestrator.db
+    group_add:
+      - "999"
--- a/docs/DEPLOY_HOOK.md
+++ b/docs/DEPLOY_HOOK.md
@@ -0,0 +1,90 @@
+# Orchestrator Deploy Hook
+
+`scripts/orchestrator-deploy-hook.sh` — хост-скрипт деплоя orchestrator с health-чеком и авто-rollback.
+
+## Как работает
+
+### Режим `--deploy` (по умолчанию)
+
+1. **Захват текущего образа** — до рестарта записывает ID образа работающего контейнера в `$PREV_IMAGE_FILE` (best-effort, не падает если сервис не запущен).
+2. **git pull** — обновляет код репозитория.
+3. **Рестарт контейнера** — `docker compose --profile $COMPOSE_PROFILE up -d --no-build $TARGET_SERVICE`.
+4. **Health-цикл** — 10 попыток × 6с = до 60с. Критерий: HTTP 200 + тело содержит `"status":"ok"`.
+   - **Успех** → `exit 0`, лог "Deploy SUCCESS".
+   - **Провал** → авто-rollback (шаг 5).
+5. **Авто-rollback** — восстанавливает образ из `$PREV_IMAGE_FILE`, рестарт, повторный health 5×3с.
+   - Если восстановился → `exit 1` (деплой провалился, откат успешен).
+   - Если и откат не помог → `exit 2` (критично).
+
+### Режим `--rollback`
+
+Вручную откатывает сервис на предыдущий образ из `$PREV_IMAGE_FILE`.
+
+## Переменные окружения
+
+| Переменная       | Дефолт                            | Описание                                      |
+|------------------|-----------------------------------|-----------------------------------------------|
+| `TARGET_SERVICE` | `orchestrator-staging`            | Имя docker-compose сервиса                    |
+| `TARGET_PORT`    | `8501`                            | Порт health-check                             |
+| `TARGET_IMAGE`   | `orchestrator-orchestrator-staging` | Имя образа для retag при rollback           |
+| `COMPOSE_PROFILE`| `staging`                         | Docker compose profile (пусто = без профиля) |
+| `PREV_IMAGE_FILE`| `$REPO/.deploy-prev-image-staging`| Файл для сохранения предыдущего образа        |
+| `LOG`            | `/var/log/orchestrator/deploy-hook.log` | Лог-файл (fallback: `$REPO/deploy-hook.log`) |
+
+> ⚠️ **Дефолт — всегда STAGING**. Прод активируется только явным переопределением env.
+
+## Примеры запуска
+
+### Staging (дефолт, безопасно)
+
+```bash
+cd /home/slin/repos/orchestrator
+bash scripts/orchestrator-deploy-hook.sh --deploy
+# или просто:
+bash scripts/orchestrator-deploy-hook.sh
+```
+
+### Прод (осознанный шаг, Этап 5)
+
+```bash
+TARGET_SERVICE=orchestrator \
+TARGET_PORT=8500 \
+TARGET_IMAGE=orchestrator-orchestrator \
+COMPOSE_PROFILE="" \
+PREV_IMAGE_FILE=/home/slin/repos/orchestrator/.deploy-prev-image-prod \
+bash scripts/orchestrator-deploy-hook.sh --deploy
+```
+
+### Ручной rollback staging
+
+```bash
+bash scripts/orchestrator-deploy-hook.sh --rollback
+```
+
+## Коды выхода
+
+| Код | Значение                                             |
+|-----|------------------------------------------------------|
+| `0` | Деплой успешен, сервис здоров                        |
+| `1` | Деплой провалился; откат выполнен (или пропущен)     |
+| `2` | Деплой провалился И откат тоже провалился (критично) |
+
+## Логи
+
+```
+/var/log/orchestrator/deploy-hook.log
+```
+
+Каждая строка с UTC-таймстампом в формате `[2026-06-05T06:30:00Z]`.
+
+## Разница с enduro-deploy-hook.sh
+
+| Функция              | enduro-deploy-hook.sh | orchestrator-deploy-hook.sh |
+|----------------------|-----------------------|-----------------------------|
+| Захват PREV_IMG      | ✅                    | ✅                          |
+| git pull             | ✅                    | ✅                          |
+| Рестарт              | ✅                    | ✅                          |
+| Health-цикл (60с)    | ❌                    | ✅ 10×6с                    |
+| Авто-rollback        | ❌                    | ✅                          |
+| Параметризация (env) | ❌ хардкод            | ✅ дефолт=staging           |
+| Compose profile      | ❌                    | ✅ --profile staging        |
--- a/docs/STAGING.md
+++ b/docs/STAGING.md
@@ -0,0 +1,85 @@
+# Staging Environment (ORCH-31)
+
+Orchestrator supports a permanent **staging instance** running on port **8501** with a
+fully-isolated SQLite database. The staging instance shares the same codebase and
+Dockerfile as production but is started under the `staging` Docker Compose profile so it
+**never starts accidentally** during a normal `docker compose up -d`.
+
+## Architecture
+
+| | Production | Staging |
+|---|---|---|
+| Port | 8500 | 8501 |
+| Container name | `orchestrator` | `orchestrator-staging` |
+| DB (host path) | `./data/orchestrator.db` | `./data/staging/orchestrator.db` |
+| DB (container path) | `/app/data/orchestrator.db` | `/app/data/orchestrator.db` |
+| env file | `.env` | `.env.staging` |
+| Compose profile | *(default)* | `staging` |
+
+DB isolation is achieved via a separate volume mount (`./data/staging:/app/data`), not by
+changing `ORCH_DB_PATH` — the container path stays identical while the host path is a
+different directory.
+
+## Prerequisites
+
+1. **`.env.staging`** — create from the template (see below). This file is **not committed**
+   to the repo (it contains secrets). Copy and fill in values before first start.
+2. **`./data/staging/`** directory — created automatically on first container start.
+
+### Create `.env.staging`
+
+```bash
+cd /home/slin/repos/orchestrator
+cp .env.staging.example .env.staging
+# Edit .env.staging — fill in real tokens / secrets.
+# At Stage 1 (ORCH-31) you can reuse prod values; sandbox Plane project
+# and isolated Gitea webhook will be wired in ORCH-32.
+nano .env.staging
+```
+
+## Starting Staging
+
+```bash
+cd /home/slin/repos/orchestrator
+docker compose --profile staging up -d orchestrator-staging
+```
+
+Check it is running:
+
+```bash
+docker ps | grep orchestrator-staging
+curl -s http://localhost:8501/health | python3 -m json.tool
+```
+
+## Stopping Staging
+
+```bash
+docker compose --profile staging stop orchestrator-staging
+# or remove the container entirely:
+docker compose --profile staging down orchestrator-staging
+```
+
+## Normal `up -d` does NOT start staging
+
+```bash
+# This starts ONLY the prod orchestrator (port 8500). Staging is NOT affected.
+docker compose up -d
+```
+
+The `profiles: [staging]` directive in `docker-compose.yml` ensures staging is
+completely invisible to commands that do not pass `--profile staging`.
+
+## Logs
+
+```bash
+docker logs -f orchestrator-staging
+```
+
+## Roadmap
+
+| Task | Description |
+|---|---|
+| **ORCH-31** *(this PR)* | Infra: compose service, .env template, gitignore, docs |
+| **ORCH-32** | Sandbox: isolated Plane project + Gitea repo for staging |
+| **ORCH-33** | Test suite running against staging endpoint |
+| **ORCH-34** | Deploy hook: promote `orchestrator:candidate` image to staging |
--- a/docs/STAGING_CHECK.md
+++ b/docs/STAGING_CHECK.md
@@ -0,0 +1,136 @@
+# STAGING_CHECK.md — Инструкция по запуску staging check suite (ORCH-33)
+
+## Что это
+
+`scripts/staging_check.py` — самостоятельный скрипт проверки **живого** staging-стенда orchestrator (порт 8501). Не unit-тесты — реальные HTTP-вызовы против работающих сервисов.
+
+Три блока проверок:
+
+| Блок | Название | Что проверяет |
+|------|----------|---------------|
+| A    | SMOKE    | `/health`, `/queue`, `ORCH_STAGING=true` |
+| B    | ACCESS   | Plane sandbox (R), Gitea sandbox (R+push), реестр проектов |
+| C    | E2E      | Создать задачу → триггер конвейера → ветка + коммент → cleanup |
+
+Exit code: **0** = все PASS, **non-zero** = есть FAIL.
+
+---
+
+## Требования к окружению
+
+Скрипт читает токены/URL из env (те же переменные, что использует orchestrator):
+
+| Переменная | Описание |
+|-----------|----------|
+| `ORCH_STAGING` | Должна быть `true` — защита от случайного запуска на проде |
+| `ORCH_PLANE_API_TOKEN` | Plane API token (`X-API-Key`) |
+| `ORCH_PLANE_API_URL` | Plane base URL **без** `/api/v1` (скрипт добавляет сам) |
+| `ORCH_PLANE_WORKSPACE_SLUG` | Workspace slug (`ag_proj`) |
+| `ORCH_GITEA_TOKEN` | Gitea token (`Authorization: token …`) |
+| `ORCH_GITEA_URL` | Gitea base URL (`http://localhost:3000`) |
+| `ORCH_PLANE_WEBHOOK_SECRET` | HMAC-секрет для подписи `/webhook/plane` (если пустой — без подписи) |
+
+Все эти переменные **уже есть** внутри контейнера `orchestrator-staging`.
+
+---
+
+## Способы запуска
+
+### 1. Внутри контейнера (рекомендуемый)
+
+```bash
+docker exec orchestrator-staging \
+  python3 /repos/orchestrator/scripts/staging_check.py --mode stub
+```
+
+### 2. С хоста (если есть токены в env)
+
+```bash
+export ORCH_STAGING=true
+export ORCH_PLANE_API_TOKEN=...
+# ... остальные переменные ...
+
+python3 scripts/staging_check.py \
+  --base-url http://localhost:8501 \
+  --mode stub
+```
+
+### 3. Из docker exec с передачей URL
+
+```bash
+docker exec orchestrator-staging \
+  python3 /repos/orchestrator/scripts/staging_check.py \
+  --base-url http://localhost:8501 \
+  --mode stub
+```
+
+---
+
+## Режимы (`--mode`)
+
+| Режим | Описание | Скорость |
+|-------|----------|----------|
+| `stub` (дефолт) | Проверяет **ранние артефакты** конвейера: ветка + QG-0-коммент. Создаются ДО запуска Claude CLI → быстро, детерминированно, без расхода LLM-кредитов. | ~30-90 сек |
+| `full-real` | Дополнительно ждёт реального завершения аналитика. Долго, расходует LLM-кредиты. | 5-15+ мин |
+
+**Текущий дефолт: `stub`** — достаточен для проверки работоспособности стенда.
+
+---
+
+## Что проверяет блок C (E2E) и почему это безопасно
+
+Порядок `start_pipeline` в коде orchestrator:
+1. Resolve проекта из реестра
+2. Получить name/description из Plane API (если в webhook пустые)
+3. **QG-0 гейт** (name ≥ 5 симв, description ≥ 20 симв)
+4. **Создать work_item_id + ветку в Gitea + начальные доки**
+5. **Записать строку задачи в БД**
+6. Поставить аналитика в очередь (вот тут Claude CLI)
+
+Блок C проверяет **шаги 4-5**, аналитика (шаг 6) **не ждёт**.  
+Тест-задача создаётся ТОЛЬКО в **SANDBOX** (`project_id 8c5a3025-...`),  
+ветка создаётся ТОЛЬКО в **orchestrator-sandbox**.
+
+### CLEANUP (обязателен)
+
+`try/finally` гарантирует удаление тестовых артефактов:
+- Удаляет ветку из `orchestrator-sandbox`
+- Удаляет задачу из Plane SANDBOX
+
+Cleanup отрабатывает даже при падении e2e.
+
+---
+
+## Принцип HMAC-подписи
+
+Скрипт читает `ORCH_PLANE_WEBHOOK_SECRET` из env и формирует подпись:
+```python
+hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
+```
+Передаёт как заголовок `X-Plane-Signature`. Алгоритм совпадает с `verify_plane_signature` в `src/webhooks/plane.py`.
+
+---
+
+## Изолированность от прода
+
+| Проверка | Гарантия |
+|---------|---------|
+| A3 `ORCH_STAGING=true` | При false — abort до деструктивных блоков |
+| B6 Реестр без боевых | ET/ORCH project_id absent в `known_plane_project_ids()` |
+| C: only SANDBOX project_id | Webhook payload указывает только `8c5a3025-...` |
+| C: only orchestrator-sandbox repo | Gitea operations на `admin/orchestrator-sandbox` |
+| C: cleanup в finally | Артефакты удаляются даже при ошибке |
+
+---
+
+## Добавление в деплой-хук
+
+```bash
+# В deploy.sh, после docker-compose up -d orchestrator-staging
+docker exec orchestrator-staging \
+  python3 /repos/orchestrator/scripts/staging_check.py --mode stub
+if [ $? -ne 0 ]; then
+  echo "Staging check FAILED — rolling back"
+  exit 1
+fi
+```
--- a/scripts/orchestrator-deploy-hook.sh
+++ b/scripts/orchestrator-deploy-hook.sh
@@ -0,0 +1,176 @@
+#!/bin/bash
+# Deploy hook for orchestrator
+# Supports --deploy (default) and --rollback modes.
+# Adds health-check loop + automatic rollback if new deploy is unhealthy.
+#
+# Parametrised via env vars (defaults are STAGING — never prod):
+#   TARGET_SERVICE   - docker-compose service name  (default: orchestrator-staging)
+#   TARGET_PORT      - health check port            (default: 8501)
+#   TARGET_IMAGE     - image name for retag         (default: orchestrator-orchestrator-staging)
+#   COMPOSE_PROFILE  - docker compose profile       (default: staging)
+#   PREV_IMAGE_FILE  - path to prev-image snapshot  (default: $REPO/.deploy-prev-image-staging)
+#   LOG              - log file path                (default: /var/log/orchestrator/deploy-hook.log)
+#
+# Usage:
+#   ./orchestrator-deploy-hook.sh [--deploy]    # normal deploy (default)
+#   ./orchestrator-deploy-hook.sh --rollback    # manual rollback
+
+set -euo pipefail
+
+REPO=/home/slin/repos/orchestrator
+
+# ---- Defaults (STAGING — safe) ---------------------------------------------
+TARGET_SERVICE="${TARGET_SERVICE:-orchestrator-staging}"
+TARGET_PORT="${TARGET_PORT:-8501}"
+TARGET_IMAGE="${TARGET_IMAGE:-orchestrator-orchestrator-staging}"
+COMPOSE_PROFILE="${COMPOSE_PROFILE:-staging}"
+PREV_IMAGE_FILE="${PREV_IMAGE_FILE:-$REPO/.deploy-prev-image-staging}"
+
+# ---- Log setup -------------------------------------------------------------
+LOG_DIR=/var/log/orchestrator
+if mkdir -p "$LOG_DIR" 2>/dev/null; then
+    LOG="${LOG:-$LOG_DIR/deploy-hook.log}"
+else
+    LOG="${LOG:-$REPO/deploy-hook.log}"
+fi
+
+log() {
+    echo "[$(date -u +%Y-%m-%dT%H:%M:%SZ)] $*" | tee -a "$LOG"
+}
+
+log "Deploy hook called: target=$TARGET_SERVICE port=$TARGET_PORT args=$*"
+
+cd "$REPO"
+
+# ============================================================================
+# HEALTH CHECK helper
+# Args: max_attempts  sleep_sec  label
+# Returns 0 if healthy within attempts, 1 otherwise
+# ============================================================================
+health_check() {
+    local max_attempts="$1"
+    local sleep_sec="$2"
+    local label="${3:-health-check}"
+    local attempt=0
+    while [[ $attempt -lt $max_attempts ]]; do
+        attempt=$(( attempt + 1 ))
+        log "$label: attempt $attempt/$max_attempts - GET http://localhost:$TARGET_PORT/health"
+        local http_code body
+        body=$(curl -s --max-time 5 "http://localhost:$TARGET_PORT/health" 2>/dev/null || true)
+        http_code=$(curl -s -o /dev/null -w '%{http_code}' --max-time 5 "http://localhost:$TARGET_PORT/health" 2>/dev/null || echo "000")
+        if [[ "$http_code" == "200" ]] && echo "$body" | grep -q '"status":"ok"'; then
+            log "$label: OK (HTTP $http_code, body=$body)"
+            return 0
+        fi
+        log "$label: not ready yet (HTTP $http_code, body=$body)"
+        if [[ $attempt -lt $max_attempts ]]; then
+            sleep "$sleep_sec"
+        fi
+    done
+    log "$label: FAILED after $max_attempts attempts"
+    return 1
+}
+
+# ============================================================================
+# ROLLBACK helper (also called for auto-rollback after bad deploy)
+# ============================================================================
+do_rollback() {
+    log "ROLLBACK: checking $PREV_IMAGE_FILE"
+    if [[ ! -s "$PREV_IMAGE_FILE" ]]; then
+        log "ROLLBACK: no previous image recorded - rollback skipped (exit 1)"
+        return 1
+    fi
+    local prev_img
+    prev_img=$(cat "$PREV_IMAGE_FILE")
+    if [[ -z "$prev_img" ]]; then
+        log "ROLLBACK: PREV_IMAGE_FILE is empty - rollback skipped (exit 1)"
+        return 1
+    fi
+    if ! docker image inspect "$prev_img" >/dev/null 2>&1; then
+        log "ROLLBACK: recorded image '$prev_img' not found locally - rollback skipped (exit 1)"
+        return 1
+    fi
+    log "ROLLBACK: retagging $prev_img -> $TARGET_IMAGE"
+    docker tag "$prev_img" "$TARGET_IMAGE" >> "$LOG" 2>&1
+    log "ROLLBACK: restarting $TARGET_SERVICE on previous image"
+    if [[ -n "$COMPOSE_PROFILE" ]]; then
+        docker compose --profile "$COMPOSE_PROFILE" up -d --no-build "$TARGET_SERVICE" >> "$LOG" 2>&1
+    else
+        docker compose up -d --no-build "$TARGET_SERVICE" >> "$LOG" 2>&1
+    fi
+    log "ROLLBACK: container restarted, running post-rollback health check (5x3s)"
+    if health_check 5 3 "ROLLBACK-health"; then
+        log "ROLLBACK: service is healthy on previous image ($prev_img)"
+        return 0
+    else
+        log "ROLLBACK: ROLLBACK ALSO FAILED - service still unhealthy after restoring $prev_img"
+        return 2
+    fi
+}
+
+# ============================================================================
+# MANUAL --rollback mode
+# ============================================================================
+if [[ "${1:-}" == "--rollback" ]]; then
+    log "Manual ROLLBACK requested"
+    if do_rollback; then
+        log "Manual ROLLBACK succeeded"
+        exit 0
+    else
+        log "Manual ROLLBACK failed"
+        exit 1
+    fi
+fi
+
+# ============================================================================
+# NORMAL DEPLOY mode (--deploy or no argument)
+# ============================================================================
+
+# 1. Capture currently running image BEFORE restart (best-effort)
+PREV_IMG=""
+SVC_CID=$(docker compose --profile "$COMPOSE_PROFILE" ps -q "$TARGET_SERVICE" 2>/dev/null || true)
+if [[ -n "$SVC_CID" ]]; then
+    PREV_IMG=$(docker inspect --format '{{.Image}}' "$SVC_CID" 2>/dev/null || true)
+fi
+if [[ -n "$PREV_IMG" ]]; then
+    echo "$PREV_IMG" > "$PREV_IMAGE_FILE"
+    log "Saved previous image: $PREV_IMG -> $PREV_IMAGE_FILE"
+else
+    log "No previous image captured (first deploy or service not running?)"
+fi
+
+# 2. Pull latest code
+log "git pull origin main"
+git pull origin main >> "$LOG" 2>&1
+
+# 3. Restart service
+log "Starting $TARGET_SERVICE (profile=$COMPOSE_PROFILE)"
+if [[ -n "$COMPOSE_PROFILE" ]]; then
+    docker compose --profile "$COMPOSE_PROFILE" up -d --no-build "$TARGET_SERVICE" >> "$LOG" 2>&1
+else
+    docker compose up -d --no-build "$TARGET_SERVICE" >> "$LOG" 2>&1
+fi
+log "$TARGET_SERVICE restarted"
+
+# 4. Health-check loop: 10 attempts x 6 seconds = up to 60s
+log "Starting health-check: 10 attempts x 6s (max 60s)"
+if health_check 10 6 "deploy-health"; then
+    log "Deploy SUCCESS: $TARGET_SERVICE healthy on port $TARGET_PORT"
+    exit 0
+fi
+
+# 5. Health failed -> AUTO ROLLBACK
+log "deploy FAILED: health not ok after 60s - initiating AUTO ROLLBACK"
+rollback_rc=0
+do_rollback || rollback_rc=$?
+
+if [[ $rollback_rc -eq 0 ]]; then
+    log "deploy FAILED, rolled back to previous image successfully - exit 1"
+    exit 1
+elif [[ $rollback_rc -eq 2 ]]; then
+    log "deploy FAILED, ROLLBACK ALSO FAILED - service may be down - exit 2"
+    exit 2
+else
+    log "deploy FAILED, rollback skipped (no previous image) - exit 1"
+    exit 1
+fi
--- a/scripts/staging_check.py
+++ b/scripts/staging_check.py
@@ -0,0 +1,639 @@
+#!/usr/bin/env python3
+"""
+staging_check.py — Live staging-stand health & e2e check suite (ORCH-33).
+
+Checks:
+  Block A — SMOKE (health/queue, correct env)
+  Block B — ACCESS (read-only calls to Plane sandbox + Gitea sandbox + registry)
+  Block C — E2E   (create task in SANDBOX → trigger pipeline via /webhook/plane
+                   → verify branch + job enqueued → CLEANUP in finally)
+
+Usage (inside the container or with correct env set):
+    python3 scripts/staging_check.py [--base-url http://localhost:8501] [--mode stub|full-real]
+
+Exit code: 0 = all PASS, non-zero = at least one FAIL.
+
+NOTE on modes:
+  stub      — default; checks early pipeline artifacts (branch + analyst job
+              enqueued) created BEFORE Claude CLI is invoked.
+              Fast, deterministic, no LLM spend.
+  full-real — additionally waits for the analyst agent to finish (long, costs
+              credits). Not the default.
+
+NOTE on Plane comments (403):
+  The orchestrator posts the "🔍 Analyst запущен" comment using per-agent bot
+  tokens (ORCH_PLANE_BOT_ANALYST). These bot accounts must be added as members
+  of every Plane project they comment on. In staging the sandbox project was
+  created after the bots were provisioned → the bots are not yet members of
+  SANDBOX → add_comment returns 403 Forbidden.
+
+  This is a known infrastructure limitation of the staging sandbox, NOT a bug
+  in the pipeline itself. C9b therefore verifies pipeline success via the
+  staging job queue (/queue → recent) instead of Plane comments: the analyst
+  job is enqueued BEFORE the add_comment call and its presence in the queue
+  proves the pipeline ran through correctly.
+"""
+
+import argparse
+import hashlib
+import hmac
+import json
+import os
+import sys
+import time
+import datetime
+import urllib.request
+import urllib.error
+import urllib.parse
+
+# ---------------------------------------------------------------------------
+# Colour helpers
+# ---------------------------------------------------------------------------
+_BOLD = "\033[1m"
+_GREEN = "\033[32m"
+_RED = "\033[31m"
+_YELLOW = "\033[33m"
+_RESET = "\033[0m"
+
+
+def _ok(msg: str) -> str:
+    return f"  {_GREEN}✓ PASS{_RESET}  {msg}"
+
+
+def _fail(msg: str) -> str:
+    return f"  {_RED}✗ FAIL{_RESET}  {msg}"
+
+
+def _info(msg: str) -> str:
+    return f"  {_YELLOW}·{_RESET}      {msg}"
+
+
+# ---------------------------------------------------------------------------
+# Low-level HTTP helpers (stdlib only — no requests/httpx in scripts/)
+# ---------------------------------------------------------------------------
+
+def _http(method: str, url: str, headers: dict | None = None,
+          body: bytes | None = None, timeout: int = 15) -> tuple[int, bytes]:
+    """Simple HTTP wrapper. Returns (status_code, response_body)."""
+    req = urllib.request.Request(url, data=body, headers=headers or {}, method=method)
+    try:
+        with urllib.request.urlopen(req, timeout=timeout) as resp:
+            return resp.status, resp.read()
+    except urllib.error.HTTPError as e:
+        return e.code, e.read()
+    except Exception as e:
+        raise RuntimeError(f"{method} {url} → {e}") from e
+
+
+def _get(url: str, headers: dict | None = None, timeout: int = 15) -> tuple[int, dict]:
+    status, body = _http("GET", url, headers=headers, timeout=timeout)
+    try:
+        data = json.loads(body)
+    except Exception:
+        data = {"_raw": body.decode(errors="replace")}
+    return status, data
+
+
+def _post(url: str, headers: dict | None = None, payload: dict | None = None,
+          raw_body: bytes | None = None, timeout: int = 15) -> tuple[int, dict]:
+    if raw_body is not None:
+        body = raw_body
+        h = dict(headers or {})
+        if "Content-Type" not in h:
+            h["Content-Type"] = "application/json"
+    else:
+        body = json.dumps(payload or {}).encode()
+        h = dict(headers or {})
+        h["Content-Type"] = "application/json"
+    status, resp_body = _http("POST", url, headers=h, body=body, timeout=timeout)
+    try:
+        data = json.loads(resp_body)
+    except Exception:
+        data = {"_raw": resp_body.decode(errors="replace")}
+    return status, data
+
+
+def _patch(url: str, headers: dict | None = None, payload: dict | None = None,
+           timeout: int = 15) -> tuple[int, dict]:
+    body = json.dumps(payload or {}).encode()
+    h = dict(headers or {})
+    h["Content-Type"] = "application/json"
+    status, resp_body = _http("PATCH", url, headers=h, body=body, timeout=timeout)
+    try:
+        data = json.loads(resp_body)
+    except Exception:
+        data = {"_raw": resp_body.decode(errors="replace")}
+    return status, data
+
+
+def _delete(url: str, headers: dict | None = None, timeout: int = 15) -> int:
+    status, _ = _http("DELETE", url, headers=headers, timeout=timeout)
+    return status
+
+
+# ---------------------------------------------------------------------------
+# HMAC helper for /webhook/plane
+# ---------------------------------------------------------------------------
+
+def _sign_payload(secret: str, body: bytes) -> str:
+    """Compute HMAC-SHA256 signature — matches verify_plane_signature in plane.py."""
+    return hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
+
+
+# ---------------------------------------------------------------------------
+# Result tracking
+# ---------------------------------------------------------------------------
+
+class Results:
+    def __init__(self):
+        self._items: list[tuple[str, bool, str]] = []  # (label, passed, detail)
+
+    def add(self, label: str, passed: bool, detail: str = ""):
+        self._items.append((label, passed, detail))
+        line = _ok(label) if passed else _fail(label)
+        if detail:
+            line += f"  [{detail}]"
+        print(line)
+
+    def summary(self) -> bool:
+        passed = sum(1 for _, ok, _ in self._items if ok)
+        total = len(self._items)
+        all_ok = passed == total
+        colour = _GREEN if all_ok else _RED
+        print()
+        print(f"{_BOLD}{'='*60}{_RESET}")
+        print(f"{colour}{_BOLD}  RESULT: {passed}/{total} checks PASS{_RESET}")
+        print(f"{_BOLD}{'='*60}{_RESET}")
+        return all_ok
+
+
+# ---------------------------------------------------------------------------
+# Block A — SMOKE
+# ---------------------------------------------------------------------------
+
+def block_a(base: str, results: Results):
+    print(f"\n{_BOLD}[Block A] SMOKE{_RESET}")
+
+    # A1 — /health
+    try:
+        status, data = _get(f"{base}/health")
+        ok = status == 200 and data.get("status") == "ok"
+        results.add("A1 GET /health → 200 status=ok", ok,
+                    f"HTTP {status}, body={data}")
+    except Exception as e:
+        results.add("A1 GET /health → 200 status=ok", False, str(e))
+
+    # A2 — /queue
+    try:
+        status, data = _get(f"{base}/queue")
+        ok = (status == 200
+              and "counts" in data
+              and "max_concurrency" in data
+              and "resilience" in data)
+        results.add("A2 GET /queue → 200 with counts/max_concurrency/resilience", ok,
+                    f"HTTP {status}, keys={list(data.keys())}")
+    except Exception as e:
+        results.add("A2 GET /queue → 200 with counts/max_concurrency/resilience", False, str(e))
+
+    # A3 — ORCH_STAGING=true in env (guard against hitting prod)
+    staging_flag = os.environ.get("ORCH_STAGING", "").lower()
+    ok = staging_flag == "true"
+    results.add("A3 ORCH_STAGING=true (not prod)", ok,
+                f"ORCH_STAGING={os.environ.get('ORCH_STAGING', '<unset>')}")
+    if not ok:
+        print(_fail("  ⛔ Safety abort: ORCH_STAGING is not 'true'. "
+                    "This might be prod. Skipping destructive blocks B/C."))
+        sys.exit(2)
+
+
+# ---------------------------------------------------------------------------
+# Block B — ACCESS
+# ---------------------------------------------------------------------------
+
+SANDBOX_PROJECT_ID = "8c5a3025-4f9d-4190-b79f-fa06276bb27e"
+PROD_ET_PROJECT_ID = "7a79f0a9-5278-49cd-9007-9a338f238f9c"
+PROD_ORCH_PROJECT_ID = "8da6aa25-a60e-44d6-a1e2-d8ae59aa7d6a"
+
+
+def block_b(results: Results):
+    print(f"\n{_BOLD}[Block B] ACCESS{_RESET}")
+
+    plane_token = os.environ.get("ORCH_PLANE_API_TOKEN", "")
+    plane_base_env = os.environ.get("ORCH_PLANE_API_URL", "http://localhost:8091")
+    # env stores URL WITHOUT /api/v1 — add it ourselves
+    plane_base = plane_base_env.rstrip("/") + "/api/v1"
+    workspace = os.environ.get("ORCH_PLANE_WORKSPACE_SLUG", "ag_proj")
+    gitea_token = os.environ.get("ORCH_GITEA_TOKEN", "")
+    gitea_base = os.environ.get("ORCH_GITEA_URL", "http://localhost:3000")
+
+    plane_headers = {"X-API-Key": plane_token}
+    gitea_headers = {"Authorization": f"token {gitea_token}"}
+
+    # B4 — Plane: list projects, sandbox id present
+    try:
+        url = f"{plane_base}/workspaces/{workspace}/projects/"
+        status, data = _get(url, headers=plane_headers)
+        if status == 200:
+            # API may return a list or {"results": [...]}
+            projects = data.get("results", data) if isinstance(data, dict) else data
+            if isinstance(projects, list):
+                ids = {p.get("id", "") for p in projects}
+            else:
+                ids = set()
+            ok = SANDBOX_PROJECT_ID in ids
+            results.add("B4 Plane: sandbox project accessible", ok,
+                        f"HTTP {status}, found {len(ids)} project(s), sandbox={'YES' if ok else 'NO'}")
+        else:
+            results.add("B4 Plane: sandbox project accessible", False,
+                        f"HTTP {status}")
+    except Exception as e:
+        results.add("B4 Plane: sandbox project accessible", False, str(e))
+
+    # B5 — Gitea: sandbox repo accessible, push=true
+    try:
+        url = f"{gitea_base}/api/v1/repos/admin/orchestrator-sandbox"
+        status, data = _get(url, headers=gitea_headers)
+        push_ok = data.get("permissions", {}).get("push", False) if status == 200 else False
+        ok = status == 200 and push_ok
+        results.add("B5 Gitea: orchestrator-sandbox accessible, push=true", ok,
+                    f"HTTP {status}, permissions={data.get('permissions')}")
+    except Exception as e:
+        results.add("B5 Gitea: orchestrator-sandbox accessible, push=true", False, str(e))
+
+    # B6 — Registry: sandbox in known IDs, prod ET/ORCH NOT in known IDs
+    try:
+        # Import from inside the container (script runs in /repos/orchestrator context)
+        sys.path.insert(0, "/repos/orchestrator")
+        # Force reload to pick up container env
+        import importlib
+        if "src.projects" in sys.modules:
+            importlib.reload(sys.modules["src.projects"])
+        from src.projects import known_plane_project_ids
+        known = known_plane_project_ids()
+        sandbox_present = SANDBOX_PROJECT_ID in known
+        et_absent = PROD_ET_PROJECT_ID not in known
+        orch_absent = PROD_ORCH_PROJECT_ID not in known
+        ok = sandbox_present and et_absent and orch_absent
+        detail = (
+            f"sandbox={'YES' if sandbox_present else 'NO'}, "
+            f"prod-ET={'NO(good)' if et_absent else 'YES(BAD!)'}, "
+            f"prod-ORCH={'NO(good)' if orch_absent else 'YES(BAD!)'}"
+        )
+        results.add("B6 Registry: sandbox present, prod ET/ORCH absent", ok, detail)
+    except Exception as e:
+        results.add("B6 Registry: sandbox present, prod ET/ORCH absent", False, str(e))
+
+
+# ---------------------------------------------------------------------------
+# Block C — E2E
+# ---------------------------------------------------------------------------
+
+IN_PROGRESS_STATE_ID = "b873d9eb-993c-48cd-97ac-99a9b1623967"
+
+# Path to staging SQLite DB inside the container
+STAGING_DB_PATH = os.environ.get("ORCH_DB_PATH", "/app/data/orchestrator.db")
+
+
+def _make_webhook_payload(issue_id: str, issue_name: str, issue_desc: str) -> dict:
+    """Build the minimal webhook payload that triggers start_pipeline."""
+    return {
+        "event": "issue",
+        "action": "updated",
+        "data": {
+            "id": issue_id,
+            "name": issue_name,
+            "description_stripped": issue_desc,
+            "project": SANDBOX_PROJECT_ID,
+            "state": {
+                "id": IN_PROGRESS_STATE_ID,
+                "name": "In Progress",
+                "group": "started",
+            },
+        },
+    }
+
+
+def _poll(fn, timeout: int = 60, interval: int = 3, label: str = ""):
+    """Poll fn() until it returns truthy or timeout expires."""
+    deadline = time.time() + timeout
+    while time.time() < deadline:
+        result = fn()
+        if result:
+            return result
+        if label:
+            print(_info(f"  waiting... ({label})"))
+        time.sleep(interval)
+    return None
+
+
+def _cleanup_staging_db(plane_issue_id: str):
+    """Delete the test task row from staging SQLite DB."""
+    if not plane_issue_id:
+        print(_info("CLEANUP DB: no issue_id to clean"))
+        return
+    try:
+        import sqlite3
+        conn = sqlite3.connect(STAGING_DB_PATH)
+        cur = conn.execute(
+            "DELETE FROM tasks WHERE plane_id = ?", (plane_issue_id,)
+        )
+        deleted = cur.rowcount
+        conn.commit()
+        conn.close()
+        if deleted:
+            print(_ok(f"CLEANUP DB: deleted {deleted} task row(s) for plane_id={plane_issue_id}"))
+        else:
+            print(_info(f"CLEANUP DB: no task row found for plane_id={plane_issue_id}"))
+    except Exception as e:
+        print(_fail(f"CLEANUP DB: error: {e}"))
+
+
+def _cleanup_staging_jobs(plane_issue_id: str):
+    """Delete job queue rows for the test task from staging SQLite DB."""
+    if not plane_issue_id:
+        return
+    try:
+        import sqlite3
+        conn = sqlite3.connect(STAGING_DB_PATH)
+        # Find task ids for this plane_id first
+        task_rows = conn.execute(
+            "SELECT id FROM tasks WHERE plane_id = ?", (plane_issue_id,)
+        ).fetchall()
+        if task_rows:
+            task_ids = [r[0] for r in task_rows]
+            placeholders = ",".join("?" * len(task_ids))
+            cur = conn.execute(
+                f"DELETE FROM jobs WHERE task_id IN ({placeholders})", task_ids
+            )
+            deleted = cur.rowcount
+            conn.commit()
+            if deleted:
+                print(_ok(f"CLEANUP DB: deleted {deleted} job row(s) for task_ids={task_ids}"))
+        conn.close()
+    except Exception as e:
+        print(_fail(f"CLEANUP DB jobs: error: {e}"))
+
+
+def _cleanup_dedup(plane_issue_id: str, wh_body_sha: str | None = None):
+    """Remove dedup event entries for the test webhook delivery."""
+    if not wh_body_sha:
+        return
+    try:
+        import sqlite3
+        conn = sqlite3.connect(STAGING_DB_PATH)
+        cur = conn.execute(
+            "DELETE FROM events_dedup WHERE delivery_id = ?", (wh_body_sha,)
+        )
+        deleted = cur.rowcount
+        conn.commit()
+        conn.close()
+        if deleted:
+            print(_ok(f"CLEANUP DB: removed {deleted} dedup entry"))
+    except Exception as e:
+        # dedup table might not exist or different schema — not critical
+        print(_info(f"CLEANUP DB dedup: {e}"))
+
+
+def block_c(base: str, results: Results, mode: str):
+    print(f"\n{_BOLD}[Block C] E2E  (mode={mode}){_RESET}")
+
+    plane_token = os.environ.get("ORCH_PLANE_API_TOKEN", "")
+    plane_base_env = os.environ.get("ORCH_PLANE_API_URL", "http://localhost:8091")
+    plane_base = plane_base_env.rstrip("/") + "/api/v1"
+    workspace = os.environ.get("ORCH_PLANE_WORKSPACE_SLUG", "ag_proj")
+    gitea_token = os.environ.get("ORCH_GITEA_TOKEN", "")
+    gitea_base = os.environ.get("ORCH_GITEA_URL", "http://localhost:3000")
+    webhook_secret = os.environ.get("ORCH_PLANE_WEBHOOK_SECRET", "")
+
+    plane_headers = {"X-API-Key": plane_token}
+    gitea_headers = {"Authorization": f"token {gitea_token}"}
+
+    ts = datetime.datetime.now(datetime.timezone.utc).strftime("%Y%m%dT%H%M%S")
+    issue_name = f"[staging-check] e2e {ts}"
+    issue_desc = (
+        "Automated e2e check created by staging_check.py. "
+        "This task tests the live staging pipeline end-to-end. "
+        "Safe to delete — cleanup runs in finally block."
+    )
+
+    issue_id = None
+    branch_name = None
+    wh_body_bytes = None
+
+    try:
+        # C7 — Create task in Plane SANDBOX
+        print(_info(f"C7: Creating issue in SANDBOX project..."))
+        url = f"{plane_base}/workspaces/{workspace}/projects/{SANDBOX_PROJECT_ID}/issues/"
+        status, data = _post(url, headers=plane_headers, payload={
+            "name": issue_name,
+            "description_html": f"<p>{issue_desc}</p>",
+            "description_stripped": issue_desc,
+        })
+        issue_id = data.get("id")
+        ok = status in (200, 201) and bool(issue_id)
+        results.add("C7 Create issue in Plane SANDBOX", ok,
+                    f"HTTP {status}, issue_id={issue_id}")
+        if not ok:
+            print(_fail(f"  Cannot continue C8-C9 without issue. body={data}"))
+            results.add("C8 Trigger pipeline via /webhook/plane", False, "skipped: C7 failed")
+            results.add("C9a Branch appears in orchestrator-sandbox", False, "skipped")
+            results.add("C9b Analyst job enqueued in staging queue", False, "skipped")
+            return
+
+        # Small delay to let Plane finish persisting the issue
+        time.sleep(2)
+
+        # C8 — Trigger pipeline via direct POST to /webhook/plane
+        print(_info(f"C8: Triggering pipeline via POST /webhook/plane ..."))
+        wh_payload = _make_webhook_payload(issue_id, issue_name, issue_desc)
+        wh_body_bytes = json.dumps(wh_payload).encode()
+
+        wh_headers = {"Content-Type": "application/json"}
+        if webhook_secret:
+            sig = _sign_payload(webhook_secret, wh_body_bytes)
+            wh_headers["X-Plane-Signature"] = sig
+            print(_info(f"  Using HMAC signature (secret len={len(webhook_secret)})"))
+        else:
+            print(_info("  No webhook secret configured, sending without signature"))
+
+        status, resp = _post(f"{base}/webhook/plane",
+                             headers=wh_headers,
+                             raw_body=wh_body_bytes)
+        ok = status == 200 and resp.get("status") in ("accepted",)
+        results.add("C8 Trigger pipeline via /webhook/plane", ok,
+                    f"HTTP {status}, resp={resp}")
+        if not ok:
+            print(_fail(f"  Pipeline trigger failed. Cannot verify C9."))
+            results.add("C9a Branch appears in orchestrator-sandbox", False, "skipped: C8 failed")
+            results.add("C9b Analyst job enqueued in staging queue", False, "skipped: C8 failed")
+            return
+
+        # C9a — Poll for branch in Gitea orchestrator-sandbox
+        print(_info("C9a: Polling for branch in orchestrator-sandbox (up to 60s)..."))
+
+        def _check_branch():
+            try:
+                burl = f"{gitea_base}/api/v1/repos/admin/orchestrator-sandbox/branches"
+                s, bdata = _get(burl, headers=gitea_headers)
+                if s != 200:
+                    return None
+                branches = bdata if isinstance(bdata, list) else bdata.get("results", [])
+                for b in branches:
+                    bname = b.get("name", "")
+                    # Branch name: feature/SANDBOX-NNN-staging-check-...
+                    if "feature/" in bname and "staging-check" in bname:
+                        return bname
+                return None
+            except Exception:
+                return None
+
+        branch_name = _poll(_check_branch, timeout=60, interval=3,
+                             label="waiting for branch")
+        ok = bool(branch_name)
+        results.add("C9a Branch appears in orchestrator-sandbox", ok,
+                    f"branch={branch_name or 'not found'}")
+
+        # C9b — Verify analyst job was enqueued via staging /queue
+        # NOTE: The orchestrator posts a "🔍 Analyst запущен" comment to Plane using
+        # per-agent bot tokens (ORCH_PLANE_BOT_ANALYST). In staging, the sandbox
+        # project was created after the bot accounts were provisioned, so the bots are
+        # not yet members of the SANDBOX project → add_comment returns 403 Forbidden.
+        # This is a known staging infrastructure limitation (not a pipeline bug).
+        # We therefore verify pipeline success via /queue (recent jobs): the analyst
+        # job is enqueued BEFORE the add_comment call, so its presence in the queue
+        # confirms the pipeline ran through to job dispatch.
+        print(_info("C9b: Checking staging job queue for analyst job (up to 30s)..."))
+        print(_info("  (Plane comment check skipped: bot-tokens not added to SANDBOX project)"))
+
+        def _check_queue():
+            try:
+                s, qdata = _get(f"{base}/queue")
+                if s != 200:
+                    return None
+                recent = qdata.get("recent", [])
+                for job in recent:
+                    if (job.get("agent") == "analyst"
+                            and job.get("repo") == "orchestrator-sandbox"
+                            and issue_name in (job.get("task_content") or "")):
+                        return job
+                return None
+            except Exception:
+                return None
+
+        analyst_job = _poll(_check_queue, timeout=30, interval=2,
+                             label="waiting for analyst job in queue")
+        ok = bool(analyst_job)
+        detail = ""
+        if analyst_job:
+            detail = (f"job_id={analyst_job.get('id')}, "
+                      f"status={analyst_job.get('status')}, "
+                      f"agent={analyst_job.get('agent')}")
+        results.add("C9b Analyst job enqueued in staging queue", ok, detail)
+
+    finally:
+        # C10 — CLEANUP (always runs)
+        print(f"\n{_BOLD}[CLEANUP]{_RESET}")
+        _cleanup(
+            plane_base=plane_base,
+            workspace=workspace,
+            gitea_base=gitea_base,
+            plane_headers=plane_headers,
+            gitea_headers=gitea_headers,
+            issue_id=issue_id,
+            branch_name=branch_name,
+            wh_body_bytes=wh_body_bytes,
+        )
+
+
+def _cleanup(plane_base, workspace, gitea_base, plane_headers, gitea_headers,
+             issue_id, branch_name, wh_body_bytes=None):
+    """Delete test branch in Gitea, test issue in Plane SANDBOX, and DB rows."""
+
+    # Delete branch in Gitea
+    if branch_name:
+        try:
+            burl = (f"{gitea_base}/api/v1/repos/admin/orchestrator-sandbox"
+                    f"/branches/{urllib.parse.quote(branch_name, safe='')}")
+            s = _delete(burl, headers=gitea_headers)
+            if s in (200, 204, 404):
+                print(_ok(f"CLEANUP: deleted branch {branch_name!r} (HTTP {s})"))
+            else:
+                print(_fail(f"CLEANUP: delete branch returned HTTP {s}"))
+        except Exception as e:
+            print(_fail(f"CLEANUP: delete branch error: {e}"))
+    else:
+        print(_info("CLEANUP: no branch to delete"))
+
+    # Delete issue in Plane SANDBOX
+    if issue_id:
+        try:
+            iurl = (f"{plane_base}/workspaces/{workspace}/projects/"
+                    f"{SANDBOX_PROJECT_ID}/issues/{issue_id}/")
+            s = _delete(iurl, headers=plane_headers)
+            if s in (200, 204, 404):
+                print(_ok(f"CLEANUP: deleted Plane issue {issue_id} (HTTP {s})"))
+            else:
+                print(_fail(f"CLEANUP: delete Plane issue returned HTTP {s}"))
+        except Exception as e:
+            print(_fail(f"CLEANUP: delete Plane issue error: {e}"))
+    else:
+        print(_info("CLEANUP: no issue to delete"))
+
+    # Delete task + jobs from staging DB
+    if issue_id:
+        _cleanup_staging_jobs(issue_id)
+        _cleanup_staging_db(issue_id)
+
+    # Remove dedup entry so future re-runs with same body don't get "duplicate"
+    if wh_body_bytes is not None:
+        import hashlib as _hl
+        dedup_id = "plane" + _hl.sha256(b"plane" + wh_body_bytes).hexdigest()
+        _cleanup_dedup(issue_id, dedup_id)
+
+
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Live staging-stand check suite (ORCH-33)"
+    )
+    parser.add_argument(
+        "--base-url",
+        default="http://localhost:8501",
+        help="Base URL of the staging orchestrator (default: http://localhost:8501)",
+    )
+    parser.add_argument(
+        "--mode",
+        choices=["stub", "full-real"],
+        default="stub",
+        help=(
+            "stub (default): check early pipeline artifacts only (branch+job), "
+            "no LLM spend. "
+            "full-real: also wait for the analyst agent (slow, costs credits)."
+        ),
+    )
+    args = parser.parse_args()
+
+    base = args.base_url.rstrip("/")
+
+    print(f"{_BOLD}{'='*60}{_RESET}")
+    print(f"{_BOLD}  ORCH-33 Staging Check Suite{_RESET}")
+    print(f"  base_url : {base}")
+    print(f"  mode     : {args.mode}")
+    print(f"  utc_time : {datetime.datetime.now(datetime.timezone.utc).isoformat()}")
+    print(f"{_BOLD}{'='*60}{_RESET}")
+
+    results = Results()
+
+    block_a(base, results)
+    block_b(results)
+    block_c(base, results, args.mode)
+
+    all_ok = results.summary()
+    sys.exit(0 if all_ok else 1)
+
+
+if __name__ == "__main__":
+    main()
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -38,3 +38,36 @@ def _no_telegram(monkeypatch):
    monkeypatch.setattr("src.agents.launcher.send_telegram", _noop, raising=False)
    monkeypatch.setattr("src.queue_worker.send_telegram", _noop, raising=False)
    yield
+
+
+@pytest.fixture(autouse=True)
+def _reset_webhook_secrets(monkeypatch):
+    """Isolate settings singleton between test files (CI cross-file isolation).
+
+    settings is a process-wide Pydantic singleton read once at import.  Different
+    test modules set env variables differently at import-time, so those values leak
+    across files when pytest collects them together (as CI does).
+
+    1. webhook secrets: reset to "" so HMAC is disabled by default.  Tests that
+       intentionally test the 401 path (test_webhook_dedup.py:268,278) re-apply
+       their own monkeypatch AFTER this autouse fixture runs, which overrides the
+       reset for the duration of that one test only.
+
+    2. db_path: reset to the value from ORCH_DB_PATH env var (last written by the
+       last imported test module).  Without this, test_webhook_dedup.py (imported
+       first, alphabetically) seeds settings.db_path = dedup.db, while
+       test_webhooks.py's setup_db fixture tries to remove test_orchestrator.db,
+       leaving the DB dirty across tests that share a branch name and causing
+       get_task_by_repo_branch() to return a stale row with the wrong stage.
+       Per-test monkeypatches in test_webhook_dedup.setup_db override this reset.
+    """
+    import os
+    from src.webhooks import gitea as gitea_mod
+    from src.webhooks import plane as plane_mod
+    from src import db as db_mod
+    monkeypatch.setattr(gitea_mod.settings, "gitea_webhook_secret", "", raising=False)
+    monkeypatch.setattr(plane_mod.settings, "plane_webhook_secret", "", raising=False)
+    db_path_env = os.environ.get("ORCH_DB_PATH", "")
+    if db_path_env:
+        monkeypatch.setattr(db_mod.settings, "db_path", db_path_env, raising=False)
+    yield
--- a/tests/test_webhooks.py
+++ b/tests/test_webhooks.py
@@ -54,13 +54,19 @@ def test_status_endpoint():
    assert "active_tasks" in resp.json()


+@patch("src.plane_sync.add_comment")
+@patch("src.plane_sync.fetch_issue_sequence_id", return_value=None)
+@patch("src.plane_sync.fetch_issue_fields", return_value=("Test task", "This is a detailed test description for the task"))
@patch("src.webhooks.plane._create_gitea_branch", new_callable=AsyncMock)
@patch("src.webhooks.plane._create_initial_docs", new_callable=AsyncMock)
-def test_plane_webhook_creates_task(mock_docs, mock_branch):
-    """work_item.created → task in DB with stage=analysis."""
+def test_plane_webhook_creates_task(mock_docs, mock_branch, mock_fetch_fields, mock_fetch_seq, mock_add_comment):
+    """work_item.created (via In Progress status) → task in DB with stage=analysis."""
    resp = client.post("/webhook/plane", json={
-        "event": "work_item.created",
-        "data": {"id": "test-123", "name": "Test task", "project": "proj-1"}
+        "event": "issue", "action": "updated",
+        "data": {
+            "id": "test-123", "name": "Test task", "project": "proj-1",
+            "state": {"id": "b873d9eb-993c-48cd-97ac-99a9b1623967", "name": "In Progress", "group": "started"},
+        }
    })
    assert resp.status_code == 200
    assert resp.json()["status"] == "accepted"
@@ -75,17 +81,37 @@ def test_plane_webhook_creates_task(mock_docs, mock_branch):
    assert "feature/" in task["branch"]


+@patch("src.plane_sync.add_comment")
+@patch("src.plane_sync.fetch_issue_sequence_id", return_value=None)
+@patch("src.plane_sync.fetch_issue_fields",
+       side_effect=[
+           ("First task", "This is a detailed description for the first task item"),
+           ("Second task", "This is a detailed description for the second task item"),
+       ])
@patch("src.webhooks.plane._create_gitea_branch", new_callable=AsyncMock)
@patch("src.webhooks.plane._create_initial_docs", new_callable=AsyncMock)
-def test_plane_webhook_generates_sequential_ids(mock_docs, mock_branch):
-    """Multiple work items get sequential IDs."""
+def test_plane_webhook_generates_sequential_ids(
+    mock_docs, mock_branch, mock_fetch_fields, mock_fetch_seq, mock_add_comment
+):
+    """Multiple In Progress transitions get sequential IDs (ET-001, ET-002)."""
+    in_progress_state = {
+        "id": "b873d9eb-993c-48cd-97ac-99a9b1623967",
+        "name": "In Progress",
+        "group": "started",
+    }
    client.post("/webhook/plane", json={
-        "event": "work_item.created",
-        "data": {"id": "item-1", "name": "First task", "project": "proj-1"}
+        "event": "issue", "action": "updated",
+        "data": {
+            "id": "item-1", "name": "First task", "project": "proj-1",
+            "state": in_progress_state,
+        }
    })
    client.post("/webhook/plane", json={
-        "event": "work_item.created",
-        "data": {"id": "item-2", "name": "Second task", "project": "proj-1"}
+        "event": "issue", "action": "updated",
+        "data": {
+            "id": "item-2", "name": "Second task", "project": "proj-1",
+            "state": in_progress_state,
+        }
    })

    conn = get_db()
@@ -202,8 +228,9 @@ def test_gitea_webhook_push():
    assert resp.json()["status"] == "accepted"


+@patch("src.webhooks.gitea.plane_notify_stage")
@patch("src.webhooks.gitea.launcher")
-def test_gitea_push_with_adr_advances_stage(mock_launcher):
+def test_gitea_push_with_adr_advances_stage(mock_launcher, mock_plane_notify):
    """Push with ADR files at architecture stage → advance to development."""
    mock_launcher.launch.return_value = 1

@@ -235,7 +262,7 @@ def test_gitea_push_with_adr_advances_stage(mock_launcher):
    task = conn.execute("SELECT * FROM tasks WHERE plane_id = 'push-001'").fetchone()
    conn.close()
    assert task["stage"] == "development"
-    mock_launcher.launch.assert_called_once()
+    mock_plane_notify.assert_called_once()


@patch("src.webhooks.gitea.check_ci_green")
Author	SHA1	Message	Date
Dev Agent	a6cbacb62c	feat(staging): add orchestrator deploy hook with health-check and auto-rollback (ORCH-34) All checks were successful CI / test (push) Successful in 13s Details CI / test (pull_request) Successful in 9s Details	2026-06-05 09:26:12 +03:00
Slava	93169f16e0	Merge pull request 'feat(staging): add live staging check suite (smoke + access + e2e) [ORCH-33]' (#29 ) from feature/ORCH-33-staging-testsuite into main	2026-06-05 09:12:51 +03:00
Dev Agent	94334bdd42	feat(staging): add live staging check suite (smoke + access + e2e) All checks were successful CI / test (push) Successful in 10s Details CI / test (pull_request) Successful in 10s Details	2026-06-05 08:54:56 +03:00
Slava	3b68a29ae1	Merge PR #28 : add isolated orchestrator-staging service (ORCH-31) Stage 1/5 of staging environment for self-hosting (ORCH-7). Adds orchestrator-staging compose service under staging profile, isolated DB, .env.staging.example, docs. Prod untouched; service inert until explicitly started.	2026-06-05 08:01:10 +03:00
Dev Agent	6c1e5fff52	feat(staging): add isolated orchestrator-staging service (port 8501, separate DB) All checks were successful CI / test (push) Successful in 10s Details CI / test (pull_request) Successful in 9s Details - Add orchestrator-staging compose service under profile 'staging' so normal 'docker compose up -d' does NOT start it. - Port 8501 via command override; network_mode: host (no ports mapping needed). - DB isolation via separate volume ./data/staging:/app/data — physically separate from prod ./data/orchestrator.db on the host. - ORCH_DB_PATH=/app/data/orchestrator.db explicit in env (same container path, isolated by volume mount). - Add .env.staging.example with all required keys and placeholders. - Update .gitignore: add .env.staging and data/staging/ exclusions. - Add docs/STAGING.md: how to start staging, architecture table, roadmap. Refs: ORCH-31 (Stage 1 of 5)	2026-06-05 07:34:48 +03:00
Slava	d0a34249cc	Merge PR #27 : isolate webhook tests + add CI workflow (self-hosting gate) Closes the CI quality gate for orchestrator self-hosting (ORCH-7). Full pytest tests/ green (294 passed). Supersedes #26.	2026-06-05 07:29:04 +03:00
Dev Agent	1baae81165	test: reset webhook secret per-test to fix cross-file isolation (CI green) All checks were successful CI / test (push) Successful in 10s Details CI / test (pull_request) Successful in 10s Details Adds autouse fixture _reset_webhook_secrets to tests/conftest.py that resets the process-wide Pydantic settings singleton before every test: 1. gitea_webhook_secret / plane_webhook_secret → "" (HMAC disabled by default). Tests that deliberately test the 401 path (test_webhook_dedup.py:268,278) override this with their own monkeypatch which runs after autouse fixtures and wins for that test only. 2. db_path → os.environ["ORCH_DB_PATH"] (last written value after all test modules are imported). Without this, test_webhook_dedup.py (imported first alphabetically) seeds settings.db_path = dedup.db, while test_webhooks.py setup_db tries to remove test_orchestrator.db — leaving the DB dirty between tests that share a branch name and causing get_task_by_repo_branch() to return a stale row with the wrong stage. Per-test monkeypatches in test_webhook_dedup.setup_db still override it. Root cause: both leaks come from the same singleton settings being read once at import, before any per-test isolation runs. The autouse fixture is the correct per-test reset point for process-wide singletons. Result: pytest tests/ → 294 passed, 0 failed (was 10 failed/284 passed).	2026-06-05 00:00:01 +03:00
Dev Agent	e856e0940b	test: migrate sequential_ids test to In Progress contract Some checks failed CI / test (push) Failing after 9s Details CI / test (pull_request) Failing after 9s Details	2026-06-04 22:38:09 +03:00
Dev Agent	7bbab9c38b	test: isolate webhook tests from live Plane API (fix CI) Some checks failed CI / test (push) Failing after 9s Details CI / test (pull_request) Failing after 9s Details	2026-06-04 22:15:40 +03:00