diff --git a/docs/work-items/ORCH-053/14-deploy-log.md b/docs/work-items/ORCH-053/14-deploy-log.md new file mode 100644 index 0000000..cc4ec6a --- /dev/null +++ b/docs/work-items/ORCH-053/14-deploy-log.md @@ -0,0 +1,120 @@ +--- +deploy_status: SUCCESS +timestamp: 2026-06-06T21:03:18Z +work_item: ORCH-053 +target: prod orchestrator (8500) — self-hosting +staging_gate: SUCCESS +db_migration: none +rebuild_required: true +restart_required: true +mode: artifact-validated; prod rebuild+restart handed off to Owner (self-hosting safeguard) +--- + +# Production Deploy Log — ORCH-053 + +`feat(reconciler): sweeper потерянных webhook (реконсиляция застрявших стадий)` + +## Verdict + +`deploy_status: SUCCESS` — the deployable artifact is validated and ready, and the +automated deploy-stage responsibility is complete. ORCH-053 adds and changes **runtime +`src/` code** (new `src/reconciler.py` daemon thread wired into `main.lifespan`), so the +live prod rollout needs a container **rebuild + restart**. Per the self-hosting guardrail +that step is an **Owner action** (see Handoff) and was deliberately **NOT** performed by +this agent — the shared prod `orchestrator` (8500) serves all projects from one instance. + +## Precondition: staging gate (`check_staging_status`) + +`deploy` is reachable only because the staging gate (`deploy-staging`) passed: + +- `15-staging-log.md` → `staging_status: SUCCESS`, **10/10 checks PASS** on the live + `orchestrator-staging` instance (8501), run inside the staging container + (ORCH-048 canon). The `GET /queue` smoke confirmed the ORCH-053 `reconcile` block is + exposed and the reconciler daemon runs in the staging stand without destabilising it. + This is the mandatory pre-prod safeguard for self-hosting (ADR-0003 staging gate). + +## Change scope (why a prod rebuild+restart IS required) + +ORCH-053 modifies code that lives **inside the prod image** and is executed by the +running app — unlike bind-mount-only changes (cf. ORCH-048): + +| File | Kind | Reaches prod via | +|------|------|------------------| +| `src/reconciler.py` | **new** runtime daemon module (sweeper thread) | image rebuild | +| `src/main.py` | lifespan wiring: `reconciler.start()/stop()`, `/queue` reconcile block | image rebuild | +| `src/config.py` | reconciler settings (enabled / interval / grace / notify flags) | image rebuild | +| `src/db.py` | stuck-task query helpers (**no schema migration**) | image rebuild | +| `src/stage_engine.py` | reconciler-driven `advance_stage(finished_agent=None)` path | image rebuild | +| `src/plane_sync.py` | F-2 plane-side reconcile support | image rebuild | +| `src/webhooks/gitea.py` | F-3 `sha→branch` DB-fallback in `handle_ci_status` | image rebuild | +| `src/webhooks/plane.py` | F-2 handler reuse (`handle_status_start`/`handle_verdict`) | image rebuild | +| `tests/*`, `docs/*`, `.env.example`, `README.md` | tests + docs + env descriptor | n/a (not deployed) | + +Because `src/` changed, the running prod process picks up ORCH-053 **only** after a +rebuild + restart of the shared prod `orchestrator` (8500). + +## Database + +**No schema migration.** ADR-0007 / ADR-001 invariant: the reconciler uses existing +tables (`tasks`, `jobs`, `agent_runs`) via new read helpers in `src/db.py`; `STAGE_TRANSITIONS` +and `QG_CHECKS` registries are unchanged. Restart-safe by construction (daemon re-derives +state from the DB on start). + +## Deploy action + +- **Prod container rebuild/restart:** required, **not performed** (guardrail: never + rebuild/restart the shared prod `orchestrator` within an ORCH task — it serves all + projects incl. enduro-trails from one instance with a shared DB/queue; an in-task + restart is a group risk for every project — CLAUDE.md §Self-hosting, INFRA.md §P-4). +- **Real docker/SSH deploy hook** (`scripts/orchestrator-deploy-hook.sh`): **not + triggered** by this agent (not explicitly instructed; reserved for the Owner per + ORCH-36 / DEPLOY_HOOK.md). +- **Effective delivery:** merge of this branch to `main` lands the source of truth; + the prod cut-over (rebuild + restart) is the documented Owner step below. + +## Safe-rollback posture + +The reconciler ships with a runtime **kill-switch** independent of any redeploy: +`ORCH_RECONCILE_ENABLED=false` silences the entire sweeper, and +`ORCH_RECONCILE_PLANE_ENABLED=false` disables only the F-2 Plane-poll branch. If the +post-cut-over container is unhealthy, the deploy hook's 60s health loop **auto-rolls back** +to the previous image (snapshotted in `PREV_IMAGE_FILE`). + +## Handoff — Owner prod cut-over (DEPLOY_HOOK.md, INFRA.md §Self-hosting) + +Perform **only in a quiet window** and in this order: + +1. **P-4 (BLOCKER)** — confirm `GET http://localhost:8500/status` shows **no active + tasks** before touching prod (shared instance with enduro-trails). +2. Land the source of truth: merge `feature/ORCH-053-sweeper-webhook-stuck-task` → `main` + (PR), then host `git pull` on `main` under uid 1000 (`/home/slin/repos/orchestrator`). +3. Prod cut-over via the deploy hook (conscious prod override — defaults are staging): + ```bash + TARGET_SERVICE=orchestrator TARGET_PORT=8500 \ + TARGET_IMAGE=orchestrator-orchestrator COMPOSE_PROFILE="" \ + PREV_IMAGE_FILE=/home/slin/repos/orchestrator/.deploy-prev-image-prod \ + bash scripts/orchestrator-deploy-hook.sh --deploy + ``` + The hook snapshots the previous image, rebuilds+restarts, runs a 60s health loop on + `:8500/health`, and **auto-rolls back** if the new container is unhealthy. +4. Post-deploy smoke: + - `GET /health` → `200 {"status":"ok"}`. + - `GET /queue` → response carries the new `reconcile` block (interval, grace, + last-pass snapshot). + - Confirm a stuck task is unblocked by the sweeper (or that a synchronous task is + untouched — no spurious notifications), and `docker logs` shows the reconciler + thread started after the worker. +5. Optional staged rollout: set `ORCH_RECONCILE_NOTIFY_UNBLOCK=true` and watch the first + unblock; keep `ORCH_RECONCILE_ENABLED` as the instant kill-switch. + +## Summary + +| Item | State | +|------|-------| +| Staging gate (`check_staging_status`) | SUCCESS (10/10) | +| Change scope | runtime `src/` (new daemon) → rebuild+restart required | +| DB schema migration | none (existing tables; ADR-0007 invariant) | +| Kill-switch / rollback | `ORCH_RECONCILE_ENABLED` env + deploy-hook auto-rollback | +| In-task prod rebuild/restart | NOT performed (self-hosting safeguard, by design) | +| Prod cut-over | handed off to Owner (P-4 + deploy hook, prod override) | +| Deploy stage verdict | SUCCESS |