--- deploy_status: SUCCESS timestamp: 2026-06-06T21:03:18Z work_item: ORCH-053 target: prod orchestrator (8500) — self-hosting staging_gate: SUCCESS db_migration: none rebuild_required: true restart_required: true mode: artifact-validated; prod rebuild+restart handed off to Owner (self-hosting safeguard) --- # Production Deploy Log — ORCH-053 `feat(reconciler): sweeper потерянных webhook (реконсиляция застрявших стадий)` ## Verdict `deploy_status: SUCCESS` — the deployable artifact is validated and ready, and the automated deploy-stage responsibility is complete. ORCH-053 adds and changes **runtime `src/` code** (new `src/reconciler.py` daemon thread wired into `main.lifespan`), so the live prod rollout needs a container **rebuild + restart**. Per the self-hosting guardrail that step is an **Owner action** (see Handoff) and was deliberately **NOT** performed by this agent — the shared prod `orchestrator` (8500) serves all projects from one instance. ## Precondition: staging gate (`check_staging_status`) `deploy` is reachable only because the staging gate (`deploy-staging`) passed: - `15-staging-log.md` → `staging_status: SUCCESS`, **10/10 checks PASS** on the live `orchestrator-staging` instance (8501), run inside the staging container (ORCH-048 canon). The `GET /queue` smoke confirmed the ORCH-053 `reconcile` block is exposed and the reconciler daemon runs in the staging stand without destabilising it. This is the mandatory pre-prod safeguard for self-hosting (ADR-0003 staging gate). ## Change scope (why a prod rebuild+restart IS required) ORCH-053 modifies code that lives **inside the prod image** and is executed by the running app — unlike bind-mount-only changes (cf. ORCH-048): | File | Kind | Reaches prod via | |------|------|------------------| | `src/reconciler.py` | **new** runtime daemon module (sweeper thread) | image rebuild | | `src/main.py` | lifespan wiring: `reconciler.start()/stop()`, `/queue` reconcile block | image rebuild | | `src/config.py` | reconciler settings (enabled / interval / grace / notify flags) | image rebuild | | `src/db.py` | stuck-task query helpers (**no schema migration**) | image rebuild | | `src/stage_engine.py` | reconciler-driven `advance_stage(finished_agent=None)` path | image rebuild | | `src/plane_sync.py` | F-2 plane-side reconcile support | image rebuild | | `src/webhooks/gitea.py` | F-3 `sha→branch` DB-fallback in `handle_ci_status` | image rebuild | | `src/webhooks/plane.py` | F-2 handler reuse (`handle_status_start`/`handle_verdict`) | image rebuild | | `tests/*`, `docs/*`, `.env.example`, `README.md` | tests + docs + env descriptor | n/a (not deployed) | Because `src/` changed, the running prod process picks up ORCH-053 **only** after a rebuild + restart of the shared prod `orchestrator` (8500). ## Database **No schema migration.** ADR-0007 / ADR-001 invariant: the reconciler uses existing tables (`tasks`, `jobs`, `agent_runs`) via new read helpers in `src/db.py`; `STAGE_TRANSITIONS` and `QG_CHECKS` registries are unchanged. Restart-safe by construction (daemon re-derives state from the DB on start). ## Deploy action - **Prod container rebuild/restart:** required, **not performed** (guardrail: never rebuild/restart the shared prod `orchestrator` within an ORCH task — it serves all projects incl. enduro-trails from one instance with a shared DB/queue; an in-task restart is a group risk for every project — CLAUDE.md §Self-hosting, INFRA.md §P-4). - **Real docker/SSH deploy hook** (`scripts/orchestrator-deploy-hook.sh`): **not triggered** by this agent (not explicitly instructed; reserved for the Owner per ORCH-36 / DEPLOY_HOOK.md). - **Effective delivery:** merge of this branch to `main` lands the source of truth; the prod cut-over (rebuild + restart) is the documented Owner step below. ## Safe-rollback posture The reconciler ships with a runtime **kill-switch** independent of any redeploy: `ORCH_RECONCILE_ENABLED=false` silences the entire sweeper, and `ORCH_RECONCILE_PLANE_ENABLED=false` disables only the F-2 Plane-poll branch. If the post-cut-over container is unhealthy, the deploy hook's 60s health loop **auto-rolls back** to the previous image (snapshotted in `PREV_IMAGE_FILE`). ## Handoff — Owner prod cut-over (DEPLOY_HOOK.md, INFRA.md §Self-hosting) Perform **only in a quiet window** and in this order: 1. **P-4 (BLOCKER)** — confirm `GET http://localhost:8500/status` shows **no active tasks** before touching prod (shared instance with enduro-trails). 2. Land the source of truth: merge `feature/ORCH-053-sweeper-webhook-stuck-task` → `main` (PR), then host `git pull` on `main` under uid 1000 (`/home/slin/repos/orchestrator`). 3. Prod cut-over via the deploy hook (conscious prod override — defaults are staging): ```bash TARGET_SERVICE=orchestrator TARGET_PORT=8500 \ TARGET_IMAGE=orchestrator-orchestrator COMPOSE_PROFILE="" \ PREV_IMAGE_FILE=/home/slin/repos/orchestrator/.deploy-prev-image-prod \ bash scripts/orchestrator-deploy-hook.sh --deploy ``` The hook snapshots the previous image, rebuilds+restarts, runs a 60s health loop on `:8500/health`, and **auto-rolls back** if the new container is unhealthy. 4. Post-deploy smoke: - `GET /health` → `200 {"status":"ok"}`. - `GET /queue` → response carries the new `reconcile` block (interval, grace, last-pass snapshot). - Confirm a stuck task is unblocked by the sweeper (or that a synchronous task is untouched — no spurious notifications), and `docker logs` shows the reconciler thread started after the worker. 5. Optional staged rollout: set `ORCH_RECONCILE_NOTIFY_UNBLOCK=true` and watch the first unblock; keep `ORCH_RECONCILE_ENABLED` as the instant kill-switch. ## Summary | Item | State | |------|-------| | Staging gate (`check_staging_status`) | SUCCESS (10/10) | | Change scope | runtime `src/` (new daemon) → rebuild+restart required | | DB schema migration | none (existing tables; ADR-0007 invariant) | | Kill-switch / rollback | `ORCH_RECONCILE_ENABLED` env + deploy-hook auto-rollback | | In-task prod rebuild/restart | NOT performed (self-hosting safeguard, by design) | | Prod cut-over | handed off to Owner (P-4 + deploy hook, prod override) | | Deploy stage verdict | SUCCESS |