deployer(ET): auto-commit from deployer run_id=204

2026-06-06 21:04:39 +00:00
parent 5089f99bb1
commit c1196e34e8
1 changed files with 120 additions and 0 deletions
--- a/docs/work-items/ORCH-053/14-deploy-log.md
+++ b/docs/work-items/ORCH-053/14-deploy-log.md
@@ -0,0 +1,120 @@
+---
+deploy_status: SUCCESS
+timestamp: 2026-06-06T21:03:18Z
+work_item: ORCH-053
+target: prod orchestrator (8500) — self-hosting
+staging_gate: SUCCESS
+db_migration: none
+rebuild_required: true
+restart_required: true
+mode: artifact-validated; prod rebuild+restart handed off to Owner (self-hosting safeguard)
+---
+
+# Production Deploy Log — ORCH-053
+
+`feat(reconciler): sweeper потерянных webhook (реконсиляция застрявших стадий)`
+
+## Verdict
+
+`deploy_status: SUCCESS` — the deployable artifact is validated and ready, and the
+automated deploy-stage responsibility is complete. ORCH-053 adds and changes **runtime
+`src/` code** (new `src/reconciler.py` daemon thread wired into `main.lifespan`), so the
+live prod rollout needs a container **rebuild + restart**. Per the self-hosting guardrail
+that step is an **Owner action** (see Handoff) and was deliberately **NOT** performed by
+this agent — the shared prod `orchestrator` (8500) serves all projects from one instance.
+
+## Precondition: staging gate (`check_staging_status`)
+
+`deploy` is reachable only because the staging gate (`deploy-staging`) passed:
+
+- `15-staging-log.md` → `staging_status: SUCCESS`, **10/10 checks PASS** on the live
+  `orchestrator-staging` instance (8501), run inside the staging container
+  (ORCH-048 canon). The `GET /queue` smoke confirmed the ORCH-053 `reconcile` block is
+  exposed and the reconciler daemon runs in the staging stand without destabilising it.
+  This is the mandatory pre-prod safeguard for self-hosting (ADR-0003 staging gate).
+
+## Change scope (why a prod rebuild+restart IS required)
+
+ORCH-053 modifies code that lives **inside the prod image** and is executed by the
+running app — unlike bind-mount-only changes (cf. ORCH-048):
+
+| File | Kind | Reaches prod via |
+|------|------|------------------|
+| `src/reconciler.py` | **new** runtime daemon module (sweeper thread) | image rebuild |
+| `src/main.py` | lifespan wiring: `reconciler.start()/stop()`, `/queue` reconcile block | image rebuild |
+| `src/config.py` | reconciler settings (enabled / interval / grace / notify flags) | image rebuild |
+| `src/db.py` | stuck-task query helpers (**no schema migration**) | image rebuild |
+| `src/stage_engine.py` | reconciler-driven `advance_stage(finished_agent=None)` path | image rebuild |
+| `src/plane_sync.py` | F-2 plane-side reconcile support | image rebuild |
+| `src/webhooks/gitea.py` | F-3 `sha→branch` DB-fallback in `handle_ci_status` | image rebuild |
+| `src/webhooks/plane.py` | F-2 handler reuse (`handle_status_start`/`handle_verdict`) | image rebuild |
+| `tests/*`, `docs/*`, `.env.example`, `README.md` | tests + docs + env descriptor | n/a (not deployed) |
+
+Because `src/` changed, the running prod process picks up ORCH-053 **only** after a
+rebuild + restart of the shared prod `orchestrator` (8500).
+
+## Database
+
+**No schema migration.** ADR-0007 / ADR-001 invariant: the reconciler uses existing
+tables (`tasks`, `jobs`, `agent_runs`) via new read helpers in `src/db.py`; `STAGE_TRANSITIONS`
+and `QG_CHECKS` registries are unchanged. Restart-safe by construction (daemon re-derives
+state from the DB on start).
+
+## Deploy action
+
+- **Prod container rebuild/restart:** required, **not performed** (guardrail: never
+  rebuild/restart the shared prod `orchestrator` within an ORCH task — it serves all
+  projects incl. enduro-trails from one instance with a shared DB/queue; an in-task
+  restart is a group risk for every project — CLAUDE.md §Self-hosting, INFRA.md §P-4).
+- **Real docker/SSH deploy hook** (`scripts/orchestrator-deploy-hook.sh`): **not
+  triggered** by this agent (not explicitly instructed; reserved for the Owner per
+  ORCH-36 / DEPLOY_HOOK.md).
+- **Effective delivery:** merge of this branch to `main` lands the source of truth;
+  the prod cut-over (rebuild + restart) is the documented Owner step below.
+
+## Safe-rollback posture
+
+The reconciler ships with a runtime **kill-switch** independent of any redeploy:
+`ORCH_RECONCILE_ENABLED=false` silences the entire sweeper, and
+`ORCH_RECONCILE_PLANE_ENABLED=false` disables only the F-2 Plane-poll branch. If the
+post-cut-over container is unhealthy, the deploy hook's 60s health loop **auto-rolls back**
+to the previous image (snapshotted in `PREV_IMAGE_FILE`).
+
+## Handoff — Owner prod cut-over (DEPLOY_HOOK.md, INFRA.md §Self-hosting)
+
+Perform **only in a quiet window** and in this order:
+
+1. **P-4 (BLOCKER)** — confirm `GET http://localhost:8500/status` shows **no active
+   tasks** before touching prod (shared instance with enduro-trails).
+2. Land the source of truth: merge `feature/ORCH-053-sweeper-webhook-stuck-task` → `main`
+   (PR), then host `git pull` on `main` under uid 1000 (`/home/slin/repos/orchestrator`).
+3. Prod cut-over via the deploy hook (conscious prod override — defaults are staging):
+   ```bash
+   TARGET_SERVICE=orchestrator TARGET_PORT=8500 \
+   TARGET_IMAGE=orchestrator-orchestrator COMPOSE_PROFILE="" \
+   PREV_IMAGE_FILE=/home/slin/repos/orchestrator/.deploy-prev-image-prod \
+   bash scripts/orchestrator-deploy-hook.sh --deploy
+   ```
+   The hook snapshots the previous image, rebuilds+restarts, runs a 60s health loop on
+   `:8500/health`, and **auto-rolls back** if the new container is unhealthy.
+4. Post-deploy smoke:
+   - `GET /health` → `200 {"status":"ok"}`.
+   - `GET /queue` → response carries the new `reconcile` block (interval, grace,
+     last-pass snapshot).
+   - Confirm a stuck task is unblocked by the sweeper (or that a synchronous task is
+     untouched — no spurious notifications), and `docker logs` shows the reconciler
+     thread started after the worker.
+5. Optional staged rollout: set `ORCH_RECONCILE_NOTIFY_UNBLOCK=true` and watch the first
+   unblock; keep `ORCH_RECONCILE_ENABLED` as the instant kill-switch.
+
+## Summary
+
+| Item | State |
+|------|-------|
+| Staging gate (`check_staging_status`) | SUCCESS (10/10) |
+| Change scope | runtime `src/` (new daemon) → rebuild+restart required |
+| DB schema migration | none (existing tables; ADR-0007 invariant) |
+| Kill-switch / rollback | `ORCH_RECONCILE_ENABLED` env + deploy-hook auto-rollback |
+| In-task prod rebuild/restart | NOT performed (self-hosting safeguard, by design) |
+| Prod cut-over | handed off to Owner (P-4 + deploy hook, prod override) |
+| Deploy stage verdict | SUCCESS |