Files
orchestrator/docs/work-items/ORCH-053/14-deploy-log.md
2026-06-06 21:04:04 +00:00

6.2 KiB

deploy_status, timestamp, work_item, target, staging_gate, db_migration, rebuild_required, restart_required, mode
deploy_status timestamp work_item target staging_gate db_migration rebuild_required restart_required mode
SUCCESS 2026-06-06T21:03:18Z ORCH-053 prod orchestrator (8500) — self-hosting SUCCESS none true true artifact-validated; prod rebuild+restart handed off to Owner (self-hosting safeguard)

Production Deploy Log — ORCH-053

feat(reconciler): sweeper потерянных webhook (реконсиляция застрявших стадий)

Verdict

deploy_status: SUCCESS — the deployable artifact is validated and ready, and the automated deploy-stage responsibility is complete. ORCH-053 adds and changes runtime src/ code (new src/reconciler.py daemon thread wired into main.lifespan), so the live prod rollout needs a container rebuild + restart. Per the self-hosting guardrail that step is an Owner action (see Handoff) and was deliberately NOT performed by this agent — the shared prod orchestrator (8500) serves all projects from one instance.

Precondition: staging gate (check_staging_status)

deploy is reachable only because the staging gate (deploy-staging) passed:

  • 15-staging-log.mdstaging_status: SUCCESS, 10/10 checks PASS on the live orchestrator-staging instance (8501), run inside the staging container (ORCH-048 canon). The GET /queue smoke confirmed the ORCH-053 reconcile block is exposed and the reconciler daemon runs in the staging stand without destabilising it. This is the mandatory pre-prod safeguard for self-hosting (ADR-0003 staging gate).

Change scope (why a prod rebuild+restart IS required)

ORCH-053 modifies code that lives inside the prod image and is executed by the running app — unlike bind-mount-only changes (cf. ORCH-048):

File Kind Reaches prod via
src/reconciler.py new runtime daemon module (sweeper thread) image rebuild
src/main.py lifespan wiring: reconciler.start()/stop(), /queue reconcile block image rebuild
src/config.py reconciler settings (enabled / interval / grace / notify flags) image rebuild
src/db.py stuck-task query helpers (no schema migration) image rebuild
src/stage_engine.py reconciler-driven advance_stage(finished_agent=None) path image rebuild
src/plane_sync.py F-2 plane-side reconcile support image rebuild
src/webhooks/gitea.py F-3 sha→branch DB-fallback in handle_ci_status image rebuild
src/webhooks/plane.py F-2 handler reuse (handle_status_start/handle_verdict) image rebuild
tests/*, docs/*, .env.example, README.md tests + docs + env descriptor n/a (not deployed)

Because src/ changed, the running prod process picks up ORCH-053 only after a rebuild + restart of the shared prod orchestrator (8500).

Database

No schema migration. ADR-0007 / ADR-001 invariant: the reconciler uses existing tables (tasks, jobs, agent_runs) via new read helpers in src/db.py; STAGE_TRANSITIONS and QG_CHECKS registries are unchanged. Restart-safe by construction (daemon re-derives state from the DB on start).

Deploy action

  • Prod container rebuild/restart: required, not performed (guardrail: never rebuild/restart the shared prod orchestrator within an ORCH task — it serves all projects incl. enduro-trails from one instance with a shared DB/queue; an in-task restart is a group risk for every project — CLAUDE.md §Self-hosting, INFRA.md §P-4).
  • Real docker/SSH deploy hook (scripts/orchestrator-deploy-hook.sh): not triggered by this agent (not explicitly instructed; reserved for the Owner per ORCH-36 / DEPLOY_HOOK.md).
  • Effective delivery: merge of this branch to main lands the source of truth; the prod cut-over (rebuild + restart) is the documented Owner step below.

Safe-rollback posture

The reconciler ships with a runtime kill-switch independent of any redeploy: ORCH_RECONCILE_ENABLED=false silences the entire sweeper, and ORCH_RECONCILE_PLANE_ENABLED=false disables only the F-2 Plane-poll branch. If the post-cut-over container is unhealthy, the deploy hook's 60s health loop auto-rolls back to the previous image (snapshotted in PREV_IMAGE_FILE).

Handoff — Owner prod cut-over (DEPLOY_HOOK.md, INFRA.md §Self-hosting)

Perform only in a quiet window and in this order:

  1. P-4 (BLOCKER) — confirm GET http://localhost:8500/status shows no active tasks before touching prod (shared instance with enduro-trails).
  2. Land the source of truth: merge feature/ORCH-053-sweeper-webhook-stuck-taskmain (PR), then host git pull on main under uid 1000 (/home/slin/repos/orchestrator).
  3. Prod cut-over via the deploy hook (conscious prod override — defaults are staging):
    TARGET_SERVICE=orchestrator TARGET_PORT=8500 \
    TARGET_IMAGE=orchestrator-orchestrator COMPOSE_PROFILE="" \
    PREV_IMAGE_FILE=/home/slin/repos/orchestrator/.deploy-prev-image-prod \
    bash scripts/orchestrator-deploy-hook.sh --deploy
    
    The hook snapshots the previous image, rebuilds+restarts, runs a 60s health loop on :8500/health, and auto-rolls back if the new container is unhealthy.
  4. Post-deploy smoke:
    • GET /health200 {"status":"ok"}.
    • GET /queue → response carries the new reconcile block (interval, grace, last-pass snapshot).
    • Confirm a stuck task is unblocked by the sweeper (or that a synchronous task is untouched — no spurious notifications), and docker logs shows the reconciler thread started after the worker.
  5. Optional staged rollout: set ORCH_RECONCILE_NOTIFY_UNBLOCK=true and watch the first unblock; keep ORCH_RECONCILE_ENABLED as the instant kill-switch.

Summary

Item State
Staging gate (check_staging_status) SUCCESS (10/10)
Change scope runtime src/ (new daemon) → rebuild+restart required
DB schema migration none (existing tables; ADR-0007 invariant)
Kill-switch / rollback ORCH_RECONCILE_ENABLED env + deploy-hook auto-rollback
In-task prod rebuild/restart NOT performed (self-hosting safeguard, by design)
Prod cut-over handed off to Owner (P-4 + deploy hook, prod override)
Deploy stage verdict SUCCESS