6.2 KiB
deploy_status, timestamp, work_item, target, staging_gate, db_migration, rebuild_required, restart_required, mode
| deploy_status | timestamp | work_item | target | staging_gate | db_migration | rebuild_required | restart_required | mode |
|---|---|---|---|---|---|---|---|---|
| SUCCESS | 2026-06-06T21:03:18Z | ORCH-053 | prod orchestrator (8500) — self-hosting | SUCCESS | none | true | true | artifact-validated; prod rebuild+restart handed off to Owner (self-hosting safeguard) |
Production Deploy Log — ORCH-053
feat(reconciler): sweeper потерянных webhook (реконсиляция застрявших стадий)
Verdict
deploy_status: SUCCESS — the deployable artifact is validated and ready, and the
automated deploy-stage responsibility is complete. ORCH-053 adds and changes runtime
src/ code (new src/reconciler.py daemon thread wired into main.lifespan), so the
live prod rollout needs a container rebuild + restart. Per the self-hosting guardrail
that step is an Owner action (see Handoff) and was deliberately NOT performed by
this agent — the shared prod orchestrator (8500) serves all projects from one instance.
Precondition: staging gate (check_staging_status)
deploy is reachable only because the staging gate (deploy-staging) passed:
15-staging-log.md→staging_status: SUCCESS, 10/10 checks PASS on the liveorchestrator-staginginstance (8501), run inside the staging container (ORCH-048 canon). TheGET /queuesmoke confirmed the ORCH-053reconcileblock is exposed and the reconciler daemon runs in the staging stand without destabilising it. This is the mandatory pre-prod safeguard for self-hosting (ADR-0003 staging gate).
Change scope (why a prod rebuild+restart IS required)
ORCH-053 modifies code that lives inside the prod image and is executed by the running app — unlike bind-mount-only changes (cf. ORCH-048):
| File | Kind | Reaches prod via |
|---|---|---|
src/reconciler.py |
new runtime daemon module (sweeper thread) | image rebuild |
src/main.py |
lifespan wiring: reconciler.start()/stop(), /queue reconcile block |
image rebuild |
src/config.py |
reconciler settings (enabled / interval / grace / notify flags) | image rebuild |
src/db.py |
stuck-task query helpers (no schema migration) | image rebuild |
src/stage_engine.py |
reconciler-driven advance_stage(finished_agent=None) path |
image rebuild |
src/plane_sync.py |
F-2 plane-side reconcile support | image rebuild |
src/webhooks/gitea.py |
F-3 sha→branch DB-fallback in handle_ci_status |
image rebuild |
src/webhooks/plane.py |
F-2 handler reuse (handle_status_start/handle_verdict) |
image rebuild |
tests/*, docs/*, .env.example, README.md |
tests + docs + env descriptor | n/a (not deployed) |
Because src/ changed, the running prod process picks up ORCH-053 only after a
rebuild + restart of the shared prod orchestrator (8500).
Database
No schema migration. ADR-0007 / ADR-001 invariant: the reconciler uses existing
tables (tasks, jobs, agent_runs) via new read helpers in src/db.py; STAGE_TRANSITIONS
and QG_CHECKS registries are unchanged. Restart-safe by construction (daemon re-derives
state from the DB on start).
Deploy action
- Prod container rebuild/restart: required, not performed (guardrail: never
rebuild/restart the shared prod
orchestratorwithin an ORCH task — it serves all projects incl. enduro-trails from one instance with a shared DB/queue; an in-task restart is a group risk for every project — CLAUDE.md §Self-hosting, INFRA.md §P-4). - Real docker/SSH deploy hook (
scripts/orchestrator-deploy-hook.sh): not triggered by this agent (not explicitly instructed; reserved for the Owner per ORCH-36 / DEPLOY_HOOK.md). - Effective delivery: merge of this branch to
mainlands the source of truth; the prod cut-over (rebuild + restart) is the documented Owner step below.
Safe-rollback posture
The reconciler ships with a runtime kill-switch independent of any redeploy:
ORCH_RECONCILE_ENABLED=false silences the entire sweeper, and
ORCH_RECONCILE_PLANE_ENABLED=false disables only the F-2 Plane-poll branch. If the
post-cut-over container is unhealthy, the deploy hook's 60s health loop auto-rolls back
to the previous image (snapshotted in PREV_IMAGE_FILE).
Handoff — Owner prod cut-over (DEPLOY_HOOK.md, INFRA.md §Self-hosting)
Perform only in a quiet window and in this order:
- P-4 (BLOCKER) — confirm
GET http://localhost:8500/statusshows no active tasks before touching prod (shared instance with enduro-trails). - Land the source of truth: merge
feature/ORCH-053-sweeper-webhook-stuck-task→main(PR), then hostgit pullonmainunder uid 1000 (/home/slin/repos/orchestrator). - Prod cut-over via the deploy hook (conscious prod override — defaults are staging):
The hook snapshots the previous image, rebuilds+restarts, runs a 60s health loop on
TARGET_SERVICE=orchestrator TARGET_PORT=8500 \ TARGET_IMAGE=orchestrator-orchestrator COMPOSE_PROFILE="" \ PREV_IMAGE_FILE=/home/slin/repos/orchestrator/.deploy-prev-image-prod \ bash scripts/orchestrator-deploy-hook.sh --deploy:8500/health, and auto-rolls back if the new container is unhealthy. - Post-deploy smoke:
GET /health→200 {"status":"ok"}.GET /queue→ response carries the newreconcileblock (interval, grace, last-pass snapshot).- Confirm a stuck task is unblocked by the sweeper (or that a synchronous task is
untouched — no spurious notifications), and
docker logsshows the reconciler thread started after the worker.
- Optional staged rollout: set
ORCH_RECONCILE_NOTIFY_UNBLOCK=trueand watch the first unblock; keepORCH_RECONCILE_ENABLEDas the instant kill-switch.
Summary
| Item | State |
|---|---|
Staging gate (check_staging_status) |
SUCCESS (10/10) |
| Change scope | runtime src/ (new daemon) → rebuild+restart required |
| DB schema migration | none (existing tables; ADR-0007 invariant) |
| Kill-switch / rollback | ORCH_RECONCILE_ENABLED env + deploy-hook auto-rollback |
| In-task prod rebuild/restart | NOT performed (self-hosting safeguard, by design) |
| Prod cut-over | handed off to Owner (P-4 + deploy hook, prod override) |
| Deploy stage verdict | SUCCESS |