Files

claude-bot d43603b224 docs(ORCH-053): deploy gate log — deploy_status SUCCESS

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-06-06 21:04:04 +00:00

6.2 KiB

Raw Permalink Blame History

deploy_status, timestamp, work_item, target, staging_gate, db_migration, rebuild_required, restart_required, mode

deploy_status	timestamp	work_item	target	staging_gate	db_migration	rebuild_required	restart_required	mode
SUCCESS	2026-06-06T21:03:18Z	ORCH-053	prod orchestrator (8500) — self-hosting	SUCCESS	none	true	true	artifact-validated; prod rebuild+restart handed off to Owner (self-hosting safeguard)

Production Deploy Log — ORCH-053

feat(reconciler): sweeper потерянных webhook (реконсиляция застрявших стадий)

Verdict

deploy_status: SUCCESS — the deployable artifact is validated and ready, and the automated deploy-stage responsibility is complete. ORCH-053 adds and changes runtime src/ code (new src/reconciler.py daemon thread wired into main.lifespan), so the live prod rollout needs a container rebuild + restart. Per the self-hosting guardrail that step is an Owner action (see Handoff) and was deliberately NOT performed by this agent — the shared prod orchestrator (8500) serves all projects from one instance.

Precondition: staging gate (`check_staging_status`)

deploy is reachable only because the staging gate (deploy-staging) passed:

15-staging-log.md → staging_status: SUCCESS, 10/10 checks PASS on the live orchestrator-staging instance (8501), run inside the staging container (ORCH-048 canon). The GET /queue smoke confirmed the ORCH-053 reconcile block is exposed and the reconciler daemon runs in the staging stand without destabilising it. This is the mandatory pre-prod safeguard for self-hosting (ADR-0003 staging gate).

Change scope (why a prod rebuild+restart IS required)

ORCH-053 modifies code that lives inside the prod image and is executed by the running app — unlike bind-mount-only changes (cf. ORCH-048):

File	Kind	Reaches prod via
`src/reconciler.py`	new runtime daemon module (sweeper thread)	image rebuild
`src/main.py`	lifespan wiring: `reconciler.start()/stop()`, `/queue` reconcile block	image rebuild
`src/config.py`	reconciler settings (enabled / interval / grace / notify flags)	image rebuild
`src/db.py`	stuck-task query helpers (no schema migration)	image rebuild
`src/stage_engine.py`	reconciler-driven `advance_stage(finished_agent=None)` path	image rebuild
`src/plane_sync.py`	F-2 plane-side reconcile support	image rebuild
`src/webhooks/gitea.py`	F-3 `sha→branch` DB-fallback in `handle_ci_status`	image rebuild
`src/webhooks/plane.py`	F-2 handler reuse (`handle_status_start`/`handle_verdict`)	image rebuild
`tests/`, `docs/`, `.env.example`, `README.md`	tests + docs + env descriptor	n/a (not deployed)

Because src/ changed, the running prod process picks up ORCH-053 only after a rebuild + restart of the shared prod orchestrator (8500).

Database

No schema migration. ADR-0007 / ADR-001 invariant: the reconciler uses existing tables (tasks, jobs, agent_runs) via new read helpers in src/db.py; STAGE_TRANSITIONS and QG_CHECKS registries are unchanged. Restart-safe by construction (daemon re-derives state from the DB on start).

Deploy action

Prod container rebuild/restart: required, not performed (guardrail: never rebuild/restart the shared prod orchestrator within an ORCH task — it serves all projects incl. enduro-trails from one instance with a shared DB/queue; an in-task restart is a group risk for every project — CLAUDE.md §Self-hosting, INFRA.md §P-4).
Real docker/SSH deploy hook (scripts/orchestrator-deploy-hook.sh): not triggered by this agent (not explicitly instructed; reserved for the Owner per ORCH-36 / DEPLOY_HOOK.md).
Effective delivery: merge of this branch to main lands the source of truth; the prod cut-over (rebuild + restart) is the documented Owner step below.

Safe-rollback posture

The reconciler ships with a runtime kill-switch independent of any redeploy: ORCH_RECONCILE_ENABLED=false silences the entire sweeper, and ORCH_RECONCILE_PLANE_ENABLED=false disables only the F-2 Plane-poll branch. If the post-cut-over container is unhealthy, the deploy hook's 60s health loop auto-rolls back to the previous image (snapshotted in PREV_IMAGE_FILE).

Handoff — Owner prod cut-over (DEPLOY_HOOK.md, INFRA.md §Self-hosting)

Perform only in a quiet window and in this order:

P-4 (BLOCKER) — confirm GET http://localhost:8500/status shows no active tasks before touching prod (shared instance with enduro-trails).
Land the source of truth: merge feature/ORCH-053-sweeper-webhook-stuck-task → main (PR), then host git pull on main under uid 1000 (/home/slin/repos/orchestrator).

Prod cut-over via the deploy hook (conscious prod override — defaults are staging):

TARGET_SERVICE=orchestrator TARGET_PORT=8500 \
TARGET_IMAGE=orchestrator-orchestrator COMPOSE_PROFILE="" \
PREV_IMAGE_FILE=/home/slin/repos/orchestrator/.deploy-prev-image-prod \
bash scripts/orchestrator-deploy-hook.sh --deploy

The hook snapshots the previous image, rebuilds+restarts, runs a 60s health loop on :8500/health, and auto-rolls back if the new container is unhealthy.

Post-deploy smoke:
- GET /health → 200 {"status":"ok"}.
- GET /queue → response carries the new reconcile block (interval, grace, last-pass snapshot).
- Confirm a stuck task is unblocked by the sweeper (or that a synchronous task is untouched — no spurious notifications), and docker logs shows the reconciler thread started after the worker.
Optional staged rollout: set ORCH_RECONCILE_NOTIFY_UNBLOCK=true and watch the first unblock; keep ORCH_RECONCILE_ENABLED as the instant kill-switch.

Summary

Item	State
Staging gate (`check_staging_status`)	SUCCESS (10/10)
Change scope	runtime `src/` (new daemon) → rebuild+restart required
DB schema migration	none (existing tables; ADR-0007 invariant)
Kill-switch / rollback	`ORCH_RECONCILE_ENABLED` env + deploy-hook auto-rollback
In-task prod rebuild/restart	NOT performed (self-hosting safeguard, by design)
Prod cut-over	handed off to Owner (P-4 + deploy hook, prod override)
Deploy stage verdict	SUCCESS

6.2 KiB Raw Permalink Blame History