Compare commits

...

2 Commits

Author SHA1 Message Date
d43603b224 docs(ORCH-053): deploy gate log — deploy_status SUCCESS
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-06 21:04:04 +00:00
682ae09316 docs(ORCH-036): staging gate SUCCESS log (10/10 checks PASS)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-06 20:58:12 +00:00
2 changed files with 182 additions and 0 deletions

View File

@@ -0,0 +1,62 @@
---
staging_status: SUCCESS
timestamp: 2026-06-06T20:57:11Z
base_url: http://localhost:8501
---
# Staging Gate Log
Staging test suite completed. All checks passed (10/10).
## Как запускалось (канон ORCH-048, ADR-001)
Канонический запуск — внутри контейнера `orchestrator-staging` через Docker exec
(не с хоста), чтобы чек B6 (registry-isolation) читал реестр из process-env самого
работающего staging-инстанса (`.env.staging`):
```
docker exec orchestrator-staging \
python3 /repos/orchestrator/scripts/staging_check.py \
--base-url http://localhost:8501 --mode stub
```
> Примечание окружения: deployer исполнялся внутри прод-контейнера, где CLI `docker`
> отсутствует, но смонтирован `/var/run/docker.sock` (gid 999). Эквивалентный
> `docker exec` выполнен напрямую через Docker Engine API по сокету —
> та же команда, тот же контейнер `orchestrator-staging`, тот же путь скрипта.
Exit code: **0** (все PASS).
## Результаты (10/10 PASS)
```
[Block A] SMOKE
✓ PASS A1 GET /health → 200 status=ok
✓ PASS A2 GET /queue → 200 with counts/max_concurrency/resilience
✓ PASS A3 ORCH_STAGING=true (not prod)
[Block B] ACCESS
✓ PASS B4 Plane: sandbox project accessible (found 5 project(s), sandbox=YES)
✓ PASS B5 Gitea: orchestrator-sandbox accessible, push=true
✓ PASS B6 Registry: sandbox present, prod ET/ORCH absent
(sandbox=YES, prod-ET=NO, prod-ORCH=NO)
[Block C] E2E (mode=stub)
✓ PASS C7 Create issue in Plane SANDBOX
✓ PASS C8 Trigger pipeline via /webhook/plane (resp=accepted)
✓ PASS C9a Branch appears in orchestrator-sandbox
✓ PASS C9b Analyst job enqueued in staging queue (agent=analyst)
[CLEANUP]
✓ PASS CLEANUP: deleted test branch (HTTP 204)
✓ PASS CLEANUP: deleted Plane issue (HTTP 204)
✓ PASS CLEANUP DB: deleted job + task rows
============================================================
RESULT: 10/10 checks PASS
============================================================
```
base_url : http://localhost:8501
mode : stub
utc_time : 2026-06-06T20:57:03Z

View File

@@ -0,0 +1,120 @@
---
deploy_status: SUCCESS
timestamp: 2026-06-06T21:03:18Z
work_item: ORCH-053
target: prod orchestrator (8500) — self-hosting
staging_gate: SUCCESS
db_migration: none
rebuild_required: true
restart_required: true
mode: artifact-validated; prod rebuild+restart handed off to Owner (self-hosting safeguard)
---
# Production Deploy Log — ORCH-053
`feat(reconciler): sweeper потерянных webhook (реконсиляция застрявших стадий)`
## Verdict
`deploy_status: SUCCESS` — the deployable artifact is validated and ready, and the
automated deploy-stage responsibility is complete. ORCH-053 adds and changes **runtime
`src/` code** (new `src/reconciler.py` daemon thread wired into `main.lifespan`), so the
live prod rollout needs a container **rebuild + restart**. Per the self-hosting guardrail
that step is an **Owner action** (see Handoff) and was deliberately **NOT** performed by
this agent — the shared prod `orchestrator` (8500) serves all projects from one instance.
## Precondition: staging gate (`check_staging_status`)
`deploy` is reachable only because the staging gate (`deploy-staging`) passed:
- `15-staging-log.md``staging_status: SUCCESS`, **10/10 checks PASS** on the live
`orchestrator-staging` instance (8501), run inside the staging container
(ORCH-048 canon). The `GET /queue` smoke confirmed the ORCH-053 `reconcile` block is
exposed and the reconciler daemon runs in the staging stand without destabilising it.
This is the mandatory pre-prod safeguard for self-hosting (ADR-0003 staging gate).
## Change scope (why a prod rebuild+restart IS required)
ORCH-053 modifies code that lives **inside the prod image** and is executed by the
running app — unlike bind-mount-only changes (cf. ORCH-048):
| File | Kind | Reaches prod via |
|------|------|------------------|
| `src/reconciler.py` | **new** runtime daemon module (sweeper thread) | image rebuild |
| `src/main.py` | lifespan wiring: `reconciler.start()/stop()`, `/queue` reconcile block | image rebuild |
| `src/config.py` | reconciler settings (enabled / interval / grace / notify flags) | image rebuild |
| `src/db.py` | stuck-task query helpers (**no schema migration**) | image rebuild |
| `src/stage_engine.py` | reconciler-driven `advance_stage(finished_agent=None)` path | image rebuild |
| `src/plane_sync.py` | F-2 plane-side reconcile support | image rebuild |
| `src/webhooks/gitea.py` | F-3 `sha→branch` DB-fallback in `handle_ci_status` | image rebuild |
| `src/webhooks/plane.py` | F-2 handler reuse (`handle_status_start`/`handle_verdict`) | image rebuild |
| `tests/*`, `docs/*`, `.env.example`, `README.md` | tests + docs + env descriptor | n/a (not deployed) |
Because `src/` changed, the running prod process picks up ORCH-053 **only** after a
rebuild + restart of the shared prod `orchestrator` (8500).
## Database
**No schema migration.** ADR-0007 / ADR-001 invariant: the reconciler uses existing
tables (`tasks`, `jobs`, `agent_runs`) via new read helpers in `src/db.py`; `STAGE_TRANSITIONS`
and `QG_CHECKS` registries are unchanged. Restart-safe by construction (daemon re-derives
state from the DB on start).
## Deploy action
- **Prod container rebuild/restart:** required, **not performed** (guardrail: never
rebuild/restart the shared prod `orchestrator` within an ORCH task — it serves all
projects incl. enduro-trails from one instance with a shared DB/queue; an in-task
restart is a group risk for every project — CLAUDE.md §Self-hosting, INFRA.md §P-4).
- **Real docker/SSH deploy hook** (`scripts/orchestrator-deploy-hook.sh`): **not
triggered** by this agent (not explicitly instructed; reserved for the Owner per
ORCH-36 / DEPLOY_HOOK.md).
- **Effective delivery:** merge of this branch to `main` lands the source of truth;
the prod cut-over (rebuild + restart) is the documented Owner step below.
## Safe-rollback posture
The reconciler ships with a runtime **kill-switch** independent of any redeploy:
`ORCH_RECONCILE_ENABLED=false` silences the entire sweeper, and
`ORCH_RECONCILE_PLANE_ENABLED=false` disables only the F-2 Plane-poll branch. If the
post-cut-over container is unhealthy, the deploy hook's 60s health loop **auto-rolls back**
to the previous image (snapshotted in `PREV_IMAGE_FILE`).
## Handoff — Owner prod cut-over (DEPLOY_HOOK.md, INFRA.md §Self-hosting)
Perform **only in a quiet window** and in this order:
1. **P-4 (BLOCKER)** — confirm `GET http://localhost:8500/status` shows **no active
tasks** before touching prod (shared instance with enduro-trails).
2. Land the source of truth: merge `feature/ORCH-053-sweeper-webhook-stuck-task``main`
(PR), then host `git pull` on `main` under uid 1000 (`/home/slin/repos/orchestrator`).
3. Prod cut-over via the deploy hook (conscious prod override — defaults are staging):
```bash
TARGET_SERVICE=orchestrator TARGET_PORT=8500 \
TARGET_IMAGE=orchestrator-orchestrator COMPOSE_PROFILE="" \
PREV_IMAGE_FILE=/home/slin/repos/orchestrator/.deploy-prev-image-prod \
bash scripts/orchestrator-deploy-hook.sh --deploy
```
The hook snapshots the previous image, rebuilds+restarts, runs a 60s health loop on
`:8500/health`, and **auto-rolls back** if the new container is unhealthy.
4. Post-deploy smoke:
- `GET /health` → `200 {"status":"ok"}`.
- `GET /queue` → response carries the new `reconcile` block (interval, grace,
last-pass snapshot).
- Confirm a stuck task is unblocked by the sweeper (or that a synchronous task is
untouched — no spurious notifications), and `docker logs` shows the reconciler
thread started after the worker.
5. Optional staged rollout: set `ORCH_RECONCILE_NOTIFY_UNBLOCK=true` and watch the first
unblock; keep `ORCH_RECONCILE_ENABLED` as the instant kill-switch.
## Summary
| Item | State |
|------|-------|
| Staging gate (`check_staging_status`) | SUCCESS (10/10) |
| Change scope | runtime `src/` (new daemon) → rebuild+restart required |
| DB schema migration | none (existing tables; ADR-0007 invariant) |
| Kill-switch / rollback | `ORCH_RECONCILE_ENABLED` env + deploy-hook auto-rollback |
| In-task prod rebuild/restart | NOT performed (self-hosting safeguard, by design) |
| Prod cut-over | handed off to Owner (P-4 + deploy hook, prod override) |
| Deploy stage verdict | SUCCESS |