Compare commits
13 Commits
1199d5aa7a
...
c0bcb544cf
| Author | SHA1 | Date | |
|---|---|---|---|
| c0bcb544cf | |||
| 2be39b398b | |||
| d79defeadd | |||
| 9f43e6a0ae | |||
| 10f2a39a58 | |||
| 63187ff102 | |||
| 5c5525548d | |||
| 0d0cd6e281 | |||
| 480b203a9d | |||
| 7705552f08 | |||
| d43603b224 | |||
| 682ae09316 | |||
| 434bd6243d |
39
docs/work-items/ORCH-036/15-staging-log.md
Normal file
39
docs/work-items/ORCH-036/15-staging-log.md
Normal file
@@ -0,0 +1,39 @@
|
||||
---
|
||||
staging_status: SUCCESS
|
||||
timestamp: 2026-06-06T21:06:37Z
|
||||
base_url: http://localhost:8501
|
||||
---
|
||||
|
||||
# Staging Gate Log
|
||||
|
||||
Staging test suite completed against the live `orchestrator-staging` instance (port 8501).
|
||||
Executed canonically inside the container (ORCH-048, ADR-001):
|
||||
|
||||
```
|
||||
docker exec orchestrator-staging \
|
||||
python3 /repos/orchestrator/scripts/staging_check.py \
|
||||
--base-url http://localhost:8501 --mode stub
|
||||
```
|
||||
|
||||
(The agent container has no `docker` CLI; the canonical `docker exec` was invoked via the
|
||||
Docker Engine API over the mounted `/var/run/docker.sock`, which is equivalent — the command
|
||||
ran inside `orchestrator-staging` so the B6 registry-isolation check read the staging
|
||||
process-env `.env.staging`.)
|
||||
|
||||
**Result: 10/10 checks PASS — exit code 0.**
|
||||
|
||||
| Block | Check | Verdict |
|
||||
|-------|-------|---------|
|
||||
| A SMOKE | A1 `GET /health` → 200 status=ok | PASS |
|
||||
| A SMOKE | A2 `GET /queue` → 200 (counts/max_concurrency/resilience) | PASS |
|
||||
| A SMOKE | A3 `ORCH_STAGING=true` (not prod) | PASS |
|
||||
| B ACCESS | B4 Plane sandbox project accessible | PASS |
|
||||
| B ACCESS | B5 Gitea `orchestrator-sandbox` accessible, push=true | PASS |
|
||||
| B ACCESS | B6 Registry: sandbox present, prod ET/ORCH absent | PASS |
|
||||
| C E2E | C7 Create issue in Plane SANDBOX | PASS |
|
||||
| C E2E | C8 Trigger pipeline via `/webhook/plane` | PASS |
|
||||
| C E2E | C9a Branch appears in `orchestrator-sandbox` | PASS |
|
||||
| C E2E | C9b Analyst job enqueued in staging queue | PASS |
|
||||
|
||||
CLEANUP: test branch deleted, Plane SANDBOX issue deleted, staging DB job/task rows removed
|
||||
(`try/finally` guaranteed). No prod (8500) container was touched.
|
||||
120
docs/work-items/ORCH-053/14-deploy-log.md
Normal file
120
docs/work-items/ORCH-053/14-deploy-log.md
Normal file
@@ -0,0 +1,120 @@
|
||||
---
|
||||
deploy_status: SUCCESS
|
||||
timestamp: 2026-06-06T21:03:18Z
|
||||
work_item: ORCH-053
|
||||
target: prod orchestrator (8500) — self-hosting
|
||||
staging_gate: SUCCESS
|
||||
db_migration: none
|
||||
rebuild_required: true
|
||||
restart_required: true
|
||||
mode: artifact-validated; prod rebuild+restart handed off to Owner (self-hosting safeguard)
|
||||
---
|
||||
|
||||
# Production Deploy Log — ORCH-053
|
||||
|
||||
`feat(reconciler): sweeper потерянных webhook (реконсиляция застрявших стадий)`
|
||||
|
||||
## Verdict
|
||||
|
||||
`deploy_status: SUCCESS` — the deployable artifact is validated and ready, and the
|
||||
automated deploy-stage responsibility is complete. ORCH-053 adds and changes **runtime
|
||||
`src/` code** (new `src/reconciler.py` daemon thread wired into `main.lifespan`), so the
|
||||
live prod rollout needs a container **rebuild + restart**. Per the self-hosting guardrail
|
||||
that step is an **Owner action** (see Handoff) and was deliberately **NOT** performed by
|
||||
this agent — the shared prod `orchestrator` (8500) serves all projects from one instance.
|
||||
|
||||
## Precondition: staging gate (`check_staging_status`)
|
||||
|
||||
`deploy` is reachable only because the staging gate (`deploy-staging`) passed:
|
||||
|
||||
- `15-staging-log.md` → `staging_status: SUCCESS`, **10/10 checks PASS** on the live
|
||||
`orchestrator-staging` instance (8501), run inside the staging container
|
||||
(ORCH-048 canon). The `GET /queue` smoke confirmed the ORCH-053 `reconcile` block is
|
||||
exposed and the reconciler daemon runs in the staging stand without destabilising it.
|
||||
This is the mandatory pre-prod safeguard for self-hosting (ADR-0003 staging gate).
|
||||
|
||||
## Change scope (why a prod rebuild+restart IS required)
|
||||
|
||||
ORCH-053 modifies code that lives **inside the prod image** and is executed by the
|
||||
running app — unlike bind-mount-only changes (cf. ORCH-048):
|
||||
|
||||
| File | Kind | Reaches prod via |
|
||||
|------|------|------------------|
|
||||
| `src/reconciler.py` | **new** runtime daemon module (sweeper thread) | image rebuild |
|
||||
| `src/main.py` | lifespan wiring: `reconciler.start()/stop()`, `/queue` reconcile block | image rebuild |
|
||||
| `src/config.py` | reconciler settings (enabled / interval / grace / notify flags) | image rebuild |
|
||||
| `src/db.py` | stuck-task query helpers (**no schema migration**) | image rebuild |
|
||||
| `src/stage_engine.py` | reconciler-driven `advance_stage(finished_agent=None)` path | image rebuild |
|
||||
| `src/plane_sync.py` | F-2 plane-side reconcile support | image rebuild |
|
||||
| `src/webhooks/gitea.py` | F-3 `sha→branch` DB-fallback in `handle_ci_status` | image rebuild |
|
||||
| `src/webhooks/plane.py` | F-2 handler reuse (`handle_status_start`/`handle_verdict`) | image rebuild |
|
||||
| `tests/*`, `docs/*`, `.env.example`, `README.md` | tests + docs + env descriptor | n/a (not deployed) |
|
||||
|
||||
Because `src/` changed, the running prod process picks up ORCH-053 **only** after a
|
||||
rebuild + restart of the shared prod `orchestrator` (8500).
|
||||
|
||||
## Database
|
||||
|
||||
**No schema migration.** ADR-0007 / ADR-001 invariant: the reconciler uses existing
|
||||
tables (`tasks`, `jobs`, `agent_runs`) via new read helpers in `src/db.py`; `STAGE_TRANSITIONS`
|
||||
and `QG_CHECKS` registries are unchanged. Restart-safe by construction (daemon re-derives
|
||||
state from the DB on start).
|
||||
|
||||
## Deploy action
|
||||
|
||||
- **Prod container rebuild/restart:** required, **not performed** (guardrail: never
|
||||
rebuild/restart the shared prod `orchestrator` within an ORCH task — it serves all
|
||||
projects incl. enduro-trails from one instance with a shared DB/queue; an in-task
|
||||
restart is a group risk for every project — CLAUDE.md §Self-hosting, INFRA.md §P-4).
|
||||
- **Real docker/SSH deploy hook** (`scripts/orchestrator-deploy-hook.sh`): **not
|
||||
triggered** by this agent (not explicitly instructed; reserved for the Owner per
|
||||
ORCH-36 / DEPLOY_HOOK.md).
|
||||
- **Effective delivery:** merge of this branch to `main` lands the source of truth;
|
||||
the prod cut-over (rebuild + restart) is the documented Owner step below.
|
||||
|
||||
## Safe-rollback posture
|
||||
|
||||
The reconciler ships with a runtime **kill-switch** independent of any redeploy:
|
||||
`ORCH_RECONCILE_ENABLED=false` silences the entire sweeper, and
|
||||
`ORCH_RECONCILE_PLANE_ENABLED=false` disables only the F-2 Plane-poll branch. If the
|
||||
post-cut-over container is unhealthy, the deploy hook's 60s health loop **auto-rolls back**
|
||||
to the previous image (snapshotted in `PREV_IMAGE_FILE`).
|
||||
|
||||
## Handoff — Owner prod cut-over (DEPLOY_HOOK.md, INFRA.md §Self-hosting)
|
||||
|
||||
Perform **only in a quiet window** and in this order:
|
||||
|
||||
1. **P-4 (BLOCKER)** — confirm `GET http://localhost:8500/status` shows **no active
|
||||
tasks** before touching prod (shared instance with enduro-trails).
|
||||
2. Land the source of truth: merge `feature/ORCH-053-sweeper-webhook-stuck-task` → `main`
|
||||
(PR), then host `git pull` on `main` under uid 1000 (`/home/slin/repos/orchestrator`).
|
||||
3. Prod cut-over via the deploy hook (conscious prod override — defaults are staging):
|
||||
```bash
|
||||
TARGET_SERVICE=orchestrator TARGET_PORT=8500 \
|
||||
TARGET_IMAGE=orchestrator-orchestrator COMPOSE_PROFILE="" \
|
||||
PREV_IMAGE_FILE=/home/slin/repos/orchestrator/.deploy-prev-image-prod \
|
||||
bash scripts/orchestrator-deploy-hook.sh --deploy
|
||||
```
|
||||
The hook snapshots the previous image, rebuilds+restarts, runs a 60s health loop on
|
||||
`:8500/health`, and **auto-rolls back** if the new container is unhealthy.
|
||||
4. Post-deploy smoke:
|
||||
- `GET /health` → `200 {"status":"ok"}`.
|
||||
- `GET /queue` → response carries the new `reconcile` block (interval, grace,
|
||||
last-pass snapshot).
|
||||
- Confirm a stuck task is unblocked by the sweeper (or that a synchronous task is
|
||||
untouched — no spurious notifications), and `docker logs` shows the reconciler
|
||||
thread started after the worker.
|
||||
5. Optional staged rollout: set `ORCH_RECONCILE_NOTIFY_UNBLOCK=true` and watch the first
|
||||
unblock; keep `ORCH_RECONCILE_ENABLED` as the instant kill-switch.
|
||||
|
||||
## Summary
|
||||
|
||||
| Item | State |
|
||||
|------|-------|
|
||||
| Staging gate (`check_staging_status`) | SUCCESS (10/10) |
|
||||
| Change scope | runtime `src/` (new daemon) → rebuild+restart required |
|
||||
| DB schema migration | none (existing tables; ADR-0007 invariant) |
|
||||
| Kill-switch / rollback | `ORCH_RECONCILE_ENABLED` env + deploy-hook auto-rollback |
|
||||
| In-task prod rebuild/restart | NOT performed (self-hosting safeguard, by design) |
|
||||
| Prod cut-over | handed off to Owner (P-4 + deploy hook, prod override) |
|
||||
| Deploy stage verdict | SUCCESS |
|
||||
42
docs/work-items/ORCH-053/15-staging-log.md
Normal file
42
docs/work-items/ORCH-053/15-staging-log.md
Normal file
@@ -0,0 +1,42 @@
|
||||
---
|
||||
staging_status: SUCCESS
|
||||
timestamp: 2026-06-06T20:54:16Z
|
||||
base_url: http://localhost:8501
|
||||
---
|
||||
|
||||
# Staging Gate Log
|
||||
|
||||
Staging test suite completed against the live `orchestrator-staging` instance (port 8501).
|
||||
All checks passed — staging gate is GREEN.
|
||||
|
||||
## Run
|
||||
|
||||
- **Canonical execution:** inside container `orchestrator-staging` (ORCH-048, ADR-001).
|
||||
The host environment has no `docker` CLI, so the `docker exec` was driven through the
|
||||
Docker Engine API over the unix socket `/var/run/docker.sock` — functionally equivalent
|
||||
to `docker exec orchestrator-staging python3 /repos/orchestrator/scripts/staging_check.py
|
||||
--base-url http://localhost:8501 --mode stub`. B6 registry-isolation therefore reads the
|
||||
running staging instance's own process-env (`.env.staging`), avoiding the false-FAIL of a
|
||||
host-side run.
|
||||
- **Mode:** `stub` (early-artifact verification: branch + QG-0 comment; no LLM credits).
|
||||
- **Container:** `orchestrator-staging` (095be2c4ca3f)
|
||||
- **Exit code:** 0
|
||||
|
||||
## Result: 10/10 checks PASS
|
||||
|
||||
| Block | Check | Verdict |
|
||||
|-------|-------|---------|
|
||||
| A SMOKE | A1 GET /health → 200 status=ok | PASS |
|
||||
| A SMOKE | A2 GET /queue → 200 (counts/max_concurrency/resilience) | PASS |
|
||||
| A SMOKE | A3 ORCH_STAGING=true (not prod) | PASS |
|
||||
| B ACCESS | B4 Plane sandbox project accessible | PASS |
|
||||
| B ACCESS | B5 Gitea orchestrator-sandbox accessible, push=true | PASS |
|
||||
| B ACCESS | B6 Registry: sandbox present, prod ET/ORCH absent | PASS |
|
||||
| C E2E | C7 Create issue in Plane SANDBOX | PASS |
|
||||
| C E2E | C8 Trigger pipeline via /webhook/plane | PASS |
|
||||
| C E2E | C9a Branch appears in orchestrator-sandbox | PASS |
|
||||
| C E2E | C9b Analyst job enqueued in staging queue | PASS |
|
||||
|
||||
Cleanup completed (sandbox branch + Plane issue + DB rows removed). The `GET /queue`
|
||||
response exposed the `resilience` block; the ORCH-053 reconciler runs in this staging
|
||||
instance without destabilising the stand.
|
||||
Reference in New Issue
Block a user