diff --git a/docs/work-items/ORCH-058/15-staging-log.md b/docs/work-items/ORCH-058/15-staging-log.md index b3c7f9c..3e75b90 100644 --- a/docs/work-items/ORCH-058/15-staging-log.md +++ b/docs/work-items/ORCH-058/15-staging-log.md @@ -1,24 +1,61 @@ --- -staging_status: SUCCESS -timestamp: 2026-06-07T09:50:05Z +staging_status: FAILED +timestamp: 2026-06-07T10:14:30Z base_url: http://localhost:8501 --- # Staging Gate Log — ORCH-058 -Staging test suite completed against the live staging environment. **All checks passed.** +Staging test suite ran against the live staging environment and **FAILED** (exit code `1`, +**8/10 checks PASS**). The two end-to-end (Block C) checks failed: the pipeline was **not +triggered** on the freshly-built staging image, so no task / branch / analyst job was created. -- Execution: canonical `docker exec` into `orchestrator-staging` (ORCH-048, ADR-001), - invoked via the Docker Engine API over the mounted unix socket (the `docker` CLI - binary is not present in the agent runtime image; the Engine API exec is the exact - equivalent of `docker exec orchestrator-staging python3 \ - /repos/orchestrator/scripts/staging_check.py --base-url http://localhost:8501 --mode stub`). - Running inside the container is required so B6 (registry isolation) reads the staging - instance's own process-env (`ORCH_PROJECTS_JSON` = sandbox only). -- Script: `/repos/orchestrator/scripts/staging_check.py` (bind-mount). +Per the staging-gate contract this is a machine verdict `FAILED` → the task rolls back to +`development`. The verdict reflects the real suite exit code, not an LLM declaration. + +## Execution +- Canonical `docker exec` into `orchestrator-staging` (ORCH-048, ADR-001), invoked via the + Docker Engine API over the mounted unix socket (the `docker` CLI binary is absent in the + agent runtime image; the Engine-API exec is the exact equivalent of + `docker exec orchestrator-staging python3 /repos/orchestrator/scripts/staging_check.py + --base-url http://localhost:8501 --mode stub`). +- Script: `/repos/orchestrator/scripts/staging_check.py` (bind-mount, `main`). - Mode: `stub` -- Exit code: `0` -- Result: **10/10 checks PASS** +- Exit code: `1` +- Result: **8/10 checks PASS** (FAIL: C9a, C9b) + +## Root cause (actionable for development rollback) +The E2E flow (`staging_check.py` Block C) creates a SANDBOX Plane issue (C7 ✓), then POSTs a +signed `/webhook/plane` payload with state `IN_PROGRESS_STATE_ID` (name `"In Progress"`, +group `"started"`) to start the pipeline (C8 ✓ — HTTP 200 `{"status":"accepted"}`). However the +staging instance logged: + +``` +2026-06-07 10:14:09 [INFO] orchestrator.webhooks.plane: issue ed5db89e-657d-4728-9179-901d2404be85 + updated to state b873d9eb..., no pipeline action +``` + +→ **"no pipeline action"**: the `In Progress` / `started` webhook did NOT start the pipeline, +so no `tasks` row, no Gitea branch (C9a FAIL — branch never appeared after 60s), and no analyst +job enqueued (C9b FAIL — queue had no new job; latest job is id=8 from 2026-06-06). Cleanup +confirmed `no task row found for plane_id=ed5db89e...` and `no branch to delete`. + +This is a **deterministic regression in the validated artifact**, not a timing flake (the +webhook was explicitly classified as no-op, not a poll timeout): +- The **same** `staging_check.py` against the **same** SANDBOX config passed **10/10** at + 09:31 UTC on the pre-rebuild image (see git history of this file). +- The staging image was **freshly rebuilt** at 10:13:29 UTC (revision label + `org.opencontainers.image.revision=094b5e2f960f696216f8661ff9c27b0d4706f219`, container + recreated 10:13:36 UTC) — consistent with ORCH-058 Strategy A rebuilding 8501 from the + validated commit. The new image now exposes the `reconcile` key in `/queue` (ORCH-053), + absent at 09:31, confirming the image changed between the two runs. +- Net: the artifact about to be promoted to prod no longer starts the pipeline on a Plane + `In Progress` (group `started`) transition. **Investigate `handle_status_start` / + webhook start-state matching in `src/webhooks/plane.py`** against the validated commit. + +Smoke (A1–A3) and access (B4–B6) all passed, including B6 registry isolation +(sandbox present; prod ET/ORCH absent) — confirming the check ran inside the staging +instance's own process-env, so there is no false-FAIL / spurious-rollback risk from B6. ## Test output @@ -27,12 +64,12 @@ Staging test suite completed against the live staging environment. **All checks ORCH-33 Staging Check Suite base_url : http://localhost:8501 mode : stub - utc_time : 2026-06-07T09:50:05.706197+00:00 + utc_time : 2026-06-07T10:14:07.188198+00:00 ============================================================ [Block A] SMOKE ✓ PASS A1 GET /health → 200 status=ok [HTTP 200, body={'status': 'ok', 'service': 'orchestrator'}] - ✓ PASS A2 GET /queue → 200 with counts/max_concurrency/resilience [HTTP 200, keys=['counts', 'max_concurrency', 'poll_interval', 'resilience', 'recent']] + ✓ PASS A2 GET /queue → 200 with counts/max_concurrency/resilience [HTTP 200, keys=['counts', 'max_concurrency', 'poll_interval', 'resilience', 'reconcile', 'recent']] ✓ PASS A3 ORCH_STAGING=true (not prod) [ORCH_STAGING=true] [Block B] ACCESS @@ -41,23 +78,28 @@ Staging test suite completed against the live staging environment. **All checks ✓ PASS B6 Registry: sandbox present, prod ET/ORCH absent [sandbox=YES, prod-ET=NO(good), prod-ORCH=NO(good)] [Block C] E2E (mode=stub) - ✓ PASS C7 Create issue in Plane SANDBOX [HTTP 201] + · C7: Creating issue in SANDBOX project... + ✓ PASS C7 Create issue in Plane SANDBOX [HTTP 201, issue_id=ed5db89e-657d-4728-9179-901d2404be85] + · C8: Triggering pipeline via POST /webhook/plane ... + · Using HMAC signature (secret len=40) ✓ PASS C8 Trigger pipeline via /webhook/plane [HTTP 200, resp={'status': 'accepted'}] - ✓ PASS C9a Branch appears in orchestrator-sandbox - ✓ PASS C9b Analyst job enqueued in staging queue [status=queued, agent=analyst] + · C9a: Polling for branch in orchestrator-sandbox (up to 60s)... + · waiting... (waiting for branch) [×20] + ✗ FAIL C9a Branch appears in orchestrator-sandbox [branch=not found] + · C9b: Checking staging job queue for analyst job (up to 30s)... + · (Plane comment check skipped: bot-tokens not added to SANDBOX project) + · waiting... (waiting for analyst job in queue) [×15] + ✗ FAIL C9b Analyst job enqueued in staging queue [CLEANUP] - ✓ PASS CLEANUP: deleted branch (HTTP 204) - ✓ PASS CLEANUP: deleted Plane issue (HTTP 204) - ✓ PASS CLEANUP DB: deleted 1 job row(s) - ✓ PASS CLEANUP DB: deleted 1 task row(s) + · CLEANUP: no branch to delete + ✓ PASS CLEANUP: deleted Plane issue ed5db89e-657d-4728-9179-901d2404be85 (HTTP 204) + · CLEANUP DB: no task row found for plane_id=ed5db89e-657d-4728-9179-901d2404be85 + · CLEANUP DB dedup: no such table: events_dedup ============================================================ - RESULT: 10/10 checks PASS + RESULT: 8/10 checks PASS ============================================================ ``` -Note: B6 registry isolation passed (sandbox present; prod ET/ORCH absent), confirming the -check ran inside the staging instance's own process-env as required — no false-FAIL/spurious -rollback risk. This is a re-run of the staging gate (prior SUCCESS at 09:31:58Z); verdict -unchanged: SUCCESS. +EXIT_CODE=1