docs(ORCH-058): staging gate re-run on fresh image — staging_status FAILED #58

Merged
admin merged 1 commits from deployer/ORCH-058-staging-verdict into main 2026-06-07 13:22:14 +03:00

View File

@@ -1,24 +1,61 @@
---
staging_status: SUCCESS
timestamp: 2026-06-07T09:50:05Z
staging_status: FAILED
timestamp: 2026-06-07T10:14:30Z
base_url: http://localhost:8501
---
# Staging Gate Log — ORCH-058
Staging test suite completed against the live staging environment. **All checks passed.**
Staging test suite ran against the live staging environment and **FAILED** (exit code `1`,
**8/10 checks PASS**). The two end-to-end (Block C) checks failed: the pipeline was **not
triggered** on the freshly-built staging image, so no task / branch / analyst job was created.
- Execution: canonical `docker exec` into `orchestrator-staging` (ORCH-048, ADR-001),
invoked via the Docker Engine API over the mounted unix socket (the `docker` CLI
binary is not present in the agent runtime image; the Engine API exec is the exact
equivalent of `docker exec orchestrator-staging python3 \
/repos/orchestrator/scripts/staging_check.py --base-url http://localhost:8501 --mode stub`).
Running inside the container is required so B6 (registry isolation) reads the staging
instance's own process-env (`ORCH_PROJECTS_JSON` = sandbox only).
- Script: `/repos/orchestrator/scripts/staging_check.py` (bind-mount).
Per the staging-gate contract this is a machine verdict `FAILED` → the task rolls back to
`development`. The verdict reflects the real suite exit code, not an LLM declaration.
## Execution
- Canonical `docker exec` into `orchestrator-staging` (ORCH-048, ADR-001), invoked via the
Docker Engine API over the mounted unix socket (the `docker` CLI binary is absent in the
agent runtime image; the Engine-API exec is the exact equivalent of
`docker exec orchestrator-staging python3 /repos/orchestrator/scripts/staging_check.py
--base-url http://localhost:8501 --mode stub`).
- Script: `/repos/orchestrator/scripts/staging_check.py` (bind-mount, `main`).
- Mode: `stub`
- Exit code: `0`
- Result: **10/10 checks PASS**
- Exit code: `1`
- Result: **8/10 checks PASS** (FAIL: C9a, C9b)
## Root cause (actionable for development rollback)
The E2E flow (`staging_check.py` Block C) creates a SANDBOX Plane issue (C7 ✓), then POSTs a
signed `/webhook/plane` payload with state `IN_PROGRESS_STATE_ID` (name `"In Progress"`,
group `"started"`) to start the pipeline (C8 ✓ — HTTP 200 `{"status":"accepted"}`). However the
staging instance logged:
```
2026-06-07 10:14:09 [INFO] orchestrator.webhooks.plane: issue ed5db89e-657d-4728-9179-901d2404be85
updated to state b873d9eb..., no pipeline action
```
**"no pipeline action"**: the `In Progress` / `started` webhook did NOT start the pipeline,
so no `tasks` row, no Gitea branch (C9a FAIL — branch never appeared after 60s), and no analyst
job enqueued (C9b FAIL — queue had no new job; latest job is id=8 from 2026-06-06). Cleanup
confirmed `no task row found for plane_id=ed5db89e...` and `no branch to delete`.
This is a **deterministic regression in the validated artifact**, not a timing flake (the
webhook was explicitly classified as no-op, not a poll timeout):
- The **same** `staging_check.py` against the **same** SANDBOX config passed **10/10** at
09:31 UTC on the pre-rebuild image (see git history of this file).
- The staging image was **freshly rebuilt** at 10:13:29 UTC (revision label
`org.opencontainers.image.revision=094b5e2f960f696216f8661ff9c27b0d4706f219`, container
recreated 10:13:36 UTC) — consistent with ORCH-058 Strategy A rebuilding 8501 from the
validated commit. The new image now exposes the `reconcile` key in `/queue` (ORCH-053),
absent at 09:31, confirming the image changed between the two runs.
- Net: the artifact about to be promoted to prod no longer starts the pipeline on a Plane
`In Progress` (group `started`) transition. **Investigate `handle_status_start` /
webhook start-state matching in `src/webhooks/plane.py`** against the validated commit.
Smoke (A1A3) and access (B4B6) all passed, including B6 registry isolation
(sandbox present; prod ET/ORCH absent) — confirming the check ran inside the staging
instance's own process-env, so there is no false-FAIL / spurious-rollback risk from B6.
## Test output
@@ -27,12 +64,12 @@ Staging test suite completed against the live staging environment. **All checks
ORCH-33 Staging Check Suite
base_url : http://localhost:8501
mode : stub
utc_time : 2026-06-07T09:50:05.706197+00:00
utc_time : 2026-06-07T10:14:07.188198+00:00
============================================================
[Block A] SMOKE
✓ PASS A1 GET /health → 200 status=ok [HTTP 200, body={'status': 'ok', 'service': 'orchestrator'}]
✓ PASS A2 GET /queue → 200 with counts/max_concurrency/resilience [HTTP 200, keys=['counts', 'max_concurrency', 'poll_interval', 'resilience', 'recent']]
✓ PASS A2 GET /queue → 200 with counts/max_concurrency/resilience [HTTP 200, keys=['counts', 'max_concurrency', 'poll_interval', 'resilience', 'reconcile', 'recent']]
✓ PASS A3 ORCH_STAGING=true (not prod) [ORCH_STAGING=true]
[Block B] ACCESS
@@ -41,23 +78,28 @@ Staging test suite completed against the live staging environment. **All checks
✓ PASS B6 Registry: sandbox present, prod ET/ORCH absent [sandbox=YES, prod-ET=NO(good), prod-ORCH=NO(good)]
[Block C] E2E (mode=stub)
✓ PASS C7 Create issue in Plane SANDBOX [HTTP 201]
· C7: Creating issue in SANDBOX project...
✓ PASS C7 Create issue in Plane SANDBOX [HTTP 201, issue_id=ed5db89e-657d-4728-9179-901d2404be85]
· C8: Triggering pipeline via POST /webhook/plane ...
· Using HMAC signature (secret len=40)
✓ PASS C8 Trigger pipeline via /webhook/plane [HTTP 200, resp={'status': 'accepted'}]
✓ PASS C9a Branch appears in orchestrator-sandbox
✓ PASS C9b Analyst job enqueued in staging queue [status=queued, agent=analyst]
· C9a: Polling for branch in orchestrator-sandbox (up to 60s)...
· waiting... (waiting for branch) [×20]
✗ FAIL C9a Branch appears in orchestrator-sandbox [branch=not found]
· C9b: Checking staging job queue for analyst job (up to 30s)...
· (Plane comment check skipped: bot-tokens not added to SANDBOX project)
· waiting... (waiting for analyst job in queue) [×15]
✗ FAIL C9b Analyst job enqueued in staging queue
[CLEANUP]
✓ PASS CLEANUP: deleted branch (HTTP 204)
✓ PASS CLEANUP: deleted Plane issue (HTTP 204)
✓ PASS CLEANUP DB: deleted 1 job row(s)
✓ PASS CLEANUP DB: deleted 1 task row(s)
· CLEANUP: no branch to delete
✓ PASS CLEANUP: deleted Plane issue ed5db89e-657d-4728-9179-901d2404be85 (HTTP 204)
· CLEANUP DB: no task row found for plane_id=ed5db89e-657d-4728-9179-901d2404be85
· CLEANUP DB dedup: no such table: events_dedup
============================================================
RESULT: 10/10 checks PASS
RESULT: 8/10 checks PASS
============================================================
```
Note: B6 registry isolation passed (sandbox present; prod ET/ORCH absent), confirming the
check ran inside the staging instance's own process-env as required — no false-FAIL/spurious
rollback risk. This is a re-run of the staging gate (prior SUCCESS at 09:31:58Z); verdict
unchanged: SUCCESS.
EXIT_CODE=1