docs(ORCH-058): staging gate re-run on fresh image — staging_status FAILED (8/10)
All checks were successful
CI / test (pull_request) Successful in 16s
All checks were successful
CI / test (pull_request) Successful in 16s
Strategy-A freshness re-validation rebuilt 8501 from merged commit 094b5e2 and
re-ran staging_check; E2E C9a/C9b fail (Plane "In Progress"/started webhook ->
"no pipeline action", no task/branch/analyst-job). Machine verdict FAILED ->
rollback to development. Prod (8500) untouched.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -1,24 +1,61 @@
|
||||
---
|
||||
staging_status: SUCCESS
|
||||
timestamp: 2026-06-07T09:50:05Z
|
||||
staging_status: FAILED
|
||||
timestamp: 2026-06-07T10:14:30Z
|
||||
base_url: http://localhost:8501
|
||||
---
|
||||
|
||||
# Staging Gate Log — ORCH-058
|
||||
|
||||
Staging test suite completed against the live staging environment. **All checks passed.**
|
||||
Staging test suite ran against the live staging environment and **FAILED** (exit code `1`,
|
||||
**8/10 checks PASS**). The two end-to-end (Block C) checks failed: the pipeline was **not
|
||||
triggered** on the freshly-built staging image, so no task / branch / analyst job was created.
|
||||
|
||||
- Execution: canonical `docker exec` into `orchestrator-staging` (ORCH-048, ADR-001),
|
||||
invoked via the Docker Engine API over the mounted unix socket (the `docker` CLI
|
||||
binary is not present in the agent runtime image; the Engine API exec is the exact
|
||||
equivalent of `docker exec orchestrator-staging python3 \
|
||||
/repos/orchestrator/scripts/staging_check.py --base-url http://localhost:8501 --mode stub`).
|
||||
Running inside the container is required so B6 (registry isolation) reads the staging
|
||||
instance's own process-env (`ORCH_PROJECTS_JSON` = sandbox only).
|
||||
- Script: `/repos/orchestrator/scripts/staging_check.py` (bind-mount).
|
||||
Per the staging-gate contract this is a machine verdict `FAILED` → the task rolls back to
|
||||
`development`. The verdict reflects the real suite exit code, not an LLM declaration.
|
||||
|
||||
## Execution
|
||||
- Canonical `docker exec` into `orchestrator-staging` (ORCH-048, ADR-001), invoked via the
|
||||
Docker Engine API over the mounted unix socket (the `docker` CLI binary is absent in the
|
||||
agent runtime image; the Engine-API exec is the exact equivalent of
|
||||
`docker exec orchestrator-staging python3 /repos/orchestrator/scripts/staging_check.py
|
||||
--base-url http://localhost:8501 --mode stub`).
|
||||
- Script: `/repos/orchestrator/scripts/staging_check.py` (bind-mount, `main`).
|
||||
- Mode: `stub`
|
||||
- Exit code: `0`
|
||||
- Result: **10/10 checks PASS**
|
||||
- Exit code: `1`
|
||||
- Result: **8/10 checks PASS** (FAIL: C9a, C9b)
|
||||
|
||||
## Root cause (actionable for development rollback)
|
||||
The E2E flow (`staging_check.py` Block C) creates a SANDBOX Plane issue (C7 ✓), then POSTs a
|
||||
signed `/webhook/plane` payload with state `IN_PROGRESS_STATE_ID` (name `"In Progress"`,
|
||||
group `"started"`) to start the pipeline (C8 ✓ — HTTP 200 `{"status":"accepted"}`). However the
|
||||
staging instance logged:
|
||||
|
||||
```
|
||||
2026-06-07 10:14:09 [INFO] orchestrator.webhooks.plane: issue ed5db89e-657d-4728-9179-901d2404be85
|
||||
updated to state b873d9eb..., no pipeline action
|
||||
```
|
||||
|
||||
→ **"no pipeline action"**: the `In Progress` / `started` webhook did NOT start the pipeline,
|
||||
so no `tasks` row, no Gitea branch (C9a FAIL — branch never appeared after 60s), and no analyst
|
||||
job enqueued (C9b FAIL — queue had no new job; latest job is id=8 from 2026-06-06). Cleanup
|
||||
confirmed `no task row found for plane_id=ed5db89e...` and `no branch to delete`.
|
||||
|
||||
This is a **deterministic regression in the validated artifact**, not a timing flake (the
|
||||
webhook was explicitly classified as no-op, not a poll timeout):
|
||||
- The **same** `staging_check.py` against the **same** SANDBOX config passed **10/10** at
|
||||
09:31 UTC on the pre-rebuild image (see git history of this file).
|
||||
- The staging image was **freshly rebuilt** at 10:13:29 UTC (revision label
|
||||
`org.opencontainers.image.revision=094b5e2f960f696216f8661ff9c27b0d4706f219`, container
|
||||
recreated 10:13:36 UTC) — consistent with ORCH-058 Strategy A rebuilding 8501 from the
|
||||
validated commit. The new image now exposes the `reconcile` key in `/queue` (ORCH-053),
|
||||
absent at 09:31, confirming the image changed between the two runs.
|
||||
- Net: the artifact about to be promoted to prod no longer starts the pipeline on a Plane
|
||||
`In Progress` (group `started`) transition. **Investigate `handle_status_start` /
|
||||
webhook start-state matching in `src/webhooks/plane.py`** against the validated commit.
|
||||
|
||||
Smoke (A1–A3) and access (B4–B6) all passed, including B6 registry isolation
|
||||
(sandbox present; prod ET/ORCH absent) — confirming the check ran inside the staging
|
||||
instance's own process-env, so there is no false-FAIL / spurious-rollback risk from B6.
|
||||
|
||||
## Test output
|
||||
|
||||
@@ -27,12 +64,12 @@ Staging test suite completed against the live staging environment. **All checks
|
||||
ORCH-33 Staging Check Suite
|
||||
base_url : http://localhost:8501
|
||||
mode : stub
|
||||
utc_time : 2026-06-07T09:50:05.706197+00:00
|
||||
utc_time : 2026-06-07T10:14:07.188198+00:00
|
||||
============================================================
|
||||
|
||||
[Block A] SMOKE
|
||||
✓ PASS A1 GET /health → 200 status=ok [HTTP 200, body={'status': 'ok', 'service': 'orchestrator'}]
|
||||
✓ PASS A2 GET /queue → 200 with counts/max_concurrency/resilience [HTTP 200, keys=['counts', 'max_concurrency', 'poll_interval', 'resilience', 'recent']]
|
||||
✓ PASS A2 GET /queue → 200 with counts/max_concurrency/resilience [HTTP 200, keys=['counts', 'max_concurrency', 'poll_interval', 'resilience', 'reconcile', 'recent']]
|
||||
✓ PASS A3 ORCH_STAGING=true (not prod) [ORCH_STAGING=true]
|
||||
|
||||
[Block B] ACCESS
|
||||
@@ -41,23 +78,28 @@ Staging test suite completed against the live staging environment. **All checks
|
||||
✓ PASS B6 Registry: sandbox present, prod ET/ORCH absent [sandbox=YES, prod-ET=NO(good), prod-ORCH=NO(good)]
|
||||
|
||||
[Block C] E2E (mode=stub)
|
||||
✓ PASS C7 Create issue in Plane SANDBOX [HTTP 201]
|
||||
· C7: Creating issue in SANDBOX project...
|
||||
✓ PASS C7 Create issue in Plane SANDBOX [HTTP 201, issue_id=ed5db89e-657d-4728-9179-901d2404be85]
|
||||
· C8: Triggering pipeline via POST /webhook/plane ...
|
||||
· Using HMAC signature (secret len=40)
|
||||
✓ PASS C8 Trigger pipeline via /webhook/plane [HTTP 200, resp={'status': 'accepted'}]
|
||||
✓ PASS C9a Branch appears in orchestrator-sandbox
|
||||
✓ PASS C9b Analyst job enqueued in staging queue [status=queued, agent=analyst]
|
||||
· C9a: Polling for branch in orchestrator-sandbox (up to 60s)...
|
||||
· waiting... (waiting for branch) [×20]
|
||||
✗ FAIL C9a Branch appears in orchestrator-sandbox [branch=not found]
|
||||
· C9b: Checking staging job queue for analyst job (up to 30s)...
|
||||
· (Plane comment check skipped: bot-tokens not added to SANDBOX project)
|
||||
· waiting... (waiting for analyst job in queue) [×15]
|
||||
✗ FAIL C9b Analyst job enqueued in staging queue
|
||||
|
||||
[CLEANUP]
|
||||
✓ PASS CLEANUP: deleted branch (HTTP 204)
|
||||
✓ PASS CLEANUP: deleted Plane issue (HTTP 204)
|
||||
✓ PASS CLEANUP DB: deleted 1 job row(s)
|
||||
✓ PASS CLEANUP DB: deleted 1 task row(s)
|
||||
· CLEANUP: no branch to delete
|
||||
✓ PASS CLEANUP: deleted Plane issue ed5db89e-657d-4728-9179-901d2404be85 (HTTP 204)
|
||||
· CLEANUP DB: no task row found for plane_id=ed5db89e-657d-4728-9179-901d2404be85
|
||||
· CLEANUP DB dedup: no such table: events_dedup
|
||||
|
||||
============================================================
|
||||
RESULT: 10/10 checks PASS
|
||||
RESULT: 8/10 checks PASS
|
||||
============================================================
|
||||
```
|
||||
|
||||
Note: B6 registry isolation passed (sandbox present; prod ET/ORCH absent), confirming the
|
||||
check ran inside the staging instance's own process-env as required — no false-FAIL/spurious
|
||||
rollback risk. This is a re-run of the staging gate (prior SUCCESS at 09:31:58Z); verdict
|
||||
unchanged: SUCCESS.
|
||||
EXIT_CODE=1
|
||||
|
||||
Reference in New Issue
Block a user