Merge pull request '#59 staging gate FAILED — corrected root cause' into main
Some checks failed
CI / test (push) Has been cancelled
Some checks failed
CI / test (push) Has been cancelled
This commit was merged in pull request #59.
This commit is contained in:
@@ -1,19 +1,24 @@
|
||||
---
|
||||
staging_status: FAILED
|
||||
timestamp: 2026-06-07T10:39:15Z
|
||||
timestamp: 2026-06-07T11:01:00Z
|
||||
base_url: http://localhost:8501
|
||||
---
|
||||
|
||||
# Staging Gate Log — ORCH-058
|
||||
|
||||
Staging test suite ran against the live staging environment and **FAILED** (exit code `1`,
|
||||
**8/10 checks PASS**). The two end-to-end (Block C) checks failed: the pipeline was **not
|
||||
triggered** on the staging image, so no task / branch / analyst job was created.
|
||||
**8/10 checks PASS**). Block C (E2E) checks C9a and C9b failed.
|
||||
|
||||
Per the staging-gate contract this is a machine verdict `FAILED` → the task rolls back to
|
||||
`development`. The verdict reflects the real suite exit code, not an LLM declaration. This run
|
||||
**reproduces** the regression already recorded in the previous re-run of this file (same 8/10,
|
||||
same `C9a`/`C9b` failure, same `no pipeline action` classification).
|
||||
Per the staging-gate contract this is the machine verdict `FAILED` (it reflects the real suite
|
||||
exit code, never an LLM declaration). Smoke (A1–A3) and access (B4–B6) all passed, **including
|
||||
B6 registry isolation** — so this is NOT a B6/ORCH-048 false-FAIL.
|
||||
|
||||
> ⚠️ **CORRECTED ROOT CAUSE — read before acting on this rollback.** The previous revision of
|
||||
> this log blamed `handle_status_start` / a regression in the validated artifact. **That was
|
||||
> wrong**, which is why the dev↔staging cycle kept repeating. Direct inspection inside the
|
||||
> running staging instance proves the production code is **correct** and the failure is a bug in
|
||||
> the **test harness `scripts/staging_check.py`**. Do NOT touch `src/webhooks/plane.py` /
|
||||
> `handle_status_start` / any ORCH-058 image-freshness code. **Fix `scripts/staging_check.py`.**
|
||||
|
||||
## Execution
|
||||
- Canonical `docker exec` into `orchestrator-staging` (ORCH-048, ADR-001), invoked via the
|
||||
@@ -21,38 +26,62 @@ same `C9a`/`C9b` failure, same `no pipeline action` classification).
|
||||
agent runtime image; the Engine-API exec is the exact equivalent of
|
||||
`docker exec orchestrator-staging python3 /repos/orchestrator/scripts/staging_check.py
|
||||
--base-url http://localhost:8501 --mode stub`).
|
||||
- Script: `/repos/orchestrator/scripts/staging_check.py` (bind-mount, `main`).
|
||||
- Script: `/repos/orchestrator/scripts/staging_check.py` (bind-mount, served from the host repo,
|
||||
NOT baked into the image — so a harness fix takes effect on the next run without a rebuild).
|
||||
- Mode: `stub`
|
||||
- Exit code: `1`
|
||||
- Result: **8/10 checks PASS** (FAIL: C9a, C9b)
|
||||
- Staging image under test: `orchestrator-orchestrator-staging`, OCI label
|
||||
`org.opencontainers.image.revision=094b5e2f960f696216f8661ff9c27b0d4706f219` (= the **merge
|
||||
commit of ORCH-058 into `main`**, PR #57; ancestor of branch HEAD `60e5596e`). Container
|
||||
recreated 2026-06-07T10:13:36Z. So the artifact under test genuinely contains the validated
|
||||
ORCH-058 code.
|
||||
|
||||
## Root cause (actionable for development rollback)
|
||||
The E2E flow (`staging_check.py` Block C) creates a SANDBOX Plane issue (C7 ✓), then POSTs a
|
||||
signed `/webhook/plane` payload to start the pipeline (C8 ✓ — HTTP 200 `{"status":"accepted"}`).
|
||||
However the staging instance logged:
|
||||
## Decisive root cause (proven, actionable)
|
||||
Block C creates a SANDBOX Plane issue (C7 ✓), then POSTs a signed `/webhook/plane` payload to
|
||||
start the pipeline (C8 ✓ — HTTP 200 `{"status":"accepted"}`). The staging instance logged for
|
||||
the test issue `427cb94e-…`:
|
||||
|
||||
```
|
||||
2026-06-07 10:39:17,333 [INFO] orchestrator.webhooks.plane: issue 990c99b5-6a1d-4e63-a59a-9a11716e07b9
|
||||
2026-06-07 10:59:04 [INFO] orchestrator.webhooks.plane: issue 427cb94e-cedd-4def-ba5d-21c555a82477
|
||||
updated to state b873d9eb..., no pipeline action
|
||||
```
|
||||
|
||||
→ **"no pipeline action"**: the webhook transition did NOT start the pipeline, so no `tasks`
|
||||
row, no Gitea branch (C9a FAIL — branch never appeared after 60s), and no analyst job enqueued
|
||||
(C9b FAIL — queue had no new job after 30s). Cleanup confirmed `no task row found for
|
||||
plane_id=990c99b5...` and `no branch to delete`.
|
||||
`handle_issue_updated` (src/webhooks/plane.py) starts the pipeline **only** when the webhook's
|
||||
new state equals the **incoming project's** `in_progress` state, resolved per-project from the
|
||||
Plane API by `get_project_states(project_id)` (ORCH-10). The webhook the harness sends carries
|
||||
state `b873d9eb-993c-48cd-97ac-99a9b1623967`.
|
||||
|
||||
This is a **deterministic regression in the validated artifact**, not a timing flake (the
|
||||
webhook was explicitly classified as a no-op, not a poll timeout):
|
||||
- The **same** `staging_check.py` against the **same** SANDBOX config passed **10/10** on an
|
||||
earlier pre-rebuild image (see git history of this file).
|
||||
- The state id `b873d9eb...` from the webhook payload is not matched as a pipeline-start
|
||||
(`group="started"`) transition by the staging instance. **Investigate `handle_status_start`
|
||||
/ webhook start-state matching in `src/webhooks/plane.py`** against the validated commit, and
|
||||
confirm the staging start-state id wiring used by `staging_check.py`.
|
||||
**The mismatch (queried live inside the staging container):**
|
||||
|
||||
Smoke (A1–A3) and access (B4–B6) all passed, including B6 registry isolation
|
||||
(sandbox present; prod ET/ORCH absent) — confirming the check ran inside the staging
|
||||
instance's own process-env, so there is no false-FAIL / spurious-rollback risk from B6.
|
||||
| | UUID |
|
||||
|---|---|
|
||||
| `staging_check.py` `IN_PROGRESS_STATE_ID` (hardcoded) | `b873d9eb-993c-48cd-97ac-99a9b1623967` |
|
||||
| `get_project_states(SANDBOX)["in_progress"]` (real) | `84a76f65-75f8-4022-9554-379dad38523c` |
|
||||
| `_DEFAULT_STATES["in_progress"]` (enduro-trails fallback) | `b873d9eb-993c-48cd-97ac-99a9b1623967` |
|
||||
|
||||
The hardcoded `b873d9eb…` is the **enduro-trails** In Progress UUID (the `_DEFAULT_STATES`
|
||||
fallback), **not** SANDBOX's. SANDBOX's actual In Progress is `84a76f65…`. So the handler
|
||||
**correctly** classifies the enduro-state webhook as `no pipeline action` for a SANDBOX issue →
|
||||
no `tasks` row, no Gitea branch (C9a FAIL after 60s), no analyst job enqueued (C9b FAIL).
|
||||
Cleanup confirmed `no task row found` and `no branch to delete`.
|
||||
|
||||
**Why it intermittently "passed 10/10" before (09:31):** `get_project_states` falls back to
|
||||
`_DEFAULT_STATES` (= `b873d9eb…`) whenever the Plane states API call fails / returns no
|
||||
recognisable states. On runs where that fallback fired, the hardcoded harness state accidentally
|
||||
matched and the pipeline started. On this run the SANDBOX states API call succeeded at startup
|
||||
(`GET …/projects/8c5a3025-…/states/ → 200 OK`), so SANDBOX resolved to its real `84a76f65…` and
|
||||
the accidental match disappeared. The green runs were the bug; the red runs are correct handler
|
||||
behaviour exposing a harness that hardcodes the wrong project's state.
|
||||
|
||||
## Required fix (for the development rollback) — in `scripts/staging_check.py` ONLY
|
||||
Make the E2E harness send SANDBOX's **actual** `in_progress` state instead of a hardcoded enduro
|
||||
UUID. Resolve it dynamically the same way the app does — e.g. `GET
|
||||
/workspaces/<slug>/projects/<SANDBOX_PROJECT_ID>/states/`, pick the state whose `name` is
|
||||
`"In Progress"` (group `"started"`), and use its `id` in `_make_webhook_payload`. (The harness
|
||||
already calls the Plane API for B4/B6, so credentials/URL are available.) Do **not** rely on the
|
||||
`_DEFAULT_STATES` fallback coincidence. No production-code change is warranted; ORCH-058's
|
||||
image-provenance feature is unaffected by this and is functioning.
|
||||
|
||||
## Test output
|
||||
|
||||
@@ -61,7 +90,7 @@ instance's own process-env, so there is no false-FAIL / spurious-rollback risk f
|
||||
ORCH-33 Staging Check Suite
|
||||
base_url : http://localhost:8501
|
||||
mode : stub
|
||||
utc_time : 2026-06-07T10:39:15.004026+00:00
|
||||
utc_time : 2026-06-07T10:59:02.392888+00:00
|
||||
============================================================
|
||||
|
||||
[Block A] SMOKE
|
||||
@@ -76,7 +105,7 @@ instance's own process-env, so there is no false-FAIL / spurious-rollback risk f
|
||||
|
||||
[Block C] E2E (mode=stub)
|
||||
· C7: Creating issue in SANDBOX project...
|
||||
✓ PASS C7 Create issue in Plane SANDBOX [HTTP 201, issue_id=990c99b5-6a1d-4e63-a59a-9a11716e07b9]
|
||||
✓ PASS C7 Create issue in Plane SANDBOX [HTTP 201, issue_id=427cb94e-cedd-4def-ba5d-21c555a82477]
|
||||
· C8: Triggering pipeline via POST /webhook/plane ...
|
||||
· Using HMAC signature (secret len=40)
|
||||
✓ PASS C8 Trigger pipeline via /webhook/plane [HTTP 200, resp={'status': 'accepted'}]
|
||||
@@ -90,8 +119,8 @@ instance's own process-env, so there is no false-FAIL / spurious-rollback risk f
|
||||
|
||||
[CLEANUP]
|
||||
· CLEANUP: no branch to delete
|
||||
✓ PASS CLEANUP: deleted Plane issue 990c99b5-6a1d-4e63-a59a-9a11716e07b9 (HTTP 204)
|
||||
· CLEANUP DB: no task row found for plane_id=990c99b5-6a1d-4e63-a59a-9a11716e07b9
|
||||
✓ PASS CLEANUP: deleted Plane issue 427cb94e-cedd-4def-ba5d-21c555a82477 (HTTP 204)
|
||||
· CLEANUP DB: no task row found for plane_id=427cb94e-cedd-4def-ba5d-21c555a82477
|
||||
· CLEANUP DB dedup: no such table: events_dedup
|
||||
|
||||
============================================================
|
||||
|
||||
Reference in New Issue
Block a user