Canonical staging_check.py run inside orchestrator-staging:
8/10 PASS, all REAL checks green, C9a/C9b infra-waived (ORCH-061),
exit 0 → advance.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tier-2 reaped a LIVE, still-finalizing monitor: _monitor_agent writes
agent_runs.exit_code FIRST, then does git push / PR / Plane comments before
_finalize_job, and the agent pid is already dead in that window — so the old
"exit_code recorded -> reap now" had no grace and could race a healthy job.
Worse, _reap_known_outcome ran the advance (advance_stage -> enqueue_job)
BEFORE the atomic claim, so a reaper that lost the race had already enqueued
the next stage (dup advance / dup enqueue), violating ADR-001 Р-1.
Fix:
- Tier-2 grace: reap only once agent_runs.exit_code has been recorded for
>= reaper_finalize_grace_s (new setting, default 300s; > max finalization
window). A live finalizing monitor is never reaped (FR-1.3/AC-3). New
finished_age_s column computed in get_running_jobs.
- claim-before-act for exit0: evaluate the canonical QG READ-ONLY (the
reconciler pattern) to choose the terminal status, then atomically claim
'done' FIRST; only the claim winner runs the advance. A loser performs no
side effects -> no dup advance / dup enqueue.
Docs (golden source) updated in the same change: ADR-001, global adr-0011,
README, internals, .env.example, CHANGELOG (also fixes the P3 broken adr-0011
link). New tests cover the grace window, lost-claim no-side-effects, and the
already-advanced idempotent path.
Refs: ORCH-065
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Re-ran staging_check inside orchestrator-staging (exit 0); all REAL checks
green, C9a/C9b waived per ORCH-061.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Validated ORCH-061 infra-tolerance against live staging (8501): all REAL
checks pass, only sandbox-infra C9a/C9b fail and are waived → exit 0.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Staging check exit code 1 (C9a/C9b). Live inspection inside orchestrator-staging
proves the production webhook handler is correct: get_project_states(SANDBOX).in_progress
= 84a76f65..., but scripts/staging_check.py hardcodes the enduro fallback b873d9eb...
=> handler correctly classifies the webhook as "no pipeline action". Fix belongs in
scripts/staging_check.py (resolve SANDBOX in_progress dynamically), NOT in handle_status_start
or any ORCH-058 image-freshness code. Image under test = ORCH-058 merge commit 094b5e2f.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
E2E pipeline not triggered on staging webhook ("no pipeline action" on
state b873d9eb...); reproduces prior FAILED. Rolls task back to development.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Strategy-A freshness re-validation rebuilt 8501 from merged commit 094b5e2 and
re-ran staging_check; E2E C9a/C9b fail (Plane "In Progress"/started webhook ->
"no pipeline action", no task/branch/analyst-job). Machine verdict FAILED ->
rollback to development. Prod (8500) untouched.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>