Eliminate the false `deploy-staging -> development` rollback that fired when the
merge-gate local re-test timed out (infra/resource) on a green CI + tester +
staging branch (incident ORCH-109/PR #129: a 516.7s suite blew its 600s budget
under CPU starvation from orphaned pytest processes -> timeout misrouted as a
code fault -> developer-retry loop -> manual gate).
Additive, 5 independent kill-switches, never-raise, self-hosting scope. Untouched
byte-for-byte: STAGE_TRANSITIONS, the QG_CHECKS registry, check_branch_mergeable
name/semantics, machine-verdict keys, the DB schema. INV-4 (never push/force-push
main) and the no-prod-restart rule are preserved.
- D1: new stdlib-only leaf src/proc_group.py runs the spawned re-test/coverage
pytest in its own process group (start_new_session) and tree-kills the WHOLE
group on timeout (os.killpg SIGTERM->grace->SIGKILL); used by
merge_gate.retest_branch and coverage_gate.measure_coverage. No orphan leak.
Fallback never-break: subprocess_tree_kill_enabled=False / non-POSIX -> the
prior subprocess.run.
- D2/D3: merge_gate.classify_retest_failure distinguishes timeout/red/lock-busy/
other; an infra timeout routes to _handle_merge_gate_infra_retry (bounded
re-queue, task stays on deploy-staging, no rollback / no developer-retry); a
red re-test / conflict still rolls back (BR-6). Exhaustion -> one infra alert.
- D4: skip the local re-test when the pre-merge rebase was a proven no-op (HEAD
already CI/tester/staging-validated); fail-safe runs the re-test on any
uncertainty. Flag merge_retest_skip_when_current_enabled.
- D5: merge_retest_timeout_s 600 -> 900 + _resolve_retest_timeout validation;
reaper_max_running_s invariant preserved without change.
- D6: in-process counters + read-only merge_gate block in GET /queue; appended
("ORCH-110","classify_retest_failure","src/merge_gate.py") to
MAIN_REGRESSION_MARKERS. Docs (README/internals overview/CLAUDE/CHANGELOG/
.env.example) updated in the same PR.
Tests: tests/test_orch110_*.py (TC-01..TC-12, incl. the red-before/green-after
incident regression). Full suite green (1988 passed).
Refs: ORCH-110
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
measure_coverage hardcoded "python" for the coverage subprocess; the prod container
and the CI runner expose "python3" (a bare "python" may be absent), and pytest-cov
lives in exactly the running interpreter's environment. Use sys.executable so the
measurement always runs under the same interpreter as the orchestrator.
Refs: ORCH-027
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Introduce a deterministic (no-LLM) coverage sub-gate that blocks coverage
degradation before a task branch merges into `main`. Existing gates judge only by
the FACT of passing (check_ci_green / check_tests_passed / merge-gate re-test), not
by completeness — so a batch autonomous run (ORCH-088) silently erodes coverage.
Pattern mirrors the security-gate (ORCH-022): leaf src/coverage_gate.py (never-raise)
+ thin check_coverage_gate in QG_CHECKS + _handle_coverage_gate splice in advance_stage,
run AFTER merge-gate (measured on the caught-up HEAD that lands in main) and BEFORE
image-freshness (fail before the expensive docker rebuild).
- measure_coverage: pytest --cov=src --cov-report=json in the per-branch worktree ->
line coverage %; None on tool error -> fail-open + WARNING by default (FR-6).
- compute_coverage_verdict (pure): absolute | baseline | both + epsilon (NFR-4 anti-flap);
baseline None -> bootstrap (absolute-only).
- coverage_baseline DB table (additive, CREATE TABLE IF NOT EXISTS) + ratchet-up in
_handle_merge_verify (deploy->done): atomic compare-and-set under merge-lease, never
decreases; bootstrap on first merge.
- Artefact 18-coverage-report.md (coverage_status: frontmatter, single source of truth);
GET /queue `coverage` block; FAIL -> Telegram; optional POST /coverage/baseline override.
- Flags ORCH_COVERAGE_* (kill-switch + self-hosting-only scope) -> enduro untouched;
STAGE_TRANSITIONS / existing check_* / verdict keys byte-for-byte unchanged (NFR-5/AC-8).
- pytest-cov==5.0.0 added to requirements.txt.
Tests: tests/test_coverage_gate.py (TC-01..TC-15). Frozen QG-registry anti-regress
tests + deploy-staging edge tests updated for the new sub-gate. Full suite green.
Docs: README / adr-0029 / PIPELINE_DOCS / 18-coverage-report.md template (architecture
stage) + CHANGELOG / CLAUDE.md / .env.example (this PR).
Refs: ORCH-027
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>