fix(merge-gate): tolerate re-test infra-timeout + tree-kill spawned pytest

Eliminate the false `deploy-staging -> development` rollback that fired when the
merge-gate local re-test timed out (infra/resource) on a green CI + tester +
staging branch (incident ORCH-109/PR #129: a 516.7s suite blew its 600s budget
under CPU starvation from orphaned pytest processes -> timeout misrouted as a
code fault -> developer-retry loop -> manual gate).

Additive, 5 independent kill-switches, never-raise, self-hosting scope. Untouched
byte-for-byte: STAGE_TRANSITIONS, the QG_CHECKS registry, check_branch_mergeable
name/semantics, machine-verdict keys, the DB schema. INV-4 (never push/force-push
main) and the no-prod-restart rule are preserved.

- D1: new stdlib-only leaf src/proc_group.py runs the spawned re-test/coverage
  pytest in its own process group (start_new_session) and tree-kills the WHOLE
  group on timeout (os.killpg SIGTERM->grace->SIGKILL); used by
  merge_gate.retest_branch and coverage_gate.measure_coverage. No orphan leak.
  Fallback never-break: subprocess_tree_kill_enabled=False / non-POSIX -> the
  prior subprocess.run.
- D2/D3: merge_gate.classify_retest_failure distinguishes timeout/red/lock-busy/
  other; an infra timeout routes to _handle_merge_gate_infra_retry (bounded
  re-queue, task stays on deploy-staging, no rollback / no developer-retry); a
  red re-test / conflict still rolls back (BR-6). Exhaustion -> one infra alert.
- D4: skip the local re-test when the pre-merge rebase was a proven no-op (HEAD
  already CI/tester/staging-validated); fail-safe runs the re-test on any
  uncertainty. Flag merge_retest_skip_when_current_enabled.
- D5: merge_retest_timeout_s 600 -> 900 + _resolve_retest_timeout validation;
  reaper_max_running_s invariant preserved without change.
- D6: in-process counters + read-only merge_gate block in GET /queue; appended
  ("ORCH-110","classify_retest_failure","src/merge_gate.py") to
  MAIN_REGRESSION_MARKERS. Docs (README/internals overview/CLAUDE/CHANGELOG/
  .env.example) updated in the same PR.

Tests: tests/test_orch110_*.py (TC-01..TC-12, incl. the red-before/green-after
incident regression). Full suite green (1988 passed).

Refs: ORCH-110

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-15 10:14:55 +03:00
committed by deployer
parent cf602b4810
commit 651b9af7c3
24 changed files with 1816 additions and 50 deletions

View File

@@ -425,12 +425,14 @@ def test_tc14_real_measurement(tmp_path, monkeypatch):
def test_tc14_measure_timeout_returns_none(monkeypatch):
import subprocess
# ORCH-110: measure_coverage now runs via proc_group.run_in_process_group
# (tree-kill on timeout). A timed_out ProcResult -> None (prior contract).
from src.proc_group import ProcResult
monkeypatch.setattr(cg, "ensure_worktree", lambda r, b: "/tmp")
def _timeout(*a, **k):
raise subprocess.TimeoutExpired(cmd="pytest", timeout=1)
monkeypatch.setattr(cg.subprocess, "run", _timeout)
monkeypatch.setattr(
cg, "run_in_process_group",
lambda *a, **k: ProcResult(returncode=None, stdout="", stderr="", timed_out=True),
)
assert cg.measure_coverage(_REPO, _BRANCH) is None