fix(merge-gate): tolerate re-test infra-timeout + tree-kill spawned pytest

Eliminate the false `deploy-staging -> development` rollback that fired when the
merge-gate local re-test timed out (infra/resource) on a green CI + tester +
staging branch (incident ORCH-109/PR #129: a 516.7s suite blew its 600s budget
under CPU starvation from orphaned pytest processes -> timeout misrouted as a
code fault -> developer-retry loop -> manual gate).

Additive, 5 independent kill-switches, never-raise, self-hosting scope. Untouched
byte-for-byte: STAGE_TRANSITIONS, the QG_CHECKS registry, check_branch_mergeable
name/semantics, machine-verdict keys, the DB schema. INV-4 (never push/force-push
main) and the no-prod-restart rule are preserved.

- D1: new stdlib-only leaf src/proc_group.py runs the spawned re-test/coverage
  pytest in its own process group (start_new_session) and tree-kills the WHOLE
  group on timeout (os.killpg SIGTERM->grace->SIGKILL); used by
  merge_gate.retest_branch and coverage_gate.measure_coverage. No orphan leak.
  Fallback never-break: subprocess_tree_kill_enabled=False / non-POSIX -> the
  prior subprocess.run.
- D2/D3: merge_gate.classify_retest_failure distinguishes timeout/red/lock-busy/
  other; an infra timeout routes to _handle_merge_gate_infra_retry (bounded
  re-queue, task stays on deploy-staging, no rollback / no developer-retry); a
  red re-test / conflict still rolls back (BR-6). Exhaustion -> one infra alert.
- D4: skip the local re-test when the pre-merge rebase was a proven no-op (HEAD
  already CI/tester/staging-validated); fail-safe runs the re-test on any
  uncertainty. Flag merge_retest_skip_when_current_enabled.
- D5: merge_retest_timeout_s 600 -> 900 + _resolve_retest_timeout validation;
  reaper_max_running_s invariant preserved without change.
- D6: in-process counters + read-only merge_gate block in GET /queue; appended
  ("ORCH-110","classify_retest_failure","src/merge_gate.py") to
  MAIN_REGRESSION_MARKERS. Docs (README/internals overview/CLAUDE/CHANGELOG/
  .env.example) updated in the same PR.

Tests: tests/test_orch110_*.py (TC-01..TC-12, incl. the red-before/green-after
incident regression). Full suite green (1988 passed).

Refs: ORCH-110

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-15 10:14:55 +03:00
committed by deployer
parent cf602b4810
commit 651b9af7c3
24 changed files with 1816 additions and 50 deletions

View File

@@ -202,18 +202,50 @@ class Settings(BaseSettings):
# only the self-hosting repo (orchestrator). Other
# repos -> conditional no-op (mirrors ORCH-35 staging).
# merge_retest_timeout_s -> wall-clock budget for the post-rebase re-test.
# ORCH-110 (D5): raised 600 -> 900 (74% headroom over the observed 516.7s
# suite vs the prior ~16%). Cross-invariant (ORCH-065/109): keep
# reaper_max_running_s (5400) > Σ(deploy-staging gate-work) + grace — see
# docs/work-items/ORCH-110/07-infra-requirements.md.
# merge_retest_target -> pytest target for the re-test (portability across repos).
# merge_lock_timeout_s -> max lease age; an older lease is reclaimed (crash backstop).
# merge_defer_delay_s -> delay before re-running the gate when the lock is busy.
# merge_defer_max_attempts -> defer retries before escalation (avoids livelock).
merge_gate_enabled: bool = True
merge_gate_repos: str = ""
merge_retest_timeout_s: int = 600
merge_retest_timeout_s: int = 900
merge_retest_target: str = "tests/"
merge_lock_timeout_s: int = 300
merge_defer_delay_s: int = 60
merge_defer_max_attempts: int = 5
# ORCH-110: merge-gate re-test infra-timeout tolerance + tree-kill of the
# orchestrator-spawned pytest subprocess (re-test + coverage). Each default =
# the desired prod behaviour (ORCH-101 canon); each flag is an INDEPENDENT
# kill-switch (off -> byte-for-byte pre-ORCH-110). Detailed ADR:
# docs/work-items/ORCH-110/06-adr/ADR-001-merge-gate-retest-infra-tolerance-and-tree-kill.md.
# subprocess_tree_kill_enabled -> D1: spawn the re-test / coverage
# pytest in its own process group and tree-kill the WHOLE group on timeout
# (no orphan grandchildren grinding the host CPU). off -> the prior
# subprocess.run(timeout=) (ORCH_SUBPROCESS_TREE_KILL_ENABLED).
# merge_retest_infra_tolerance_enabled -> D3: a re-test TIMEOUT is a transient
# (bounded infra-retry, NOT a code-fault rollback to development burning a
# developer retry). off -> timeout = the prior rollback
# (ORCH_MERGE_RETEST_INFRA_TOLERANCE_ENABLED).
# merge_retest_infra_max_retries -> D3: infra-retry budget before an
# infra-alert (anti-loop). (ORCH_MERGE_RETEST_INFRA_MAX_RETRIES)
# merge_retest_infra_retry_delay_s -> D3: delay before the staging-deployer
# re-run. (ORCH_MERGE_RETEST_INFRA_RETRY_DELAY_S)
# merge_retest_skip_when_current_enabled -> D4: skip the local re-test when the
# pre-merge rebase was a PROVEN no-op (branch already at origin/main; HEAD is
# exactly the CI/tester/staging-validated commit). off -> always re-test
# after rebase (ORCH_MERGE_RETEST_SKIP_WHEN_CURRENT_ENABLED).
# The tree-kill grace reuses the existing agent_kill_grace_seconds (no new key).
subprocess_tree_kill_enabled: bool = True
merge_retest_infra_tolerance_enabled: bool = True
merge_retest_infra_max_retries: int = 2
merge_retest_infra_retry_delay_s: int = 120
merge_retest_skip_when_current_enabled: bool = True
# ORCH-036: executable self-deploy (deploy stage drives the host hook).
# The `deploy` stage for the self-hosting repo is turned into a REAL prod
# restart via a detached host process, gated by a manual approve. Three-phase