fix(merge-gate): tolerate re-test infra-timeout + tree-kill spawned pytest
Eliminate the false `deploy-staging -> development` rollback that fired when the merge-gate local re-test timed out (infra/resource) on a green CI + tester + staging branch (incident ORCH-109/PR #129: a 516.7s suite blew its 600s budget under CPU starvation from orphaned pytest processes -> timeout misrouted as a code fault -> developer-retry loop -> manual gate). Additive, 5 independent kill-switches, never-raise, self-hosting scope. Untouched byte-for-byte: STAGE_TRANSITIONS, the QG_CHECKS registry, check_branch_mergeable name/semantics, machine-verdict keys, the DB schema. INV-4 (never push/force-push main) and the no-prod-restart rule are preserved. - D1: new stdlib-only leaf src/proc_group.py runs the spawned re-test/coverage pytest in its own process group (start_new_session) and tree-kills the WHOLE group on timeout (os.killpg SIGTERM->grace->SIGKILL); used by merge_gate.retest_branch and coverage_gate.measure_coverage. No orphan leak. Fallback never-break: subprocess_tree_kill_enabled=False / non-POSIX -> the prior subprocess.run. - D2/D3: merge_gate.classify_retest_failure distinguishes timeout/red/lock-busy/ other; an infra timeout routes to _handle_merge_gate_infra_retry (bounded re-queue, task stays on deploy-staging, no rollback / no developer-retry); a red re-test / conflict still rolls back (BR-6). Exhaustion -> one infra alert. - D4: skip the local re-test when the pre-merge rebase was a proven no-op (HEAD already CI/tester/staging-validated); fail-safe runs the re-test on any uncertainty. Flag merge_retest_skip_when_current_enabled. - D5: merge_retest_timeout_s 600 -> 900 + _resolve_retest_timeout validation; reaper_max_running_s invariant preserved without change. - D6: in-process counters + read-only merge_gate block in GET /queue; appended ("ORCH-110","classify_retest_failure","src/merge_gate.py") to MAIN_REGRESSION_MARKERS. Docs (README/internals overview/CLAUDE/CHANGELOG/ .env.example) updated in the same PR. Tests: tests/test_orch110_*.py (TC-01..TC-12, incl. the red-before/green-after incident regression). Full suite green (1988 passed). Refs: ORCH-110 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -708,10 +708,38 @@ def check_branch_mergeable(repo: str, work_item_id: str, branch: str) -> tuple[b
|
||||
logger.info("check_branch_mergeable: %s up-to-date with main", branch)
|
||||
return True, "branch up-to-date with main"
|
||||
|
||||
# ORCH-110 (D4): capture HEAD before/after the rebase to detect a no-op
|
||||
# rebase (branch already contained the latest origin/main).
|
||||
pre_sha = merge_gate.head_sha(repo, branch)
|
||||
ok, rb_reason = merge_gate.auto_rebase_onto_main(repo, branch)
|
||||
if not ok:
|
||||
merge_gate.release_merge_lease(repo, branch)
|
||||
return False, rb_reason # "rebase conflict: ..."
|
||||
post_sha = merge_gate.head_sha(repo, branch)
|
||||
|
||||
# ORCH-110 (D4 / FR-4 / AC-6): re-test contract. ORCH-043 catches a
|
||||
# SEMANTIC merge conflict that can only arise when ``main`` actually moved
|
||||
# and the branch was really rebased onto new commits. When the rebase was
|
||||
# a PROVEN no-op (HEAD unchanged), there is no "moved main" -> the local
|
||||
# re-test re-checks exactly the commit CI + tester + staging already
|
||||
# validated on THIS HEAD -> it is a redundant single point of false
|
||||
# failure (the ORCH-109 timeout incident). Skip it ONLY on a proven no-op
|
||||
# (both SHAs non-empty AND equal); on ANY uncertainty (empty SHA / flag
|
||||
# off) the re-test runs (fail-safe to BR-6/AC-3). This extends to the
|
||||
# premerge_rebase_always=True path the same optimisation the
|
||||
# premerge_rebase_always=False not-behind short-circuit already has.
|
||||
if (
|
||||
bool(getattr(settings, "merge_retest_skip_when_current_enabled", False))
|
||||
and pre_sha and post_sha and pre_sha == post_sha
|
||||
):
|
||||
logger.info(
|
||||
"check_branch_mergeable: %s rebase no-op (HEAD %s unchanged) -> "
|
||||
"re-test skipped (HEAD CI-validated)", branch, pre_sha[:8],
|
||||
)
|
||||
merge_gate.note_retest_skipped_current()
|
||||
return True, (
|
||||
"branch up-to-date (re-test skipped: rebase no-op, HEAD CI-validated)"
|
||||
)
|
||||
|
||||
ok_t, t_reason = merge_gate.retest_branch(repo, branch)
|
||||
if ok_t:
|
||||
|
||||
Reference in New Issue
Block a user