Eliminate the false `deploy-staging -> development` rollback that fired when the merge-gate local re-test timed out (infra/resource) on a green CI + tester + staging branch (incident ORCH-109/PR #129: a 516.7s suite blew its 600s budget under CPU starvation from orphaned pytest processes -> timeout misrouted as a code fault -> developer-retry loop -> manual gate). Additive, 5 independent kill-switches, never-raise, self-hosting scope. Untouched byte-for-byte: STAGE_TRANSITIONS, the QG_CHECKS registry, check_branch_mergeable name/semantics, machine-verdict keys, the DB schema. INV-4 (never push/force-push main) and the no-prod-restart rule are preserved. - D1: new stdlib-only leaf src/proc_group.py runs the spawned re-test/coverage pytest in its own process group (start_new_session) and tree-kills the WHOLE group on timeout (os.killpg SIGTERM->grace->SIGKILL); used by merge_gate.retest_branch and coverage_gate.measure_coverage. No orphan leak. Fallback never-break: subprocess_tree_kill_enabled=False / non-POSIX -> the prior subprocess.run. - D2/D3: merge_gate.classify_retest_failure distinguishes timeout/red/lock-busy/ other; an infra timeout routes to _handle_merge_gate_infra_retry (bounded re-queue, task stays on deploy-staging, no rollback / no developer-retry); a red re-test / conflict still rolls back (BR-6). Exhaustion -> one infra alert. - D4: skip the local re-test when the pre-merge rebase was a proven no-op (HEAD already CI/tester/staging-validated); fail-safe runs the re-test on any uncertainty. Flag merge_retest_skip_when_current_enabled. - D5: merge_retest_timeout_s 600 -> 900 + _resolve_retest_timeout validation; reaper_max_running_s invariant preserved without change. - D6: in-process counters + read-only merge_gate block in GET /queue; appended ("ORCH-110","classify_retest_failure","src/merge_gate.py") to MAIN_REGRESSION_MARKERS. Docs (README/internals overview/CLAUDE/CHANGELOG/ .env.example) updated in the same PR. Tests: tests/test_orch110_*.py (TC-01..TC-12, incl. the red-before/green-after incident regression). Full suite green (1988 passed). Refs: ORCH-110 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
65 lines
2.4 KiB
Python
65 lines
2.4 KiB
Python
"""ORCH-110 TC-03: classify_retest_failure distinguishes infra-timeout from red.
|
|
|
|
Covers D2 (FR-1 / AC-2): the pure predicate that lets the engine route an INFRA
|
|
re-test timeout differently from a deterministically RED re-test, WITHOUT changing
|
|
the name / PASS-FAIL semantics of the registered ``check_branch_mergeable``.
|
|
|
|
The critical scope guard: an ``auto_rebase_onto_main`` "rebase timeout" is a
|
|
DIFFERENT timeout (git hung) and must NOT be classified as the infra-tolerated
|
|
re-test timeout (it stays on the rollback path).
|
|
"""
|
|
import os
|
|
import tempfile
|
|
|
|
os.environ.setdefault("ORCH_DB_PATH", os.path.join(tempfile.gettempdir(), "test_orch110_classify.db"))
|
|
os.environ.setdefault("ORCH_GITEA_TOKEN", "test-token")
|
|
os.environ.setdefault("ORCH_PLANE_API_TOKEN", "test-token")
|
|
|
|
import pytest # noqa: E402
|
|
|
|
from src import merge_gate # noqa: E402
|
|
|
|
classify = merge_gate.classify_retest_failure
|
|
|
|
|
|
@pytest.mark.parametrize(
|
|
"reason,expected",
|
|
[
|
|
("re-test timeout after 900s", "timeout"),
|
|
("re-test timeout after 600s", "timeout"),
|
|
("re-test failed after rebase: 1 failed, 5 passed", "red"),
|
|
("re-test failed: ...AssertionError\n1 failed", "red"),
|
|
("merge-lock busy", "lock-busy"),
|
|
("rebase conflict: src/db.py", "other"),
|
|
# SCOPE GUARD: a git "rebase timeout" is NOT the infra-tolerated re-test
|
|
# timeout — it must stay on the rollback path (ADR D2).
|
|
("rebase timeout", "other"),
|
|
("push --force-with-lease failed: ...", "other"),
|
|
("", "other"),
|
|
],
|
|
)
|
|
def test_tc03_classify_reasons(reason, expected):
|
|
assert classify(reason) == expected
|
|
|
|
|
|
def test_tc03_classify_never_raises_on_bad_input():
|
|
# None / non-str must degrade to the safe "other" (-> rollback), never raise.
|
|
assert classify(None) == "other"
|
|
assert classify(12345) == "other"
|
|
|
|
|
|
def test_tc03_case_insensitive():
|
|
assert classify("RE-TEST TIMEOUT AFTER 900S") == "timeout"
|
|
assert classify("Merge-Lock Busy") == "lock-busy"
|
|
|
|
|
|
def test_tc03_distinct_from_lock_busy_and_conflict():
|
|
"""timeout is a distinct class from the existing defer (lock-busy) and rollback
|
|
(conflict) reasons — the three must never collide."""
|
|
classes = {
|
|
classify("re-test timeout after 900s"),
|
|
classify("merge-lock busy"),
|
|
classify("rebase conflict: x"),
|
|
}
|
|
assert classes == {"timeout", "lock-busy", "other"}
|