fix(merge-gate): SHA-in-main as sole merge-verify criterion + main regression guard
Root-cause fix for main erosion (phantom merge): code of ORCH-067/069 reached `done` while absent from origin/main (only their auto docs-PRs landed). - FR-1: verify_merged_to_main confirms merge ONLY by `git merge-base --is-ancestor <validated_sha> origin/main`; the OR-branch pr_already_merged is removed (a merged PR no longer confirms). Empty SHA / git error -> False. - FR-2: pr_already_merged demoted to merge_pr idempotency-guard; counts a PR only when merged & head.ref==<branch> & base.ref=="main" (explicit in-loop filter). - FR-3: merge_pr selects the open code-PR by head==<branch> AND base==main. - FR-5: new deterministic check_main_regression in _handle_merge_verify (after confirmed SHA-in-main, before done) verifies MAIN_REGRESSION_MARKERS still in origin/main; deterministic count==0 -> alert "main regressed" + HOLD (NOT done, no rollback); git error of the grep -> fail-open. Kill-switch ORCH_REGRESSION_GUARD_ENABLED; non-self -> no-op. - FR-4: root .gitattributes `CHANGELOG.md merge=union` so Unreleased edits auto-merge on rebase without conflict (branch not rolled back). Invariants unchanged (STAGE_TRANSITIONS, QG_CHECKS, deploy-status, merge-gate, image-freshness, DB schema, external HTTP API); non-self repos no-op (INV-5); never-raise (INV-1); merge only via Gitea PR-API (INV-2). Docs: CHANGELOG, .env.example (README/ADR updated by architect). Tests: tests/test_orch073_*.py (TC-01..18); existing merge-gate tests updated for the new code-PR filter. Refs: ORCH-073 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -396,6 +396,19 @@ class Settings(BaseSettings):
|
||||
merge_pr_timeout_s: int = 60
|
||||
merge_verify_timeout_s: int = 60
|
||||
|
||||
# ORCH-073 (ADR-001 Р-4): main-integrity regression guard. After the merge-verify
|
||||
# under-gate confirms the deployed SHA is an ancestor of origin/main (FR-1), a
|
||||
# secondary deterministic (no-LLM) guard checks that a declarative set of markers
|
||||
# for recently-merged tasks (MAIN_REGRESSION_MARKERS in merge_gate.py) is still
|
||||
# present in origin/main — i.e. a CHANGELOG-rebase or phantom-merge did not silently
|
||||
# roll back a neighbouring task's code. A missing marker (deterministic count==0) ->
|
||||
# ALERT + HOLD (task stays on `deploy`, NOT done); an infra/git error on the grep
|
||||
# itself -> fail-OPEN (do not block done; SHA-in-main remains the primary gate).
|
||||
# regression_guard_enabled -> kill-switch (env ORCH_REGRESSION_GUARD_ENABLED);
|
||||
# reuses the merge_verify_applies scope (self-hosting /
|
||||
# merge_verify_repos), so non-self repos are a no-op.
|
||||
regression_guard_enabled: bool = True
|
||||
|
||||
# Telegram notifications
|
||||
telegram_bot_token: str = ""
|
||||
telegram_chat_id: str = ""
|
||||
|
||||
@@ -445,25 +445,30 @@ def reclaim_stale_lease(repo: str) -> bool:
|
||||
# ORCH-065: idempotent merge finalization guard (Problem C)
|
||||
# ---------------------------------------------------------------------------
|
||||
def pr_already_merged(repo: str, branch: str) -> bool:
|
||||
"""Return True iff the PR for ``branch`` is ALREADY merged (ADR-001 Р-3, FR-3.2).
|
||||
"""Return True iff the **code-PR of ``branch``** is ALREADY merged (idempotency-guard).
|
||||
|
||||
A deterministic, read-only guard the merge path consults BEFORE attempting a
|
||||
(second) merge so a re-driven / reaped task is idempotent: an already-merged
|
||||
PR -> no-op, never a duplicate merge and never an error. This is the ONLY new
|
||||
merge-related helper and it does NOT merge — it only READS the PR state via
|
||||
the existing Gitea client, so it does not introduce duplicate merge logic.
|
||||
ORCH-073 ADR-001 Р-2 (FR-2): this is an **idempotency-guard for ``merge_pr``**, NOT
|
||||
a source of truth for ``done`` (the only proof of merge is SHA-in-main, FR-1 /
|
||||
``verify_merged_to_main``). It lets a re-driven / reaped ``merge_pr`` be idempotent:
|
||||
the code-PR is already merged -> no-op, never a duplicate merge.
|
||||
|
||||
Consultation point: the actual merge actor is the **deployer agent** (it merges
|
||||
the feature PR at the start of the ``deploy`` stage — see webhooks/gitea.py),
|
||||
so the wiring lives in the deployer prompt (``.openclaw/agents/deployer.md``),
|
||||
which runs this exact function before any (re-)merge. The merge-gate quality
|
||||
check (``qg.checks.check_branch_mergeable``) is intentionally NOT modified
|
||||
(ORCH-065 AC-13: ``check_*`` behaviour unchanged) — it runs on the FIRST
|
||||
deploy-staging -> deploy edge and does not re-run on a ``deploy``-stage re-drive,
|
||||
which is exactly where the second-merge risk lives.
|
||||
Root-cause fix (G4 audit): the previous implementation returned True for ANY
|
||||
``merged == True`` PR returned by ``GET /pulls?state=all&head=<branch>``. Gitea's
|
||||
``head`` query-param filters unreliably for a bare branch name, so auto docs-PRs
|
||||
(staging/deploy logs, ``head=docs/*``) leaked into the result and were counted as
|
||||
"merged" — the ORCH-067/069 phantom-merge. We now apply an EXPLICIT in-loop filter
|
||||
instead of trusting the query-param: a PR counts only when it carries the code of
|
||||
THIS feature-branch into ``main``:
|
||||
|
||||
* ``pr.merged is True`` AND
|
||||
* ``pr.head.ref == branch`` (the code of exactly this feature-branch) AND
|
||||
* ``pr.base.ref == "main"`` (target is main, not a docs/other base).
|
||||
|
||||
This excludes auto docs-PRs (different ``head.ref``) and PRs onto a non-``main``
|
||||
base, so a merged docs-PR can no longer make ``merge_pr`` skip a real code merge.
|
||||
|
||||
Queries Gitea ``GET /repos/{owner}/{repo}/pulls?state=all&head=<branch>`` and
|
||||
reports True when any matching PR has ``merged == True``. Never raises (AC-9):
|
||||
reports True only when a matching PR passes the filter above. Never raises (AC-9):
|
||||
any HTTP/parse error -> ``False`` (conservative: "not known-merged" lets the
|
||||
normal gate re-evaluate rather than silently skipping a real merge).
|
||||
"""
|
||||
@@ -479,7 +484,11 @@ def pr_already_merged(repo: str, branch: str) -> bool:
|
||||
if resp.status_code != 200:
|
||||
return False
|
||||
for pr in resp.json() or []:
|
||||
if pr.get("merged") is True:
|
||||
if (
|
||||
pr.get("merged") is True
|
||||
and pr.get("head", {}).get("ref") == branch
|
||||
and pr.get("base", {}).get("ref") == "main"
|
||||
):
|
||||
return True
|
||||
return False
|
||||
except Exception as e: # noqa: BLE001 - never-raise contract
|
||||
@@ -505,6 +514,7 @@ def pr_already_merged(repo: str, branch: str) -> bool:
|
||||
_MERGE_VERIFY_COUNTERS: dict = {
|
||||
"merge_verified_total": 0,
|
||||
"not_merged_alerts_total": 0,
|
||||
"main_regressed_alerts_total": 0, # ORCH-073 Р-4: regression-guard HOLD+alert count.
|
||||
"last_alert_wi": None,
|
||||
}
|
||||
|
||||
@@ -526,6 +536,15 @@ def note_not_merged_alert(work_item_id: str | None) -> None:
|
||||
pass
|
||||
|
||||
|
||||
def note_main_regressed_alert(work_item_id: str | None) -> None:
|
||||
"""Bump the 'main regressed (marker missing)' counter (ORCH-073 Р-4). Never raises."""
|
||||
try:
|
||||
_MERGE_VERIFY_COUNTERS["main_regressed_alerts_total"] += 1
|
||||
_MERGE_VERIFY_COUNTERS["last_alert_wi"] = work_item_id
|
||||
except Exception: # noqa: BLE001 - observability must never break a decision
|
||||
pass
|
||||
|
||||
|
||||
def merge_verify_status() -> dict:
|
||||
"""Snapshot of the merge-verify under-gate for GET /queue. Never raises."""
|
||||
try:
|
||||
@@ -534,6 +553,7 @@ def merge_verify_status() -> dict:
|
||||
"repos": settings.merge_verify_repos or "",
|
||||
"merge_verified_total": _MERGE_VERIFY_COUNTERS["merge_verified_total"],
|
||||
"not_merged_alerts_total": _MERGE_VERIFY_COUNTERS["not_merged_alerts_total"],
|
||||
"main_regressed_alerts_total": _MERGE_VERIFY_COUNTERS["main_regressed_alerts_total"],
|
||||
"last_alert_wi": _MERGE_VERIFY_COUNTERS["last_alert_wi"],
|
||||
}
|
||||
except Exception as e: # noqa: BLE001 - never-raise contract
|
||||
@@ -578,7 +598,10 @@ def merge_pr(repo: str, branch: str) -> tuple[bool, str]:
|
||||
Algorithm:
|
||||
1. ``pr_already_merged`` -> True -> no-op ``(True, "already-merged")`` (INV-5/AC-9).
|
||||
2. ``GET /repos/{owner}/{repo}/pulls?state=open`` -> the open PR whose head ref
|
||||
== ``branch`` -> its index. No open PR -> ``(False, "no open PR")``.
|
||||
== ``branch`` AND base ref == ``main`` -> its index. ORCH-073 ADR-001 Р-3
|
||||
(FR-3) adds the ``base == main`` filter so the actor merges exactly the
|
||||
feature code-PR and never an auto docs-PR / a PR onto a foreign base. No
|
||||
such open PR -> ``(False, "no open PR")``.
|
||||
3. ``POST /repos/{owner}/{repo}/pulls/{index}/merge`` (Do: ``merge``) ->
|
||||
200/201 -> ``(True, "merged PR #<n>")``; otherwise ``(False, "<reason>")``.
|
||||
|
||||
@@ -602,7 +625,10 @@ def merge_pr(repo: str, branch: str) -> tuple[bool, str]:
|
||||
return False, f"list PRs failed: HTTP {resp.status_code}"
|
||||
index = None
|
||||
for pr in resp.json() or []:
|
||||
if pr.get("head", {}).get("ref") == branch:
|
||||
if (
|
||||
pr.get("head", {}).get("ref") == branch
|
||||
and pr.get("base", {}).get("ref") == "main"
|
||||
):
|
||||
index = pr.get("number")
|
||||
break
|
||||
if index is None:
|
||||
@@ -631,26 +657,32 @@ def merge_pr(repo: str, branch: str) -> tuple[bool, str]:
|
||||
def verify_merged_to_main(repo: str, branch: str, sha: str) -> bool:
|
||||
"""Return True iff the deployed commit is confirmed merged into ``origin/main``.
|
||||
|
||||
Post-deploy verification (FR-2 / D4): the merge is confirmed when EITHER
|
||||
* ``pr_already_merged(repo, branch)`` is True (Gitea ``PR.merged == true``), OR
|
||||
* ``git merge-base --is-ancestor <sha> origin/main`` succeeds in the per-branch
|
||||
worktree (after ``git fetch origin main``), i.e. the validated SHA is an
|
||||
ancestor of the current ``origin/main``.
|
||||
Post-deploy verification — ORCH-073 ADR-001 Р-1 (FR-1): the merge is confirmed by
|
||||
the SINGLE, authoritative fact "the deployed commit IS an ancestor of the current
|
||||
``origin/main``":
|
||||
|
||||
* after ``git fetch origin main`` (in the per-branch worktree),
|
||||
``git merge-base --is-ancestor <sha> origin/main`` returns ``rc == 0``.
|
||||
|
||||
The former OR-branch ``pr_already_merged(repo, branch)`` was REMOVED: a merged
|
||||
``PR.merged == true`` is no longer sufficient to confirm a merge. That branch was
|
||||
the ORCH-067/069 phantom-merge root cause — an auto docs-PR (staging/deploy logs)
|
||||
counted as "merged" via the unreliable Gitea ``head`` query, turning merge-verify
|
||||
falsely GREEN while the code-PR was never merged. ``pr_already_merged`` now serves
|
||||
ONLY as an idempotency-guard inside ``merge_pr`` (Р-2/Р-3), never as proof of merge.
|
||||
|
||||
``sha`` is the validated commit (``image_freshness.validated_revision`` =
|
||||
worktree ``git rev-parse HEAD``). An empty ``sha`` makes the git branch
|
||||
inconclusive (only the PR-merged branch can then confirm).
|
||||
worktree ``git rev-parse HEAD``). An empty ``sha`` is inconclusive -> ``False``
|
||||
(fail-closed: alert + HOLD), since the SHA-in-main check cannot run without it.
|
||||
|
||||
Never-raise (INV-1/AC-7 / TC-04): any git/HTTP error -> ``False`` (= "not
|
||||
confirmed" -> fail-closed for ``done``: alert + HOLD). The exception is NEVER
|
||||
propagated into ``advance_stage``.
|
||||
"""
|
||||
try:
|
||||
if pr_already_merged(repo, branch):
|
||||
return True
|
||||
if not sha:
|
||||
logger.warning(
|
||||
"verify_merged_to_main: empty SHA for %s/%s and PR not known-merged",
|
||||
"verify_merged_to_main: empty SHA for %s/%s -> cannot confirm SHA-in-main",
|
||||
repo, branch,
|
||||
)
|
||||
return False
|
||||
@@ -675,3 +707,110 @@ def verify_merged_to_main(repo: str, branch: str, sha: str) -> bool:
|
||||
"verify_merged_to_main unexpected error for %s/%s: %s", repo, branch, e
|
||||
)
|
||||
return False
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# ORCH-073 (ADR-001 Р-4): main-integrity regression guard.
|
||||
#
|
||||
# A secondary, deterministic (no-LLM) guard that runs in `_handle_merge_verify`
|
||||
# AFTER the SHA-in-main check (verify_merged_to_main, FR-1) confirms the deployed
|
||||
# commit, and BEFORE the task is stamped `done`. It checks that a DECLARATIVE set
|
||||
# of markers for recently-merged tasks is still present in `origin/main` — i.e. a
|
||||
# CHANGELOG-rebase / phantom-merge did not silently roll back a neighbouring task's
|
||||
# code (the ORCH-067/069 failure mode, which SHA-in-main alone would not catch when
|
||||
# the deployed SHA itself IS in main but a sibling's code is gone).
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
# Declarative, append-only marker set (ADR-001 Р-4). Each future task that lands
|
||||
# significant code SHOULD append its own (task, marker_substring, path) row so the
|
||||
# guard protects it from a later phantom-merge / rebase rollback. Kept in code (not
|
||||
# DB / Plane — a non-goal) so it versions together with the fix it protects.
|
||||
MAIN_REGRESSION_MARKERS: list[tuple[str, str, str]] = [
|
||||
("ORCH-067", "plane_issue_link", "src/notifications.py"),
|
||||
("ORCH-069", "qg0_title_max", "src/config.py"),
|
||||
("ORCH-071", "verify_merged_to_main", "src/merge_gate.py"),
|
||||
("ORCH-073", "check_main_regression", "src/merge_gate.py"),
|
||||
]
|
||||
|
||||
|
||||
def check_main_regression(repo: str, branch: str) -> tuple[bool, str]:
|
||||
"""Verify the declarative marker set is still present in ``origin/main``.
|
||||
|
||||
ORCH-073 ADR-001 Р-4 (FR-5). For each ``(task, marker, path)`` in
|
||||
``MAIN_REGRESSION_MARKERS`` run ``git grep -c <marker> origin/main -- <path>`` in
|
||||
the per-branch worktree (after ``git fetch origin main``). A DETERMINISTIC count
|
||||
of ``0`` for any marker means a neighbouring task's code was rolled back ->
|
||||
regression.
|
||||
|
||||
Returns ``(ok, reason)``:
|
||||
* ``(True, "markers intact (<n>)")`` — every marker present -> proceed.
|
||||
* ``(False, "main regressed: <task> ...")`` — a marker is deterministically
|
||||
absent (count==0) -> caller HOLDs the task (NOT done) + alerts.
|
||||
|
||||
**Fail-OPEN on infra error** (intentional trade-off, ADR-001 Р-4): any git/OS
|
||||
error on the grep itself -> ``(True, "guard inconclusive: <reason>")`` so a flaky
|
||||
git never produces a false HOLD. "Regressed" is asserted ONLY on a deterministic
|
||||
``count == 0``, never on "could not determine". The PRIMARY fail-closed gate is
|
||||
SHA-in-main (FR-1); this marker-grep is a secondary, best-effort guard.
|
||||
|
||||
Never raises (INV-1): any unexpected error -> ``(True, "guard error: ...")``.
|
||||
"""
|
||||
try:
|
||||
try:
|
||||
wt = ensure_worktree(repo, branch)
|
||||
except Exception as e: # noqa: BLE001 - never-raise contract -> fail-open
|
||||
logger.warning(
|
||||
"check_main_regression: worktree error for %s/%s: %s (fail-open)",
|
||||
repo, branch, e,
|
||||
)
|
||||
return True, f"guard inconclusive: worktree error: {e}"
|
||||
|
||||
try:
|
||||
subprocess.run(
|
||||
["git", "-C", wt, "fetch", "origin", "main"],
|
||||
capture_output=True, timeout=settings.merge_verify_timeout_s,
|
||||
)
|
||||
except (subprocess.SubprocessError, OSError) as e:
|
||||
logger.warning(
|
||||
"check_main_regression: fetch error for %s/%s: %s (fail-open)",
|
||||
repo, branch, e,
|
||||
)
|
||||
return True, f"guard inconclusive: fetch error: {e}"
|
||||
|
||||
for task, marker, path in MAIN_REGRESSION_MARKERS:
|
||||
try:
|
||||
r = subprocess.run(
|
||||
["git", "-C", wt, "grep", "-c", marker, "origin/main", "--", path],
|
||||
capture_output=True, text=True, timeout=_SHORT_TIMEOUT,
|
||||
)
|
||||
except (subprocess.SubprocessError, OSError) as e:
|
||||
# Infra error on this marker -> fail-open (do NOT assert regression).
|
||||
logger.warning(
|
||||
"check_main_regression: grep error for %s (%s @ %s): %s (fail-open)",
|
||||
task, marker, path, e,
|
||||
)
|
||||
return True, f"guard inconclusive: grep error for {task}: {e}"
|
||||
# git grep exit codes: 0 = match(es) found, 1 = no match, >1 = real error.
|
||||
if r.returncode == 0:
|
||||
continue
|
||||
if r.returncode == 1:
|
||||
# Deterministic absence -> regression of a neighbouring task's code.
|
||||
logger.warning(
|
||||
"check_main_regression: marker MISSING in origin/main for %s "
|
||||
"(%s @ %s) -> main regressed", task, marker, path,
|
||||
)
|
||||
return False, f"main regressed: {task} code missing ({marker} @ {path})"
|
||||
# rc > 1 -> git error (e.g. bad path/ref) -> inconclusive -> fail-open.
|
||||
logger.warning(
|
||||
"check_main_regression: ambiguous git grep rc=%s for %s (%s @ %s) "
|
||||
"(fail-open)", r.returncode, task, marker, path,
|
||||
)
|
||||
return True, f"guard inconclusive: git grep rc={r.returncode} for {task}"
|
||||
|
||||
return True, f"markers intact ({len(MAIN_REGRESSION_MARKERS)})"
|
||||
except Exception as e: # noqa: BLE001 - never-raise contract -> fail-open
|
||||
logger.warning(
|
||||
"check_main_regression unexpected error for %s/%s: %s (fail-open)",
|
||||
repo, branch, e,
|
||||
)
|
||||
return True, f"guard error: {e}"
|
||||
|
||||
@@ -1277,6 +1277,50 @@ def _deploy_finalize_defer_count(task_id: int) -> int:
|
||||
return n
|
||||
|
||||
|
||||
def _hold_main_regressed(
|
||||
task_id, repo, work_item_id, branch, guard_msg: str, result: AdvanceResult
|
||||
) -> bool:
|
||||
"""HOLD the task because the regression guard found neighbouring code missing.
|
||||
|
||||
ORCH-073 Р-4 (FR-5 / AC-5): the deployed SHA IS in `main` (FR-1 passed) but a
|
||||
declarative marker of a recently-merged task is gone -> a phantom-merge / rebase
|
||||
rolled back sibling code. Reaction is ALERT-only + HOLD (Telegram + Plane
|
||||
``set_issue_blocked`` + comment), task stays on `deploy` (NOT done), NO rollback
|
||||
to development (an infra defect, not a code fault — symmetric to the not-merged
|
||||
HOLD). Returns ``True`` (INTERVENED). Never breaks the HOLD on a notify error.
|
||||
"""
|
||||
merge_gate.note_main_regressed_alert(work_item_id)
|
||||
msg = (
|
||||
f"main regressed: {guard_msg} (repo={repo}, branch={branch}, "
|
||||
f"wi={work_item_id}). Соседний код пропал из `main` — задача удержана на "
|
||||
f"`deploy` (НЕ done). Нужно ручное восстановление кода."
|
||||
)
|
||||
logger.warning(f"Task {task_id}: {msg}")
|
||||
if work_item_id:
|
||||
try:
|
||||
set_issue_blocked(work_item_id)
|
||||
except Exception as e: # noqa: BLE001 - never break the HOLD
|
||||
logger.warning(f"Task {task_id}: set_issue_blocked failed: {e}")
|
||||
try:
|
||||
plane_add_comment(
|
||||
work_item_id,
|
||||
"\U0001f6a8 Регресс `main`: " + guard_msg + ". Код соседней задачи "
|
||||
"пропал из `main`. Задача удержана на `deploy` (НЕ done) — нужно "
|
||||
"восстановить код и повторить approve.",
|
||||
author="deployer",
|
||||
)
|
||||
except Exception as e: # noqa: BLE001 - never break the HOLD
|
||||
logger.warning(f"Task {task_id}: plane regressed comment failed: {e}")
|
||||
try:
|
||||
send_telegram(f"\U0001f6a8 {msg}")
|
||||
except Exception as e: # noqa: BLE001 - never break the HOLD
|
||||
logger.warning(f"Task {task_id}: main-regressed telegram failed: {e}")
|
||||
result.alerted = True
|
||||
result.note = "main-regressed-hold"
|
||||
result.advanced = False
|
||||
return True
|
||||
|
||||
|
||||
def _handle_merge_verify(task_id, repo, work_item_id, branch, result: AdvanceResult) -> bool:
|
||||
"""ORCH-071 merge-verify under-gate on the `deploy -> done` edge.
|
||||
|
||||
@@ -1317,6 +1361,20 @@ def _handle_merge_verify(task_id, repo, work_item_id, branch, result: AdvanceRes
|
||||
|
||||
confirmed = merge_gate.verify_merged_to_main(repo, branch, sha)
|
||||
if confirmed:
|
||||
# ORCH-073 Р-4 (FR-5): secondary main-integrity regression guard. The
|
||||
# deployed SHA is in main (FR-1), but a CHANGELOG-rebase / phantom-merge
|
||||
# could still have rolled back a NEIGHBOURING task's code. Verify the
|
||||
# declarative marker set is intact; a deterministic miss -> HOLD + alert
|
||||
# (NOT done, no rollback — an infra defect, not a code fault). Fail-OPEN
|
||||
# on a git error of the guard itself (SHA-in-main remains the primary
|
||||
# gate). Honours the same scope/kill-switch as the under-gate.
|
||||
if settings.regression_guard_enabled:
|
||||
guard_ok, guard_msg = merge_gate.check_main_regression(repo, branch)
|
||||
if not guard_ok:
|
||||
return _hold_main_regressed(
|
||||
task_id, repo, work_item_id, branch, guard_msg, result
|
||||
)
|
||||
|
||||
merge_gate.note_merge_verified()
|
||||
try:
|
||||
self_deploy.record_merged_to_main(repo, work_item_id, branch, True)
|
||||
|
||||
Reference in New Issue
Block a user