fix(deploy): terminal-window-aware guard so done tasks hold Done in Plane (ORCH-094)

A DB stage=done task with 0 active jobs flapped in Plane between `Awaiting Deploy` and `Monitoring after Deploy` instead of holding `Done` (verified live on ORCH-061, task 47): the three deploy-phase setters were terminal-blind, so any stale/duplicate/unknown caller under the bot token re-stamped an intermediate status over the terminal Done, forever. - New leaf src/deploy_status_guard.py (pure, never-raise, config-gated): decide() -> ALLOW | CONVERGE_DONE | SUPPRESS on the entry of set_issue_awaiting_deploy / set_issue_deploying / set_issue_monitoring. A deploy-phase status is legitimate iff the task is non-terminal OR (done AND post-deploy window active); otherwise done converges to Done idempotently, cancelled is suppressed (FR-2, D1/D2). - D3: move post_deploy.arm_monitor ABOVE the terminal-sync block in advance_stage so window_active is True when the legitimate first Monitoring is set (the task is already DB-done by then); a re-drive after the window closes converges to Done. - D4: run_post_deploy_monitor no-ops without a status PATCH / re-queue when the task became cancelled mid-window (zombie-tick guard, FR-3). - D5: additive `reason` kwarg on the three setters + one structured log line per verdict (work_item/caller/target/db_stage/window_active/verdict); new read-only db.get_task_by_work_item_id; post_deploy.window_active helper. - Flags deploy_status_guard_enabled (kill-switch -> 1:1) / deploy_status_guard_repos (CSV; empty = self-hosting only). STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / DB schema untouched (reads existing tasks.stage). Tests: TC-01..TC-12 across 5 new test modules + config flags; updated the reason-kwarg assertions in test_deploy_terminal_sync / test_deploy_approve. Full regress green (1413). Docs: CHANGELOG, CLAUDE.md, docs/architecture/README.md (status -> реализовано), .env.example. Refs: ORCH-094 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 23:31:30 +03:00
parent db4dd275e4
commit a46dcbcab3
18 changed files with 1088 additions and 25 deletions
--- a/src/config.py
+++ b/src/config.py
@@ -649,6 +649,32 @@ class Settings(BaseSettings):
    stop_status_enabled: bool = True
    stop_status_repos: str = ""

+    # ORCH-094: terminal-window-aware guard for deploy-phase Plane status setters.
+    # A task with DB stage='done' (and 0 active jobs) was flapping in Plane between
+    # `Awaiting Deploy` and `Monitoring after Deploy` instead of holding `Done`,
+    # because the three deploy-phase setters (set_issue_awaiting_deploy /
+    # set_issue_deploying / set_issue_monitoring) are terminal-blind: any stale /
+    # duplicate / unknown caller under the bot token re-stamps an intermediate
+    # deploy status over the terminal Done. ORCH-094 puts a single low choke-point
+    # guard on the entry of those three setters (leaf src/deploy_status_guard.py):
+    # for a task whose DB stage is terminal it converges to Done idempotently
+    # (CONVERGE_DONE), EXCEPT the legitimate post-deploy `Monitoring` while the
+    # window is still active (ARMED & not DONE). Additive, never-raise; reads the
+    # existing tasks.stage (no migration); STAGE_TRANSITIONS / QG_CHECKS /
+    # machine-verdict keys are NOT touched. See
+    # docs/work-items/ORCH-094/06-adr/ADR-001-terminal-window-aware-deploy-status-guard.md
+    # and the cross-cutting docs/architecture/adr/adr-0028-…md.
+    #   deploy_status_guard_enabled -> kill-switch (env ORCH_DEPLOY_STATUS_GUARD_ENABLED).
+    #                                  False -> the setters are terminal-blind, behaviour
+    #                                  strictly 1:1 as before ORCH-094 (zero regression).
+    #   deploy_status_guard_repos   -> CSV scope (env ORCH_DEPLOY_STATUS_GUARD_REPOS).
+    #                                  Empty -> applies ONLY to the self-hosting repo
+    #                                  (orchestrator), where deploy-phase statuses are set
+    #                                  at all; non-empty -> only the listed repos. Tokens
+    #                                  are sanitised (^[A-Za-z0-9._-]+$) by the guard leaf.
+    deploy_status_guard_enabled: bool = True
+    deploy_status_guard_repos: str = ""
+
    # ORCH-073 (ADR-001 Р-4): main-integrity regression guard. After the merge-verify
    # under-gate confirms the deployed SHA is an ancestor of origin/main (FR-1), a
    # secondary deterministic (no-LLM) guard checks that a declarative set of markers
--- a/src/db.py
+++ b/src/db.py
@@ -223,6 +223,28 @@ def get_task_by_plane_id(plane_id: str) -> dict | None:
    return None


+def get_task_by_work_item_id(work_item_id: str) -> dict | None:
+    """ORCH-094: read-only lookup of the live task row by human-readable
+    ``work_item_id`` (e.g. ``"ORCH-061"``).
+
+    ``get_task_by_plane_id`` matches the Plane UUIDs (``plane_id`` /
+    ``plane_issue_id``), not the human-readable ``work_item_id`` the deploy-phase
+    setters receive — hence this thin accessor. A live row matches exactly; the
+    ORCH-090 cancel tombstones carry a ``#cancelled-<id>`` suffix on
+    ``work_item_id`` so they never collide with a clean id. No schema change.
+    """
+    if not work_item_id:
+        return None
+    conn = get_db()
+    try:
+        row = conn.execute(
+            "SELECT * FROM tasks WHERE work_item_id = ?", (work_item_id,)
+        ).fetchone()
+    finally:
+        conn.close()
+    return dict(row) if row else None
+
+
 def get_task_by_repo_branch(repo: str, branch: str) -> dict | None:
    """Find task by repo and branch name."""
    conn = get_db()
--- a/src/deploy_status_guard.py
+++ b/src/deploy_status_guard.py
@@ -0,0 +1,191 @@
+"""ORCH-094: terminal-window-aware guard for deploy-phase Plane status setters.
+
+Leaf module — pure, never-raise, config-gated logic over the existing ``tasks``
+table and the restart-safe post-deploy sentinels. Mirrors the leaf pattern of
+``src/serial_gate.py`` / ``src/labels.py`` / ``src/cancel.py``: it imports only
+``config`` (and lazily ``db`` / ``post_deploy`` / ``qg.checks``), never
+``plane_sync`` / ``stage_engine`` — the setters that need a verdict call
+:func:`decide`, they do not live here.
+
+The bug (verified live on ORCH-061, task 47, done since 07.06): a task with DB
+``stage='done'`` and no active job flaps in Plane between ``Awaiting Deploy`` and
+``Monitoring after Deploy`` instead of holding ``Done``. The three deploy-phase
+setters (``set_issue_awaiting_deploy`` / ``set_issue_deploying`` /
+``set_issue_monitoring``) are **terminal-blind**: any stale / duplicate / unknown
+caller under the bot token re-stamps an intermediate deploy status over the
+terminal Done, and the pendulum never settles.
+
+The fix is a single low choke-point on the entry of those three setters. For a
+task whose DB stage is terminal the verdict converges to ``Done`` idempotently,
+EXCEPT the one legitimate case: the post-deploy ``Monitoring`` status while the
+observation window is still active (``post_deploy.window_active`` — ARMED & not
+DONE). The deploy ``Awaiting``/``Deploying`` statuses are ALWAYS spurious for a
+``done`` task (Phase A/B happen strictly BEFORE ``deploy -> done``).
+
+Key invariant (ADR-001 D2): a deploy-phase status is legitimate iff the task is
+non-terminal OR (``done`` AND the post-deploy window is active); otherwise the
+verdict is idempotent convergence to ``Done`` (for ``done``) / suppression (for
+``cancelled``).
+
+never-raise contract (self-hosting safety): any error / inability to determine
+the DB stage degrades to ``ALLOW`` (fail-safe to the prior 1:1 behaviour, NFR-1)
+— a local SQLite read is reliable, so in the normal case the stage is read and
+the pendulum cannot arise.
+"""
+from __future__ import annotations
+
+import logging
+import re
+
+from .config import settings
+
+logger = logging.getLogger("orchestrator.deploy_status_guard")
+
+# Verdicts returned by decide() (the setter executes them).
+ALLOW = "ALLOW"  # PATCH the requested deploy-phase status (normal path).
+CONVERGE_DONE = "CONVERGE_DONE"  # set_issue_done instead (idempotent convergence).
+SUPPRESS = "SUPPRESS"  # do nothing (do not stamp over a `cancelled` terminal).
+
+# Deploy-phase target tokens (one per guarded setter).
+AWAITING = "awaiting"
+DEPLOYING = "deploying"
+MONITORING = "monitoring"
+
+# Terminal DB stages (harmonised with serial_gate / adr-0026).
+_TERMINAL = ("done", "cancelled")
+
+# Repo tokens embedded into config CSV must match this (mirrors serial_gate R-6).
+_REPO_TOKEN = re.compile(r"^[A-Za-z0-9._-]+$")
+
+
+# ---------------------------------------------------------------------------
+# Conditionality (mirrors post_deploy_applies / _merge_gate_applies)
+# ---------------------------------------------------------------------------
+def _scope_repos() -> set[str]:
+    """Sanitised set of in-scope repo tokens from ``deploy_status_guard_repos``.
+
+    Empty/blank CSV -> empty set, meaning "self-hosting only" (resolved by the
+    caller via :func:`applies`). Invalid tokens (regex miss) are dropped. Never
+    raises.
+    """
+    try:
+        raw = (settings.deploy_status_guard_repos or "").strip()
+    except Exception:  # noqa: BLE001
+        return set()
+    if not raw:
+        return set()
+    out: set[str] = set()
+    for tok in raw.split(","):
+        t = tok.strip()
+        if t and _REPO_TOKEN.match(t):
+            out.add(t)
+        elif t:
+            logger.warning("deploy_status_guard: dropping invalid repo token %r", t)
+    return out
+
+
+def applies(repo: str) -> bool:
+    """Whether the guard is REAL for this repo (D6).
+
+      * ``deploy_status_guard_enabled=False`` -> always False (kill-switch; the
+        setters are terminal-blind, 1:1 as before ORCH-094).
+      * ``deploy_status_guard_repos`` (CSV) non-empty -> real only for listed repos.
+      * empty CSV -> real ONLY for the self-hosting repo (``orchestrator``), where
+        deploy-phase statuses are set at all. Mirrors the ORCH-35/36/43/58
+        self-hosting-only rollout -> non-self repos (enduro-trails) are untouched
+        (they never see Awaiting/Deploying/Monitoring; terminal-sync goes straight
+        to Done), i.e. zero regression.
+    Never raises -> False on error (degrade to "guard inert").
+    """
+    try:
+        if not getattr(settings, "deploy_status_guard_enabled", False):
+            return False
+        scope = _scope_repos()
+        if scope:
+            return (repo or "").strip() in scope
+        # Lazy import keeps this module a leaf (avoid importing qg at load time).
+        from .qg.checks import is_self_hosting_repo
+        return is_self_hosting_repo(repo)
+    except Exception as e:  # noqa: BLE001 - never-raise
+        logger.warning("deploy_status_guard.applies error for %s: %s", repo, e)
+        return False
+
+
+# ---------------------------------------------------------------------------
+# Verdict (the single predicate — ADR-001 D2)
+# ---------------------------------------------------------------------------
+def decide(work_item_id: str, target_status: str, reason: str | None = None) -> str:
+    """Decide what a deploy-phase setter should do for ``work_item_id`` (D2).
+
+    Returns one of :data:`ALLOW` / :data:`CONVERGE_DONE` / :data:`SUPPRESS`.
+    Steps (ADR-001 D2):
+
+      1. kill-switch off                          -> ALLOW (behaviour 1:1).
+      2. task not found                           -> ALLOW (foreign/unknown issue).
+      3. guard not applicable for the repo        -> ALLOW (non-self / out-of-scope).
+      4. DB stage non-terminal                    -> ALLOW (live deploy cycle, AC-4).
+      5. DB stage == 'cancelled'                  -> SUPPRESS (do not stamp over it).
+      6. DB stage == 'done':
+           * target == 'monitoring' AND window active -> ALLOW (legit post-deploy).
+           * otherwise                                -> CONVERGE_DONE.
+      7. any exception / undeterminable stage     -> ALLOW (fail-safe, NFR-1).
+
+    Always emits exactly one structured observability line (FR-4 / D5): work_item,
+    caller (``reason``), target_status, db_stage, window_active, verdict.
+    """
+    db_stage = None
+    window = None
+    verdict = ALLOW
+    try:
+        if not getattr(settings, "deploy_status_guard_enabled", False):
+            return ALLOW  # step 1 (logged in finally)
+
+        from . import db
+        task = db.get_task_by_work_item_id(work_item_id)
+        if task is None:
+            return ALLOW  # step 2
+
+        repo = task.get("repo")
+        if not applies(repo):
+            return ALLOW  # step 3
+
+        db_stage = (task.get("stage") or "").strip()
+        if db_stage not in _TERMINAL:
+            verdict = ALLOW  # step 4 — non-terminal: legit working deploy cycle
+            return verdict
+
+        if db_stage == "cancelled":
+            verdict = SUPPRESS  # step 5
+            return verdict
+
+        # step 6 — db_stage == 'done'
+        if target_status == MONITORING:
+            from . import post_deploy
+            window = post_deploy.window_active(repo, work_item_id)
+            if window:
+                verdict = ALLOW
+                return verdict
+        verdict = CONVERGE_DONE
+        return verdict
+    except Exception as e:  # noqa: BLE001 - never-raise; fail-safe to ALLOW
+        logger.warning(
+            "deploy_status_guard.decide error for %s (target=%s) -> ALLOW: %s",
+            work_item_id, target_status, e,
+        )
+        verdict = ALLOW
+        return verdict
+    finally:
+        # FR-4 / D5: one structured line per call. Convergence/suppression is the
+        # interesting case — log it at WARNING so a future flapp is easy to attribute.
+        try:
+            msg = (
+                "deploy_status_guard: work_item=%s caller=%s target=%s db_stage=%s "
+                "window_active=%s verdict=%s"
+            )
+            argv = (work_item_id, reason, target_status, db_stage, window, verdict)
+            if verdict == ALLOW:
+                logger.info(msg, *argv)
+            else:
+                logger.warning(msg, *argv)
+        except Exception:  # noqa: BLE001 - logging must never raise
+            pass
--- a/src/plane_sync.py
+++ b/src/plane_sync.py
@@ -951,32 +951,67 @@ def set_issue_code_review(work_item_id: str, project_id: str = None):
    _set_issue_state_direct(work_item_id, state_id, project_id)


-def set_issue_awaiting_deploy(work_item_id: str, project_id: str = None):
+def _deploy_status_guarded(work_item_id: str, target: str, reason: str | None) -> bool:
+    """ORCH-094: apply the terminal-window-aware guard for a deploy-phase setter.
+
+    Returns True iff the caller should PROCEED with the normal PATCH (verdict
+    ALLOW). On CONVERGE_DONE it drives the task to terminal ``Done`` here (the
+    idempotent convergence target) and returns False; on SUPPRESS it does nothing
+    and returns False. never-raise: any error degrades to ALLOW (proceed), keeping
+    behaviour 1:1 with pre-ORCH-094 (the guard leaf itself fails safe to ALLOW).
+    """
+    try:
+        from . import deploy_status_guard
+        verdict = deploy_status_guard.decide(work_item_id, target, reason=reason)
+        if verdict == deploy_status_guard.CONVERGE_DONE:
+            set_issue_done(work_item_id)
+            return False
+        if verdict == deploy_status_guard.SUPPRESS:
+            return False
+        return True
+    except Exception as e:  # noqa: BLE001 - never-raise; proceed (1:1) on doubt
+        logger.warning(f"deploy_status_guard wrapper error for {work_item_id}: {e}")
+        return True
+
+
+def set_issue_awaiting_deploy(work_item_id: str, project_id: str = None, reason: str = None):
    """ORCH-066: set issue to 'Awaiting Deploy' — self-deploy Phase A approval-pending.

    Degrades to the project's In Review UUID when 'Awaiting Deploy' is not created.
+    ORCH-094: terminal-window-aware — a task whose DB stage is terminal converges to
+    Done instead of stamping a spurious deploy status (``reason`` = caller, FR-4).
    """
+    if not _deploy_status_guarded(work_item_id, "awaiting", reason):
+        return
    project_id = _resolve_project_id(work_item_id, project_id)
    state_id = get_project_states(project_id)["awaiting_deploy"]
    _set_issue_state_direct(work_item_id, state_id, project_id)


-def set_issue_deploying(work_item_id: str, project_id: str = None):
+def set_issue_deploying(work_item_id: str, project_id: str = None, reason: str = None):
    """ORCH-066: set issue to 'Deploying' — self-deploy Phase B prod deploy in flight.

    Degrades to the project's In Progress UUID when 'Deploying' is not created.
+    ORCH-094: terminal-window-aware (see :func:`set_issue_awaiting_deploy`).
    """
+    if not _deploy_status_guarded(work_item_id, "deploying", reason):
+        return
    project_id = _resolve_project_id(work_item_id, project_id)
    state_id = get_project_states(project_id)["deploying"]
    _set_issue_state_direct(work_item_id, state_id, project_id)


-def set_issue_monitoring(work_item_id: str, project_id: str = None):
+def set_issue_monitoring(work_item_id: str, project_id: str = None, reason: str = None):
    """ORCH-066: set issue to 'Monitoring after Deploy' — post-deploy window open.

    Degrades to the project's Done UUID when 'Monitoring after Deploy' is not
    created (so the board shows Done, exactly as before ORCH-066).
+    ORCH-094: terminal-window-aware — the LEGITIMATE first Monitoring (DB already
+    ``done`` by the time line 404 runs, but the post-deploy window is active) is
+    allowed; a stale Monitoring after the window has closed converges to Done.
    """
+    if not _deploy_status_guarded(work_item_id, "monitoring", reason):
+        return
    project_id = _resolve_project_id(work_item_id, project_id)
    state_id = get_project_states(project_id)["monitoring"]
    _set_issue_state_direct(work_item_id, state_id, project_id)
--- a/src/post_deploy.py
+++ b/src/post_deploy.py
@@ -316,6 +316,28 @@ def has_marker(repo: str, work_item_id: str | None, name: str) -> bool:
        return False


+def window_active(repo: str, work_item_id: str | None) -> bool:
+    """ORCH-094: True iff a post-deploy observation window is currently OPEN.
+
+    A window is open iff it has been armed (``ARMED`` sentinel) and has NOT yet
+    finished (no ``DONE`` sentinel). The terminal-window-aware deploy-status guard
+    (``deploy_status_guard.decide``) uses this to keep the legitimate post-deploy
+    ``Monitoring after Deploy`` status for a task that is already DB-``done`` while
+    its window is live, and to converge to ``Done`` once the window has closed.
+
+    Restart-safe (the sentinels live on disk) and never-raise -> False on error
+    (a doubt resolves to "window closed", i.e. converge to Done — the safe-for-
+    indication default that matches the bug we are fixing).
+    """
+    try:
+        return has_marker(repo, work_item_id, ARMED) and not has_marker(
+            repo, work_item_id, DONE
+        )
+    except Exception as e:  # noqa: BLE001 - never-raise
+        logger.warning("window_active error for %s/%s: %s", repo, work_item_id, e)
+        return False
+
+
 def write_marker(repo: str, work_item_id: str | None, name: str, content: str = "") -> bool:
    """Create/overwrite a sentinel (best-effort). Returns True on success."""
    try:
--- a/src/stage_engine.py
+++ b/src/stage_engine.py
@@ -384,6 +384,29 @@ def advance_stage(
            f"(auto-advance after {agent})"
        )

+        # ORCH-021: arm post-deploy monitoring PAST `done`. Responsibility extends
+        # beyond the restart-time health-check to catch the "green deploy, red prod"
+        # class (ET-8). Idempotent (sentinel `armed`) + conditional (applies()), so a
+        # double webhook / reconciler / finalizer re-driving `done` never doubles it
+        # and non-applicable repos are untouched. never-raise (arm_monitor + guard).
+        #
+        # ORCH-094 (ADR-001 D3): the arm block is moved ABOVE the terminal-sync
+        # block (it used to run AFTER set_issue_monitoring). The order matters now
+        # that set_issue_monitoring is terminal-window-aware: by the time the
+        # legitimate first `Monitoring` is set, the task is ALREADY DB-`done`
+        # (update_task_stage ran above), so the guard must see the window as ACTIVE
+        # (ARMED & not DONE) to let it through. Arming first writes the ARMED
+        # sentinel -> window_active==True -> the guard returns ALLOW. A re-drive of
+        # deploy->done AFTER the window has closed (DONE present) -> window_active
+        # False -> the guard converges to Done (no resurrected Monitoring). The
+        # move is safe: arm_monitor only writes a sentinel + enqueues a deferred
+        # job; it depends on neither the Plane status nor the merge lease.
+        if next_stage == "done" and post_deploy.post_deploy_applies(repo):
+            try:
+                post_deploy.arm_monitor(repo, work_item_id, branch, task_id)
+            except Exception as e:  # noqa: BLE001 - monitoring must never crash done
+                logger.warning(f"Task {task_id}: post-deploy arm failed: {e}")
+
        # --- Terminal sync: deploy -> done must reach Plane's Done -----------
        # When the deployer's check_deploy_status passes we advance to the
        # terminal 'done' stage. Previously a merged-PR webhook completed the
@@ -401,7 +424,7 @@ def advance_stage(
        if next_stage == "done" and work_item_id:
            try:
                if post_deploy.post_deploy_applies(repo):
-                    set_issue_monitoring(work_item_id)
+                    set_issue_monitoring(work_item_id, reason="advance:deploy->done")
                    logger.info(
                        f"Task {task_id}: deploy->done (self), Plane state -> "
                        f"Monitoring after Deploy (post-deploy window)"
@@ -416,24 +439,14 @@ def advance_stage(

        # ORCH-043: the merge has landed (deploy->done). Release the merge lease as
        # a backstop in case the PR-merged webhook was lost (holder-aware no-op if a
-        # different task already owns it). Never raises.
+        # different task already owns it). Never raises. ORCH-094: stays AFTER the
+        # terminal-sync (the arm-block move above does not touch the lease).
        if next_stage == "done":
            try:
                merge_gate.release_merge_lease(repo, branch)
            except Exception as e:  # noqa: BLE001 - defensive
                logger.warning(f"Task {task_id}: merge-lease release on done failed: {e}")

-        # ORCH-021: arm post-deploy monitoring PAST `done`. Responsibility extends
-        # beyond the restart-time health-check to catch the "green deploy, red prod"
-        # class (ET-8). Idempotent (sentinel `armed`) + conditional (applies()), so a
-        # double webhook / reconciler / finalizer re-driving `done` never doubles it
-        # and non-applicable repos are untouched. never-raise (arm_monitor + guard).
-        if next_stage == "done" and post_deploy.post_deploy_applies(repo):
-            try:
-                post_deploy.arm_monitor(repo, work_item_id, branch, task_id)
-            except Exception as e:  # noqa: BLE001 - monitoring must never crash done
-                logger.warning(f"Task {task_id}: post-deploy arm failed: {e}")
-
        # --- Launch the next agent (ORCH-4 fix: current_stage, not next) -----
        next_agent = get_agent_for_stage(current_stage)
        if next_agent:
@@ -1214,8 +1227,8 @@ def _handle_self_deploy_phase_a(
        # ORCH-066 (AC-6/AC-13): Phase A approval-pending is now `Awaiting Deploy`,
        # which discharges `In Review` of the deploy-approval meaning (In Review
        # stays for analyst BRD/review approve-pending only). Degrades to In Review
-        # where the status is not created.
-        set_issue_awaiting_deploy(work_item_id)
+        # where the status is not created. ORCH-094: reason tags the caller (FR-4).
+        set_issue_awaiting_deploy(work_item_id, reason="phase_a")
    # ORCH-036: belt-and-suspenders — wipe any STALE deploy-state markers before
    # arming a fresh approve. A prior FAILED pass clears on rollback, but clearing
    # here too guarantees the entry to every new prod-deploy pass starts clean
@@ -1312,8 +1325,9 @@ def _handle_self_deploy_phase_b(task_id, repo, work_item_id, branch, result: Adv
    )
    # ORCH-066 (AC-7): the prod deploy is now in flight -> indicate `Deploying`
    # (degrades to In Progress where the status is not created).
+    # ORCH-094: reason tags the caller (FR-4).
    if work_item_id:
-        set_issue_deploying(work_item_id)
+        set_issue_deploying(work_item_id, reason="phase_b")
    task_desc = (
        f"Work item: {work_item_id}\nRepo: {repo}\nBranch: {branch}\n"
        f"Stage: deploy\nNote: deploy-finalize poll (prod self-deploy initiated)."
@@ -1714,7 +1728,7 @@ def run_post_deploy_monitor(job: dict):
    try:
        conn = get_db()
        row = conn.execute(
-            "SELECT work_item_id, branch FROM tasks WHERE id=?", (task_id,)
+            "SELECT work_item_id, branch, stage FROM tasks WHERE id=?", (task_id,)
        ).fetchone()
        conn.close()
    except Exception as e:  # noqa: BLE001 - never-raise
@@ -1723,13 +1737,28 @@ def run_post_deploy_monitor(job: dict):
    if not row:
        logger.error(f"post-deploy-monitor: no task row for task_id={task_id}")
        return
-    work_item_id, branch = row[0], row[1]
+    work_item_id, branch, db_stage = row[0], row[1], row[2]

    # AC-15: a finished window is a no-op (defends against a duplicate job).
    if post_deploy.has_marker(repo, work_item_id, post_deploy.DONE):
        logger.info(f"post-deploy-monitor: {work_item_id} already done (no-op)")
        return

+    # ORCH-094 (FR-3 / D4 / AC-3): a tick must have an active basis. If the task
+    # became terminal ANOMALOUSLY mid-window (cancelled via STOP, ORCH-090), the
+    # tick is a "zombie" — close the window WITHOUT a status PATCH and WITHOUT
+    # re-queueing the next tick (a cancelled task already reached its own terminal;
+    # stamping a deploy status over it would flapp). A `done` stage is the NORMAL
+    # state of a post-deploy window (it opens strictly past deploy->done) so it is
+    # NOT treated as an anomaly here.
+    if (db_stage or "").strip() == "cancelled":
+        logger.info(
+            f"post-deploy-monitor: {work_item_id} task cancelled mid-window -> "
+            f"closing window, no status PATCH, no re-queue (zombie-tick guard)"
+        )
+        post_deploy.mark_done(repo, work_item_id)
+        return
+
    # One probe -> append -> classify (restart-safe via the persisted series).
    probe = post_deploy.probe_signals(settings.post_deploy_base_url)
    series = post_deploy.append_probe(repo, work_item_id, probe)