fix(cancel): narrow STOP critical-window so deploy-park cancel applies (ORCH-090)
Review P1: a STOP while a self-hosting task is PARKED on `deploy` awaiting the manual `Confirm Deploy` was classified as a critical merge/deploy window solely because the task still held the per-repo merge-lease (held from merge-gate through deploy->done). That window is fully reversible — nothing is merged or deployed yet (the irreversible merge_pr runs later in _handle_merge_verify, always under an INITIATED marker). So the cancel was DEFERRED to run_deploy_finalizer, which only runs after Phase B (Confirm Deploy) — the very step the operator pressed STOP to avoid. Result: the deferred cancel was never applied, the task wedged non-terminal holding the lease, blocking the repo's serial-gate (ORCH-088) and merges. Fix: gate the merge-lease branch of cancel.in_critical_window on an actively RUNNING actor (_task_has_running_actor). Lease held + running deploy/merge job -> still deferred (genuine in-flight step). Lease held + no running actor (idle deploy parking) -> NOT critical -> immediate full reset, which itself releases the lease (step 3c) and drives the task terminal. INITIATED-marker deferral unchanged. Also fixes review P2 (AC-6): set_task_cancel_requested now returns the first-stamp fact (rowcount), and the deferred branch only notifies on the first transition — a repeated STOP while still deferred no longer spams duplicate notifications. Tests: test_d7_lease_held_idle_parking_is_not_critical, test_d7_lease_held_with_running_actor_still_critical, test_d7_stop_on_deploy_awaiting_confirm_full_resets, test_d7_repeated_stop_in_critical_window_no_duplicate_notify. Full suite green (1349). Refs: ORCH-090 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -3,8 +3,8 @@
|
||||
Leaf module mirroring ``src/serial_gate.py`` / ``src/labels.py``: pure,
|
||||
unit-testable, never-raise functions over config + the existing DB / deploy-state.
|
||||
Module-level imports are limited to ``config`` (and ``re``); the critical-window
|
||||
probe lazily imports ``self_deploy`` / ``merge_gate`` so a cycle can never form and
|
||||
an import failure degrades safely.
|
||||
probe lazily imports ``self_deploy`` / ``merge_gate`` / ``db`` so a cycle can never
|
||||
form and an import failure degrades safely.
|
||||
|
||||
What it answers:
|
||||
* ``applies(repo)`` — is STOP-cancellation REAL for this repo?
|
||||
@@ -78,16 +78,50 @@ def applies(repo: str) -> bool:
|
||||
return False
|
||||
|
||||
|
||||
def _task_has_running_actor(task_id) -> bool:
|
||||
"""True iff the task currently has a RUNNING job — an active merge/deploy actor.
|
||||
|
||||
Distinguishes a genuinely in-flight merge/deploy (a running deployer / deploy
|
||||
finalizer job actually executing the irreversible step) from a task merely
|
||||
PARKED on ``deploy`` awaiting the human ``Confirm Deploy`` (the merge-lease is
|
||||
held across that wait, ORCH-036/043, but nothing is executing and nothing has
|
||||
been merged/deployed). Lazily imports ``db``; raises on a db error so the caller
|
||||
fails CLOSED (treat as critical) rather than silently mis-classifying on doubt.
|
||||
"""
|
||||
if not task_id:
|
||||
return False
|
||||
from . import db
|
||||
for job in db.get_active_jobs_for_task(task_id):
|
||||
if job.get("status") == "running":
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def in_critical_window(task: dict) -> bool:
|
||||
"""Is the task inside an irreversible merge/deploy step (ADR-001 D7 / AC-7)?
|
||||
|
||||
A STOP that lands here must NOT tear the step apart (half-merge / detached prod
|
||||
deploy / dead prod container, NFR-3). Two markers (existing, no new state):
|
||||
deploy / dead prod container, NFR-3). Markers (existing, no new state):
|
||||
* self-deploy Phase B initiated — the ``INITIATED`` sentinel in
|
||||
``<repos_dir>/.deploy-state-<repo>/<wi>/`` (ORCH-036);
|
||||
* the task currently HOLDS the per-repo merge-lease
|
||||
``<repos_dir>/.merge-lease-<repo>.json`` (ORCH-043), holder branch == task
|
||||
branch.
|
||||
``<repos_dir>/.deploy-state-<repo>/<wi>/`` (ORCH-036) — the detached prod
|
||||
deploy + the deterministic ``merge_pr`` (``_handle_merge_verify``, run later
|
||||
under the SAME marker) are both covered here;
|
||||
* the task HOLDS the per-repo merge-lease ``<repos_dir>/.merge-lease-<repo>.json``
|
||||
(ORCH-043), holder branch == task branch, **AND** a merge/deploy actor is
|
||||
actually RUNNING.
|
||||
|
||||
The merge-lease branch is gated on a running actor on purpose (ORCH-090 review
|
||||
P1 fix). For the self-hosting repo the lease is HELD from the merge-gate PASS
|
||||
(``deploy-staging -> deploy`` edge) right through to ``deploy -> done`` — including
|
||||
the whole time the task sits PARKED on ``deploy`` awaiting a human ``Confirm
|
||||
Deploy`` (Phase A). That wait is FULLY REVERSIBLE: nothing is merged or deployed
|
||||
(the irreversible ``merge_pr`` only runs later in ``_handle_merge_verify``, always
|
||||
under an ``INITIATED`` marker already caught above). Classifying that idle parking
|
||||
as "critical" used to DEFER the cancel to a deploy finalizer that the operator —
|
||||
having pressed STOP precisely to NOT confirm — never triggers, so the cancel was
|
||||
never applied and the task wedged while still holding the lease (blocking the
|
||||
repo's serial-gate / merges). Now idle parking (lease held, no running actor) is
|
||||
NOT critical: the full reset runs immediately and itself releases the lease.
|
||||
|
||||
fail-CLOSED (TR-3): any error/uncertainty -> True (DEFER cancellation). Outside
|
||||
the window -> False (apply the full reset immediately).
|
||||
@@ -108,7 +142,16 @@ def in_critical_window(task: dict) -> bool:
|
||||
from . import merge_gate
|
||||
holder = merge_gate.current_lease_holder(repo)
|
||||
if holder and branch and holder == branch:
|
||||
return True
|
||||
# Lease held. Critical ONLY if an actor is actively merging/deploying;
|
||||
# an idle task parked on `deploy` awaiting Confirm Deploy is reversible.
|
||||
if _task_has_running_actor(task.get("id")):
|
||||
return True
|
||||
logger.info(
|
||||
"cancel.in_critical_window: task %s holds the merge-lease but no "
|
||||
"actor is running (idle deploy parking, awaiting Confirm Deploy) -> "
|
||||
"NOT critical; full reset will release the lease", task.get("id"),
|
||||
)
|
||||
return False
|
||||
except Exception as e: # noqa: BLE001 - fail-CLOSED on doubt
|
||||
logger.warning("cancel.in_critical_window merge-lease probe error: %s", e)
|
||||
return True
|
||||
|
||||
13
src/db.py
13
src/db.py
@@ -867,20 +867,23 @@ def mark_task_cancelled(task_id: int) -> bool:
|
||||
def set_task_cancel_requested(task_id: int) -> bool:
|
||||
"""ORCH-090 (ADR-001 D7): mark a deferred cancellation (STOP in critical window).
|
||||
|
||||
Idempotent: only stamps ``cancel_requested_at`` the first time. The deterministic
|
||||
deploy/merge finalizer reads it once the irreversible step completes and then
|
||||
applies the full cancellation. never-raise -> False on error.
|
||||
Idempotent: only stamps ``cancel_requested_at`` the first time. Returns the
|
||||
**first-stamp fact** — ``True`` iff THIS call actually stamped the column (a
|
||||
repeated STOP while still deferred updates 0 rows -> ``False``), so the caller can
|
||||
suppress duplicate notifications (AC-6). The deterministic deploy/merge finalizer
|
||||
reads the column once the irreversible step completes and then applies the full
|
||||
cancellation. never-raise -> False on error.
|
||||
"""
|
||||
try:
|
||||
conn = get_db()
|
||||
try:
|
||||
conn.execute(
|
||||
cur = conn.execute(
|
||||
"UPDATE tasks SET cancel_requested_at=datetime('now') "
|
||||
"WHERE id = ? AND cancel_requested_at IS NULL",
|
||||
(task_id,),
|
||||
)
|
||||
conn.commit()
|
||||
return True
|
||||
return cur.rowcount > 0
|
||||
finally:
|
||||
conn.close()
|
||||
except Exception:
|
||||
|
||||
@@ -1919,20 +1919,24 @@ def cancel_task(
|
||||
|
||||
# (2) Critical merge/deploy window -> DEFER (unless forced by the finalizer).
|
||||
if not force and cancel_mod.in_critical_window(task):
|
||||
set_task_cancel_requested(task_id)
|
||||
first = set_task_cancel_requested(task_id)
|
||||
result["cancelled_jobs"] = cancel_jobs_for_task(task_id, only_queued=True)
|
||||
result["deferred"] = True
|
||||
result["ok"] = True
|
||||
result["note"] = "deferred-critical-window"
|
||||
msg = (
|
||||
f"⏸️ {link_for(work_item_id)}: STOP получен во время "
|
||||
f"критичного шага (merge/deploy) — отмена ОТЛОЖЕНА до честного "
|
||||
f"завершения шага. main/прод не трогаются."
|
||||
)
|
||||
_notify_cancel(work_item_id, task_id, msg)
|
||||
result["note"] = "deferred-critical-window" if first else "deferred-already-pending"
|
||||
# AC-6: only alert on the FIRST deferral transition — a repeated STOP while
|
||||
# still deferred must not spam duplicate Telegram/Plane notifications.
|
||||
if first:
|
||||
msg = (
|
||||
f"⏸️ {link_for(work_item_id)}: STOP получен во время "
|
||||
f"критичного шага (merge/deploy) — отмена ОТЛОЖЕНА до честного "
|
||||
f"завершения шага. main/прод не трогаются."
|
||||
)
|
||||
_notify_cancel(work_item_id, task_id, msg)
|
||||
logger.warning(
|
||||
"cancel_task: task %s (%s) in critical window -> deferred cancel "
|
||||
"(queued jobs cancelled=%s)", task_id, work_item_id, result["cancelled_jobs"],
|
||||
"(first=%s, queued jobs cancelled=%s)", task_id, work_item_id, first,
|
||||
result["cancelled_jobs"],
|
||||
)
|
||||
return result
|
||||
|
||||
|
||||
Reference in New Issue
Block a user