feat(cancel): STOP-status task cancellation + relaunch-hole close (ORCH-090)

Introduce the dedicated Plane STOP status as a single declarative task-cancel
mechanism: stop the active agent (graceful SIGTERM cascade), cancel all jobs
(terminal `cancelled`, never requeued), remove the worktree + delete the remote
feature branch (never main, never force-push), drive the task to the new
system-terminal state `cancelled` and tombstone the natural keys so a later
"To Analyse" re-creates it from scratch (docs artefacts preserved). STOP during a
critical merge/deploy window is deferred until the irreversible step finishes
honestly. Also closes the relaunch hole: handle_status_start relaunch is gated to
the `analysis` stage; the only pipeline-start entry point remains "To Analyse".

Cross-cutting (adr-0026): the "task terminal" predicate is widened {done} ->
{done, cancelled} in serial_gate / task_deps / stages sink + reaper/worker
requeue guards. STAGE_TRANSITIONS exit-gates / QG_CHECKS / check_* are unchanged
(`cancelled` is a sink, not a new edge). Additive, never-raise, restart-safe,
under kill-switch ORCH_STOP_STATUS_ENABLED (off -> zero regression).

New: src/cancel.py (leaf), src/gitea.py (delete_remote_branch), tasks columns
cancelled_at/cancel_requested_at, jobs status `cancelled`, GET /queue `stop` block.
Tests: tests/test_stop_status.py (TC-01..TC-14 + D7); full suite green (1345).
Docs updated in-PR (architecture README, CLAUDE.md, README.md, .env.example,
CHANGELOG). ADR-001 D4 refinement: plane_issue_id is tombstoned too (the lookup
ORs on it) — original UUID recoverable from the parseable suffix.

Refs: ORCH-090

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-09 21:01:57 +03:00
committed by orchestrator-deployer
parent ab083ba826
commit ebbf2e7a2d
27 changed files with 1394 additions and 38 deletions

144
src/cancel.py Normal file
View File

@@ -0,0 +1,144 @@
"""ORCH-090 (ADR-001 D9 / adr-0026): STOP-cancellation leaf — pure decision logic.
Leaf module mirroring ``src/serial_gate.py`` / ``src/labels.py``: pure,
unit-testable, never-raise functions over config + the existing DB / deploy-state.
Module-level imports are limited to ``config`` (and ``re``); the critical-window
probe lazily imports ``self_deploy`` / ``merge_gate`` so a cycle can never form and
an import failure degrades safely.
What it answers:
* ``applies(repo)`` — is STOP-cancellation REAL for this repo?
* ``in_critical_window(task)``— is the task inside an irreversible merge/deploy
step where cancellation must be DEFERRED (ADR-001 D7) instead of applied now?
* ``snapshot()`` — read-only summary for ``GET /queue`` (AC-10).
The ORCHESTRATION of a cancellation (SIGTERM, cancel-jobs, worktree/branch
cleanup, key tombstone, notifications) lives in ``stage_engine.cancel_task`` — this
leaf only decides, it never mutates.
never-raise contract (self-hosting safety): every public function degrades
conservatively. ``applies`` -> False on error (gate inert, the kill-switch-off
default). ``in_critical_window`` -> True on doubt (fail-CLOSED: when we cannot
confirm we are OUTSIDE a critical window, DEFER cancellation rather than risk
tearing a half-merge / detached prod deploy, NFR-3 / TR-3).
"""
from __future__ import annotations
import logging
import re
from .config import settings
logger = logging.getLogger("orchestrator.cancel")
# Repo tokens in the CSV scope must match this (mirrors serial_gate._REPO_TOKEN).
_REPO_TOKEN = re.compile(r"^[A-Za-z0-9._-]+$")
def _scope_repos() -> set[str]:
"""Sanitised set of in-scope repo tokens from ``stop_status_repos`` (CSV).
Empty/blank CSV -> empty set, meaning "apply to ALL repos" (D9). Invalid tokens
(regex miss) are dropped. Never raises.
"""
try:
raw = (settings.stop_status_repos or "").strip()
except Exception: # noqa: BLE001
return set()
if not raw:
return set()
out: set[str] = set()
for tok in raw.split(","):
t = tok.strip()
if t and _REPO_TOKEN.match(t):
out.add(t)
elif t:
logger.warning("cancel: dropping invalid repo token %r from CSV", t)
return out
def applies(repo: str) -> bool:
"""Whether STOP-cancellation is REAL for this repo (D9 / AC-8).
* ``stop_status_enabled=False`` -> always False (kill-switch; STOP handling and
the relaunch-hole gate are 1:1 as before ORCH-090).
* ``stop_status_repos`` (CSV) non-empty -> real only for listed repos.
* empty CSV -> real for ALL repos (cancellation is meaningful for enduro too).
Never raises -> False on error (degrade to "inert", matching kill-switch off).
"""
try:
if not getattr(settings, "stop_status_enabled", False):
return False
scope = _scope_repos()
if scope:
return (repo or "").strip() in scope
return True
except Exception as e: # noqa: BLE001 - never-raise
logger.warning("cancel.applies error for %s: %s", repo, e)
return False
def in_critical_window(task: dict) -> bool:
"""Is the task inside an irreversible merge/deploy step (ADR-001 D7 / AC-7)?
A STOP that lands here must NOT tear the step apart (half-merge / detached prod
deploy / dead prod container, NFR-3). Two markers (existing, no new state):
* self-deploy Phase B initiated — the ``INITIATED`` sentinel in
``<repos_dir>/.deploy-state-<repo>/<wi>/`` (ORCH-036);
* the task currently HOLDS the per-repo merge-lease
``<repos_dir>/.merge-lease-<repo>.json`` (ORCH-043), holder branch == task
branch.
fail-CLOSED (TR-3): any error/uncertainty -> True (DEFER cancellation). Outside
the window -> False (apply the full reset immediately).
"""
if not task:
return False
repo = task.get("repo")
work_item_id = task.get("work_item_id")
branch = task.get("branch")
try:
from . import self_deploy
if self_deploy.has_marker(repo, work_item_id, self_deploy.INITIATED):
return True
except Exception as e: # noqa: BLE001 - fail-CLOSED on doubt
logger.warning("cancel.in_critical_window self_deploy probe error: %s", e)
return True
try:
from . import merge_gate
holder = merge_gate.current_lease_holder(repo)
if holder and branch and holder == branch:
return True
except Exception as e: # noqa: BLE001 - fail-CLOSED on doubt
logger.warning("cancel.in_critical_window merge-lease probe error: %s", e)
return True
return False
def snapshot() -> dict:
"""Read-only STOP-cancellation summary for GET /queue (AC-10).
Additive block; existing /queue keys are untouched. never-raise -> a minimal
dict with the flags on error.
"""
try:
enabled = bool(getattr(settings, "stop_status_enabled", False))
except Exception: # noqa: BLE001
enabled = False
try:
repos_cfg = getattr(settings, "stop_status_repos", "") or ""
except Exception: # noqa: BLE001
repos_cfg = ""
try:
from . import db
stats = db.cancelled_tasks_snapshot(10)
except Exception as e: # noqa: BLE001 - never-raise
logger.warning("cancel.snapshot error: %s", e)
stats = {"count": 0, "pending": 0, "recent": []}
return {
"enabled": enabled,
"repos": repos_cfg,
"cancelled_count": stats.get("count", 0),
"deferred_pending": stats.get("pending", 0),
"recent": stats.get("recent", []),
}