feat(cancel): STOP-status task cancellation + relaunch-hole close (ORCH-090)

Introduce the dedicated Plane STOP status as a single declarative task-cancel
mechanism: stop the active agent (graceful SIGTERM cascade), cancel all jobs
(terminal `cancelled`, never requeued), remove the worktree + delete the remote
feature branch (never main, never force-push), drive the task to the new
system-terminal state `cancelled` and tombstone the natural keys so a later
"To Analyse" re-creates it from scratch (docs artefacts preserved). STOP during a
critical merge/deploy window is deferred until the irreversible step finishes
honestly. Also closes the relaunch hole: handle_status_start relaunch is gated to
the `analysis` stage; the only pipeline-start entry point remains "To Analyse".

Cross-cutting (adr-0026): the "task terminal" predicate is widened {done} ->
{done, cancelled} in serial_gate / task_deps / stages sink + reaper/worker
requeue guards. STAGE_TRANSITIONS exit-gates / QG_CHECKS / check_* are unchanged
(`cancelled` is a sink, not a new edge). Additive, never-raise, restart-safe,
under kill-switch ORCH_STOP_STATUS_ENABLED (off -> zero regression).

New: src/cancel.py (leaf), src/gitea.py (delete_remote_branch), tasks columns
cancelled_at/cancel_requested_at, jobs status `cancelled`, GET /queue `stop` block.
Tests: tests/test_stop_status.py (TC-01..TC-14 + D7); full suite green (1345).
Docs updated in-PR (architecture README, CLAUDE.md, README.md, .env.example,
CHANGELOG). ADR-001 D4 refinement: plane_issue_id is tombstoned too (the lookup
ORs on it) — original UUID recoverable from the parseable suffix.

Refs: ORCH-090

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-09 21:01:57 +03:00
committed by orchestrator-deployer
parent ab083ba826
commit ebbf2e7a2d
27 changed files with 1394 additions and 38 deletions

View File

@@ -160,8 +160,15 @@ async def handle_issue_updated(data: dict, project_id: str = ""):
# fallback) resolve to None, so the branch simply never activates (no KeyError,
# no blind deploy). Checked before `approved` so the two gestures never alias.
confirm_state = proj_states.get("confirm_deploy")
# ORCH-090: dedicated operator STOP status -> cancel the task (stop agent + full
# reset). fail-closed via .get (no UUID on a board without the status -> None ->
# branch never activates, exactly like confirm_deploy). Checked FIRST so a STOP
# is never aliased by to_analyse/approved/rejected.
stop_state = proj_states.get("stop")
# ORCH-066: start/resume trigger is `To Analyse` (human entry-point).
if new_state == proj_states["to_analyse"]:
if stop_state and new_state == stop_state:
await handle_stop(data, project_id)
elif new_state == proj_states["to_analyse"]:
await handle_status_start(data, project_id)
elif confirm_state and new_state == confirm_state:
await handle_confirm_deploy(data, project_id)
@@ -212,6 +219,44 @@ async def handle_confirm_deploy(data: dict, project_id: str = ""):
)
async def handle_stop(data: dict, project_id: str = ""):
"""ORCH-090: a human flipped the issue to the dedicated STOP status — cancel
the task (stop the active agent + full progress reset).
Resolves the task by plane_id and delegates to the unified
``stage_engine.cancel_task`` (run off the event loop via asyncio.to_thread — it
is synchronous and may sleep during the graceful SIGTERM cascade). Guards:
* kill-switch / repo-scope via ``cancel.applies(repo)`` (False -> no-op-log);
* idempotent — an absent / already-terminal task is a no-op inside cancel_task.
Contract is never-raise (NFR-5): any error is logged, the webhook flow never
crashes.
"""
import asyncio
from .. import cancel
from ..stage_engine import cancel_task
plane_id = str(data.get("id") or "")
task = get_task_by_plane_id(plane_id)
if not task:
logger.info(f"STOP for {plane_id} but no task found, ignoring (no-op)")
return
task_id = task["id"]
repo = task.get("repo", "")
if not cancel.applies(repo):
logger.info(
f"STOP for {plane_id} (task {task_id}, repo={repo}) but cancellation is "
f"not applicable (kill-switch off / out of scope); no-op"
)
return
logger.info(f"Task {task_id}: STOP status -> cancelling (stop agent + full reset)")
try:
await asyncio.to_thread(cancel_task, task_id, reason="Plane STOP status", source="stop")
except Exception as e: # never-raise: the webhook flow must not crash
logger.error(f"STOP handling failed for task {task_id}: {e}")
async def handle_status_start(data: dict, project_id: str = ""):
"""An issue moved into In Progress.
@@ -279,6 +324,36 @@ async def handle_status_start(data: dict, project_id: str = ""):
)
return
# ORCH-090 (ADR-001 D6 / AC-5): close the relaunch hole. The legitimate "answer
# to Needs Input" resume is owned ONLY by the analyst (ORCH-066 — the sole
# Needs-Input setter). A manual move of an EXISTING task at any OTHER stage to
# "To Analyse" must NOT silently relaunch the mid-pipeline agent on the old
# branch (the incident pattern). Gate the relaunch to `analysis`; any other
# stage -> no-op-with-log + a best-effort Plane hint to use STOP -> To Analyse
# for a clean-slate restart. Under the kill-switch off this gate is inert
# (behaviour 1:1 as before ORCH-090).
from ..config import settings as _settings
if getattr(_settings, "stop_status_enabled", False) and current_stage != "analysis":
logger.info(
f"Status->To Analyse for {plane_id}: existing task on stage "
f"'{current_stage}' — NOT relaunching {stage_agent} (relaunch-hole closed, "
f"ORCH-090). Use STOP then To Analyse to restart from scratch."
)
try:
_add_comment(
work_item_id,
" Перезапуск "
"агента сменой "
"рабочего статуса "
"отключён (ORCH-090). Для "
"перезапуска с нуля: "
"STOP → To Analyse.",
author=stage_agent,
)
except Exception as e:
logger.error(f"Failed to post relaunch-hole comment for {work_item_id}: {e}")
return
task_desc = (
f"Work item: {work_item_id}\nRepo: {repo}\nBranch: {branch}\n"
f"Stage: {current_stage}\nNote: Stakeholder returned the issue to In "