feat(cancel): STOP-status task cancellation + relaunch-hole close (ORCH-090)

Introduce the dedicated Plane STOP status as a single declarative task-cancel
mechanism: stop the active agent (graceful SIGTERM cascade), cancel all jobs
(terminal `cancelled`, never requeued), remove the worktree + delete the remote
feature branch (never main, never force-push), drive the task to the new
system-terminal state `cancelled` and tombstone the natural keys so a later
"To Analyse" re-creates it from scratch (docs artefacts preserved). STOP during a
critical merge/deploy window is deferred until the irreversible step finishes
honestly. Also closes the relaunch hole: handle_status_start relaunch is gated to
the `analysis` stage; the only pipeline-start entry point remains "To Analyse".

Cross-cutting (adr-0026): the "task terminal" predicate is widened {done} ->
{done, cancelled} in serial_gate / task_deps / stages sink + reaper/worker
requeue guards. STAGE_TRANSITIONS exit-gates / QG_CHECKS / check_* are unchanged
(`cancelled` is a sink, not a new edge). Additive, never-raise, restart-safe,
under kill-switch ORCH_STOP_STATUS_ENABLED (off -> zero regression).

New: src/cancel.py (leaf), src/gitea.py (delete_remote_branch), tasks columns
cancelled_at/cancel_requested_at, jobs status `cancelled`, GET /queue `stop` block.
Tests: tests/test_stop_status.py (TC-01..TC-14 + D7); full suite green (1345).
Docs updated in-PR (architecture README, CLAUDE.md, README.md, .env.example,
CHANGELOG). ADR-001 D4 refinement: plane_issue_id is tombstoned too (the lookup
ORs on it) — original UUID recoverable from the parseable suffix.

Refs: ORCH-090

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-09 21:01:57 +03:00
committed by orchestrator-deployer
parent ab083ba826
commit ebbf2e7a2d
27 changed files with 1394 additions and 38 deletions

View File

@@ -679,17 +679,47 @@ class AgentLauncher:
if timeout is None:
timeout = self._resolve_timeout(agent)
time.sleep(timeout)
# ORCH-090: the SIGTERM->grace->SIGKILL cascade is now a reusable helper
# (stop_process) shared with the STOP-cancellation path. The timeout
# watchdog just sleeps the timeout, then drives the cascade.
logger.warning(
f"Agent run_id={run_id} exceeded {timeout}s timeout (pid={pid})"
)
self.stop_process(pid, run_id, reason=f"timeout>{timeout}s")
def stop_process(self, pid: int, run_id: int | None, *, reason: str = "stop") -> bool:
"""ORCH-7 / ORCH-090 (ADR-001 D2): graceful SIGTERM->grace->SIGKILL cascade.
Extracted from ``_watchdog`` so the STOP-cancellation path
(``stage_engine.cancel_task``) stops an active agent through the SAME
graceful cascade instead of a new "dirty" kill (AC-1). Send SIGTERM, give
the process up to ``settings.agent_kill_grace_seconds`` to flush and exit,
SIGKILL only if it is still alive after the grace; stamp ``agent_runs``
exit_code=-9 via ``_record_kill`` whenever a kill actually happened.
never-raise; ``ProcessLookupError`` is tolerated at every step (the process
may already be gone). Returns True iff a SIGTERM was delivered to a live
process; False when the process was already gone (no record — the monitor's
``proc.wait()`` owns that exit).
"""
if pid is None:
return False
# Phase 1: SIGTERM (graceful). If the process is already gone, we're done.
try:
os.kill(pid, signal.SIGTERM)
logger.warning(
f"Agent run_id={run_id} exceeded {timeout}s timeout: sent SIGTERM "
f"(pid={pid}), grace={settings.agent_kill_grace_seconds}s"
f"stop_process ({reason}): sent SIGTERM to pid={pid} "
f"(run_id={run_id}), grace={settings.agent_kill_grace_seconds}s"
)
except ProcessLookupError:
logger.info(f"Agent run_id={run_id} already exited before SIGTERM")
return # nothing to record: the monitor's proc.wait() owns the exit
logger.info(
f"stop_process ({reason}): pid={pid} already exited "
f"(run_id={run_id}); nothing to record"
)
return False
except Exception as e: # noqa: BLE001 - never-raise
logger.warning(f"stop_process SIGTERM error pid={pid}: {e}")
return False
# Phase 2: poll for graceful exit within the grace window.
grace = settings.agent_kill_grace_seconds
@@ -702,21 +732,27 @@ class AgentLauncher:
os.kill(pid, 0) # signal 0 = liveness probe, does not kill
except ProcessLookupError:
logger.info(
f"Agent run_id={run_id} exited gracefully after SIGTERM "
f"({waited:.1f}s); no SIGKILL needed"
f"stop_process ({reason}): pid={pid} exited gracefully after "
f"SIGTERM ({waited:.1f}s); no SIGKILL needed"
)
self._record_kill(run_id)
return
return True
except Exception: # noqa: BLE001 - probe error -> escalate to SIGKILL
break
# Phase 3: still alive -> hard SIGKILL.
try:
os.kill(pid, signal.SIGKILL)
logger.warning(
f"Agent run_id={run_id} did not exit within {grace}s grace: sent SIGKILL"
f"stop_process ({reason}): pid={pid} did not exit within {grace}s "
f"grace: sent SIGKILL"
)
except ProcessLookupError:
logger.info(f"Agent run_id={run_id} exited just before SIGKILL")
logger.info(f"stop_process ({reason}): pid={pid} exited just before SIGKILL")
except Exception as e: # noqa: BLE001 - never-raise
logger.warning(f"stop_process SIGKILL error pid={pid}: {e}")
self._record_kill(run_id)
return True
@staticmethod
def _record_kill(run_id: int):