feat(cancel): STOP-status task cancellation + relaunch-hole close (ORCH-090)

Introduce the dedicated Plane STOP status as a single declarative task-cancel mechanism: stop the active agent (graceful SIGTERM cascade), cancel all jobs (terminal `cancelled`, never requeued), remove the worktree + delete the remote feature branch (never main, never force-push), drive the task to the new system-terminal state `cancelled` and tombstone the natural keys so a later "To Analyse" re-creates it from scratch (docs artefacts preserved). STOP during a critical merge/deploy window is deferred until the irreversible step finishes honestly. Also closes the relaunch hole: handle_status_start relaunch is gated to the `analysis` stage; the only pipeline-start entry point remains "To Analyse". Cross-cutting (adr-0026): the "task terminal" predicate is widened {done} -> {done, cancelled} in serial_gate / task_deps / stages sink + reaper/worker requeue guards. STAGE_TRANSITIONS exit-gates / QG_CHECKS / check_* are unchanged (`cancelled` is a sink, not a new edge). Additive, never-raise, restart-safe, under kill-switch ORCH_STOP_STATUS_ENABLED (off -> zero regression). New: src/cancel.py (leaf), src/gitea.py (delete_remote_branch), tasks columns cancelled_at/cancel_requested_at, jobs status `cancelled`, GET /queue `stop` block. Tests: tests/test_stop_status.py (TC-01..TC-14 + D7); full suite green (1345). Docs updated in-PR (architecture README, CLAUDE.md, README.md, .env.example, CHANGELOG). ADR-001 D4 refinement: plane_issue_id is tombstoned too (the lookup ORs on it) — original UUID recoverable from the parseable suffix. Refs: ORCH-090 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:01:57 +03:00
parent ab083ba826
commit ebbf2e7a2d
27 changed files with 1394 additions and 38 deletions
--- a/src/agents/launcher.py
+++ b/src/agents/launcher.py
@@ -679,17 +679,47 @@ class AgentLauncher:
        if timeout is None:
            timeout = self._resolve_timeout(agent)
        time.sleep(timeout)
+        # ORCH-090: the SIGTERM->grace->SIGKILL cascade is now a reusable helper
+        # (stop_process) shared with the STOP-cancellation path. The timeout
+        # watchdog just sleeps the timeout, then drives the cascade.
+        logger.warning(
+            f"Agent run_id={run_id} exceeded {timeout}s timeout (pid={pid})"
+        )
+        self.stop_process(pid, run_id, reason=f"timeout>{timeout}s")

+    def stop_process(self, pid: int, run_id: int | None, *, reason: str = "stop") -> bool:
+        """ORCH-7 / ORCH-090 (ADR-001 D2): graceful SIGTERM->grace->SIGKILL cascade.
+
+        Extracted from ``_watchdog`` so the STOP-cancellation path
+        (``stage_engine.cancel_task``) stops an active agent through the SAME
+        graceful cascade instead of a new "dirty" kill (AC-1). Send SIGTERM, give
+        the process up to ``settings.agent_kill_grace_seconds`` to flush and exit,
+        SIGKILL only if it is still alive after the grace; stamp ``agent_runs``
+        exit_code=-9 via ``_record_kill`` whenever a kill actually happened.
+
+        never-raise; ``ProcessLookupError`` is tolerated at every step (the process
+        may already be gone). Returns True iff a SIGTERM was delivered to a live
+        process; False when the process was already gone (no record — the monitor's
+        ``proc.wait()`` owns that exit).
+        """
+        if pid is None:
+            return False
        # Phase 1: SIGTERM (graceful). If the process is already gone, we're done.
        try:
            os.kill(pid, signal.SIGTERM)
            logger.warning(
-                f"Agent run_id={run_id} exceeded {timeout}s timeout: sent SIGTERM "
-                f"(pid={pid}), grace={settings.agent_kill_grace_seconds}s"
+                f"stop_process ({reason}): sent SIGTERM to pid={pid} "
+                f"(run_id={run_id}), grace={settings.agent_kill_grace_seconds}s"
            )
        except ProcessLookupError:
-            logger.info(f"Agent run_id={run_id} already exited before SIGTERM")
-            return  # nothing to record: the monitor's proc.wait() owns the exit
+            logger.info(
+                f"stop_process ({reason}): pid={pid} already exited "
+                f"(run_id={run_id}); nothing to record"
+            )
+            return False
+        except Exception as e:  # noqa: BLE001 - never-raise
+            logger.warning(f"stop_process SIGTERM error pid={pid}: {e}")
+            return False

        # Phase 2: poll for graceful exit within the grace window.
        grace = settings.agent_kill_grace_seconds
@@ -702,21 +732,27 @@ class AgentLauncher:
                os.kill(pid, 0)  # signal 0 = liveness probe, does not kill
            except ProcessLookupError:
                logger.info(
-                    f"Agent run_id={run_id} exited gracefully after SIGTERM "
-                    f"({waited:.1f}s); no SIGKILL needed"
+                    f"stop_process ({reason}): pid={pid} exited gracefully after "
+                    f"SIGTERM ({waited:.1f}s); no SIGKILL needed"
                )
                self._record_kill(run_id)
-                return
+                return True
+            except Exception:  # noqa: BLE001 - probe error -> escalate to SIGKILL
+                break

        # Phase 3: still alive -> hard SIGKILL.
        try:
            os.kill(pid, signal.SIGKILL)
            logger.warning(
-                f"Agent run_id={run_id} did not exit within {grace}s grace: sent SIGKILL"
+                f"stop_process ({reason}): pid={pid} did not exit within {grace}s "
+                f"grace: sent SIGKILL"
            )
        except ProcessLookupError:
-            logger.info(f"Agent run_id={run_id} exited just before SIGKILL")
+            logger.info(f"stop_process ({reason}): pid={pid} exited just before SIGKILL")
+        except Exception as e:  # noqa: BLE001 - never-raise
+            logger.warning(f"stop_process SIGKILL error pid={pid}: {e}")
        self._record_kill(run_id)
+        return True

    @staticmethod
    def _record_kill(run_id: int):