feat(watchdog): proc_blocking alert for orphaned long-lived test processes

Close the observability gap between agent_hung (only tracked jobs by jobs.pid) and orphaned pytest subprocesses the orchestrator launches itself (merge_gate.retest_branch / coverage_gate.measure_coverage). On a timeout-kill of the agent (-9, ORCH-109) the grand-child pytest reparents onto tini and keeps running for days, starving CPU and failing merge-gate re-test — with no alert. Strictly inside the observer (watchdog/** + the watchdog compose service): - watchdog/collectors/proc.py: stdlib-only /proc scan (under pid: host), read-only, never-raise -> []; pure parsers split from I/O (tested on a fake /proc tree). Never reads /proc/<pid>/environ. - watchdog/signals.py: pure proc_signals builder, per-entity ("proc_blocking", pid), active iff age_s > proc_age_s; actionable RU detail. - watchdog/core.py: opt-in tick block (gated on proc_enabled -> zero overhead / byte-for-byte when off) + RECOVERY synthesis for a vanished process through the existing decide()/AlertState (no new anti-spam logic). - watchdog/config.py: WATCHDOG_PROC_{ENABLED(false),AGE_MIN(60),PATTERNS(pytest), COOLDOWN_S(1800)}; default threshold > max(merge_retest_timeout_s=600, coverage_run_timeout_s=900) so a legit in-flight run never crosses it. - docker-compose.yml: pid: host on orchestrator-watchdog ONLY (read-only privilege). Anti-false-positive and no overlap with agent_hung are by construction (cmdline scope + age threshold), not fragile cross-namespace PID matching. Canon synced: WATCHDOG_PROC_* in .env.watchdog.example <-> .env.example block; documented in LITE_SETUP.md and docs/architecture/README.md (architect). src/**, /metrics, schema_version, STAGE_TRANSITIONS, QG_CHECKS, check_*, machine-verdict and the DB schema are untouched; deploy rebuilds only the sidecar, prod orchestrator is not restarted (NFR-3). Tests: tests/watchdog/test_proc_blocking_signal.py (TC-01..TC-06), test_proc_collector.py (/proc parsing), test_tick_proc_blocking_integration.py (TC-07), plus pid: host and proc-config assertions. Full pytest tests/ green (1930). Refs: ORCH-111 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 01:46:09 +03:00
parent 7298f11064
commit 2e73ccf090
15 changed files with 948 additions and 2 deletions
--- a/watchdog/core.py
+++ b/watchdog/core.py
@@ -19,6 +19,7 @@ from .collectors import containers as containers_mod
 from .collectors import deps as deps_mod
 from .collectors import host as host_mod
 from .collectors import orch as orch_mod
+from .collectors import proc as proc_mod
 from .config import Config
 from .notify import Notifier
 from . import signals as signals_mod
@@ -93,6 +94,18 @@ class Watchdog:
            logger.warning("watchdog: deps collect error: %s", e)
            return {}

+    def _collect_proc(self, now: float) -> list:
+        # Opt-in: when WATCHDOG_PROC_ENABLED is false the scan is NOT called
+        # (gate mirrors _collect_disk on disk_crit_enabled) -> zero overhead and
+        # byte-for-byte tick behaviour as before ORCH-111 (D5 / AC-7).
+        if not self.cfg.proc_enabled:
+            return []
+        try:
+            return proc_mod.collect_candidates(self.cfg.proc_patterns, now=now)
+        except Exception as e:  # noqa: BLE001 - never-raise: one signal skipped
+            logger.warning("watchdog: proc collect error: %s", e)
+            return []
+
    # -- one tick ---------------------------------------------------------
    def tick(self) -> list:
        """Run one full pass; returns the dispatched ``(action, Signal)`` list.
@@ -134,10 +147,53 @@ class Watchdog:
        # 4) external dependency pings
        built.extend(signals_mod.dep_signals(self._collect_deps()))

+        # 5) long-lived blocking test/child processes (opt-in; pid: host /proc).
+        # Gated entirely on proc_enabled so a disabled sidecar is byte-for-byte
+        # as before ORCH-111 (D5/AC-7); RECOVERY for a vanished process is
+        # synthesised through the SAME decide()/AlertState machinery (D4).
+        if self.cfg.proc_enabled:
+            proc_sigs = signals_mod.proc_signals(self.cfg, self._collect_proc(now))
+            proc_sigs.extend(self._synthesize_proc_recoveries(proc_sigs))
+            built.extend(proc_sigs)
+
        dispatched = self._dispatch(built, now)
        self.last_run_ts = now
        return dispatched

+    def _synthesize_proc_recoveries(self, current_sigs: list) -> list:
+        """Synthesise an inactive ``Signal`` for every vanished proc_blocking key.
+
+        ``proc_signals`` emits a signal ONLY for a currently observed candidate,
+        so a process that disappeared leaves an alerting :class:`AlertState` with
+        no fresh signal and would never recover. Reusing ``decide()``/
+        ``AlertState`` (FR-5 — no separate anti-spam logic), we emit an
+        ``active=False`` signal for each alerting ``("proc_blocking", …)`` key
+        absent from the current set -> ``decide`` yields exactly one RECOVERY and
+        clears the state. This is per-family bookkeeping, not new throttling.
+        """
+        out: list = []
+        try:
+            current_keys = {s.key for s in current_sigs}
+            for key, state in list(self._states.items()):
+                if (
+                    isinstance(key, tuple)
+                    and key
+                    and key[0] == "proc_blocking"
+                    and state.alerting
+                    and key not in current_keys
+                ):
+                    out.append(
+                        signals_mod.Signal(
+                            key=key,
+                            active=False,
+                            title="Блокирующий процесс",
+                            detail=f"процесс PID {key[1]} завершился",
+                        )
+                    )
+        except Exception as e:  # noqa: BLE001 - never-raise: skip recovery synthesis
+            logger.warning("watchdog: proc recovery synth error: %s", e)
+        return out
+
    # -- decision + dispatch ----------------------------------------------
    def _dispatch(self, built: list, now: float) -> list:
        """Run each signal through ``decide`` and send alert/realert/recovery."""