feat(watchdog): proc_blocking alert for orphaned long-lived test processes

Close the observability gap between agent_hung (only tracked jobs by jobs.pid)
and orphaned pytest subprocesses the orchestrator launches itself
(merge_gate.retest_branch / coverage_gate.measure_coverage). On a timeout-kill of
the agent (-9, ORCH-109) the grand-child pytest reparents onto tini and keeps
running for days, starving CPU and failing merge-gate re-test — with no alert.

Strictly inside the observer (watchdog/** + the watchdog compose service):
- watchdog/collectors/proc.py: stdlib-only /proc scan (under pid: host),
  read-only, never-raise -> []; pure parsers split from I/O (tested on a fake
  /proc tree). Never reads /proc/<pid>/environ.
- watchdog/signals.py: pure proc_signals builder, per-entity
  ("proc_blocking", pid), active iff age_s > proc_age_s; actionable RU detail.
- watchdog/core.py: opt-in tick block (gated on proc_enabled -> zero overhead /
  byte-for-byte when off) + RECOVERY synthesis for a vanished process through the
  existing decide()/AlertState (no new anti-spam logic).
- watchdog/config.py: WATCHDOG_PROC_{ENABLED(false),AGE_MIN(60),PATTERNS(pytest),
  COOLDOWN_S(1800)}; default threshold > max(merge_retest_timeout_s=600,
  coverage_run_timeout_s=900) so a legit in-flight run never crosses it.
- docker-compose.yml: pid: host on orchestrator-watchdog ONLY (read-only privilege).

Anti-false-positive and no overlap with agent_hung are by construction (cmdline
scope + age threshold), not fragile cross-namespace PID matching.

Canon synced: WATCHDOG_PROC_* in .env.watchdog.example <-> .env.example block;
documented in LITE_SETUP.md and docs/architecture/README.md (architect). src/**,
/metrics, schema_version, STAGE_TRANSITIONS, QG_CHECKS, check_*, machine-verdict
and the DB schema are untouched; deploy rebuilds only the sidecar, prod
orchestrator is not restarted (NFR-3).

Tests: tests/watchdog/test_proc_blocking_signal.py (TC-01..TC-06),
test_proc_collector.py (/proc parsing), test_tick_proc_blocking_integration.py
(TC-07), plus pid: host and proc-config assertions. Full pytest tests/ green (1930).

Refs: ORCH-111
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-15 01:46:09 +03:00
committed by orchestrator-deployer
parent 7298f11064
commit 2e73ccf090
15 changed files with 948 additions and 2 deletions

View File

@@ -19,6 +19,7 @@ from .collectors import containers as containers_mod
from .collectors import deps as deps_mod
from .collectors import host as host_mod
from .collectors import orch as orch_mod
from .collectors import proc as proc_mod
from .config import Config
from .notify import Notifier
from . import signals as signals_mod
@@ -93,6 +94,18 @@ class Watchdog:
logger.warning("watchdog: deps collect error: %s", e)
return {}
def _collect_proc(self, now: float) -> list:
# Opt-in: when WATCHDOG_PROC_ENABLED is false the scan is NOT called
# (gate mirrors _collect_disk on disk_crit_enabled) -> zero overhead and
# byte-for-byte tick behaviour as before ORCH-111 (D5 / AC-7).
if not self.cfg.proc_enabled:
return []
try:
return proc_mod.collect_candidates(self.cfg.proc_patterns, now=now)
except Exception as e: # noqa: BLE001 - never-raise: one signal skipped
logger.warning("watchdog: proc collect error: %s", e)
return []
# -- one tick ---------------------------------------------------------
def tick(self) -> list:
"""Run one full pass; returns the dispatched ``(action, Signal)`` list.
@@ -134,10 +147,53 @@ class Watchdog:
# 4) external dependency pings
built.extend(signals_mod.dep_signals(self._collect_deps()))
# 5) long-lived blocking test/child processes (opt-in; pid: host /proc).
# Gated entirely on proc_enabled so a disabled sidecar is byte-for-byte
# as before ORCH-111 (D5/AC-7); RECOVERY for a vanished process is
# synthesised through the SAME decide()/AlertState machinery (D4).
if self.cfg.proc_enabled:
proc_sigs = signals_mod.proc_signals(self.cfg, self._collect_proc(now))
proc_sigs.extend(self._synthesize_proc_recoveries(proc_sigs))
built.extend(proc_sigs)
dispatched = self._dispatch(built, now)
self.last_run_ts = now
return dispatched
def _synthesize_proc_recoveries(self, current_sigs: list) -> list:
"""Synthesise an inactive ``Signal`` for every vanished proc_blocking key.
``proc_signals`` emits a signal ONLY for a currently observed candidate,
so a process that disappeared leaves an alerting :class:`AlertState` with
no fresh signal and would never recover. Reusing ``decide()``/
``AlertState`` (FR-5 — no separate anti-spam logic), we emit an
``active=False`` signal for each alerting ``("proc_blocking", …)`` key
absent from the current set -> ``decide`` yields exactly one RECOVERY and
clears the state. This is per-family bookkeeping, not new throttling.
"""
out: list = []
try:
current_keys = {s.key for s in current_sigs}
for key, state in list(self._states.items()):
if (
isinstance(key, tuple)
and key
and key[0] == "proc_blocking"
and state.alerting
and key not in current_keys
):
out.append(
signals_mod.Signal(
key=key,
active=False,
title="Блокирующий процесс",
detail=f"процесс PID {key[1]} завершился",
)
)
except Exception as e: # noqa: BLE001 - never-raise: skip recovery synthesis
logger.warning("watchdog: proc recovery synth error: %s", e)
return out
# -- decision + dispatch ----------------------------------------------
def _dispatch(self, built: list, now: float) -> list:
"""Run each signal through ``decide`` and send alert/realert/recovery."""