feat(watchdog): proc_blocking alert for orphaned long-lived test processes
Close the observability gap between agent_hung (only tracked jobs by jobs.pid)
and orphaned pytest subprocesses the orchestrator launches itself
(merge_gate.retest_branch / coverage_gate.measure_coverage). On a timeout-kill of
the agent (-9, ORCH-109) the grand-child pytest reparents onto tini and keeps
running for days, starving CPU and failing merge-gate re-test — with no alert.
Strictly inside the observer (watchdog/** + the watchdog compose service):
- watchdog/collectors/proc.py: stdlib-only /proc scan (under pid: host),
read-only, never-raise -> []; pure parsers split from I/O (tested on a fake
/proc tree). Never reads /proc/<pid>/environ.
- watchdog/signals.py: pure proc_signals builder, per-entity
("proc_blocking", pid), active iff age_s > proc_age_s; actionable RU detail.
- watchdog/core.py: opt-in tick block (gated on proc_enabled -> zero overhead /
byte-for-byte when off) + RECOVERY synthesis for a vanished process through the
existing decide()/AlertState (no new anti-spam logic).
- watchdog/config.py: WATCHDOG_PROC_{ENABLED(false),AGE_MIN(60),PATTERNS(pytest),
COOLDOWN_S(1800)}; default threshold > max(merge_retest_timeout_s=600,
coverage_run_timeout_s=900) so a legit in-flight run never crosses it.
- docker-compose.yml: pid: host on orchestrator-watchdog ONLY (read-only privilege).
Anti-false-positive and no overlap with agent_hung are by construction (cmdline
scope + age threshold), not fragile cross-namespace PID matching.
Canon synced: WATCHDOG_PROC_* in .env.watchdog.example <-> .env.example block;
documented in LITE_SETUP.md and docs/architecture/README.md (architect). src/**,
/metrics, schema_version, STAGE_TRANSITIONS, QG_CHECKS, check_*, machine-verdict
and the DB schema are untouched; deploy rebuilds only the sidecar, prod
orchestrator is not restarted (NFR-3).
Tests: tests/watchdog/test_proc_blocking_signal.py (TC-01..TC-06),
test_proc_collector.py (/proc parsing), test_tick_proc_blocking_integration.py
(TC-07), plus pid: host and proc-config assertions. Full pytest tests/ green (1930).
Refs: ORCH-111
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
10
.env.example
10
.env.example
@@ -569,6 +569,12 @@ ORCH_QG0_TITLE_MAX=200
|
||||
# CONTAINERS -> CSV of container names to watch (status != running/healthy).
|
||||
# DOCKER_SOCK -> path to the read-only docker.sock inside the container.
|
||||
# DEPS -> CSV of name=url dependency pings (empty -> no pings).
|
||||
# PROC_ENABLED -> ORCH-111 opt-in: alert on a long-lived test process (pytest)
|
||||
# orphaned on the host (needs `pid: host`, default OFF).
|
||||
# PROC_AGE_MIN -> minutes a test process may live before alerting; MUST exceed
|
||||
# max(merge_retest_timeout_s, coverage_run_timeout_s)/60.
|
||||
# PROC_PATTERNS -> CSV of cmdline substrings that mark the test-class (pytest).
|
||||
# PROC_COOLDOWN_S-> per-signal re-alert throttle for proc_blocking.
|
||||
# TG_BOT_TOKEN / TG_CHAT_ID -> the sidecar's OWN Telegram bot/chat (independent
|
||||
# of the orchestrator's; absent -> logs, does not send).
|
||||
WATCHDOG_ENABLED=true
|
||||
@@ -588,5 +594,9 @@ WATCHDOG_QUEUE_DEPTH=20
|
||||
WATCHDOG_CONTAINERS=orchestrator
|
||||
WATCHDOG_DOCKER_SOCK=/var/run/docker.sock
|
||||
WATCHDOG_DEPS=
|
||||
WATCHDOG_PROC_ENABLED=false
|
||||
WATCHDOG_PROC_AGE_MIN=60
|
||||
WATCHDOG_PROC_PATTERNS=pytest
|
||||
WATCHDOG_PROC_COOLDOWN_S=1800
|
||||
WATCHDOG_TG_BOT_TOKEN=
|
||||
WATCHDOG_TG_CHAT_ID=
|
||||
|
||||
Reference in New Issue
Block a user