ORCH-111: watchdog proc_blocking alert on long-lived orphaned test processes #130
Reference in New Issue
Block a user
Delete Branch "feature/ORCH-111-bug-watchdog-must-alert-on-lon"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
ORCH-111 — Watchdog
proc_blocking: alert on long-lived orphaned test processesCloses the observability blind spot where orphaned
pytestsubprocesses (launched by the orchestrator itself inmerge_gate.retest_branch/coverage_gate.measure_coverage) survive a timeout-kill of the agent (-9, ORCH-109), reparent onto tini and live for days — starving CPU and failing merge-gate re-test without raising any alert (the established incident:test_install_lite_script.pyprocesses lived >2 days unnoticed).What changed (strictly inside the observer)
watchdog/collectors/proc.py(new, D3): stdlib-only/procscan (underpid: hostthe container/procreflects the host PID-namespace). Reads/proc/statbtime+SC_CLK_TCK, iterates numeric/proc/<pid>, matchescmdlineby test-class pattern, parses/proc/<pid>/stat(field 22starttime→age_s; 14+15 →cpu_s, informational). Read-only (noos.kill/signals/subprocess; never reads/proc/<pid>/environ), never-raise (per-pid race skipped; top-level →[]). Pure parsers split from I/O.watchdog/signals.py(D4): pureproc_signalsbuilder — per-entitySignal("proc_blocking", pid),active ⇔ age_s > proc_age_s; actionable RU detail (PID + age + truncated cmdline + CPU time).watchdog/core.py(D4/D7): opt-intick()block gated onproc_enabled(disabled → zero overhead, byte-for-byte as before) + RECOVERY synthesis for a vanished process reusing the existingdecision.decide()/AlertState(no new anti-spam logic — FR-5).watchdog/config.py(D5):WATCHDOG_PROC_ENABLED(default false, opt-in) /WATCHDOG_PROC_AGE_MIN(60) /WATCHDOG_PROC_PATTERNS(pytest) /WATCHDOG_PROC_COOLDOWN_S(1800), never-raise parsers. Default threshold (3600s) exceedsmax(merge_retest_timeout_s=600, coverage_run_timeout_s=900)so a legit in-flight run never crosses it.docker-compose.yml(D6):pid: hostonorchestrator-watchdogonly (read-only privilege; not a volume → read-only-mounts invariant intact).Invariants (AC-9)
src/**,/metrics,schema_version,STAGE_TRANSITIONS,QG_CHECKS,check_*, machine-verdict keys and the DB schema are byte-for-byte untouched. Deploy rebuilds only the sidecar; prodorchestratoris not restarted (NFR-3). Anti-false-positive and no overlap withagent_hungare by construction (cmdline scope + age threshold), not fragile cross-namespace PID matching.Canon (AC-10 / NFR-5)
WATCHDOG_PROC_*synced.env.watchdog.example↔.env.exampleblock (key-sync test green); documented indocs/deployment/LITE_SETUP.md§4 anddocs/architecture/README.md.Tests
tests/watchdog/test_proc_blocking_signal.py— TC-01…TC-06 (builder, anti-FP, config/kill-switch, never-raise/read-only AST scan, anti-spam+recovery cycle, no-dup-with-agent_hung).tests/watchdog/test_proc_collector.py—/procparsing fixtures (btime/pid-stat/cmdline/age/cpu/filtering/race).tests/watchdog/test_tick_proc_blocking_integration.py— TC-07 tick→dispatch + kill-switch-off + in-budget + collector-explodes.tests/watchdog/test_compose_service.py— positivepid: host(and prod must NOT have it);test_config_killswitch.py— proc-config defaults/env.pytest tests/ -q→ 1930 passed.ADR:
docs/work-items/ORCH-111/06-adr/ADR-001-watchdog-orphan-test-process-alert.md, cross-cuttingdocs/architecture/adr/adr-0041-watchdog-orphan-test-process-alert.md.Refs: ORCH-111
🤖 Generated with Claude Code
Close the observability gap between agent_hung (only tracked jobs by jobs.pid) and orphaned pytest subprocesses the orchestrator launches itself (merge_gate.retest_branch / coverage_gate.measure_coverage). On a timeout-kill of the agent (-9, ORCH-109) the grand-child pytest reparents onto tini and keeps running for days, starving CPU and failing merge-gate re-test — with no alert. Strictly inside the observer (watchdog/** + the watchdog compose service): - watchdog/collectors/proc.py: stdlib-only /proc scan (under pid: host), read-only, never-raise -> []; pure parsers split from I/O (tested on a fake /proc tree). Never reads /proc/<pid>/environ. - watchdog/signals.py: pure proc_signals builder, per-entity ("proc_blocking", pid), active iff age_s > proc_age_s; actionable RU detail. - watchdog/core.py: opt-in tick block (gated on proc_enabled -> zero overhead / byte-for-byte when off) + RECOVERY synthesis for a vanished process through the existing decide()/AlertState (no new anti-spam logic). - watchdog/config.py: WATCHDOG_PROC_{ENABLED(false),AGE_MIN(60),PATTERNS(pytest), COOLDOWN_S(1800)}; default threshold > max(merge_retest_timeout_s=600, coverage_run_timeout_s=900) so a legit in-flight run never crosses it. - docker-compose.yml: pid: host on orchestrator-watchdog ONLY (read-only privilege). Anti-false-positive and no overlap with agent_hung are by construction (cmdline scope + age threshold), not fragile cross-namespace PID matching. Canon synced: WATCHDOG_PROC_* in .env.watchdog.example <-> .env.example block; documented in LITE_SETUP.md and docs/architecture/README.md (architect). src/**, /metrics, schema_version, STAGE_TRANSITIONS, QG_CHECKS, check_*, machine-verdict and the DB schema are untouched; deploy rebuilds only the sidecar, prod orchestrator is not restarted (NFR-3). Tests: tests/watchdog/test_proc_blocking_signal.py (TC-01..TC-06), test_proc_collector.py (/proc parsing), test_tick_proc_blocking_integration.py (TC-07), plus pid: host and proc-config assertions. Full pytest tests/ green (1930). Refs: ORCH-111 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>7fe552909dto1fbfb941a9