Close the observability gap between agent_hung (only tracked jobs by jobs.pid)
and orphaned pytest subprocesses the orchestrator launches itself
(merge_gate.retest_branch / coverage_gate.measure_coverage). On a timeout-kill of
the agent (-9, ORCH-109) the grand-child pytest reparents onto tini and keeps
running for days, starving CPU and failing merge-gate re-test — with no alert.
Strictly inside the observer (watchdog/** + the watchdog compose service):
- watchdog/collectors/proc.py: stdlib-only /proc scan (under pid: host),
read-only, never-raise -> []; pure parsers split from I/O (tested on a fake
/proc tree). Never reads /proc/<pid>/environ.
- watchdog/signals.py: pure proc_signals builder, per-entity
("proc_blocking", pid), active iff age_s > proc_age_s; actionable RU detail.
- watchdog/core.py: opt-in tick block (gated on proc_enabled -> zero overhead /
byte-for-byte when off) + RECOVERY synthesis for a vanished process through the
existing decide()/AlertState (no new anti-spam logic).
- watchdog/config.py: WATCHDOG_PROC_{ENABLED(false),AGE_MIN(60),PATTERNS(pytest),
COOLDOWN_S(1800)}; default threshold > max(merge_retest_timeout_s=600,
coverage_run_timeout_s=900) so a legit in-flight run never crosses it.
- docker-compose.yml: pid: host on orchestrator-watchdog ONLY (read-only privilege).
Anti-false-positive and no overlap with agent_hung are by construction (cmdline
scope + age threshold), not fragile cross-namespace PID matching.
Canon synced: WATCHDOG_PROC_* in .env.watchdog.example <-> .env.example block;
documented in LITE_SETUP.md and docs/architecture/README.md (architect). src/**,
/metrics, schema_version, STAGE_TRANSITIONS, QG_CHECKS, check_*, machine-verdict
and the DB schema are untouched; deploy rebuilds only the sidecar, prod
orchestrator is not restarted (NFR-3).
Tests: tests/watchdog/test_proc_blocking_signal.py (TC-01..TC-06),
test_proc_collector.py (/proc parsing), test_tick_proc_blocking_integration.py
(TC-07), plus pid: host and proc-config assertions. Full pytest tests/ green (1930).
Refs: ORCH-111
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
78 lines
2.8 KiB
Python
78 lines
2.8 KiB
Python
"""TC-12: compose invariant — orchestrator-watchdog is a separate service.
|
|
|
|
It declares its own build (watchdog/Dockerfile), restart policy, mem_limit, and
|
|
mounts docker.sock read-only (:ro). Parses the real docker-compose.yml.
|
|
"""
|
|
import pathlib
|
|
|
|
import yaml
|
|
|
|
REPO_ROOT = pathlib.Path(__file__).resolve().parents[2]
|
|
|
|
|
|
def _compose():
|
|
with open(REPO_ROOT / "docker-compose.yml") as f:
|
|
return yaml.safe_load(f)
|
|
|
|
|
|
def test_watchdog_service_declared():
|
|
svc = _compose()["services"]
|
|
assert "orchestrator-watchdog" in svc
|
|
|
|
|
|
def test_watchdog_builds_from_watchdog_dockerfile():
|
|
wd = _compose()["services"]["orchestrator-watchdog"]
|
|
build = wd["build"]
|
|
assert isinstance(build, dict)
|
|
assert build["dockerfile"] == "watchdog/Dockerfile"
|
|
assert build["context"] == "."
|
|
|
|
|
|
def test_watchdog_has_restart_and_mem_limit():
|
|
wd = _compose()["services"]["orchestrator-watchdog"]
|
|
assert wd["restart"] == "unless-stopped"
|
|
assert wd["mem_limit"] == "128m" # thin stack, not Grafana/Prometheus
|
|
|
|
|
|
def test_docker_sock_mounted_read_only():
|
|
wd = _compose()["services"]["orchestrator-watchdog"]
|
|
sock = [v for v in wd["volumes"] if "docker.sock" in v]
|
|
assert sock, "docker.sock must be mounted"
|
|
assert all(v.endswith(":ro") for v in sock), "docker.sock must be :ro"
|
|
|
|
|
|
def test_host_paths_mounted_read_only():
|
|
wd = _compose()["services"]["orchestrator-watchdog"]
|
|
# Every bind mount the watchdog uses is read-only (it only reads).
|
|
for v in wd["volumes"]:
|
|
assert v.endswith(":ro"), f"watchdog mount must be :ro: {v}"
|
|
|
|
|
|
def test_watchdog_shares_host_pid_namespace():
|
|
# ORCH-111 (adr-0041 D6): the sidecar shares the host PID-namespace so its
|
|
# /proc reflects the host (proc_blocking collector). `pid: host` is NOT a
|
|
# volume, so the read-only-mounts invariant above is unaffected.
|
|
wd = _compose()["services"]["orchestrator-watchdog"]
|
|
assert wd.get("pid") == "host", "orchestrator-watchdog must declare `pid: host`"
|
|
# The privilege stays on the OBSERVER only — prod orchestrator must NOT get it.
|
|
orch = _compose()["services"]["orchestrator"]
|
|
assert "pid" not in orch, "the prod orchestrator service must not share the host PID-namespace"
|
|
|
|
|
|
def test_env_file_is_optional():
|
|
# A missing .env.watchdog must not break `docker compose up` (self-hosting).
|
|
wd = _compose()["services"]["orchestrator-watchdog"]
|
|
env_file = wd["env_file"]
|
|
assert isinstance(env_file, list)
|
|
assert env_file[0]["required"] is False
|
|
|
|
|
|
def test_watchdog_dockerfile_exists_and_is_stdlib_only():
|
|
df = REPO_ROOT / "watchdog" / "Dockerfile"
|
|
assert df.exists()
|
|
text = df.read_text()
|
|
# No pip install of third-party deps (stdlib-only, D1).
|
|
assert "pip install" not in text
|
|
assert "COPY requirements" not in text
|
|
assert "requirements.txt" not in text
|