Add the `watchdog/` package (thin Python-3.12 stdlib-only daemon) and the `orchestrator-watchdog` compose service — the brain half of the domain-0 observability pair. F1a (ORCH-099) exposes GET /metrics raw signal; F1b reads it, augments with host / container / dependency probes, runs each signal through a generalised pure decision function (decide(signal_active, prev, now, cooldown), a strict superset of disk_watchdog.decide_action) with per-signal in-memory dedup/throttle/recovery, and alerts over its OWN independent Telegram channel. Key properties (ADR-001): - Observer separated from observed: separate container; /metrics not answering is itself the master `orch_down` alarm (debounced K ticks — no flap on a hiccup). - Strictly read-only: docker.sock GET-only + mounted :ro (double guard), host paths :ro, no DB/disk writes, no process control — self-hosting-safe. - never-raise on three levels (per-source/per-tick/per-send) + WATCHDOG_ENABLED kill-switch (disabled -> inert idle-loop, not exit). - Disk anti-duplicate (D6): disk_watchdog (ORCH-063) stays sole owner of the 85% alert; sidecar carries orch_down + an opt-in 97% ceiling (default off). - NO import from src/** (C-1); src/**, STAGE_TRANSITIONS, QG_CHECKS, check_*, DB schema — untouched. env_file optional so a missing .env.watchdog never breaks `docker compose up` for the prod orchestrator. Tests: tests/watchdog/ (TC-01…TC-13) + full tests/ regression green (TC-14). Docs: CHANGELOG, .env.example canon (WATCHDOG_*); architecture README + adr-0033 authored at the architecture stage. Refs: ORCH-100 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
43 lines
1.8 KiB
Python
43 lines
1.8 KiB
Python
"""TC-13: anti-duplicate disk alert (coordinated with ORCH-063 / disk_watchdog).
|
|
|
|
ADR-001 D6: disk_watchdog (ORCH-063) is the SOLE owner of the 85% disk alert via
|
|
the orchestrator's Telegram. The sidecar carries NO disk alert by default
|
|
(``WATCHDOG_DISK_CRIT_ENABLED=false``) -> structurally zero double-alert. The
|
|
sidecar's contribution is an OPT-IN independent ceiling at a HIGHER threshold
|
|
(a different event, separate channel).
|
|
"""
|
|
from watchdog.config import Config
|
|
from watchdog.signals import host_signals
|
|
|
|
|
|
def _cfg(**kw):
|
|
return Config.from_env(kw)
|
|
|
|
|
|
def test_disk_signal_absent_by_default():
|
|
# Disk full at 90% -> sidecar produces NO disk signal (disk_watchdog owns it).
|
|
cfg = _cfg()
|
|
assert cfg.disk_crit_enabled is False
|
|
sigs = host_signals(cfg, mem_pct=None, disk=("/repos", 90.0))
|
|
assert [s for s in sigs if s.key == "host_disk_crit"] == []
|
|
|
|
|
|
def test_opt_in_ceiling_is_separate_higher_event():
|
|
cfg = _cfg(WATCHDOG_DISK_CRIT_ENABLED="true", WATCHDOG_DISK_CRIT_PCT="97")
|
|
# Below the ceiling (90% < 97%) -> not active even when opted in (no 85% dup).
|
|
below = host_signals(cfg, mem_pct=None, disk=("/repos", 90.0))
|
|
crit_below = [s for s in below if s.key == "host_disk_crit"]
|
|
assert len(crit_below) == 1 and crit_below[0].active is False
|
|
|
|
# At/over the high ceiling -> active (a DIFFERENT event from disk_watchdog 85%).
|
|
over = host_signals(cfg, mem_pct=None, disk=("/repos", 98.0))
|
|
crit_over = [s for s in over if s.key == "host_disk_crit"]
|
|
assert len(crit_over) == 1 and crit_over[0].active is True
|
|
|
|
|
|
def test_mem_signal_independent_of_disk():
|
|
cfg = _cfg(WATCHDOG_MEM_PCT="90")
|
|
sigs = host_signals(cfg, mem_pct=95.0, disk=None)
|
|
mem = [s for s in sigs if s.key == "host_mem"]
|
|
assert len(mem) == 1 and mem[0].active is True
|