Add the `watchdog/` package (thin Python-3.12 stdlib-only daemon) and the `orchestrator-watchdog` compose service — the brain half of the domain-0 observability pair. F1a (ORCH-099) exposes GET /metrics raw signal; F1b reads it, augments with host / container / dependency probes, runs each signal through a generalised pure decision function (decide(signal_active, prev, now, cooldown), a strict superset of disk_watchdog.decide_action) with per-signal in-memory dedup/throttle/recovery, and alerts over its OWN independent Telegram channel. Key properties (ADR-001): - Observer separated from observed: separate container; /metrics not answering is itself the master `orch_down` alarm (debounced K ticks — no flap on a hiccup). - Strictly read-only: docker.sock GET-only + mounted :ro (double guard), host paths :ro, no DB/disk writes, no process control — self-hosting-safe. - never-raise on three levels (per-source/per-tick/per-send) + WATCHDOG_ENABLED kill-switch (disabled -> inert idle-loop, not exit). - Disk anti-duplicate (D6): disk_watchdog (ORCH-063) stays sole owner of the 85% alert; sidecar carries orch_down + an opt-in 97% ceiling (default off). - NO import from src/** (C-1); src/**, STAGE_TRANSITIONS, QG_CHECKS, check_*, DB schema — untouched. env_file optional so a missing .env.watchdog never breaks `docker compose up` for the prod orchestrator. Tests: tests/watchdog/ (TC-01…TC-13) + full tests/ regression green (TC-14). Docs: CHANGELOG, .env.example canon (WATCHDOG_*); architecture README + adr-0033 authored at the architecture stage. Refs: ORCH-100 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
76 lines
2.3 KiB
Python
76 lines
2.3 KiB
Python
"""Sidecar entrypoint: the tick loop with kill-switch + per-tick never-raise (D8).
|
|
|
|
Run as ``python -m watchdog`` (the container ``ENTRYPOINT``). The loop:
|
|
* honours ``WATCHDOG_ENABLED=false`` -> stays INERT (idle-loops with a log line,
|
|
does NOT ``exit``, so ``restart: unless-stopped`` does not spin a restart loop);
|
|
* wraps every tick in an outer ``try/except`` so a tick error logs and the daemon
|
|
survives (per-tick never-raise);
|
|
* logs start / each tick so the container logs prove the sidecar is alive and why
|
|
an alert did (not) fire (NFR-7).
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import logging
|
|
import time
|
|
|
|
from .config import Config
|
|
from .core import Watchdog
|
|
|
|
logger = logging.getLogger("watchdog")
|
|
|
|
|
|
def _setup_logging() -> None:
|
|
logging.basicConfig(
|
|
level=logging.INFO,
|
|
format="%(asctime)s %(levelname)s %(name)s: %(message)s",
|
|
)
|
|
|
|
|
|
def run(cfg: Config | None = None, max_ticks: int | None = None) -> None:
|
|
"""Run the tick loop. ``max_ticks`` bounds the loop for tests (``None`` = forever)."""
|
|
cfg = cfg or Config.from_env()
|
|
|
|
if not cfg.enabled:
|
|
logger.info("watchdog: WATCHDOG_ENABLED=false -> inert (idle, no ticks)")
|
|
# Idle, not exit: keep the container up so restart-policy does not flap.
|
|
ticks = 0
|
|
while max_ticks is None or ticks < max_ticks:
|
|
time.sleep(cfg.interval_s)
|
|
ticks += 1
|
|
return
|
|
|
|
logger.info(
|
|
"watchdog started (interval=%ss, metrics=%s, containers=%s, deps=%s, "
|
|
"mem_pct=%s, disk_crit=%s)",
|
|
cfg.interval_s,
|
|
cfg.metrics_url,
|
|
cfg.containers,
|
|
list(cfg.deps),
|
|
cfg.mem_pct,
|
|
cfg.disk_crit_enabled,
|
|
)
|
|
dog = Watchdog(cfg)
|
|
ticks = 0
|
|
while max_ticks is None or ticks < max_ticks:
|
|
try:
|
|
dispatched = dog.tick()
|
|
fired = [
|
|
(a, getattr(s, "key", None)) for a, s in dispatched if a != "none"
|
|
]
|
|
logger.info("watchdog tick ok (fired=%s)", fired)
|
|
except Exception as e: # noqa: BLE001 - per-tick outer never-raise (D8)
|
|
logger.error("watchdog tick error: %s", e)
|
|
ticks += 1
|
|
if max_ticks is not None and ticks >= max_ticks:
|
|
break
|
|
time.sleep(cfg.interval_s)
|
|
|
|
|
|
def main() -> None:
|
|
_setup_logging()
|
|
run()
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|