Files
orchestrator/watchdog/__main__.py
claude-bot 259b507906 feat(watchdog): sidecar-watchdog F1b — monitoring brain in a separate container (ORCH-100)
Add the `watchdog/` package (thin Python-3.12 stdlib-only daemon) and the
`orchestrator-watchdog` compose service — the brain half of the domain-0
observability pair. F1a (ORCH-099) exposes GET /metrics raw signal; F1b reads it,
augments with host / container / dependency probes, runs each signal through a
generalised pure decision function (decide(signal_active, prev, now, cooldown),
a strict superset of disk_watchdog.decide_action) with per-signal in-memory
dedup/throttle/recovery, and alerts over its OWN independent Telegram channel.

Key properties (ADR-001):
- Observer separated from observed: separate container; /metrics not answering is
  itself the master `orch_down` alarm (debounced K ticks — no flap on a hiccup).
- Strictly read-only: docker.sock GET-only + mounted :ro (double guard), host
  paths :ro, no DB/disk writes, no process control — self-hosting-safe.
- never-raise on three levels (per-source/per-tick/per-send) + WATCHDOG_ENABLED
  kill-switch (disabled -> inert idle-loop, not exit).
- Disk anti-duplicate (D6): disk_watchdog (ORCH-063) stays sole owner of the 85%
  alert; sidecar carries orch_down + an opt-in 97% ceiling (default off).
- NO import from src/** (C-1); src/**, STAGE_TRANSITIONS, QG_CHECKS, check_*, DB
  schema — untouched. env_file optional so a missing .env.watchdog never breaks
  `docker compose up` for the prod orchestrator.

Tests: tests/watchdog/ (TC-01…TC-13) + full tests/ regression green (TC-14).
Docs: CHANGELOG, .env.example canon (WATCHDOG_*); architecture README + adr-0033
authored at the architecture stage.

Refs: ORCH-100

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 09:36:02 +03:00

76 lines
2.3 KiB
Python

"""Sidecar entrypoint: the tick loop with kill-switch + per-tick never-raise (D8).
Run as ``python -m watchdog`` (the container ``ENTRYPOINT``). The loop:
* honours ``WATCHDOG_ENABLED=false`` -> stays INERT (idle-loops with a log line,
does NOT ``exit``, so ``restart: unless-stopped`` does not spin a restart loop);
* wraps every tick in an outer ``try/except`` so a tick error logs and the daemon
survives (per-tick never-raise);
* logs start / each tick so the container logs prove the sidecar is alive and why
an alert did (not) fire (NFR-7).
"""
from __future__ import annotations
import logging
import time
from .config import Config
from .core import Watchdog
logger = logging.getLogger("watchdog")
def _setup_logging() -> None:
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(name)s: %(message)s",
)
def run(cfg: Config | None = None, max_ticks: int | None = None) -> None:
"""Run the tick loop. ``max_ticks`` bounds the loop for tests (``None`` = forever)."""
cfg = cfg or Config.from_env()
if not cfg.enabled:
logger.info("watchdog: WATCHDOG_ENABLED=false -> inert (idle, no ticks)")
# Idle, not exit: keep the container up so restart-policy does not flap.
ticks = 0
while max_ticks is None or ticks < max_ticks:
time.sleep(cfg.interval_s)
ticks += 1
return
logger.info(
"watchdog started (interval=%ss, metrics=%s, containers=%s, deps=%s, "
"mem_pct=%s, disk_crit=%s)",
cfg.interval_s,
cfg.metrics_url,
cfg.containers,
list(cfg.deps),
cfg.mem_pct,
cfg.disk_crit_enabled,
)
dog = Watchdog(cfg)
ticks = 0
while max_ticks is None or ticks < max_ticks:
try:
dispatched = dog.tick()
fired = [
(a, getattr(s, "key", None)) for a, s in dispatched if a != "none"
]
logger.info("watchdog tick ok (fired=%s)", fired)
except Exception as e: # noqa: BLE001 - per-tick outer never-raise (D8)
logger.error("watchdog tick error: %s", e)
ticks += 1
if max_ticks is not None and ticks >= max_ticks:
break
time.sleep(cfg.interval_s)
def main() -> None:
_setup_logging()
run()
if __name__ == "__main__":
main()