Add the `watchdog/` package (thin Python-3.12 stdlib-only daemon) and the `orchestrator-watchdog` compose service — the brain half of the domain-0 observability pair. F1a (ORCH-099) exposes GET /metrics raw signal; F1b reads it, augments with host / container / dependency probes, runs each signal through a generalised pure decision function (decide(signal_active, prev, now, cooldown), a strict superset of disk_watchdog.decide_action) with per-signal in-memory dedup/throttle/recovery, and alerts over its OWN independent Telegram channel. Key properties (ADR-001): - Observer separated from observed: separate container; /metrics not answering is itself the master `orch_down` alarm (debounced K ticks — no flap on a hiccup). - Strictly read-only: docker.sock GET-only + mounted :ro (double guard), host paths :ro, no DB/disk writes, no process control — self-hosting-safe. - never-raise on three levels (per-source/per-tick/per-send) + WATCHDOG_ENABLED kill-switch (disabled -> inert idle-loop, not exit). - Disk anti-duplicate (D6): disk_watchdog (ORCH-063) stays sole owner of the 85% alert; sidecar carries orch_down + an opt-in 97% ceiling (default off). - NO import from src/** (C-1); src/**, STAGE_TRANSITIONS, QG_CHECKS, check_*, DB schema — untouched. env_file optional so a missing .env.watchdog never breaks `docker compose up` for the prod orchestrator. Tests: tests/watchdog/ (TC-01…TC-13) + full tests/ regression green (TC-14). Docs: CHANGELOG, .env.example canon (WATCHDOG_*); architecture README + adr-0033 authored at the architecture stage. Refs: ORCH-100 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
52 lines
1.9 KiB
Python
52 lines
1.9 KiB
Python
"""Collector: external dependency pings — Plane / Gitea / Anthropic (FR-6).
|
|
|
|
A light ``GET`` with a short timeout per configured dependency. never-raise: an
|
|
unreachable dependency returns ``False`` (a signal for the threshold), never an
|
|
exception (D8). Endpoints / timeouts are configured via ``WATCHDOG_DEPS`` (D5);
|
|
an empty config means no pings (fail-safe).
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import logging
|
|
import urllib.error
|
|
import urllib.request
|
|
|
|
logger = logging.getLogger("watchdog.collectors.deps")
|
|
|
|
|
|
def ping(url: str, timeout_s: float, *, opener=urllib.request.urlopen) -> bool:
|
|
"""True when ``url`` answers with a non-5xx HTTP status. never-raise.
|
|
|
|
A 4xx still counts as "reachable" (the host is up and responding) — we ping
|
|
for liveness, not for auth. ``opener`` is injected so tests never hit the
|
|
network.
|
|
"""
|
|
try:
|
|
req = urllib.request.Request(url, method="GET")
|
|
with opener(req, timeout=timeout_s) as resp:
|
|
status = int(getattr(resp, "status", None) or resp.getcode())
|
|
return status < 500
|
|
except urllib.error.HTTPError as e:
|
|
# An HTTP error response still proves the host is reachable, unless 5xx.
|
|
return int(getattr(e, "code", 500)) < 500
|
|
except Exception as e: # noqa: BLE001 - unreachable -> down signal, not a crash
|
|
logger.warning("watchdog: dep ping %s failed: %s", url, e)
|
|
return False
|
|
|
|
|
|
def ping_all(
|
|
deps: dict[str, str],
|
|
timeout_s: float,
|
|
*,
|
|
opener=urllib.request.urlopen,
|
|
) -> dict[str, bool]:
|
|
"""Ping every configured dependency -> ``{name: reachable}``. never-raise."""
|
|
out: dict[str, bool] = {}
|
|
for name, url in deps.items():
|
|
try:
|
|
out[name] = ping(url, timeout_s, opener=opener)
|
|
except Exception as e: # noqa: BLE001 - one dep degrades, others continue
|
|
logger.warning("watchdog: dep %s ping error: %s", name, e)
|
|
out[name] = False
|
|
return out
|