Files
claude-bot 259b507906 feat(watchdog): sidecar-watchdog F1b — monitoring brain in a separate container (ORCH-100)
Add the `watchdog/` package (thin Python-3.12 stdlib-only daemon) and the
`orchestrator-watchdog` compose service — the brain half of the domain-0
observability pair. F1a (ORCH-099) exposes GET /metrics raw signal; F1b reads it,
augments with host / container / dependency probes, runs each signal through a
generalised pure decision function (decide(signal_active, prev, now, cooldown),
a strict superset of disk_watchdog.decide_action) with per-signal in-memory
dedup/throttle/recovery, and alerts over its OWN independent Telegram channel.

Key properties (ADR-001):
- Observer separated from observed: separate container; /metrics not answering is
  itself the master `orch_down` alarm (debounced K ticks — no flap on a hiccup).
- Strictly read-only: docker.sock GET-only + mounted :ro (double guard), host
  paths :ro, no DB/disk writes, no process control — self-hosting-safe.
- never-raise on three levels (per-source/per-tick/per-send) + WATCHDOG_ENABLED
  kill-switch (disabled -> inert idle-loop, not exit).
- Disk anti-duplicate (D6): disk_watchdog (ORCH-063) stays sole owner of the 85%
  alert; sidecar carries orch_down + an opt-in 97% ceiling (default off).
- NO import from src/** (C-1); src/**, STAGE_TRANSITIONS, QG_CHECKS, check_*, DB
  schema — untouched. env_file optional so a missing .env.watchdog never breaks
  `docker compose up` for the prod orchestrator.

Tests: tests/watchdog/ (TC-01…TC-13) + full tests/ regression green (TC-14).
Docs: CHANGELOG, .env.example canon (WATCHDOG_*); architecture README + adr-0033
authored at the architecture stage.

Refs: ORCH-100

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 09:36:02 +03:00

52 lines
1.9 KiB
Python

"""Collector: external dependency pings — Plane / Gitea / Anthropic (FR-6).
A light ``GET`` with a short timeout per configured dependency. never-raise: an
unreachable dependency returns ``False`` (a signal for the threshold), never an
exception (D8). Endpoints / timeouts are configured via ``WATCHDOG_DEPS`` (D5);
an empty config means no pings (fail-safe).
"""
from __future__ import annotations
import logging
import urllib.error
import urllib.request
logger = logging.getLogger("watchdog.collectors.deps")
def ping(url: str, timeout_s: float, *, opener=urllib.request.urlopen) -> bool:
"""True when ``url`` answers with a non-5xx HTTP status. never-raise.
A 4xx still counts as "reachable" (the host is up and responding) — we ping
for liveness, not for auth. ``opener`` is injected so tests never hit the
network.
"""
try:
req = urllib.request.Request(url, method="GET")
with opener(req, timeout=timeout_s) as resp:
status = int(getattr(resp, "status", None) or resp.getcode())
return status < 500
except urllib.error.HTTPError as e:
# An HTTP error response still proves the host is reachable, unless 5xx.
return int(getattr(e, "code", 500)) < 500
except Exception as e: # noqa: BLE001 - unreachable -> down signal, not a crash
logger.warning("watchdog: dep ping %s failed: %s", url, e)
return False
def ping_all(
deps: dict[str, str],
timeout_s: float,
*,
opener=urllib.request.urlopen,
) -> dict[str, bool]:
"""Ping every configured dependency -> ``{name: reachable}``. never-raise."""
out: dict[str, bool] = {}
for name, url in deps.items():
try:
out[name] = ping(url, timeout_s, opener=opener)
except Exception as e: # noqa: BLE001 - one dep degrades, others continue
logger.warning("watchdog: dep %s ping error: %s", name, e)
out[name] = False
return out