Add the `watchdog/` package (thin Python-3.12 stdlib-only daemon) and the `orchestrator-watchdog` compose service — the brain half of the domain-0 observability pair. F1a (ORCH-099) exposes GET /metrics raw signal; F1b reads it, augments with host / container / dependency probes, runs each signal through a generalised pure decision function (decide(signal_active, prev, now, cooldown), a strict superset of disk_watchdog.decide_action) with per-signal in-memory dedup/throttle/recovery, and alerts over its OWN independent Telegram channel. Key properties (ADR-001): - Observer separated from observed: separate container; /metrics not answering is itself the master `orch_down` alarm (debounced K ticks — no flap on a hiccup). - Strictly read-only: docker.sock GET-only + mounted :ro (double guard), host paths :ro, no DB/disk writes, no process control — self-hosting-safe. - never-raise on three levels (per-source/per-tick/per-send) + WATCHDOG_ENABLED kill-switch (disabled -> inert idle-loop, not exit). - Disk anti-duplicate (D6): disk_watchdog (ORCH-063) stays sole owner of the 85% alert; sidecar carries orch_down + an opt-in 97% ceiling (default off). - NO import from src/** (C-1); src/**, STAGE_TRANSITIONS, QG_CHECKS, check_*, DB schema — untouched. env_file optional so a missing .env.watchdog never breaks `docker compose up` for the prod orchestrator. Tests: tests/watchdog/ (TC-01…TC-13) + full tests/ regression green (TC-14). Docs: CHANGELOG, .env.example canon (WATCHDOG_*); architecture README + adr-0033 authored at the architecture stage. Refs: ORCH-100 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
69 lines
2.4 KiB
Python
69 lines
2.4 KiB
Python
"""Independent Telegram transport for the sidecar (D7, FR-8, BR-8).
|
|
|
|
Reads its OWN ``WATCHDOG_TG_BOT_TOKEN`` / ``WATCHDOG_TG_CHAT_ID`` and POSTs via
|
|
``urllib`` to ``api.telegram.org``. It is FORBIDDEN to import
|
|
``src/notifications.py`` or to use the orchestrator's token / chat / functions —
|
|
otherwise a crash or refactor of the orchestrator would drag down the alert
|
|
channel (a direct violation of C-1 / BR-8). Missing token/chat -> log and skip
|
|
(fail-safe), never raise (NFR-3).
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import logging
|
|
import urllib.parse
|
|
import urllib.request
|
|
|
|
logger = logging.getLogger("watchdog.notify")
|
|
|
|
_TELEGRAM_API = "https://api.telegram.org"
|
|
|
|
|
|
def send_telegram(
|
|
bot_token: str,
|
|
chat_id: str,
|
|
text: str,
|
|
timeout_s: float = 5.0,
|
|
*,
|
|
api_base: str = _TELEGRAM_API,
|
|
opener=urllib.request.urlopen,
|
|
) -> bool:
|
|
"""Send one Telegram message over the sidecar's own bot. never-raise (D8).
|
|
|
|
Returns ``True`` on a delivered message, ``False`` on any failure (missing
|
|
credentials, network error, non-2xx). ``opener`` / ``api_base`` are injected
|
|
so tests never touch the real network.
|
|
"""
|
|
if not bot_token or not chat_id:
|
|
logger.warning("watchdog: telegram token/chat not configured -> skip send")
|
|
return False
|
|
try:
|
|
url = f"{api_base}/bot{bot_token}/sendMessage"
|
|
payload = urllib.parse.urlencode(
|
|
{
|
|
"chat_id": chat_id,
|
|
"text": text,
|
|
"parse_mode": "HTML",
|
|
"disable_web_page_preview": "true",
|
|
}
|
|
).encode("utf-8")
|
|
req = urllib.request.Request(url, data=payload, method="POST")
|
|
with opener(req, timeout=timeout_s) as resp:
|
|
status = getattr(resp, "status", None) or resp.getcode()
|
|
return 200 <= int(status) < 300
|
|
except Exception as e: # noqa: BLE001 - delivery is best-effort
|
|
logger.warning("watchdog: telegram send failed: %s", e)
|
|
return False
|
|
|
|
|
|
class Notifier:
|
|
"""Thin stateful wrapper binding the sidecar credentials for the tick loop."""
|
|
|
|
def __init__(self, bot_token: str, chat_id: str, timeout_s: float = 5.0):
|
|
self._token = bot_token
|
|
self._chat = chat_id
|
|
self._timeout = timeout_s
|
|
|
|
def send(self, text: str) -> bool:
|
|
"""Best-effort send through the sidecar's own channel (never raises)."""
|
|
return send_telegram(self._token, self._chat, text, self._timeout)
|