feat(watchdog): sidecar-watchdog F1b — monitoring brain in a separate container (ORCH-100)
Add the `watchdog/` package (thin Python-3.12 stdlib-only daemon) and the `orchestrator-watchdog` compose service — the brain half of the domain-0 observability pair. F1a (ORCH-099) exposes GET /metrics raw signal; F1b reads it, augments with host / container / dependency probes, runs each signal through a generalised pure decision function (decide(signal_active, prev, now, cooldown), a strict superset of disk_watchdog.decide_action) with per-signal in-memory dedup/throttle/recovery, and alerts over its OWN independent Telegram channel. Key properties (ADR-001): - Observer separated from observed: separate container; /metrics not answering is itself the master `orch_down` alarm (debounced K ticks — no flap on a hiccup). - Strictly read-only: docker.sock GET-only + mounted :ro (double guard), host paths :ro, no DB/disk writes, no process control — self-hosting-safe. - never-raise on three levels (per-source/per-tick/per-send) + WATCHDOG_ENABLED kill-switch (disabled -> inert idle-loop, not exit). - Disk anti-duplicate (D6): disk_watchdog (ORCH-063) stays sole owner of the 85% alert; sidecar carries orch_down + an opt-in 97% ceiling (default off). - NO import from src/** (C-1); src/**, STAGE_TRANSITIONS, QG_CHECKS, check_*, DB schema — untouched. env_file optional so a missing .env.watchdog never breaks `docker compose up` for the prod orchestrator. Tests: tests/watchdog/ (TC-01…TC-13) + full tests/ regression green (TC-14). Docs: CHANGELOG, .env.example canon (WATCHDOG_*); architecture README + adr-0033 authored at the architecture stage. Refs: ORCH-100 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
45
.env.example
45
.env.example
@@ -465,3 +465,48 @@ ORCH_POST_DEPLOY_BASE_URL=http://localhost:8500
|
||||
# DB title TEXT is unbounded). Default 200. An invalid/empty value gracefully
|
||||
# degrades to 200 (the process never crashes on startup).
|
||||
ORCH_QG0_TITLE_MAX=200
|
||||
|
||||
# ── ORCH-100 (FND/F1b): sidecar-watchdog (orchestrator-watchdog container) ─────
|
||||
# The monitoring brain runs in a SEPARATE container with its OWN config. These
|
||||
# keys are read by the watchdog package (watchdog/config.py), NOT by the
|
||||
# orchestrator. At runtime they live in `.env.watchdog` (env_file of the
|
||||
# orchestrator-watchdog service); this block is the canon. NO real secrets here.
|
||||
# ENABLED -> kill-switch; false (or not starting the service) -> inert.
|
||||
# INTERVAL_S -> seconds between ticks.
|
||||
# HTTP_TIMEOUT_S -> per-request timeout (metrics / pings / docker / telegram).
|
||||
# COOLDOWN_S -> re-alert throttle for a sustained signal (anti-spam).
|
||||
# METRICS_URL -> orchestrator /metrics (host-network -> 127.0.0.1:8500).
|
||||
# ORCH_DOWN_TICKS-> K consecutive /metrics failures before "орк не отвечает".
|
||||
# MEM_PCT -> host memory used-% threshold.
|
||||
# DISK_CRIT_* -> OPT-IN independent disk CEILING (disk_watchdog/ORCH-063 owns
|
||||
# the 85% alert; this is a higher ceiling on the sidecar's own
|
||||
# channel, OFF by default -> no double disk-alert, AC-5/D6).
|
||||
# DISK_PATHS -> host paths measured for the opt-in ceiling.
|
||||
# AGENT_HUNG_MIN -> runtime minutes before an agent with ~0 CPU is "hung".
|
||||
# AGENT_CPU_FLOOR-> CPU fraction below which a long-running agent counts as hung.
|
||||
# STAGE_STUCK_MIN-> minutes a task may sit in one stage before alerting.
|
||||
# QUEUE_DEPTH -> queued-job depth threshold.
|
||||
# CONTAINERS -> CSV of container names to watch (status != running/healthy).
|
||||
# DOCKER_SOCK -> path to the read-only docker.sock inside the container.
|
||||
# DEPS -> CSV of name=url dependency pings (empty -> no pings).
|
||||
# TG_BOT_TOKEN / TG_CHAT_ID -> the sidecar's OWN Telegram bot/chat (independent
|
||||
# of the orchestrator's; absent -> logs, does not send).
|
||||
WATCHDOG_ENABLED=true
|
||||
WATCHDOG_INTERVAL_S=30
|
||||
WATCHDOG_HTTP_TIMEOUT_S=5
|
||||
WATCHDOG_COOLDOWN_S=1800
|
||||
WATCHDOG_METRICS_URL=http://127.0.0.1:8500/metrics
|
||||
WATCHDOG_ORCH_DOWN_TICKS=3
|
||||
WATCHDOG_MEM_PCT=90
|
||||
WATCHDOG_DISK_CRIT_ENABLED=false
|
||||
WATCHDOG_DISK_CRIT_PCT=97
|
||||
WATCHDOG_DISK_PATHS=/repos,/app/data
|
||||
WATCHDOG_AGENT_HUNG_MIN=20
|
||||
WATCHDOG_AGENT_CPU_FLOOR=0.01
|
||||
WATCHDOG_STAGE_STUCK_MIN=120
|
||||
WATCHDOG_QUEUE_DEPTH=20
|
||||
WATCHDOG_CONTAINERS=orchestrator
|
||||
WATCHDOG_DOCKER_SOCK=/var/run/docker.sock
|
||||
WATCHDOG_DEPS=
|
||||
WATCHDOG_TG_BOT_TOKEN=
|
||||
WATCHDOG_TG_CHAT_ID=
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
Work item: ORCH-057
|
||||
Work item: ORCH-100
|
||||
Repo: orchestrator
|
||||
Branch: feature/ORCH-057-bug-follow-up-orch-040-normali
|
||||
Branch: feature/ORCH-100-fnd-f1b-sidecar-watchdog
|
||||
Stage: development
|
||||
@@ -3,6 +3,15 @@
|
||||
Формат: [Keep a Changelog](https://keepachangelog.com/). Записи — на смысловой PR/задачу.
|
||||
|
||||
## [Unreleased]
|
||||
- **FND/F1b: sidecar-watchdog — мозг мониторинга в отдельном контейнере** (ORCH-100, `feat`): новая папка `watchdog/` (тонкий **Python-3.12-stdlib-only** демон) + сервис `orchestrator-watchdog` в `docker-compose.yml` (`network_mode: host`, read-only `docker.sock`, `mem_limit: 128m`). Вторая половина пары наблюдаемости домена 0: F1a (ORCH-099) отдаёт `GET /metrics` (сырьё), F1b — **мозг**, который это сырьё читает, дополняет внешними сигналами (хост/контейнеры/зависимости) и превращает в **алерты** через **собственный** независимый Telegram-канал. **`src/**` НЕ изменён** — F1b потребитель `/metrics`; `STAGE_TRANSITIONS`/`QG_CHECKS`/`check_*`/схема БД орка — байт-в-байт. Аддитивно, под kill-switch `WATCHDOG_ENABLED`, строго read-only к наблюдаемому (self-hosting-безопасно). ADR: `docs/work-items/ORCH-100/06-adr/ADR-001-sidecar-watchdog.md`, сквозной `docs/architecture/adr/adr-0033-sidecar-watchdog.md`.
|
||||
- **Стек (D1):** Python 3.12 stdlib-only на `python:3.12-slim` — `urllib` (HTTP `/metrics` + пинги + Telegram POST), сырой HTTP-over-unix-socket для read-only `docker.sock` (БЕЗ pip-пакета `docker`), `shutil.disk_usage`/`/proc/meminfo` для хоста. Нет дерева зависимостей (тонкость, C-3). Отдельный образ `watchdog/Dockerfile` (build-контекст = корень репо; `src/**` НЕ копируется — изоляция C-1).
|
||||
- **Топология (D2):** сервис собирается из `watchdog/Dockerfile`, `restart: unless-stopped` (самовосстановление), `network_mode: host` → `/metrics` достижим как `http://127.0.0.1:8500/metrics`; `docker.sock` смонтирован `:ro` И код GET-only (двойная гарантия read-only); хост-пути bind-mount `:ro`; `mem_limit: 128m`+`mem_reservation: 32m`. `env_file` опционален (`required: false`) → отсутствие `.env.watchdog` НЕ ломает `docker compose up` прод-орка. Деплой watchdog поднимает ТОЛЬКО его — прод `orchestrator` не пересобирается/не рестартится.
|
||||
- **Обобщённая чистая решающая функция (D4):** `watchdog/decision.py::decide(signal_active, prev, now, cooldown_s) -> alert|realert|recovery|none` — строгая генерализация `disk_watchdog.decide_action` (булев `signal_active` вместо `used_pct >= threshold`), per-signal in-memory `AlertState` (анти-спам/recovery, рестарт сбрасывает → корректный повторный алерт стоящей проблемы).
|
||||
- **Реестр сигналов (D5):** `orch_down` (K=3 подряд неудачных `/metrics` — debounce, не флаппит на одиночной икоте), `host_mem` (≥90%), `host_disk_crit` (opt-in потолок 97%, default off — D6), `agent_hung` (per run_id, два опроса: `runtime > N` И доля CPU `< floor`), `stage_stuck` (per work_item), `job_failed` (edge, рост счётчика), `queue_depth` (≥20), `container_down` (per name, статус ∉ {running,healthy}), `dep_down` (per name, пинг Plane/Gitea/Anthropic). Все пороги/интервалы/URL/токены — из env (`WATCHDOG_*`, канон в `.env.example`).
|
||||
- **Анти-дубль диск-алерта (D6, AC-5):** штатные 85% остаются ЕДИНСТВЕННО за `disk_watchdog` (ORCH-063) → **нулевой дубль по построению**; вклад sidecar — `orch_down` (когда орк лёг, in-process стражи мертвы) + **opt-in** независимый потолок `host_disk_crit` (97%, default off) как резерв канала. Один владелец на порог.
|
||||
- **Независимый транспорт (D7):** `watchdog/notify.py` читает **свои** `WATCHDOG_TG_BOT_TOKEN`/`WATCHDOG_TG_CHAT_ID`, **запрещён** импорт `src/notifications.py`/токена орка (падение орка не утянет алерт-канал). Отсутствие токена → fail-safe (логирует, не шлёт, не падает).
|
||||
- **never-raise + kill-switch (D8):** три уровня (per-source: битый коллектор деградирует один сигнал; per-tick: внешний try/except цикла; per-send: обёрнутая отправка). `WATCHDOG_ENABLED=false` → демон инертен (idle-loop с логом, НЕ exit — чтобы restart-policy не крутил петлю). Толерантность к версии `/metrics` (D9): неизвестные поля игнорируются, рост `schema_version` логируется (warning) без крэша.
|
||||
- Тесты: `tests/watchdog/test_*.py` (TC-01…TC-13: решение/orch-down/never-raise/kill-switch/full-tick/docker-readonly/notify-isolation/metrics-parse/compose/disk-dedup + коллекторы host/deps) + полный регресс `tests/ -q` зелёный (TC-14, `src/**` не тронут). **Инфра-предусловие** (07): добавить сервис в compose, создать bot/chat watchdog + `.env.watchdog`, первый запуск на хосте. Откат: не запускать сервис / `WATCHDOG_ENABLED=false`.
|
||||
- **Багфикс-трек: упрощённый/дешёвый маршрут конвейера для багов** (ORCH-019, `feat`): задача с меткой Plane `Bug` идёт **укороченным маршрутом** — пропускается стадия `architecture` (отдельный прогон opus-агента `architect` + ADR + exit-гейт `check_architecture_done`), тяжёлая аналитика заменяется облегчённым пакетом (короткий bug-report + обязательный план регресс-теста). **Все Quality Gate'ы исполняются без изменений** (корневой инвариант NFR-1): `STAGE_TRANSITIONS` / реестр `QG_CHECKS` / сигнатуры `check_*` / machine-verdict ключи (`verdict:`/`result:`/`deploy_status:`/`staging_status:`/`security_status:`/`coverage_status:`) — байт-в-байт прежние; маршрутизация багфикса — свойство планировщика, **не** гейт. Аддитивно, под kill-switch, с областью репо, never-raise, fail-safe → полный цикл. ADR: `docs/work-items/ORCH-019/06-adr/ADR-001-bug-fast-track.md`, сквозной `docs/architecture/adr/adr-0032-bug-fast-track.md`.
|
||||
- **Классификация (D1, FR-1):** новый leaf `src/bug_fast_track.py` (never-raise, паттерн `labels`/`serial_gate`). `bug_fast_track_applies(repo)` (локально, без сети) проверяется ПЕРВЫМ → выключенный флаг = нулевой сетевой оверхед; `is_bug_task(work_item_id, project_id)` делегирует в проверенный `labels.has_label` (ORCH-089: `fetch_issue_labels`+`get_project_labels`, нормализация, TTL-кэш). **Источник истины — Plane API**, не payload вебхука. Чтение метки — только в `start_pipeline`, **никогда** в горячем `claim_next_job` (NFR-4).
|
||||
- **Хранение типа (D2):** аддитивная идемпотентная колонка `tasks.track TEXT DEFAULT 'full'` (`_ensure_column`, паттерн `tasks.cancelled_at` ORCH-090); значения `'full'` (дефолт, ВСЕ существующие и не-баг задачи) | `'bug'`. Хелперы `db.set_task_track`/`db.get_task_track` (отсутствие/NULL → `'full'`, fail-safe). Сигнатура `create_task_atomic` не меняется.
|
||||
|
||||
@@ -38,6 +38,39 @@ services:
|
||||
group_add:
|
||||
- "999"
|
||||
|
||||
# ORCH-100 (FND/F1b): sidecar-watchdog — the monitoring brain in a SEPARATE
|
||||
# container (observer separated from observed, ADR-001 D2). Deploying it builds
|
||||
# ONLY this service — the prod `orchestrator` is NOT rebuilt/restarted.
|
||||
# * network_mode: host -> /metrics reachable at http://127.0.0.1:8500/metrics
|
||||
# and host interfaces visible for memory/disk reads.
|
||||
# * docker.sock mounted :ro AND the code is GET-only (double read-only guard).
|
||||
# * host disk paths bind-mounted :ro so shutil.disk_usage sees the host FS but
|
||||
# can never write (opt-in disk ceiling, D6).
|
||||
# * mem_limit caps the thin stdlib daemon (D2): OOM = early "sidecar grew" signal.
|
||||
# * WATCHDOG_ENABLED=false (or simply not starting the service) -> inert.
|
||||
orchestrator-watchdog:
|
||||
build:
|
||||
context: .
|
||||
dockerfile: watchdog/Dockerfile
|
||||
container_name: orchestrator-watchdog
|
||||
restart: unless-stopped
|
||||
init: true
|
||||
network_mode: host
|
||||
mem_limit: 128m
|
||||
mem_reservation: 32m
|
||||
volumes:
|
||||
- /var/run/docker.sock:/var/run/docker.sock:ro
|
||||
- /home/slin/repos:/repos:ro
|
||||
- ./data:/app/data:ro
|
||||
# Optional env_file (required: false): a missing .env.watchdog must NOT fail
|
||||
# `docker compose up` for the prod orchestrator (self-hosting safety). Absent
|
||||
# file -> WATCHDOG_* defaults, no token -> fail-safe (logs, does not send).
|
||||
env_file:
|
||||
- path: .env.watchdog
|
||||
required: false
|
||||
group_add:
|
||||
- "999"
|
||||
|
||||
# ORCH-31: staging instance (port 8501, isolated DB).
|
||||
# Starts ONLY with: docker compose --profile staging up -d orchestrator-staging
|
||||
# Normal "docker compose up -d" does NOT start this service.
|
||||
|
||||
0
tests/watchdog/__init__.py
Normal file
0
tests/watchdog/__init__.py
Normal file
46
tests/watchdog/conftest.py
Normal file
46
tests/watchdog/conftest.py
Normal file
@@ -0,0 +1,46 @@
|
||||
"""Shared helpers/fixtures for the watchdog (ORCH-100, F1b) test suite.
|
||||
|
||||
A tiny urllib-style fake opener so HTTP collectors / Telegram transport never
|
||||
touch the network (test plan §scope: all collectors/transport are mocked).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import io
|
||||
import urllib.error
|
||||
|
||||
|
||||
class FakeResponse:
|
||||
"""Context-manager response mimicking ``urllib`` ``addinfourl``."""
|
||||
|
||||
def __init__(self, status: int = 200, body: bytes = b"{}"):
|
||||
self.status = status
|
||||
self._body = body
|
||||
|
||||
def getcode(self):
|
||||
return self.status
|
||||
|
||||
def read(self):
|
||||
return self._body
|
||||
|
||||
def __enter__(self):
|
||||
return self
|
||||
|
||||
def __exit__(self, *a):
|
||||
return False
|
||||
|
||||
|
||||
def make_opener(*, status=200, body=b"{}", exc=None):
|
||||
"""Build a fake ``urlopen`` that returns a body or raises ``exc``."""
|
||||
|
||||
def _opener(req, timeout=None):
|
||||
if exc is not None:
|
||||
raise exc
|
||||
return FakeResponse(status=status, body=body)
|
||||
|
||||
return _opener
|
||||
|
||||
|
||||
def http_error(code: int) -> urllib.error.HTTPError:
|
||||
return urllib.error.HTTPError(
|
||||
url="http://x", code=code, msg="err", hdrs=None, fp=io.BytesIO(b"")
|
||||
)
|
||||
66
tests/watchdog/test_compose_service.py
Normal file
66
tests/watchdog/test_compose_service.py
Normal file
@@ -0,0 +1,66 @@
|
||||
"""TC-12: compose invariant — orchestrator-watchdog is a separate service.
|
||||
|
||||
It declares its own build (watchdog/Dockerfile), restart policy, mem_limit, and
|
||||
mounts docker.sock read-only (:ro). Parses the real docker-compose.yml.
|
||||
"""
|
||||
import pathlib
|
||||
|
||||
import yaml
|
||||
|
||||
REPO_ROOT = pathlib.Path(__file__).resolve().parents[2]
|
||||
|
||||
|
||||
def _compose():
|
||||
with open(REPO_ROOT / "docker-compose.yml") as f:
|
||||
return yaml.safe_load(f)
|
||||
|
||||
|
||||
def test_watchdog_service_declared():
|
||||
svc = _compose()["services"]
|
||||
assert "orchestrator-watchdog" in svc
|
||||
|
||||
|
||||
def test_watchdog_builds_from_watchdog_dockerfile():
|
||||
wd = _compose()["services"]["orchestrator-watchdog"]
|
||||
build = wd["build"]
|
||||
assert isinstance(build, dict)
|
||||
assert build["dockerfile"] == "watchdog/Dockerfile"
|
||||
assert build["context"] == "."
|
||||
|
||||
|
||||
def test_watchdog_has_restart_and_mem_limit():
|
||||
wd = _compose()["services"]["orchestrator-watchdog"]
|
||||
assert wd["restart"] == "unless-stopped"
|
||||
assert wd["mem_limit"] == "128m" # thin stack, not Grafana/Prometheus
|
||||
|
||||
|
||||
def test_docker_sock_mounted_read_only():
|
||||
wd = _compose()["services"]["orchestrator-watchdog"]
|
||||
sock = [v for v in wd["volumes"] if "docker.sock" in v]
|
||||
assert sock, "docker.sock must be mounted"
|
||||
assert all(v.endswith(":ro") for v in sock), "docker.sock must be :ro"
|
||||
|
||||
|
||||
def test_host_paths_mounted_read_only():
|
||||
wd = _compose()["services"]["orchestrator-watchdog"]
|
||||
# Every bind mount the watchdog uses is read-only (it only reads).
|
||||
for v in wd["volumes"]:
|
||||
assert v.endswith(":ro"), f"watchdog mount must be :ro: {v}"
|
||||
|
||||
|
||||
def test_env_file_is_optional():
|
||||
# A missing .env.watchdog must not break `docker compose up` (self-hosting).
|
||||
wd = _compose()["services"]["orchestrator-watchdog"]
|
||||
env_file = wd["env_file"]
|
||||
assert isinstance(env_file, list)
|
||||
assert env_file[0]["required"] is False
|
||||
|
||||
|
||||
def test_watchdog_dockerfile_exists_and_is_stdlib_only():
|
||||
df = REPO_ROOT / "watchdog" / "Dockerfile"
|
||||
assert df.exists()
|
||||
text = df.read_text()
|
||||
# No pip install of third-party deps (stdlib-only, D1).
|
||||
assert "pip install" not in text
|
||||
assert "COPY requirements" not in text
|
||||
assert "requirements.txt" not in text
|
||||
69
tests/watchdog/test_config_killswitch.py
Normal file
69
tests/watchdog/test_config_killswitch.py
Normal file
@@ -0,0 +1,69 @@
|
||||
"""TC-07: kill-switch + env-driven config (no hardcoded thresholds).
|
||||
|
||||
``WATCHDOG_ENABLED=false`` -> the daemon is inert (idle, no ticks). Thresholds /
|
||||
intervals / timeouts come from env, not constants.
|
||||
"""
|
||||
from watchdog.config import Config
|
||||
|
||||
|
||||
def test_killswitch_off_is_inert(monkeypatch):
|
||||
from watchdog import __main__ as entry
|
||||
|
||||
cfg = Config.from_env({"WATCHDOG_ENABLED": "false", "WATCHDOG_INTERVAL_S": "0"})
|
||||
assert cfg.enabled is False
|
||||
|
||||
built = {"n": 0}
|
||||
|
||||
class _Dog:
|
||||
def tick(self):
|
||||
built["n"] += 1
|
||||
|
||||
# If run() ever constructed a Watchdog / ticked while disabled, this would fire.
|
||||
monkeypatch.setattr(entry, "Watchdog", lambda c: _Dog())
|
||||
monkeypatch.setattr(entry.time, "sleep", lambda *_: None)
|
||||
entry.run(cfg=cfg, max_ticks=3)
|
||||
assert built["n"] == 0 # inert: never ticked
|
||||
|
||||
|
||||
def test_thresholds_read_from_env():
|
||||
cfg = Config.from_env(
|
||||
{
|
||||
"WATCHDOG_INTERVAL_S": "7",
|
||||
"WATCHDOG_MEM_PCT": "77",
|
||||
"WATCHDOG_QUEUE_DEPTH": "9",
|
||||
"WATCHDOG_AGENT_HUNG_MIN": "5",
|
||||
"WATCHDOG_STAGE_STUCK_MIN": "11",
|
||||
"WATCHDOG_ORCH_DOWN_TICKS": "4",
|
||||
"WATCHDOG_COOLDOWN_S": "60",
|
||||
"WATCHDOG_HTTP_TIMEOUT_S": "2",
|
||||
"WATCHDOG_CONTAINERS": "orchestrator,plane-app",
|
||||
"WATCHDOG_DEPS": "gitea=http://g/healthz,plane=http://p/",
|
||||
}
|
||||
)
|
||||
assert cfg.interval_s == 7.0
|
||||
assert cfg.mem_pct == 77.0
|
||||
assert cfg.queue_depth == 9
|
||||
assert cfg.agent_hung_s == 5 * 60.0
|
||||
assert cfg.stage_stuck_s == 11 * 60.0
|
||||
assert cfg.orch_down_ticks == 4
|
||||
assert cfg.cooldown_s == 60.0
|
||||
assert cfg.http_timeout_s == 2.0
|
||||
assert cfg.containers == ["orchestrator", "plane-app"]
|
||||
assert cfg.deps == {"gitea": "http://g/healthz", "plane": "http://p/"}
|
||||
|
||||
|
||||
def test_defaults_when_env_absent():
|
||||
cfg = Config.from_env({})
|
||||
assert cfg.enabled is True
|
||||
assert cfg.interval_s == 30.0
|
||||
assert cfg.metrics_url.endswith(":8500/metrics")
|
||||
assert cfg.disk_crit_enabled is False
|
||||
assert cfg.containers == ["orchestrator"]
|
||||
assert cfg.deps == {}
|
||||
|
||||
|
||||
def test_malformed_env_degrades_to_default():
|
||||
# A garbage numeric value must not crash config; it degrades to the default.
|
||||
cfg = Config.from_env({"WATCHDOG_INTERVAL_S": "abc", "WATCHDOG_MEM_PCT": ""})
|
||||
assert cfg.interval_s == 30.0
|
||||
assert cfg.mem_pct == 90.0
|
||||
56
tests/watchdog/test_decision.py
Normal file
56
tests/watchdog/test_decision.py
Normal file
@@ -0,0 +1,56 @@
|
||||
"""TC-01…TC-04: the pure decision function (alert/throttle/realert/recovery).
|
||||
|
||||
Mirrors the disk_watchdog.decide_action tests — the generalised ``decide`` is a
|
||||
strict superset (boolean ``signal_active`` instead of ``used_pct >= threshold``).
|
||||
"""
|
||||
from watchdog.decision import (
|
||||
ACTION_ALERT,
|
||||
ACTION_NONE,
|
||||
ACTION_REALERT,
|
||||
ACTION_RECOVERY,
|
||||
AlertState,
|
||||
decide,
|
||||
)
|
||||
|
||||
COOLDOWN = 1800.0
|
||||
|
||||
|
||||
def test_tc01_not_alerting_active_alerts():
|
||||
# TC-01: not-alerting & signal active -> ALERT (one per crossing).
|
||||
prev = AlertState(alerting=False)
|
||||
assert decide(True, prev, now=100.0, cooldown_s=COOLDOWN) == ACTION_ALERT
|
||||
|
||||
|
||||
def test_tc01_not_alerting_inactive_is_none():
|
||||
prev = AlertState(alerting=False)
|
||||
assert decide(False, prev, now=100.0, cooldown_s=COOLDOWN) == ACTION_NONE
|
||||
|
||||
|
||||
def test_tc02_alerting_active_in_cooldown_is_none():
|
||||
# TC-02: alerting & still active & cooldown NOT elapsed -> NONE (anti-spam).
|
||||
prev = AlertState(alerting=True, last_alert_at=1000.0)
|
||||
assert decide(True, prev, now=1000.0 + 10.0, cooldown_s=COOLDOWN) == ACTION_NONE
|
||||
|
||||
|
||||
def test_tc03_alerting_active_cooldown_elapsed_realerts():
|
||||
# TC-03: alerting & still active & cooldown elapsed -> REALERT.
|
||||
prev = AlertState(alerting=True, last_alert_at=1000.0)
|
||||
assert decide(True, prev, now=1000.0 + COOLDOWN, cooldown_s=COOLDOWN) == ACTION_REALERT
|
||||
|
||||
|
||||
def test_tc03_alerting_active_no_last_alert_realerts():
|
||||
# Defensive: alerting but last_alert_at missing -> treat cooldown as elapsed.
|
||||
prev = AlertState(alerting=True, last_alert_at=None)
|
||||
assert decide(True, prev, now=5.0, cooldown_s=COOLDOWN) == ACTION_REALERT
|
||||
|
||||
|
||||
def test_tc04_alerting_recovers_when_inactive():
|
||||
# TC-04: alerting & signal back to normal -> RECOVERY.
|
||||
prev = AlertState(alerting=True, last_alert_at=1000.0)
|
||||
assert decide(False, prev, now=1200.0, cooldown_s=COOLDOWN) == ACTION_RECOVERY
|
||||
|
||||
|
||||
def test_cooldown_boundary_is_inclusive():
|
||||
# Exactly at cooldown boundary -> REALERT (>= semantics, like disk_watchdog).
|
||||
prev = AlertState(alerting=True, last_alert_at=0.0)
|
||||
assert decide(True, prev, now=COOLDOWN, cooldown_s=COOLDOWN) == ACTION_REALERT
|
||||
39
tests/watchdog/test_deps_collector.py
Normal file
39
tests/watchdog/test_deps_collector.py
Normal file
@@ -0,0 +1,39 @@
|
||||
"""Dependency ping collector: reachable / unreachable / 5xx (never-raise)."""
|
||||
from watchdog.collectors import deps as deps_mod
|
||||
|
||||
from .conftest import http_error, make_opener
|
||||
|
||||
|
||||
def test_ping_reachable():
|
||||
assert deps_mod.ping("http://x", 1.0, opener=make_opener(status=200)) is True
|
||||
|
||||
|
||||
def test_ping_4xx_still_reachable():
|
||||
# A 4xx proves the host is up (we ping for liveness, not auth).
|
||||
assert deps_mod.ping("http://x", 1.0, opener=make_opener(exc=http_error(404))) is True
|
||||
|
||||
|
||||
def test_ping_5xx_is_down():
|
||||
assert deps_mod.ping("http://x", 1.0, opener=make_opener(exc=http_error(503))) is False
|
||||
|
||||
|
||||
def test_ping_timeout_is_down():
|
||||
assert deps_mod.ping(
|
||||
"http://x", 1.0, opener=make_opener(exc=TimeoutError())
|
||||
) is False
|
||||
|
||||
|
||||
def test_ping_all_mixed():
|
||||
def opener_factory(url):
|
||||
return make_opener(status=200) if "good" in url else make_opener(
|
||||
exc=ConnectionError()
|
||||
)
|
||||
|
||||
def opener(req, timeout=None):
|
||||
url = req.full_url if hasattr(req, "full_url") else req
|
||||
return opener_factory(url)(req, timeout)
|
||||
|
||||
res = deps_mod.ping_all(
|
||||
{"good": "http://good", "bad": "http://bad"}, 1.0, opener=opener
|
||||
)
|
||||
assert res == {"good": True, "bad": False}
|
||||
42
tests/watchdog/test_disk_alert_dedup.py
Normal file
42
tests/watchdog/test_disk_alert_dedup.py
Normal file
@@ -0,0 +1,42 @@
|
||||
"""TC-13: anti-duplicate disk alert (coordinated with ORCH-063 / disk_watchdog).
|
||||
|
||||
ADR-001 D6: disk_watchdog (ORCH-063) is the SOLE owner of the 85% disk alert via
|
||||
the orchestrator's Telegram. The sidecar carries NO disk alert by default
|
||||
(``WATCHDOG_DISK_CRIT_ENABLED=false``) -> structurally zero double-alert. The
|
||||
sidecar's contribution is an OPT-IN independent ceiling at a HIGHER threshold
|
||||
(a different event, separate channel).
|
||||
"""
|
||||
from watchdog.config import Config
|
||||
from watchdog.signals import host_signals
|
||||
|
||||
|
||||
def _cfg(**kw):
|
||||
return Config.from_env(kw)
|
||||
|
||||
|
||||
def test_disk_signal_absent_by_default():
|
||||
# Disk full at 90% -> sidecar produces NO disk signal (disk_watchdog owns it).
|
||||
cfg = _cfg()
|
||||
assert cfg.disk_crit_enabled is False
|
||||
sigs = host_signals(cfg, mem_pct=None, disk=("/repos", 90.0))
|
||||
assert [s for s in sigs if s.key == "host_disk_crit"] == []
|
||||
|
||||
|
||||
def test_opt_in_ceiling_is_separate_higher_event():
|
||||
cfg = _cfg(WATCHDOG_DISK_CRIT_ENABLED="true", WATCHDOG_DISK_CRIT_PCT="97")
|
||||
# Below the ceiling (90% < 97%) -> not active even when opted in (no 85% dup).
|
||||
below = host_signals(cfg, mem_pct=None, disk=("/repos", 90.0))
|
||||
crit_below = [s for s in below if s.key == "host_disk_crit"]
|
||||
assert len(crit_below) == 1 and crit_below[0].active is False
|
||||
|
||||
# At/over the high ceiling -> active (a DIFFERENT event from disk_watchdog 85%).
|
||||
over = host_signals(cfg, mem_pct=None, disk=("/repos", 98.0))
|
||||
crit_over = [s for s in over if s.key == "host_disk_crit"]
|
||||
assert len(crit_over) == 1 and crit_over[0].active is True
|
||||
|
||||
|
||||
def test_mem_signal_independent_of_disk():
|
||||
cfg = _cfg(WATCHDOG_MEM_PCT="90")
|
||||
sigs = host_signals(cfg, mem_pct=95.0, disk=None)
|
||||
mem = [s for s in sigs if s.key == "host_mem"]
|
||||
assert len(mem) == 1 and mem[0].active is True
|
||||
79
tests/watchdog/test_docker_readonly.py
Normal file
79
tests/watchdog/test_docker_readonly.py
Normal file
@@ -0,0 +1,79 @@
|
||||
"""TC-09: self-hosting safety — the Docker client is read-only by construction.
|
||||
|
||||
The client exposes ONLY read methods (list/inspect), its single request
|
||||
primitive hard-codes the ``GET`` HTTP method, and the source carries no
|
||||
mutating Docker verb (start/stop/restart/kill/exec/POST). ``classify_container``
|
||||
is a pure status mapper.
|
||||
"""
|
||||
import inspect as _inspect
|
||||
|
||||
from watchdog.collectors import containers as cmod
|
||||
|
||||
|
||||
def test_request_primitive_is_get_only(monkeypatch):
|
||||
captured = {}
|
||||
|
||||
class _FakeConn:
|
||||
def __init__(self, *a, **k):
|
||||
pass
|
||||
|
||||
def request(self, method, path):
|
||||
captured["method"] = method
|
||||
captured["path"] = path
|
||||
|
||||
def getresponse(self):
|
||||
class _R:
|
||||
status = 200
|
||||
|
||||
def read(self_inner):
|
||||
return b"[]"
|
||||
|
||||
return _R()
|
||||
|
||||
def close(self):
|
||||
pass
|
||||
|
||||
monkeypatch.setattr(cmod, "_UnixHTTPConnection", _FakeConn)
|
||||
reader = cmod.DockerSockReader("/var/run/docker.sock")
|
||||
reader.list_containers()
|
||||
assert captured["method"] == "GET"
|
||||
reader.inspect("orchestrator")
|
||||
assert captured["method"] == "GET"
|
||||
|
||||
|
||||
def test_no_mutating_verbs_in_source():
|
||||
src = _inspect.getsource(cmod)
|
||||
lowered = src.lower()
|
||||
# No write/control verbs should appear as Docker actions in this module.
|
||||
for verb in ("/start", "/stop", "/restart", "/kill", "/exec", "\"post\"", "'post'"):
|
||||
assert verb not in lowered, f"mutating verb leaked into containers.py: {verb}"
|
||||
|
||||
|
||||
def test_reader_exposes_only_read_methods():
|
||||
public = [
|
||||
n for n in dir(cmod.DockerSockReader)
|
||||
if not n.startswith("_")
|
||||
]
|
||||
assert set(public) == {"list_containers", "inspect"}
|
||||
|
||||
|
||||
def test_classify_container_pure_mapping():
|
||||
assert cmod.classify_container({"State": {"Status": "running"}}) == "running"
|
||||
assert cmod.classify_container({"State": {"Status": "exited"}}) == "exited"
|
||||
assert cmod.classify_container(
|
||||
{"State": {"Status": "running", "Health": {"Status": "unhealthy"}}}
|
||||
) == "unhealthy"
|
||||
assert cmod.classify_container(
|
||||
{"State": {"Status": "running", "Health": {"Status": "healthy"}}}
|
||||
) == "healthy"
|
||||
assert cmod.classify_container(None) == "unknown"
|
||||
assert cmod.classify_container({}) == "unknown"
|
||||
|
||||
|
||||
def test_container_alarm_semantics():
|
||||
assert cmod.container_alarm("running") is False
|
||||
assert cmod.container_alarm("healthy") is False
|
||||
assert cmod.container_alarm("exited") is True
|
||||
assert cmod.container_alarm("restarting") is True
|
||||
assert cmod.container_alarm("unhealthy") is True
|
||||
assert cmod.container_alarm("unknown") is True
|
||||
54
tests/watchdog/test_host_collector.py
Normal file
54
tests/watchdog/test_host_collector.py
Normal file
@@ -0,0 +1,54 @@
|
||||
"""Host collector: /proc/meminfo parsing + disk reads (never-raise)."""
|
||||
import os
|
||||
import tempfile
|
||||
|
||||
from watchdog.collectors import host as host_mod
|
||||
|
||||
|
||||
def test_mem_used_pct_from_meminfo():
|
||||
content = "MemTotal: 1000 kB\nMemFree: 100 kB\nMemAvailable: 250 kB\n"
|
||||
with tempfile.NamedTemporaryFile("w", suffix=".meminfo", delete=False) as f:
|
||||
f.write(content)
|
||||
path = f.name
|
||||
try:
|
||||
pct = host_mod.read_mem_used_pct(path)
|
||||
# used = (1 - 250/1000) * 100 = 75.0
|
||||
assert pct == 75.0
|
||||
finally:
|
||||
os.unlink(path)
|
||||
|
||||
|
||||
def test_mem_used_pct_missing_file_is_none():
|
||||
assert host_mod.read_mem_used_pct("/no/such/meminfo") is None
|
||||
|
||||
|
||||
def test_mem_used_pct_garbage_is_none():
|
||||
with tempfile.NamedTemporaryFile("w", delete=False) as f:
|
||||
f.write("totally not meminfo\n")
|
||||
path = f.name
|
||||
try:
|
||||
assert host_mod.read_mem_used_pct(path) is None
|
||||
finally:
|
||||
os.unlink(path)
|
||||
|
||||
|
||||
def test_disk_used_pct_real_path():
|
||||
pct = host_mod.read_disk_used_pct("/")
|
||||
assert pct is None or (0.0 <= pct <= 100.0)
|
||||
|
||||
|
||||
def test_disk_used_pct_missing_path_is_none():
|
||||
assert host_mod.read_disk_used_pct("/no/such/path/xyz") is None
|
||||
|
||||
|
||||
def test_max_disk_used_pct_picks_worst(monkeypatch):
|
||||
monkeypatch.setattr(
|
||||
host_mod, "read_disk_used_pct",
|
||||
lambda p: {"/a": 10.0, "/b": 80.0, "/c": None}.get(p),
|
||||
)
|
||||
assert host_mod.max_disk_used_pct(["/a", "/b", "/c"]) == ("/b", 80.0)
|
||||
|
||||
|
||||
def test_max_disk_used_pct_all_unreadable(monkeypatch):
|
||||
monkeypatch.setattr(host_mod, "read_disk_used_pct", lambda p: None)
|
||||
assert host_mod.max_disk_used_pct(["/a", "/b"]) is None
|
||||
118
tests/watchdog/test_metrics_parse.py
Normal file
118
tests/watchdog/test_metrics_parse.py
Normal file
@@ -0,0 +1,118 @@
|
||||
"""TC-11: tolerance to the /metrics contract.
|
||||
|
||||
Unknown fields are ignored, a missing optional does not crash, and a
|
||||
schema_version above the known one logs a warning (no crash). Also covers the
|
||||
envelope-derived signal evaluation (agent_hung / stage_stuck / job_failed /
|
||||
queue_depth).
|
||||
"""
|
||||
import logging
|
||||
|
||||
from watchdog.collectors import orch as orch_mod
|
||||
from watchdog.config import Config
|
||||
from watchdog.signals import AgentSample, eval_envelope
|
||||
|
||||
|
||||
def _cfg(**kw):
|
||||
return Config.from_env(kw)
|
||||
|
||||
|
||||
def test_unknown_field_ignored():
|
||||
body = '{"schema_version":1,"stages":[],"brand_new_field":42}'
|
||||
env = orch_mod.parse_envelope(body)
|
||||
assert env["brand_new_field"] == 42 # tolerated, not a crash
|
||||
|
||||
|
||||
def test_missing_optional_not_an_error():
|
||||
env = orch_mod.parse_envelope('{"schema_version":1}')
|
||||
ev = eval_envelope(env, _cfg(), prev_agents={}, prev_failed=None)
|
||||
assert ev.signals == [] # no stages/agents/queue -> no signals, no crash
|
||||
|
||||
|
||||
def test_non_object_body_raises_valueerror():
|
||||
import pytest
|
||||
|
||||
with pytest.raises(ValueError):
|
||||
orch_mod.parse_envelope("[1,2,3]")
|
||||
|
||||
|
||||
def test_schema_version_bump_warns(caplog):
|
||||
env = {"schema_version": 999}
|
||||
with caplog.at_level(logging.WARNING):
|
||||
orch_mod.check_schema_version(env)
|
||||
assert any("schema_version" in r.message for r in caplog.records)
|
||||
|
||||
|
||||
def test_parse_generated_at_roundtrip_and_tolerant():
|
||||
assert orch_mod.parse_generated_at({"generated_at": "2026-06-10T00:00:00Z"})
|
||||
assert orch_mod.parse_generated_at({"generated_at": "garbage"}) is None
|
||||
assert orch_mod.parse_generated_at({}) is None
|
||||
|
||||
|
||||
def test_queue_depth_and_job_failed_signals():
|
||||
env = {
|
||||
"schema_version": 1,
|
||||
"queue": {"depth": 25, "counts": {"failed": 5}},
|
||||
}
|
||||
cfg = _cfg(WATCHDOG_QUEUE_DEPTH="20")
|
||||
# First tick: failed baseline established, depth over threshold fires.
|
||||
ev = eval_envelope(env, cfg, prev_agents={}, prev_failed=None)
|
||||
keys = {s.key for s in ev.signals}
|
||||
assert "queue_depth" in keys
|
||||
assert "job_failed" not in keys # no prior baseline -> no edge yet
|
||||
assert ev.failed_count == 5
|
||||
|
||||
# Next tick: failed grew 5 -> 7 -> edge job_failed alert.
|
||||
env2 = {"queue": {"depth": 0, "counts": {"failed": 7}}}
|
||||
ev2 = eval_envelope(env2, cfg, prev_agents={}, prev_failed=ev.failed_count)
|
||||
jf = [s for s in ev2.signals if s.key == "job_failed"]
|
||||
assert len(jf) == 1 and jf[0].edge is True and jf[0].active is True
|
||||
|
||||
|
||||
def test_stage_stuck_signal():
|
||||
env = {"stages": [{"work_item": "ORCH-1", "stage": "review", "age_in_stage_s": 9999}]}
|
||||
cfg = _cfg(WATCHDOG_STAGE_STUCK_MIN="1") # 60s threshold
|
||||
ev = eval_envelope(env, cfg, prev_agents={}, prev_failed=None)
|
||||
stuck = [s for s in ev.signals if s.key == ("stage_stuck", "ORCH-1")]
|
||||
assert len(stuck) == 1 and stuck[0].active is True
|
||||
|
||||
|
||||
def test_agent_hung_needs_two_polls_and_low_cpu():
|
||||
cfg = _cfg(WATCHDOG_AGENT_HUNG_MIN="1", WATCHDOG_AGENT_CPU_FLOOR="0.01")
|
||||
env = {
|
||||
"schema_version": 1,
|
||||
"generated_at": "2026-06-10T00:01:40Z", # +100s vs prev sample below
|
||||
"clk_tck": 100,
|
||||
"agents": [{"run_id": 7, "agent": "developer", "runtime_s": 999, "cpu_ticks": 50}],
|
||||
}
|
||||
prev_t = orch_mod.parse_generated_at({"generated_at": "2026-06-10T00:00:00Z"})
|
||||
prev = {7: AgentSample(cpu_ticks=40, generated_at=prev_t)}
|
||||
# Δticks=10 over clk_tck=100 -> 0.1 CPU-seconds over 100s -> frac 0.001 < floor.
|
||||
ev = eval_envelope(env, cfg, prev_agents=prev, prev_failed=None)
|
||||
hung = [s for s in ev.signals if s.key == ("agent_hung", 7)]
|
||||
assert len(hung) == 1 and hung[0].active is True
|
||||
|
||||
|
||||
def test_agent_hung_skipped_when_cpu_ticks_null():
|
||||
cfg = _cfg(WATCHDOG_AGENT_HUNG_MIN="1")
|
||||
env = {
|
||||
"generated_at": "2026-06-10T00:01:40Z",
|
||||
"clk_tck": 100,
|
||||
"agents": [{"run_id": 8, "runtime_s": 999, "cpu_ticks": None}],
|
||||
}
|
||||
prev = {8: AgentSample(cpu_ticks=10, generated_at=0.0)}
|
||||
ev = eval_envelope(env, cfg, prev_agents=prev, prev_failed=None)
|
||||
assert [s for s in ev.signals if s.key == ("agent_hung", 8)] == []
|
||||
|
||||
|
||||
def test_agent_busy_not_hung():
|
||||
cfg = _cfg(WATCHDOG_AGENT_HUNG_MIN="1", WATCHDOG_AGENT_CPU_FLOOR="0.01")
|
||||
env = {
|
||||
"generated_at": "2026-06-10T00:01:40Z",
|
||||
"clk_tck": 100,
|
||||
"agents": [{"run_id": 9, "runtime_s": 999, "cpu_ticks": 5000}],
|
||||
}
|
||||
prev_t = orch_mod.parse_generated_at({"generated_at": "2026-06-10T00:00:00Z"})
|
||||
prev = {9: AgentSample(cpu_ticks=40, generated_at=prev_t)}
|
||||
# Big Δticks -> high CPU fraction -> not hung.
|
||||
ev = eval_envelope(env, cfg, prev_agents=prev, prev_failed=None)
|
||||
assert [s for s in ev.signals if s.key == ("agent_hung", 9)] == []
|
||||
88
tests/watchdog/test_never_raise.py
Normal file
88
tests/watchdog/test_never_raise.py
Normal file
@@ -0,0 +1,88 @@
|
||||
"""TC-06: three-level never-raise.
|
||||
|
||||
A raising collector (host / containers / deps) degrades ONE signal and the tick
|
||||
reaches the end collecting the rest; a raising send is swallowed; the daemon
|
||||
loop survives a raising tick.
|
||||
"""
|
||||
from watchdog.config import Config
|
||||
from watchdog.core import Watchdog
|
||||
|
||||
|
||||
class _BoomDocker:
|
||||
def inspect(self, name):
|
||||
raise RuntimeError("docker socket blew up")
|
||||
|
||||
|
||||
class _Notifier:
|
||||
def __init__(self):
|
||||
self.sent = []
|
||||
|
||||
def send(self, text):
|
||||
self.sent.append(text)
|
||||
return True
|
||||
|
||||
|
||||
class _BoomNotifier:
|
||||
def send(self, text):
|
||||
raise RuntimeError("telegram blew up")
|
||||
|
||||
|
||||
def _cfg(**kw):
|
||||
base = {
|
||||
"WATCHDOG_TG_BOT_TOKEN": "t",
|
||||
"WATCHDOG_TG_CHAT_ID": "c",
|
||||
"WATCHDOG_CONTAINERS": "orchestrator",
|
||||
}
|
||||
return Config.from_env({**base, **kw})
|
||||
|
||||
|
||||
def _good_fetch_patch(dog, monkeypatch):
|
||||
from watchdog.collectors import orch as orch_mod
|
||||
|
||||
env = {"schema_version": 1, "generated_at": "2026-06-10T00:00:00Z",
|
||||
"clk_tck": 100, "agents": [], "stages": [],
|
||||
"queue": {"depth": 0, "counts": {"failed": 0}}}
|
||||
monkeypatch.setattr(
|
||||
orch_mod, "fetch_metrics",
|
||||
lambda *a, **k: orch_mod.FetchResult(ok=True, envelope=env),
|
||||
)
|
||||
|
||||
|
||||
def test_per_source_broken_container_degrades_one_signal(monkeypatch):
|
||||
notifier = _Notifier()
|
||||
dog = Watchdog(_cfg(), notifier=notifier, docker=_BoomDocker())
|
||||
_good_fetch_patch(dog, monkeypatch)
|
||||
# Should not raise; tick completes and produces results for other sources.
|
||||
results = dog.tick()
|
||||
keys = [getattr(s, "key", None) for _, s in results]
|
||||
# orch_down evaluated (orch was up -> not active) and container evaluated.
|
||||
assert "orch_down" in keys
|
||||
assert ("container_down", "orchestrator") in keys
|
||||
|
||||
|
||||
def test_per_send_failure_is_swallowed(monkeypatch):
|
||||
# A raising notifier must not break the tick (per-send never-raise).
|
||||
cfg = _cfg(WATCHDOG_MEM_PCT="0") # mem >= 0 always -> force an alert send
|
||||
dog = Watchdog(cfg, notifier=_BoomNotifier(), docker=_BoomDocker())
|
||||
_good_fetch_patch(dog, monkeypatch)
|
||||
monkeypatch.setattr(
|
||||
"watchdog.collectors.host.read_mem_used_pct", lambda *a, **k: 50.0
|
||||
)
|
||||
# Must not raise despite the notifier exploding on a triggered alert.
|
||||
dog.tick()
|
||||
|
||||
|
||||
def test_per_tick_loop_survives_raising_tick(monkeypatch):
|
||||
# The __main__ run loop must survive a tick that raises (outer never-raise).
|
||||
from watchdog import __main__ as entry
|
||||
|
||||
cfg = _cfg(WATCHDOG_INTERVAL_S="0")
|
||||
|
||||
class _BoomDog:
|
||||
def tick(self):
|
||||
raise RuntimeError("tick blew up")
|
||||
|
||||
monkeypatch.setattr(entry, "Watchdog", lambda c: _BoomDog())
|
||||
monkeypatch.setattr(entry.time, "sleep", lambda *_: None)
|
||||
# max_ticks bounds the loop; it must return cleanly, not propagate.
|
||||
entry.run(cfg=cfg, max_ticks=3)
|
||||
84
tests/watchdog/test_notify_isolation.py
Normal file
84
tests/watchdog/test_notify_isolation.py
Normal file
@@ -0,0 +1,84 @@
|
||||
"""TC-10: independent Telegram transport.
|
||||
|
||||
The sidecar sends through its OWN bot_token/chat_id from env and must NOT import
|
||||
``src.notifications`` or the orchestrator's code (C-1 / BR-8).
|
||||
"""
|
||||
import pathlib
|
||||
|
||||
from watchdog import notify as notify_mod
|
||||
from watchdog.notify import Notifier, send_telegram
|
||||
|
||||
|
||||
def test_notify_uses_own_token_and_chat(monkeypatch):
|
||||
captured = {}
|
||||
|
||||
def _fake_opener(req, timeout=None):
|
||||
captured["url"] = req.full_url
|
||||
captured["data"] = req.data
|
||||
|
||||
class _R:
|
||||
status = 200
|
||||
|
||||
def getcode(self):
|
||||
return 200
|
||||
|
||||
def __enter__(self_inner):
|
||||
return self_inner
|
||||
|
||||
def __exit__(self_inner, *a):
|
||||
return False
|
||||
|
||||
return _R()
|
||||
|
||||
ok = send_telegram(
|
||||
"MYTOKEN", "MYCHAT", "hello", opener=_fake_opener, api_base="https://tg.test"
|
||||
)
|
||||
assert ok is True
|
||||
assert "botMYTOKEN" in captured["url"]
|
||||
assert b"MYCHAT" in captured["data"]
|
||||
|
||||
|
||||
def test_missing_credentials_is_failsafe_no_send():
|
||||
# Absent token/chat -> logs and returns False, never raises (fail-safe).
|
||||
assert send_telegram("", "chat", "x") is False
|
||||
assert send_telegram("tok", "", "x") is False
|
||||
|
||||
|
||||
def test_send_failure_is_swallowed():
|
||||
def _boom(req, timeout=None):
|
||||
raise OSError("network down")
|
||||
|
||||
assert send_telegram("t", "c", "x", opener=_boom) is False
|
||||
|
||||
|
||||
def test_notifier_wraps_credentials(monkeypatch):
|
||||
sent = {}
|
||||
monkeypatch.setattr(
|
||||
notify_mod, "send_telegram",
|
||||
lambda tok, chat, text, timeout: sent.update(tok=tok, chat=chat, text=text) or True,
|
||||
)
|
||||
Notifier("TOK", "CHAT").send("body")
|
||||
assert sent == {"tok": "TOK", "chat": "CHAT", "text": "body"}
|
||||
|
||||
|
||||
def test_watchdog_package_does_not_import_src():
|
||||
# No watchdog/*.py file may reference the orchestrator's src package (C-1).
|
||||
# (Source scan, not sys.modules: the global test conftest imports src.* for
|
||||
# every test, so a runtime check would be polluted.)
|
||||
pkg_root = pathlib.Path(notify_mod.__file__).resolve().parent
|
||||
offenders = []
|
||||
for py in pkg_root.rglob("*.py"):
|
||||
text = py.read_text(encoding="utf-8")
|
||||
for needle in ("import src", "from src", "src.notifications"):
|
||||
if needle in text:
|
||||
offenders.append(f"{py.name}: {needle}")
|
||||
assert offenders == [], f"watchdog references the orchestrator src: {offenders}"
|
||||
|
||||
|
||||
def test_notify_source_has_no_src_notifications_import():
|
||||
import inspect
|
||||
|
||||
src = inspect.getsource(notify_mod)
|
||||
assert "src.notifications" not in src
|
||||
assert "from src" not in src
|
||||
assert "import src" not in src
|
||||
67
tests/watchdog/test_orch_down.py
Normal file
67
tests/watchdog/test_orch_down.py
Normal file
@@ -0,0 +1,67 @@
|
||||
"""TC-05: orchestrator-down detection.
|
||||
|
||||
A ``/metrics`` timeout / connection-refused / 5xx / unreadable body -> the
|
||||
``orchestrator_down`` signal -> ALERT "орк не отвечает" once the debounce
|
||||
threshold of consecutive failures is reached (FR-3).
|
||||
"""
|
||||
from watchdog.collectors import orch as orch_mod
|
||||
from watchdog.config import Config
|
||||
from watchdog.signals import orch_down_signal
|
||||
|
||||
from .conftest import http_error, make_opener
|
||||
|
||||
|
||||
def _cfg(**kw):
|
||||
return Config.from_env({**{"WATCHDOG_ORCH_DOWN_TICKS": "3"}, **kw})
|
||||
|
||||
|
||||
def test_fetch_timeout_is_not_ok():
|
||||
opener = make_opener(exc=TimeoutError("timed out"))
|
||||
res = orch_mod.fetch_metrics("http://x/metrics", 1.0, opener=opener)
|
||||
assert res.ok is False
|
||||
assert res.envelope is None
|
||||
assert res.error
|
||||
|
||||
|
||||
def test_fetch_connection_refused_is_not_ok():
|
||||
opener = make_opener(exc=ConnectionRefusedError("refused"))
|
||||
res = orch_mod.fetch_metrics("http://x/metrics", 1.0, opener=opener)
|
||||
assert res.ok is False
|
||||
|
||||
|
||||
def test_fetch_5xx_is_not_ok():
|
||||
opener = make_opener(status=503, body=b"oops")
|
||||
res = orch_mod.fetch_metrics("http://x/metrics", 1.0, opener=opener)
|
||||
assert res.ok is False
|
||||
assert "503" in (res.error or "")
|
||||
|
||||
|
||||
def test_fetch_httperror_5xx_is_not_ok():
|
||||
opener = make_opener(exc=http_error(502))
|
||||
res = orch_mod.fetch_metrics("http://x/metrics", 1.0, opener=opener)
|
||||
assert res.ok is False
|
||||
|
||||
|
||||
def test_fetch_unreadable_body_is_not_ok():
|
||||
opener = make_opener(status=200, body=b"not-json{{{")
|
||||
res = orch_mod.fetch_metrics("http://x/metrics", 1.0, opener=opener)
|
||||
assert res.ok is False
|
||||
|
||||
|
||||
def test_fetch_good_body_is_ok():
|
||||
opener = make_opener(status=200, body=b'{"schema_version":1,"stages":[]}')
|
||||
res = orch_mod.fetch_metrics("http://x/metrics", 1.0, opener=opener)
|
||||
assert res.ok is True
|
||||
assert res.envelope["schema_version"] == 1
|
||||
|
||||
|
||||
def test_orch_down_signal_debounce_then_alert():
|
||||
cfg = _cfg()
|
||||
# Single transient failure -> NOT active (does not flap).
|
||||
assert orch_down_signal(1, cfg, "timeout").active is False
|
||||
assert orch_down_signal(2, cfg, "timeout").active is False
|
||||
# K-th consecutive failure -> active alarm.
|
||||
sig = orch_down_signal(3, cfg, "timeout")
|
||||
assert sig.active is True
|
||||
assert sig.key == "orch_down"
|
||||
assert "не отвечает" in sig.detail
|
||||
106
tests/watchdog/test_tick_orch_down_integration.py
Normal file
106
tests/watchdog/test_tick_orch_down_integration.py
Normal file
@@ -0,0 +1,106 @@
|
||||
"""TC-08: full tick with the orchestrator down (integration).
|
||||
|
||||
With ``/metrics`` failing, the tick must not crash, must still collect host /
|
||||
containers / deps, must produce EXACTLY ONE ``orchestrator_down`` alert (after
|
||||
the debounce), suppress within cooldown, and emit recovery on restoration.
|
||||
"""
|
||||
from watchdog.collectors import orch as orch_mod
|
||||
from watchdog.config import Config
|
||||
from watchdog.core import Watchdog
|
||||
|
||||
|
||||
class _Notifier:
|
||||
def __init__(self):
|
||||
self.sent = []
|
||||
|
||||
def send(self, text):
|
||||
self.sent.append(text)
|
||||
return True
|
||||
|
||||
|
||||
class _StubDocker:
|
||||
def inspect(self, name):
|
||||
return {"State": {"Status": "running"}}
|
||||
|
||||
|
||||
def _cfg(**kw):
|
||||
base = {
|
||||
"WATCHDOG_TG_BOT_TOKEN": "t",
|
||||
"WATCHDOG_TG_CHAT_ID": "c",
|
||||
"WATCHDOG_ORCH_DOWN_TICKS": "2",
|
||||
"WATCHDOG_COOLDOWN_S": "1000",
|
||||
"WATCHDOG_CONTAINERS": "orchestrator",
|
||||
}
|
||||
return Config.from_env({**base, **kw})
|
||||
|
||||
|
||||
def _clock():
|
||||
t = {"v": 0.0}
|
||||
|
||||
def now():
|
||||
return t["v"]
|
||||
|
||||
return t, now
|
||||
|
||||
|
||||
def _down(monkeypatch):
|
||||
monkeypatch.setattr(
|
||||
orch_mod, "fetch_metrics",
|
||||
lambda *a, **k: orch_mod.FetchResult(ok=False, error="timeout"),
|
||||
)
|
||||
|
||||
|
||||
def _up(monkeypatch):
|
||||
env = {"schema_version": 1, "generated_at": "2026-06-10T00:00:00Z",
|
||||
"clk_tck": 100, "agents": [], "stages": [],
|
||||
"queue": {"depth": 0, "counts": {"failed": 0}}}
|
||||
monkeypatch.setattr(
|
||||
orch_mod, "fetch_metrics",
|
||||
lambda *a, **k: orch_mod.FetchResult(ok=True, envelope=env),
|
||||
)
|
||||
|
||||
|
||||
def _orch_down_alerts(notifier):
|
||||
return [m for m in notifier.sent if "не отвечает" in m]
|
||||
|
||||
|
||||
def test_tick_orch_down_one_alert_then_throttle_then_recovery(monkeypatch):
|
||||
notifier = _Notifier()
|
||||
t, now = _clock()
|
||||
dog = Watchdog(_cfg(), notifier=notifier, docker=_StubDocker(), now_provider=now)
|
||||
|
||||
_down(monkeypatch)
|
||||
# tick 1: first failure -> debounced, NOT yet active -> no alert.
|
||||
dog.tick()
|
||||
assert _orch_down_alerts(notifier) == []
|
||||
|
||||
# tick 2: second consecutive failure -> active -> EXACTLY ONE alert.
|
||||
t["v"] = 30.0
|
||||
dog.tick()
|
||||
assert len(_orch_down_alerts(notifier)) == 1
|
||||
|
||||
# tick 3: still down, within cooldown -> throttled (no new alert).
|
||||
t["v"] = 60.0
|
||||
dog.tick()
|
||||
assert len(_orch_down_alerts(notifier)) == 1
|
||||
|
||||
# restore: orchestrator answers again -> recovery message.
|
||||
_up(monkeypatch)
|
||||
t["v"] = 90.0
|
||||
dog.tick()
|
||||
recoveries = [m for m in notifier.sent if "восстановление" in m and "Орк" in m]
|
||||
assert len(recoveries) == 1
|
||||
|
||||
|
||||
def test_tick_does_not_crash_when_everything_breaks(monkeypatch):
|
||||
# orch down + docker raising + no deps: tick still completes.
|
||||
class _BoomDocker:
|
||||
def inspect(self, name):
|
||||
raise RuntimeError("boom")
|
||||
|
||||
notifier = _Notifier()
|
||||
dog = Watchdog(_cfg(), notifier=notifier, docker=_BoomDocker())
|
||||
_down(monkeypatch)
|
||||
dog.tick() # must not raise
|
||||
dog.tick()
|
||||
assert len(_orch_down_alerts(notifier)) == 1
|
||||
28
watchdog/Dockerfile
Normal file
28
watchdog/Dockerfile
Normal file
@@ -0,0 +1,28 @@
|
||||
# ORCH-100 (FND/F1b): sidecar-watchdog — thin stdlib-only monitoring brain.
|
||||
#
|
||||
# A separate, deliberately tiny image (NO pip dependencies — Python 3.12 stdlib
|
||||
# only, ADR-001 D1): urllib for HTTP/Telegram, a raw HTTP-over-unix-socket client
|
||||
# for the read-only docker.sock, shutil/proc for host metrics. Kept thin on a
|
||||
# tight host (C-3); mem_limit is enforced in docker-compose.yml (D2).
|
||||
#
|
||||
# The build context is the REPO ROOT (see docker-compose.yml `build:
|
||||
# context: . / dockerfile: watchdog/Dockerfile`) so we can COPY the watchdog/
|
||||
# package. src/** is intentionally NOT copied — the sidecar must not import the
|
||||
# orchestrator (C-1).
|
||||
FROM python:3.12-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# Run as a non-root user; the sidecar only READS (docker.sock :ro, host paths :ro).
|
||||
RUN useradd -u 1000 -m -d /home/watchdog -s /bin/bash watchdog
|
||||
|
||||
# Copy ONLY the sidecar package (no src/, no requirements — stdlib only).
|
||||
COPY watchdog/ ./watchdog/
|
||||
|
||||
ENV PYTHONPATH=/app
|
||||
ENV PYTHONUNBUFFERED=1
|
||||
|
||||
USER watchdog
|
||||
|
||||
# `python -m watchdog` runs watchdog/__main__.py (the tick loop).
|
||||
ENTRYPOINT ["python", "-m", "watchdog"]
|
||||
31
watchdog/__init__.py
Normal file
31
watchdog/__init__.py
Normal file
@@ -0,0 +1,31 @@
|
||||
"""ORCH-100 (FND/F1b): sidecar-watchdog — the monitoring brain in a separate container.
|
||||
|
||||
This package is the *brain* half of the domain-0 observability pair. F1a
|
||||
(ORCH-099, ``src/metrics.py``) exposes a lightweight read-only ``GET /metrics``
|
||||
envelope — raw signal only. F1b (this package) is the stateful observer that
|
||||
reads that envelope, augments it with host / container / dependency probes, runs
|
||||
every signal through a generalised pure decision function (modelled 1:1 on
|
||||
``src/disk_watchdog.py::decide_action``) with per-signal in-memory
|
||||
dedup / throttle / recovery, and emits alerts over its OWN independent Telegram
|
||||
channel.
|
||||
|
||||
Hard invariants (ADR-001, ``docs/work-items/ORCH-100/06-adr/``):
|
||||
* The observer is separated from the observed: the runtime is a separate
|
||||
container (``orchestrator-watchdog``). A hang/crash of the orchestrator makes
|
||||
the sidecar *louder* (``orchestrator_down``), never silent.
|
||||
* Strictly read-only to the observed system: ``docker.sock`` is GET-only (and
|
||||
mounted ``:ro``), no DB writes, no disk writes, no process control
|
||||
(start/stop/restart/exec) — self-hosting-safe on the shared prod host.
|
||||
* never-raise on three levels (per-source / per-tick / per-send) + a
|
||||
``WATCHDOG_ENABLED`` kill-switch.
|
||||
* NO import from ``src/**`` — the sidecar must survive a refactor/crash of the
|
||||
orchestrator process (C-1).
|
||||
|
||||
The highest known ``/metrics`` schema_version this build understands. A higher
|
||||
value from the orchestrator is tolerated (warning, read the compatible subset),
|
||||
never a crash (D9).
|
||||
"""
|
||||
|
||||
KNOWN_SCHEMA_VERSION = 1
|
||||
|
||||
__all__ = ["KNOWN_SCHEMA_VERSION"]
|
||||
75
watchdog/__main__.py
Normal file
75
watchdog/__main__.py
Normal file
@@ -0,0 +1,75 @@
|
||||
"""Sidecar entrypoint: the tick loop with kill-switch + per-tick never-raise (D8).
|
||||
|
||||
Run as ``python -m watchdog`` (the container ``ENTRYPOINT``). The loop:
|
||||
* honours ``WATCHDOG_ENABLED=false`` -> stays INERT (idle-loops with a log line,
|
||||
does NOT ``exit``, so ``restart: unless-stopped`` does not spin a restart loop);
|
||||
* wraps every tick in an outer ``try/except`` so a tick error logs and the daemon
|
||||
survives (per-tick never-raise);
|
||||
* logs start / each tick so the container logs prove the sidecar is alive and why
|
||||
an alert did (not) fire (NFR-7).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import time
|
||||
|
||||
from .config import Config
|
||||
from .core import Watchdog
|
||||
|
||||
logger = logging.getLogger("watchdog")
|
||||
|
||||
|
||||
def _setup_logging() -> None:
|
||||
logging.basicConfig(
|
||||
level=logging.INFO,
|
||||
format="%(asctime)s %(levelname)s %(name)s: %(message)s",
|
||||
)
|
||||
|
||||
|
||||
def run(cfg: Config | None = None, max_ticks: int | None = None) -> None:
|
||||
"""Run the tick loop. ``max_ticks`` bounds the loop for tests (``None`` = forever)."""
|
||||
cfg = cfg or Config.from_env()
|
||||
|
||||
if not cfg.enabled:
|
||||
logger.info("watchdog: WATCHDOG_ENABLED=false -> inert (idle, no ticks)")
|
||||
# Idle, not exit: keep the container up so restart-policy does not flap.
|
||||
ticks = 0
|
||||
while max_ticks is None or ticks < max_ticks:
|
||||
time.sleep(cfg.interval_s)
|
||||
ticks += 1
|
||||
return
|
||||
|
||||
logger.info(
|
||||
"watchdog started (interval=%ss, metrics=%s, containers=%s, deps=%s, "
|
||||
"mem_pct=%s, disk_crit=%s)",
|
||||
cfg.interval_s,
|
||||
cfg.metrics_url,
|
||||
cfg.containers,
|
||||
list(cfg.deps),
|
||||
cfg.mem_pct,
|
||||
cfg.disk_crit_enabled,
|
||||
)
|
||||
dog = Watchdog(cfg)
|
||||
ticks = 0
|
||||
while max_ticks is None or ticks < max_ticks:
|
||||
try:
|
||||
dispatched = dog.tick()
|
||||
fired = [
|
||||
(a, getattr(s, "key", None)) for a, s in dispatched if a != "none"
|
||||
]
|
||||
logger.info("watchdog tick ok (fired=%s)", fired)
|
||||
except Exception as e: # noqa: BLE001 - per-tick outer never-raise (D8)
|
||||
logger.error("watchdog tick error: %s", e)
|
||||
ticks += 1
|
||||
if max_ticks is not None and ticks >= max_ticks:
|
||||
break
|
||||
time.sleep(cfg.interval_s)
|
||||
|
||||
|
||||
def main() -> None:
|
||||
_setup_logging()
|
||||
run()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
5
watchdog/collectors/__init__.py
Normal file
5
watchdog/collectors/__init__.py
Normal file
@@ -0,0 +1,5 @@
|
||||
"""Sidecar collectors: orchestrator ``/metrics``, host, containers, dependencies.
|
||||
|
||||
Each collector is never-raise at the source level (per-source degradation, D8):
|
||||
a broken source degrades ONE signal and the tick keeps collecting the rest.
|
||||
"""
|
||||
119
watchdog/collectors/containers.py
Normal file
119
watchdog/collectors/containers.py
Normal file
@@ -0,0 +1,119 @@
|
||||
"""Collector: container statuses over a READ-ONLY ``docker.sock`` (D1, D2, FR-5).
|
||||
|
||||
Raw HTTP-over-unix-socket via stdlib (``socket.AF_UNIX`` +
|
||||
``http.client.HTTPConnection`` subclass) — NO ``docker`` pip package. The client
|
||||
issues ``GET`` requests ONLY (``GET /containers/json``,
|
||||
``GET /containers/<name>/json``) — it is read-only **by construction**: there is
|
||||
no method that POSTs / starts / stops / restarts / execs (AC-6, TC-09). The
|
||||
mount is additionally ``:ro``, a second guarantee.
|
||||
|
||||
``classify_container`` is a pure function (Up / healthy / restarting / exited /
|
||||
unhealthy) and ``container_alarm`` decides whether the status is alerting — both
|
||||
testable without a live Docker.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import http.client
|
||||
import json
|
||||
import logging
|
||||
import socket
|
||||
|
||||
logger = logging.getLogger("watchdog.collectors.containers")
|
||||
|
||||
# A container is "healthy" (no alarm) only in these states.
|
||||
_OK_STATES = frozenset({"running", "healthy"})
|
||||
|
||||
|
||||
class _UnixHTTPConnection(http.client.HTTPConnection):
|
||||
"""``HTTPConnection`` over an ``AF_UNIX`` socket (stdlib only, GET-only use)."""
|
||||
|
||||
def __init__(self, sock_path: str, timeout: float):
|
||||
super().__init__("localhost", timeout=timeout)
|
||||
self._sock_path = sock_path
|
||||
|
||||
def connect(self) -> None: # noqa: D401 - override
|
||||
sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
|
||||
sock.settimeout(self.timeout)
|
||||
sock.connect(self._sock_path)
|
||||
self.sock = sock
|
||||
|
||||
|
||||
class DockerSockReader:
|
||||
"""Read-only Docker API client over the unix socket.
|
||||
|
||||
EXPOSES READ METHODS ONLY (``list_containers`` / ``inspect``); the single
|
||||
private primitive ``_get`` hard-codes the ``GET`` HTTP method, so no caller
|
||||
can ever mutate the Docker state (AC-6 / TC-09). never-raise: any socket /
|
||||
HTTP / parse error degrades to ``None`` / ``[]``.
|
||||
"""
|
||||
|
||||
def __init__(self, sock_path: str = "/var/run/docker.sock", timeout_s: float = 3.0):
|
||||
self._sock_path = sock_path
|
||||
self._timeout = timeout_s
|
||||
|
||||
def _get(self, path: str) -> object | None:
|
||||
"""Issue a single ``GET <path>`` over the socket. never-raise.
|
||||
|
||||
This is the ONLY request primitive and it is GET-only — the read-only
|
||||
guarantee is structural, not policy.
|
||||
"""
|
||||
conn = None
|
||||
try:
|
||||
conn = _UnixHTTPConnection(self._sock_path, self._timeout)
|
||||
conn.request("GET", path)
|
||||
resp = conn.getresponse()
|
||||
body = resp.read()
|
||||
if resp.status >= 400:
|
||||
logger.warning("watchdog: docker GET %s -> %s", path, resp.status)
|
||||
return None
|
||||
return json.loads(body.decode("utf-8", errors="replace"))
|
||||
except Exception as e: # noqa: BLE001 - docker unreachable -> degrade
|
||||
logger.warning("watchdog: docker GET %s failed: %s", path, e)
|
||||
return None
|
||||
finally:
|
||||
if conn is not None:
|
||||
try:
|
||||
conn.close()
|
||||
except Exception: # noqa: BLE001
|
||||
pass
|
||||
|
||||
def list_containers(self) -> list:
|
||||
"""``GET /containers/json?all=1`` — every container (read-only)."""
|
||||
data = self._get("/containers/json?all=1")
|
||||
return data if isinstance(data, list) else []
|
||||
|
||||
def inspect(self, name: str) -> dict | None:
|
||||
"""``GET /containers/<name>/json`` — one container's detail (read-only)."""
|
||||
data = self._get(f"/containers/{name}/json")
|
||||
return data if isinstance(data, dict) else None
|
||||
|
||||
|
||||
def classify_container(inspect: dict | None) -> str:
|
||||
"""Pure classifier: inspect-JSON -> a coarse status token (D5).
|
||||
|
||||
Returns one of ``running`` / ``healthy`` / ``unhealthy`` / ``restarting`` /
|
||||
``exited`` / ``created`` / ``paused`` / ``dead`` / ``unknown``. When a
|
||||
healthcheck is present its verdict (``healthy`` / ``unhealthy``) takes
|
||||
precedence over the bare ``running`` state. Never raises.
|
||||
"""
|
||||
try:
|
||||
if not inspect:
|
||||
return "unknown"
|
||||
state = inspect.get("State")
|
||||
if not isinstance(state, dict):
|
||||
return "unknown"
|
||||
status = (state.get("Status") or "").strip().lower()
|
||||
health = state.get("Health")
|
||||
if isinstance(health, dict):
|
||||
hstatus = (health.get("Status") or "").strip().lower()
|
||||
if hstatus in ("healthy", "unhealthy"):
|
||||
return hstatus
|
||||
return status or "unknown"
|
||||
except Exception as e: # noqa: BLE001 - classification must never crash
|
||||
logger.warning("watchdog: classify_container error: %s", e)
|
||||
return "unknown"
|
||||
|
||||
|
||||
def container_alarm(status: str) -> bool:
|
||||
"""True when ``status`` is NOT a healthy state (restarting/exited/unhealthy/...)."""
|
||||
return (status or "").strip().lower() not in _OK_STATES
|
||||
51
watchdog/collectors/deps.py
Normal file
51
watchdog/collectors/deps.py
Normal file
@@ -0,0 +1,51 @@
|
||||
"""Collector: external dependency pings — Plane / Gitea / Anthropic (FR-6).
|
||||
|
||||
A light ``GET`` with a short timeout per configured dependency. never-raise: an
|
||||
unreachable dependency returns ``False`` (a signal for the threshold), never an
|
||||
exception (D8). Endpoints / timeouts are configured via ``WATCHDOG_DEPS`` (D5);
|
||||
an empty config means no pings (fail-safe).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import urllib.error
|
||||
import urllib.request
|
||||
|
||||
logger = logging.getLogger("watchdog.collectors.deps")
|
||||
|
||||
|
||||
def ping(url: str, timeout_s: float, *, opener=urllib.request.urlopen) -> bool:
|
||||
"""True when ``url`` answers with a non-5xx HTTP status. never-raise.
|
||||
|
||||
A 4xx still counts as "reachable" (the host is up and responding) — we ping
|
||||
for liveness, not for auth. ``opener`` is injected so tests never hit the
|
||||
network.
|
||||
"""
|
||||
try:
|
||||
req = urllib.request.Request(url, method="GET")
|
||||
with opener(req, timeout=timeout_s) as resp:
|
||||
status = int(getattr(resp, "status", None) or resp.getcode())
|
||||
return status < 500
|
||||
except urllib.error.HTTPError as e:
|
||||
# An HTTP error response still proves the host is reachable, unless 5xx.
|
||||
return int(getattr(e, "code", 500)) < 500
|
||||
except Exception as e: # noqa: BLE001 - unreachable -> down signal, not a crash
|
||||
logger.warning("watchdog: dep ping %s failed: %s", url, e)
|
||||
return False
|
||||
|
||||
|
||||
def ping_all(
|
||||
deps: dict[str, str],
|
||||
timeout_s: float,
|
||||
*,
|
||||
opener=urllib.request.urlopen,
|
||||
) -> dict[str, bool]:
|
||||
"""Ping every configured dependency -> ``{name: reachable}``. never-raise."""
|
||||
out: dict[str, bool] = {}
|
||||
for name, url in deps.items():
|
||||
try:
|
||||
out[name] = ping(url, timeout_s, opener=opener)
|
||||
except Exception as e: # noqa: BLE001 - one dep degrades, others continue
|
||||
logger.warning("watchdog: dep %s ping error: %s", name, e)
|
||||
out[name] = False
|
||||
return out
|
||||
75
watchdog/collectors/host.py
Normal file
75
watchdog/collectors/host.py
Normal file
@@ -0,0 +1,75 @@
|
||||
"""Collector: host metrics — memory (/proc/meminfo), disk (shutil.disk_usage).
|
||||
|
||||
stdlib-only, the same primitives ``disk_watchdog`` uses (D1). Every reader is
|
||||
never-raise: a missing path / unreadable proc-file degrades to ``None`` (one
|
||||
signal skipped), never a tick crash (D8). CPU "hung agent" liveness is computed
|
||||
from the ``/metrics`` envelope (cpu_ticks), not here.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import shutil
|
||||
|
||||
logger = logging.getLogger("watchdog.collectors.host")
|
||||
|
||||
|
||||
def read_mem_used_pct(meminfo_path: str = "/proc/meminfo") -> float | None:
|
||||
"""Host memory used-% from ``/proc/meminfo`` (``MemTotal`` / ``MemAvailable``).
|
||||
|
||||
``used_pct = (1 - MemAvailable/MemTotal) * 100``. Returns ``None`` on a
|
||||
missing file / unparseable content / non-Linux (never raises).
|
||||
"""
|
||||
try:
|
||||
fields: dict[str, int] = {}
|
||||
with open(meminfo_path, "r") as f:
|
||||
for line in f:
|
||||
parts = line.split(":")
|
||||
if len(parts) != 2:
|
||||
continue
|
||||
key = parts[0].strip()
|
||||
val = parts[1].strip().split()
|
||||
if val:
|
||||
try:
|
||||
fields[key] = int(val[0]) # value is in kB
|
||||
except ValueError:
|
||||
continue
|
||||
total = fields.get("MemTotal")
|
||||
avail = fields.get("MemAvailable")
|
||||
if not total or avail is None:
|
||||
return None
|
||||
used_pct = (1.0 - (avail / total)) * 100.0
|
||||
return round(used_pct, 1)
|
||||
except Exception as e: # noqa: BLE001 - degrade one signal, keep the tick
|
||||
logger.warning("watchdog: cannot read memory: %s", e)
|
||||
return None
|
||||
|
||||
|
||||
def read_disk_used_pct(path: str) -> float | None:
|
||||
"""Disk used-% for one path via ``shutil.disk_usage`` (1:1 with disk_watchdog).
|
||||
|
||||
Returns ``None`` if the path is missing / unreadable (never raises).
|
||||
"""
|
||||
try:
|
||||
usage = shutil.disk_usage(path)
|
||||
total = int(usage.total)
|
||||
if total <= 0:
|
||||
return None
|
||||
return round(int(usage.used) / total * 100.0, 1)
|
||||
except Exception as e: # noqa: BLE001 - skip this path, keep the tick
|
||||
logger.warning("watchdog: cannot measure disk %s: %s", path, e)
|
||||
return None
|
||||
|
||||
|
||||
def max_disk_used_pct(paths: list[str]) -> tuple[str, float] | None:
|
||||
"""The fullest of ``paths`` as ``(path, used_pct)`` — the worst-case ceiling.
|
||||
|
||||
A path that cannot be measured is skipped; ``None`` if none could be read.
|
||||
"""
|
||||
worst: tuple[str, float] | None = None
|
||||
for p in paths:
|
||||
pct = read_disk_used_pct(p)
|
||||
if pct is None:
|
||||
continue
|
||||
if worst is None or pct > worst[1]:
|
||||
worst = (p, pct)
|
||||
return worst
|
||||
118
watchdog/collectors/orch.py
Normal file
118
watchdog/collectors/orch.py
Normal file
@@ -0,0 +1,118 @@
|
||||
"""Collector: orchestrator ``GET /metrics`` -> parsed envelope | orchestrator_down.
|
||||
|
||||
The orchestrator runs ``network_mode: host`` on port 8500, so from the
|
||||
host-network sidecar ``/metrics`` is reachable at ``http://127.0.0.1:8500/metrics``
|
||||
(configurable). The body is the F1a versioned envelope
|
||||
``{schema_version, generated_at, clk_tck, stages[], queue, agents[], cost,
|
||||
enabled}`` (adr-0030 D2). Parsing is DEFENSIVE (D9): unknown keys are ignored,
|
||||
a missing optional is not an error, a ``schema_version`` higher than known is
|
||||
logged (warning) but read as the compatible subset — never a crash.
|
||||
|
||||
A timeout / connection-refused / 5xx / unreadable body is itself the master
|
||||
alarm signal ``orchestrator_down`` (FR-3), surfaced by ``FetchResult.ok ==
|
||||
False`` — NOT an exception (never-raise per-source, D8).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import urllib.error
|
||||
import urllib.request
|
||||
from dataclasses import dataclass
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from .. import KNOWN_SCHEMA_VERSION
|
||||
|
||||
logger = logging.getLogger("watchdog.collectors.orch")
|
||||
|
||||
|
||||
@dataclass
|
||||
class FetchResult:
|
||||
"""Outcome of one ``/metrics`` probe.
|
||||
|
||||
``ok`` is ``True`` only when a 2xx response carried a parseable JSON object.
|
||||
Any other outcome (timeout / refused / 5xx / unreadable) -> ``ok == False``
|
||||
with a human ``error`` -> the ``orchestrator_down`` signal source.
|
||||
"""
|
||||
|
||||
ok: bool
|
||||
envelope: dict | None = None
|
||||
error: str | None = None
|
||||
|
||||
|
||||
def parse_envelope(body: str | bytes) -> dict:
|
||||
"""Parse the ``/metrics`` body into a dict — tolerant (D9, TC-11).
|
||||
|
||||
Raises ``ValueError`` only when the body is not a JSON object (that is the
|
||||
"unreadable body" case the caller maps to ``orchestrator_down``). A valid
|
||||
object with unknown / missing keys parses cleanly; downstream readers use
|
||||
``.get(...)`` with defaults.
|
||||
"""
|
||||
if isinstance(body, bytes):
|
||||
body = body.decode("utf-8", errors="replace")
|
||||
data = json.loads(body)
|
||||
if not isinstance(data, dict):
|
||||
raise ValueError("metrics body is not a JSON object")
|
||||
return data
|
||||
|
||||
|
||||
def check_schema_version(envelope: dict) -> None:
|
||||
"""Warn (never crash) when the orchestrator advertises a newer contract (D9)."""
|
||||
try:
|
||||
sv = envelope.get("schema_version")
|
||||
if isinstance(sv, int) and sv > KNOWN_SCHEMA_VERSION:
|
||||
logger.warning(
|
||||
"watchdog: /metrics schema_version=%s > known=%s; reading the "
|
||||
"compatible subset",
|
||||
sv,
|
||||
KNOWN_SCHEMA_VERSION,
|
||||
)
|
||||
except Exception as e: # noqa: BLE001 - tolerance must never crash
|
||||
logger.warning("watchdog: schema_version check error: %s", e)
|
||||
|
||||
|
||||
def fetch_metrics(
|
||||
url: str,
|
||||
timeout_s: float,
|
||||
*,
|
||||
opener=urllib.request.urlopen,
|
||||
) -> FetchResult:
|
||||
"""Probe ``GET <url>`` and return a :class:`FetchResult`. never-raise (D8).
|
||||
|
||||
``opener`` is injected so tests drive timeout / refused / 5xx / good-body
|
||||
without the network. A 5xx (or any ``HTTPError`` >= 500) is treated as
|
||||
down; a parseable 2xx object is ``ok``.
|
||||
"""
|
||||
try:
|
||||
with opener(url, timeout=timeout_s) as resp:
|
||||
status = int(getattr(resp, "status", None) or resp.getcode())
|
||||
raw = resp.read()
|
||||
if status >= 500:
|
||||
return FetchResult(ok=False, error=f"http {status}")
|
||||
if status >= 400:
|
||||
# 4xx is "reachable but refusing" — still not a usable envelope.
|
||||
return FetchResult(ok=False, error=f"http {status}")
|
||||
env = parse_envelope(raw)
|
||||
check_schema_version(env)
|
||||
return FetchResult(ok=True, envelope=env)
|
||||
except urllib.error.HTTPError as e: # noqa: PERF203
|
||||
return FetchResult(ok=False, error=f"http {getattr(e, 'code', '?')}")
|
||||
except Exception as e: # noqa: BLE001 - timeout / refused / unreadable -> down
|
||||
return FetchResult(ok=False, error=str(e) or e.__class__.__name__)
|
||||
|
||||
|
||||
def parse_generated_at(envelope: dict) -> float | None:
|
||||
"""Convert the envelope ``generated_at`` ISO-8601 (``...Z``) to epoch seconds.
|
||||
|
||||
Returns ``None`` on a missing / malformed timestamp (never raises) — the
|
||||
caller then skips the CPU-fraction computation for that tick.
|
||||
"""
|
||||
try:
|
||||
raw = envelope.get("generated_at")
|
||||
if not raw or not isinstance(raw, str):
|
||||
return None
|
||||
dt = datetime.strptime(raw, "%Y-%m-%dT%H:%M:%SZ").replace(tzinfo=timezone.utc)
|
||||
return dt.timestamp()
|
||||
except Exception as e: # noqa: BLE001 - tolerant parsing
|
||||
logger.warning("watchdog: cannot parse generated_at: %s", e)
|
||||
return None
|
||||
159
watchdog/config.py
Normal file
159
watchdog/config.py
Normal file
@@ -0,0 +1,159 @@
|
||||
"""Read ``WATCHDOG_*`` env into a frozen config (thresholds / intervals / tokens /
|
||||
URLs / kill-switch), with safe defaults (D1/D8, FR-10).
|
||||
|
||||
Every parser is never-raise: a missing / malformed value degrades to its
|
||||
documented default, the process never crashes on a bad env (the same spirit as
|
||||
``disk_watchdog.parse_paths``). ``.env.example`` is the canon of the keys.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
from dataclasses import dataclass, field
|
||||
|
||||
|
||||
def _str(env: dict, key: str, default: str) -> str:
|
||||
try:
|
||||
v = env.get(key)
|
||||
if v is None or not str(v).strip():
|
||||
return default
|
||||
return str(v).strip()
|
||||
except Exception: # noqa: BLE001 - never break config on a bad env
|
||||
return default
|
||||
|
||||
|
||||
def _int(env: dict, key: str, default: int) -> int:
|
||||
try:
|
||||
v = env.get(key)
|
||||
if v is None or not str(v).strip():
|
||||
return default
|
||||
return int(str(v).strip())
|
||||
except Exception: # noqa: BLE001
|
||||
return default
|
||||
|
||||
|
||||
def _float(env: dict, key: str, default: float) -> float:
|
||||
try:
|
||||
v = env.get(key)
|
||||
if v is None or not str(v).strip():
|
||||
return default
|
||||
return float(str(v).strip())
|
||||
except Exception: # noqa: BLE001
|
||||
return default
|
||||
|
||||
|
||||
def _bool(env: dict, key: str, default: bool) -> bool:
|
||||
try:
|
||||
v = env.get(key)
|
||||
if v is None or not str(v).strip():
|
||||
return default
|
||||
return str(v).strip().lower() in ("1", "true", "yes", "on")
|
||||
except Exception: # noqa: BLE001
|
||||
return default
|
||||
|
||||
|
||||
def _csv(env: dict, key: str, default: list[str]) -> list[str]:
|
||||
try:
|
||||
v = env.get(key)
|
||||
if v is None or not str(v).strip():
|
||||
return list(default)
|
||||
out = [p.strip() for p in str(v).split(",") if p.strip()]
|
||||
return out or list(default)
|
||||
except Exception: # noqa: BLE001
|
||||
return list(default)
|
||||
|
||||
|
||||
def _deps(env: dict, key: str) -> dict[str, str]:
|
||||
"""Parse ``name=url,name=url`` dependency pings (FR-6). Empty -> no pings.
|
||||
|
||||
Default is empty (fail-safe: no hardcoded network), the canonical example
|
||||
URLs live in ``.env.example`` so the operator opts in explicitly.
|
||||
"""
|
||||
out: dict[str, str] = {}
|
||||
try:
|
||||
raw = env.get(key)
|
||||
if not raw or not str(raw).strip():
|
||||
return out
|
||||
for pair in str(raw).split(","):
|
||||
pair = pair.strip()
|
||||
if not pair or "=" not in pair:
|
||||
continue
|
||||
name, _, url = pair.partition("=")
|
||||
name, url = name.strip(), url.strip()
|
||||
if name and url:
|
||||
out[name] = url
|
||||
except Exception: # noqa: BLE001
|
||||
return {}
|
||||
return out
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class Config:
|
||||
"""Immutable sidecar config built from the environment (FR-10)."""
|
||||
|
||||
# -- lifecycle / loop -------------------------------------------------
|
||||
enabled: bool = True
|
||||
interval_s: float = 30.0
|
||||
http_timeout_s: float = 5.0
|
||||
cooldown_s: float = 1800.0 # re-alert throttle for sustained signals
|
||||
|
||||
# -- orchestrator /metrics -------------------------------------------
|
||||
metrics_url: str = "http://127.0.0.1:8500/metrics"
|
||||
orch_down_ticks: int = 3 # K consecutive failures before orch_down fires
|
||||
|
||||
# -- host -------------------------------------------------------------
|
||||
mem_pct: float = 90.0
|
||||
disk_paths: list[str] = field(default_factory=lambda: ["/repos", "/app/data"])
|
||||
disk_crit_enabled: bool = False # opt-in independent disk ceiling (D6)
|
||||
disk_crit_pct: float = 97.0
|
||||
|
||||
# -- agents / queue / stages (derived from the /metrics envelope) -----
|
||||
agent_hung_min: float = 20.0 # minutes of runtime before "hung" is considered
|
||||
agent_cpu_floor: float = 0.01 # CPU fraction below which a long agent is "hung"
|
||||
stage_stuck_min: float = 120.0 # minutes a task may sit in one stage
|
||||
queue_depth: int = 20
|
||||
|
||||
# -- containers (docker.sock, read-only) ------------------------------
|
||||
containers: list[str] = field(default_factory=lambda: ["orchestrator"])
|
||||
docker_sock: str = "/var/run/docker.sock"
|
||||
|
||||
# -- external dependencies -------------------------------------------
|
||||
deps: dict[str, str] = field(default_factory=dict)
|
||||
|
||||
# -- independent Telegram transport ----------------------------------
|
||||
tg_bot_token: str = ""
|
||||
tg_chat_id: str = ""
|
||||
|
||||
# -- derived helpers --------------------------------------------------
|
||||
@property
|
||||
def agent_hung_s(self) -> float:
|
||||
return self.agent_hung_min * 60.0
|
||||
|
||||
@property
|
||||
def stage_stuck_s(self) -> float:
|
||||
return self.stage_stuck_min * 60.0
|
||||
|
||||
@classmethod
|
||||
def from_env(cls, env: dict | None = None) -> "Config":
|
||||
"""Build a Config from ``env`` (defaults to ``os.environ``). never-raise."""
|
||||
e = dict(os.environ if env is None else env)
|
||||
return cls(
|
||||
enabled=_bool(e, "WATCHDOG_ENABLED", True),
|
||||
interval_s=_float(e, "WATCHDOG_INTERVAL_S", 30.0),
|
||||
http_timeout_s=_float(e, "WATCHDOG_HTTP_TIMEOUT_S", 5.0),
|
||||
cooldown_s=_float(e, "WATCHDOG_COOLDOWN_S", 1800.0),
|
||||
metrics_url=_str(e, "WATCHDOG_METRICS_URL", "http://127.0.0.1:8500/metrics"),
|
||||
orch_down_ticks=_int(e, "WATCHDOG_ORCH_DOWN_TICKS", 3),
|
||||
mem_pct=_float(e, "WATCHDOG_MEM_PCT", 90.0),
|
||||
disk_paths=_csv(e, "WATCHDOG_DISK_PATHS", ["/repos", "/app/data"]),
|
||||
disk_crit_enabled=_bool(e, "WATCHDOG_DISK_CRIT_ENABLED", False),
|
||||
disk_crit_pct=_float(e, "WATCHDOG_DISK_CRIT_PCT", 97.0),
|
||||
agent_hung_min=_float(e, "WATCHDOG_AGENT_HUNG_MIN", 20.0),
|
||||
agent_cpu_floor=_float(e, "WATCHDOG_AGENT_CPU_FLOOR", 0.01),
|
||||
stage_stuck_min=_float(e, "WATCHDOG_STAGE_STUCK_MIN", 120.0),
|
||||
queue_depth=_int(e, "WATCHDOG_QUEUE_DEPTH", 20),
|
||||
containers=_csv(e, "WATCHDOG_CONTAINERS", ["orchestrator"]),
|
||||
docker_sock=_str(e, "WATCHDOG_DOCKER_SOCK", "/var/run/docker.sock"),
|
||||
deps=_deps(e, "WATCHDOG_DEPS"),
|
||||
tg_bot_token=_str(e, "WATCHDOG_TG_BOT_TOKEN", ""),
|
||||
tg_chat_id=_str(e, "WATCHDOG_TG_CHAT_ID", ""),
|
||||
)
|
||||
183
watchdog/core.py
Normal file
183
watchdog/core.py
Normal file
@@ -0,0 +1,183 @@
|
||||
"""The sidecar tick orchestration: collect -> evaluate -> decide -> dispatch (D3).
|
||||
|
||||
The ``Watchdog`` owns the cross-tick state the sidecar is responsible for:
|
||||
* ``_states`` — per signal_key :class:`AlertState` (anti-spam / recovery);
|
||||
* ``_agents`` — per run_id :class:`AgentSample` (cpu_ticks, generated_at);
|
||||
* ``_failed`` — last seen ``queue.counts.failed`` (job_failed edge);
|
||||
* ``_orch_fail`` — consecutive ``/metrics`` failures (orch_down debounce).
|
||||
|
||||
All collection is wrapped per-source and the whole ``tick`` is wrapped per-tick
|
||||
(never-raise, D8). ``now_provider`` is injectable for deterministic tests.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import time
|
||||
|
||||
from . import decision
|
||||
from .collectors import containers as containers_mod
|
||||
from .collectors import deps as deps_mod
|
||||
from .collectors import host as host_mod
|
||||
from .collectors import orch as orch_mod
|
||||
from .config import Config
|
||||
from .notify import Notifier
|
||||
from . import signals as signals_mod
|
||||
|
||||
logger = logging.getLogger("watchdog.core")
|
||||
|
||||
|
||||
class Watchdog:
|
||||
"""Stateful observer: one ``tick`` collects every source and dispatches alerts."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
cfg: Config,
|
||||
notifier: Notifier | None = None,
|
||||
docker: containers_mod.DockerSockReader | None = None,
|
||||
now_provider=None,
|
||||
):
|
||||
self.cfg = cfg
|
||||
self._now = now_provider or time.time
|
||||
self._notifier = notifier or Notifier(
|
||||
cfg.tg_bot_token, cfg.tg_chat_id, cfg.http_timeout_s
|
||||
)
|
||||
self._docker = docker or containers_mod.DockerSockReader(
|
||||
cfg.docker_sock, cfg.http_timeout_s
|
||||
)
|
||||
# cross-tick state owned by the sidecar
|
||||
self._states: dict[object, decision.AlertState] = {}
|
||||
self._agents: dict[object, signals_mod.AgentSample] = {}
|
||||
self._failed: int | None = None
|
||||
self._orch_fail: int = 0
|
||||
self.last_run_ts: float | None = None
|
||||
|
||||
# -- collection (each source guarded; per-source never-raise) ---------
|
||||
def _collect_orch(self) -> orch_mod.FetchResult:
|
||||
try:
|
||||
return orch_mod.fetch_metrics(self.cfg.metrics_url, self.cfg.http_timeout_s)
|
||||
except Exception as e: # noqa: BLE001 - treat as down, never crash the tick
|
||||
logger.warning("watchdog: orch collect error: %s", e)
|
||||
return orch_mod.FetchResult(ok=False, error=str(e))
|
||||
|
||||
def _collect_host_mem(self) -> float | None:
|
||||
try:
|
||||
return host_mod.read_mem_used_pct()
|
||||
except Exception as e: # noqa: BLE001
|
||||
logger.warning("watchdog: host mem collect error: %s", e)
|
||||
return None
|
||||
|
||||
def _collect_disk(self) -> tuple | None:
|
||||
if not self.cfg.disk_crit_enabled:
|
||||
return None
|
||||
try:
|
||||
return host_mod.max_disk_used_pct(self.cfg.disk_paths)
|
||||
except Exception as e: # noqa: BLE001
|
||||
logger.warning("watchdog: disk collect error: %s", e)
|
||||
return None
|
||||
|
||||
def _collect_containers(self) -> dict:
|
||||
out: dict[str, str] = {}
|
||||
for name in self.cfg.containers:
|
||||
try:
|
||||
inspect = self._docker.inspect(name)
|
||||
out[name] = containers_mod.classify_container(inspect)
|
||||
except Exception as e: # noqa: BLE001 - one container degrades, others continue
|
||||
logger.warning("watchdog: container %s collect error: %s", name, e)
|
||||
out[name] = "unknown"
|
||||
return out
|
||||
|
||||
def _collect_deps(self) -> dict:
|
||||
try:
|
||||
return deps_mod.ping_all(self.cfg.deps, self.cfg.http_timeout_s)
|
||||
except Exception as e: # noqa: BLE001
|
||||
logger.warning("watchdog: deps collect error: %s", e)
|
||||
return {}
|
||||
|
||||
# -- one tick ---------------------------------------------------------
|
||||
def tick(self) -> list:
|
||||
"""Run one full pass; returns the dispatched ``(action, Signal)`` list.
|
||||
|
||||
Per-source collection is independently guarded so a broken source (ork
|
||||
down / docker unreachable / dep timeout) degrades ONE signal and the rest
|
||||
of the tick still runs (D8). The orchestrator being down is itself the
|
||||
``orchestrator_down`` signal, not a failed tick (FR-3).
|
||||
"""
|
||||
now = self._now()
|
||||
built: list[signals_mod.Signal] = []
|
||||
|
||||
# 1) orchestrator /metrics (+ orch_down debounce)
|
||||
fetch = self._collect_orch()
|
||||
if fetch.ok and fetch.envelope is not None:
|
||||
self._orch_fail = 0
|
||||
ev = signals_mod.eval_envelope(
|
||||
fetch.envelope, self.cfg, self._agents, self._failed
|
||||
)
|
||||
self._agents = ev.agent_samples
|
||||
self._failed = ev.failed_count
|
||||
built.extend(ev.signals)
|
||||
else:
|
||||
self._orch_fail += 1
|
||||
built.append(
|
||||
signals_mod.orch_down_signal(self._orch_fail, self.cfg, fetch.error)
|
||||
)
|
||||
|
||||
# 2) host memory + opt-in disk ceiling
|
||||
built.extend(
|
||||
signals_mod.host_signals(
|
||||
self.cfg, self._collect_host_mem(), self._collect_disk()
|
||||
)
|
||||
)
|
||||
|
||||
# 3) containers (read-only docker.sock)
|
||||
built.extend(signals_mod.container_signals(self.cfg, self._collect_containers()))
|
||||
|
||||
# 4) external dependency pings
|
||||
built.extend(signals_mod.dep_signals(self._collect_deps()))
|
||||
|
||||
dispatched = self._dispatch(built, now)
|
||||
self.last_run_ts = now
|
||||
return dispatched
|
||||
|
||||
# -- decision + dispatch ----------------------------------------------
|
||||
def _dispatch(self, built: list, now: float) -> list:
|
||||
"""Run each signal through ``decide`` and send alert/realert/recovery."""
|
||||
results: list = []
|
||||
for sig in built:
|
||||
try:
|
||||
cooldown = sig.cooldown_s if sig.cooldown_s is not None else self.cfg.cooldown_s
|
||||
if sig.edge:
|
||||
# Edge signals (job_failed) fire on each new occurrence and
|
||||
# keep no sustained state: a fresh empty prev -> ALERT iff active.
|
||||
prev = decision.AlertState()
|
||||
else:
|
||||
prev = self._states.get(sig.key) or decision.AlertState()
|
||||
action = decision.decide(sig.active, prev, now, cooldown)
|
||||
if action in (decision.ACTION_ALERT, decision.ACTION_REALERT):
|
||||
self._send(self._format(sig, action))
|
||||
if not sig.edge:
|
||||
self._states[sig.key] = decision.AlertState(
|
||||
alerting=True, last_alert_at=now
|
||||
)
|
||||
elif action == decision.ACTION_RECOVERY:
|
||||
self._send(self._format(sig, action))
|
||||
self._states[sig.key] = decision.AlertState(
|
||||
alerting=False, last_alert_at=None
|
||||
)
|
||||
results.append((action, sig))
|
||||
except Exception as e: # noqa: BLE001 - one signal degrades, others dispatch
|
||||
logger.warning("watchdog: dispatch error for %s: %s", sig.key, e)
|
||||
return results
|
||||
|
||||
@staticmethod
|
||||
def _format(sig: signals_mod.Signal, action: str) -> str:
|
||||
if action == decision.ACTION_RECOVERY:
|
||||
return f"\U0001f7e2 {sig.title}: восстановление. {sig.detail}"
|
||||
prefix = "\U0001f534" if action == decision.ACTION_ALERT else "\U0001f501"
|
||||
return f"{prefix} {sig.title}: {sig.detail}"
|
||||
|
||||
def _send(self, text: str) -> None:
|
||||
"""Best-effort dispatch through the sidecar's own channel. never-raise."""
|
||||
try:
|
||||
self._notifier.send(text)
|
||||
except Exception as e: # noqa: BLE001 - per-send never-raise (D8)
|
||||
logger.warning("watchdog: send failed: %s", e)
|
||||
63
watchdog/decision.py
Normal file
63
watchdog/decision.py
Normal file
@@ -0,0 +1,63 @@
|
||||
"""Generalised pure alert-decision function + in-memory anti-spam state (D4).
|
||||
|
||||
``src/disk_watchdog.py::decide_action`` is hard-wired to ``used_pct >= threshold``.
|
||||
F1b has many heterogeneous signals (booleans — "orch down", "container
|
||||
unhealthy"; counters — "job-failed delta"; thresholds — "memory %", "agent hung N
|
||||
min"), so the *comparison is lifted out* and this function works on an
|
||||
already-computed boolean ``signal_active``. The set of outcomes, the cooldown /
|
||||
recovery semantics and the in-memory best-effort state are a strict
|
||||
generalisation of the disk variant (BRD §BR-9 names it the template).
|
||||
|
||||
``now`` and ``cooldown_s`` are injected so the cooldown / recovery logic is
|
||||
testable deterministically without a real timer (TC-01…TC-04).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
|
||||
# Decision outcomes — same vocabulary as ``disk_watchdog`` (1:1 semantics).
|
||||
ACTION_NONE = "none"
|
||||
ACTION_ALERT = "alert"
|
||||
ACTION_REALERT = "realert"
|
||||
ACTION_RECOVERY = "recovery"
|
||||
|
||||
|
||||
@dataclass
|
||||
class AlertState:
|
||||
"""In-memory anti-spam state for one signal key (1:1 with ``PathAlertState``).
|
||||
|
||||
Best-effort: lives only in the daemon (no DB row, no migration). After a
|
||||
process restart ``alerting`` resets to ``False`` -> a still-standing problem
|
||||
re-alerts once, which is safe (an early signal, not an SLA; FR-7).
|
||||
"""
|
||||
|
||||
alerting: bool = False
|
||||
last_alert_at: float | None = None
|
||||
|
||||
|
||||
def decide(
|
||||
signal_active: bool,
|
||||
prev: AlertState,
|
||||
now: float,
|
||||
cooldown_s: float,
|
||||
) -> str:
|
||||
"""Pure alert decision — testable without a thread or a real timer (D4).
|
||||
|
||||
Returns one of ``ACTION_{NONE,ALERT,REALERT,RECOVERY}`` as a function of the
|
||||
current boolean signal, the previous per-key state and the injected clock:
|
||||
|
||||
* not alerting & active -> ALERT (threshold crossed)
|
||||
* alerting & active & cooldown elapsed -> REALERT (re-alert)
|
||||
* alerting & active & in cooldown -> NONE (anti-spam)
|
||||
* alerting & not active -> RECOVERY (back to normal)
|
||||
* not alerting & not active -> NONE (normal)
|
||||
"""
|
||||
if not prev.alerting:
|
||||
return ACTION_ALERT if signal_active else ACTION_NONE
|
||||
# prev.alerting is True
|
||||
if not signal_active:
|
||||
return ACTION_RECOVERY
|
||||
last = prev.last_alert_at
|
||||
if last is None or (now - last) >= cooldown_s:
|
||||
return ACTION_REALERT
|
||||
return ACTION_NONE
|
||||
68
watchdog/notify.py
Normal file
68
watchdog/notify.py
Normal file
@@ -0,0 +1,68 @@
|
||||
"""Independent Telegram transport for the sidecar (D7, FR-8, BR-8).
|
||||
|
||||
Reads its OWN ``WATCHDOG_TG_BOT_TOKEN`` / ``WATCHDOG_TG_CHAT_ID`` and POSTs via
|
||||
``urllib`` to ``api.telegram.org``. It is FORBIDDEN to import
|
||||
``src/notifications.py`` or to use the orchestrator's token / chat / functions —
|
||||
otherwise a crash or refactor of the orchestrator would drag down the alert
|
||||
channel (a direct violation of C-1 / BR-8). Missing token/chat -> log and skip
|
||||
(fail-safe), never raise (NFR-3).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import urllib.parse
|
||||
import urllib.request
|
||||
|
||||
logger = logging.getLogger("watchdog.notify")
|
||||
|
||||
_TELEGRAM_API = "https://api.telegram.org"
|
||||
|
||||
|
||||
def send_telegram(
|
||||
bot_token: str,
|
||||
chat_id: str,
|
||||
text: str,
|
||||
timeout_s: float = 5.0,
|
||||
*,
|
||||
api_base: str = _TELEGRAM_API,
|
||||
opener=urllib.request.urlopen,
|
||||
) -> bool:
|
||||
"""Send one Telegram message over the sidecar's own bot. never-raise (D8).
|
||||
|
||||
Returns ``True`` on a delivered message, ``False`` on any failure (missing
|
||||
credentials, network error, non-2xx). ``opener`` / ``api_base`` are injected
|
||||
so tests never touch the real network.
|
||||
"""
|
||||
if not bot_token or not chat_id:
|
||||
logger.warning("watchdog: telegram token/chat not configured -> skip send")
|
||||
return False
|
||||
try:
|
||||
url = f"{api_base}/bot{bot_token}/sendMessage"
|
||||
payload = urllib.parse.urlencode(
|
||||
{
|
||||
"chat_id": chat_id,
|
||||
"text": text,
|
||||
"parse_mode": "HTML",
|
||||
"disable_web_page_preview": "true",
|
||||
}
|
||||
).encode("utf-8")
|
||||
req = urllib.request.Request(url, data=payload, method="POST")
|
||||
with opener(req, timeout=timeout_s) as resp:
|
||||
status = getattr(resp, "status", None) or resp.getcode()
|
||||
return 200 <= int(status) < 300
|
||||
except Exception as e: # noqa: BLE001 - delivery is best-effort
|
||||
logger.warning("watchdog: telegram send failed: %s", e)
|
||||
return False
|
||||
|
||||
|
||||
class Notifier:
|
||||
"""Thin stateful wrapper binding the sidecar credentials for the tick loop."""
|
||||
|
||||
def __init__(self, bot_token: str, chat_id: str, timeout_s: float = 5.0):
|
||||
self._token = bot_token
|
||||
self._chat = chat_id
|
||||
self._timeout = timeout_s
|
||||
|
||||
def send(self, text: str) -> bool:
|
||||
"""Best-effort send through the sidecar's own channel (never raises)."""
|
||||
return send_telegram(self._token, self._chat, text, self._timeout)
|
||||
283
watchdog/signals.py
Normal file
283
watchdog/signals.py
Normal file
@@ -0,0 +1,283 @@
|
||||
"""Pure signal builders: turn collected raw inputs into ``Signal`` objects (D5).
|
||||
|
||||
A ``Signal`` is ``(key, active, title, detail, edge)``. ``key`` identifies the
|
||||
signal for per-key anti-spam state: a scalar (``"orch_down"``, ``"host_mem"``)
|
||||
or a tuple for per-entity signals (``("agent_hung", run_id)``,
|
||||
``("container_down", name)``, ``("stage_stuck", work_item)``,
|
||||
``("dep_down", name)``).
|
||||
|
||||
These builders are PURE — given the envelope / host readings / prev-sample state
|
||||
they return signals + the next sample state, with no I/O — so the whole decision
|
||||
surface is unit-testable without a container, a socket or a timer (TC-01…TC-11).
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from dataclasses import dataclass, field
|
||||
|
||||
from .collectors import containers as containers_mod
|
||||
from .collectors import orch as orch_mod
|
||||
from .config import Config
|
||||
|
||||
logger = logging.getLogger("watchdog.signals")
|
||||
|
||||
|
||||
@dataclass
|
||||
class Signal:
|
||||
"""One evaluated signal heading into the decision function.
|
||||
|
||||
``edge`` marks event-style signals (e.g. ``job_failed``) that fire on each
|
||||
new occurrence and have no sustained "recovery": the dispatcher does not
|
||||
persist alerting state for them.
|
||||
"""
|
||||
|
||||
key: object
|
||||
active: bool
|
||||
title: str
|
||||
detail: str
|
||||
edge: bool = False
|
||||
cooldown_s: float | None = None # per-signal override of the global cooldown
|
||||
|
||||
|
||||
@dataclass
|
||||
class AgentSample:
|
||||
"""Previous ``(cpu_ticks, generated_at_epoch)`` for one running agent (D5)."""
|
||||
|
||||
cpu_ticks: int
|
||||
generated_at: float
|
||||
|
||||
|
||||
@dataclass
|
||||
class EnvelopeEval:
|
||||
"""Result of evaluating the ``/metrics`` envelope: signals + carried state."""
|
||||
|
||||
signals: list = field(default_factory=list)
|
||||
agent_samples: dict = field(default_factory=dict) # run_id -> AgentSample
|
||||
failed_count: int | None = None
|
||||
|
||||
|
||||
def _cpu_fraction(
|
||||
cur_ticks: int,
|
||||
cur_gen: float,
|
||||
prev: AgentSample,
|
||||
clk_tck: int,
|
||||
) -> float | None:
|
||||
"""CPU fraction of one agent across two ``/metrics`` polls (D5).
|
||||
|
||||
``frac = (Δticks / clk_tck) / Δseconds``. Returns ``None`` if the deltas are
|
||||
not usable (no wall-time elapsed, non-positive clk_tck) so a degenerate
|
||||
sample never produces a false "hung" verdict.
|
||||
"""
|
||||
try:
|
||||
dt = cur_gen - prev.generated_at
|
||||
if dt <= 0 or not clk_tck or clk_tck <= 0:
|
||||
return None
|
||||
cpu_seconds = (cur_ticks - prev.cpu_ticks) / clk_tck
|
||||
if cpu_seconds < 0:
|
||||
return None
|
||||
return cpu_seconds / dt
|
||||
except Exception as e: # noqa: BLE001 - degenerate sample, no verdict
|
||||
logger.warning("watchdog: cpu_fraction error: %s", e)
|
||||
return None
|
||||
|
||||
|
||||
def eval_envelope(
|
||||
envelope: dict,
|
||||
cfg: Config,
|
||||
prev_agents: dict,
|
||||
prev_failed: int | None,
|
||||
) -> EnvelopeEval:
|
||||
"""Derive agent_hung / stage_stuck / job_failed / queue_depth signals (D5).
|
||||
|
||||
Pure: no I/O. ``prev_agents`` (run_id -> :class:`AgentSample`) and
|
||||
``prev_failed`` carry the cross-tick state the sidecar owns; the returned
|
||||
:class:`EnvelopeEval` includes the NEXT state to persist. never-raise: a bad
|
||||
sub-section degrades that family of signals, the rest still evaluate.
|
||||
"""
|
||||
out = EnvelopeEval()
|
||||
if not isinstance(envelope, dict):
|
||||
out.agent_samples = dict(prev_agents)
|
||||
out.failed_count = prev_failed
|
||||
return out
|
||||
|
||||
clk_tck = envelope.get("clk_tck")
|
||||
gen_at = orch_mod.parse_generated_at(envelope)
|
||||
|
||||
# -- agent_hung (needs two polls; per run_id) -------------------------
|
||||
new_samples: dict = {}
|
||||
try:
|
||||
for a in envelope.get("agents") or []:
|
||||
run_id = a.get("run_id")
|
||||
cpu_ticks = a.get("cpu_ticks")
|
||||
runtime_s = a.get("runtime_s")
|
||||
if run_id is None:
|
||||
continue
|
||||
if cpu_ticks is None or gen_at is None:
|
||||
# pid dead / non-Linux / no timestamp -> cannot judge; skip.
|
||||
continue
|
||||
new_samples[run_id] = AgentSample(int(cpu_ticks), gen_at)
|
||||
prev = prev_agents.get(run_id)
|
||||
if prev is None or not isinstance(clk_tck, int):
|
||||
continue
|
||||
frac = _cpu_fraction(int(cpu_ticks), gen_at, prev, clk_tck)
|
||||
if frac is None or runtime_s is None:
|
||||
continue
|
||||
hung = (runtime_s > cfg.agent_hung_s) and (frac < cfg.agent_cpu_floor)
|
||||
if hung:
|
||||
out.signals.append(
|
||||
Signal(
|
||||
key=("agent_hung", run_id),
|
||||
active=True,
|
||||
title="Агент завис",
|
||||
detail=(
|
||||
f"agent={a.get('agent')} run_id={run_id} "
|
||||
f"runtime={int(runtime_s)}s cpu={frac:.4f} "
|
||||
f"(< {cfg.agent_cpu_floor})"
|
||||
),
|
||||
)
|
||||
)
|
||||
except Exception as e: # noqa: BLE001 - degrade agent family only
|
||||
logger.warning("watchdog: eval agents error: %s", e)
|
||||
out.agent_samples = new_samples
|
||||
|
||||
# -- stage_stuck (per work_item) -------------------------------------
|
||||
try:
|
||||
for s in envelope.get("stages") or []:
|
||||
age = s.get("age_in_stage_s")
|
||||
wi = s.get("work_item")
|
||||
if age is None or wi is None:
|
||||
continue
|
||||
if age > cfg.stage_stuck_s:
|
||||
out.signals.append(
|
||||
Signal(
|
||||
key=("stage_stuck", wi),
|
||||
active=True,
|
||||
title="Стадия застряла",
|
||||
detail=(
|
||||
f"{wi} в стадии {s.get('stage')} уже {int(age)}s "
|
||||
f"(порог {int(cfg.stage_stuck_s)}s)"
|
||||
),
|
||||
)
|
||||
)
|
||||
except Exception as e: # noqa: BLE001
|
||||
logger.warning("watchdog: eval stages error: %s", e)
|
||||
|
||||
# -- queue depth + job_failed (edge) ---------------------------------
|
||||
failed_now: int | None = prev_failed
|
||||
try:
|
||||
queue = envelope.get("queue") or {}
|
||||
depth = queue.get("depth")
|
||||
if isinstance(depth, int) and depth >= cfg.queue_depth:
|
||||
out.signals.append(
|
||||
Signal(
|
||||
key="queue_depth",
|
||||
active=True,
|
||||
title="Очередь растёт",
|
||||
detail=f"глубина очереди {depth} (порог {cfg.queue_depth})",
|
||||
)
|
||||
)
|
||||
counts = queue.get("counts") or {}
|
||||
failed = counts.get("failed")
|
||||
if isinstance(failed, int):
|
||||
failed_now = failed
|
||||
if prev_failed is not None and failed > prev_failed:
|
||||
out.signals.append(
|
||||
Signal(
|
||||
key="job_failed",
|
||||
active=True,
|
||||
title="Job упал",
|
||||
detail=(
|
||||
f"failed-джобов стало {failed} "
|
||||
f"(было {prev_failed}, +{failed - prev_failed})"
|
||||
),
|
||||
edge=True,
|
||||
)
|
||||
)
|
||||
except Exception as e: # noqa: BLE001
|
||||
logger.warning("watchdog: eval queue error: %s", e)
|
||||
out.failed_count = failed_now
|
||||
|
||||
return out
|
||||
|
||||
|
||||
def host_signals(cfg: Config, mem_pct: float | None, disk: tuple | None) -> list:
|
||||
"""Build host memory + opt-in disk-ceiling signals (D5/D6). Pure."""
|
||||
sigs: list = []
|
||||
if mem_pct is not None:
|
||||
sigs.append(
|
||||
Signal(
|
||||
key="host_mem",
|
||||
active=mem_pct >= cfg.mem_pct,
|
||||
title="Память хоста",
|
||||
detail=f"память хоста {mem_pct}% (порог {cfg.mem_pct}%)",
|
||||
)
|
||||
)
|
||||
# Disk ceiling is OPT-IN (D6): disk_watchdog (ORCH-063) owns the 85% alert;
|
||||
# the sidecar only carries an independent HIGHER ceiling when explicitly
|
||||
# enabled, so there is no double-alert on the same fill event (FR-9/AC-5).
|
||||
if cfg.disk_crit_enabled and disk is not None:
|
||||
path, pct = disk
|
||||
sigs.append(
|
||||
Signal(
|
||||
key="host_disk_crit",
|
||||
active=pct >= cfg.disk_crit_pct,
|
||||
title="Диск (критический потолок)",
|
||||
detail=(
|
||||
f"диск {path} {pct}% (критический потолок {cfg.disk_crit_pct}%, "
|
||||
f"независимый канал sidecar)"
|
||||
),
|
||||
)
|
||||
)
|
||||
return sigs
|
||||
|
||||
|
||||
def container_signals(cfg: Config, statuses: dict) -> list:
|
||||
"""Build per-container down signals from ``{name: status}``. Pure."""
|
||||
sigs: list = []
|
||||
for name, status in statuses.items():
|
||||
sigs.append(
|
||||
Signal(
|
||||
key=("container_down", name),
|
||||
active=containers_mod.container_alarm(status),
|
||||
title="Контейнер не в норме",
|
||||
detail=f"контейнер {name}: статус '{status}'",
|
||||
)
|
||||
)
|
||||
return sigs
|
||||
|
||||
|
||||
def dep_signals(reachability: dict) -> list:
|
||||
"""Build per-dependency down signals from ``{name: reachable}``. Pure."""
|
||||
sigs: list = []
|
||||
for name, reachable in reachability.items():
|
||||
sigs.append(
|
||||
Signal(
|
||||
key=("dep_down", name),
|
||||
active=not reachable,
|
||||
title="Зависимость недоступна",
|
||||
detail=f"зависимость {name} не отвечает",
|
||||
)
|
||||
)
|
||||
return sigs
|
||||
|
||||
|
||||
def orch_down_signal(consecutive_failures: int, cfg: Config, error: str | None) -> Signal:
|
||||
"""The master ``orchestrator_down`` signal (FR-3).
|
||||
|
||||
Active once ``/metrics`` has failed ``orch_down_ticks`` times in a row — a
|
||||
single transient hiccup does not flap. The text explicitly notes that the
|
||||
in-process guards (disk / reaper / reconciler) are dead too, so the operator
|
||||
knows to check the host directly (D6).
|
||||
"""
|
||||
active = consecutive_failures >= cfg.orch_down_ticks
|
||||
return Signal(
|
||||
key="orch_down",
|
||||
active=active,
|
||||
title="Орк не отвечает",
|
||||
detail=(
|
||||
f"GET /metrics не отвечает {consecutive_failures} тик(ов) подряд "
|
||||
f"(порог {cfg.orch_down_ticks}): {error or 'недоступен'}. "
|
||||
f"In-process стражи (disk/reaper/reconciler) тоже мертвы — проверьте "
|
||||
f"хост (вкл. диск) и контейнер orchestrator."
|
||||
),
|
||||
)
|
||||
Reference in New Issue
Block a user