feat(watchdog): sidecar-watchdog F1b — monitoring brain in a separate container (ORCH-100) #116
Reference in New Issue
Block a user
Delete Branch "feature/ORCH-100-fnd-f1b-sidecar-watchdog"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
ORCH-100 — FND/F1b: sidecar-watchdog (monitoring brain in a separate container)
The brain half of the domain-0 observability pair. F1a (ORCH-099) exposes
GET /metricsraw signal; F1b (this PR) reads it, augments it with host / container / dependency probes, runs each signal through a generalised pure decision function, and emits alerts over its own independent Telegram channel.What's added
watchdog/— thin Python-3.12 stdlib-only daemon (no pip deps):config.py—WATCHDOG_*env → frozenConfig(never-raise parsers, defaults).decision.py—decide(signal_active, prev, now, cooldown_s) → alert|realert|recovery|none, a strict generalisation ofdisk_watchdog.decide_action+ in-memoryAlertState.collectors/—orch(GET /metrics → envelope |orch_down, tolerant parsing),host(/proc/meminfo+shutil.disk_usage),containers(read-only GET-only docker.sock overAF_UNIX, nodockerSDK),deps(Plane/Gitea/Anthropic pings).signals.py— pure signal builders (agent_hung / stage_stuck / job_failed / queue_depth / host / container / dep / orch_down).core.py—Watchdog.tick(): collect → evaluate → decide → dispatch (per-source guards).notify.py— independent Telegram transport (own token/chat; nosrc.notificationsimport).__main__.py— tick loop with kill-switch + per-tick never-raise.Dockerfile—python:3.12-slim, stdlib-only, copies onlywatchdog/(notsrc/**).docker-compose.yml—orchestrator-watchdogservice:network_mode: host,docker.sock :ro, host paths:ro,mem_limit: 128m, optionalenv_file(required: false) so a missing.env.watchdognever breaksdocker compose upfor prod..env.example— canonicalWATCHDOG_*block.CHANGELOG.md— F1b entry. (Architecture README + adr-0033 authored at the architecture stage.)Invariants
/metricsnot answering is itself the masterorch_downalarm (debounced K=3 ticks — no flap on a single hiccup). No import fromsrc/**.:ro(double guard), no DB/disk writes, no process control. Deploying the sidecar does not rebuild/restart the prodorchestrator.disk_watchdog(ORCH-063) stays the sole owner of the 85% alert → zero duplicate by construction; sidecar addsorch_down+ an opt-in 97% ceiling (default off).WATCHDOG_ENABLEDkill-switch (disabled → inert idle-loop, not exit).src/**/STAGE_TRANSITIONS/QG_CHECKS/check_*/ DB schema — untouched.Tests
tests/watchdog/— TC-01…TC-13 (decision, orch-down, never-raise, kill-switch, full-tick integration, docker read-only, notify isolation, metrics tolerance, compose invariant, disk dedup) + host/deps collectors. Placed undertests/so the existing CIpytest tests/collects them.pytest tests/ -qgreen: 1617 passed (TC-14 regression —src/**not modified).ruff check watchdog/ tests/watchdog/clean.Infra precondition (one-off, see
07-infra-requirements.md)Create the
orchestrator-watchdogbot/chat +.env.watchdogon the host, thendocker compose up -d --build orchestrator-watchdog. Absent bot/chat → fail-safe (logs, no send).Refs: ORCH-100
🤖 Generated with Claude Code
c377a8ded3to0ef1cf6698