Files
orchestrator/watchdog/__init__.py
claude-bot 259b507906 feat(watchdog): sidecar-watchdog F1b — monitoring brain in a separate container (ORCH-100)
Add the `watchdog/` package (thin Python-3.12 stdlib-only daemon) and the
`orchestrator-watchdog` compose service — the brain half of the domain-0
observability pair. F1a (ORCH-099) exposes GET /metrics raw signal; F1b reads it,
augments with host / container / dependency probes, runs each signal through a
generalised pure decision function (decide(signal_active, prev, now, cooldown),
a strict superset of disk_watchdog.decide_action) with per-signal in-memory
dedup/throttle/recovery, and alerts over its OWN independent Telegram channel.

Key properties (ADR-001):
- Observer separated from observed: separate container; /metrics not answering is
  itself the master `orch_down` alarm (debounced K ticks — no flap on a hiccup).
- Strictly read-only: docker.sock GET-only + mounted :ro (double guard), host
  paths :ro, no DB/disk writes, no process control — self-hosting-safe.
- never-raise on three levels (per-source/per-tick/per-send) + WATCHDOG_ENABLED
  kill-switch (disabled -> inert idle-loop, not exit).
- Disk anti-duplicate (D6): disk_watchdog (ORCH-063) stays sole owner of the 85%
  alert; sidecar carries orch_down + an opt-in 97% ceiling (default off).
- NO import from src/** (C-1); src/**, STAGE_TRANSITIONS, QG_CHECKS, check_*, DB
  schema — untouched. env_file optional so a missing .env.watchdog never breaks
  `docker compose up` for the prod orchestrator.

Tests: tests/watchdog/ (TC-01…TC-13) + full tests/ regression green (TC-14).
Docs: CHANGELOG, .env.example canon (WATCHDOG_*); architecture README + adr-0033
authored at the architecture stage.

Refs: ORCH-100

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 09:36:02 +03:00

32 lines
1.6 KiB
Python

"""ORCH-100 (FND/F1b): sidecar-watchdog — the monitoring brain in a separate container.
This package is the *brain* half of the domain-0 observability pair. F1a
(ORCH-099, ``src/metrics.py``) exposes a lightweight read-only ``GET /metrics``
envelope — raw signal only. F1b (this package) is the stateful observer that
reads that envelope, augments it with host / container / dependency probes, runs
every signal through a generalised pure decision function (modelled 1:1 on
``src/disk_watchdog.py::decide_action``) with per-signal in-memory
dedup / throttle / recovery, and emits alerts over its OWN independent Telegram
channel.
Hard invariants (ADR-001, ``docs/work-items/ORCH-100/06-adr/``):
* The observer is separated from the observed: the runtime is a separate
container (``orchestrator-watchdog``). A hang/crash of the orchestrator makes
the sidecar *louder* (``orchestrator_down``), never silent.
* Strictly read-only to the observed system: ``docker.sock`` is GET-only (and
mounted ``:ro``), no DB writes, no disk writes, no process control
(start/stop/restart/exec) — self-hosting-safe on the shared prod host.
* never-raise on three levels (per-source / per-tick / per-send) + a
``WATCHDOG_ENABLED`` kill-switch.
* NO import from ``src/**`` — the sidecar must survive a refactor/crash of the
orchestrator process (C-1).
The highest known ``/metrics`` schema_version this build understands. A higher
value from the orchestrator is tolerated (warning, read the compatible subset),
never a crash (D9).
"""
KNOWN_SCHEMA_VERSION = 1
__all__ = ["KNOWN_SCHEMA_VERSION"]