Add the `watchdog/` package (thin Python-3.12 stdlib-only daemon) and the `orchestrator-watchdog` compose service — the brain half of the domain-0 observability pair. F1a (ORCH-099) exposes GET /metrics raw signal; F1b reads it, augments with host / container / dependency probes, runs each signal through a generalised pure decision function (decide(signal_active, prev, now, cooldown), a strict superset of disk_watchdog.decide_action) with per-signal in-memory dedup/throttle/recovery, and alerts over its OWN independent Telegram channel. Key properties (ADR-001): - Observer separated from observed: separate container; /metrics not answering is itself the master `orch_down` alarm (debounced K ticks — no flap on a hiccup). - Strictly read-only: docker.sock GET-only + mounted :ro (double guard), host paths :ro, no DB/disk writes, no process control — self-hosting-safe. - never-raise on three levels (per-source/per-tick/per-send) + WATCHDOG_ENABLED kill-switch (disabled -> inert idle-loop, not exit). - Disk anti-duplicate (D6): disk_watchdog (ORCH-063) stays sole owner of the 85% alert; sidecar carries orch_down + an opt-in 97% ceiling (default off). - NO import from src/** (C-1); src/**, STAGE_TRANSITIONS, QG_CHECKS, check_*, DB schema — untouched. env_file optional so a missing .env.watchdog never breaks `docker compose up` for the prod orchestrator. Tests: tests/watchdog/ (TC-01…TC-13) + full tests/ regression green (TC-14). Docs: CHANGELOG, .env.example canon (WATCHDOG_*); architecture README + adr-0033 authored at the architecture stage. Refs: ORCH-100 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
32 lines
1.6 KiB
Python
32 lines
1.6 KiB
Python
"""ORCH-100 (FND/F1b): sidecar-watchdog — the monitoring brain in a separate container.
|
|
|
|
This package is the *brain* half of the domain-0 observability pair. F1a
|
|
(ORCH-099, ``src/metrics.py``) exposes a lightweight read-only ``GET /metrics``
|
|
envelope — raw signal only. F1b (this package) is the stateful observer that
|
|
reads that envelope, augments it with host / container / dependency probes, runs
|
|
every signal through a generalised pure decision function (modelled 1:1 on
|
|
``src/disk_watchdog.py::decide_action``) with per-signal in-memory
|
|
dedup / throttle / recovery, and emits alerts over its OWN independent Telegram
|
|
channel.
|
|
|
|
Hard invariants (ADR-001, ``docs/work-items/ORCH-100/06-adr/``):
|
|
* The observer is separated from the observed: the runtime is a separate
|
|
container (``orchestrator-watchdog``). A hang/crash of the orchestrator makes
|
|
the sidecar *louder* (``orchestrator_down``), never silent.
|
|
* Strictly read-only to the observed system: ``docker.sock`` is GET-only (and
|
|
mounted ``:ro``), no DB writes, no disk writes, no process control
|
|
(start/stop/restart/exec) — self-hosting-safe on the shared prod host.
|
|
* never-raise on three levels (per-source / per-tick / per-send) + a
|
|
``WATCHDOG_ENABLED`` kill-switch.
|
|
* NO import from ``src/**`` — the sidecar must survive a refactor/crash of the
|
|
orchestrator process (C-1).
|
|
|
|
The highest known ``/metrics`` schema_version this build understands. A higher
|
|
value from the orchestrator is tolerated (warning, read the compatible subset),
|
|
never a crash (D9).
|
|
"""
|
|
|
|
KNOWN_SCHEMA_VERSION = 1
|
|
|
|
__all__ = ["KNOWN_SCHEMA_VERSION"]
|