feat(watchdog): sidecar-watchdog F1b — monitoring brain in a separate container (ORCH-100) #116

Merged
admin merged 10 commits from feature/ORCH-100-fnd-f1b-sidecar-watchdog into main 2026-06-10 09:57:13 +03:00
Owner

ORCH-100 — FND/F1b: sidecar-watchdog (monitoring brain in a separate container)

The brain half of the domain-0 observability pair. F1a (ORCH-099) exposes GET /metrics raw signal; F1b (this PR) reads it, augments it with host / container / dependency probes, runs each signal through a generalised pure decision function, and emits alerts over its own independent Telegram channel.

What's added

  • watchdog/ — thin Python-3.12 stdlib-only daemon (no pip deps):
    • config.pyWATCHDOG_* env → frozen Config (never-raise parsers, defaults).
    • decision.pydecide(signal_active, prev, now, cooldown_s) → alert|realert|recovery|none, a strict generalisation of disk_watchdog.decide_action + in-memory AlertState.
    • collectors/orch (GET /metrics → envelope | orch_down, tolerant parsing), host (/proc/meminfo + shutil.disk_usage), containers (read-only GET-only docker.sock over AF_UNIX, no docker SDK), deps (Plane/Gitea/Anthropic pings).
    • signals.py — pure signal builders (agent_hung / stage_stuck / job_failed / queue_depth / host / container / dep / orch_down).
    • core.pyWatchdog.tick(): collect → evaluate → decide → dispatch (per-source guards).
    • notify.py — independent Telegram transport (own token/chat; no src.notifications import).
    • __main__.py — tick loop with kill-switch + per-tick never-raise.
    • Dockerfilepython:3.12-slim, stdlib-only, copies only watchdog/ (not src/**).
  • docker-compose.ymlorchestrator-watchdog service: network_mode: host, docker.sock :ro, host paths :ro, mem_limit: 128m, optional env_file (required: false) so a missing .env.watchdog never breaks docker compose up for prod.
  • .env.example — canonical WATCHDOG_* block.
  • CHANGELOG.md — F1b entry. (Architecture README + adr-0033 authored at the architecture stage.)

Invariants

  • Observer separated from observed (C-1): /metrics not answering is itself the master orch_down alarm (debounced K=3 ticks — no flap on a single hiccup). No import from src/**.
  • Strictly read-only / self-hosting-safe: docker.sock GET-only and :ro (double guard), no DB/disk writes, no process control. Deploying the sidecar does not rebuild/restart the prod orchestrator.
  • Disk anti-duplicate (D6/AC-5): disk_watchdog (ORCH-063) stays the sole owner of the 85% alert → zero duplicate by construction; sidecar adds orch_down + an opt-in 97% ceiling (default off).
  • never-raise (3 levels) + WATCHDOG_ENABLED kill-switch (disabled → inert idle-loop, not exit).
  • src/** / STAGE_TRANSITIONS / QG_CHECKS / check_* / DB schema — untouched.

Tests

  • tests/watchdog/ — TC-01…TC-13 (decision, orch-down, never-raise, kill-switch, full-tick integration, docker read-only, notify isolation, metrics tolerance, compose invariant, disk dedup) + host/deps collectors. Placed under tests/ so the existing CI pytest tests/ collects them.
  • Full pytest tests/ -q green: 1617 passed (TC-14 regression — src/** not modified).
  • ruff check watchdog/ tests/watchdog/ clean.

Infra precondition (one-off, see 07-infra-requirements.md)

Create the orchestrator-watchdog bot/chat + .env.watchdog on the host, then docker compose up -d --build orchestrator-watchdog. Absent bot/chat → fail-safe (logs, no send).

Note on size: this is a single indivisible new component (one container/feature — a partial watchdog can't ship); the ~2.2k diff is mostly the test-plan-mandated 14 TCs and repo-convention docstrings, not multiple features. Task-level decomposition isn't applicable.

Refs: ORCH-100

🤖 Generated with Claude Code

## ORCH-100 — FND/F1b: sidecar-watchdog (monitoring brain in a separate container) The **brain** half of the domain-0 observability pair. F1a (ORCH-099) exposes `GET /metrics` raw signal; F1b (this PR) reads it, augments it with host / container / dependency probes, runs each signal through a **generalised pure decision function**, and emits alerts over its **own** independent Telegram channel. ### What's added - **`watchdog/`** — thin Python-3.12 **stdlib-only** daemon (no pip deps): - `config.py` — `WATCHDOG_*` env → frozen `Config` (never-raise parsers, defaults). - `decision.py` — `decide(signal_active, prev, now, cooldown_s) → alert|realert|recovery|none`, a strict generalisation of `disk_watchdog.decide_action` + in-memory `AlertState`. - `collectors/` — `orch` (GET /metrics → envelope | `orch_down`, tolerant parsing), `host` (`/proc/meminfo` + `shutil.disk_usage`), `containers` (read-only **GET-only** docker.sock over `AF_UNIX`, no `docker` SDK), `deps` (Plane/Gitea/Anthropic pings). - `signals.py` — pure signal builders (agent_hung / stage_stuck / job_failed / queue_depth / host / container / dep / orch_down). - `core.py` — `Watchdog.tick()`: collect → evaluate → decide → dispatch (per-source guards). - `notify.py` — independent Telegram transport (own token/chat; **no** `src.notifications` import). - `__main__.py` — tick loop with kill-switch + per-tick never-raise. - `Dockerfile` — `python:3.12-slim`, stdlib-only, copies only `watchdog/` (not `src/**`). - **`docker-compose.yml`** — `orchestrator-watchdog` service: `network_mode: host`, `docker.sock :ro`, host paths `:ro`, `mem_limit: 128m`, **optional** `env_file` (`required: false`) so a missing `.env.watchdog` never breaks `docker compose up` for prod. - **`.env.example`** — canonical `WATCHDOG_*` block. - **`CHANGELOG.md`** — F1b entry. (Architecture README + adr-0033 authored at the architecture stage.) ### Invariants - **Observer separated from observed (C-1):** `/metrics` not answering is itself the master `orch_down` alarm (debounced K=3 ticks — no flap on a single hiccup). No import from `src/**`. - **Strictly read-only / self-hosting-safe:** docker.sock GET-only **and** `:ro` (double guard), no DB/disk writes, no process control. Deploying the sidecar does **not** rebuild/restart the prod `orchestrator`. - **Disk anti-duplicate (D6/AC-5):** `disk_watchdog` (ORCH-063) stays the sole owner of the 85% alert → zero duplicate by construction; sidecar adds `orch_down` + an **opt-in** 97% ceiling (default off). - **never-raise (3 levels) + `WATCHDOG_ENABLED` kill-switch** (disabled → inert idle-loop, not exit). - `src/**` / `STAGE_TRANSITIONS` / `QG_CHECKS` / `check_*` / DB schema — **untouched**. ### Tests - `tests/watchdog/` — TC-01…TC-13 (decision, orch-down, never-raise, kill-switch, full-tick integration, docker read-only, notify isolation, metrics tolerance, compose invariant, disk dedup) + host/deps collectors. Placed under `tests/` so the existing CI `pytest tests/` collects them. - Full `pytest tests/ -q` green: **1617 passed** (TC-14 regression — `src/**` not modified). - `ruff check watchdog/ tests/watchdog/` clean. ### Infra precondition (one-off, see `07-infra-requirements.md`) Create the `orchestrator-watchdog` bot/chat + `.env.watchdog` on the host, then `docker compose up -d --build orchestrator-watchdog`. Absent bot/chat → fail-safe (logs, no send). > **Note on size:** this is a single indivisible new component (one container/feature — a partial watchdog can't ship); the ~2.2k diff is mostly the test-plan-mandated 14 TCs and repo-convention docstrings, not multiple features. Task-level decomposition isn't applicable. Refs: ORCH-100 🤖 Generated with [Claude Code](https://claude.com/claude-code)
admin added 9 commits 2026-06-10 09:36:05 +03:00
Add the `watchdog/` package (thin Python-3.12 stdlib-only daemon) and the
`orchestrator-watchdog` compose service — the brain half of the domain-0
observability pair. F1a (ORCH-099) exposes GET /metrics raw signal; F1b reads it,
augments with host / container / dependency probes, runs each signal through a
generalised pure decision function (decide(signal_active, prev, now, cooldown),
a strict superset of disk_watchdog.decide_action) with per-signal in-memory
dedup/throttle/recovery, and alerts over its OWN independent Telegram channel.

Key properties (ADR-001):
- Observer separated from observed: separate container; /metrics not answering is
  itself the master `orch_down` alarm (debounced K ticks — no flap on a hiccup).
- Strictly read-only: docker.sock GET-only + mounted :ro (double guard), host
  paths :ro, no DB/disk writes, no process control — self-hosting-safe.
- never-raise on three levels (per-source/per-tick/per-send) + WATCHDOG_ENABLED
  kill-switch (disabled -> inert idle-loop, not exit).
- Disk anti-duplicate (D6): disk_watchdog (ORCH-063) stays sole owner of the 85%
  alert; sidecar carries orch_down + an opt-in 97% ceiling (default off).
- NO import from src/** (C-1); src/**, STAGE_TRANSITIONS, QG_CHECKS, check_*, DB
  schema — untouched. env_file optional so a missing .env.watchdog never breaks
  `docker compose up` for the prod orchestrator.

Tests: tests/watchdog/ (TC-01…TC-13) + full tests/ regression green (TC-14).
Docs: CHANGELOG, .env.example canon (WATCHDOG_*); architecture README + adr-0033
authored at the architecture stage.

Refs: ORCH-100

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
test_queue.py::TestRetry::test_finalize_job_requeue_then_fail failed in the
self-hosting environment because launcher._finalize_job classifies a non-zero
exit by reading the tail of <settings.runs_dir>/<run_id>.log. settings.runs_dir
defaults to the live prod dir /app/data/runs, which on the host holds REAL
accumulated agent logs; a real 2.log containing "429" flips the expected
'permanent' classification to 'transient', requeueing the job instead of
marking it 'failed'. This is ambient prod pollution, not a code fault.

Add an autouse _isolate_runs_dir fixture (mirroring _no_telegram /
_disable_merge_verify) that redirects settings.runs_dir to a per-test tmp dir
so _run_log_path() resolves to a non-existent file and classify_log_file()
returns the documented 'permanent' default. Full suite: 1617 passed. src/**
untouched.

Refs: ORCH-100

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tester(ET): auto-commit from tester run_id=571
All checks were successful
CI / test (push) Successful in 1m1s
CI / test (pull_request) Successful in 58s
0ef1cf6698
admin force-pushed feature/ORCH-100-fnd-f1b-sidecar-watchdog from c377a8ded3 to 0ef1cf6698 2026-06-10 09:36:05 +03:00 Compare
admin merged commit db78c9eb7a into main 2026-06-10 09:57:13 +03:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: admin/orchestrator#116