feat(disk-watchdog): host-FS fill heartbeat + Telegram alert at >=85% (ORCH-063)

Adds src/disk_watchdog.py — a background daemon thread modelled on reconciler/job_reaper that measures host-FS fill via the mounted bind-paths (/repos, /app/data) with shutil.disk_usage and Telegram-alerts the operator at >= threshold (default 85%). The missing proactive signal: on 07.06.2026 the mva154 host disk silently hit 100% and stalled the whole self-hosting pipeline. - Pure decide_action(used_pct, threshold, prev, now, realert_s): alert on crossing up, cooldown re-alert, single recovery below threshold (unit-tested without a thread/timer; clock injected). - measure_paths: shutil.disk_usage per path, dedup by st_dev, per-path never-raise (a broken path never fails the tick). - Config flags ORCH_DISK_MONITOR_* with defensive validation (threshold 1..100, positive intervals -> default + warning). Kill-switch -> daemon does not start. - Additive disk_monitor block in GET /queue; start/stop in main.lifespan. - never-raise (per-path/per-tick/per-send); STAGE_TRANSITIONS/QG_CHECKS/check_*/ DB schema untouched, no migration (anti-spam state in-memory). Tests: tests/test_disk_watchdog.py (TC-01..TC-12, 18 cases); full suite green (1296). Docs: INFRA.md, .env.example, CHANGELOG.md (architecture/README.md + ADRs authored at architecture stage). Refs: ORCH-063 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 18:51:37 +03:00
parent 4d9251c698
commit 8759cb7df8
7 changed files with 817 additions and 0 deletions
--- a/src/main.py
+++ b/src/main.py
@@ -105,9 +105,19 @@ async def lifespan(app: FastAPI):
    from .job_reaper import reaper
    reaper.start()

+    # ORCH-063: start the disk-watchdog LAST (after the reaper). It is independent
+    # of the queue/DB — it only reads host-FS fill and Telegram-alerts at >=
+    # threshold — so the order is not critical, but we follow the daemon
+    # convention. Honours the kill-switch ORCH_DISK_MONITOR_ENABLED (start() is a
+    # no-op when disabled, so behaviour is 1:1 as before).
+    from .disk_watchdog import disk_watchdog
+    disk_watchdog.start()
+
    try:
        yield
    finally:
+        # ORCH-063: stop the disk-watchdog first (reverse of startup).
+        disk_watchdog.stop()
        # Graceful shutdown order mirrors startup in reverse: stop the reaper
        # first, then the reconciler (it must not enqueue new work while the
        # worker is winding down), then the worker. Running agents keep going;
@@ -151,6 +161,7 @@ async def queue():
    from . import task_deps
    from . import serial_gate
    from . import labels
+    from .disk_watchdog import disk_watchdog
    return {
        "counts": job_status_counts(),
        "max_concurrency": worker.max_concurrency,
@@ -169,6 +180,10 @@ async def queue():
        # ORCH-089 (D7): auto-mode-by-label observability (read-only) — kill-switch,
        # label names, scope. Additive block.
        "auto_labels": labels.snapshot(),
+        # ORCH-063 (FR-6 / AC-7): disk-watchdog observability (read-only) —
+        # enabled, threshold, interval, last measurement per host-path. Additive
+        # block; never-raise (status() returns {"enabled": ...} minimum on error).
+        "disk_monitor": disk_watchdog.status(),
        "recent": recent_jobs(10),
    }