feat(disk-watchdog): host-FS fill heartbeat + Telegram alert at >=85% (ORCH-063)

Adds src/disk_watchdog.py — a background daemon thread modelled on reconciler/job_reaper that measures host-FS fill via the mounted bind-paths (/repos, /app/data) with shutil.disk_usage and Telegram-alerts the operator at >= threshold (default 85%). The missing proactive signal: on 07.06.2026 the mva154 host disk silently hit 100% and stalled the whole self-hosting pipeline. - Pure decide_action(used_pct, threshold, prev, now, realert_s): alert on crossing up, cooldown re-alert, single recovery below threshold (unit-tested without a thread/timer; clock injected). - measure_paths: shutil.disk_usage per path, dedup by st_dev, per-path never-raise (a broken path never fails the tick). - Config flags ORCH_DISK_MONITOR_* with defensive validation (threshold 1..100, positive intervals -> default + warning). Kill-switch -> daemon does not start. - Additive disk_monitor block in GET /queue; start/stop in main.lifespan. - never-raise (per-path/per-tick/per-send); STAGE_TRANSITIONS/QG_CHECKS/check_*/ DB schema untouched, no migration (anti-spam state in-memory). Tests: tests/test_disk_watchdog.py (TC-01..TC-12, 18 cases); full suite green (1296). Docs: INFRA.md, .env.example, CHANGELOG.md (architecture/README.md + ADRs authored at architecture stage). Refs: ORCH-063 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 18:51:37 +03:00
parent 4d9251c698
commit 8759cb7df8
7 changed files with 817 additions and 0 deletions
--- a/src/config.py
+++ b/src/config.py
@@ -1,3 +1,5 @@
+import logging
+
 from pydantic import field_validator
 from pydantic_settings import BaseSettings

@@ -381,6 +383,68 @@ class Settings(BaseSettings):
    reaper_finalize_grace_s: int = 300
    lease_reclaim_enabled: bool = True

+    # ORCH-063: disk-watchdog — background heartbeat that measures host-FS fill via
+    # the mounted bind-paths and Telegram-alerts the operator at >= threshold. On
+    # 07.06.2026 the mva154 host disk silently hit 100% and stalled the WHOLE
+    # self-hosting pipeline; the watchdog is the missing proactive signal. Modelled
+    # on reconciler/job_reaper (daemon thread, start/stop in main.lifespan, /queue
+    # snapshot, never-raise). Anti-spam state is in-memory (no DB migration).
+    #   disk_monitor_enabled       -> kill-switch; False -> the daemon does not start
+    #                                 (zero regression), env ORCH_DISK_MONITOR_ENABLED.
+    #   disk_monitor_interval_s    -> heartbeat measurement period, seconds (order of
+    #                                 minutes; cheap shutil.disk_usage, no df subprocess).
+    #   disk_monitor_threshold_pct -> fill % that triggers the alert (Owner-fixed 85).
+    #   disk_monitor_realert_s     -> min interval between repeat alerts while still
+    #                                 above threshold (anti-spam cooldown, ~6h).
+    #   disk_monitor_paths         -> CSV of monitored HOST bind-paths (NOT overlay /);
+    #                                 empty -> the default set (/repos, /app/data).
+    # Defensive validation (ADR-001 D7): threshold out of 1..100 or a non-positive
+    # interval -> default + warning (the process never crashes on a bad env value).
+    disk_monitor_enabled: bool = True
+    disk_monitor_interval_s: int = 300
+    disk_monitor_threshold_pct: int = 85
+    disk_monitor_realert_s: int = 21600
+    disk_monitor_paths: str = "/repos,/app/data"
+
+    @field_validator(
+        "disk_monitor_interval_s", "disk_monitor_realert_s", mode="before"
+    )
+    @classmethod
+    def _disk_positive_int(cls, v, info):
+        # Non-positive / non-numeric interval -> the field default (never crash).
+        _defaults = {"disk_monitor_interval_s": 300, "disk_monitor_realert_s": 21600}
+        fallback = _defaults.get(info.field_name, 1)
+        try:
+            if v is None or (isinstance(v, str) and v.strip() == ""):
+                return fallback
+            iv = int(v)
+            if iv <= 0:
+                logging.getLogger("orchestrator.config").warning(
+                    "%s must be > 0, got %s; falling back to %s",
+                    info.field_name, v, fallback,
+                )
+                return fallback
+            return iv
+        except (TypeError, ValueError):
+            return fallback
+
+    @field_validator("disk_monitor_threshold_pct", mode="before")
+    @classmethod
+    def _disk_threshold_pct(cls, v):
+        # Threshold must be a percentage in 1..100; otherwise -> default 85.
+        try:
+            if v is None or (isinstance(v, str) and v.strip() == ""):
+                return 85
+            iv = int(v)
+            if 1 <= iv <= 100:
+                return iv
+            logging.getLogger("orchestrator.config").warning(
+                "disk_monitor_threshold_pct must be 1..100, got %s; using 85", v
+            )
+            return 85
+        except (TypeError, ValueError):
+            return 85
+
    # ORCH-071: merge-verify under-gate on the `deploy -> done` edge. For the
    # self-hosting repo the `deploy` stage runs the DETERMINISTIC self-deploy path
    # (Phase A/B/C), where the LLM `deployer` agent — historically the ONLY actor