feat(infra): auto-prune docker build cache on mva154 (ORCH-062)

Add src/build_cache_pruner.py — a background daemon thread modelled 1:1 on src/disk_watchdog.py that periodically runs STRICTLY `docker builder prune -f --filter until=<until>` (BuildKit GC) on the HOST over ssh. It is the "second half" of the disk-watchdog (ORCH-063): the watchdog signals, the pruner cleans. Removes the root cause of the 07.06.2026 incident (build cache ~11GB -> disk 100% -> whole self-hosting pipeline down) automatically, без оператора. ADR-001 (Variant A): host-over-ssh, same channel as image_freshness/self_deploy (no docker CLI in the image). Touches ONLY the build cache — no image/system prune, no image/container removal, never restarts the docker daemon or the prod container (self-hosting safety). No ssh target -> tick is a no-op. - src/config.py: ORCH_BUILD_CACHE_PRUNE_* flags + defensive validators (interval/timeout >0, until ~ ^\d+[smhdw]?$, notify_min_gb >=0 -> safe default). - src/main.py: start last (after disk_watchdog) / stop first in lifespan; additive read-only build_cache_prune block in GET /queue. - never-raise on two levels (per-command + per-tick); kill-switch ORCH_BUILD_CACHE_PRUNE_ENABLED (false -> daemon does not start, 1:1 as before). - STAGE_TRANSITIONS / QG_CHECKS / check_* / _parse_* / DB schema UNCHANGED; last-run/last-result is in-memory (no migration). - tests/test_build_cache_pruner.py: TC-01..TC-12 (23 cases, docker fully mocked). - .env.example + CHANGELOG.md updated; INFRA.md / architecture docs already carry the component (architecture stage). Refs: ORCH-062 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 19:43:53 +03:00
parent d2604e42cd
commit 664c2e945a
6 changed files with 855 additions and 1 deletions
--- a/src/config.py
+++ b/src/config.py
@@ -1,4 +1,5 @@
 import logging
+import re

 from pydantic import field_validator
 from pydantic_settings import BaseSettings
@@ -445,6 +446,88 @@ class Settings(BaseSettings):
        except (TypeError, ValueError):
            return 85

+    # ORCH-062: build-cache-pruner — the "second half" of the disk-watchdog
+    # (ORCH-063): watchdog SIGNALS, pruner CLEANS. A background daemon thread
+    # modelled 1:1 on disk_watchdog (start/stop in main.lifespan, /queue snapshot,
+    # never-raise, kill-switch) that periodically runs `docker builder prune` on
+    # the HOST over ssh (the container ships no docker CLI — same channel as
+    # image_freshness/self_deploy). Touches ONLY the BuildKit build cache: never
+    # images/containers of running services, never restarts the docker daemon or
+    # the prod container (self-hosting safety). State (last run / result) is
+    # in-memory, best-effort — no DB migration. ADR-001 D1..D7.
+    #   build_cache_prune_enabled       -> kill-switch; False -> daemon does not
+    #                                      start (1:1 as before), env *_ENABLED.
+    #   build_cache_prune_interval_s    -> tick period, seconds (order of hours).
+    #   build_cache_prune_until         -> retention age for warm cache
+    #                                      (`docker builder prune --filter until=`).
+    #   build_cache_prune_all           -> add `-a` (ALWAYS paired with until).
+    #   build_cache_prune_timeout_s     -> bound on the ssh command, seconds.
+    #   build_cache_prune_notify_min_gb -> Telegram when reclaimed >= N GB; 0 -> silent.
+    # Defensive validation (ADR-001 D4): a non-positive / non-numeric interval or
+    # timeout -> default + warning; an `until` not matching ^\d+[smhdw]?$ -> "24h";
+    # a negative notify threshold -> 0. A bad env value NEVER crashes the start.
+    build_cache_prune_enabled: bool = True
+    build_cache_prune_interval_s: int = 21600
+    build_cache_prune_until: str = "24h"
+    build_cache_prune_all: bool = False
+    build_cache_prune_timeout_s: int = 120
+    build_cache_prune_notify_min_gb: float = 0.0
+
+    @field_validator(
+        "build_cache_prune_interval_s", "build_cache_prune_timeout_s", mode="before"
+    )
+    @classmethod
+    def _bcp_positive_int(cls, v, info):
+        # Non-positive / non-numeric -> the field default (never crash the start).
+        _defaults = {
+            "build_cache_prune_interval_s": 21600,
+            "build_cache_prune_timeout_s": 120,
+        }
+        fallback = _defaults.get(info.field_name, 1)
+        try:
+            if v is None or (isinstance(v, str) and v.strip() == ""):
+                return fallback
+            iv = int(v)
+            if iv <= 0:
+                logging.getLogger("orchestrator.config").warning(
+                    "%s must be > 0, got %s; falling back to %s",
+                    info.field_name, v, fallback,
+                )
+                return fallback
+            return iv
+        except (TypeError, ValueError):
+            return fallback
+
+    @field_validator("build_cache_prune_until", mode="before")
+    @classmethod
+    def _bcp_until(cls, v):
+        # A docker `until` filter: digits + optional unit (s/m/h/d/w). Anything
+        # else -> the safe default "24h" (keeps warm cache, BR-2).
+        try:
+            if v is None:
+                return "24h"
+            s = str(v).strip()
+            if s and re.match(r"^\d+[smhdw]?$", s):
+                return s
+            logging.getLogger("orchestrator.config").warning(
+                "build_cache_prune_until must match ^\\d+[smhdw]?$, got %r; using 24h", v
+            )
+            return "24h"
+        except (TypeError, ValueError):
+            return "24h"
+
+    @field_validator("build_cache_prune_notify_min_gb", mode="before")
+    @classmethod
+    def _bcp_notify_min_gb(cls, v):
+        # A non-negative GB threshold; negative / non-numeric -> 0 (silent).
+        try:
+            if v is None or (isinstance(v, str) and v.strip() == ""):
+                return 0.0
+            fv = float(v)
+            return fv if fv >= 0 else 0.0
+        except (TypeError, ValueError):
+            return 0.0
+
    # ORCH-071: merge-verify under-gate on the `deploy -> done` edge. For the
    # self-hosting repo the `deploy` stage runs the DETERMINISTIC self-deploy path
    # (Phase A/B/C), where the LLM `deployer` agent — historically the ONLY actor