feat(infra): auto-prune docker build cache on mva154 (ORCH-062)

Add src/build_cache_pruner.py — a background daemon thread modelled 1:1 on
src/disk_watchdog.py that periodically runs STRICTLY `docker builder prune -f
--filter until=<until>` (BuildKit GC) on the HOST over ssh. It is the "second
half" of the disk-watchdog (ORCH-063): the watchdog signals, the pruner cleans.
Removes the root cause of the 07.06.2026 incident (build cache ~11GB -> disk
100% -> whole self-hosting pipeline down) automatically, без оператора.

ADR-001 (Variant A): host-over-ssh, same channel as image_freshness/self_deploy
(no docker CLI in the image). Touches ONLY the build cache — no image/system
prune, no image/container removal, never restarts the docker daemon or the prod
container (self-hosting safety). No ssh target -> tick is a no-op.

- src/config.py: ORCH_BUILD_CACHE_PRUNE_* flags + defensive validators
  (interval/timeout >0, until ~ ^\d+[smhdw]?$, notify_min_gb >=0 -> safe default).
- src/main.py: start last (after disk_watchdog) / stop first in lifespan;
  additive read-only build_cache_prune block in GET /queue.
- never-raise on two levels (per-command + per-tick); kill-switch
  ORCH_BUILD_CACHE_PRUNE_ENABLED (false -> daemon does not start, 1:1 as before).
- STAGE_TRANSITIONS / QG_CHECKS / check_* / _parse_* / DB schema UNCHANGED;
  last-run/last-result is in-memory (no migration).
- tests/test_build_cache_pruner.py: TC-01..TC-12 (23 cases, docker fully mocked).
- .env.example + CHANGELOG.md updated; INFRA.md / architecture docs already
  carry the component (architecture stage).

Refs: ORCH-062

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-09 19:43:53 +03:00
committed by orchestrator-deployer
parent d2604e42cd
commit 664c2e945a
6 changed files with 855 additions and 1 deletions

351
src/build_cache_pruner.py Normal file
View File

@@ -0,0 +1,351 @@
"""ORCH-062: build-cache-pruner — periodic ``docker builder prune`` on the host.
On 07.06.2026 the mva154 host disk silently grew to 100% and took down the WHOLE
self-hosting pipeline of every project. The dominant consumer was the **docker
build cache** (~11 GB accumulated by frequent rebuilds: ``docker compose up
--build`` on prod deploy, the ``--profile staging`` rebuild, the build-once retag
behind ``check_staging_image_fresh``). ORCH-063 added the disk-watchdog, which
only **signals** (Telegram alert at >=85%) and explicitly deferred the cleanup to
this task. **This module is that cleanup: the watchdog signals — the pruner
cleans.**
It is a background daemon thread modelled **1:1 on** ``src/disk_watchdog.py``
(``threading.Thread(daemon=True)`` + ``threading.Event`` for a clean stop, the
``start()`` / ``stop(timeout)`` / ``status()`` contract, a ``/queue`` snapshot,
per-tick never-raise and a kill-switch ``ORCH_BUILD_CACHE_PRUNE_ENABLED``). Each
tick runs **strictly** ``docker builder prune -f --filter until=<until>`` (BuildKit
GC) on the **host over ssh** — the prod container ships no docker CLI, only
``openssh-client`` (``Dockerfile:11``), so docker operations run over ssh on the
host, the same channel ``image_freshness``/``self_deploy`` already use.
Invariants (TRZ §5/§6 / ADR-001 D2/D6):
* The command touches **only** the BuildKit build cache. There is NO
``docker image prune``, NO ``docker system prune``, no image/container removal
of running services and no container stop/restart. The prod ``orchestrator``
container is NEVER restarted (self-hosting blast radius). ``-a/--all`` is only
ever added **paired with** the ``until`` age filter — never a bare
"nuke everything".
* ``STAGE_TRANSITIONS`` / ``QG_CHECKS`` / ``check_*`` / ``_parse_*`` /
``src/stage_engine.py`` / the DB schema are UNCHANGED — the pruner is an
operational daemon, not a Quality Gate (like ``reconciler`` / ``job_reaper`` /
``disk_watchdog``). No new migration (last-run / last-result is in-memory,
best-effort, may reset on restart — safe: at worst one extra safe prune).
* never-raise on two levels: per-command (non-zero rc / timeout / ``OSError`` /
no ssh target / output-parse error -> logged and swallowed, the tick lives)
and per-tick (outer ``try/except`` in ``_run``, like ``disk_watchdog._run``).
The background loop and the pipeline never fall over.
* No ssh target configured (``deploy_ssh_host`` empty) -> the tick is a no-op
(logged, reflected in ``status().last_error``). This scopes the feature to the
self-hosting prod (where ssh is configured) and makes the default safe in any
environment without host access — parallel to how ``self_deploy`` /
``image_freshness`` degrade without a target.
* Kill-switch ``build_cache_prune_enabled=False`` -> the daemon does not start
(``main.lifespan`` guard + ``start()`` guard) and ``/queue`` returns
``{"enabled": false, ...}`` — behaviour 1:1 as before the task.
See docs/work-items/ORCH-062/06-adr/ADR-001-build-cache-pruner.md and the
cross-cutting docs/architecture/adr/adr-0025-build-cache-pruner.md.
"""
import logging
import re
import shlex
import subprocess
import threading
import time
from .config import settings
from .notifications import send_telegram
logger = logging.getLogger("orchestrator.build_cache_pruner")
_BYTES_PER_GB = 1024 ** 3
# Multipliers for the "Total reclaimed space: <n><unit>" line emitted by
# `docker builder prune`. Decimal units are base-1000 (docker's HumanSize),
# the *i* binary units base-1024. Best-effort — only used for observability /
# the optional notify threshold, never for a decision.
_SIZE_UNITS = {
"B": 1,
"KB": 1000, "MB": 1000 ** 2, "GB": 1000 ** 3, "TB": 1000 ** 4,
"KIB": 1024, "MIB": 1024 ** 2, "GIB": 1024 ** 3, "TIB": 1024 ** 4,
}
_RECLAIMED_RE = re.compile(
r"Total reclaimed space:\s*([\d.]+)\s*([KMGT]?i?B)", re.IGNORECASE
)
def decide_prune(prev_run_ts: float | None, now: float, interval_s: float) -> bool:
"""Pure decision (anti-frequency, NFR-4): should this tick prune?
Returns ``True`` when no prune has run yet (``prev_run_ts is None``) or at
least ``interval_s`` seconds have elapsed since the last attempt; ``False``
otherwise. Testable without a thread or a real timer (TC-01/TC-02). A
non-positive / unusable ``interval_s`` falls open to ``True`` (prune) — the
config validator already guards the value, this is belt-and-braces.
"""
if prev_run_ts is None:
return True
try:
return (now - prev_run_ts) >= interval_s
except TypeError: # pragma: no cover - defensive, inputs are numbers
return True
def _ssh_target() -> str | None:
"""ssh ``user@host`` for the host prune, or ``None`` when no host is
configured (tests / non-self contexts). Mirrors ``image_freshness._ssh_target``.
"""
host = (settings.deploy_ssh_host or "").strip()
if not host:
return None
user = (settings.deploy_ssh_user or "").strip()
return f"{user}@{host}" if user else host
def build_prune_command(
ssh_target: str, until: str, prune_all: bool = False
) -> list[str]:
"""Build the ssh command that runs ``docker builder prune`` on the host.
The remote is **strictly** ``docker builder prune -f`` (BuildKit GC), with the
age filter ``--filter until=<until>`` appended whenever ``until`` is set so the
warm recent cache is kept (BR-2/AC-2), and ``-a`` added **only** when
``prune_all`` is set — always paired with the age filter (D2). It NEVER emits
``docker image prune`` / ``docker system prune`` / any image/container removal
(BR-3/AC-3). The ``until`` value is ``shlex.quote``-d for the remote shell.
"""
remote = "docker builder prune -f"
if prune_all:
remote += " -a"
if until:
remote += " --filter until=" + shlex.quote(until)
return ["ssh", "-o", "StrictHostKeyChecking=no", ssh_target, remote]
def parse_reclaimed(output: str) -> int | None:
"""Best-effort parse of ``Total reclaimed space: <n><unit>`` -> bytes.
Returns the reclaimed size in bytes, or ``None`` when the line is absent /
unparseable (FR-4: observability is best-effort, never a decision). Never
raises.
"""
try:
m = _RECLAIMED_RE.search(output or "")
if not m:
return None
value = float(m.group(1))
unit = m.group(2).upper()
mult = _SIZE_UNITS.get(unit)
if mult is None:
return None
return int(value * mult)
except Exception as e: # noqa: BLE001 - parsing is best-effort
logger.warning("build-cache-pruner: cannot parse reclaimed space: %s", e)
return None
class BuildCachePruner:
"""Background daemon running ``docker builder prune`` on the host on a period.
Modelled on ``DiskWatchdog``: a ``threading.Thread(daemon=True)`` + a
``threading.Event`` for a clean stop. The only in-memory state is the
best-effort ``last_run_ts`` / ``_last_reclaimed`` / ``_last_error`` — all reset
on restart, which is safe (at worst one extra safe prune; D6).
``now_provider`` is injectable so the anti-frequency decision is testable
deterministically without a real timer.
"""
def __init__(self, interval_s: float | None = None, now_provider=None):
self.interval_s = (
interval_s
if interval_s is not None
else settings.build_cache_prune_interval_s
)
self._now = now_provider or time.time
self._stop = threading.Event()
self._thread: threading.Thread | None = None
# Best-effort in-memory state (no DB row, no migration).
self.last_run_ts: float | None = None
self._last_reclaimed: int | None = None
self._last_reclaimed_human: str | None = None
self._last_error: str | None = None
# -- config helpers ----------------------------------------------------
@property
def _until(self) -> str:
return settings.build_cache_prune_until
@property
def _all(self) -> bool:
return settings.build_cache_prune_all
@property
def _timeout_s(self) -> int:
return settings.build_cache_prune_timeout_s
@property
def _notify_min_gb(self) -> float:
return settings.build_cache_prune_notify_min_gb
# -- tick --------------------------------------------------------------
def tick(self) -> None:
"""One pass: prune if the anti-frequency window has elapsed (never-raise).
Runs the pure ``decide_prune`` against the injected clock; on a PRUNE
decision it performs the host prune (``_prune``), which is itself
never-raise. A SKIP decision leaves all state untouched.
"""
now = self._now()
if not decide_prune(self.last_run_ts, now, self.interval_s):
return
self._prune(now)
def _prune(self, now: float) -> None:
"""Run ``docker builder prune`` on the host over ssh. Never raises (AC-4).
Records the attempt time (``last_run_ts``) up front so the anti-frequency
window advances even when the command fails or there is no ssh target.
Every failure mode — no target, timeout, non-zero rc, ``OSError`` — is
logged, stored in ``_last_error`` and swallowed; the loop stays alive.
"""
self.last_run_ts = now
target = _ssh_target()
if not target:
self._last_error = "no ssh host configured (deploy_ssh_host empty)"
logger.info("build-cache-pruner: %s — tick is a no-op", self._last_error)
return
cmd = build_prune_command(target, self._until, self._all)
try:
r = subprocess.run(
cmd, capture_output=True, text=True, timeout=self._timeout_s
)
except subprocess.TimeoutExpired:
self._last_error = f"timeout after {self._timeout_s}s"
logger.warning("build-cache-pruner: prune %s", self._last_error)
return
except (subprocess.SubprocessError, OSError) as e:
self._last_error = f"ssh/subprocess error: {e}"
logger.warning("build-cache-pruner: %s", self._last_error)
return
if r.returncode != 0:
self._last_error = (
f"rc={r.returncode}: {(r.stderr or '').strip()[:200]}"
)
logger.warning("build-cache-pruner: prune %s", self._last_error)
return
# Success: parse the best-effort reclaimed size and clear the error.
self._last_error = None
reclaimed = parse_reclaimed(r.stdout or "")
self._last_reclaimed = reclaimed
self._last_reclaimed_human = self._format_reclaimed(reclaimed)
logger.info(
"build-cache-pruner: pruned host build cache (until=%s, all=%s), "
"reclaimed=%s",
self._until, self._all, self._last_reclaimed_human or "unknown",
)
self._maybe_notify(reclaimed)
@staticmethod
def _format_reclaimed(reclaimed: int | None) -> str | None:
"""Human GB label for a reclaimed byte count (best-effort, never raises)."""
if reclaimed is None:
return None
try:
return f"{reclaimed / _BYTES_PER_GB:.2f} GB"
except Exception: # noqa: BLE001 - observability only
return None
def _maybe_notify(self, reclaimed: int | None) -> None:
"""Telegram when reclaimed >= ``notify_min_gb`` (>0 to enable). Never raises."""
try:
min_gb = self._notify_min_gb
if not min_gb or min_gb <= 0 or reclaimed is None:
return
gb = reclaimed / _BYTES_PER_GB
if gb < min_gb:
return
self._send(
f"\U0001f9f9 build-cache-pruner: освобождено {gb:.2f} ГБ "
f"docker build cache на хосте (until={self._until})."
)
except Exception as e: # noqa: BLE001 - notify is best-effort
logger.warning("build-cache-pruner: notify decision failed: %s", e)
def _send(self, text: str) -> None:
"""Send a Telegram message (notifying). Never raises (best-effort)."""
try:
send_telegram(text)
except Exception as e: # noqa: BLE001 - delivery is best-effort
logger.warning("build-cache-pruner: telegram send failed: %s", e)
# -- loop / lifecycle --------------------------------------------------
def _tick(self) -> None:
try:
self.tick()
except Exception as e: # noqa: BLE001 - inner never-raise
logger.error("build-cache-pruner: tick error: %s", e)
def _run(self) -> None:
logger.info(
"BuildCachePruner started (interval=%ss, until=%s, all=%s, "
"timeout=%ss, enabled=%s)",
self.interval_s, self._until, self._all, self._timeout_s,
settings.build_cache_prune_enabled,
)
while not self._stop.is_set():
try:
self._tick()
except Exception as e: # noqa: BLE001 - outer never-raise
logger.error("BuildCachePruner loop error: %s", e)
self._stop.wait(self.interval_s)
logger.info("BuildCachePruner stopped")
def start(self) -> None:
"""Start the daemon thread (idempotent: a live thread is a no-op).
Honours the kill-switch: ``build_cache_prune_enabled=False`` -> no-op (the
daemon never starts; ``main.lifespan`` also guards, AC-5/TC-07).
"""
if not settings.build_cache_prune_enabled:
return
if self._thread and self._thread.is_alive():
return
self._stop.clear()
self._thread = threading.Thread(
target=self._run, name="build-cache-pruner", daemon=True
)
self._thread.start()
def stop(self, timeout: float = 5.0) -> None:
self._stop.set()
if self._thread:
self._thread.join(timeout=timeout)
def status(self) -> dict:
"""Build-cache-pruner snapshot for /queue observability (FR-4/AC-7).
Never raises — returns a minimal ``{"enabled": ...}`` on any error.
"""
try:
return {
"enabled": settings.build_cache_prune_enabled,
"interval_s": self.interval_s,
"until": self._until,
"all": self._all,
"last_run_ts": self.last_run_ts,
"last_reclaimed_bytes": self._last_reclaimed,
"last_reclaimed": self._last_reclaimed_human,
"last_error": self._last_error,
}
except Exception as e: # noqa: BLE001 - observability must never raise
logger.warning("build-cache-pruner: status() failed: %s", e)
return {"enabled": settings.build_cache_prune_enabled}
# Module-level singleton used by the FastAPI lifespan.
build_cache_pruner = BuildCachePruner()

View File

@@ -1,4 +1,5 @@
import logging
import re
from pydantic import field_validator
from pydantic_settings import BaseSettings
@@ -445,6 +446,88 @@ class Settings(BaseSettings):
except (TypeError, ValueError):
return 85
# ORCH-062: build-cache-pruner — the "second half" of the disk-watchdog
# (ORCH-063): watchdog SIGNALS, pruner CLEANS. A background daemon thread
# modelled 1:1 on disk_watchdog (start/stop in main.lifespan, /queue snapshot,
# never-raise, kill-switch) that periodically runs `docker builder prune` on
# the HOST over ssh (the container ships no docker CLI — same channel as
# image_freshness/self_deploy). Touches ONLY the BuildKit build cache: never
# images/containers of running services, never restarts the docker daemon or
# the prod container (self-hosting safety). State (last run / result) is
# in-memory, best-effort — no DB migration. ADR-001 D1..D7.
# build_cache_prune_enabled -> kill-switch; False -> daemon does not
# start (1:1 as before), env *_ENABLED.
# build_cache_prune_interval_s -> tick period, seconds (order of hours).
# build_cache_prune_until -> retention age for warm cache
# (`docker builder prune --filter until=`).
# build_cache_prune_all -> add `-a` (ALWAYS paired with until).
# build_cache_prune_timeout_s -> bound on the ssh command, seconds.
# build_cache_prune_notify_min_gb -> Telegram when reclaimed >= N GB; 0 -> silent.
# Defensive validation (ADR-001 D4): a non-positive / non-numeric interval or
# timeout -> default + warning; an `until` not matching ^\d+[smhdw]?$ -> "24h";
# a negative notify threshold -> 0. A bad env value NEVER crashes the start.
build_cache_prune_enabled: bool = True
build_cache_prune_interval_s: int = 21600
build_cache_prune_until: str = "24h"
build_cache_prune_all: bool = False
build_cache_prune_timeout_s: int = 120
build_cache_prune_notify_min_gb: float = 0.0
@field_validator(
"build_cache_prune_interval_s", "build_cache_prune_timeout_s", mode="before"
)
@classmethod
def _bcp_positive_int(cls, v, info):
# Non-positive / non-numeric -> the field default (never crash the start).
_defaults = {
"build_cache_prune_interval_s": 21600,
"build_cache_prune_timeout_s": 120,
}
fallback = _defaults.get(info.field_name, 1)
try:
if v is None or (isinstance(v, str) and v.strip() == ""):
return fallback
iv = int(v)
if iv <= 0:
logging.getLogger("orchestrator.config").warning(
"%s must be > 0, got %s; falling back to %s",
info.field_name, v, fallback,
)
return fallback
return iv
except (TypeError, ValueError):
return fallback
@field_validator("build_cache_prune_until", mode="before")
@classmethod
def _bcp_until(cls, v):
# A docker `until` filter: digits + optional unit (s/m/h/d/w). Anything
# else -> the safe default "24h" (keeps warm cache, BR-2).
try:
if v is None:
return "24h"
s = str(v).strip()
if s and re.match(r"^\d+[smhdw]?$", s):
return s
logging.getLogger("orchestrator.config").warning(
"build_cache_prune_until must match ^\\d+[smhdw]?$, got %r; using 24h", v
)
return "24h"
except (TypeError, ValueError):
return "24h"
@field_validator("build_cache_prune_notify_min_gb", mode="before")
@classmethod
def _bcp_notify_min_gb(cls, v):
# A non-negative GB threshold; negative / non-numeric -> 0 (silent).
try:
if v is None or (isinstance(v, str) and v.strip() == ""):
return 0.0
fv = float(v)
return fv if fv >= 0 else 0.0
except (TypeError, ValueError):
return 0.0
# ORCH-071: merge-verify under-gate on the `deploy -> done` edge. For the
# self-hosting repo the `deploy` stage runs the DETERMINISTIC self-deploy path
# (Phase A/B/C), where the LLM `deployer` agent — historically the ONLY actor

View File

@@ -113,10 +113,20 @@ async def lifespan(app: FastAPI):
from .disk_watchdog import disk_watchdog
disk_watchdog.start()
# ORCH-062: start the build-cache-pruner LAST, right after the disk-watchdog
# (D7). It is the "second half" of the watchdog (watchdog signals, pruner
# cleans): a daemon thread that periodically runs `docker builder prune` on
# the host over ssh. Honours the kill-switch ORCH_BUILD_CACHE_PRUNE_ENABLED
# (start() is a no-op when disabled, so behaviour is 1:1 as before).
from .build_cache_pruner import build_cache_pruner
build_cache_pruner.start()
try:
yield
finally:
# ORCH-063: stop the disk-watchdog first (reverse of startup).
# ORCH-062: stop the build-cache-pruner first (reverse of startup, D7).
build_cache_pruner.stop()
# ORCH-063: stop the disk-watchdog next (reverse of startup).
disk_watchdog.stop()
# Graceful shutdown order mirrors startup in reverse: stop the reaper
# first, then the reconciler (it must not enqueue new work while the
@@ -162,6 +172,7 @@ async def queue():
from . import serial_gate
from . import labels
from .disk_watchdog import disk_watchdog
from .build_cache_pruner import build_cache_pruner
return {
"counts": job_status_counts(),
"max_concurrency": worker.max_concurrency,
@@ -184,6 +195,11 @@ async def queue():
# enabled, threshold, interval, last measurement per host-path. Additive
# block; never-raise (status() returns {"enabled": ...} minimum on error).
"disk_monitor": disk_watchdog.status(),
# ORCH-062 (FR-4 / AC-7): build-cache-pruner observability (read-only) —
# enabled, interval, retention (until), last run + best-effort reclaimed /
# last error. Additive block; never-raise (status() returns {"enabled":
# ...} minimum on error).
"build_cache_prune": build_cache_pruner.status(),
"recent": recent_jobs(10),
}