feat(infra): auto-prune docker build cache on mva154 (ORCH-062)

Add src/build_cache_pruner.py — a background daemon thread modelled 1:1 on
src/disk_watchdog.py that periodically runs STRICTLY `docker builder prune -f
--filter until=<until>` (BuildKit GC) on the HOST over ssh. It is the "second
half" of the disk-watchdog (ORCH-063): the watchdog signals, the pruner cleans.
Removes the root cause of the 07.06.2026 incident (build cache ~11GB -> disk
100% -> whole self-hosting pipeline down) automatically, без оператора.

ADR-001 (Variant A): host-over-ssh, same channel as image_freshness/self_deploy
(no docker CLI in the image). Touches ONLY the build cache — no image/system
prune, no image/container removal, never restarts the docker daemon or the prod
container (self-hosting safety). No ssh target -> tick is a no-op.

- src/config.py: ORCH_BUILD_CACHE_PRUNE_* flags + defensive validators
  (interval/timeout >0, until ~ ^\d+[smhdw]?$, notify_min_gb >=0 -> safe default).
- src/main.py: start last (after disk_watchdog) / stop first in lifespan;
  additive read-only build_cache_prune block in GET /queue.
- never-raise on two levels (per-command + per-tick); kill-switch
  ORCH_BUILD_CACHE_PRUNE_ENABLED (false -> daemon does not start, 1:1 as before).
- STAGE_TRANSITIONS / QG_CHECKS / check_* / _parse_* / DB schema UNCHANGED;
  last-run/last-result is in-memory (no migration).
- tests/test_build_cache_pruner.py: TC-01..TC-12 (23 cases, docker fully mocked).
- .env.example + CHANGELOG.md updated; INFRA.md / architecture docs already
  carry the component (architecture stage).

Refs: ORCH-062

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-09 19:43:53 +03:00
committed by orchestrator-deployer
parent d2604e42cd
commit 664c2e945a
6 changed files with 855 additions and 1 deletions

View File

@@ -113,10 +113,20 @@ async def lifespan(app: FastAPI):
from .disk_watchdog import disk_watchdog
disk_watchdog.start()
# ORCH-062: start the build-cache-pruner LAST, right after the disk-watchdog
# (D7). It is the "second half" of the watchdog (watchdog signals, pruner
# cleans): a daemon thread that periodically runs `docker builder prune` on
# the host over ssh. Honours the kill-switch ORCH_BUILD_CACHE_PRUNE_ENABLED
# (start() is a no-op when disabled, so behaviour is 1:1 as before).
from .build_cache_pruner import build_cache_pruner
build_cache_pruner.start()
try:
yield
finally:
# ORCH-063: stop the disk-watchdog first (reverse of startup).
# ORCH-062: stop the build-cache-pruner first (reverse of startup, D7).
build_cache_pruner.stop()
# ORCH-063: stop the disk-watchdog next (reverse of startup).
disk_watchdog.stop()
# Graceful shutdown order mirrors startup in reverse: stop the reaper
# first, then the reconciler (it must not enqueue new work while the
@@ -162,6 +172,7 @@ async def queue():
from . import serial_gate
from . import labels
from .disk_watchdog import disk_watchdog
from .build_cache_pruner import build_cache_pruner
return {
"counts": job_status_counts(),
"max_concurrency": worker.max_concurrency,
@@ -184,6 +195,11 @@ async def queue():
# enabled, threshold, interval, last measurement per host-path. Additive
# block; never-raise (status() returns {"enabled": ...} minimum on error).
"disk_monitor": disk_watchdog.status(),
# ORCH-062 (FR-4 / AC-7): build-cache-pruner observability (read-only) —
# enabled, interval, retention (until), last run + best-effort reclaimed /
# last error. Additive block; never-raise (status() returns {"enabled":
# ...} minimum on error).
"build_cache_prune": build_cache_pruner.status(),
"recent": recent_jobs(10),
}