feat(infra): auto-prune docker build cache on mva154 (ORCH-062)
Add src/build_cache_pruner.py — a background daemon thread modelled 1:1 on src/disk_watchdog.py that periodically runs STRICTLY `docker builder prune -f --filter until=<until>` (BuildKit GC) on the HOST over ssh. It is the "second half" of the disk-watchdog (ORCH-063): the watchdog signals, the pruner cleans. Removes the root cause of the 07.06.2026 incident (build cache ~11GB -> disk 100% -> whole self-hosting pipeline down) automatically, без оператора. ADR-001 (Variant A): host-over-ssh, same channel as image_freshness/self_deploy (no docker CLI in the image). Touches ONLY the build cache — no image/system prune, no image/container removal, never restarts the docker daemon or the prod container (self-hosting safety). No ssh target -> tick is a no-op. - src/config.py: ORCH_BUILD_CACHE_PRUNE_* flags + defensive validators (interval/timeout >0, until ~ ^\d+[smhdw]?$, notify_min_gb >=0 -> safe default). - src/main.py: start last (after disk_watchdog) / stop first in lifespan; additive read-only build_cache_prune block in GET /queue. - never-raise on two levels (per-command + per-tick); kill-switch ORCH_BUILD_CACHE_PRUNE_ENABLED (false -> daemon does not start, 1:1 as before). - STAGE_TRANSITIONS / QG_CHECKS / check_* / _parse_* / DB schema UNCHANGED; last-run/last-result is in-memory (no migration). - tests/test_build_cache_pruner.py: TC-01..TC-12 (23 cases, docker fully mocked). - .env.example + CHANGELOG.md updated; INFRA.md / architecture docs already carry the component (architecture stage). Refs: ORCH-062 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
20
.env.example
20
.env.example
@@ -286,6 +286,26 @@ ORCH_DISK_MONITOR_THRESHOLD_PCT=85
|
||||
ORCH_DISK_MONITOR_REALERT_S=21600
|
||||
ORCH_DISK_MONITOR_PATHS=/repos,/app/data
|
||||
|
||||
# ORCH-062: build-cache-pruner — the "second half" of the disk-watchdog
|
||||
# (watchdog SIGNALS, pruner CLEANS). A daemon thread modelled on disk_watchdog
|
||||
# that periodically runs STRICTLY `docker builder prune -f --filter until=<until>`
|
||||
# on the HOST over ssh (BuildKit GC). Touches ONLY the build cache: never
|
||||
# images/containers of running services, never restarts the docker daemon or the
|
||||
# prod container (self-hosting safety). State is in-memory (no DB migration). No
|
||||
# ssh host configured -> the tick is a no-op. See docs/operations/INFRA.md.
|
||||
# BUILD_CACHE_PRUNE_ENABLED -> kill-switch; false -> the daemon does not start (1:1 as before).
|
||||
# BUILD_CACHE_PRUNE_INTERVAL_S -> tick period, seconds (order of hours; default ~6h). >0, else default.
|
||||
# BUILD_CACHE_PRUNE_UNTIL -> retention age for the warm cache (`--filter until=`); ^\d+[smhdw]?$, else 24h.
|
||||
# BUILD_CACHE_PRUNE_ALL -> add `-a` (ALWAYS paired with until); default false.
|
||||
# BUILD_CACHE_PRUNE_TIMEOUT_S -> bound on the ssh command, seconds. >0, else default.
|
||||
# BUILD_CACHE_PRUNE_NOTIFY_MIN_GB -> Telegram when reclaimed >= N GB; 0 -> silent.
|
||||
ORCH_BUILD_CACHE_PRUNE_ENABLED=true
|
||||
ORCH_BUILD_CACHE_PRUNE_INTERVAL_S=21600
|
||||
ORCH_BUILD_CACHE_PRUNE_UNTIL=24h
|
||||
ORCH_BUILD_CACHE_PRUNE_ALL=false
|
||||
ORCH_BUILD_CACHE_PRUNE_TIMEOUT_S=120
|
||||
ORCH_BUILD_CACHE_PRUNE_NOTIFY_MIN_GB=0
|
||||
|
||||
# ORCH-022: security-gate (secret-scanning + dependency audit) on the
|
||||
# deploy-staging -> deploy edge, run FIRST among the edge sub-gates. Deterministic
|
||||
# (no LLM): gitleaks (offline secret-scan, pinned Go binary in the image) + pip-audit
|
||||
|
||||
Reference in New Issue
Block a user