fix(test): isolate settings.runs_dir in conftest to stop ambient prod-log pollution (ORCH-100)

test_queue.py::TestRetry::test_finalize_job_requeue_then_fail failed in the self-hosting environment because launcher._finalize_job classifies a non-zero exit by reading the tail of <settings.runs_dir>/<run_id>.log. settings.runs_dir defaults to the live prod dir /app/data/runs, which on the host holds REAL accumulated agent logs; a real 2.log containing "429" flips the expected 'permanent' classification to 'transient', requeueing the job instead of marking it 'failed'. This is ambient prod pollution, not a code fault. Add an autouse _isolate_runs_dir fixture (mirroring _no_telegram / _disable_merge_verify) that redirects settings.runs_dir to a per-test tmp dir so _run_log_path() resolves to a non-existent file and classify_log_file() returns the documented 'permanent' default. Full suite: 1617 passed. src/** untouched. Refs: ORCH-100 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 09:21:12 +03:00
parent d61b583dad
commit 318bae7472
2 changed files with 29 additions and 0 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -4,6 +4,7 @@

 ## [Unreleased]
 - **FND/F1b: sidecar-watchdog — мозг мониторинга в отдельном контейнере** (ORCH-100, `feat`): новая папка `watchdog/` (тонкий **Python-3.12-stdlib-only** демон) + сервис `orchestrator-watchdog` в `docker-compose.yml` (`network_mode: host`, read-only `docker.sock`, `mem_limit: 128m`). Вторая половина пары наблюдаемости домена 0: F1a (ORCH-099) отдаёт `GET /metrics` (сырьё), F1b — **мозг**, который это сырьё читает, дополняет внешними сигналами (хост/контейнеры/зависимости) и превращает в **алерты** через **собственный** независимый Telegram-канал. **`src/**` НЕ изменён** — F1b потребитель `/metrics`; `STAGE_TRANSITIONS`/`QG_CHECKS`/`check_*`/схема БД орка — байт-в-байт. Аддитивно, под kill-switch `WATCHDOG_ENABLED`, строго read-only к наблюдаемому (self-hosting-безопасно). ADR: `docs/work-items/ORCH-100/06-adr/ADR-001-sidecar-watchdog.md`, сквозной `docs/architecture/adr/adr-0033-sidecar-watchdog.md`.
+  - **fix(test): изоляция `settings.runs_dir` в conftest** — устранена амбиентная prod-зависимость, валившая `test_queue.py::TestRetry::test_finalize_job_requeue_then_fail` в self-hosting-окружении (TC-14 «full tests/ regression green»). `launcher._finalize_job` классифицирует падение по хвосту `<settings.runs_dir>/<run_id>.log`; `runs_dir` по умолчанию = живой prod-каталог `/app/data/runs`, где на хосте накоплены РЕАЛЬНЫЕ логи агентов (`2.log` содержит `429` → 'transient'), поэтому тест с литеральным `run_id=2` читал чужой prod-лог и получал requeue вместо `failed`. Новый autouse-фикстур `_isolate_runs_dir` в `tests/conftest.py` (по образцу `_no_telegram`/`_disable_merge_verify`) перенаправляет `runs_dir` в пер-тестовый tmp → `_run_log_path()` указывает на несуществующий файл → `classify_log_file()` отдаёт документированный дефолт 'permanent'. Детерминизм всей сюты восстановлен (1617 passed); `src/**` не тронут.
  - **Стек (D1):** Python 3.12 stdlib-only на `python:3.12-slim` — `urllib` (HTTP `/metrics` + пинги + Telegram POST), сырой HTTP-over-unix-socket для read-only `docker.sock` (БЕЗ pip-пакета `docker`), `shutil.disk_usage`/`/proc/meminfo` для хоста. Нет дерева зависимостей (тонкость, C-3). Отдельный образ `watchdog/Dockerfile` (build-контекст = корень репо; `src/**` НЕ копируется — изоляция C-1).
  - **Топология (D2):** сервис собирается из `watchdog/Dockerfile`, `restart: unless-stopped` (самовосстановление), `network_mode: host` → `/metrics` достижим как `http://127.0.0.1:8500/metrics`; `docker.sock` смонтирован `:ro` И код GET-only (двойная гарантия read-only); хост-пути bind-mount `:ro`; `mem_limit: 128m`+`mem_reservation: 32m`. `env_file` опционален (`required: false`) → отсутствие `.env.watchdog` НЕ ломает `docker compose up` прод-орка. Деплой watchdog поднимает ТОЛЬКО его — прод `orchestrator` не пересобирается/не рестартится.
  - **Обобщённая чистая решающая функция (D4):** `watchdog/decision.py::decide(signal_active, prev, now, cooldown_s) -> alert|realert|recovery|none` — строгая генерализация `disk_watchdog.decide_action` (булев `signal_active` вместо `used_pct >= threshold`), per-signal in-memory `AlertState` (анти-спам/recovery, рестарт сбрасывает → корректный повторный алерт стоящей проблемы).
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -77,6 +77,34 @@ def _reset_webhook_secrets(monkeypatch):
    yield


+@pytest.fixture(autouse=True)
+def _isolate_runs_dir(monkeypatch, tmp_path):
+    """ORCH-100: point settings.runs_dir at a per-test tmp dir in ALL tests.
+
+    Background: ``launcher._run_log_path(run_id)`` resolves to
+    ``<settings.runs_dir>/<run_id>.log`` and, on a non-zero exit,
+    ``_finalize_job`` classifies the failure by reading the *tail of that log*
+    (transient 429/overload/timeout -> backoff-requeue; permanent -> attempts
+    requeue then 'failed'). settings.runs_dir defaults to the live prod dir
+    ``/app/data/runs``, which on the self-hosting host holds REAL accumulated
+    agent logs (1.log, 2.log, ...). Tests that exercise the finalize path with a
+    small literal run_id (e.g. test_finalize_job_requeue_then_fail uses run_id=1/2)
+    therefore read whatever a real prod run happened to log — and a real 2.log that
+    contains "429" silently flips an expected 'permanent' classification to
+    'transient', requeueing instead of failing. That is ambient prod pollution, not
+    a code fault.
+
+    Redirecting runs_dir to an empty tmp dir makes _run_log_path() resolve to a
+    non-existent file -> classify_log_file() returns the documented 'permanent'
+    default, restoring deterministic, environment-independent behaviour for the
+    whole suite. settings is a process-wide singleton shared by launcher
+    (``launcher.settings is config.settings``), so patching the source covers it.
+    """
+    from src import config as _cfg
+    monkeypatch.setattr(_cfg.settings, "runs_dir", str(tmp_path), raising=False)
+    yield
+
+
@pytest.fixture(autouse=True)
 def _disable_merge_verify(monkeypatch):
    """ORCH-071: disable the merge-verify under-gate by default in ALL tests.