admin/orchestrator

feat(watchdog): sidecar-watchdog F1b — monitoring brain in a separate container (ORCH-100) #116

Merged

admin merged 10 commits from feature/ORCH-100-fnd-f1b-sidecar-watchdog into main

2026-06-10 09:57:13 +03:00

Author	SHA1	Message	Date
deploy-finalizer	e7dad0f644	deploy(ORCH-036): finalize SUCCESS for ORCH-100 All checks were successful CI / test (push) Successful in 52s Details	2026-06-10 09:57:11 +03:00
claude-bot	0ef1cf6698	tester(ET): auto-commit from tester run_id=571 All checks were successful CI / test (push) Successful in 1m1s Details CI / test (pull_request) Successful in 58s Details	2026-06-10 09:36:02 +03:00
claude-bot	9f62e05d01	reviewer(ET): auto-commit from reviewer run_id=570	2026-06-10 09:36:02 +03:00
claude-bot	318bae7472	fix(test): isolate settings.runs_dir in conftest to stop ambient prod-log pollution (ORCH-100) test_queue.py::TestRetry::test_finalize_job_requeue_then_fail failed in the self-hosting environment because launcher._finalize_job classifies a non-zero exit by reading the tail of <settings.runs_dir>/<run_id>.log. settings.runs_dir defaults to the live prod dir /app/data/runs, which on the host holds REAL accumulated agent logs; a real 2.log containing "429" flips the expected 'permanent' classification to 'transient', requeueing the job instead of marking it 'failed'. This is ambient prod pollution, not a code fault. Add an autouse _isolate_runs_dir fixture (mirroring _no_telegram / _disable_merge_verify) that redirects settings.runs_dir to a per-test tmp dir so _run_log_path() resolves to a non-existent file and classify_log_file() returns the documented 'permanent' default. Full suite: 1617 passed. src/** untouched. Refs: ORCH-100 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:36:02 +03:00
claude-bot	d61b583dad	tester(ET): auto-commit from tester run_id=568	2026-06-10 09:36:02 +03:00
claude-bot	93cf2732a2	reviewer(ET): auto-commit from reviewer run_id=567	2026-06-10 09:36:02 +03:00
claude-bot	259b507906	feat(watchdog): sidecar-watchdog F1b — monitoring brain in a separate container (ORCH-100) Add the `watchdog/` package (thin Python-3.12 stdlib-only daemon) and the `orchestrator-watchdog` compose service — the brain half of the domain-0 observability pair. F1a (ORCH-099) exposes GET /metrics raw signal; F1b reads it, augments with host / container / dependency probes, runs each signal through a generalised pure decision function (decide(signal_active, prev, now, cooldown), a strict superset of disk_watchdog.decide_action) with per-signal in-memory dedup/throttle/recovery, and alerts over its OWN independent Telegram channel. Key properties (ADR-001): - Observer separated from observed: separate container; /metrics not answering is itself the master `orch_down` alarm (debounced K ticks — no flap on a hiccup). - Strictly read-only: docker.sock GET-only + mounted :ro (double guard), host paths :ro, no DB/disk writes, no process control — self-hosting-safe. - never-raise on three levels (per-source/per-tick/per-send) + WATCHDOG_ENABLED kill-switch (disabled -> inert idle-loop, not exit). - Disk anti-duplicate (D6): disk_watchdog (ORCH-063) stays sole owner of the 85% alert; sidecar carries orch_down + an opt-in 97% ceiling (default off). - NO import from src/ (C-1); src/, STAGE_TRANSITIONS, QG_CHECKS, check_, DB schema — untouched. env_file optional so a missing .env.watchdog never breaks `docker compose up` for the prod orchestrator. Tests: tests/watchdog/ (TC-01…TC-13) + full tests/ regression green (TC-14). Docs: CHANGELOG, .env.example canon (WATCHDOG_); architecture README + adr-0033 authored at the architecture stage. Refs: ORCH-100 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:36:02 +03:00
claude-bot	1c08b3f62a	architect(ET): auto-commit from architect run_id=565	2026-06-10 09:36:02 +03:00
claude-bot	36102f253f	analyst(ET): auto-commit from analyst run_id=564	2026-06-10 09:36:02 +03:00
Slava	874cc29ff7	docs: init ORCH-100 business request	2026-06-10 09:36:02 +03:00