test_queue.py::TestRetry::test_finalize_job_requeue_then_fail failed in the
self-hosting environment because launcher._finalize_job classifies a non-zero
exit by reading the tail of <settings.runs_dir>/<run_id>.log. settings.runs_dir
defaults to the live prod dir /app/data/runs, which on the host holds REAL
accumulated agent logs; a real 2.log containing "429" flips the expected
'permanent' classification to 'transient', requeueing the job instead of
marking it 'failed'. This is ambient prod pollution, not a code fault.
Add an autouse _isolate_runs_dir fixture (mirroring _no_telegram /
_disable_merge_verify) that redirects settings.runs_dir to a per-test tmp dir
so _run_log_path() resolves to a non-existent file and classify_log_file()
returns the documented 'permanent' default. Full suite: 1617 passed. src/**
untouched.
Refs: ORCH-100
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the `watchdog/` package (thin Python-3.12 stdlib-only daemon) and the
`orchestrator-watchdog` compose service — the brain half of the domain-0
observability pair. F1a (ORCH-099) exposes GET /metrics raw signal; F1b reads it,
augments with host / container / dependency probes, runs each signal through a
generalised pure decision function (decide(signal_active, prev, now, cooldown),
a strict superset of disk_watchdog.decide_action) with per-signal in-memory
dedup/throttle/recovery, and alerts over its OWN independent Telegram channel.
Key properties (ADR-001):
- Observer separated from observed: separate container; /metrics not answering is
itself the master `orch_down` alarm (debounced K ticks — no flap on a hiccup).
- Strictly read-only: docker.sock GET-only + mounted :ro (double guard), host
paths :ro, no DB/disk writes, no process control — self-hosting-safe.
- never-raise on three levels (per-source/per-tick/per-send) + WATCHDOG_ENABLED
kill-switch (disabled -> inert idle-loop, not exit).
- Disk anti-duplicate (D6): disk_watchdog (ORCH-063) stays sole owner of the 85%
alert; sidecar carries orch_down + an opt-in 97% ceiling (default off).
- NO import from src/** (C-1); src/**, STAGE_TRANSITIONS, QG_CHECKS, check_*, DB
schema — untouched. env_file optional so a missing .env.watchdog never breaks
`docker compose up` for the prod orchestrator.
Tests: tests/watchdog/ (TC-01…TC-13) + full tests/ regression green (TC-14).
Docs: CHANGELOG, .env.example canon (WATCHDOG_*); architecture README + adr-0033
authored at the architecture stage.
Refs: ORCH-100
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Follow-up ORCH-040: legacy root:root files in /repos broke worktree creation
under uid 1000 with a raw "Permission denied" (agent never started, no diagnosis).
Three additive, kill-switch-reversible layers; STAGE_TRANSITIONS / QG_CHECKS /
check_* / machine-verdict keys / DB schema are byte-for-byte unchanged.
- D1: ensure_worktree classifies the permission class and raises an actionable
RuntimeError (cause + chown command + INFRA.md ref); non-permission errors keep
the prior raw-stderr contract; kill-switch off -> contract 1:1 as before ORCH-057.
- D2: new never-raise leaf src/fs_normalize.py — scan_ownership (TTL-cached,
early-exit per root), applies()-first scope (empty CSV -> self-hosting only),
opt-in normalize() that chowns ONLY when privileged (no-op under uid 1000).
- D3: best-effort startup detect in main.lifespan (WARNING + Telegram on mismatch,
never-fatal); read-only fs_ownership block in GET /queue; POST /fs-normalize/check.
Claim is NOT blocked — the clear early outcome is delivered by D1 at launch.
- Docs/config: .env.example flags + CHANGELOG (architecture README / adr-0031 /
INFRA.md procedure already landed on the branch).
- Tests: test_fs_normalize.py, test_git_worktree_perm.py,
test_fs_normalize_startup.py, test_api_queue.py (TC-01..TC-12). Full suite green.
Refs: ORCH-057
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
FND/F1a: add a versioned read-only JSON endpoint GET /metrics that exposes the
orchestrator's own raw state for the future observability sidecar F1b — active
task stages, job queue, agent-liveness (pid/runtime/cpu_ticks), and cost/tokens.
The orchestrator emits ONLY raw signal it alone knows; thresholds/alerts/history
live in the separate sidecar (observer separated from observed, BRD §1).
- src/metrics.py: new leaf collector build_metrics() (never-raise per section,
serial_gate.snapshot() pattern); envelope schema_version/generated_at/clk_tck +
stages/queue/agents/cost. _read_cpu_ticks(pid) reads utime+stime from
/proc/<pid>/stat (null on None/dead/non-Linux pid — never raises).
- src/main.py: thin @app.get("/metrics") wrapper (style of GET /queue).
- src/db.py: read-only helpers get_running_agents() (dedicated SELECT, not an
extension of the hot-path get_running_jobs()), agent_cost_totals(),
queue_retry_stats(); job_status_counts() default dict gains the cancelled key.
- src/config.py: metrics_endpoint_enabled kill-switch (default True), env
ORCH_METRICS_ENABLED via explicit validation_alias so the documented switch
actually controls the flag.
- docs: README API table row + CHANGELOG entry (contract section already added
by architect); .env.example ORCH_METRICS_ENABLED.
Strictly read-only / never-raise: STAGE_TRANSITIONS / QG_CHECKS / check_* /
machine-verdict keys / DB schema untouched; /health//status//queue byte-for-byte.
Tests: tests/test_metrics.py (TC-01..TC-11) + env-alias tests in test_config.py.
Full suite green (1482).
Refs: ORCH-099
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reviewer P1 (ORCH-027 attempt 2): inserting the ORCH-027 changelog
block duplicated the adjacent ORCH-095 entry — its paragraph body was
repeated verbatim, corrupting a golden-source doc and another work
item's artifact (CLAUDE.md §3). Remove the duplicate half, leaving a
single ORCH-095 body. ORCH-027 entry untouched (already correct).
Refs: ORCH-027
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>