The merge-gate re-test bounced ORCH-111 to development with 1 failed + 40
errors in 488s — a resource-exhaustion signature, NOT a code defect:
- This branch is watchdog-only (watchdog/** + compose); it touches no src/,
no STAGE_TRANSITIONS/QG_CHECKS/check_*, and no tests/test_stage_engine.py.
- The failing tests (test_stage_engine.py::TestStagingInfraTolerance
tc02/tc12/tc13/tc14) are outside this branch's scope, pass in isolation
(5 passed/19s), and pass right after the new watchdog tests (105 passed).
tc14 takes NO fixtures yet "errored" — a systemic/host failure, not logic.
- Host load was ~10-12 on a 4-core box at re-test time (the exact orphaned-
pytest CPU-starvation incident ORCH-111 alerts on; ORCH-111 by design only
observes, it does not reap — BR-3).
Evidence the branch is sound: full `pytest tests/` is green locally
(1933 passed, 0 failed, 0 errors in 267s, well under the 600s budget) and
Gitea CI on the branch HEAD is green (push + pull_request). Empty commit to
re-run the pipeline now that host load has dropped (10.5 -> 6).
Refs: ORCH-111
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Close the observability gap between agent_hung (only tracked jobs by jobs.pid)
and orphaned pytest subprocesses the orchestrator launches itself
(merge_gate.retest_branch / coverage_gate.measure_coverage). On a timeout-kill of
the agent (-9, ORCH-109) the grand-child pytest reparents onto tini and keeps
running for days, starving CPU and failing merge-gate re-test — with no alert.
Strictly inside the observer (watchdog/** + the watchdog compose service):
- watchdog/collectors/proc.py: stdlib-only /proc scan (under pid: host),
read-only, never-raise -> []; pure parsers split from I/O (tested on a fake
/proc tree). Never reads /proc/<pid>/environ.
- watchdog/signals.py: pure proc_signals builder, per-entity
("proc_blocking", pid), active iff age_s > proc_age_s; actionable RU detail.
- watchdog/core.py: opt-in tick block (gated on proc_enabled -> zero overhead /
byte-for-byte when off) + RECOVERY synthesis for a vanished process through the
existing decide()/AlertState (no new anti-spam logic).
- watchdog/config.py: WATCHDOG_PROC_{ENABLED(false),AGE_MIN(60),PATTERNS(pytest),
COOLDOWN_S(1800)}; default threshold > max(merge_retest_timeout_s=600,
coverage_run_timeout_s=900) so a legit in-flight run never crosses it.
- docker-compose.yml: pid: host on orchestrator-watchdog ONLY (read-only privilege).
Anti-false-positive and no overlap with agent_hung are by construction (cmdline
scope + age threshold), not fragile cross-namespace PID matching.
Canon synced: WATCHDOG_PROC_* in .env.watchdog.example <-> .env.example block;
documented in LITE_SETUP.md and docs/architecture/README.md (architect). src/**,
/metrics, schema_version, STAGE_TRANSITIONS, QG_CHECKS, check_*, machine-verdict
and the DB schema are untouched; deploy rebuilds only the sidecar, prod
orchestrator is not restarted (NFR-3).
Tests: tests/watchdog/test_proc_blocking_signal.py (TC-01..TC-06),
test_proc_collector.py (/proc parsing), test_tick_proc_blocking_integration.py
(TC-07), plus pid: host and proc-config assertions. Full pytest tests/ green (1930).
Refs: ORCH-111
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When the ORCH-109 entry was inserted above the ORCH-105 entry, the
ORCH-105 bullet had its body accidentally duplicated (the same
"слайдо-источник …" paragraph appeared twice in one bullet). Restore
the ORCH-105 entry to its canonical single-bodied form (byte-for-byte
identical to origin/main); the legitimate ORCH-109 additions are
untouched.
Refs: ORCH-109
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two additive, isolated launch-subsystem fixes from incident ORCH-104, without
touching STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict / DB schema.
D1 — launch-time model stamp: write the resolved model into agent_runs.model in
the SAME UPDATE as the effort stamp (ORCH-087), so the model is present from
launch, survives a timeout-kill (exit_code=-9), and is visible in-flight in
/metrics & /queue. record_usage stays an enrichment (model=COALESCE preserves the
launch stamp when the usage JSON model is None). never-raise (isolated try/except).
D3/D4 — dedicated per-role budgets: agent_timeout_developer_s=3600 /
agent_timeout_reviewer_s=3000 with a deterministic _resolve_timeout ladder
(overrides_json[agent] > dedicated role key > agent_timeout_seconds=1800; other
roles byte-for-byte). Malformed/non-positive config falls back to the global
default + WARNING (never-break). reaper_max_running_s raised 3600 -> 5400 in
lockstep to keep the ORCH-065 invariant (5400 > 3600 + 20 = 3620).
FR-4 (kill / in-flight visibility) and FR-5 (anti-salvage) are structural in the
existing code; pinned here by regression tests (tests/test_orch109_timeout_model.py,
TC-01..TC-12). Docs: .env.example, config passport, CHANGELOG, CLAUDE.md
(README/internals authored by architect in this branch).
Refs: ORCH-109
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Canonical staging_check.py (stub) exit 0; all REAL checks green,
C9a/C9b waived sandbox-infra (ORCH-061).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>