ORCH-111: watchdog proc_blocking alert on long-lived orphaned test processes #130

Merged
admin merged 14 commits from feature/ORCH-111-bug-watchdog-must-alert-on-lon into main 2026-06-15 09:14:22 +03:00
Owner

ORCH-111 — Watchdog proc_blocking: alert on long-lived orphaned test processes

Closes the observability blind spot where orphaned pytest subprocesses (launched by the orchestrator itself in merge_gate.retest_branch / coverage_gate.measure_coverage) survive a timeout-kill of the agent (-9, ORCH-109), reparent onto tini and live for days — starving CPU and failing merge-gate re-test without raising any alert (the established incident: test_install_lite_script.py processes lived >2 days unnoticed).

What changed (strictly inside the observer)

  • watchdog/collectors/proc.py (new, D3): stdlib-only /proc scan (under pid: host the container /proc reflects the host PID-namespace). Reads /proc/stat btime + SC_CLK_TCK, iterates numeric /proc/<pid>, matches cmdline by test-class pattern, parses /proc/<pid>/stat (field 22 starttimeage_s; 14+15 → cpu_s, informational). Read-only (no os.kill/signals/subprocess; never reads /proc/<pid>/environ), never-raise (per-pid race skipped; top-level → []). Pure parsers split from I/O.
  • watchdog/signals.py (D4): pure proc_signals builder — per-entity Signal("proc_blocking", pid), active ⇔ age_s > proc_age_s; actionable RU detail (PID + age + truncated cmdline + CPU time).
  • watchdog/core.py (D4/D7): opt-in tick() block gated on proc_enabled (disabled → zero overhead, byte-for-byte as before) + RECOVERY synthesis for a vanished process reusing the existing decision.decide()/AlertState (no new anti-spam logic — FR-5).
  • watchdog/config.py (D5): WATCHDOG_PROC_ENABLED (default false, opt-in) / WATCHDOG_PROC_AGE_MIN (60) / WATCHDOG_PROC_PATTERNS (pytest) / WATCHDOG_PROC_COOLDOWN_S (1800), never-raise parsers. Default threshold (3600s) exceeds max(merge_retest_timeout_s=600, coverage_run_timeout_s=900) so a legit in-flight run never crosses it.
  • docker-compose.yml (D6): pid: host on orchestrator-watchdog only (read-only privilege; not a volume → read-only-mounts invariant intact).

Invariants (AC-9)

src/**, /metrics, schema_version, STAGE_TRANSITIONS, QG_CHECKS, check_*, machine-verdict keys and the DB schema are byte-for-byte untouched. Deploy rebuilds only the sidecar; prod orchestrator is not restarted (NFR-3). Anti-false-positive and no overlap with agent_hung are by construction (cmdline scope + age threshold), not fragile cross-namespace PID matching.

Canon (AC-10 / NFR-5)

WATCHDOG_PROC_* synced .env.watchdog.example.env.example block (key-sync test green); documented in docs/deployment/LITE_SETUP.md §4 and docs/architecture/README.md.

Tests

  • tests/watchdog/test_proc_blocking_signal.py — TC-01…TC-06 (builder, anti-FP, config/kill-switch, never-raise/read-only AST scan, anti-spam+recovery cycle, no-dup-with-agent_hung).
  • tests/watchdog/test_proc_collector.py/proc parsing fixtures (btime/pid-stat/cmdline/age/cpu/filtering/race).
  • tests/watchdog/test_tick_proc_blocking_integration.py — TC-07 tick→dispatch + kill-switch-off + in-budget + collector-explodes.
  • tests/watchdog/test_compose_service.py — positive pid: host (and prod must NOT have it); test_config_killswitch.py — proc-config defaults/env.

pytest tests/ -q1930 passed.

ADR: docs/work-items/ORCH-111/06-adr/ADR-001-watchdog-orphan-test-process-alert.md, cross-cutting docs/architecture/adr/adr-0041-watchdog-orphan-test-process-alert.md.

Refs: ORCH-111

🤖 Generated with Claude Code

## ORCH-111 — Watchdog `proc_blocking`: alert on long-lived orphaned test processes Closes the observability blind spot where orphaned `pytest` subprocesses (launched by the orchestrator itself in `merge_gate.retest_branch` / `coverage_gate.measure_coverage`) survive a timeout-kill of the agent (`-9`, ORCH-109), reparent onto tini and live for days — starving CPU and failing merge-gate re-test **without raising any alert** (the established incident: `test_install_lite_script.py` processes lived >2 days unnoticed). ### What changed (strictly inside the observer) - **`watchdog/collectors/proc.py` (new, D3):** stdlib-only `/proc` scan (under `pid: host` the container `/proc` reflects the host PID-namespace). Reads `/proc/stat` `btime` + `SC_CLK_TCK`, iterates numeric `/proc/<pid>`, matches `cmdline` by test-class pattern, parses `/proc/<pid>/stat` (field 22 `starttime` → `age_s`; 14+15 → `cpu_s`, informational). **Read-only** (no `os.kill`/signals/`subprocess`; **never** reads `/proc/<pid>/environ`), **never-raise** (per-pid race skipped; top-level → `[]`). Pure parsers split from I/O. - **`watchdog/signals.py` (D4):** pure `proc_signals` builder — per-entity `Signal("proc_blocking", pid)`, `active ⇔ age_s > proc_age_s`; actionable RU detail (PID + age + truncated cmdline + CPU time). - **`watchdog/core.py` (D4/D7):** opt-in `tick()` block gated on `proc_enabled` (disabled → zero overhead, byte-for-byte as before) + RECOVERY synthesis for a vanished process reusing the existing `decision.decide()`/`AlertState` (no new anti-spam logic — FR-5). - **`watchdog/config.py` (D5):** `WATCHDOG_PROC_ENABLED` (default **false**, opt-in) / `WATCHDOG_PROC_AGE_MIN` (60) / `WATCHDOG_PROC_PATTERNS` (`pytest`) / `WATCHDOG_PROC_COOLDOWN_S` (1800), never-raise parsers. Default threshold (3600s) **exceeds** `max(merge_retest_timeout_s=600, coverage_run_timeout_s=900)` so a legit in-flight run never crosses it. - **`docker-compose.yml` (D6):** `pid: host` on `orchestrator-watchdog` **only** (read-only privilege; not a volume → read-only-mounts invariant intact). ### Invariants (AC-9) `src/**`, `/metrics`, `schema_version`, `STAGE_TRANSITIONS`, `QG_CHECKS`, `check_*`, machine-verdict keys and the DB schema are **byte-for-byte untouched**. Deploy rebuilds **only** the sidecar; prod `orchestrator` is **not** restarted (NFR-3). Anti-false-positive and no overlap with `agent_hung` are **by construction** (cmdline scope + age threshold), not fragile cross-namespace PID matching. ### Canon (AC-10 / NFR-5) `WATCHDOG_PROC_*` synced `.env.watchdog.example` ↔ `.env.example` block (key-sync test green); documented in `docs/deployment/LITE_SETUP.md` §4 and `docs/architecture/README.md`. ### Tests - `tests/watchdog/test_proc_blocking_signal.py` — TC-01…TC-06 (builder, anti-FP, config/kill-switch, never-raise/read-only AST scan, anti-spam+recovery cycle, no-dup-with-agent_hung). - `tests/watchdog/test_proc_collector.py` — `/proc` parsing fixtures (btime/pid-stat/cmdline/age/cpu/filtering/race). - `tests/watchdog/test_tick_proc_blocking_integration.py` — TC-07 tick→dispatch + kill-switch-off + in-budget + collector-explodes. - `tests/watchdog/test_compose_service.py` — positive `pid: host` (and prod must NOT have it); `test_config_killswitch.py` — proc-config defaults/env. `pytest tests/ -q` → **1930 passed**. ADR: `docs/work-items/ORCH-111/06-adr/ADR-001-watchdog-orphan-test-process-alert.md`, cross-cutting `docs/architecture/adr/adr-0041-watchdog-orphan-test-process-alert.md`. Refs: ORCH-111 🤖 Generated with [Claude Code](https://claude.com/claude-code)
admin added 6 commits 2026-06-15 02:14:25 +03:00
Close the observability gap between agent_hung (only tracked jobs by jobs.pid)
and orphaned pytest subprocesses the orchestrator launches itself
(merge_gate.retest_branch / coverage_gate.measure_coverage). On a timeout-kill of
the agent (-9, ORCH-109) the grand-child pytest reparents onto tini and keeps
running for days, starving CPU and failing merge-gate re-test — with no alert.

Strictly inside the observer (watchdog/** + the watchdog compose service):
- watchdog/collectors/proc.py: stdlib-only /proc scan (under pid: host),
  read-only, never-raise -> []; pure parsers split from I/O (tested on a fake
  /proc tree). Never reads /proc/<pid>/environ.
- watchdog/signals.py: pure proc_signals builder, per-entity
  ("proc_blocking", pid), active iff age_s > proc_age_s; actionable RU detail.
- watchdog/core.py: opt-in tick block (gated on proc_enabled -> zero overhead /
  byte-for-byte when off) + RECOVERY synthesis for a vanished process through the
  existing decide()/AlertState (no new anti-spam logic).
- watchdog/config.py: WATCHDOG_PROC_{ENABLED(false),AGE_MIN(60),PATTERNS(pytest),
  COOLDOWN_S(1800)}; default threshold > max(merge_retest_timeout_s=600,
  coverage_run_timeout_s=900) so a legit in-flight run never crosses it.
- docker-compose.yml: pid: host on orchestrator-watchdog ONLY (read-only privilege).

Anti-false-positive and no overlap with agent_hung are by construction (cmdline
scope + age threshold), not fragile cross-namespace PID matching.

Canon synced: WATCHDOG_PROC_* in .env.watchdog.example <-> .env.example block;
documented in LITE_SETUP.md and docs/architecture/README.md (architect). src/**,
/metrics, schema_version, STAGE_TRANSITIONS, QG_CHECKS, check_*, machine-verdict
and the DB schema are untouched; deploy rebuilds only the sidecar, prod
orchestrator is not restarted (NFR-3).

Tests: tests/watchdog/test_proc_blocking_signal.py (TC-01..TC-06),
test_proc_collector.py (/proc parsing), test_tick_proc_blocking_integration.py
(TC-07), plus pid: host and proc-config assertions. Full pytest tests/ green (1930).

Refs: ORCH-111
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tester(ET): auto-commit from tester run_id=678
All checks were successful
CI / test (push) Successful in 4m22s
CI / test (pull_request) Successful in 4m27s
1fbfb941a9
admin force-pushed feature/ORCH-111-bug-watchdog-must-alert-on-lon from 7fe552909d to 1fbfb941a9 2026-06-15 02:14:25 +03:00 Compare
admin added 1 commit 2026-06-15 02:40:06 +03:00
chore(ORCH-111): retrigger merge-gate re-test (flaked under host CPU starvation)
All checks were successful
CI / test (push) Successful in 2m52s
CI / test (pull_request) Successful in 3m10s
4311720c39
The merge-gate re-test bounced ORCH-111 to development with 1 failed + 40
errors in 488s — a resource-exhaustion signature, NOT a code defect:

- This branch is watchdog-only (watchdog/** + compose); it touches no src/,
  no STAGE_TRANSITIONS/QG_CHECKS/check_*, and no tests/test_stage_engine.py.
- The failing tests (test_stage_engine.py::TestStagingInfraTolerance
  tc02/tc12/tc13/tc14) are outside this branch's scope, pass in isolation
  (5 passed/19s), and pass right after the new watchdog tests (105 passed).
  tc14 takes NO fixtures yet "errored" — a systemic/host failure, not logic.
- Host load was ~10-12 on a 4-core box at re-test time (the exact orphaned-
  pytest CPU-starvation incident ORCH-111 alerts on; ORCH-111 by design only
  observes, it does not reap — BR-3).

Evidence the branch is sound: full `pytest tests/` is green locally
(1933 passed, 0 failed, 0 errors in 267s, well under the 600s budget) and
Gitea CI on the branch HEAD is green (push + pull_request). Empty commit to
re-run the pipeline now that host load has dropped (10.5 -> 6).

Refs: ORCH-111
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
admin added 1 commit 2026-06-15 02:43:36 +03:00
developer(ET): auto-commit from developer run_id=680
Some checks failed
CI / test (push) Has been cancelled
CI / test (pull_request) Successful in 2m50s
27b85144c2
admin added 1 commit 2026-06-15 02:43:44 +03:00
deploy(ORCH-036): finalize FAILED for ORCH-111
All checks were successful
CI / test (push) Successful in 3m0s
CI / test (pull_request) Successful in 3m0s
007a9ad47d
admin added 1 commit 2026-06-15 08:31:51 +03:00
reviewer(ET): auto-commit from reviewer run_id=681
All checks were successful
CI / test (push) Successful in 3m48s
CI / test (pull_request) Successful in 4m48s
521a72e702
admin added 1 commit 2026-06-15 08:43:59 +03:00
tester(ET): auto-commit from tester run_id=682
All checks were successful
CI / test (push) Successful in 3m3s
CI / test (pull_request) Successful in 3m13s
3f16b77d2b
admin added 1 commit 2026-06-15 08:47:49 +03:00
deploy-staging(ORCH-111): staging gate SUCCESS (8/10 PASS, C9a/C9b infra-waived)
All checks were successful
CI / test (push) Successful in 3m31s
CI / test (pull_request) Successful in 4m15s
d1e8346605
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
admin added 1 commit 2026-06-15 09:13:19 +03:00
chore(ORCH-111): retrigger merge-gate re-test (2nd host CPU-starvation flake)
Some checks failed
CI / test (push) Has been cancelled
CI / test (pull_request) Successful in 3m1s
2d0d654022
The deploy-edge merge-gate re-test bounced ORCH-111 back to development again
with `3 failed, 1916 passed, 14 errors in 444.79s` — a resource-exhaustion
signature, NOT a code defect. This is the SECOND occurrence of the identical
flake on this branch (cf. 4311720).

Evidence the branch is sound:
- Watchdog-only change (watchdog/** + docker-compose.yml + docs). It touches no
  src/, no STAGE_TRANSITIONS/QG_CHECKS/check_*, and none of the failing test
  files (tests/test_stage_engine.py, tests/test_orch109_timeout_model.py).
- The failures/errors are OUTSIDE this branch's scope:
  test_stage_engine.py::TestStagingInfraTolerance tc02/tc13/tc14 and
  test_orch109_timeout_model.py::TestContractsUnchanged::test_tc12. They pass in
  isolation (4 passed/5.9s) and were ERRORS (subprocess timeouts), not assertion
  failures — a systemic host failure, not logic.
- No pytest-randomly/-xdist installed -> deterministic order; merge-gate re-test
  and a local run execute the same order on the same code.
- The failed run took 444.79s vs a clean local full run of 204.72s (2x slower):
  the orphaned-pytest CPU-starvation incident ORCH-111 itself alerts on. By
  design ORCH-111 only observes; it does not reap (ADR BR-3).

Full `pytest tests/` is green locally: 1933 passed, 0 failed, 0 errors in
204.72s (well under the 600s merge_retest budget), and the local run was FASTER
than the prior retrigger's (267s) -> host load is currently low. Empty commit to
re-run CI + the pipeline now.

NOTE (operator): until the orphaned host pytest processes are cleaned up, the
merge-gate re-test can keep flaking. ORCH-111 detects them (proc_blocking,
default-off) but does not reap them (BR-3) -> manual host cleanup is the durable
fix; a follow-up work item for reap/remediation is recommended.

Refs: ORCH-111
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
admin added 1 commit 2026-06-15 09:14:13 +03:00
deploy(ORCH-036): finalize SUCCESS for ORCH-111
All checks were successful
CI / test (push) Successful in 2m41s
CI / test (pull_request) Successful in 3m12s
da599e8736
admin merged commit b6c9d27e9c into main 2026-06-15 09:14:22 +03:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: admin/orchestrator#130