ORCH-7: cleanup + hardening (M-4 dead code + M-2 graceful timeout) #4

Merged
admin merged 3 commits from feature/ORCH-7-hardening into main 2026-06-03 08:31:26 +03:00
Owner

ORCH-7: cleanup + hardening (M-4 + M-2). Small focused PR, no pipeline behavior change.

M-4 — remove dead code

  • _auto_merge_pr had 0 callers (merge is the deployer agent job). Method removed. _ensure_pr (used by auto-advance) kept. grep auto_merge src/ = 0 matches.

M-2 — graceful timeout + configurable

  • _watchdog: SIGTERM -> poll os.kill(pid,0) up to agent_kill_grace_seconds -> SIGKILL only if still alive. ProcessLookupError tolerated at every step. Recorded exit_code stays -9 so ORCH-1 retry/fail is unchanged (timeout-kill classifies permanent, bounded requeue, no loop).
  • config.py: agent_timeout_seconds=1800, agent_kill_grace_seconds=20, agent_timeout_overrides_json="" (per-agent JSON override). AGENT_TIMEOUT kept as backward-compat alias. _resolve_timeout(agent) picks override else default.

Tests

  • New: watchdog SIGTERM->SIGKILL ordering; graceful-exit-in-grace skips SIGKILL; already-dead-before-SIGTERM tolerated; _resolve_timeout override/default/malformed-JSON.
  • Baseline: 118 passed (110 + 8 new), 9 pre-existing webhook-401 untouched.

Deploy

  • Rebuilt from branch; /health ok; /queue ok (breaker closed, preflight ok).

Do NOT merge — Стрим merges after review.

ORCH-7: cleanup + hardening (M-4 + M-2). Small focused PR, no pipeline behavior change. ## M-4 — remove dead code - `_auto_merge_pr` had 0 callers (merge is the deployer agent job). Method removed. `_ensure_pr` (used by auto-advance) kept. `grep auto_merge src/` = 0 matches. ## M-2 — graceful timeout + configurable - `_watchdog`: SIGTERM -> poll `os.kill(pid,0)` up to `agent_kill_grace_seconds` -> SIGKILL only if still alive. `ProcessLookupError` tolerated at every step. Recorded `exit_code` stays `-9` so ORCH-1 retry/fail is unchanged (timeout-kill classifies permanent, bounded requeue, no loop). - config.py: `agent_timeout_seconds=1800`, `agent_kill_grace_seconds=20`, `agent_timeout_overrides_json=""` (per-agent JSON override). `AGENT_TIMEOUT` kept as backward-compat alias. `_resolve_timeout(agent)` picks override else default. ## Tests - New: watchdog SIGTERM->SIGKILL ordering; graceful-exit-in-grace skips SIGKILL; already-dead-before-SIGTERM tolerated; `_resolve_timeout` override/default/malformed-JSON. - Baseline: 118 passed (110 + 8 new), 9 pre-existing webhook-401 untouched. ## Deploy - Rebuilt from branch; `/health` ok; `/queue` ok (breaker closed, preflight ok). Do NOT merge — Стрим merges after review.
admin added 3 commits 2026-06-03 08:30:50 +03:00
_auto_merge_pr had zero callers (merge is handled by the deployer agent).
Removed the method; _ensure_pr (still used by the auto-advance path) is kept.
The watchdog used to time.sleep(timeout) then immediately SIGKILL, which cut
claude off mid-write and left half-written artifacts. It now sends SIGTERM,
polls os.kill(pid, 0) for up to agent_kill_grace_seconds, and only SIGKILL if
the process is still alive; ProcessLookupError is tolerated at every step.

Timeout is now configurable via config.py: agent_timeout_seconds (default 1800),
agent_kill_grace_seconds (default 20), and agent_timeout_overrides_json for
per-agent overrides (e.g. {"reviewer": 3600}). AGENT_TIMEOUT is kept as a
backward-compatible alias. The recorded exit_code stays -9 so the ORCH-1
monitor retry/fail logic is unchanged (timeout-kills classify as permanent and
requeue within max_attempts, no retry loop).
Cover M-2: SIGTERM-before-SIGKILL ordering, graceful exit within grace skips
SIGKILL, ProcessLookupError before SIGTERM is tolerated (no _record_kill), and
_resolve_timeout per-agent override / default / malformed-JSON fallback.
Cover M-4: _auto_merge_pr removed, _ensure_pr retained.
admin merged commit fd554c8a5a into main 2026-06-03 08:31:26 +03:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: admin/orchestrator#4