fix(launcher): raise developer/reviewer timeout budgets + stamp model at launch
Two additive, isolated launch-subsystem fixes from incident ORCH-104, without touching STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict / DB schema. D1 — launch-time model stamp: write the resolved model into agent_runs.model in the SAME UPDATE as the effort stamp (ORCH-087), so the model is present from launch, survives a timeout-kill (exit_code=-9), and is visible in-flight in /metrics & /queue. record_usage stays an enrichment (model=COALESCE preserves the launch stamp when the usage JSON model is None). never-raise (isolated try/except). D3/D4 — dedicated per-role budgets: agent_timeout_developer_s=3600 / agent_timeout_reviewer_s=3000 with a deterministic _resolve_timeout ladder (overrides_json[agent] > dedicated role key > agent_timeout_seconds=1800; other roles byte-for-byte). Malformed/non-positive config falls back to the global default + WARNING (never-break). reaper_max_running_s raised 3600 -> 5400 in lockstep to keep the ORCH-065 invariant (5400 > 3600 + 20 = 3620). FR-4 (kill / in-flight visibility) and FR-5 (anti-salvage) are structural in the existing code; pinned here by regression tests (tests/test_orch109_timeout_model.py, TC-01..TC-12). Docs: .env.example, config passport, CHANGELOG, CLAUDE.md (README/internals authored by architect in this branch). Refs: ORCH-109 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -563,14 +563,26 @@ class AgentLauncher:
|
||||
# so this is the only reliable source for the tracker's "· model · effort"
|
||||
# line. Empty resolve (no --effort flag) -> NULL so the suffix is omitted.
|
||||
# Reuses the still-open conn; never blocks the launch.
|
||||
#
|
||||
# ORCH-109 (D1): stamp the resolved MODEL in the SAME UPDATE. Previously the
|
||||
# model was only written post-hoc from the final usage-JSON (usage.record_usage,
|
||||
# model=COALESCE(?, model)); a timeout-killed run never emits that JSON, so the
|
||||
# model stayed NULL exactly when an incident needs it. Resolving it here is
|
||||
# deterministic (resolve_agent_model above), so the value is present from launch,
|
||||
# survives a timeout-kill (-9), and is visible in-flight in /metrics & /queue.
|
||||
# The post-hoc record_usage stays an ENRICHMENT (COALESCE keeps the launch stamp
|
||||
# when the JSON model is None/missing). Empty resolve (model == "", CLI default
|
||||
# with no --model) -> NULL, symmetric with `effort or None`, so the tracker's
|
||||
# model suffix is correctly omitted. never-raise: failure is isolated + WARNING;
|
||||
# the launch continues (model_flag is built from the local `model`, not the DB).
|
||||
try:
|
||||
conn.execute(
|
||||
"UPDATE agent_runs SET effort=? WHERE id=?",
|
||||
(effort or None, run_id),
|
||||
"UPDATE agent_runs SET model=?, effort=? WHERE id=?",
|
||||
(model or None, effort or None, run_id),
|
||||
)
|
||||
conn.commit()
|
||||
except Exception as e:
|
||||
logger.warning(f"effort stamp failed for run_id={run_id}: {e}")
|
||||
logger.warning(f"model/effort stamp failed for run_id={run_id}: {e}")
|
||||
model_flag = f"--model {model} " if model else ""
|
||||
effort_flag = f"--effort {effort} " if effort else ""
|
||||
# ORCH-074 (G2): agent_fallback_model is read directly here, bypassing
|
||||
@@ -658,16 +670,34 @@ class AgentLauncher:
|
||||
notify_agent_started(run_id, agent, task_id)
|
||||
return run_id
|
||||
|
||||
# ORCH-109 (D3): dedicated raised-budget keys for the two HEAVY roles. Maps the
|
||||
# role to its Settings attribute; resolved BELOW the operator JSON escape-hatch
|
||||
# and ABOVE the global default. A role absent here keeps the global default.
|
||||
_TIMEOUT_ROLE_KEYS = {
|
||||
"developer": "agent_timeout_developer_s",
|
||||
"reviewer": "agent_timeout_reviewer_s",
|
||||
}
|
||||
|
||||
@staticmethod
|
||||
def _resolve_timeout(agent: str = None) -> int:
|
||||
"""ORCH-7 (M-2): resolve the wall-clock timeout for an agent.
|
||||
"""ORCH-7 (M-2) + ORCH-109 (D3): resolve the wall-clock timeout for an agent.
|
||||
|
||||
Per-agent override from settings.agent_timeout_overrides_json (a JSON object
|
||||
like {"reviewer": 3600}) wins; otherwise the global default
|
||||
settings.agent_timeout_seconds is used. A malformed override JSON is ignored
|
||||
(falls back to the default) and only logged, so a bad env never bricks runs.
|
||||
Deterministic priority ladder (highest first):
|
||||
1. settings.agent_timeout_overrides_json[agent] -- operator escape-hatch,
|
||||
wins for ANY role (full BC). A malformed JSON is ignored + logged.
|
||||
2. dedicated per-role key (ORCH-109): developer -> agent_timeout_developer_s
|
||||
(3600), reviewer -> agent_timeout_reviewer_s (3000). A non-positive /
|
||||
non-int value is ignored + logged (never-break) and falls through to (3).
|
||||
3. settings.agent_timeout_seconds -- the global default (1800) for every
|
||||
other role (analyst/architect/tester/deployer), byte-for-byte as before.
|
||||
|
||||
Never raises: any bad config degrades to the global default so a bad env
|
||||
never bricks runs. Cross-invariant (ORCH-065): max(resolved) + grace must
|
||||
stay < reaper_max_running_s (raised to 5400 in lockstep; see config.py).
|
||||
"""
|
||||
default = settings.agent_timeout_seconds
|
||||
|
||||
# (1) operator JSON override -- highest priority, unchanged semantics.
|
||||
raw = (settings.agent_timeout_overrides_json or "").strip()
|
||||
if agent and raw:
|
||||
try:
|
||||
@@ -676,6 +706,22 @@ class AgentLauncher:
|
||||
return int(overrides[agent])
|
||||
except (ValueError, TypeError) as e:
|
||||
logger.warning(f"Invalid agent_timeout_overrides_json, using default: {e}")
|
||||
|
||||
# (2) dedicated per-role raised budget (ORCH-109 D3/D4).
|
||||
key = AgentLauncher._TIMEOUT_ROLE_KEYS.get(agent)
|
||||
if key is not None:
|
||||
try:
|
||||
value = int(getattr(settings, key))
|
||||
if value > 0:
|
||||
return value
|
||||
logger.warning(
|
||||
f"Non-positive {key}={value!r}; falling back to "
|
||||
f"agent_timeout_seconds={default}"
|
||||
)
|
||||
except (ValueError, TypeError) as e:
|
||||
logger.warning(f"Invalid {key} for agent '{agent}', using default: {e}")
|
||||
|
||||
# (3) global default.
|
||||
return default
|
||||
|
||||
def _watchdog(self, pid: int, run_id: int, timeout: int = None,
|
||||
|
||||
@@ -120,10 +120,28 @@ class Settings(BaseSettings):
|
||||
# (env ORCH_AGENT_KILL_GRACE_SECONDS).
|
||||
# agent_timeout_overrides_json -> optional per-agent override JSON object,
|
||||
# e.g. {"reviewer": 3600, "architect": 2700}
|
||||
# (env ORCH_AGENT_TIMEOUT_OVERRIDES_JSON).
|
||||
# (env ORCH_AGENT_TIMEOUT_OVERRIDES_JSON). HIGHEST
|
||||
# priority escape-hatch in _resolve_timeout (wins for
|
||||
# any role).
|
||||
# ORCH-109 (D3/D4): raised wall-clock budgets for the two HEAVY roles.
|
||||
# agent_timeout_developer_s -> developer is the bottleneck (effort xhigh,
|
||||
# coding/agentic); 3600s/60m (env
|
||||
# ORCH_AGENT_TIMEOUT_DEVELOPER_S).
|
||||
# agent_timeout_reviewer_s -> reviewer reads a large diff + writes the review
|
||||
# (high reasoning); 3000s/50m (env
|
||||
# ORCH_AGENT_TIMEOUT_REVIEWER_S).
|
||||
# _resolve_timeout ladder: overrides_json[agent] > dedicated role key >
|
||||
# agent_timeout_seconds (other roles stay at 1800, byte-for-byte). A malformed
|
||||
# JSON / non-positive dedicated value falls back to agent_timeout_seconds +
|
||||
# WARNING (never-break). The defaults ARE the prod budget (ORCH-101 canon: empty
|
||||
# .env reproduces prod). CROSS-INVARIANT (ORCH-065): reaper_max_running_s MUST
|
||||
# stay > max(resolved timeout) + agent_kill_grace_seconds; raised in lockstep to
|
||||
# 5400 below (5400 > 3600 + 20 = 3620).
|
||||
agent_timeout_seconds: int = 1800
|
||||
agent_kill_grace_seconds: int = 20
|
||||
agent_timeout_overrides_json: str = ""
|
||||
agent_timeout_developer_s: int = 3600
|
||||
agent_timeout_reviewer_s: int = 3000
|
||||
|
||||
# ORCH-41: per-agent LLM model. Empty -> agent_model_default. Resolution order:
|
||||
# project-override (projects_json agent_models) > ORCH_AGENT_MODEL_<AGENT> >
|
||||
@@ -480,6 +498,9 @@ class Settings(BaseSettings):
|
||||
# reaper_max_running_s -> Tier-3 backstop ceiling: a job 'running' longer than
|
||||
# this is reaped even when liveness is unknowable. MUST be
|
||||
# > max agent_timeout + grace so a legit agent is safe.
|
||||
# ORCH-109 (D4): raised 3600 -> 5400 in lockstep with the
|
||||
# developer budget (5400 > 3600 + 20 = 3620; headroom 1780s
|
||||
# also covers the monitor finalization window).
|
||||
# reaper_finalize_grace_s -> Tier-2 anti-false-positive: a LIVE monitor writes
|
||||
# agent_runs.exit_code FIRST, THEN does git commit/push +
|
||||
# PR + Plane usage comments (seconds..minutes) and only
|
||||
@@ -494,7 +515,7 @@ class Settings(BaseSettings):
|
||||
reaper_enabled: bool = True
|
||||
reaper_interval_s: int = 60
|
||||
reaper_dead_ticks: int = 2
|
||||
reaper_max_running_s: int = 3600
|
||||
reaper_max_running_s: int = 5400
|
||||
reaper_finalize_grace_s: int = 300
|
||||
lease_reclaim_enabled: bool = True
|
||||
|
||||
|
||||
Reference in New Issue
Block a user