fix(launcher): raise developer/reviewer timeout budgets + stamp model at launch
All checks were successful
CI / test (push) Successful in 4m24s
All checks were successful
CI / test (push) Successful in 4m24s
Two additive, isolated launch-subsystem fixes from incident ORCH-104, without touching STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict / DB schema. D1 — launch-time model stamp: write the resolved model into agent_runs.model in the SAME UPDATE as the effort stamp (ORCH-087), so the model is present from launch, survives a timeout-kill (exit_code=-9), and is visible in-flight in /metrics & /queue. record_usage stays an enrichment (model=COALESCE preserves the launch stamp when the usage JSON model is None). never-raise (isolated try/except). D3/D4 — dedicated per-role budgets: agent_timeout_developer_s=3600 / agent_timeout_reviewer_s=3000 with a deterministic _resolve_timeout ladder (overrides_json[agent] > dedicated role key > agent_timeout_seconds=1800; other roles byte-for-byte). Malformed/non-positive config falls back to the global default + WARNING (never-break). reaper_max_running_s raised 3600 -> 5400 in lockstep to keep the ORCH-065 invariant (5400 > 3600 + 20 = 3620). FR-4 (kill / in-flight visibility) and FR-5 (anti-salvage) are structural in the existing code; pinned here by regression tests (tests/test_orch109_timeout_model.py, TC-01..TC-12). Docs: .env.example, config passport, CHANGELOG, CLAUDE.md (README/internals authored by architect in this branch). Refs: ORCH-109 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
28
.env.example
28
.env.example
@@ -107,6 +107,30 @@ ORCH_AGENT_EFFORT_DEPLOYER=medium
|
||||
# (G4 NOT enabled, ADR-001 ORCH-74: determinism — all agents stay on opus-4-8). A
|
||||
# non-empty value is validated by the SAME predicate as the model; a typo is dropped.
|
||||
ORCH_AGENT_FALLBACK_MODEL=
|
||||
|
||||
# ── Agent timeout / wall-clock budgets (ORCH-7, raised per-role ORCH-109) ─────
|
||||
# The in-process watchdog kills a run that exceeds its wall-clock budget
|
||||
# (SIGTERM -> grace -> SIGKILL, exit_code=-9). _resolve_timeout ladder (highest
|
||||
# first): OVERRIDES_JSON[agent] > dedicated role key > SECONDS (global default).
|
||||
# SECONDS -> global default budget for every role WITHOUT a raised
|
||||
# key (analyst/architect/tester/deployer).
|
||||
# KILL_GRACE_SECONDS -> pause between SIGTERM and SIGKILL so claude can flush
|
||||
# artifacts before the hard kill.
|
||||
# OVERRIDES_JSON -> optional per-agent override object, e.g.
|
||||
# {"reviewer":3600,"architect":2700}; wins for ANY role.
|
||||
# Malformed JSON -> ignored + WARNING (never-break).
|
||||
# ORCH-109: the two HEAVY roles get raised dedicated budgets (defaults = prod, so an
|
||||
# empty .env reproduces prod — ORCH-101 canon). A non-positive value falls back to
|
||||
# SECONDS + WARNING.
|
||||
# DEVELOPER_S -> developer budget (xhigh, coding/agentic bottleneck), 60m.
|
||||
# REVIEWER_S -> reviewer budget (large diff + high reasoning), 50m.
|
||||
# CROSS-INVARIANT (ORCH-065): ORCH_REAPER_MAX_RUNNING_S MUST stay > max(budget)+grace;
|
||||
# it is raised to 5400 in lockstep below (5400 > 3600 + 20 = 3620).
|
||||
ORCH_AGENT_TIMEOUT_SECONDS=1800
|
||||
ORCH_AGENT_KILL_GRACE_SECONDS=20
|
||||
ORCH_AGENT_TIMEOUT_OVERRIDES_JSON=
|
||||
ORCH_AGENT_TIMEOUT_DEVELOPER_S=3600
|
||||
ORCH_AGENT_TIMEOUT_REVIEWER_S=3000
|
||||
# ORCH-042/ORCH-067: live-tracker mode. bump (DEFAULT since ORCH-067) -> on every
|
||||
# update the old card is deleted and a fresh one is sent silently to the BOTTOM of
|
||||
# the chat (deleteMessage + sendMessage + repoint), so the current status is always
|
||||
@@ -365,6 +389,8 @@ ORCH_PLANE_STATES_TTL_S=300
|
||||
# REAPER_INTERVAL_S -> background scan period (seconds).
|
||||
# REAPER_DEAD_TICKS -> consecutive dead-pid ticks before reaping (Tier-1, >=2).
|
||||
# REAPER_MAX_RUNNING_S -> Tier-3 backstop ceiling; must exceed max agent_timeout+grace.
|
||||
# ORCH-109: raised 3600 -> 5400 in lockstep with the developer
|
||||
# budget (5400 > 3600 + 20 = 3620).
|
||||
# REAPER_FINALIZE_GRACE_S -> Tier-2 grace: how long agent_runs.exit_code must have been
|
||||
# recorded before a still-'running' job is reaped; MUST exceed
|
||||
# the max finalization window (git push + PR + Plane comments).
|
||||
@@ -374,7 +400,7 @@ ORCH_PLANE_STATES_TTL_S=300
|
||||
ORCH_REAPER_ENABLED=true
|
||||
ORCH_REAPER_INTERVAL_S=60
|
||||
ORCH_REAPER_DEAD_TICKS=2
|
||||
ORCH_REAPER_MAX_RUNNING_S=3600
|
||||
ORCH_REAPER_MAX_RUNNING_S=5400
|
||||
ORCH_REAPER_FINALIZE_GRACE_S=300
|
||||
ORCH_LEASE_RECLAIM_ENABLED=true
|
||||
|
||||
|
||||
Reference in New Issue
Block a user