feat(preflight): catch logged-out auth and treat empty result as failure
ORCH-044 closes two blind spots that let a single de-authenticated agent
stall the shared queue for all projects:
P1 — preflight auth gate. `claude --version` answers even when logged out,
so version-only preflight was blind to auth. Adds a token-free, network-free
check of <AGENT_HOME>/.claude/.credentials.json: missing/unreadable/no-oauth
or an expired `claudeAiOauth.expiresAt` (epoch ms, vs now + skew) => preflight
FAIL; absent expiry => OK (no false positives). Result is cached on the same
preflight_cache_ttl. Post-factum safety net: launcher detects auth markers
("not logged in" / "/login" / "unauthorized" / 401) in the run log and resets
the preflight cache so the next tick re-evaluates auth. Auth failure is a gate,
not a transient — it does not spin the circuit breaker. Emergency toggle
ORCH_PREFLIGHT_CHECK_AUTH=false restores version-only behaviour.
P3 — empty log / no result-JSON => job failed. exit_code==0 with an empty or
JSON-less run log no longer counts as success: a separate result_ok flag gates
stage advance + usage comments, fires a Telegram alert, and routes the job
through the normal transient/permanent failure path (exit_code integrity in
agent_runs preserved).
Scope: P2 (--effort) is intentionally excluded and tracked in ORCH-50.
New settings: ORCH_PREFLIGHT_CHECK_AUTH, ORCH_CLAUDE_CREDENTIALS_PATH,
ORCH_AUTH_EXPIRY_SKEW_SECONDS. Docs updated (INFRA.md, internals.md, CHANGELOG).
Refs: ORCH-044
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -64,6 +64,25 @@ class Settings(BaseSettings):
|
||||
# breaker_threshold -> consecutive transient failures that OPEN the breaker.
|
||||
# breaker_pause_seconds -> how long the breaker stays open before half-open.
|
||||
preflight_cache_ttl: int = 45
|
||||
# ORCH-044 (P1): token-free preflight auth gate. After `claude --version`
|
||||
# succeeds, preflight also checks that claude is logged in by reading the
|
||||
# local OAuth credentials file (no network / no prompt-ping — BR-1).
|
||||
# preflight_check_auth -> master toggle (env ORCH_PREFLIGHT_CHECK_AUTH).
|
||||
# Emergency off-switch if the check ever
|
||||
# false-positives and wedges the shared queue.
|
||||
# claude_credentials_path -> explicit path to .credentials.json
|
||||
# (env ORCH_CLAUDE_CREDENTIALS_PATH). Empty ->
|
||||
# <AGENT_HOME>/.claude/.credentials.json, where
|
||||
# AGENT_HOME is the HOME the launcher really
|
||||
# spawns claude under (/home/slin), NOT the
|
||||
# orchestrator process env.
|
||||
# auth_expiry_skew_seconds -> clock-drift slack when comparing
|
||||
# claudeAiOauth.expiresAt (env
|
||||
# ORCH_AUTH_EXPIRY_SKEW_SECONDS); a token within
|
||||
# this many seconds of now is treated as expired.
|
||||
preflight_check_auth: bool = True
|
||||
claude_credentials_path: str = ""
|
||||
auth_expiry_skew_seconds: int = 0
|
||||
backoff_base_seconds: int = 10
|
||||
backoff_max_seconds: int = 600
|
||||
transient_max_attempts: int = 5
|
||||
|
||||
Reference in New Issue
Block a user