feat(preflight): catch logged-out auth and treat empty result as failure
Some checks failed
CI / test (push) Failing after 14s
CI / test (pull_request) Failing after 13s

ORCH-044 closes two blind spots that let a single de-authenticated agent
stall the shared queue for all projects:

P1 — preflight auth gate. `claude --version` answers even when logged out,
so version-only preflight was blind to auth. Adds a token-free, network-free
check of <AGENT_HOME>/.claude/.credentials.json: missing/unreadable/no-oauth
or an expired `claudeAiOauth.expiresAt` (epoch ms, vs now + skew) => preflight
FAIL; absent expiry => OK (no false positives). Result is cached on the same
preflight_cache_ttl. Post-factum safety net: launcher detects auth markers
("not logged in" / "/login" / "unauthorized" / 401) in the run log and resets
the preflight cache so the next tick re-evaluates auth. Auth failure is a gate,
not a transient — it does not spin the circuit breaker. Emergency toggle
ORCH_PREFLIGHT_CHECK_AUTH=false restores version-only behaviour.

P3 — empty log / no result-JSON => job failed. exit_code==0 with an empty or
JSON-less run log no longer counts as success: a separate result_ok flag gates
stage advance + usage comments, fires a Telegram alert, and routes the job
through the normal transient/permanent failure path (exit_code integrity in
agent_runs preserved).

Scope: P2 (--effort) is intentionally excluded and tracked in ORCH-50.

New settings: ORCH_PREFLIGHT_CHECK_AUTH, ORCH_CLAUDE_CREDENTIALS_PATH,
ORCH_AUTH_EXPIRY_SKEW_SECONDS. Docs updated (INFRA.md, internals.md, CHANGELOG).

Refs: ORCH-044

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-06-06 08:11:27 +00:00
parent 8fb59cd87f
commit 98b47fe021
8 changed files with 891 additions and 29 deletions

View File

@@ -64,6 +64,25 @@ class Settings(BaseSettings):
# breaker_threshold -> consecutive transient failures that OPEN the breaker.
# breaker_pause_seconds -> how long the breaker stays open before half-open.
preflight_cache_ttl: int = 45
# ORCH-044 (P1): token-free preflight auth gate. After `claude --version`
# succeeds, preflight also checks that claude is logged in by reading the
# local OAuth credentials file (no network / no prompt-ping — BR-1).
# preflight_check_auth -> master toggle (env ORCH_PREFLIGHT_CHECK_AUTH).
# Emergency off-switch if the check ever
# false-positives and wedges the shared queue.
# claude_credentials_path -> explicit path to .credentials.json
# (env ORCH_CLAUDE_CREDENTIALS_PATH). Empty ->
# <AGENT_HOME>/.claude/.credentials.json, where
# AGENT_HOME is the HOME the launcher really
# spawns claude under (/home/slin), NOT the
# orchestrator process env.
# auth_expiry_skew_seconds -> clock-drift slack when comparing
# claudeAiOauth.expiresAt (env
# ORCH_AUTH_EXPIRY_SKEW_SECONDS); a token within
# this many seconds of now is treated as expired.
preflight_check_auth: bool = True
claude_credentials_path: str = ""
auth_expiry_skew_seconds: int = 0
backoff_base_seconds: int = 10
backoff_max_seconds: int = 600
transient_max_attempts: int = 5