fix(queue): enforce queued ⇒ no run-ownership invariant (ORCH-126)
Queued analyst-jobs hung forever even with ORCH_SERIAL_GATE_ENABLED=false
(incident ORCH-124/125, job 2286: queued + run_id=759/760 + pid=35/42 +
started_at=NULL — physically impossible). No path returning a job to
'queued' reset its run-ownership (run_id/pid); after a container restart a
reused pid made pid_alive(stale)=True, so the job-reaper Tier-1 saw a phantom
'running' and at max_concurrency=1 wedged the claim of the whole shared queue.
Enforce the invariant `status='queued' ⇒ run_id IS NULL AND pid IS NULL AND
started_at IS NULL` on existing columns (no schema change):
- D1 forward-cleanup: requeue_running_jobs / mark_job('queued') /
mark_job_transient / reap_running_job('queued') reset run_id=NULL, pid=NULL
in the same UPDATE that clears started_at; atomic status-guards preserved.
- D2 clean claim: claim_next_job resets pid/run_id on the queued->running flip
(defense-in-depth) so the row carries pid IS NULL until _spawn stamps it.
- D4 self-heal + observability: db.find_impossible_queued_jobs /
sanitize_impossible_queued run at startup (main.lifespan) and on each reaper
tick (JobReaper.sanitize_impossible_queued_once, never-raise); counter
impossible_queued_total in the GET /queue reaper block. Kill-switch
ORCH_IMPOSSIBLE_QUEUED_SANITIZE_ENABLED (default on; gates only the D4 sweep).
- D5: reaper Tier-1 unchanged — the fix restores its precondition (pid reflects
THIS run). Marked invariants ORCH-065/113/114/099 preserved.
Tests: tests/test_orch126_queued_stale_run.py (TC-01 mandatory regression
red->green; TC-02..TC-10). Full pytest tests/ -q green (2189 passed).
Docs: internals.md (run-ownership invariant section), .env.example, CHANGELOG;
cross-cutting adr-0052.
Refs: ORCH-126
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -722,6 +722,17 @@ class Settings(BaseSettings):
|
||||
lease_reclaim_enabled: bool = True
|
||||
reaper_finalizer_liveness_enabled: bool = True
|
||||
|
||||
# ORCH-126 (D4/FR-4): detect + self-heal "impossible" queued rows — a job that
|
||||
# is `status='queued'` while still carrying run-ownership (run_id/pid/started_at),
|
||||
# which is physically impossible (the incident state of job 2286: `queued +
|
||||
# run_id=759 + pid=35 + started_at=NULL`). The BASE run-ownership reset on every
|
||||
# requeue/claim path (D1-D3 in src/db.py) is UNCONDITIONAL — this kill-switch
|
||||
# gates ONLY the optional detect/sanitize sweep (run at startup in main.lifespan
|
||||
# and on each job-reaper tick) plus its read-only /queue counter. Default on;
|
||||
# False -> the sweep is a no-op (D1-D3 still enforce the invariant going forward).
|
||||
# never-raise: a sweep error is isolated and never wedges startup / the reaper.
|
||||
impossible_queued_sanitize_enabled: bool = True
|
||||
|
||||
# ORCH-114 (adr-0045): durable transition-ownership lease + expected-stage CAS for
|
||||
# side-effectful stage transitions. Generalises the process-local ORCH-113
|
||||
# finalizer-liveness to a DURABLE, cross-path owner-exclusion (additive table
|
||||
|
||||
Reference in New Issue
Block a user