fix(queue): enforce queued ⇒ no run-ownership invariant (ORCH-126)
Queued analyst-jobs hung forever even with ORCH_SERIAL_GATE_ENABLED=false
(incident ORCH-124/125, job 2286: queued + run_id=759/760 + pid=35/42 +
started_at=NULL — physically impossible). No path returning a job to
'queued' reset its run-ownership (run_id/pid); after a container restart a
reused pid made pid_alive(stale)=True, so the job-reaper Tier-1 saw a phantom
'running' and at max_concurrency=1 wedged the claim of the whole shared queue.
Enforce the invariant `status='queued' ⇒ run_id IS NULL AND pid IS NULL AND
started_at IS NULL` on existing columns (no schema change):
- D1 forward-cleanup: requeue_running_jobs / mark_job('queued') /
mark_job_transient / reap_running_job('queued') reset run_id=NULL, pid=NULL
in the same UPDATE that clears started_at; atomic status-guards preserved.
- D2 clean claim: claim_next_job resets pid/run_id on the queued->running flip
(defense-in-depth) so the row carries pid IS NULL until _spawn stamps it.
- D4 self-heal + observability: db.find_impossible_queued_jobs /
sanitize_impossible_queued run at startup (main.lifespan) and on each reaper
tick (JobReaper.sanitize_impossible_queued_once, never-raise); counter
impossible_queued_total in the GET /queue reaper block. Kill-switch
ORCH_IMPOSSIBLE_QUEUED_SANITIZE_ENABLED (default on; gates only the D4 sweep).
- D5: reaper Tier-1 unchanged — the fix restores its precondition (pid reflects
THIS run). Marked invariants ORCH-065/113/114/099 preserved.
Tests: tests/test_orch126_queued_stale_run.py (TC-01 mandatory regression
red->green; TC-02..TC-10). Full pytest tests/ -q green (2189 passed).
Docs: internals.md (run-ownership invariant section), .env.example, CHANGELOG;
cross-cutting adr-0052.
Refs: ORCH-126
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
12
.env.example
12
.env.example
@@ -453,6 +453,18 @@ ORCH_REAPER_MAX_RUNNING_S=5400
|
||||
ORCH_REAPER_FINALIZE_GRACE_S=300
|
||||
ORCH_LEASE_RECLAIM_ENABLED=true
|
||||
|
||||
# ORCH-126 (adr-0052): run-ownership hygiene of the `jobs` row — invariant
|
||||
# `status='queued' => run_id IS NULL AND pid IS NULL AND started_at IS NULL`. The BASE
|
||||
# reset on every requeue/claim path (requeue_running_jobs / mark_job('queued') /
|
||||
# mark_job_transient / reap_running_job('queued') / claim_next_job) is UNCONDITIONAL
|
||||
# (no flag — it fixes a data invariant). This kill-switch gates ONLY the optional
|
||||
# detect/self-heal sweep of "impossible" queued rows (a queued job still carrying
|
||||
# run_id/pid/started_at — the incident state of job 2286) run at startup + on each
|
||||
# reaper tick, plus its read-only /queue counter (reaper.impossible_queued_total).
|
||||
# IMPOSSIBLE_QUEUED_SANITIZE_ENABLED -> default true; false -> the sweep is a no-op
|
||||
# (D1-D3 still enforce the invariant going forward).
|
||||
ORCH_IMPOSSIBLE_QUEUED_SANITIZE_ENABLED=true
|
||||
|
||||
# ORCH-114 (adr-0045): durable transition-ownership lease + expected-stage CAS for
|
||||
# side-effectful stage transitions. Generalises the process-local ORCH-113 finalizer-
|
||||
# liveness into a DURABLE, cross-path owner-exclusion (additive table `transition_lease`)
|
||||
|
||||
Reference in New Issue
Block a user