fix(queue): enforce queued ⇒ no run-ownership invariant (ORCH-126)

Queued analyst-jobs hung forever even with ORCH_SERIAL_GATE_ENABLED=false (incident ORCH-124/125, job 2286: queued + run_id=759/760 + pid=35/42 + started_at=NULL — physically impossible). No path returning a job to 'queued' reset its run-ownership (run_id/pid); after a container restart a reused pid made pid_alive(stale)=True, so the job-reaper Tier-1 saw a phantom 'running' and at max_concurrency=1 wedged the claim of the whole shared queue. Enforce the invariant `status='queued' ⇒ run_id IS NULL AND pid IS NULL AND started_at IS NULL` on existing columns (no schema change): - D1 forward-cleanup: requeue_running_jobs / mark_job('queued') / mark_job_transient / reap_running_job('queued') reset run_id=NULL, pid=NULL in the same UPDATE that clears started_at; atomic status-guards preserved. - D2 clean claim: claim_next_job resets pid/run_id on the queued->running flip (defense-in-depth) so the row carries pid IS NULL until _spawn stamps it. - D4 self-heal + observability: db.find_impossible_queued_jobs / sanitize_impossible_queued run at startup (main.lifespan) and on each reaper tick (JobReaper.sanitize_impossible_queued_once, never-raise); counter impossible_queued_total in the GET /queue reaper block. Kill-switch ORCH_IMPOSSIBLE_QUEUED_SANITIZE_ENABLED (default on; gates only the D4 sweep). - D5: reaper Tier-1 unchanged — the fix restores its precondition (pid reflects THIS run). Marked invariants ORCH-065/113/114/099 preserved. Tests: tests/test_orch126_queued_stale_run.py (TC-01 mandatory regression red->green; TC-02..TC-10). Full pytest tests/ -q green (2189 passed). Docs: internals.md (run-ownership invariant section), .env.example, CHANGELOG; cross-cutting adr-0052. Refs: ORCH-126 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 11:39:26 +03:00
parent 3fb7bd6e4c
commit d7e7a4d817
9 changed files with 549 additions and 8 deletions
--- a/src/config.py
+++ b/src/config.py
@@ -722,6 +722,17 @@ class Settings(BaseSettings):
    lease_reclaim_enabled: bool = True
    reaper_finalizer_liveness_enabled: bool = True

+    # ORCH-126 (D4/FR-4): detect + self-heal "impossible" queued rows — a job that
+    # is `status='queued'` while still carrying run-ownership (run_id/pid/started_at),
+    # which is physically impossible (the incident state of job 2286: `queued +
+    # run_id=759 + pid=35 + started_at=NULL`). The BASE run-ownership reset on every
+    # requeue/claim path (D1-D3 in src/db.py) is UNCONDITIONAL — this kill-switch
+    # gates ONLY the optional detect/sanitize sweep (run at startup in main.lifespan
+    # and on each job-reaper tick) plus its read-only /queue counter. Default on;
+    # False -> the sweep is a no-op (D1-D3 still enforce the invariant going forward).
+    # never-raise: a sweep error is isolated and never wedges startup / the reaper.
+    impossible_queued_sanitize_enabled: bool = True
+
    # ORCH-114 (adr-0045): durable transition-ownership lease + expected-stage CAS for
    # side-effectful stage transitions. Generalises the process-local ORCH-113
    # finalizer-liveness to a DURABLE, cross-path owner-exclusion (additive table