ORCH-1 (F-2b): persistent job queue instead of in-process daemon threads #3

Merged
admin merged 13 commits from feature/ORCH-1-job-queue into main 2026-06-03 08:09:23 +03:00
Owner

Persistent SQLite job queue + background worker, replacing in-process Popen+daemon-thread agent spawn (8 webhook call points). Plus a resilience layer (Slava extension).

Queue core

  • jobs table + atomic claim_next_job (available_at-gated)
  • launch_job + _finalize_job (done/requeue/failed, retries)
  • 8 webhook points + 4 internal advance calls -> enqueue_job
  • queue_worker (max_concurrency) + lifespan start/stop + queue-recovery (restart-safe)
  • GET /queue observability

Resilience (ДОПОЛНЕНИЕ)

  • A. Cheap preflight (CLAUDE_BIN exists + claude --version, cached, NO tokens, no prompt-ping) gates claiming; FAIL -> jobs stay queued
  • B. 429/transient vs permanent classifier on the run-log tail + Retry-After
  • C. Exponential backoff via jobs.available_at + separate transient_attempts budget
  • D. Circuit breaker: N consecutive transient -> open (pause, no CLI, Telegram), half-open probe, closed on recovery; reflected in /queue.resilience
  • fix: preflight checks the binary the launcher actually spawns (container ORCH_CLAUDE_BIN pointed at a missing path)

Tests: 110 passed / 9 pre-existing test_webhooks failures untouched. Popen never spawned (mocked). Live-verified on the box (preflight ok, transient backoff, breaker open/half-open/closed).

B-1/B-2/M-1/ORCH-2/ORCH-6 preserved. See docs/ORCH-1_JOB_QUEUE.md.

Persistent SQLite job queue + background worker, replacing in-process Popen+daemon-thread agent spawn (8 webhook call points). Plus a resilience layer (Slava extension). **Queue core** - jobs table + atomic claim_next_job (available_at-gated) - launch_job + _finalize_job (done/requeue/failed, retries) - 8 webhook points + 4 internal advance calls -> enqueue_job - queue_worker (max_concurrency) + lifespan start/stop + queue-recovery (restart-safe) - GET /queue observability **Resilience (ДОПОЛНЕНИЕ)** - A. Cheap preflight (CLAUDE_BIN exists + claude --version, cached, NO tokens, no prompt-ping) gates claiming; FAIL -> jobs stay queued - B. 429/transient vs permanent classifier on the run-log tail + Retry-After - C. Exponential backoff via jobs.available_at + separate transient_attempts budget - D. Circuit breaker: N consecutive transient -> open (pause, no CLI, Telegram), half-open probe, closed on recovery; reflected in /queue.resilience - fix: preflight checks the binary the launcher actually spawns (container ORCH_CLAUDE_BIN pointed at a missing path) **Tests:** 110 passed / 9 pre-existing test_webhooks failures untouched. Popen never spawned (mocked). Live-verified on the box (preflight ok, transient backoff, breaker open/half-open/closed). B-1/B-2/M-1/ORCH-2/ORCH-6 preserved. See docs/ORCH-1_JOB_QUEUE.md.
admin added 6 commits 2026-06-02 23:59:03 +03:00
Persistent SQLite job queue (F-2b): jobs table + idx, atomic claim_next_job,
enqueue/mark/count/requeue/get helpers. New settings max_concurrency
(ORCH_MAX_CONCURRENCY) and queue_poll_interval (ORCH_QUEUE_POLL_INTERVAL).
Refactor launch() into shared _spawn(); add launch_job(job) that threads job_id
through monitor/watchdog. _finalize_job marks done / requeue (attempts<max) /
failed+notify. Internal advance-chain self.launch -> enqueue_job. B-1/B-2/M-1/ORCH-2
spawn logic unchanged.
All 8 webhook launch points (plane x4, gitea x4) now enqueue a job and return
immediately instead of synchronously spawning claude in the uvicorn process.
queue_worker.QueueWorker drains the queue respecting max_concurrency. main.py
lifespan: queue-recovery (requeue running jobs) after M-1 orphan-recovery, starts
worker and stops it on shutdown. New GET /queue endpoint (counts + recent jobs).
Covers enqueue->claim->mark, atomic claim (no double dispatch, 8-thread race),
retry fail->queued->failed, requeue_running_jobs, observability, worker
max_concurrency. Popen fully mocked (no real agent spawned).
ARCHITECTURE job-queue section + flow diagram, README /queue endpoint and
ORCH_MAX_CONCURRENCY/ORCH_QUEUE_POLL_INTERVAL, new docs/ORCH-1_JOB_QUEUE.md.
admin added 6 commits 2026-06-03 00:12:26 +03:00
jobs.transient_attempts + available_at columns (idempotent _ensure_column
migration); claim_next_job honours available_at; mark_job_transient (backoff
requeue with separate transient budget). Config: preflight_cache_ttl,
backoff_base/max_seconds, transient_max_attempts, breaker_threshold,
breaker_pause_seconds.
preflight.py: cached CLAUDE_BIN exists + claude --version (no tokens, no
prompt-ping). error_classifier.py: classify_log_file -> transient|permanent
from log tail + Retry-After parsing.
_finalize_job classifies the run log: transient (429/overload) -> backoff
requeue via mark_job_transient with separate transient_attempts budget honouring
Retry-After; permanent -> normal attempts<max. on_outcome callback feeds the
circuit breaker. _backoff_seconds = min(2^n*base, max) | Retry-After.
QueueWorker gates claims behind preflight and the CircuitBreaker (open ->
pause, no CLI calls + Telegram alert; half-open probes one job; closed on
recovery). Wires launcher.on_outcome. /queue exposes resilience snapshot.
Covers preflight FAIL->queued + cache, transient/permanent classifier +
Retry-After, exp backoff + available_at gating, launcher transient vs permanent
finalize, circuit breaker open/half-open/closed. test_queue worker tests stub
preflight OK. Popen never spawned.
admin added 1 commit 2026-06-03 00:13:45 +03:00
Container ORCH_CLAUDE_BIN pointed at a non-existent /usr/bin/claude while the
launcher spawns the hardcoded /opt/claude-code/bin/claude.exe. Preflight now
follows AgentLauncher.CLAUDE_BIN (the genuinely executed path), so it no longer
falsely blocks every job in production.
admin merged commit 4e52e192e4 into main 2026-06-03 08:09:23 +03:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: admin/orchestrator#3