feat(queue): resilience schema + backoff helper + config (ORCH-1)

jobs.transient_attempts + available_at columns (idempotent _ensure_column
migration); claim_next_job honours available_at; mark_job_transient (backoff
requeue with separate transient budget). Config: preflight_cache_ttl,
backoff_base/max_seconds, transient_max_attempts, breaker_threshold,
breaker_pause_seconds.
This commit is contained in:
Dev Agent
2026-06-03 00:12:17 +03:00
parent 4be168c0ec
commit 0cd9b11fe0
2 changed files with 107 additions and 1 deletions

View File

@@ -36,6 +36,23 @@ class Settings(BaseSettings):
max_concurrency: int = 1
queue_poll_interval: float = 2.0
# ORCH-1b (resilience): preflight + 429/rate-limit + backoff + circuit breaker.
# preflight_cache_ttl -> cache the cheap CLI/network preflight result (seconds);
# the worker does NOT re-run `claude --version` more often
# than this (env ORCH_PREFLIGHT_CACHE_TTL).
# backoff_base_seconds -> base for exponential transient backoff.
# backoff_max_seconds -> ceiling for the transient backoff.
# transient_max_attempts -> retry budget for transient (429/overload/network)
# failures, separate from code-fault `attempts`.
# breaker_threshold -> consecutive transient failures that OPEN the breaker.
# breaker_pause_seconds -> how long the breaker stays open before half-open.
preflight_cache_ttl: int = 45
backoff_base_seconds: int = 10
backoff_max_seconds: int = 600
transient_max_attempts: int = 5
breaker_threshold: int = 3
breaker_pause_seconds: int = 300
# Telegram notifications
telegram_bot_token: str = ""