orchestrator

Author	SHA1	Message	Date
claude-bot	664c2e945a	feat(infra): auto-prune docker build cache on mva154 (ORCH-062) Add src/build_cache_pruner.py — a background daemon thread modelled 1:1 on src/disk_watchdog.py that periodically runs STRICTLY `docker builder prune -f --filter until=<until>` (BuildKit GC) on the HOST over ssh. It is the "second half" of the disk-watchdog (ORCH-063): the watchdog signals, the pruner cleans. Removes the root cause of the 07.06.2026 incident (build cache ~11GB -> disk 100% -> whole self-hosting pipeline down) automatically, без оператора. ADR-001 (Variant A): host-over-ssh, same channel as image_freshness/self_deploy (no docker CLI in the image). Touches ONLY the build cache — no image/system prune, no image/container removal, never restarts the docker daemon or the prod container (self-hosting safety). No ssh target -> tick is a no-op. - src/config.py: ORCH_BUILD_CACHE_PRUNE_* flags + defensive validators (interval/timeout >0, until ~ ^\d+[smhdw]?$, notify_min_gb >=0 -> safe default). - src/main.py: start last (after disk_watchdog) / stop first in lifespan; additive read-only build_cache_prune block in GET /queue. - never-raise on two levels (per-command + per-tick); kill-switch ORCH_BUILD_CACHE_PRUNE_ENABLED (false -> daemon does not start, 1:1 as before). - STAGE_TRANSITIONS / QG_CHECKS / check_* / _parse_* / DB schema UNCHANGED; last-run/last-result is in-memory (no migration). - tests/test_build_cache_pruner.py: TC-01..TC-12 (23 cases, docker fully mocked). - .env.example + CHANGELOG.md updated; INFRA.md / architecture docs already carry the component (architecture stage). Refs: ORCH-062 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 19:55:00 +03:00
claude-bot	8759cb7df8	feat(disk-watchdog): host-FS fill heartbeat + Telegram alert at >=85% (ORCH-063) Adds src/disk_watchdog.py — a background daemon thread modelled on reconciler/job_reaper that measures host-FS fill via the mounted bind-paths (/repos, /app/data) with shutil.disk_usage and Telegram-alerts the operator at >= threshold (default 85%). The missing proactive signal: on 07.06.2026 the mva154 host disk silently hit 100% and stalled the whole self-hosting pipeline. - Pure decide_action(used_pct, threshold, prev, now, realert_s): alert on crossing up, cooldown re-alert, single recovery below threshold (unit-tested without a thread/timer; clock injected). - measure_paths: shutil.disk_usage per path, dedup by st_dev, per-path never-raise (a broken path never fails the tick). - Config flags ORCH_DISK_MONITOR_* with defensive validation (threshold 1..100, positive intervals -> default + warning). Kill-switch -> daemon does not start. - Additive disk_monitor block in GET /queue; start/stop in main.lifespan. - never-raise (per-path/per-tick/per-send); STAGE_TRANSITIONS/QG_CHECKS/check_*/ DB schema untouched, no migration (anti-spam state in-memory). Tests: tests/test_disk_watchdog.py (TC-01..TC-12, 18 cases); full suite green (1296). Docs: INFRA.md, .env.example, CHANGELOG.md (architecture/README.md + ADRs authored at architecture stage). Refs: ORCH-063 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 19:04:36 +03:00
claude-bot	a6d0ba51c0	feat(labels): auto-mode by Plane labels — autoApprove + autoDeploy (ORCH-089) Lift the two HUMAN gates that block an autonomous batch run (epic ORCH-088): the BRD gate (analysis: manual Approved) and the prod-deploy gate (deploy Phase A: manual Confirm Deploy, ORCH-059). Selective (a Plane label on the issue), declarative, reversible, and WITHOUT touching a single technical check. Additive, mirroring the conditional sub-gates (ORCH-035/043/058/088): leaf src/labels.py (never-raise) + two point insertions + config flags. STAGE_TRANSITIONS / QG_CHECKS / check_* / DB schema are NOT touched. - autoApprove: врезка in _handle_analysis_approved_flow (files_ok branch) -> set_issue_approved + log/Telegram/Plane-comment + advance_stage( finished_agent=None) — the SAME path a human Approved takes (approved-via- status -> analysis->architecture + mark_brd_review_ended). No duplicated transition logic; re-entrancy safe. - autoDeploy: врезка in _handle_self_deploy_phase_a after advance to deploy + clear_state -> log/Telegram/Plane-comment + _handle_self_deploy_phase_b (INITIATED marker, Deploying, finalizer). Only the indicative human steps are skipped. BR-5 holds structurally: Phase A is reached only after the green edge sub-gates, so autoDeploy can never deploy a broken build. - plane_sync: fetch_issue_labels (None on error != []), get_project_labels ({normalized_name->uuid}, TTL cache, ambiguity sentinel), set_issue_approved. - config flags: auto_label_enabled (kill-switch), auto_approve_label/ auto_deploy_label, auto_label_repos (empty -> self-hosting only), auto_label_states_ttl_s. applies() (local) checked FIRST; has_label (network) only when applies==True -> zero network / zero regression when disabled (AC-8). - Fail-safe (never auto on doubt), transparency via log+Telegram+Plane+card, read-only auto_labels block in GET /queue. - Tests TC-01..TC-26 across 7 modules; docs (CLAUDE.md, architecture README, CHANGELOG) updated in the same PR. Refs: ORCH-089 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 12:31:24 +03:00
claude-bot	ee4773f5b0	feat(serial-gate): per-repo serial gate + deferred branch cut + rollback-freeze (ORCH-088) Этап 1 (serial e2e) пакетного автономного режима. Новая задача репо не входит в analysis (analyst-job не выбирается, ветка не режется), пока в репо есть более ранняя незавершённая задача (FIFO, t2.id < jobs.task_id) ИЛИ репо заморожен. - src/serial_gate.py — новый leaf (never-raise): build_claim_clause (fail-OPEN), is_repo_frozen (fail-CLOSED), set/clear_repo_freeze, serial_gate_applies, snapshot. - src/db.py — идемпотентная миграция repo_freeze + serial_gate-фрагмент в claim_next_job. - src/webhooks/plane.py + src/agents/launcher.py — отложенный срез ветки: start_pipeline не создаёт Gitea-ветку/docs для применимого репо; релокация в _materialize_deferred_branch на момент claim analyst-job (база = свежий origin/main с кодом предшественника, AC-6). - src/stage_engine.py — post-deploy DEGRADED → durable per-repo freeze + Telegram-алерт. - src/main.py — блок serial_gate в GET /queue + POST /serial-gate/unfreeze. - src/config.py — serial_gate_enabled / serial_gate_repos / serial_gate_freeze_enabled. FIFO-уточнение реализации (FR-2): ADR-001 D1 фиксировал t2.id != jobs.task_id; при != пакет одновременно созданных свежих задач взаимно блокировался бы (дедлок). t2.id < jobs.task_id допускает самую раннюю задачу и сериализует остальные, сохраняя AC-1/R-7. STAGE_TRANSITIONS / QG_CHECKS / check_* — без изменений. Аддитивно, под kill-switch, never-raise, restart-safe; при выключенном флаге — нулевая регрессия (enduro не затронут). Тесты: TC-01..TC-22 (test_serial_gate*.py + test_queue_endpoint.py); полный прогон 1114 зелёных. Docs: README (serial gate / /queue / API / БД), CLAUDE.md, CHANGELOG.md, .env.example. Refs: ORCH-088 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 11:24:48 +03:00
claude-bot	a74379f657	feat(ORCH-026): task dependencies (B waits for A) + single-repo merge serialization Level A — merge/deploy serialization within one repo: reuse the existing ORCH-043/065 merge-lease (no new mechanism); the only new logic is an unconditional pre-merge rebase in check_branch_mergeable — under the held lease, auto_rebase_onto_main is ALWAYS called when premerge_rebase_always (default True), not just when the branch is behind. No-op on an up-to-date branch (rebase keeps HEAD, force-with-lease -> "Everything up-to-date", CI not triggered). Kill-switch off -> ORCH-043 behaviour 1:1. Level B — declarative task dependencies: additive job_deps table (CREATE ... IF NOT EXISTS, no live-DB migration); claim_next_job gate (NOT EXISTS) defers a job whose depends-on tasks are not yet 'done' without occupying a max_concurrency slot; inert on empty job_deps -> zero regression. New leaf src/task_deps.py (never-raise): is_task_ready (fail-open), DFS cycle detection + Blocked/alert, declare/ingest_plane_relations (db source never hits the network on the hot path), snapshot. Telegram waiting-line, /queue observability, reconciler skip + cycle backstop, reaper untouched. Invariants unchanged: STAGE_TRANSITIONS, QG_CHECKS registry (dep gate is a claim_next_job врезка, not a registered QG), DB schema of existing tables, HTTP endpoints; non-self repos remain a no-op on empty deps/scope. Flags: ORCH_PREMERGE_REBASE_ALWAYS, ORCH_TASK_DEPS_ENABLED, ORCH_TASK_DEPS_SOURCE. Docs: docs/architecture/README.md, CLAUDE.md, .env.example, CHANGELOG.md, adr-0015. Tests: tests/test_orch026_*.py (64 tests); full suite 991 green. Refs: ORCH-026 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-08 19:17:44 +03:00
claude-bot	fb25e9a0cf	developer(ET): auto-commit from developer run_id=355	2026-06-08 08:45:31 +00:00
claude-bot	4bebb921ff	feat(reaper): job-reaper + stale merge-lease reclaim + idempotent merge finalization Closes the "zombie jobs" incident class: job status was set only inside the live launcher process, so a process death left jobs.status='running' forever; at max_concurrency=1 one zombie blocked ALL projects' queue (self-hosting risk). Adds a background daemon (src/job_reaper.py) with three-tier liveness (dead-pid streak / known exit_code / max-running backstop) whose only mutating write is an atomic terminal flip guarded by WHERE status='running' (no double-process). For exit0 the canonical QG is the source of truth via gate-driven advance, not "exit0". Also proactively reclaims stale merge-lease (dead pid OR TTL) via file delete only (no git ops), and makes merge finalization idempotent (pr_already_merged guard + up-to-date short-circuit on re-drive). New jobs.pid column via idempotent _ensure_column (no migration); pid stamped in launcher._spawn after Popen. Reaper start/stop in lifespan; "reaper" snapshot in GET /queue. Kill-switches: ORCH_REAPER_ENABLED, ORCH_REAPER_INTERVAL_S, ORCH_REAPER_DEAD_TICKS, ORCH_REAPER_MAX_RUNNING_S, ORCH_LEASE_RECLAIM_ENABLED. Invariants unchanged (AC-13): STAGE_TRANSITIONS, QG_CHECKS registry, check_branch_mergeable signature/behaviour, BUG-8 rollback, hook exit codes. restart-safe, never-raise per unit of background work. Docs: docs/architecture/README.md, CHANGELOG.md, .env.example. Tests: tests/test_job_reaper.py, tests/test_merge_lease_reclaim.py, tests/test_merge_gate.py (TC-16), tests/test_merge_gate_race.py (TC-17), tests/test_queue.py, tests/test_config.py (TC-19/TC-20). 742 passed. Refs: ORCH-065 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 16:14:45 +00:00
claude-bot	2f4c553fd8	feat(post-deploy): post-deploy prod monitoring + degradation reaction (ORCH-021) Extend pipeline responsibility past deploy->done: after the terminal transition for an applicable repo, arm a ~15min observation window that probes prod and reacts to a degradation the restart-time health-check missed ("green deploy, red prod"). - src/post_deploy.py: new leaf module (config + lazy qg/db only). Sentinel-file restart-safe state (.post-deploy-state-<repo>/<wi>/), no DB migration. probe_signals/classify/decide_action/run_rollback, all never-raise. - Reserved-agent job `post-deploy-monitor` (no-LLM, Variant B, calque of deploy-finalizer): self-requeues each tick via enqueue_job. - Deterministic classify: DEGRADED iff >= fail_threshold consecutive health failures OR window 5xx ratio > 5xx_threshold; fail-safe HEALTHY. - Self-hosting invariant (BR-5/AC-8): a tick NEVER restarts the prod orchestrator container -> orchestrator is ALWAYS ALERT_ONLY. - Conditionality (ORCH-35/36/43/58): kill-switch + CSV repos, empty -> self-hosting only. - QG_CHECKS / STAGE_TRANSITIONS / schema unchanged (AC-12). - Docs: CHANGELOG, CLAUDE artefact list (16-post-deploy-log.md), architecture README, .env.example (ORCH_POST_DEPLOY_*). Refs: ORCH-021 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 14:40:06 +00:00
claude-bot	7d2d77217a	feat(reconciler): sweeper потерянных webhook (реконсиляция застрявших стадий) Конвейер продвигается только входящими webhook; потерянное событие (502 на ребилде, отсутствие ретраев у Plane/Gitea, неразрезолвленный sha→branch) оставляет задачу молча застрявшей (класс инцидента ORCH-044). Новый фоновый daemon-поток src/reconciler.py (паттерн queue_worker) доигрывает пропущенный переход через те же штатные гейты/обработчики, что и webhook: - F-1 gate-side: для задач stage≠done, без активного job и age(updated_at) ≥ grace_for_stage(stage) — read-only пред-оценка канонического QG; зелёный → stage_engine.advance_stage(..., finished_agent=None); красный → тишина (спам нотификаций структурно невозможен). analysis F-1 не трогает (человеческий гейт). - F-2 plane-side: опрос Plane API per-project (plane_sync.list_issues_by_state, курсорная пагинация, never-raise) → реплей In Progress/Approved/Rejected через существующие handle_status_start/handle_verdict (async из sync-потока, asyncio.run). - F-3: усиление sha→branch в handle_ci_status — БД-fallback по единственной development-задаче repo (неоднозначность → не резолвим), debug→info. - Анти-дубль на создании (db.create_task_atomic под process-wide Lock): гонка reconcile↔webhook не плодит второй task/branch/worktree/analyst-job (AC-4). - F-4 observability: лог-строка разблокировки + Telegram + блок reconcile в /queue. Старт/стоп в main.lifespan (после worker.start() / перед worker.stop()), restart-safe, never-raise на единицу работы. Kill-switches ORCH_RECONCILE_ENABLED / ORCH_RECONCILE_PLANE_ENABLED + grace-настройки. Схема БД и реестры STAGE_TRANSITIONS/QG_CHECKS не менялись. Тесты: test_reconciler.py, test_reconciler_plane.py, test_gitea_sha_resolve.py, test_config.py (33 новых, 563 всего зелёные). Документация обновлена (golden source): architecture/README.md, INFRA.md, README.md, CHANGELOG.md, adr-0007 → accepted. Refs: ORCH-053 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-06 20:55:25 +00:00
Dev Agent	0653c2437f	feat(launcher): prune old run logs (L-2)	2026-06-03 09:53:55 +03:00
Dev Agent	f314ae09e5	feat(worker): preflight gate + circuit breaker + /queue resilience (ORCH-1) QueueWorker gates claims behind preflight and the CircuitBreaker (open -> pause, no CLI calls + Telegram alert; half-open probes one job; closed on recovery). Wires launcher.on_outcome. /queue exposes resilience snapshot.	2026-06-03 00:12:17 +03:00
Dev Agent	b6d4426a48	feat(worker): background queue worker + lifespan + queue-recovery + /queue (ORCH-1) queue_worker.QueueWorker drains the queue respecting max_concurrency. main.py lifespan: queue-recovery (requeue running jobs) after M-1 orphan-recovery, starts worker and stops it on shutdown. New GET /queue endpoint (counts + recent jobs).	2026-06-02 23:58:44 +03:00
Dev Agent	212352997e	fix(main): proper orphan recovery with per-run warning + notify (M-1)	2026-06-02 20:12:29 +03:00
Dev Agent	0ad56e1f0a	fix: tini entrypoint, event routing wildcard, orphan recovery	2026-05-22 13:52:46 +03:00
Dev Agent	b545665e2d	feat: full pipeline fixes - CI status branch lookup, review webhook routing, auto-advance, plane sync - handle_ci_status: fallback git branch -r --contains when branches[] empty - webhook router: handle pull_request_approved event type - handle_pr: map review.type to review.state for new Gitea format - launcher: auto-advance stage after agent completion (_try_advance_stage) - plane_sync: notify Plane on stage changes - stages.py: stage machine with QG definitions - notifications.py: stage change notifications - safe.directory fix for container git operations	2026-05-22 01:57:02 +03:00
Dev Agent	daf8cdad9e	feat: orchestrator MVP — webhooks, agent launcher, QG checks	2026-05-19 15:57:00 +03:00

16 Commits