orchestrator

Author	SHA1	Message	Date
claude-bot	f1635ddb39	feat(replication): расхардкод хоста + секреты нового хоста + smoke-runbook All checks were successful CI / test (push) Successful in 57s Details CI / test (pull_request) Successful in 55s Details Фундамент тиража 10-common (эпик ORCH-10): платформа разворачивается на новой инфре без правки кода — только env/конфиг. Каждый дефолт = боевому значению (пустой .env => поведение 1:1, kill-switch-природа, NFR-2); STAGE_TRANSITIONS/QG_CHECKS/check_/machine-verdict/схема БД не тронуты. - config: agent_home_dir / agent_git_name / git_email_domain / staging_port (ADR-001 D2/D4); код-блокеры A1-A4 закрыты: plane_sync ссылки из gitea_public_url+gitea_owner, launcher - единый agent_git_env() (x2 места), self_deploy/post_deploy - HOME+домен из Settings (имена системных акторов - платформенные литералы) - image_freshness: staging_port из конфига + fail-closed guard staging_port == прод-порт -> отказ ДО ssh/build (инвариант ORCH-058 AC-9 стал исполняемым); REPO= передаётся хуку явно обоими инвокерами (D7) - SELF_HOSTING_REPO - нормативная платформенная константа (D3, пин-тест) - compose: полная ${VAR:-default}-интерполяция (реестр B, карта D6); группа ORCH-040 uid/gid/HOME/маунты двигается согласованно (build.args APP_); group_add "МИНА 1" сохранён x3; оба app-сервиса с явным command: - Dockerfile: ARG APP_UID/APP_GID/APP_USER/APP_HOME (CMD exec-form 8500 сознательно не тронут - D5); deploy-hook: REPO="${REPO:-...}" (D1 реестра) - секреты: stdlib scripts/gen_secrets.py (token_hex(32); печать по умолчанию; --write никогда не перезаписывает существующий .env молча, exit=2; перезапись только --force); .env.example дополнен до полноты ключей старта - доки: новый docs/operations/REPLICATION.md (карта env, чек-лист секретов, smoke-процедура с PASS/FAIL, границы 10-common/Lite/Bundled), INFRA.md, README, CLAUDE.md, CHANGELOG - анти-регресс: tests/test_no_host_hardcodes.py (tokenize-сканер запрещённых литералов, config-модули - структурное исключение, allowlist пуст, негативная самопроверка) + test_host_config_keys / test_infra_parametrization / test_secrets_gen / test_replication_smoke; согласованные структурные правки test_orch040_compose (судит резолв дефолтов) и test_deploy_hook_rollback_sim (REPO через env-override = контракт D7) Полный регресс: 1764 passed. Refs: ORCH-101 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 20:50:43 +03:00
claude-bot	259b507906	feat(watchdog): sidecar-watchdog F1b — monitoring brain in a separate container (ORCH-100) Add the `watchdog/` package (thin Python-3.12 stdlib-only daemon) and the `orchestrator-watchdog` compose service — the brain half of the domain-0 observability pair. F1a (ORCH-099) exposes GET /metrics raw signal; F1b reads it, augments with host / container / dependency probes, runs each signal through a generalised pure decision function (decide(signal_active, prev, now, cooldown), a strict superset of disk_watchdog.decide_action) with per-signal in-memory dedup/throttle/recovery, and alerts over its OWN independent Telegram channel. Key properties (ADR-001): - Observer separated from observed: separate container; /metrics not answering is itself the master `orch_down` alarm (debounced K ticks — no flap on a hiccup). - Strictly read-only: docker.sock GET-only + mounted :ro (double guard), host paths :ro, no DB/disk writes, no process control — self-hosting-safe. - never-raise on three levels (per-source/per-tick/per-send) + WATCHDOG_ENABLED kill-switch (disabled -> inert idle-loop, not exit). - Disk anti-duplicate (D6): disk_watchdog (ORCH-063) stays sole owner of the 85% alert; sidecar carries orch_down + an opt-in 97% ceiling (default off). - NO import from src/ (C-1); src/, STAGE_TRANSITIONS, QG_CHECKS, check_, DB schema — untouched. env_file optional so a missing .env.watchdog never breaks `docker compose up` for the prod orchestrator. Tests: tests/watchdog/ (TC-01…TC-13) + full tests/ regression green (TC-14). Docs: CHANGELOG, .env.example canon (WATCHDOG_); architecture README + adr-0033 authored at the architecture stage. Refs: ORCH-100 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:36:02 +03:00
claude-bot	50bcae765a	feat(bug-fast-track): cheaper/shorter pipeline route for bug-fix tasks (ORCH-019) A task carrying the Plane `Bug` label takes a shortened route that skips the `architecture` stage (one opus architect run + ADR + check_architecture_done), replacing heavy analysis with a lite package (bug-report + mandatory regression test plan). EVERY Quality Gate / sub-gate runs UNCHANGED — the route is a scheduler property, not a gate (root invariant NFR-1): STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys are byte-for-byte preserved. - src/bug_fast_track.py: new leaf (never-raise) — bug_fast_track_applies (local, network-free, checked first), is_bug_task (labels.has_label, Plane API source), skips_architecture (pure DB-backed routing predicate), snapshot. - src/db.py: additive idempotent tasks.track column (TEXT DEFAULT 'full') + set_task_track / get_task_track helpers (missing/NULL -> 'full', fail-safe). - src/stage_engine.py: routing-override on the analysis-exit edge (track='bug' -> development/developer, skipping architect); brd-review-clock stamp extended to analysis->development. get_next_stage/get_agent_for_stage stay pure. - src/webhooks/plane.py: classify task as bug in start_pipeline (applies-first short-circuit; never-raise -> full cycle on any error). - src/main.py: additive bug_fast_track block in GET /queue + POST /bug-fast-track/escalate (reset 'bug'->'full' to return to the full cycle). - src/config.py: bug_fast_track_enabled / _label / _repos flags (empty CSV -> self-hosting only). - src/notifications.py: optional 🐞 marker on the bug-track card (never-raise). - Prompts: analyst.md (lite bug package + escalation), reviewer.md (regression- test axis) — 52d canon preserved. - Docs: CLAUDE.md, README.md (env + API + section), docs/architecture/README.md, CHANGELOG.md, .env.example. - Tests: tests/test_bug_fast_track*.py + test_db_migrations.py + queue block (TC-01..TC-15). Full regression green (1551 passed). Kill-switch ORCH_BUG_FAST_TRACK_ENABLED=false -> 1:1 pre-ORCH-019 (zero regression; residual track column harmless). Refs: ORCH-019 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 03:58:15 +03:00
claude-bot	a98d605477	feat(fs): legacy root-owned ownership detect + actionable worktree error (ORCH-057) Follow-up ORCH-040: legacy root:root files in /repos broke worktree creation under uid 1000 with a raw "Permission denied" (agent never started, no diagnosis). Three additive, kill-switch-reversible layers; STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / DB schema are byte-for-byte unchanged. - D1: ensure_worktree classifies the permission class and raises an actionable RuntimeError (cause + chown command + INFRA.md ref); non-permission errors keep the prior raw-stderr contract; kill-switch off -> contract 1:1 as before ORCH-057. - D2: new never-raise leaf src/fs_normalize.py — scan_ownership (TTL-cached, early-exit per root), applies()-first scope (empty CSV -> self-hosting only), opt-in normalize() that chowns ONLY when privileged (no-op under uid 1000). - D3: best-effort startup detect in main.lifespan (WARNING + Telegram on mismatch, never-fatal); read-only fs_ownership block in GET /queue; POST /fs-normalize/check. Claim is NOT blocked — the clear early outcome is delivered by D1 at launch. - Docs/config: .env.example flags + CHANGELOG (architecture README / adr-0031 / INFRA.md procedure already landed on the branch). - Tests: test_fs_normalize.py, test_git_worktree_perm.py, test_fs_normalize_startup.py, test_api_queue.py (TC-01..TC-12). Full suite green. Refs: ORCH-057 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 03:03:34 +03:00
claude-bot	d8793c9698	feat(metrics): lightweight read-only GET /metrics raw-signal endpoint (ORCH-099) FND/F1a: add a versioned read-only JSON endpoint GET /metrics that exposes the orchestrator's own raw state for the future observability sidecar F1b — active task stages, job queue, agent-liveness (pid/runtime/cpu_ticks), and cost/tokens. The orchestrator emits ONLY raw signal it alone knows; thresholds/alerts/history live in the separate sidecar (observer separated from observed, BRD §1). - src/metrics.py: new leaf collector build_metrics() (never-raise per section, serial_gate.snapshot() pattern); envelope schema_version/generated_at/clk_tck + stages/queue/agents/cost. _read_cpu_ticks(pid) reads utime+stime from /proc/<pid>/stat (null on None/dead/non-Linux pid — never raises). - src/main.py: thin @app.get("/metrics") wrapper (style of GET /queue). - src/db.py: read-only helpers get_running_agents() (dedicated SELECT, not an extension of the hot-path get_running_jobs()), agent_cost_totals(), queue_retry_stats(); job_status_counts() default dict gains the cancelled key. - src/config.py: metrics_endpoint_enabled kill-switch (default True), env ORCH_METRICS_ENABLED via explicit validation_alias so the documented switch actually controls the flag. - docs: README API table row + CHANGELOG entry (contract section already added by architect); .env.example ORCH_METRICS_ENABLED. Strictly read-only / never-raise: STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / DB schema untouched; /health//status//queue byte-for-byte. Tests: tests/test_metrics.py (TC-01..TC-11) + env-alias tests in test_config.py. Full suite green (1482). Refs: ORCH-099 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 02:09:19 +03:00
claude-bot	eadfd8419b	feat(coverage): deterministic test-coverage gate on deploy-staging->deploy edge (ORCH-027) Introduce a deterministic (no-LLM) coverage sub-gate that blocks coverage degradation before a task branch merges into `main`. Existing gates judge only by the FACT of passing (check_ci_green / check_tests_passed / merge-gate re-test), not by completeness — so a batch autonomous run (ORCH-088) silently erodes coverage. Pattern mirrors the security-gate (ORCH-022): leaf src/coverage_gate.py (never-raise) + thin check_coverage_gate in QG_CHECKS + _handle_coverage_gate splice in advance_stage, run AFTER merge-gate (measured on the caught-up HEAD that lands in main) and BEFORE image-freshness (fail before the expensive docker rebuild). - measure_coverage: pytest --cov=src --cov-report=json in the per-branch worktree -> line coverage %; None on tool error -> fail-open + WARNING by default (FR-6). - compute_coverage_verdict (pure): absolute \| baseline \| both + epsilon (NFR-4 anti-flap); baseline None -> bootstrap (absolute-only). - coverage_baseline DB table (additive, CREATE TABLE IF NOT EXISTS) + ratchet-up in _handle_merge_verify (deploy->done): atomic compare-and-set under merge-lease, never decreases; bootstrap on first merge. - Artefact 18-coverage-report.md (coverage_status: frontmatter, single source of truth); GET /queue `coverage` block; FAIL -> Telegram; optional POST /coverage/baseline override. - Flags ORCH_COVERAGE_* (kill-switch + self-hosting-only scope) -> enduro untouched; STAGE_TRANSITIONS / existing check_* / verdict keys byte-for-byte unchanged (NFR-5/AC-8). - pytest-cov==5.0.0 added to requirements.txt. Tests: tests/test_coverage_gate.py (TC-01..TC-15). Frozen QG-registry anti-regress tests + deploy-staging edge tests updated for the new sub-gate. Full suite green. Docs: README / adr-0029 / PIPELINE_DOCS / 18-coverage-report.md template (architecture stage) + CHANGELOG / CLAUDE.md / .env.example (this PR). Refs: ORCH-027 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 01:26:24 +03:00
claude-bot	a46dcbcab3	fix(deploy): terminal-window-aware guard so done tasks hold Done in Plane (ORCH-094) A DB stage=done task with 0 active jobs flapped in Plane between `Awaiting Deploy` and `Monitoring after Deploy` instead of holding `Done` (verified live on ORCH-061, task 47): the three deploy-phase setters were terminal-blind, so any stale/duplicate/unknown caller under the bot token re-stamped an intermediate status over the terminal Done, forever. - New leaf src/deploy_status_guard.py (pure, never-raise, config-gated): decide() -> ALLOW \| CONVERGE_DONE \| SUPPRESS on the entry of set_issue_awaiting_deploy / set_issue_deploying / set_issue_monitoring. A deploy-phase status is legitimate iff the task is non-terminal OR (done AND post-deploy window active); otherwise done converges to Done idempotently, cancelled is suppressed (FR-2, D1/D2). - D3: move post_deploy.arm_monitor ABOVE the terminal-sync block in advance_stage so window_active is True when the legitimate first Monitoring is set (the task is already DB-done by then); a re-drive after the window closes converges to Done. - D4: run_post_deploy_monitor no-ops without a status PATCH / re-queue when the task became cancelled mid-window (zombie-tick guard, FR-3). - D5: additive `reason` kwarg on the three setters + one structured log line per verdict (work_item/caller/target/db_stage/window_active/verdict); new read-only db.get_task_by_work_item_id; post_deploy.window_active helper. - Flags deploy_status_guard_enabled (kill-switch -> 1:1) / deploy_status_guard_repos (CSV; empty = self-hosting only). STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / DB schema untouched (reads existing tasks.stage). Tests: TC-01..TC-12 across 5 new test modules + config flags; updated the reason-kwarg assertions in test_deploy_terminal_sync / test_deploy_approve. Full regress green (1413). Docs: CHANGELOG, CLAUDE.md, docs/architecture/README.md (status -> реализовано), .env.example. Refs: ORCH-094 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 23:41:24 +03:00
claude-bot	0b25fc1527	fix(merge_gate): retry transient Gitea merge errors + already-in-main guard merge_pr now wraps ONLY the mutating POST /pulls/{n}/merge in a bounded exponential-backoff retry-loop on TRANSIENT outcomes (405 "try again later", 408, any 5xx, network/timeout, and 409\|422 while the PR is still mergeable); TERMINAL outcomes (403/404/real conflict via mergeable==False) -> fast honest False, so the ORCH-071/081 not-merged HOLD backstop is unchanged. Fixes the ORCH-063 false HOLD + manual re-merge on Gitea's post-push mergeability hiccup. ensure_open_pr gains an "already fully in main" guard (_branch_fully_in_main, git merge-base --is-ancestor HEAD origin/main) BEFORE creating a PR -> new "already-in-main" outcome avoids the garbage empty PR on a re-driven finalizer; _handle_merge_verify skips merge_pr on that outcome and lets the authoritative SHA-in-main check confirm -> done (not a HOLD). git error of the guard fails OPEN to the create path. New ORCH_MERGE_RETRY_* settings (kill-switch merge_retry_enabled -> one-shot, max_attempts=3, backoff base=2/max=5). INV-4 (merge only via Gitea PR-merge API, never push/force-push main), never-raise, STAGE_TRANSITIONS/QG_CHECKS/DB schema unchanged. Docs (README merge-verify section, CLAUDE.md, CHANGELOG, .env.example) updated in the same PR. Tests: test_merge_gate.py TC-01..12, test_config.py TC-13, test_merge_verify.py TC-14..16; full suite green (1389). Refs: ORCH-093 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 22:47:20 +03:00
claude-bot	ebbf2e7a2d	feat(cancel): STOP-status task cancellation + relaunch-hole close (ORCH-090) Introduce the dedicated Plane STOP status as a single declarative task-cancel mechanism: stop the active agent (graceful SIGTERM cascade), cancel all jobs (terminal `cancelled`, never requeued), remove the worktree + delete the remote feature branch (never main, never force-push), drive the task to the new system-terminal state `cancelled` and tombstone the natural keys so a later "To Analyse" re-creates it from scratch (docs artefacts preserved). STOP during a critical merge/deploy window is deferred until the irreversible step finishes honestly. Also closes the relaunch hole: handle_status_start relaunch is gated to the `analysis` stage; the only pipeline-start entry point remains "To Analyse". Cross-cutting (adr-0026): the "task terminal" predicate is widened {done} -> {done, cancelled} in serial_gate / task_deps / stages sink + reaper/worker requeue guards. STAGE_TRANSITIONS exit-gates / QG_CHECKS / check_* are unchanged (`cancelled` is a sink, not a new edge). Additive, never-raise, restart-safe, under kill-switch ORCH_STOP_STATUS_ENABLED (off -> zero regression). New: src/cancel.py (leaf), src/gitea.py (delete_remote_branch), tasks columns cancelled_at/cancel_requested_at, jobs status `cancelled`, GET /queue `stop` block. Tests: tests/test_stop_status.py (TC-01..TC-14 + D7); full suite green (1345). Docs updated in-PR (architecture README, CLAUDE.md, README.md, .env.example, CHANGELOG). ADR-001 D4 refinement: plane_issue_id is tombstoned too (the lookup ORs on it) — original UUID recoverable from the parseable suffix. Refs: ORCH-090 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:31:56 +03:00
claude-bot	664c2e945a	feat(infra): auto-prune docker build cache on mva154 (ORCH-062) Add src/build_cache_pruner.py — a background daemon thread modelled 1:1 on src/disk_watchdog.py that periodically runs STRICTLY `docker builder prune -f --filter until=<until>` (BuildKit GC) on the HOST over ssh. It is the "second half" of the disk-watchdog (ORCH-063): the watchdog signals, the pruner cleans. Removes the root cause of the 07.06.2026 incident (build cache ~11GB -> disk 100% -> whole self-hosting pipeline down) automatically, без оператора. ADR-001 (Variant A): host-over-ssh, same channel as image_freshness/self_deploy (no docker CLI in the image). Touches ONLY the build cache — no image/system prune, no image/container removal, never restarts the docker daemon or the prod container (self-hosting safety). No ssh target -> tick is a no-op. - src/config.py: ORCH_BUILD_CACHE_PRUNE_* flags + defensive validators (interval/timeout >0, until ~ ^\d+[smhdw]?$, notify_min_gb >=0 -> safe default). - src/main.py: start last (after disk_watchdog) / stop first in lifespan; additive read-only build_cache_prune block in GET /queue. - never-raise on two levels (per-command + per-tick); kill-switch ORCH_BUILD_CACHE_PRUNE_ENABLED (false -> daemon does not start, 1:1 as before). - STAGE_TRANSITIONS / QG_CHECKS / check_* / _parse_* / DB schema UNCHANGED; last-run/last-result is in-memory (no migration). - tests/test_build_cache_pruner.py: TC-01..TC-12 (23 cases, docker fully mocked). - .env.example + CHANGELOG.md updated; INFRA.md / architecture docs already carry the component (architecture stage). Refs: ORCH-062 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 19:55:00 +03:00
claude-bot	8759cb7df8	feat(disk-watchdog): host-FS fill heartbeat + Telegram alert at >=85% (ORCH-063) Adds src/disk_watchdog.py — a background daemon thread modelled on reconciler/job_reaper that measures host-FS fill via the mounted bind-paths (/repos, /app/data) with shutil.disk_usage and Telegram-alerts the operator at >= threshold (default 85%). The missing proactive signal: on 07.06.2026 the mva154 host disk silently hit 100% and stalled the whole self-hosting pipeline. - Pure decide_action(used_pct, threshold, prev, now, realert_s): alert on crossing up, cooldown re-alert, single recovery below threshold (unit-tested without a thread/timer; clock injected). - measure_paths: shutil.disk_usage per path, dedup by st_dev, per-path never-raise (a broken path never fails the tick). - Config flags ORCH_DISK_MONITOR_* with defensive validation (threshold 1..100, positive intervals -> default + warning). Kill-switch -> daemon does not start. - Additive disk_monitor block in GET /queue; start/stop in main.lifespan. - never-raise (per-path/per-tick/per-send); STAGE_TRANSITIONS/QG_CHECKS/check_*/ DB schema untouched, no migration (anti-spam state in-memory). Tests: tests/test_disk_watchdog.py (TC-01..TC-12, 18 cases); full suite green (1296). Docs: INFRA.md, .env.example, CHANGELOG.md (architecture/README.md + ADRs authored at architecture stage). Refs: ORCH-063 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 19:04:36 +03:00
claude-bot	ee4773f5b0	feat(serial-gate): per-repo serial gate + deferred branch cut + rollback-freeze (ORCH-088) Этап 1 (serial e2e) пакетного автономного режима. Новая задача репо не входит в analysis (analyst-job не выбирается, ветка не режется), пока в репо есть более ранняя незавершённая задача (FIFO, t2.id < jobs.task_id) ИЛИ репо заморожен. - src/serial_gate.py — новый leaf (never-raise): build_claim_clause (fail-OPEN), is_repo_frozen (fail-CLOSED), set/clear_repo_freeze, serial_gate_applies, snapshot. - src/db.py — идемпотентная миграция repo_freeze + serial_gate-фрагмент в claim_next_job. - src/webhooks/plane.py + src/agents/launcher.py — отложенный срез ветки: start_pipeline не создаёт Gitea-ветку/docs для применимого репо; релокация в _materialize_deferred_branch на момент claim analyst-job (база = свежий origin/main с кодом предшественника, AC-6). - src/stage_engine.py — post-deploy DEGRADED → durable per-repo freeze + Telegram-алерт. - src/main.py — блок serial_gate в GET /queue + POST /serial-gate/unfreeze. - src/config.py — serial_gate_enabled / serial_gate_repos / serial_gate_freeze_enabled. FIFO-уточнение реализации (FR-2): ADR-001 D1 фиксировал t2.id != jobs.task_id; при != пакет одновременно созданных свежих задач взаимно блокировался бы (дедлок). t2.id < jobs.task_id допускает самую раннюю задачу и сериализует остальные, сохраняя AC-1/R-7. STAGE_TRANSITIONS / QG_CHECKS / check_* — без изменений. Аддитивно, под kill-switch, never-raise, restart-safe; при выключенном флаге — нулевая регрессия (enduro не затронут). Тесты: TC-01..TC-22 (test_serial_gate*.py + test_queue_endpoint.py); полный прогон 1114 зелёных. Docs: README (serial gate / /queue / API / БД), CLAUDE.md, CHANGELOG.md, .env.example. Refs: ORCH-088 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 11:24:48 +03:00
claude-bot	0ab6a33ef5	feat(merge-verify): guarantee idempotent open code-PR before merge_pr (ORCH-082) Close the missing invariant "by merge-verify time the branch has an open code-PR". The pipeline created a PR only on the developer path with a fresh worktree commit (launcher._ensure_pr), so a branch (e.g. after a manual main restore) could reach the deploy->done merge-verify under-gate PR-less -> merge_pr returned "no open PR" -> a FALSE HOLD (ORCH-074 incident). - merge_gate.ensure_open_pr(repo, branch) -> (status, detail): idempotent leaf-actor (never-raise). GET open PRs filtered head==branch AND base==main (identical to merge_pr/ORCH-073 FR-3 — auto docs-PR is not a code-PR) -> existed; else POST -> created; 409/422 race -> re-GET -> existed (no dup); any other error -> failed. - stage_engine._handle_merge_verify: врезка after validated_revision and BEFORE merge_pr. created\|existed -> proceed; failed -> honest HOLD via new _hold_pr_create_failed (note "pr-create-failed-hold", text distinguishable from the not-merged HOLD; task stays on deploy, NO rollback). - launcher._ensure_pr delegated to ensure_open_pr (single PR-creation path, shared head==branch & base==main filter); the developer-only trigger is unchanged. - ORCH-073 protection untouched & authoritative: merge is confirmed ONLY by verify_merged_to_main (SHA-in-main) + check_main_regression. Real un-merged code still HOLDs. - Kill-switch ORCH_MERGE_VERIFY_AUTOCREATE_PR_ENABLED (default true); scope = merge_verify_applies (self-hosting / merge_verify_repos); non-self -> no-op; false -> ORCH-074 behaviour 1:1. No DB migration; main never push/force-push. - Append ORCH-082 marker to MAIN_REGRESSION_MARKERS (append-only convention). - conftest defaults the autocreate flag OFF (mirrors merge_verify_enabled) so unrelated deploy->done tests stay 1:1 (no network). Tests: tests/test_orch082_ensure_pr.py (TC-01..05), tests/test_orch082_merge_verify_autocreate.py (TC-06..12). Docs: README merge-verify block (ORCH-082), CHANGELOG, .env.example. Refs: ORCH-082 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 00:57:08 +03:00
claude-bot	1ada41f272	fix(effort): per-role floor for --effort resolution + developer→xhigh resolve_agent_effort returned '' for all agents in prod because empty ORCH_AGENT_EFFORT_*= env vars clobber pydantic class-defaults, leaving no non-empty floor to fall back to -> --effort never reached the Claude CLI. Add a level-4 per-role floor in resolve_agent_effort (src/agents/launcher.py): _agent_effort_floor reads the declared class-default of agent_effort_<agent> (model_fields[...].default), which a present-but-empty env cannot override. Floor applies only when levels 1-3 are empty and BEFORE validation, so a typo (non-empty) still drops to '' (never-break ORCH-41) and explicit env/override still wins (priority preserved). config.py: agent_effort_developer high->xhigh (single source of truth; floor follows automatically). Refs: ORCH-081 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-08 22:50:47 +03:00
claude-bot	0873803faa	feat(launcher): drop dead frontmatter model + validate model name (never-break) G1: remove the dead `model:` line from all 6 .openclaw/agents/.md prompts — launcher never read it; config (agent_model_) is the single source of truth. G2: add is_valid_model helper (format check ^claude-…$) applied inside resolve_agent_model's resolution cascade and at the inline --fallback-model read in _spawn. An invalid name is logged and skipped to the next valid level (in the limit: no --model flag), never passed to the CLI, never raises. Format check chosen over an allowlist for forward-compatibility (ADR-001). G3 (routing) and G4 (fallback) intentionally NOT enabled — all agents stay on claude-opus-4-8; agent_fallback_model stays "". Docs (golden source) updated in the same change: README model/effort table + validation, CLAUDE.md, .env.example (ORCH_AGENT_MODEL_/EFFORT_/FALLBACK_MODEL), CHANGELOG. Tests: test_agent_frontmatter_no_model.py (G1), extended test_resolve_agent_model.py (G2 never-break). Refs: ORCH-074 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-08 22:00:54 +03:00
claude-bot	a74379f657	feat(ORCH-026): task dependencies (B waits for A) + single-repo merge serialization Level A — merge/deploy serialization within one repo: reuse the existing ORCH-043/065 merge-lease (no new mechanism); the only new logic is an unconditional pre-merge rebase in check_branch_mergeable — under the held lease, auto_rebase_onto_main is ALWAYS called when premerge_rebase_always (default True), not just when the branch is behind. No-op on an up-to-date branch (rebase keeps HEAD, force-with-lease -> "Everything up-to-date", CI not triggered). Kill-switch off -> ORCH-043 behaviour 1:1. Level B — declarative task dependencies: additive job_deps table (CREATE ... IF NOT EXISTS, no live-DB migration); claim_next_job gate (NOT EXISTS) defers a job whose depends-on tasks are not yet 'done' without occupying a max_concurrency slot; inert on empty job_deps -> zero regression. New leaf src/task_deps.py (never-raise): is_task_ready (fail-open), DFS cycle detection + Blocked/alert, declare/ingest_plane_relations (db source never hits the network on the hot path), snapshot. Telegram waiting-line, /queue observability, reconciler skip + cycle backstop, reaper untouched. Invariants unchanged: STAGE_TRANSITIONS, QG_CHECKS registry (dep gate is a claim_next_job врезка, not a registered QG), DB schema of existing tables, HTTP endpoints; non-self repos remain a no-op on empty deps/scope. Flags: ORCH_PREMERGE_REBASE_ALWAYS, ORCH_TASK_DEPS_ENABLED, ORCH_TASK_DEPS_SOURCE. Docs: docs/architecture/README.md, CLAUDE.md, .env.example, CHANGELOG.md, adr-0015. Tests: tests/test_orch026_*.py (64 tests); full suite 991 green. Refs: ORCH-026 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-08 19:17:44 +03:00
claude-bot	aff334e82b	fix(merge-gate): SHA-in-main as sole merge-verify criterion + main regression guard Root-cause fix for main erosion (phantom merge): code of ORCH-067/069 reached `done` while absent from origin/main (only their auto docs-PRs landed). - FR-1: verify_merged_to_main confirms merge ONLY by `git merge-base --is-ancestor <validated_sha> origin/main`; the OR-branch pr_already_merged is removed (a merged PR no longer confirms). Empty SHA / git error -> False. - FR-2: pr_already_merged demoted to merge_pr idempotency-guard; counts a PR only when merged & head.ref==<branch> & base.ref=="main" (explicit in-loop filter). - FR-3: merge_pr selects the open code-PR by head==<branch> AND base==main. - FR-5: new deterministic check_main_regression in _handle_merge_verify (after confirmed SHA-in-main, before done) verifies MAIN_REGRESSION_MARKERS still in origin/main; deterministic count==0 -> alert "main regressed" + HOLD (NOT done, no rollback); git error of the grep -> fail-open. Kill-switch ORCH_REGRESSION_GUARD_ENABLED; non-self -> no-op. - FR-4: root .gitattributes `CHANGELOG.md merge=union` so Unreleased edits auto-merge on rebase without conflict (branch not rolled back). Invariants unchanged (STAGE_TRANSITIONS, QG_CHECKS, deploy-status, merge-gate, image-freshness, DB schema, external HTTP API); non-self repos no-op (INV-5); never-raise (INV-1); merge only via Gitea PR-API (INV-2). Docs: CHANGELOG, .env.example (README/ADR updated by architect). Tests: tests/test_orch073_*.py (TC-01..18); existing merge-gate tests updated for the new code-PR filter. Refs: ORCH-073 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-08 16:30:46 +03:00
stream	183e6d68bc	restore: re-merge ORCH-069 qg0_title_max All checks were successful CI / test (pull_request) Successful in 21s Details # Conflicts: # CHANGELOG.md	2026-06-08 14:58:30 +03:00
claude-bot	0ed05417e6	feat(qg0): configurable QG-0 title limit via ORCH_QG0_TITLE_MAX (default 200) Replace the hardcoded `len(name) > 80` cap in the QG-0 entry validation (_qg0_errors) with a configurable Settings.qg0_title_max (env ORCH_QG0_TITLE_MAX, default 200). The 80-char cap was a hygiene limit, not structural, so valid 81-200 char titles were rejected without a business reason. The limit is read dynamically per call and the error text interpolates the active value. Graceful degradation (AC-3, self-hosting safety): an empty/non-numeric env value no longer crashes the process on startup. A field_validator(mode="before") intercepts the raw env before int-parsing and falls back to 200 (never raises), suppressing pydantic ValidationError. Additive and backward-compatible (default 200 > old 80). Invariants unchanged: STAGE_TRANSITIONS, QG_CHECKS registry, DB schema, slug [:30], lower limits, soft-QG-0 warning path, API. Refs: ORCH-069 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-08 11:24:01 +00:00
claude-bot	f330a580c4	docs(tracker): update CHANGELOG, CLAUDE.md, .env.example for ORCH-067 Закрывает P0/P1 ревью (attempt 2/3): документация = golden source. - CHANGELOG.md: запись ORCH-067 в [Unreleased] (bump-дефолт, статус-строка карточки по модели ORCH-066, кликабельный номер задачи, новые флаги). - CLAUDE.md: раздел «Нотификации / Telegram live-tracker» (ТЗ §5). - .env.example: ORCH_TRACKER_MODE=bump (синхрон с новым дефолтом) + ORCH_TRACKER_LIVE_STATUS / _TTL_S / _TIMEOUT_S. Refs: ORCH-067 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-08 10:34:33 +00:00
stream	2ec6873e33	integ: merge ORCH-068 reconciler livelock fix # Conflicts: # docs/architecture/README.md # src/reconciler.py	2026-06-08 06:36:29 +00:00
claude-bot	1d72c44587	fix(reconciler): stop F-2 livelock spam on synced terminal tasks + cache TTL Reconciler F-2 spammed Telegram "<wi> разблокирована" every ~120s for a fully-synchronized Done task (incident ET-002, 191+ msgs/night) after the ORCH-066 Plane status model merge. Two stacked defects (defense in depth): - D1 (selection): actionable states were told apart by bare UUID, so a Done issue aliased onto the approved UUID entered the approved branch. Now terminal states are excluded by Plane state GROUP (completed/cancelled), a project-independent discriminator robust to UUID aliasing; per-issue check with a logical-key fallback when the group is unavailable. get_project_states caches {uuid -> group} from the same /states/ fetch; new sibling accessor get_project_state_groups. - D2 (notification): _note_unblock fired unconditionally after _dispatch. Now it only fires on a confirmed state change (stage before/after _dispatch; task-appears for the start case) — handlers' contracts untouched. - TR-3: in-memory dedup guard {issue_id -> last unblocked state} as a backstop. - TR-4: _STATES_CACHE lived for the whole process lifetime, so a new Plane status was invisible without a restart. Added TTL ORCH_PLANE_STATES_TTL_S (default 300s; 0 = previous lifetime cache) reusing reload_project_states(); a failed refresh serves the stale-but-correct set, not enduro defaults. STAGE_TRANSITIONS / QG_CHECKS / DB schema / handle_* contracts / F-1 / F-3 unchanged; never-raise preserved; self-hosting tick never restarts prod. Observability: skipped_terminal_total / deduped_total in /queue reconcile block. Tests: tests/test_reconciler_plane.py (TC-01..TC-10), tests/test_plane_states_cache.py (TC-11/TC-12). Refs: ORCH-068 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-08 05:18:46 +00:00
claude-bot	30b6187c73	feat(security): security-gate (gitleaks secret-scan + pip-audit) before merge Add a deterministic (no-LLM) security sub-gate on the deploy-staging -> deploy edge, run FIRST (before merge-gate ORCH-043 and image-freshness ORCH-058) so it fails cheaply before any expensive rebase/rebuild, and scans origin/main..HEAD before rebase so a task is never blamed for a CVE introduced by an updated main. Why: the autonomous pipeline merged branches into main with no check for a leaked secret or a vulnerable dependency. For the self-hosting orchestrator (one shared prod instance serving every project from a shared DB) a single leak/CVE landed in the prod of all projects (CLAUDE.md self-hosting, section 8). - New leaf src/security_gate.py (never-raise): gitleaks (offline, fail-closed on tool error => secrets guarantee is unconditional) + pip-audit (best-effort; unreachable CVE feed degrades fail-open + loud warning by default, strict via security_dep_audit_fail_closed). Verdict lives ONLY in 17-security-report.md YAML frontmatter (write -> read-back single source of truth); FAIL is authoritative; missing/broken frontmatter => fail-closed. - check_security_gate thin wrapper registered in QG_CHECKS (lazy import, no cycle). - _handle_security_gate wired FIRST in advance_stage deploy-staging block: FAIL -> rollback to development + developer-retry (cap MAX_DEVELOPER_RETRIES); task_desc carries verbatim findings (ORCH-046 pattern). No merge-lease release (runs before lease acquire). Self-hosting safe: only reads/scans/writes, never deploys. - Conditional rollout (security_gate_enabled + security_gate_repos; empty scope -> self-hosting only). 6 new ORCH_SECURITY_* settings. - Infra: pinned gitleaks Go binary in Dockerfile (+curl/ca-certificates), pip-audit in requirements.txt, versioned .gitleaks.toml at repo root. - STAGE_TRANSITIONS and DB schema unchanged. Docs: docs/architecture/README.md (marked realized), CLAUDE.md (artifact 17), CHANGELOG.md. Tests: test_security_gate.py, test_qg_security.py, test_stage_engine_security_gate.py + updated registry/edge snapshots. Refs: ORCH-022 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 18:04:50 +00:00
claude-bot	720c31393a	fix(reaper): Tier-2 finalization grace + claim-before-act (no dup advance) Tier-2 reaped a LIVE, still-finalizing monitor: _monitor_agent writes agent_runs.exit_code FIRST, then does git push / PR / Plane comments before _finalize_job, and the agent pid is already dead in that window — so the old "exit_code recorded -> reap now" had no grace and could race a healthy job. Worse, _reap_known_outcome ran the advance (advance_stage -> enqueue_job) BEFORE the atomic claim, so a reaper that lost the race had already enqueued the next stage (dup advance / dup enqueue), violating ADR-001 Р-1. Fix: - Tier-2 grace: reap only once agent_runs.exit_code has been recorded for >= reaper_finalize_grace_s (new setting, default 300s; > max finalization window). A live finalizing monitor is never reaped (FR-1.3/AC-3). New finished_age_s column computed in get_running_jobs. - claim-before-act for exit0: evaluate the canonical QG READ-ONLY (the reconciler pattern) to choose the terminal status, then atomically claim 'done' FIRST; only the claim winner runs the advance. A loser performs no side effects -> no dup advance / dup enqueue. Docs (golden source) updated in the same change: ADR-001, global adr-0011, README, internals, .env.example, CHANGELOG (also fixes the P3 broken adr-0011 link). New tests cover the grace window, lost-claim no-side-effects, and the already-advanced idempotent path. Refs: ORCH-065 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 16:14:45 +00:00
claude-bot	4bebb921ff	feat(reaper): job-reaper + stale merge-lease reclaim + idempotent merge finalization Closes the "zombie jobs" incident class: job status was set only inside the live launcher process, so a process death left jobs.status='running' forever; at max_concurrency=1 one zombie blocked ALL projects' queue (self-hosting risk). Adds a background daemon (src/job_reaper.py) with three-tier liveness (dead-pid streak / known exit_code / max-running backstop) whose only mutating write is an atomic terminal flip guarded by WHERE status='running' (no double-process). For exit0 the canonical QG is the source of truth via gate-driven advance, not "exit0". Also proactively reclaims stale merge-lease (dead pid OR TTL) via file delete only (no git ops), and makes merge finalization idempotent (pr_already_merged guard + up-to-date short-circuit on re-drive). New jobs.pid column via idempotent _ensure_column (no migration); pid stamped in launcher._spawn after Popen. Reaper start/stop in lifespan; "reaper" snapshot in GET /queue. Kill-switches: ORCH_REAPER_ENABLED, ORCH_REAPER_INTERVAL_S, ORCH_REAPER_DEAD_TICKS, ORCH_REAPER_MAX_RUNNING_S, ORCH_LEASE_RECLAIM_ENABLED. Invariants unchanged (AC-13): STAGE_TRANSITIONS, QG_CHECKS registry, check_branch_mergeable signature/behaviour, BUG-8 rollback, hook exit codes. restart-safe, never-raise per unit of background work. Docs: docs/architecture/README.md, CHANGELOG.md, .env.example. Tests: tests/test_job_reaper.py, tests/test_merge_lease_reclaim.py, tests/test_merge_gate.py (TC-16), tests/test_merge_gate_race.py (TC-17), tests/test_queue.py, tests/test_config.py (TC-19/TC-20). 742 passed. Refs: ORCH-065 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 16:14:45 +00:00
claude-bot	2f4c553fd8	feat(post-deploy): post-deploy prod monitoring + degradation reaction (ORCH-021) Extend pipeline responsibility past deploy->done: after the terminal transition for an applicable repo, arm a ~15min observation window that probes prod and reacts to a degradation the restart-time health-check missed ("green deploy, red prod"). - src/post_deploy.py: new leaf module (config + lazy qg/db only). Sentinel-file restart-safe state (.post-deploy-state-<repo>/<wi>/), no DB migration. probe_signals/classify/decide_action/run_rollback, all never-raise. - Reserved-agent job `post-deploy-monitor` (no-LLM, Variant B, calque of deploy-finalizer): self-requeues each tick via enqueue_job. - Deterministic classify: DEGRADED iff >= fail_threshold consecutive health failures OR window 5xx ratio > 5xx_threshold; fail-safe HEALTHY. - Self-hosting invariant (BR-5/AC-8): a tick NEVER restarts the prod orchestrator container -> orchestrator is ALWAYS ALERT_ONLY. - Conditionality (ORCH-35/36/43/58): kill-switch + CSV repos, empty -> self-hosting only. - QG_CHECKS / STAGE_TRANSITIONS / schema unchanged (AC-12). - Docs: CHANGELOG, CLAUDE artefact list (16-post-deploy-log.md), architecture README, .env.example (ORCH_POST_DEPLOY_*). Refs: ORCH-021 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 14:40:06 +00:00
claude-bot	9070489968	fix(staging): tolerate sandbox-infra-only FAILs (C9a/C9b) in deploy-staging verdict Some checks failed CI / test (push) Failing after 39s Details CI / test (pull_request) Failing after 35s Details The self-hosting orchestrator looped on deploy-staging -> development because scripts/staging_check.py exited 1 on ANY failed check, so two infra-only checks (C9a sandbox branch / C9b analyst-job — caused by SANDBOX bot accounts not being members of the sandbox Plane project, NOT a pipeline regress) forced staging_status: FAILED -> rollback -> loop, burning developer retries and tokens. Direction (б) per ADR-001: classify staging checks as REAL (all pipeline checks, fail-closed) vs SANDBOX_INFRA (narrow allowlist {C9a, C9b}, waivable). New leaf module src/staging_verdict.py (stdlib-only, never-raise): classify_check + compute_staging_verdict fold per-check results into a tolerant-but-fail-closed verdict — any REAL failure -> FAILED/exit1 (safety net holds under any flag); only C9a/C9b failed & tolerant -> SUCCESS/exit0 with waived list; only infra & strict -> FAILED/exit1; any internal error -> FAILED/exit1 (never a false green). staging_check.py now auto-classifies each check (public 3-tuple _items shape kept as an ORCH-048 b6 regression guard), exposes categorized_items(), prints INFRA-WAIVED/VERDICT lines, and exits via the verdict; new --strict flag forces legacy strictness per-run. Kill-switch ORCH_STAGING_INFRA_TOLERANCE_ENABLED (default true) restores legacy strict mode globally. launcher gains action_stage_no_changes_note so "no changes to commit" on action stages is logged as expected, not treated as under-delivery. Contracts unchanged: STAGE_TRANSITIONS, QG_CHECKS registry, staging_status:/ deploy_status: frontmatter, hook exit-code (0/1/2), check_staging_status; no DB migration. Docs: README, STAGING_CHECK.md, deployer.md, .env.example, CHANGELOG. Refs: ORCH-061 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 12:39:00 +00:00
claude-bot	4db8276f98	fix(reconciler): skip escalated / Blocked / Needs-Input tasks in F-1 All checks were successful CI / test (push) Successful in 16s Details CI / test (pull_request) Successful in 16s Details Reconciler F-1 could not tell "stuck by a lost webhook" from "escalated: max developer retries reached, waiting for a human". With CI green and a reviewer that kept sending REQUEST_CHANGES up to the cap, every tick re-unblocked development -> review -> rollback -> re-unblock (incident ET-013, infinite bounce: wasted agent runs, Telegram spam, parasitic load on the shared self-hosting instance). Add two pre-gate guards in Reconciler._reconcile_gate_task (after the existing analysis/no-gate/active-job/grace guards, before the gate pre-evaluation), each an early silent return (no advance, no unblocked_total increment, no notifications): - Guard 1 (escalated, deterministic, no network, checked first): developer_retry_count(task_id) >= MAX_DEVELOPER_RETRIES. Promote stage_engine._developer_retry_count to public developer_retry_count (single source of truth; private alias kept). Limit from the constant, not a literal 3. - Guard 2 (explicit human Plane gate, Variant A, no DB migration): new never-raise plane_sync.fetch_issue_state + Reconciler._is_blocked_or_needs_input; any error/None/unresolved project -> conservative skip. New sub-flag ORCH_RECONCILE_SKIP_BLOCKED_ENABLED mutes only the networked Guard 2. F-2 unchanged: Blocked/Needs Input are outside {in_progress, approved, rejected} so they are never replayed (regression test added). DB schema, STAGE_TRANSITIONS, QG_CHECKS, never-raise, analysis carve-out and kill-switches untouched. Refs: ORCH-060 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 11:50:02 +00:00
claude-bot	6ddff5583d	fix(ORCH-058): parametrize staging_check in --build-staging + explicit staging target All checks were successful CI / test (push) Successful in 19s Details CI / test (pull_request) Successful in 18s Details Round-3 review follow-up on `c53d625` (P1/P2): - P1: --build-staging now runs staging_check via parametrized STAGING_CONTAINER / STAGING_CHECK_PATH / STAGING_CHECK_MODE (default orchestrator-staging / bind-mount path / stub) instead of hardcoding $TARGET_SERVICE + the script path. docker exec runs INSIDE the staging container (ORCH-048 canonical: B6 registry isolation), after health, before exit 0. Fail-closed: any non-zero -> exit 1. STAGING only (8501). - P2a: rebuild_staging_image now passes the STAGING target EXPLICITLY (TARGET_SERVICE/TARGET_PORT/COMPOSE_PROFILE/STAGING_CONTAINER) so the self-rebuild can never drift onto prod 8500 if hook defaults change (AC-9). - P2b: TC-09 caller<->hook contract tests assert the ssh command carries GIT_SHA + BUILD_CONTEXT + the staging target and never the prod 8500 one; no-ssh-host fails closed. - P3: consolidated the three duplicate README footers into one. - Docs (golden source): DEPLOY_HOOK.md step 4 + env rows, README footer, CHANGELOG, Dockerfile ARG GIT_SHA="" comment, .env.example freshness block. Validates exactly the artefact later BUILD-ONCE retagged to prod (AC-4, ADR-001 step 3). 632 tests pass, ruff clean, bash -n OK. Refs: ORCH-058 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 09:24:38 +00:00
claude-bot	3b3d587300	docs(ORCH-058): add CHANGELOG entry, .env.example flags, fix README status All checks were successful CI / test (push) Successful in 17s Details Close AC-11 documentation gap left by the prior developer run: the ORCH-058 feature (staging-image provenance before BUILD-ONCE retag) was implemented and green but never recorded in the golden-source docs. - CHANGELOG.md: add the ORCH-058 [Unreleased]/Added entry (layers A+B, validated_revision anchor, check_staging_image_fresh, EXPECTED_REVISION hook guard, new ORCH_IMAGE_FRESHNESS_* flags, ADR/test refs). - .env.example (canon): document ORCH_IMAGE_FRESHNESS_ENABLED / ORCH_IMAGE_FRESHNESS_REPOS, mirroring the ORCH-036/043/053 precedent. - docs/architecture/README.md: footer note design -> реализовано, aligning it with the already-updated section. Refs: ORCH-058 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 08:27:57 +00:00
stream	36c1898fac	Merge remote-tracking branch 'origin/main' into feature/ORCH-036-orch-36-deploy-b All checks were successful CI / test (push) Successful in 16s Details CI / test (pull_request) Successful in 14s Details # Conflicts: # .env.example # CHANGELOG.md # docs/architecture/README.md # docs/operations/INFRA.md # src/config.py	2026-06-07 00:22:19 +03:00
claude-bot	d79defeadd	fix(deploy): clear stale self-deploy markers on rollback; document env Re-deploy after a FAILED prod deploy wedged the task on `deploy`: the sentinel markers (approve-requested/initiated/result) are keyed by the stable work_item_id, so after the БАГ-8 rollback (deploy -> development) and a developer fix, Phase B's idempotency-guard saw a STALE `initiated` and became a no-op — the detached hook never re-launched and the finalizer was never enqueued. Add self_deploy.clear_state (never-raise, idempotent) and call it on the check_deploy_status FAILED rollback and at the start of Phase A, so every fresh prod-deploy pass starts clean. Also document the new ORCH_SELF_DEPLOY_* / ORCH_DEPLOY_* descriptors in the canonical .env.example (CLAUDE.md rule #8, ТЗ §2.6), modelled on the ORCH-043 merge-gate block (placeholders only, secrets not committed). Contracts untouched: STAGE_TRANSITIONS, QG_CHECKS, _parse_deploy_status, БАГ-8, merge-gate. Refs: ORCH-036 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-06 21:07:35 +00:00
claude-bot	7d2d77217a	feat(reconciler): sweeper потерянных webhook (реконсиляция застрявших стадий) Конвейер продвигается только входящими webhook; потерянное событие (502 на ребилде, отсутствие ретраев у Plane/Gitea, неразрезолвленный sha→branch) оставляет задачу молча застрявшей (класс инцидента ORCH-044). Новый фоновый daemon-поток src/reconciler.py (паттерн queue_worker) доигрывает пропущенный переход через те же штатные гейты/обработчики, что и webhook: - F-1 gate-side: для задач stage≠done, без активного job и age(updated_at) ≥ grace_for_stage(stage) — read-only пред-оценка канонического QG; зелёный → stage_engine.advance_stage(..., finished_agent=None); красный → тишина (спам нотификаций структурно невозможен). analysis F-1 не трогает (человеческий гейт). - F-2 plane-side: опрос Plane API per-project (plane_sync.list_issues_by_state, курсорная пагинация, never-raise) → реплей In Progress/Approved/Rejected через существующие handle_status_start/handle_verdict (async из sync-потока, asyncio.run). - F-3: усиление sha→branch в handle_ci_status — БД-fallback по единственной development-задаче repo (неоднозначность → не резолвим), debug→info. - Анти-дубль на создании (db.create_task_atomic под process-wide Lock): гонка reconcile↔webhook не плодит второй task/branch/worktree/analyst-job (AC-4). - F-4 observability: лог-строка разблокировки + Telegram + блок reconcile в /queue. Старт/стоп в main.lifespan (после worker.start() / перед worker.stop()), restart-safe, never-raise на единицу работы. Kill-switches ORCH_RECONCILE_ENABLED / ORCH_RECONCILE_PLANE_ENABLED + grace-настройки. Схема БД и реестры STAGE_TRANSITIONS/QG_CHECKS не менялись. Тесты: test_reconciler.py, test_reconciler_plane.py, test_gitea_sha_resolve.py, test_config.py (33 новых, 563 всего зелёные). Документация обновлена (golden source): architecture/README.md, INFRA.md, README.md, CHANGELOG.md, adr-0007 → accepted. Refs: ORCH-053 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-06 20:55:25 +00:00
claude-bot	00d69d9e27	feat(merge-gate): auto-rebase onto current main + re-test + serialise merges All checks were successful CI / test (push) Successful in 15s Details CI / test (pull_request) Successful in 17s Details Deterministic (no-LLM) sub-gate on the deploy-staging -> deploy edge that catches a feature branch up to the CURRENT origin/main, re-tests the combined tree, and serialises merges with a per-repo file lease — so two green parallel branches can no longer break main (self-hosting safety for the orchestrator repo). - src/merge_gate.py: branch_is_behind_main, auto_rebase_onto_main (push --force-with-lease ONLY the task branch, NEVER main), retest_branch, and a file merge-lease (atomic O_CREAT\|O_EXCL, holder-aware release, stale reclaim). Strict never-raise contract; all git ops in the per-branch worktree. - src/qg/checks.py: check_branch_mergeable composes the primitives under the lease; registered in QG_CHECKS. Conditional rollout (merge_gate_enabled / merge_gate_repos, default self-hosting only). - src/stage_engine.py: sub-gate hook on deploy-staging (not a new stage). PASS -> advance; "merge-lock busy" -> DEFER (re-queue with available_at, anti-deadlock at max_concurrency=1, capped); conflict/red re-test -> rollback to development + developer retry (capped by MAX_DEVELOPER_RETRIES). Lease released on deploy->done / rollback / PR-merged webhook. - src/db.py: enqueue_job(available_at_delay_s=...) for the defer (no schema change). - src/webhooks/gitea.py: holder-aware lease release on PR-merged. - src/config.py + .env.example: ORCH_MERGE_* settings. Docs: README + adr-0006 (architect) already cover the design; CHANGELOG updated. Tests: test_merge_gate.py, test_qg_merge_gate.py, test_merge_gate_race.py, test_stage_engine.py::TestMergeGate, test_config.py, QG-registry snapshot. Full suite: 535 passed. Refs: ORCH-043 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-06 17:32:50 +00:00
claude-bot	05c17135c1	feat(notifications): add bump mode + russify Telegram live-tracker All checks were successful CI / test (push) Successful in 13s Details CI / test (pull_request) Successful in 13s Details ORCH-042: new ORCH_TRACKER_MODE (Settings.tracker_mode, default edit) selects the live-tracker card behaviour. bump mode re-creates the card at the bottom of the chat on every update (delete_telegram + send silently + repoint message_id), keeping the "one card per task" invariant: <=1 new message per call, repoint only on successful send, delete result never gates the send. New never-raising delete_telegram helper. Anything != "bump" resolves to edit (zero regression). Also russify/cosmetic-fix the card text (both modes): "Подтверждение BRD" label, ✅ after approve-gate, Russian stage labels, "📦 Внедрено". Docs updated in the same PR (CHANGELOG, internals.md, .env.example). Refs: ORCH-042 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-06 10:13:49 +00:00
claude-bot	69a4aaab99	feat(notifications): direct BRD + Plane links in approve ping (ORCH-017) All checks were successful CI / test (push) Successful in 12s Details CI / test (pull_request) Successful in 12s Details notify_approve_requested now embeds two HTML <a> links into the single notifying approve-gate message: a Gitea branch-view link to 01-brd.md and a Plane issue browser link. Adds ORCH_PLANE_WEB_URL (external Plane web URL, fallback to plane_api_url) with a loopback-guard that omits the Plane link when the resolved base is localhost/empty (no broken localhost URLs in prod). Each link is built independently and omitted on missing data; the message and the "flip to Approved" call to action are always sent as exactly one ping. The shared send_telegram helper is left untouched (min blast radius for the self-hosting prod container). Dynamic labels are html.escaped; parse_mode=HTML preserved. QG registry / stages / approve handler unchanged. Docs updated in-PR: CHANGELOG, .env.example, INFRA env map. Tests: test_notify_approve_links.py, test_analysis_approve_flow_links.py. Refs: ORCH-017 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-05 17:58:00 +00:00
Dev Agent	daf8cdad9e	feat: orchestrator MVP — webhooks, agent launcher, QG checks	2026-05-19 15:57:00 +03:00

37 Commits