orchestrator

Author	SHA1	Message	Date
claude-bot	238de9ba44	feat(estimator): task estimation triggered by Plane status «Оценка» (ORCH-020) All checks were successful CI / test (push) Successful in 1m20s Details CI / test (pull_request) Successful in 1m22s Details Add a deterministic task-estimation side-mechanism: a new operator Plane status «Оценка» (third action-status, family STOP/Confirm Deploy) triggers a new never-raise leaf src/estimator.py that forecasts cost/time/tokens/story points {1,2,3,5,8} from the history of completed tasks (no LLM — ADR-001 D1), writes the forecast to Plane (estimate_point + comment), the Telegram card and the additive task_estimates ledger (UPSERT by work_item_id), then returns the issue to Backlog. On task completion the fact (from usage.py) is written to Plane `point`. Massivity is free (Plane multi-select -> N webhooks); re-estimate is idempotent; anti-disruption skips in-flight tasks; anti-loop (Backlog matches no trigger branch). INVARIANT (NFR-1/NFR-3): the estimator is an observer/producer, never a Quality Gate or stage — STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / existing table schemas and the hot launch path (resolve_agent_model/_spawn) are byte-for-byte untouched. fail-closed: `estimate` absent from _DEFAULT_STATES -> a board without the status is inert (zero regression, enduro untouched). - config: ORCH_ESTIMATOR_* flags (kill-switch + CSV scope empty->self-hosting only, bootstrap defaults, story-point cost thresholds, wall cap). - db: additive task_estimates table + record_estimate/set_actual/get_estimate/ estimates_snapshot + read-only completed_task_stats history aggregate. - plane_sync: «Оценка»->estimate name map (NOT in _DEFAULT_STATES); set_issue_ backlog/set_issue_point/set_issue_estimate_point + get_project_estimate_points (all via ORCH-117 write-guard, best-effort/fail-safe). - webhooks/plane: fail-closed estimate branch + handle_estimate (anti-disruption, auto-return to Backlog, off-loop). - stage_engine: best-effort fact-on-done врезка (after terminal decision). - notifications: «Оценка» card line (read from ledger, omitted when empty). - main: read-only `estimator` GET /queue block + optional POST/GET /estimate. - onboarding canon + Lite/Bundled/ONBOARDING docs: 23rd status «Оценка» (group unstarted, never terminal); anti-drift count pins bumped 22->23. - tests: tests/test_orch020_estimator.py (TC-01..TC-20), full suite green. Refs: ORCH-020 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 21:47:00 +03:00
claude-bot	dbd8df6eb2	docs(operations): add user FAQ for STOP cancellation status (ORCH-108) All checks were successful CI / test (push) Successful in 1m15s Details CI / test (pull_request) Successful in 1m16s Details Create docs/operations/FAQ_STOP.md — a user-facing "вопрос → ответ" FAQ for Plane board users explaining the STOP status: what it does, how to cancel a task, step-by-step consequences (agent stops → jobs cancelled → working branch and worktree removed → task → cancelled → Telegram+Plane; docs artifacts preserved, main/prod untouched), deferred cancellation in the critical merge/deploy window, explicit "STOP does NOT revert merged/deployed code" (revert is a separate task), restart only via "To Analyse" from scratch, no-op causes, where to observe the result, and STOP/Approved/Confirm Deploy disambig. docs-only: src/*, STAGE_TRANSITIONS, QG_CHECKS, check_, machine-verdict keys and the DB schema are byte-for-byte untouched. STOP behaviour source of truth remains ORCH-090 (adr-0026); the FAQ documents and links to it (link-first: machine details markers/lease/tombstone given by reference, not duplicated). Add two-way cross-links: docs/overview/business.md (Сценарий 6) and docs/overview/tech-pipeline.md (Отмена: STOP → cancelled) → FAQ; FAQ → overview + ADR ORCH-090. Structure guarded by deterministic anti-drift test tests/test_faq_stop_doc.py (offline, no network/LLM/subprocess; mirrors tests/test_lite_setup_doc.py): existence + 8 section anchors + fact bricks + cross-links + claim-level negative scan (sentence-level, not bare substrings, so it never false-fails on correctly negating phrases — TR-3, with a non-evergreen self-check). Full pytest tests/ green (2227 passed). Refs: ORCH-108 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 17:50:58 +03:00
claude-bot	3d3f07ff05	docs(changelog): restore ORCH-119 entry dropped by rebase auto-merge Some checks failed CI / test (push) Has been cancelled Details CI / test (pull_request) Successful in 1m16s Details The merge-gate's auto_rebase_onto_main silently dropped the ORCH-119 CHANGELOG bullet during a same-anchor 3-way merge: origin/main's ORCH-120/126 entries were kept while the ORCH-119 insertion was lost. Re-spliced the entry verbatim under ## [Unreleased] alongside 120/126. Refs: ORCH-119 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 15:10:14 +03:00
claude-bot	19c31778b2	docs(overview): sync system showcase with analyst open-questions auto-park (ORCH-120) All checks were successful CI / test (push) Successful in 1m15s Details CI / test (pull_request) Successful in 1m14s Details Address reviewer P1 (ось ORCH-011/ORCH-079, правило агентов №6): витрина описывала паузу serial-gate как исключительно операторскую, но ORCH-120 добавил движковый авто-park/unpark на analyst Needs Input. - tech-pipeline.md: абзац пауз теперь называет два источника (оператор + авто-park движком на Needs Input, флаг analyst_needs_input_autopause_enabled, скоуп self-hosting, симметричный unpark на resume). - tech-observability.md: пункт пауз в GET /queue — оба источника. - tech-agents.md: when-applicable сигнальный канал 01-questions.md у analyst (строка таблицы + поясняющая врезка; не machine-verdict, не deliverable). - CHANGELOG: запись ORCH-120 дополнена строкой про обновление витрины. tests/test_system_docs.py зелёный (29 passed). src/STAGE_TRANSITIONS/QG_CHECKS не тронуты — docs-only. Refs: ORCH-120 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:28:06 +03:00
claude-bot	d6b495f156	fix(analysis): activate analyst open-questions -> Needs Input flow (ORCH-120) All checks were successful CI / test (push) Successful in 1m14s Details CI / test (pull_request) Successful in 1m11s Details Activates and completes the previously dead "analyst asks BLOCKING questions -> 01-questions.md -> Needs Input" path. Four coordinated changes, additive, under kill-switch, self-hosting scope, never-raise; STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / DB schema are byte-for-byte UNCHANGED (the flow is a pre-gate engine branch, NOT a Quality Gate; 01-questions.md is a SIGNAL artifact, NOT a machine-verdict). - D1 contract + canon: analyst.md documents the 01-questions.md channel (blocking questions -> Needs Input, do NOT fabricate deliverables) + resume behaviour; new skeleton docs/_templates/01-questions.md; PIPELINE_DOCS.md manifest row + 01- prefix note. - D2 freshness-supersede (DQ-2): pure offline mtime predicate questions_active in the new leaf src/analyst_questions.py (a full FRESH package supersedes a stale untouched 01-questions.md -> no Needs-Input loop, AC-6). - D3 priority: questions take priority over "files ready" in _handle_analysis_approved_flow (_decide_analysis_outcome + _emit_analysis_*); off/out-of-scope runs the ORIGINAL byte-for-byte order (AC-9). - D4 auto-park: set_task_paused on Needs Input via the ORCH-124 pause axis so the repo serial-gate FIFO is not wedged while waiting for a human (AC-4); D5 resume + unpark (clear_task_paused) in handle_status_start (analysis branch). Flags (config.py, safe defaults): analyst_questions_gate_enabled / analyst_questions_gate_repos (empty -> self-hosting only) / analyst_needs_input_autopause_enabled. Tests: test_orch120_analyst_needs_input.py (TC-01 regress + TC-02/03/06/09/10), test_orch120_serial_gate_needs_input.py (TC-04), test_orch120_resume_unpark.py (TC-05), test_orch120_questions_artifact_canon.py (TC-08), assert in test_agent_prompts_canon.py (TC-07). Full suite green (2205 passed). Refs: ORCH-120 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 13:15:27 +03:00
claude-bot	d7e7a4d817	fix(queue): enforce queued ⇒ no run-ownership invariant (ORCH-126) All checks were successful CI / test (push) Successful in 1m14s Details CI / test (pull_request) Successful in 1m15s Details Queued analyst-jobs hung forever even with ORCH_SERIAL_GATE_ENABLED=false (incident ORCH-124/125, job 2286: queued + run_id=759/760 + pid=35/42 + started_at=NULL — physically impossible). No path returning a job to 'queued' reset its run-ownership (run_id/pid); after a container restart a reused pid made pid_alive(stale)=True, so the job-reaper Tier-1 saw a phantom 'running' and at max_concurrency=1 wedged the claim of the whole shared queue. Enforce the invariant `status='queued' ⇒ run_id IS NULL AND pid IS NULL AND started_at IS NULL` on existing columns (no schema change): - D1 forward-cleanup: requeue_running_jobs / mark_job('queued') / mark_job_transient / reap_running_job('queued') reset run_id=NULL, pid=NULL in the same UPDATE that clears started_at; atomic status-guards preserved. - D2 clean claim: claim_next_job resets pid/run_id on the queued->running flip (defense-in-depth) so the row carries pid IS NULL until _spawn stamps it. - D4 self-heal + observability: db.find_impossible_queued_jobs / sanitize_impossible_queued run at startup (main.lifespan) and on each reaper tick (JobReaper.sanitize_impossible_queued_once, never-raise); counter impossible_queued_total in the GET /queue reaper block. Kill-switch ORCH_IMPOSSIBLE_QUEUED_SANITIZE_ENABLED (default on; gates only the D4 sweep). - D5: reaper Tier-1 unchanged — the fix restores its precondition (pid reflects THIS run). Marked invariants ORCH-065/113/114/099 preserved. Tests: tests/test_orch126_queued_stale_run.py (TC-01 mandatory regression red->green; TC-02..TC-10). Full pytest tests/ -q green (2189 passed). Docs: internals.md (run-ownership invariant section), .env.example, CHANGELOG; cross-cutting adr-0052. Refs: ORCH-126 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-17 11:39:26 +03:00
claude-bot	58e5dfe55d	docs(serial-gate): sync system showcase + clean stray tags (ORCH-124) All checks were successful CI / test (push) Successful in 1m15s Details CI / test (pull_request) Successful in 1m12s Details Addresses reviewer REQUEST_CHANGES (run 768) on ORCH-124 — docs-only, no src/tests touched, fix scope unchanged. P1: update docs/overview/ showcase for the new serial-gate "pause without blocking" axis (changed task-routing functionality, ORCH-011/ORCH-079): - tech-pipeline.md: FIFO exception "pause without blocking" next to freeze - tech-data-model.md: durable signal tasks.paused_at on the Task row - tech-observability.md: paused/reason in serial_gate GET /queue block + operator endpoints POST /serial-gate/pause\|resume P2: strip leaked tool-call trailing tags (</content>/</invoke>) from 4 golden-source docs of this PR (06-adr/ADR-001, adr-0051, 08-data-requirements.md, 10-tech-risks.md). CHANGELOG "Доки" bullet extended accordingly. Full suite green (2178 passed); test_system_docs.py green (machine-checked showcase facts intact). Refs: ORCH-124 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 21:50:45 +03:00
claude-bot	3a1972875f	fix(tests): isolate repos_dir in ORCH-123 staging-runner test fixture All checks were successful CI / test (push) Successful in 1m13s Details CI / test (pull_request) Successful in 1m12s Details The deterministic test-runner gate (full `pytest tests/`) failed on test_orch123_staging_runner_exec.py::test_r2_held_deploy_staging_not_rolled_back once ORCH-124 reached the testing stage. Root cause (pre-existing latent regress, surfaced — not introduced — by ORCH-124): the fixture isolated `worktrees_dir` but not `repos_dir`. `check_staging_status` falls back to `<repos_dir>/<repo>` (and its origin/main) when the feature worktree is absent. After ORCH-123 merged, the real `/repos/orchestrator/docs/work-items/ORCH-123/15-staging-log.md` (verdict SUCCESS) exists on disk, so the intended-RED staging gate read it and went green -> advance_stage was called -> the R-2 assertion failed. Order-dependent: the test passed alone, failed in the full suite. Fix: isolate `settings.repos_dir` to an empty tmp subdir in the fixture (mirroring the existing worktrees_dir isolation) so the staging gate is deterministically "not found" -> red, regardless of suite ordering. The ORCH-123 R-2 invariant (a held deploy-staging task is never rolled back to development, adr-0049/ADR-001 D4) is preserved and strengthened — the fix only restores the test's stated premise. src/** / STAGE_TRANSITIONS / QG_CHECKS / check_* untouched (test-only change). Refs: ORCH-124 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 20:12:28 +03:00
claude-bot	87af857082	fix(serial-gate): pause-without-blocking via per-task park signal (ORCH-124) All checks were successful CI / test (push) Successful in 1m12s Details CI / test (pull_request) Successful in 1m17s Details Fixes incident ORCH-116/ORCH-123: serial_gate defined a repo's "active task" purely by machine stage (tasks.stage NOT IN ('done','cancelled')). Plane statuses Backlog/Blocked/Needs-Input (layer-B indication, ORCH-066) do NOT change tasks.stage (layer A), so a paused predecessor was indistinguishable from an active one and held the FIFO gate closed against an urgent successor — the urgent fix could not start until the paused task was formally done. Introduces an explicit, durable, DB-resolvable per-task "park" signal — additive nullable column tasks.paused_at (pattern of cancelled_at/track) — and a new ORTHOGONAL scheduler "pause" axis. The serial-gate "active task" predicate becomes `stage NOT IN ('done','cancelled') AND paused_at IS NULL` across all three points (build_claim_clause / repo_has_active_task / _per_repo_snapshot). The terminal set {done,cancelled} in serial_gate/task_deps/stages.py is byte-for-byte unchanged (adr-0026 not regressed): task_deps/stages.py do NOT read paused_at, so a paused declared dependency and an active repo_freeze STILL block (pause never bypasses them — different axes). Anti-stale-base on resume relies on the existing deferred branch cut (ORCH-088) + pre-merge auto_rebase_onto_main + merge-gate re-test (ORCH-026/093/110) — no new rebase machinery. Additive, under an independent sub-flag, never-raise, restart-safe; hot-claim fail-OPEN and freeze fail-CLOSED preserved. STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / existing table schemas are byte-for-byte untouched (this is a queue-scheduler + observability change, not a Quality Gate). - src/db.py: additive tasks.paused_at column (_ensure_column) + set/clear/is helpers - src/serial_gate.py: _pause_layer_enabled() + pause-term in the 3 points; `paused` list + per-job `reason` (freeze>dependency>active-task>null) in the /queue snapshot - src/config.py + .env.example: serial_gate_pause_enabled (default True = true no-op) - src/main.py: POST /serial-gate/pause\|resume?work_item=<id> (by образцу unfreeze) - tests/test_orch124_serial_gate_pause.py: TC-01 mandatory incident regress + TC-02..15 - CHANGELOG.md: [Unreleased] entry ADR: docs/work-items/ORCH-124/06-adr/ADR-001-serial-gate-pause-without-blocking.md Cross-cutting: docs/architecture/adr/adr-0051-serial-gate-pause-without-blocking.md Refs: ORCH-124 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 19:35:55 +03:00
claude-bot	74fccf3a09	fix(testing): reconcile ORCH-116 with merged ORCH-123 (ADR renumber, CHANGELOG, env parity) All checks were successful CI / test (push) Successful in 1m12s Details CI / test (pull_request) Successful in 1m12s Details Recovery from the merge-gate rebase-conflict bounce. The feature branch was rebased onto origin/main (which had merged ORCH-123). The single conflicting hunk — docs/architecture/README.md — was resolved during the rebase: kept ORCH-123's host-side staging-runner line AND the ORCH-116 test-runner bullet. This follow-up commit reconciles the remainder: - Renumber the global sweeping ADR adr-0049 -> adr-0050. ORCH-123 took adr-0049 (adr-0049-host-side-docker-execution-boundary.md) on main while ORCH-116 was in flight, so ORCH-116 yields to the merged task and moves to the next free number. Mechanical cross-reference reconciliation only (git mv + title + every test-runner reference across README/internals/CLAUDE/CHANGELOG/config.py + 06-adr/ADR-001 + 12-review). Main's adr-0049 host-side references are left byte-for-byte untouched. No design/verdict content was altered. - Restore the ORCH-116 CHANGELOG entry that the CHANGELOG auto-merge silently dropped (both ORCH-123 and ORCH-116 inserted at the same [Unreleased] anchor; git kept only ORCH-123). - Add the missing ORCH_TEST_RUNNER_* keys to .env.example (parity with the ORCH_STAGING_RUNNER_* block; ORCH-101 canon of start keys). Refs: ORCH-116 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 09:56:47 +03:00
claude-bot	cc41dd849c	fix(staging): host-side ssh execution + env classification for staging-runner (ORCH-123) All checks were successful CI / test (push) Successful in 1m8s Details CI / test (pull_request) Successful in 1m8s Details The ORCH-115 deterministic staging-runner ran `docker exec` FROM INSIDE the prod `orchestrator` container, which ships only `openssh-client git curl` — no `docker` CLI (Dockerfile:11). `Popen(["docker", ...])` hit FileNotFoundError -> a PERMANENT environment defect that was mis-routed as a code-fail rollback `deploy-staging -> development` (burning developer-retries). Incident ORCH-116: every self-hosting task reaching deploy-staging was doomed to a false rollback. Fix (adr-0049, additive, flag-gated, never-raise, self-hosting scope; the gate / artifact contract / STAGE_TRANSITIONS / DB schema are byte-for-byte unchanged): - D1: build_staging_command() wraps the SAME `docker exec ... staging_check.py ... --mode stub` in `ssh <user@host> '<...>'` so it runs HOST-SIDE over the existing trusted ssh channel (mirror self_deploy / image_freshness). New flag staging_runner_exec_host_side (default True). No docker CLI/SDK added to the image, docker.sock not used in-container (D2 security). - D3: three-way classify_staging_outcome (suite-ran / permanent-env / transient-infra), disambiguating the exit=1 collision by scanning stderr. - D4: invariant "infra != code-fail" — permanent-env / exhausted transient-infra end in an infra-HOLD (no rollback, no developer-retry), NOT a false FAILED rollback (supersedes ORCH-115 D5). A really-executed failing suite still rolls back (anti-over-tolerance). R-2 verified: a held deploy-staging task is not rolled back by the reconciler. - D5: prod-like preflight() of the host-side channel at startup (main.lifespan, best-effort, never blocks). - D8: snapshot adds permanent_env / exec_host_side / preflight. Docs (golden source, same PR): INFRA.md execution-boundary section, architecture/README.md, CLAUDE.md, CHANGELOG.md, .env.example. Tests: tests/test_orch123_staging_runner_exec.py (TC-01 mandatory regression red->green; TC-02..TC-14 + R-2). ORCH-115 anti-drift green (3 tests updated for the D1/D4/D8 supersession). Full suite: 2131 passed. Refs: ORCH-123 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 08:42:36 +03:00
claude-bot	b50cf1dd08	feat(staging): deterministic staging-runner replacing LLM deployer on deploy-staging (ORCH-115) All checks were successful CI / test (push) Successful in 1m8s Details CI / test (pull_request) Successful in 1m8s Details Replace the LLM `deployer` agent on the `deploy-staging` stage (self-hosting orchestrator) with a deterministic staging-runner intercepted in launch_job BEFORE _spawn (the deploy-finalizer / post-deploy-monitor reserved-agent precedent). The runner executes the SAME staging suite, maps the exit-code to `staging_status:` via the existing self_deploy.map_exit_code_to_status contract, writes 15-staging-log.md, and initiates the UNCHANGED check_staging_status gate exactly as a finished LLM-deployer would. Invariant (NFR-1): this replaces only the producer of the artifact — the artifact contract, the gate / _parse_staging_status / check_staging_status name, STAGE_TRANSITIONS, the machine-verdict key `staging_status:` and the DB schema are byte-for-byte unchanged. Additive, under a kill-switch + repo-scope CSV, never-raise, fail-safe back to the LLM path. Two-level outcome (D5, anti ORCH-110): suite executed -> verdict -> advance (FAILED -> the existing deploy-staging -> development rollback + developer-retry, same as a FAILED LLM verdict); tool-error (suite did not execute) -> bounded DEFER -> fail-closed FAILED + alert on exhaustion (infra != code fault; never a silent advance / false green). First implemented slice of the LLM determinization roadmap (ORCH-118 A6, replace-deterministic-now). - New leaf src/staging_runner.py (never-raise; proc_group tree-kill + timeout) - launch_job intercept + _run_staging_runner_job (mirror _run_deploy_finalizer_job) - config: ORCH_STAGING_RUNNER_* keys (enabled/repos/timeout/infra-retry budget) - GET /queue staging_runner observability block - docs: llm-call-sites/roadmap/usage-policy (A6 implemented; machine blocks + single-transport invariant intact), deployer.md (LLM branch -> fallback), CLAUDE.md, CHANGELOG.md, overview (tech-pipeline/tech-agents/tech-quality-security), .env.example - tests/test_orch115_staging_runner.py (TC-01..TC-13); LLM anti-drift green (TC-14) Refs: ORCH-115 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 01:59:43 +03:00
claude-bot	9710d5f80d	docs(llm): LLM call-site map, control-path axis, roadmap & usage policy + anti-drift tests All checks were successful CI / test (push) Successful in 1m8s Details CI / test (pull_request) Successful in 1m10s Details ORCH-118 (inventory-first, docs+tests only): publish an evidence-based map of every place the orchestrator's control flow consumes (or can consume) an LLM judgment, mark the control-path axis (C control-path vs P artifact-producer), define "avoidable LLM control path" as a checkable two-bit predicate, classify each call-site, and order the deterministic-replacement roadmap. Pin the map to code with offline structural anti-drift tests. - docs/architecture/llm-call-sites.md — map + machine-readable inventory block + control-path axis + classification + keep-LLM justifications + deterministic non-agent paths (FR-1/FR-2/FR-3/FR-8). - docs/architecture/llm-determinization-roadmap.md — ordered candidates BY ROLE, savings sourced from agent_runs, recommended first slice = deployer staging (FR-4). No fabricated follow-up Plane-IDs (R3/NFR-6). - docs/architecture/llm-usage-policy.md — normative principle, keep/replace criteria via the axis, definition of "avoidable LLM control path" (FR-5/FR-8). - tests/test_llm_call_site_inventory.py — TC-01/02/03/04/05/06/09/12/13/14. - tests/test_llm_determinization_docs.py — TC-07/08/11. - CHANGELOG.md + docs/overview/tech-quality-security.md — golden-source sync (AC-8). Avoidable LLM control paths = {tester, deployer}; control-path-keep = {reviewer}; not-control-path (P) = {analyst, architect, developer}. Single LLM transport = launcher._spawn (S0); no alternative transport (TC-12). Runtime untouched: STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / DB schema are byte-for-byte; no replacement runners implemented (FR-7). Full suite: 2081 passed. Refs: ORCH-118 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-16 00:13:07 +03:00
claude-bot	861b5ee984	fix(plane): sandbox-only fail-closed guard for Plane writes from test process (ORCH-117) Close the root class of incident ORCH-114: a pytest/worktree process performed a REAL write (PATCH issues state=<Done> + comment) against the PRODUCTION Plane project, because test/staging processes inherit the live Plane token (PLANE_HEADERS/PROJECT_ID are captured at import — a post-hoc env/token swap is a no-op) and nothing forced them to write only to the sandbox. Symmetric to the existing _no_telegram autouse floor. - New pure never-raise leaf src/plane_write_guard.py (decide/audit_block/ audit_allow), wired into the 3 plane_sync write primitives (update_issue_state / add_comment / _set_issue_state_direct) via _guard_allows_write, AT CALL TIME, before any network step. Active ONLY in a test process (pytest in sys.modules / PYTEST_CURRENT_TEST); live + staging runtimes (uvicorn) are a strict no-op. - In a test process: default-deny. A write is allowed iff opt-in (plane_test_write_enabled) AND target project in the sandbox allowlist (plane_test_sandbox_projects, default = the one SANDBOX id). Prod is blocked even with opt-in (allowlist sandbox-only); unresolved project -> block (fail-closed). - Independent second layer: tests/conftest.py::_plane_sandbox_only autouse floor. Intentionally NO prod-block kill-switch (anti back-door, NFR-6). - Audit: block -> loud ERROR; sandbox-allow -> INFO. - Bypass fixtures for the 3 (+1) pre-existing tests that assert on the mocked write primitive's httpx call (header/URL/state logic), the guard is no Quality Gate: STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict / DB schema untouched. - Tests: tests/test_orch117_plane_write_isolation.py (TC-01 mandatory ORCH-114 regression + TC-02..TC-14). Docs: CLAUDE.md, architecture/README.md, operations/INFRA.md, .env.example, CHANGELOG.md. Refs: ORCH-117 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 21:32:20 +03:00
claude-bot	c4a97a7a28	fix(stage-engine): address ORCH-114 review — env/docs canon + in-region rollback CAS Resolves the REQUEST_CHANGES findings on ORCH-114 (durable transition-ownership lease + expected-stage CAS): P1 — documentation = golden source: - .env.example: add ORCH_TRANSITION_LEASE_ENABLED / ORCH_TRANSITION_LEASE_REPOS (canon of 100% start keys, ORCH-101), next to the other gate kill-switches. - CLAUDE.md: add the ORCH-114 passport section (mechanism, invariant, flags, ADR links) so a future agent editing advance_stage/reaper/webhooks finds the ownership invariant in the first mandatory-read doc (ORCH-078 traceability index). P2 — should-fix: - docs/overview/ (system showcase, ORCH-011): add transition_lease to tech-data-model.md (helper tables), tech-observability.md (/queue blocks) and tech-architecture.md (components). - ADR-001 D4 alignment: the four side-effectful-edge rollback handlers (_handle_merge_gate_rollback / _handle_security_gate / _handle_coverage_gate / _handle_image_freshness) now write `development` through the expected-stage CAS via a shared _rollback_stage_cas helper (defence against the rollback↔done contradiction, BR-6) instead of a bare unconditional update_task_stage. Under the held lease the sole owner always wins; a lost race aborts WITHOUT side effects. Kill-switch off / out-of-scope repo -> degenerates to the prior write -> 1:1. - Test isolation: make tests/test_webhooks.py order-independent by pinning the proj-1 registry per-test (mirrors test_webhook_dedup.proj_registry); it had only passed by relying on import order. Drop the needless module-level ORCH_DB_PATH setdefault in test_orch114 (fresh_db already isolates db_path). New regression tests (TC-11): in-region rollback writes route through CAS; rollback CAS wins when at expected stage; rollback CAS-lost does NOT clobber `done`; kill-switch-off rollback degenerates to the unconditional write. ruff clean (src/stage_engine.py, src/transition_lease.py); full suite 2052 passed. Refs: ORCH-114 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 19:28:38 +03:00
claude-bot	6ea4402942	fix(stage-engine): durable transition-ownership lease + expected-stage CAS (ORCH-114) Close the root class of the ORCH-110/111/112/113 incident chain: side-effectful stage transitions had no single ownership. `advance_stage` is re-enterable and wrote the stage with a bare `UPDATE ... WHERE id=?` (no compare-and-swap), while >=5 actors (monitor / Plane-webhook / reconciler F-1 / job-reaper / deploy-finalizer) enter the same transition independently. A concurrent or post-restart re-entry therefore re-applied irreversible effects (merge_pr / coverage-ratchet / image-rebuild / prod-deploy initiation) and produced a contradictory rollback<->done (incident ORCH-111, job 1914 / PR #130). Two complementary layers, both additive, under one kill-switch, never-raise: 1. Durable transition-lease (new table `transition_lease`) — owner-exclusion on ENTRY to the side-effectful region: a second actor that sees a LIVE owner does not start the heavy sub-gates at all (prevention, not post-hoc repair). 2. Expected-stage CAS (`db.update_task_stage_cas`) — atomicity on the stage WRITE: a lost race aborts with NO side effect. Also closes the 6 paths that write the stage in bypass of advance_stage (gitea x5 + plane rollback). Owner liveness = owner_pid + owner_boot_id (NOT a heartbeat — a blocking 900s merge re-test cannot beat one; ADR-001 D3), making restart recovery free (a fresh boot_id renders every prior lease stale -> reclaimed by recover_on_startup). The lease has no own TTL: its hard age ceiling is the reaper Tier-3 backstop reaper_max_running_s, so the cross-cutting budget invariant ORCH-065/109/110/113 is untouched. Generalises ORCH-113 finalizer-liveness (process-local, Tier-2, deploy-staging) to a durable cross-path lease: the reaper consults it on all relevant paths (defer live, reclaim dead; Tier-3 ignores the marker -> bounded; a reap force-releases the lease); reconciler F-1 and the Plane webhook defer on an active lease; main.lifespan calls recover_on_startup() after requeue_running_jobs. finalizer_liveness.py is unchanged (it remains the kill-switch-off fallback). Scope self-hosting (transition_lease_repos="" -> orchestrator only; enduro untouched). Kill-switch ORCH_TRANSITION_LEASE_ENABLED=false -> CAS degenerates to the prior unconditional update_task_stage, lease inert, reaper -> ORCH-113 fallback (byte-for- byte pre-ORCH-114). STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / existing table schemas — byte-for-byte (one additive table, no epoch column on tasks). Observability: read-only `transition_lease` block in GET /queue + a Telegram alert on forced/stale reclaim + optional POST /transition-lease/release?work_item=<id>. Coverage: tests/test_orch114_transition_ownership.py (TC-01 mandatory regression of the ORCH-111 class — red before fix, green after; TC-02..TC-14). Full suite green (2048 passed); the 4 webhook tests that spied on the removed gitea.update_task_stage were updated to spy on the new commit_stage_cas write path. ADR: docs/work-items/ORCH-114/06-adr/ADR-001-transition-ownership-lease-and-stage-cas.md Cross-cutting: docs/architecture/adr/adr-0045-transition-ownership-lease-and-stage-cas.md Refs: ORCH-114 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 19:28:38 +03:00
claude-bot	a1f3b7588a	fix(deploy): resilient-pull hygiene for dirty shared deploy-base (ORCH-112) Self-deploy git pull blocked on a dirty shared main checkout (manual/abandoned WIP from a failed/cancelled task) — incident ORCH-111: "Your local changes to src/config.py would be overwritten by merge" wedged the prod deploy and required manual intervention (a group risk on self-hosting). The deploy hook (--deploy) now converges the deploy-base to a clean, current origin/main BEFORE the pull (git fetch + reset --hard origin/main + a SCOPED `git clean -fd`, NEVER -x), strictly preserving the rollback/log artefacts (.deploy-prev-image-* / deploy-hook.log via -e), gitignored .env/data/.db/build (no -x), and sibling/.git state (out of clean scope). Gated by CHECKOUT_HYGIENE env injected by self_deploy.build_deploy_command only when the new pure never-raise leaf src/checkout_hygiene.py says applies(repo) (kill-switch + self-hosting scope). Convergence after failed/cancelled is this same deploy-time self-heal — cancel_task is NOT extended and no background janitor is introduced. Observability: the hook writes a `hygiene` sentinel, the Phase-C finalizer reads it and sends a best-effort Telegram alert. Additive, under kill-switch (ORCH_CHECKOUT_HYGIENE_ENABLED, default true; off -> bare `git pull origin main` 1:1 before ORCH-112), never-raise, self-hosting scope. STAGE_TRANSITIONS / QG_CHECKS / check_ / machine-verdict keys / DB schema / the hook exit-code contract (0/1/2, ORCH-036) are byte-for-byte untouched. Coverage: tests/test_deploy_checkout_hygiene.py (TC-01..TC-10; real-hook shell simulation in a temp git repo, no network/prod/ssh, + unit). TC-01 is the mandatory ORCH-111 regression (RED before the fix, GREEN after). Docs golden source updated in the same PR (CLAUDE.md, CHANGELOG.md, .env.example; INFRA.md / architecture/README.md / adr-0044 written at the architecture stage). Refs: ORCH-112 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:15:56 +03:00
claude-bot	7cb1f83f6c	fix(reaper): do not re-run deploy-staging finalization while finalizer is alive On the deploy-staging -> deploy edge the live monitor stamps agent_runs.finished_at FIRST, then runs the heavy edge sub-gates (security/merge-gate re-test/coverage/image-freshness) in-thread for MINUTES and only THEN _finalize_job. Reaper Tier-2 measures finished_age_s from finished_at, so past reaper_finalize_grace_s it treated the live, long finalizer as dead and independently re-ran the advance -> a second re-test went red -> false rollback deploy-staging -> development while the original finalizer concurrently merged the PR (incident ORCH-111, job 1914). Add a process-local finalizer-ownership registry (src/finalizer_liveness.py, never-raise): the monitor mark()s ownership right after the exit_code stamp and clear()s it in a try/finally around the (verbatim-extracted) finalization tail, so an exception in the monitor thread still releases ownership and a genuinely dead finalizer is reaped. The reaper Tier-2 consults the marker only when the kill-switch is on AND the task stage == deploy-staging AND ownership is active -> DEFER (no second advance) and fall through to the Tier-3 backstop, which ignores the marker (a stuck/dead finalizer is still reaped in bounded time). In-memory is authoritative (monitor + reaper are daemon threads of one uvicorn process); restart is covered by the startup requeue_running_jobs. Additive, global kill-switch reaper_finalizer_liveness_enabled (default True; false -> reaper byte-for-byte prior). STAGE_TRANSITIONS / QG_CHECKS / every check_* / machine-verdict keys / DB schema unchanged; grace/ceiling and the ORCH-065/109/110 budget invariant untouched; never restarts prod, never pushes main. Observability: finalizer_defers_total + finalizer_owned in GET /queue. Tests: tests/test_orch113_reaper_finalizer_liveness.py (TC-01..TC-08, incl. the mandatory ORCH-111 regression: red before the fix, green after). Refs: ORCH-113 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 13:08:41 +03:00
claude-bot	651b9af7c3	fix(merge-gate): tolerate re-test infra-timeout + tree-kill spawned pytest Eliminate the false `deploy-staging -> development` rollback that fired when the merge-gate local re-test timed out (infra/resource) on a green CI + tester + staging branch (incident ORCH-109/PR #129: a 516.7s suite blew its 600s budget under CPU starvation from orphaned pytest processes -> timeout misrouted as a code fault -> developer-retry loop -> manual gate). Additive, 5 independent kill-switches, never-raise, self-hosting scope. Untouched byte-for-byte: STAGE_TRANSITIONS, the QG_CHECKS registry, check_branch_mergeable name/semantics, machine-verdict keys, the DB schema. INV-4 (never push/force-push main) and the no-prod-restart rule are preserved. - D1: new stdlib-only leaf src/proc_group.py runs the spawned re-test/coverage pytest in its own process group (start_new_session) and tree-kills the WHOLE group on timeout (os.killpg SIGTERM->grace->SIGKILL); used by merge_gate.retest_branch and coverage_gate.measure_coverage. No orphan leak. Fallback never-break: subprocess_tree_kill_enabled=False / non-POSIX -> the prior subprocess.run. - D2/D3: merge_gate.classify_retest_failure distinguishes timeout/red/lock-busy/ other; an infra timeout routes to _handle_merge_gate_infra_retry (bounded re-queue, task stays on deploy-staging, no rollback / no developer-retry); a red re-test / conflict still rolls back (BR-6). Exhaustion -> one infra alert. - D4: skip the local re-test when the pre-merge rebase was a proven no-op (HEAD already CI/tester/staging-validated); fail-safe runs the re-test on any uncertainty. Flag merge_retest_skip_when_current_enabled. - D5: merge_retest_timeout_s 600 -> 900 + _resolve_retest_timeout validation; reaper_max_running_s invariant preserved without change. - D6: in-process counters + read-only merge_gate block in GET /queue; appended ("ORCH-110","classify_retest_failure","src/merge_gate.py") to MAIN_REGRESSION_MARKERS. Docs (README/internals overview/CLAUDE/CHANGELOG/ .env.example) updated in the same PR. Tests: tests/test_orch110_*.py (TC-01..TC-12, incl. the red-before/green-after incident regression). Full suite green (1988 passed). Refs: ORCH-110 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 10:42:34 +03:00
claude-bot	2e73ccf090	feat(watchdog): proc_blocking alert for orphaned long-lived test processes Close the observability gap between agent_hung (only tracked jobs by jobs.pid) and orphaned pytest subprocesses the orchestrator launches itself (merge_gate.retest_branch / coverage_gate.measure_coverage). On a timeout-kill of the agent (-9, ORCH-109) the grand-child pytest reparents onto tini and keeps running for days, starving CPU and failing merge-gate re-test — with no alert. Strictly inside the observer (watchdog/** + the watchdog compose service): - watchdog/collectors/proc.py: stdlib-only /proc scan (under pid: host), read-only, never-raise -> []; pure parsers split from I/O (tested on a fake /proc tree). Never reads /proc/<pid>/environ. - watchdog/signals.py: pure proc_signals builder, per-entity ("proc_blocking", pid), active iff age_s > proc_age_s; actionable RU detail. - watchdog/core.py: opt-in tick block (gated on proc_enabled -> zero overhead / byte-for-byte when off) + RECOVERY synthesis for a vanished process through the existing decide()/AlertState (no new anti-spam logic). - watchdog/config.py: WATCHDOG_PROC_{ENABLED(false),AGE_MIN(60),PATTERNS(pytest), COOLDOWN_S(1800)}; default threshold > max(merge_retest_timeout_s=600, coverage_run_timeout_s=900) so a legit in-flight run never crosses it. - docker-compose.yml: pid: host on orchestrator-watchdog ONLY (read-only privilege). Anti-false-positive and no overlap with agent_hung are by construction (cmdline scope + age threshold), not fragile cross-namespace PID matching. Canon synced: WATCHDOG_PROC_* in .env.watchdog.example <-> .env.example block; documented in LITE_SETUP.md and docs/architecture/README.md (architect). src/*, /metrics, schema_version, STAGE_TRANSITIONS, QG_CHECKS, check_, machine-verdict and the DB schema are untouched; deploy rebuilds only the sidecar, prod orchestrator is not restarted (NFR-3). Tests: tests/watchdog/test_proc_blocking_signal.py (TC-01..TC-06), test_proc_collector.py (/proc parsing), test_tick_proc_blocking_integration.py (TC-07), plus pid: host and proc-config assertions. Full pytest tests/ green (1930). Refs: ORCH-111 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 02:14:17 +03:00
claude-bot	bc96977eb7	docs(readme): sync Watchdog section with per-role timeout budgets Front-page README «### Watchdog» по-прежнему утверждал «timeout 30 минут», что стало неверным после ORCH-109 (per-role бюджеты: developer 60м / reviewer 50м / прочие 30м дефолт, `_resolve_timeout`). Приведено в соответствие с docs/architecture/internals.md + добавлен Tier-3 backstop reaper_max_running_s=90м. Закрывает P1-finding reviewer (12-review.md). Docs-only: src/**/STAGE_TRANSITIONS/QG_CHECKS/схема БД не тронуты. Refs: ORCH-109 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 14:26:11 +03:00
claude-bot	bbcaa93cff	docs(changelog): fix duplicated ORCH-105 entry body When the ORCH-109 entry was inserted above the ORCH-105 entry, the ORCH-105 bullet had its body accidentally duplicated (the same "слайдо-источник …" paragraph appeared twice in one bullet). Restore the ORCH-105 entry to its canonical single-bodied form (byte-for-byte identical to origin/main); the legitimate ORCH-109 additions are untouched. Refs: ORCH-109 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 14:26:11 +03:00
claude-bot	6bd7f9ba84	fix(launcher): raise developer/reviewer timeout budgets + stamp model at launch Two additive, isolated launch-subsystem fixes from incident ORCH-104, without touching STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict / DB schema. D1 — launch-time model stamp: write the resolved model into agent_runs.model in the SAME UPDATE as the effort stamp (ORCH-087), so the model is present from launch, survives a timeout-kill (exit_code=-9), and is visible in-flight in /metrics & /queue. record_usage stays an enrichment (model=COALESCE preserves the launch stamp when the usage JSON model is None). never-raise (isolated try/except). D3/D4 — dedicated per-role budgets: agent_timeout_developer_s=3600 / agent_timeout_reviewer_s=3000 with a deterministic _resolve_timeout ladder (overrides_json[agent] > dedicated role key > agent_timeout_seconds=1800; other roles byte-for-byte). Malformed/non-positive config falls back to the global default + WARNING (never-break). reaper_max_running_s raised 3600 -> 5400 in lockstep to keep the ORCH-065 invariant (5400 > 3600 + 20 = 3620). FR-4 (kill / in-flight visibility) and FR-5 (anti-salvage) are structural in the existing code; pinned here by regression tests (tests/test_orch109_timeout_model.py, TC-01..TC-12). Docs: .env.example, config passport, CHANGELOG, CLAUDE.md (README/internals authored by architect in this branch). Refs: ORCH-109 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 14:26:11 +03:00
claude-bot	d016ac9b4c	docs(overview): ORCH-105 — слайды Lite-установки и использования через Plane Расширяю слайдо-источник презентации docs/overview/presentation.md тремя слайдами в каноне ORCH-011 (16 → 19, сквозная нумерация сохранена): - Слайд «Запуск и ведение задачи через Plane» (вход «To Analyse», статусы = индикация, наблюдение: доска + Telegram-карточка + комментарии). - Слайд «Что решает человек: гейты, авто-режим, отмена» (Approved / Confirm Deploy; autoApprove/autoDeploy/Bug — без пропуска тех. проверок; STOP). - Слайд «Lite-установка скриптами» (два контейнера платформы; только конфиг; gen_secrets.py/onboard_project.py + docker compose up -d; runbook LITE_SETUP.md; одношаговый bootstrap — это смежный Bundled, не Lite). Факты сверены с golden sources (LITE_SETUP.md, tech-pipeline.md, tech-integrations.md, CLAUDE.md). Анти-дрейф — новая функция test_presentation_covers_lite_and_plane_usage_bits в tests/test_system_docs.py (существующие проверки без послаблений). CHANGELOG обновлён. Docs+tests only: src/*/STAGE_TRANSITIONS/QG_CHECKS/check_/схема БД — байт-в-байт; python-pptx не в прод-образе; .pptx в git не коммитится. Ручная сборка .pptx (TC-07) проверена в dev-venv: «Собрано слайдов: 19», exit 0. Refs: ORCH-105 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-12 08:19:36 +03:00
claude-bot	6d798c01ef	docs(overview): витрина системы docs/overview/ — бизнес+тех, 3 аудитории, презентация (ORCH-011) Единая точка входа в документацию платформы (ADR-001 D1–D9): - docs/overview/ — 10 файлов: индекс (маршруты «Я заказчик / Я менеджер / Я разработчик» + норматив «изменил функциональность → обнови витрину в том же PR»), business.md (без жаргона, 6 сценариев), 7 тех-блоков (link-first), presentation.md (16 слайдов + процедура сборки «команда + Проверка:»). - scripts/build_presentation.py — генератор .pptx в тёмном дизайне (python-pptx; чистый stdlib-парсер parse_slides + ленивый import pptx; бинарь не коммитится, build/ в .gitignore; зависимость НЕ в прод-образе — машинный гард TC-09). - tests/test_system_docs.py — структурный анти-дрейф: derive-сверки стадий/ гейтов/агентов импортом STAGE_TRANSITIONS/QG_CHECKS/glob промптов/config, валидность ссылок, FORBIDDEN-скан + секрет-эвристика, слайды каноническим парсером, NFR-2, указатели. - reviewer.md — ось обзорных доков ORCH-079 расширена на витрину (D7; канон 52d байт-в-байт, только текст внутри секций) + анти-регресс ассерт в test_agent_prompts_canon.py. - Указатели: README.md, CLAUDE.md (правила №2/№6, «Структура»), PRODUCT_VISION.md (врезка-ссылка), CHANGELOG.md. Рантайм байт-в-байт: src/*, docker-compose.yml, Dockerfile, requirements — ноль изменений (docs+tests+dev-скрипт, паттерн ORCH-102/103). pytest: 1873 passed. Refs: ORCH-011 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-11 09:36:40 +03:00
claude-bot	f0cd19d748	feat(replication): ORCH-10b Bundled-тираж — bundle-compose всего стека + bootstrap-скрипт Закрывает Type B эпика ORCH-10 (по ADR-001 ORCH-103, D1–D11): - deploy/bundled/docker-compose.yml — самодостаточный compose всего стека (орк + watchdog + Gitea 1.22.6 + зеркало upstream Plane CE v0.23.1, ~14 контейнеров); project name orchestrator-bundle (узнаваемый префикс), container_name не пиннится, staging-контура нет; одна bridge-сеть, машинный трафик — сервис-DNS, наружу только человеческие порты; GITEA__webhook__ALLOWED_HOST_LIST=orchestrator; все образы пиннованы неподвижными тегами. Корневой compose/Dockerfile/src/** — байт-в-байт. - deploy/bundled/.env.example — конфиг-канон bundle (плейсхолдеры, ни одного дефолтного пароля; key-set-sync интерполяций держит тест). - scripts/bootstrap_bundle.py — python stdlib-only, режимы plan/apply/verify, step-движок check→ensure, exit 0/2/1: preflight (fail-fast до мутаций) → секреты (gen_secrets.py + stdlib secrets, без перетирания) → up+готовность → init Gitea автоматом → init Plane (manual-step с API-верификацией) → онбординг строго onboard_project.py apply+verify → token-remote клон → сборка .env/.env.watchdog (единственный писатель, права 600) → health. Delete-операций нет вообще (D9), секреты не печатаются (NFR-3). - CHANGELOG.md, CLAUDE.md (абзац Type B), .gitignore (deploy/bundled/repos/). Док BUNDLED_SETUP.md, REPLICATION §1, arch README, adr-0038 и три структурных тест-модуля (TC-01…TC-11) — в предыдущих коммитах ветки; полный регресс 1844 passed, ruff по файлам задачи чистый. Refs: ORCH-103 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-11 02:16:32 +03:00
claude-bot	8351e91382	docs(deployment): ORCH-10a Lite-тираж — LITE_SETUP.md + канон watchdog-конфига + анти-дрейф контур Закрывает Type A эпика ORCH-10 (поверх 10-common ORCH-101). Docs+tests (паттерн ORCH-077/092): src/, docker-compose.yml, Dockerfile, scripts/ — ноль изменений; конвейер (STAGE_TRANSITIONS/QG_CHECKS/check_/machine-verdict/ схема БД) — байт-в-байт. - docs/deployment/LITE_SETUP.md (D1/D2): golden source Lite-тиража — 13 нормативных разделов в порядке маршрута оператора, каждый шаг = fenced-команда + явная «Проверка:»/PASS/FAIL, хост-специфика только плейсхолдерами; канон не форкается (статусы/env/вебхуки/smoke — ссылками на ONBOARDING §1 / REPLICATION §2–§4 / SETUP_WEBHOOKS; явно — только fail-closed Confirm Deploy/STOP и обязательные ключи нового хоста). - .env.watchdog.example (D5, исход А-4): третий канонический env-example; key-set = блок WATCHDOG_ .env.example (19 ключей, токены — пустые плейсхолдеры); закрывает ловушку файла-носителя (sidecar читает ТОЛЬКО .env.watchdog); C-1 ORCH-100 + когерентность порта в шапке; .env.watchdog добавлен в .gitignore (секрет-гигиена, зеркало .env.staging). - tests/test_lite_setup_doc.py (D8): 25 структурных тестов без сети/LLM/subprocess — 13 разделов в порядке D2, кирпичи FR-6.1, key-sync watchdog-канона, env-ключи ⊂ .env.example, compose-подмножество (ровно орк+watchdog по дефолту, staging за профилем, анти-появление plane/gitea), fenced-скан FORBIDDEN (импорт из test_no_host_hardcodes) + секрет-эвристика с негативным самочеком, «22 статуса» сверкой импорта plane_sync._PLANE_NAME_TO_KEY, перекрёстность. - Перекрёстные доки (FR-7): REPLICATION.md §1 (Type A — Lite → ✅ ORCH-102 + ссылка), README.md (способность Lite + docs/deployment/ в структуре), INFRA.md (.env.watchdog в секрет-нормативе + ссылка на deployment), CLAUDE.md (блок ORCH-102), CHANGELOG.md. Нормативы разделов: Gitea — branch protection на main НЕ включать (D3 / ADR D10 ORCH-009 / INV-4), pre-receive не вводится, ОДИН глобальный webhook-секрет; staging-вилка опциональна (D6); источник кода — параметризованный git clone <ORCHESTRATOR_GIT_URL> (D7); stateless — данные/задачи/секреты боевого хоста НЕ переносятся (AC-3). Тесты: pytest tests/ -q — 1789 passed (полный регресс зелёный). Refs: ORCH-102 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-11 00:42:15 +03:00
claude-bot	f1635ddb39	feat(replication): расхардкод хоста + секреты нового хоста + smoke-runbook All checks were successful CI / test (push) Successful in 57s Details CI / test (pull_request) Successful in 55s Details Фундамент тиража 10-common (эпик ORCH-10): платформа разворачивается на новой инфре без правки кода — только env/конфиг. Каждый дефолт = боевому значению (пустой .env => поведение 1:1, kill-switch-природа, NFR-2); STAGE_TRANSITIONS/QG_CHECKS/check_/machine-verdict/схема БД не тронуты. - config: agent_home_dir / agent_git_name / git_email_domain / staging_port (ADR-001 D2/D4); код-блокеры A1-A4 закрыты: plane_sync ссылки из gitea_public_url+gitea_owner, launcher - единый agent_git_env() (x2 места), self_deploy/post_deploy - HOME+домен из Settings (имена системных акторов - платформенные литералы) - image_freshness: staging_port из конфига + fail-closed guard staging_port == прод-порт -> отказ ДО ssh/build (инвариант ORCH-058 AC-9 стал исполняемым); REPO= передаётся хуку явно обоими инвокерами (D7) - SELF_HOSTING_REPO - нормативная платформенная константа (D3, пин-тест) - compose: полная ${VAR:-default}-интерполяция (реестр B, карта D6); группа ORCH-040 uid/gid/HOME/маунты двигается согласованно (build.args APP_); group_add "МИНА 1" сохранён x3; оба app-сервиса с явным command: - Dockerfile: ARG APP_UID/APP_GID/APP_USER/APP_HOME (CMD exec-form 8500 сознательно не тронут - D5); deploy-hook: REPO="${REPO:-...}" (D1 реестра) - секреты: stdlib scripts/gen_secrets.py (token_hex(32); печать по умолчанию; --write никогда не перезаписывает существующий .env молча, exit=2; перезапись только --force); .env.example дополнен до полноты ключей старта - доки: новый docs/operations/REPLICATION.md (карта env, чек-лист секретов, smoke-процедура с PASS/FAIL, границы 10-common/Lite/Bundled), INFRA.md, README, CLAUDE.md, CHANGELOG - анти-регресс: tests/test_no_host_hardcodes.py (tokenize-сканер запрещённых литералов, config-модули - структурное исключение, allowlist пуст, негативная самопроверка) + test_host_config_keys / test_infra_parametrization / test_secrets_gen / test_replication_smoke; согласованные структурные правки test_orch040_compose (судит резолв дефолтов) и test_deploy_hook_rollback_sim (REPO через env-override = контракт D7) Полный регресс: 1764 passed. Refs: ORCH-101 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 20:50:43 +03:00
claude-bot	e9038182a1	fix(tests): hermetic ORCH-41 model/effort tests vs host env (unblock merge-gate) Some checks failed CI / test (push) Has been cancelled Details CI / test (pull_request) Successful in 55s Details Merge-gate re-test runs under the orchestrator's prod env, where the operator legitimately set ORCH_AGENT_FALLBACK_MODEL and changed ORCH_AGENT_MODEL_DEFAULT / ORCH_AGENT_EFFORT_*. Two ORCH-41-era tests asserted SHIPPED defaults through the env-backed settings singleton and failed 3/3 there, while Gitea CI (clean env) stayed green. Branch ORCH-009 touches neither src/ nor these tests - latent non-hermetic landmine on main, detonated by the prod env change. - test_resolve_agent_effort.py: autouse fixture now mirrors the sibling model-file baseline (pins shipped model/fallback fields) so the flag-assembly tests are env-independent. - test_resolve_agent_model.py: fixture also resets agent_fallback_model; test_fallback_model_disabled_by_default now asserts the CLASS field default (the actual ORCH-074 ADR-001 G4 invariant: shipped default is ""), never-break is_valid_model asserts unchanged byte-for-byte. Clean-env behaviour is byte-equivalent (fixtures pin exactly what an empty env yields). Full suite: 1713 passed (was 2 failed / 1711). Refs: ORCH-009 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 16:17:54 +03:00
claude-bot	dc1cb87818	feat(onboarding): turnkey project onboarding — kit + CLI + runbook (ORCH-009) Operator capability to bring a NEW project online in one pass, fully outside the runtime and the pipeline (src/** byte-exact, no kill-switch needed — activation is an explicit human CLI run). Reference = the orchestrator repo itself (ORCH-52b/c/d/e canons). * onboarding/repo-skeleton/ — parametrized kit of a new repo: 6 agent prompt templates per canon 52d/92 (5 ru + deployer en with the shared-host guardrail frame), reviewer doc-gate (REQUEST_CHANGES), CLAUDE.md passport, AGENTS.md, CONTRIBUTING.md, docs/ skeleton with mandatory operations/INFRA.md, .env.example; {{NAME}} placeholders + stdlib render, dictionary onboarding/placeholders.json (bijection held by tests). Canon is NOT forked: docs/_templates + docs/_standards are live-copied from the checkout at materialization time (BR-2/D3). * scripts/onboard_project.py — plan (default, GET-only, zero mutations) / apply (idempotent ensure, no delete ops at all) / verify (registry round-trip via the actual projects._parse_projects_json, all 22 state names incl. fail-closed Confirm Deploy/STOP, labels, webhook, kit completeness, unresolved-placeholder scan). Closed read-only src import list (ADR D4); state groups fixed per ADR D5 (STOP→cancelled, terminal groups only Done/Cancelled/STOP); Gitea webhook reuses the single global ORCH_GITEA_WEBHOOK_SECRET (TR-6); initial push ONLY into a freshly created empty repo (INV-4 untouched); never restarts prod / never edits .env / deletes nothing (NFR-2); secrets masked (NFR-3); Plane CE API gaps degrade to manual-step (fail-safe). * docs/operations/ONBOARDING.md runbook + SETUP_WEBHOOKS.md generalized per-repo; CLAUDE.md / docs/architecture/README.md / CHANGELOG.md updated in the same PR (golden source). * Anti-drift tests: test_onboarding_kit.py / test_onboarding_script.py (mocked, no network) / test_onboarding_invariants.py (snapshots of STAGE_TRANSITIONS/QG_CHECKS, closed CLI import list, reference .openclaw/agents/ prompts untouched). Full regression: 1713 passed. Refs: ORCH-009 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 16:08:43 +03:00
claude-bot	21a47e85d3	fix(lessons): resolve land-race with ORCH-100 — renumber ADR 0033→0034 All checks were successful CI / test (push) Successful in 56s Details CI / test (pull_request) Successful in 55s Details Merge-gate auto_rebase_onto_main bounced this branch back: ORCH-100 landed in main first and claimed global ADR number adr-0033 (adr-0033-sidecar-watchdog), while this branch had created adr-0033-lessons-journal. Resolved the genuine land race: - rebased feature/ORCH-098-fnd onto current origin/main (linear history) - resolved docs/architecture/README.md component-list conflict — both the Lessons-journal and Sidecar-watchdog bullets now coexist - renamed docs/architecture/adr/adr-0033-lessons-journal.md → adr-0034-lessons-journal.md (next free global ADR number) + fixed the in-file header - updated all cross-references (CLAUDE.md, README.md, work-item ADR-001, 12-review.md) 0033→0034 for the lessons journal; ORCH-100's adr-0033 (sidecar) left intact - recovered the ORCH-098 CHANGELOG entry silently dropped by the rebase auto-merge (now above ORCH-100, ADR ref corrected to 0034) No code semantics changed; src/ auto-merged cleanly (ORCH-100 did not touch src/). ruff: n/a locally (CI). pytest tests/ -q: 1630 passed. Refs: ORCH-098 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 10:44:34 +03:00
claude-bot	318bae7472	fix(test): isolate settings.runs_dir in conftest to stop ambient prod-log pollution (ORCH-100) test_queue.py::TestRetry::test_finalize_job_requeue_then_fail failed in the self-hosting environment because launcher._finalize_job classifies a non-zero exit by reading the tail of <settings.runs_dir>/<run_id>.log. settings.runs_dir defaults to the live prod dir /app/data/runs, which on the host holds REAL accumulated agent logs; a real 2.log containing "429" flips the expected 'permanent' classification to 'transient', requeueing the job instead of marking it 'failed'. This is ambient prod pollution, not a code fault. Add an autouse _isolate_runs_dir fixture (mirroring _no_telegram / _disable_merge_verify) that redirects settings.runs_dir to a per-test tmp dir so _run_log_path() resolves to a non-existent file and classify_log_file() returns the documented 'permanent' default. Full suite: 1617 passed. src/** untouched. Refs: ORCH-100 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:36:02 +03:00
claude-bot	259b507906	feat(watchdog): sidecar-watchdog F1b — monitoring brain in a separate container (ORCH-100) Add the `watchdog/` package (thin Python-3.12 stdlib-only daemon) and the `orchestrator-watchdog` compose service — the brain half of the domain-0 observability pair. F1a (ORCH-099) exposes GET /metrics raw signal; F1b reads it, augments with host / container / dependency probes, runs each signal through a generalised pure decision function (decide(signal_active, prev, now, cooldown), a strict superset of disk_watchdog.decide_action) with per-signal in-memory dedup/throttle/recovery, and alerts over its OWN independent Telegram channel. Key properties (ADR-001): - Observer separated from observed: separate container; /metrics not answering is itself the master `orch_down` alarm (debounced K ticks — no flap on a hiccup). - Strictly read-only: docker.sock GET-only + mounted :ro (double guard), host paths :ro, no DB/disk writes, no process control — self-hosting-safe. - never-raise on three levels (per-source/per-tick/per-send) + WATCHDOG_ENABLED kill-switch (disabled -> inert idle-loop, not exit). - Disk anti-duplicate (D6): disk_watchdog (ORCH-063) stays sole owner of the 85% alert; sidecar carries orch_down + an opt-in 97% ceiling (default off). - NO import from src/ (C-1); src/, STAGE_TRANSITIONS, QG_CHECKS, check_, DB schema — untouched. env_file optional so a missing .env.watchdog never breaks `docker compose up` for the prod orchestrator. Tests: tests/watchdog/ (TC-01…TC-13) + full tests/ regression green (TC-14). Docs: CHANGELOG, .env.example canon (WATCHDOG_); architecture README + adr-0033 authored at the architecture stage. Refs: ORCH-100 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 09:36:02 +03:00
claude-bot	50bcae765a	feat(bug-fast-track): cheaper/shorter pipeline route for bug-fix tasks (ORCH-019) A task carrying the Plane `Bug` label takes a shortened route that skips the `architecture` stage (one opus architect run + ADR + check_architecture_done), replacing heavy analysis with a lite package (bug-report + mandatory regression test plan). EVERY Quality Gate / sub-gate runs UNCHANGED — the route is a scheduler property, not a gate (root invariant NFR-1): STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys are byte-for-byte preserved. - src/bug_fast_track.py: new leaf (never-raise) — bug_fast_track_applies (local, network-free, checked first), is_bug_task (labels.has_label, Plane API source), skips_architecture (pure DB-backed routing predicate), snapshot. - src/db.py: additive idempotent tasks.track column (TEXT DEFAULT 'full') + set_task_track / get_task_track helpers (missing/NULL -> 'full', fail-safe). - src/stage_engine.py: routing-override on the analysis-exit edge (track='bug' -> development/developer, skipping architect); brd-review-clock stamp extended to analysis->development. get_next_stage/get_agent_for_stage stay pure. - src/webhooks/plane.py: classify task as bug in start_pipeline (applies-first short-circuit; never-raise -> full cycle on any error). - src/main.py: additive bug_fast_track block in GET /queue + POST /bug-fast-track/escalate (reset 'bug'->'full' to return to the full cycle). - src/config.py: bug_fast_track_enabled / _label / _repos flags (empty CSV -> self-hosting only). - src/notifications.py: optional 🐞 marker on the bug-track card (never-raise). - Prompts: analyst.md (lite bug package + escalation), reviewer.md (regression- test axis) — 52d canon preserved. - Docs: CLAUDE.md, README.md (env + API + section), docs/architecture/README.md, CHANGELOG.md, .env.example. - Tests: tests/test_bug_fast_track*.py + test_db_migrations.py + queue block (TC-01..TC-15). Full regression green (1551 passed). Kill-switch ORCH_BUG_FAST_TRACK_ENABLED=false -> 1:1 pre-ORCH-019 (zero regression; residual track column harmless). Refs: ORCH-019 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 03:58:15 +03:00
claude-bot	a98d605477	feat(fs): legacy root-owned ownership detect + actionable worktree error (ORCH-057) Follow-up ORCH-040: legacy root:root files in /repos broke worktree creation under uid 1000 with a raw "Permission denied" (agent never started, no diagnosis). Three additive, kill-switch-reversible layers; STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / DB schema are byte-for-byte unchanged. - D1: ensure_worktree classifies the permission class and raises an actionable RuntimeError (cause + chown command + INFRA.md ref); non-permission errors keep the prior raw-stderr contract; kill-switch off -> contract 1:1 as before ORCH-057. - D2: new never-raise leaf src/fs_normalize.py — scan_ownership (TTL-cached, early-exit per root), applies()-first scope (empty CSV -> self-hosting only), opt-in normalize() that chowns ONLY when privileged (no-op under uid 1000). - D3: best-effort startup detect in main.lifespan (WARNING + Telegram on mismatch, never-fatal); read-only fs_ownership block in GET /queue; POST /fs-normalize/check. Claim is NOT blocked — the clear early outcome is delivered by D1 at launch. - Docs/config: .env.example flags + CHANGELOG (architecture README / adr-0031 / INFRA.md procedure already landed on the branch). - Tests: test_fs_normalize.py, test_git_worktree_perm.py, test_fs_normalize_startup.py, test_api_queue.py (TC-01..TC-12). Full suite green. Refs: ORCH-057 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 03:03:34 +03:00
claude-bot	d8793c9698	feat(metrics): lightweight read-only GET /metrics raw-signal endpoint (ORCH-099) FND/F1a: add a versioned read-only JSON endpoint GET /metrics that exposes the orchestrator's own raw state for the future observability sidecar F1b — active task stages, job queue, agent-liveness (pid/runtime/cpu_ticks), and cost/tokens. The orchestrator emits ONLY raw signal it alone knows; thresholds/alerts/history live in the separate sidecar (observer separated from observed, BRD §1). - src/metrics.py: new leaf collector build_metrics() (never-raise per section, serial_gate.snapshot() pattern); envelope schema_version/generated_at/clk_tck + stages/queue/agents/cost. _read_cpu_ticks(pid) reads utime+stime from /proc/<pid>/stat (null on None/dead/non-Linux pid — never raises). - src/main.py: thin @app.get("/metrics") wrapper (style of GET /queue). - src/db.py: read-only helpers get_running_agents() (dedicated SELECT, not an extension of the hot-path get_running_jobs()), agent_cost_totals(), queue_retry_stats(); job_status_counts() default dict gains the cancelled key. - src/config.py: metrics_endpoint_enabled kill-switch (default True), env ORCH_METRICS_ENABLED via explicit validation_alias so the documented switch actually controls the flag. - docs: README API table row + CHANGELOG entry (contract section already added by architect); .env.example ORCH_METRICS_ENABLED. Strictly read-only / never-raise: STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / DB schema untouched; /health//status//queue byte-for-byte. Tests: tests/test_metrics.py (TC-01..TC-11) + env-alias tests in test_config.py. Full suite green (1482). Refs: ORCH-099 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 02:09:19 +03:00
claude-bot	78b6cdb3f1	docs(changelog): repair duplicated ORCH-095 entry body Reviewer P1 (ORCH-027 attempt 2): inserting the ORCH-027 changelog block duplicated the adjacent ORCH-095 entry — its paragraph body was repeated verbatim, corrupting a golden-source doc and another work item's artifact (CLAUDE.md §3). Remove the duplicate half, leaving a single ORCH-095 body. ORCH-027 entry untouched (already correct). Refs: ORCH-027 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 01:26:24 +03:00
claude-bot	eadfd8419b	feat(coverage): deterministic test-coverage gate on deploy-staging->deploy edge (ORCH-027) Introduce a deterministic (no-LLM) coverage sub-gate that blocks coverage degradation before a task branch merges into `main`. Existing gates judge only by the FACT of passing (check_ci_green / check_tests_passed / merge-gate re-test), not by completeness — so a batch autonomous run (ORCH-088) silently erodes coverage. Pattern mirrors the security-gate (ORCH-022): leaf src/coverage_gate.py (never-raise) + thin check_coverage_gate in QG_CHECKS + _handle_coverage_gate splice in advance_stage, run AFTER merge-gate (measured on the caught-up HEAD that lands in main) and BEFORE image-freshness (fail before the expensive docker rebuild). - measure_coverage: pytest --cov=src --cov-report=json in the per-branch worktree -> line coverage %; None on tool error -> fail-open + WARNING by default (FR-6). - compute_coverage_verdict (pure): absolute \| baseline \| both + epsilon (NFR-4 anti-flap); baseline None -> bootstrap (absolute-only). - coverage_baseline DB table (additive, CREATE TABLE IF NOT EXISTS) + ratchet-up in _handle_merge_verify (deploy->done): atomic compare-and-set under merge-lease, never decreases; bootstrap on first merge. - Artefact 18-coverage-report.md (coverage_status: frontmatter, single source of truth); GET /queue `coverage` block; FAIL -> Telegram; optional POST /coverage/baseline override. - Flags ORCH_COVERAGE_* (kill-switch + self-hosting-only scope) -> enduro untouched; STAGE_TRANSITIONS / existing check_* / verdict keys byte-for-byte unchanged (NFR-5/AC-8). - pytest-cov==5.0.0 added to requirements.txt. Tests: tests/test_coverage_gate.py (TC-01..TC-15). Frozen QG-registry anti-regress tests + deploy-staging edge tests updated for the new sub-gate. Full suite green. Docs: README / adr-0029 / PIPELINE_DOCS / 18-coverage-report.md template (architecture stage) + CHANGELOG / CLAUDE.md / .env.example (this PR). Refs: ORCH-027 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 01:26:24 +03:00
claude-bot	b38cc16041	fix(notifications): escape all card data fields at the render boundary (ORCH-095) render_task_tracker sends/edits the live card with parse_mode=HTML. _fmt_minutes returns the literal "<1м" for a sub-minute stage; interpolated raw into HTML text Telegram parsed "<1м" as an opening tag -> editMessageText 400 can't parse entities -> edit_telegram EDIT_FAILED -> update_task_tracker early return (anti-duplicate ORCH-087) -> the card froze (incident ORCH-093, message_id 18854). Close the whole "unescaped data in HTML text" class per ADR-001: a module-local _esc(x)=html.escape(str(x)) (never-raise) wraps every DATA slot (durations, status label, model, effort, token/cost metrics) exactly once at the render boundary in render_task_tracker/_stage_line. Source functions stay HTML-agnostic (_fmt_minutes still returns "<1м"; escape on the boundary renders it visually identical as <1м, so the visible format is unchanged). Intentional MARKUP slots (num_html / link_for / _done_link / already-escaped esc_title) are NOT escaped, so the issue number stays a clickable <a> tag and nothing is double-escaped. A previously-frozen card auto-recovers on the next stage transition (a new safe render edits in place, 200) — no new code, no touch to edit_telegram / update_task_tracker / the orphan ledger, so the ORCH-087 anti-duplicate invariant is preserved (a transient edit failure still does not spawn a new card). STAGE_TRANSITIONS / QG_CHECKS / check_* / notification transport / DB schema are untouched. New tests/test_tracker_html_escape.py (TC-01..TC-11); full suite green. Refs: ORCH-095 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-10 00:17:26 +03:00
claude-bot	a46dcbcab3	fix(deploy): terminal-window-aware guard so done tasks hold Done in Plane (ORCH-094) A DB stage=done task with 0 active jobs flapped in Plane between `Awaiting Deploy` and `Monitoring after Deploy` instead of holding `Done` (verified live on ORCH-061, task 47): the three deploy-phase setters were terminal-blind, so any stale/duplicate/unknown caller under the bot token re-stamped an intermediate status over the terminal Done, forever. - New leaf src/deploy_status_guard.py (pure, never-raise, config-gated): decide() -> ALLOW \| CONVERGE_DONE \| SUPPRESS on the entry of set_issue_awaiting_deploy / set_issue_deploying / set_issue_monitoring. A deploy-phase status is legitimate iff the task is non-terminal OR (done AND post-deploy window active); otherwise done converges to Done idempotently, cancelled is suppressed (FR-2, D1/D2). - D3: move post_deploy.arm_monitor ABOVE the terminal-sync block in advance_stage so window_active is True when the legitimate first Monitoring is set (the task is already DB-done by then); a re-drive after the window closes converges to Done. - D4: run_post_deploy_monitor no-ops without a status PATCH / re-queue when the task became cancelled mid-window (zombie-tick guard, FR-3). - D5: additive `reason` kwarg on the three setters + one structured log line per verdict (work_item/caller/target/db_stage/window_active/verdict); new read-only db.get_task_by_work_item_id; post_deploy.window_active helper. - Flags deploy_status_guard_enabled (kill-switch -> 1:1) / deploy_status_guard_repos (CSV; empty = self-hosting only). STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / DB schema untouched (reads existing tasks.stage). Tests: TC-01..TC-12 across 5 new test modules + config flags; updated the reason-kwarg assertions in test_deploy_terminal_sync / test_deploy_approve. Full regress green (1413). Docs: CHANGELOG, CLAUDE.md, docs/architecture/README.md (status -> реализовано), .env.example. Refs: ORCH-094 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 23:41:24 +03:00
claude-bot	0b25fc1527	fix(merge_gate): retry transient Gitea merge errors + already-in-main guard merge_pr now wraps ONLY the mutating POST /pulls/{n}/merge in a bounded exponential-backoff retry-loop on TRANSIENT outcomes (405 "try again later", 408, any 5xx, network/timeout, and 409\|422 while the PR is still mergeable); TERMINAL outcomes (403/404/real conflict via mergeable==False) -> fast honest False, so the ORCH-071/081 not-merged HOLD backstop is unchanged. Fixes the ORCH-063 false HOLD + manual re-merge on Gitea's post-push mergeability hiccup. ensure_open_pr gains an "already fully in main" guard (_branch_fully_in_main, git merge-base --is-ancestor HEAD origin/main) BEFORE creating a PR -> new "already-in-main" outcome avoids the garbage empty PR on a re-driven finalizer; _handle_merge_verify skips merge_pr on that outcome and lets the authoritative SHA-in-main check confirm -> done (not a HOLD). git error of the guard fails OPEN to the create path. New ORCH_MERGE_RETRY_* settings (kill-switch merge_retry_enabled -> one-shot, max_attempts=3, backoff base=2/max=5). INV-4 (merge only via Gitea PR-merge API, never push/force-push main), never-raise, STAGE_TRANSITIONS/QG_CHECKS/DB schema unchanged. Docs (README merge-verify section, CLAUDE.md, CHANGELOG, .env.example) updated in the same PR. Tests: test_merge_gate.py TC-01..12, test_config.py TC-13, test_merge_verify.py TC-14..16; full suite green (1389). Refs: ORCH-093 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 22:47:20 +03:00
claude-bot	328ae78da3	fix(notifications): tracker card — status-map completeness, rollback reflection, stage-metric summation (ORCH-091) Three verified live-card defects in src/notifications.py (ORCH-067/087), all additive and indication-only (STAGE_TRANSITIONS / QG_CHECKS / check_* / transport / DB schema untouched; never-raise; revert = git revert): - Деф.1 (D1): _STAGE_STATUS_LABEL covered 8 of 10 STAGE_TRANSITIONS keys — deploy-staging and cancelled (ORCH-090) fell back to the misleading "To Analyse". Added deploy-staging→"Deploying (staging)", cancelled→"Cancelled"; replaced the runtime fallback for an UNMAPPED stage with a neutral capitalized label (_neutral_stage_label). created stays an explicit "To Analyse"; broken/None input degrades safely. Map completeness is asserted programmatically from STAGE_TRANSITIONS.keys() (single source of truth), not a static list. - Деф.2 (D2): the stage-row loop drew ✅ for any stage with a finished agent run regardless of position — after a rollback the card showed the absurd "✅ Внедрение + 🔄 Разработка". Added read-only _pipeline_pos from the STAGE_TRANSITIONS order and a suppression gate (✅ only when current_pos >= _pipeline_pos(stage_key)); deploy-staging→deploy normalization applied ONLY to the current position; is_active_stage untouched. - Деф.3 (D3): _stage_line took only the LAST run (ORCH-069: developer 3 runs Σ $3.98 rendered ~$0.00). It now aggregates ALL of the agent's runs with the same per-run formulas as the task totals → strict convergence with SUM(agent_runs) by task_id; model/effort/attempt come from the last run. Tests: test_tracker_status_line.py (ORCH-091 TC-01..TC-03 + updated tc06); new test_tracker_rollback_metrics.py (TC-05..TC-08). Full suite green (1370). Docs: CHANGELOG + internals.md (architecture README already updated by architect). Refs: ORCH-091 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 22:08:52 +03:00
claude-bot	aae65969d5	fix(cancel): narrow STOP critical-window so deploy-park cancel applies (ORCH-090) Review P1: a STOP while a self-hosting task is PARKED on `deploy` awaiting the manual `Confirm Deploy` was classified as a critical merge/deploy window solely because the task still held the per-repo merge-lease (held from merge-gate through deploy->done). That window is fully reversible — nothing is merged or deployed yet (the irreversible merge_pr runs later in _handle_merge_verify, always under an INITIATED marker). So the cancel was DEFERRED to run_deploy_finalizer, which only runs after Phase B (Confirm Deploy) — the very step the operator pressed STOP to avoid. Result: the deferred cancel was never applied, the task wedged non-terminal holding the lease, blocking the repo's serial-gate (ORCH-088) and merges. Fix: gate the merge-lease branch of cancel.in_critical_window on an actively RUNNING actor (_task_has_running_actor). Lease held + running deploy/merge job -> still deferred (genuine in-flight step). Lease held + no running actor (idle deploy parking) -> NOT critical -> immediate full reset, which itself releases the lease (step 3c) and drives the task terminal. INITIATED-marker deferral unchanged. Also fixes review P2 (AC-6): set_task_cancel_requested now returns the first-stamp fact (rowcount), and the deferred branch only notifies on the first transition — a repeated STOP while still deferred no longer spams duplicate notifications. Tests: test_d7_lease_held_idle_parking_is_not_critical, test_d7_lease_held_with_running_actor_still_critical, test_d7_stop_on_deploy_awaiting_confirm_full_resets, test_d7_repeated_stop_in_critical_window_no_duplicate_notify. Full suite green (1349). Refs: ORCH-090 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:31:56 +03:00
claude-bot	ebbf2e7a2d	feat(cancel): STOP-status task cancellation + relaunch-hole close (ORCH-090) Introduce the dedicated Plane STOP status as a single declarative task-cancel mechanism: stop the active agent (graceful SIGTERM cascade), cancel all jobs (terminal `cancelled`, never requeued), remove the worktree + delete the remote feature branch (never main, never force-push), drive the task to the new system-terminal state `cancelled` and tombstone the natural keys so a later "To Analyse" re-creates it from scratch (docs artefacts preserved). STOP during a critical merge/deploy window is deferred until the irreversible step finishes honestly. Also closes the relaunch hole: handle_status_start relaunch is gated to the `analysis` stage; the only pipeline-start entry point remains "To Analyse". Cross-cutting (adr-0026): the "task terminal" predicate is widened {done} -> {done, cancelled} in serial_gate / task_deps / stages sink + reaper/worker requeue guards. STAGE_TRANSITIONS exit-gates / QG_CHECKS / check_* are unchanged (`cancelled` is a sink, not a new edge). Additive, never-raise, restart-safe, under kill-switch ORCH_STOP_STATUS_ENABLED (off -> zero regression). New: src/cancel.py (leaf), src/gitea.py (delete_remote_branch), tasks columns cancelled_at/cancel_requested_at, jobs status `cancelled`, GET /queue `stop` block. Tests: tests/test_stop_status.py (TC-01..TC-14 + D7); full suite green (1345). Docs updated in-PR (architecture README, CLAUDE.md, README.md, .env.example, CHANGELOG). ADR-001 D4 refinement: plane_issue_id is tombstoned too (the lookup ORs on it) — original UUID recoverable from the parseable suffix. Refs: ORCH-090 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 21:31:56 +03:00
claude-bot	664c2e945a	feat(infra): auto-prune docker build cache on mva154 (ORCH-062) Add src/build_cache_pruner.py — a background daemon thread modelled 1:1 on src/disk_watchdog.py that periodically runs STRICTLY `docker builder prune -f --filter until=<until>` (BuildKit GC) on the HOST over ssh. It is the "second half" of the disk-watchdog (ORCH-063): the watchdog signals, the pruner cleans. Removes the root cause of the 07.06.2026 incident (build cache ~11GB -> disk 100% -> whole self-hosting pipeline down) automatically, без оператора. ADR-001 (Variant A): host-over-ssh, same channel as image_freshness/self_deploy (no docker CLI in the image). Touches ONLY the build cache — no image/system prune, no image/container removal, never restarts the docker daemon or the prod container (self-hosting safety). No ssh target -> tick is a no-op. - src/config.py: ORCH_BUILD_CACHE_PRUNE_* flags + defensive validators (interval/timeout >0, until ~ ^\d+[smhdw]?$, notify_min_gb >=0 -> safe default). - src/main.py: start last (after disk_watchdog) / stop first in lifespan; additive read-only build_cache_prune block in GET /queue. - never-raise on two levels (per-command + per-tick); kill-switch ORCH_BUILD_CACHE_PRUNE_ENABLED (false -> daemon does not start, 1:1 as before). - STAGE_TRANSITIONS / QG_CHECKS / check_* / _parse_* / DB schema UNCHANGED; last-run/last-result is in-memory (no migration). - tests/test_build_cache_pruner.py: TC-01..TC-12 (23 cases, docker fully mocked). - .env.example + CHANGELOG.md updated; INFRA.md / architecture docs already carry the component (architecture stage). Refs: ORCH-062 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 19:55:00 +03:00
claude-bot	8759cb7df8	feat(disk-watchdog): host-FS fill heartbeat + Telegram alert at >=85% (ORCH-063) Adds src/disk_watchdog.py — a background daemon thread modelled on reconciler/job_reaper that measures host-FS fill via the mounted bind-paths (/repos, /app/data) with shutil.disk_usage and Telegram-alerts the operator at >= threshold (default 85%). The missing proactive signal: on 07.06.2026 the mva154 host disk silently hit 100% and stalled the whole self-hosting pipeline. - Pure decide_action(used_pct, threshold, prev, now, realert_s): alert on crossing up, cooldown re-alert, single recovery below threshold (unit-tested without a thread/timer; clock injected). - measure_paths: shutil.disk_usage per path, dedup by st_dev, per-path never-raise (a broken path never fails the tick). - Config flags ORCH_DISK_MONITOR_* with defensive validation (threshold 1..100, positive intervals -> default + warning). Kill-switch -> daemon does not start. - Additive disk_monitor block in GET /queue; start/stop in main.lifespan. - never-raise (per-path/per-tick/per-send); STAGE_TRANSITIONS/QG_CHECKS/check_*/ DB schema untouched, no migration (anti-spam state in-memory). Tests: tests/test_disk_watchdog.py (TC-01..TC-12, 18 cases); full suite green (1296). Docs: INFRA.md, .env.example, CHANGELOG.md (architecture/README.md + ADRs authored at architecture stage). Refs: ORCH-063 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 19:04:36 +03:00
claude-bot	6cae171745	docs(prompts): ORCH-092 — аудит 6 агент-промптов (расхардкод, escalation, чистка) Эпилог эпика ORCH-52. Docs/prompts-only: src/*, STAGE_TRANSITIONS, QG_CHECKS, machine-verdict ключи и схема БД не тронуты; frontmatter_validation_strict=False. - FR-1/FR-2: копируемые frontmatter-примеры всех 6 промптов расхардкожены (created_at: <YYYY-MM-DD> / model_used: <resolve ORCH-41> + врезка «не копируй буквально, подставь date +%F и модель из конфига»); литерал claude-opus-4-8 — только справка в таблице полей. - FR-3: имена check_ в промптах сверены с QG_CHECKS — несовпадений нет (закреплено интеграционным тестом TC-03). - FR-4: developer «PR>1500 → разбивай» переформулирован в эскалацию на уровне задач. - FR-5: секция <escalation> у developer/reviewer/tester (после </success_criteria>): back-to:analysis / back-to:dev / REQUEST_CHANGES. - FR-6: deployer — критичные self-hosting-запреты в видной рамке в начале <context>. - FR-7: tester обогащён worktree-путём, smoke serial_gate (ORCH-088), покрытием TC. - FR-8: из reviewer удалена мёртвая строка «тот же экземпляр Developer». - FR-9 (ADR-001 D1): убран ручной git rebase origin/main — свежесть базы держит движок (serial-gate ORCH-088 + auto_rebase_onto_main под merge-lease). - FR-10 (ADR-001 D2): deployer.md оставлен на английском как нормативное исключение. - FR-11: расширен tests/test_agent_prompts_canon.py (ORCH-092 TC-01…TC-08); канон 52d и test_agent_frontmatter_no_model.py зелёные; полный регресс 1278 зелёный. Документация: 6 промптов, CLAUDE.md, docs/architecture/README.md, CHANGELOG.md. Refs: ORCH-092 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 17:46:27 +03:00
claude-bot	d97b26a59f	docs(ORCH-079): ORCH-52f — sync README with code + reviewer overview-docs axis Layer 5 (final) of epic ORCH-52. Docs + prompt-only; src/ untouched. - README.md «Известные ограничения»: fix numbering (was 1,2,3,4,3,4), move 6 resolved/obsolete items to «Закрыто (история)» trail with ORCH refs, keep only really-open limitations (Telegram-48h ORCH-087, task-deps intra-repo ORCH-026, serial-gate ORCH-088). Point-sync stage table (development → check_ci_green) and event-routing (ORCH-045). - reviewer.md: overview-docs axis (axis 4 + constraints) — closing a README limitation without updating README → finding ≥P1 (canon 52d «❌→✅»; verdict key + 5 XML sections + 6 schema fields byte-intact). - tests: new tests/test_readme_limitations.py (numbering + no resolved items as open); test_agent_prompts_canon.py asserts the new axis. - CLAUDE.md / CHANGELOG.md updated; epic ORCH-52 closed (52b→…→52f). Refs: ORCH-079 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 16:33:33 +03:00
claude-bot	572b3172cd	docs(ORCH-078): ORCH-52e — стандарт трассировки ORCH-NNN + правило чтения ADR Слой 4 (трассировка) эпика ORCH-52, замыкающий цепочку 52b/52c/52d. Docs + prompts-only: src/**, STAGE_TRANSITIONS, QG_CHECKS, src/frontmatter.py, схема БД — не тронуты; новый QG не вводится; ретро-фит 51 маркера вне объёма. - Новый нормативный стандарт docs/_standards/TRACEABILITY.md: формат маркера, правило размещения, чтение истории с реальным проверяемым примером (src/serial_gate.py → ORCH-088 → ADR-001-serial-gate.md), fallback-доступ (git show origin/main:...), анти-археология (3+ → сводный сквозной ADR), каноничный текст правила чтения (единый источник). - Точечные аддитивные врезки в промпты (52d-канон не переписан): developer.md (правило чтения чужого маркера + fallback, «❌ X → ✅ Y»), architect.md (правило чтения + анти-археология), reviewer.md (усиление оси «Соответствие ADR» под-пунктом: слом маркированного инварианта → finding ≥P1). Все три ссылаются на единый текст в TRACEABILITY.md, не копируют (анти-дубль BR-6). - Сопутствующе: CLAUDE.md, docs/architecture/README.md (слой 4 эпика 52), CHANGELOG.md. - Анти-регресс: расширен tests/test_agent_prompts_canon.py (9 новых проверок); проверки 52d и test_agent_frontmatter_no_model.py зелёные; полный pytest tests/ -q зелёный (1253 passed), src/ не изменён. Refs: ORCH-078 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 15:48:43 +03:00
claude-bot	8beed58d98	docs(prompts): rewrite 6 agent prompts in Anthropic canon + emit 52c schema (ORCH-52d) Замыкающий слой эпика ORCH-52. Тело всех 6 промптов .openclaw/agents/.md переписано в едином каноне Anthropic (5 обязательных XML-секций <context>/ <task>/<deliverables>/<constraints>/<output_format>, запреты «❌ X → ✅ Y», <thinking> у решающих ролей), и каждый промпт добровольно эмитит 6-польную frontmatter-схему 52c (work_item/stage/author_agent/status/created_at/ model_used) аддитивно — рядом с machine-verdict ключом, не меняя его имя/ регистр/значения (verdict:/result:/staging_status:/deploy_status:/ security_status:). Docs/prompts-only: src/*, STAGE_TRANSITIONS, QG_CHECKS, схема БД не тронуты; frontmatter_validation_strict остаётся False (enforcement не включён). Функциональное содержание старых промптов перенесено 1:1 (инвентарь TRZ §FR-6). - tests/test_agent_prompts_canon.py: структурный анти-регресс (TC-01…TC-07) - tests/manual/ab_prompt_compare.md: метод A/B (TC-09 / AC-6) - CLAUDE.md, CHANGELOG.md обновлены; README/ADR — архитектором Полный регресс pytest tests/ -q зелёный (1244); test_agent_frontmatter_no_model остаётся зелёным. Refs: ORCH-077 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-09 15:08:27 +03:00

1 2

100 Commits