Add a deterministic task-estimation side-mechanism: a new operator Plane
status «Оценка» (third action-status, family STOP/Confirm Deploy) triggers a
new never-raise leaf src/estimator.py that forecasts cost/time/tokens/story
points {1,2,3,5,8} from the history of completed tasks (no LLM — ADR-001 D1),
writes the forecast to Plane (estimate_point + comment), the Telegram card and
the additive task_estimates ledger (UPSERT by work_item_id), then returns the
issue to Backlog. On task completion the fact (from usage.py) is written to
Plane `point`. Massivity is free (Plane multi-select -> N webhooks); re-estimate
is idempotent; anti-disruption skips in-flight tasks; anti-loop (Backlog matches
no trigger branch).
INVARIANT (NFR-1/NFR-3): the estimator is an observer/producer, never a Quality
Gate or stage — STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys /
existing table schemas and the hot launch path (resolve_agent_model/_spawn) are
byte-for-byte untouched. fail-closed: `estimate` absent from _DEFAULT_STATES ->
a board without the status is inert (zero regression, enduro untouched).
- config: ORCH_ESTIMATOR_* flags (kill-switch + CSV scope empty->self-hosting
only, bootstrap defaults, story-point cost thresholds, wall cap).
- db: additive task_estimates table + record_estimate/set_actual/get_estimate/
estimates_snapshot + read-only completed_task_stats history aggregate.
- plane_sync: «Оценка»->estimate name map (NOT in _DEFAULT_STATES); set_issue_
backlog/set_issue_point/set_issue_estimate_point + get_project_estimate_points
(all via ORCH-117 write-guard, best-effort/fail-safe).
- webhooks/plane: fail-closed estimate branch + handle_estimate (anti-disruption,
auto-return to Backlog, off-loop).
- stage_engine: best-effort fact-on-done врезка (after terminal decision).
- notifications: «Оценка» card line (read from ledger, omitted when empty).
- main: read-only `estimator` GET /queue block + optional POST/GET /estimate.
- onboarding canon + Lite/Bundled/ONBOARDING docs: 23rd status «Оценка» (group
unstarted, never terminal); anti-drift count pins bumped 22->23.
- tests: tests/test_orch020_estimator.py (TC-01..TC-20), full suite green.
Refs: ORCH-020
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Create docs/operations/FAQ_STOP.md — a user-facing "вопрос → ответ" FAQ for
Plane board users explaining the STOP status: what it does, how to cancel a
task, step-by-step consequences (agent stops → jobs cancelled → working branch
and worktree removed → task → cancelled → Telegram+Plane; docs artifacts
preserved, main/prod untouched), deferred cancellation in the critical
merge/deploy window, explicit "STOP does NOT revert merged/deployed code"
(revert is a separate task), restart only via "To Analyse" from scratch, no-op
causes, where to observe the result, and STOP/Approved/Confirm Deploy disambig.
docs-only: src/**, STAGE_TRANSITIONS, QG_CHECKS, check_*, machine-verdict keys
and the DB schema are byte-for-byte untouched. STOP behaviour source of truth
remains ORCH-090 (adr-0026); the FAQ documents and links to it (link-first:
machine details markers/lease/tombstone given by reference, not duplicated).
Add two-way cross-links: docs/overview/business.md (Сценарий 6) and
docs/overview/tech-pipeline.md (Отмена: STOP → cancelled) → FAQ; FAQ → overview
+ ADR ORCH-090. Structure guarded by deterministic anti-drift test
tests/test_faq_stop_doc.py (offline, no network/LLM/subprocess; mirrors
tests/test_lite_setup_doc.py): existence + 8 section anchors + fact bricks +
cross-links + claim-level negative scan (sentence-level, not bare substrings,
so it never false-fails on correctly negating phrases — TR-3, with a
non-evergreen self-check). Full pytest tests/ green (2227 passed).
Refs: ORCH-108
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The merge-gate's auto_rebase_onto_main silently dropped the ORCH-119
CHANGELOG bullet during a same-anchor 3-way merge: origin/main's
ORCH-120/126 entries were kept while the ORCH-119 insertion was lost.
Re-spliced the entry verbatim under ## [Unreleased] alongside 120/126.
Refs: ORCH-119
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Address reviewer P1 (ось ORCH-011/ORCH-079, правило агентов №6): витрина
описывала паузу serial-gate как исключительно операторскую, но ORCH-120
добавил движковый авто-park/unpark на analyst Needs Input.
- tech-pipeline.md: абзац пауз теперь называет два источника (оператор +
авто-park движком на Needs Input, флаг analyst_needs_input_autopause_enabled,
скоуп self-hosting, симметричный unpark на resume).
- tech-observability.md: пункт пауз в GET /queue — оба источника.
- tech-agents.md: when-applicable сигнальный канал 01-questions.md у analyst
(строка таблицы + поясняющая врезка; не machine-verdict, не deliverable).
- CHANGELOG: запись ORCH-120 дополнена строкой про обновление витрины.
tests/test_system_docs.py зелёный (29 passed). src/STAGE_TRANSITIONS/QG_CHECKS
не тронуты — docs-only.
Refs: ORCH-120
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Activates and completes the previously dead "analyst asks BLOCKING questions ->
01-questions.md -> Needs Input" path. Four coordinated changes, additive, under
kill-switch, self-hosting scope, never-raise; STAGE_TRANSITIONS / QG_CHECKS /
check_* / machine-verdict keys / DB schema are byte-for-byte UNCHANGED (the flow
is a pre-gate engine branch, NOT a Quality Gate; 01-questions.md is a SIGNAL
artifact, NOT a machine-verdict).
- D1 contract + canon: analyst.md documents the 01-questions.md channel (blocking
questions -> Needs Input, do NOT fabricate deliverables) + resume behaviour; new
skeleton docs/_templates/01-questions.md; PIPELINE_DOCS.md manifest row + 01-
prefix note.
- D2 freshness-supersede (DQ-2): pure offline mtime predicate questions_active in
the new leaf src/analyst_questions.py (a full FRESH package supersedes a stale
untouched 01-questions.md -> no Needs-Input loop, AC-6).
- D3 priority: questions take priority over "files ready" in
_handle_analysis_approved_flow (_decide_analysis_outcome + _emit_analysis_*);
off/out-of-scope runs the ORIGINAL byte-for-byte order (AC-9).
- D4 auto-park: set_task_paused on Needs Input via the ORCH-124 pause axis so the
repo serial-gate FIFO is not wedged while waiting for a human (AC-4); D5 resume +
unpark (clear_task_paused) in handle_status_start (analysis branch).
Flags (config.py, safe defaults): analyst_questions_gate_enabled /
analyst_questions_gate_repos (empty -> self-hosting only) /
analyst_needs_input_autopause_enabled.
Tests: test_orch120_analyst_needs_input.py (TC-01 regress + TC-02/03/06/09/10),
test_orch120_serial_gate_needs_input.py (TC-04), test_orch120_resume_unpark.py
(TC-05), test_orch120_questions_artifact_canon.py (TC-08), assert in
test_agent_prompts_canon.py (TC-07). Full suite green (2205 passed).
Refs: ORCH-120
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Queued analyst-jobs hung forever even with ORCH_SERIAL_GATE_ENABLED=false
(incident ORCH-124/125, job 2286: queued + run_id=759/760 + pid=35/42 +
started_at=NULL — physically impossible). No path returning a job to
'queued' reset its run-ownership (run_id/pid); after a container restart a
reused pid made pid_alive(stale)=True, so the job-reaper Tier-1 saw a phantom
'running' and at max_concurrency=1 wedged the claim of the whole shared queue.
Enforce the invariant `status='queued' ⇒ run_id IS NULL AND pid IS NULL AND
started_at IS NULL` on existing columns (no schema change):
- D1 forward-cleanup: requeue_running_jobs / mark_job('queued') /
mark_job_transient / reap_running_job('queued') reset run_id=NULL, pid=NULL
in the same UPDATE that clears started_at; atomic status-guards preserved.
- D2 clean claim: claim_next_job resets pid/run_id on the queued->running flip
(defense-in-depth) so the row carries pid IS NULL until _spawn stamps it.
- D4 self-heal + observability: db.find_impossible_queued_jobs /
sanitize_impossible_queued run at startup (main.lifespan) and on each reaper
tick (JobReaper.sanitize_impossible_queued_once, never-raise); counter
impossible_queued_total in the GET /queue reaper block. Kill-switch
ORCH_IMPOSSIBLE_QUEUED_SANITIZE_ENABLED (default on; gates only the D4 sweep).
- D5: reaper Tier-1 unchanged — the fix restores its precondition (pid reflects
THIS run). Marked invariants ORCH-065/113/114/099 preserved.
Tests: tests/test_orch126_queued_stale_run.py (TC-01 mandatory regression
red->green; TC-02..TC-10). Full pytest tests/ -q green (2189 passed).
Docs: internals.md (run-ownership invariant section), .env.example, CHANGELOG;
cross-cutting adr-0052.
Refs: ORCH-126
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Addresses reviewer REQUEST_CHANGES (run 768) on ORCH-124 — docs-only,
no src/tests touched, fix scope unchanged.
P1: update docs/overview/ showcase for the new serial-gate "pause without
blocking" axis (changed task-routing functionality, ORCH-011/ORCH-079):
- tech-pipeline.md: FIFO exception "pause without blocking" next to freeze
- tech-data-model.md: durable signal tasks.paused_at on the Task row
- tech-observability.md: paused/reason in serial_gate GET /queue block +
operator endpoints POST /serial-gate/pause|resume
P2: strip leaked tool-call trailing tags (</content>/</invoke>) from 4
golden-source docs of this PR (06-adr/ADR-001, adr-0051,
08-data-requirements.md, 10-tech-risks.md).
CHANGELOG "Доки" bullet extended accordingly. Full suite green (2178 passed);
test_system_docs.py green (machine-checked showcase facts intact).
Refs: ORCH-124
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The deterministic test-runner gate (full `pytest tests/`) failed on
test_orch123_staging_runner_exec.py::test_r2_held_deploy_staging_not_rolled_back
once ORCH-124 reached the testing stage.
Root cause (pre-existing latent regress, surfaced — not introduced — by
ORCH-124): the fixture isolated `worktrees_dir` but not `repos_dir`.
`check_staging_status` falls back to `<repos_dir>/<repo>` (and its
origin/main) when the feature worktree is absent. After ORCH-123 merged,
the real `/repos/orchestrator/docs/work-items/ORCH-123/15-staging-log.md`
(verdict SUCCESS) exists on disk, so the intended-RED staging gate read it
and went green -> advance_stage was called -> the R-2 assertion failed.
Order-dependent: the test passed alone, failed in the full suite.
Fix: isolate `settings.repos_dir` to an empty tmp subdir in the fixture
(mirroring the existing worktrees_dir isolation) so the staging gate is
deterministically "not found" -> red, regardless of suite ordering. The
ORCH-123 R-2 invariant (a held deploy-staging task is never rolled back to
development, adr-0049/ADR-001 D4) is preserved and strengthened — the fix
only restores the test's stated premise. src/** / STAGE_TRANSITIONS /
QG_CHECKS / check_* untouched (test-only change).
Refs: ORCH-124
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fixes incident ORCH-116/ORCH-123: serial_gate defined a repo's "active task"
purely by machine stage (tasks.stage NOT IN ('done','cancelled')). Plane statuses
Backlog/Blocked/Needs-Input (layer-B indication, ORCH-066) do NOT change
tasks.stage (layer A), so a paused predecessor was indistinguishable from an active
one and held the FIFO gate closed against an urgent successor — the urgent fix
could not start until the paused task was formally done.
Introduces an explicit, durable, DB-resolvable per-task "park" signal — additive
nullable column tasks.paused_at (pattern of cancelled_at/track) — and a new
ORTHOGONAL scheduler "pause" axis. The serial-gate "active task" predicate becomes
`stage NOT IN ('done','cancelled') AND paused_at IS NULL` across all three points
(build_claim_clause / repo_has_active_task / _per_repo_snapshot). The terminal set
{done,cancelled} in serial_gate/task_deps/stages.py is byte-for-byte unchanged
(adr-0026 not regressed): task_deps/stages.py do NOT read paused_at, so a paused
declared dependency and an active repo_freeze STILL block (pause never bypasses
them — different axes). Anti-stale-base on resume relies on the existing deferred
branch cut (ORCH-088) + pre-merge auto_rebase_onto_main + merge-gate re-test
(ORCH-026/093/110) — no new rebase machinery.
Additive, under an independent sub-flag, never-raise, restart-safe; hot-claim
fail-OPEN and freeze fail-CLOSED preserved. STAGE_TRANSITIONS / QG_CHECKS / check_*
/ machine-verdict keys / existing table schemas are byte-for-byte untouched (this is
a queue-scheduler + observability change, not a Quality Gate).
- src/db.py: additive tasks.paused_at column (_ensure_column) + set/clear/is helpers
- src/serial_gate.py: _pause_layer_enabled() + pause-term in the 3 points; `paused`
list + per-job `reason` (freeze>dependency>active-task>null) in the /queue snapshot
- src/config.py + .env.example: serial_gate_pause_enabled (default True = true no-op)
- src/main.py: POST /serial-gate/pause|resume?work_item=<id> (by образцу unfreeze)
- tests/test_orch124_serial_gate_pause.py: TC-01 mandatory incident regress + TC-02..15
- CHANGELOG.md: [Unreleased] entry
ADR: docs/work-items/ORCH-124/06-adr/ADR-001-serial-gate-pause-without-blocking.md
Cross-cutting: docs/architecture/adr/adr-0051-serial-gate-pause-without-blocking.md
Refs: ORCH-124
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Recovery from the merge-gate rebase-conflict bounce. The feature branch was
rebased onto origin/main (which had merged ORCH-123). The single conflicting
hunk — docs/architecture/README.md — was resolved during the rebase: kept
ORCH-123's host-side staging-runner line AND the ORCH-116 test-runner bullet.
This follow-up commit reconciles the remainder:
- Renumber the global sweeping ADR adr-0049 -> adr-0050. ORCH-123 took adr-0049
(adr-0049-host-side-docker-execution-boundary.md) on main while ORCH-116 was
in flight, so ORCH-116 yields to the merged task and moves to the next free
number. Mechanical cross-reference reconciliation only (git mv + title + every
test-runner reference across README/internals/CLAUDE/CHANGELOG/config.py +
06-adr/ADR-001 + 12-review). Main's adr-0049 host-side references are left
byte-for-byte untouched. No design/verdict content was altered.
- Restore the ORCH-116 CHANGELOG entry that the CHANGELOG auto-merge silently
dropped (both ORCH-123 and ORCH-116 inserted at the same [Unreleased] anchor;
git kept only ORCH-123).
- Add the missing ORCH_TEST_RUNNER_* keys to .env.example (parity with the
ORCH_STAGING_RUNNER_* block; ORCH-101 canon of start keys).
Refs: ORCH-116
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The ORCH-115 deterministic staging-runner ran `docker exec` FROM INSIDE the prod
`orchestrator` container, which ships only `openssh-client git curl` — no `docker`
CLI (Dockerfile:11). `Popen(["docker", ...])` hit FileNotFoundError -> a PERMANENT
environment defect that was mis-routed as a code-fail rollback
`deploy-staging -> development` (burning developer-retries). Incident ORCH-116:
every self-hosting task reaching deploy-staging was doomed to a false rollback.
Fix (adr-0049, additive, flag-gated, never-raise, self-hosting scope; the gate /
artifact contract / STAGE_TRANSITIONS / DB schema are byte-for-byte unchanged):
- D1: build_staging_command() wraps the SAME `docker exec ... staging_check.py
... --mode stub` in `ssh <user@host> '<...>'` so it runs HOST-SIDE over the
existing trusted ssh channel (mirror self_deploy / image_freshness). New flag
staging_runner_exec_host_side (default True). No docker CLI/SDK added to the
image, docker.sock not used in-container (D2 security).
- D3: three-way classify_staging_outcome (suite-ran / permanent-env /
transient-infra), disambiguating the exit=1 collision by scanning stderr.
- D4: invariant "infra != code-fail" — permanent-env / exhausted transient-infra
end in an infra-HOLD (no rollback, no developer-retry), NOT a false FAILED
rollback (supersedes ORCH-115 D5). A really-executed failing suite still rolls
back (anti-over-tolerance). R-2 verified: a held deploy-staging task is not
rolled back by the reconciler.
- D5: prod-like preflight() of the host-side channel at startup (main.lifespan,
best-effort, never blocks).
- D8: snapshot adds permanent_env / exec_host_side / preflight.
Docs (golden source, same PR): INFRA.md execution-boundary section,
architecture/README.md, CLAUDE.md, CHANGELOG.md, .env.example.
Tests: tests/test_orch123_staging_runner_exec.py (TC-01 mandatory regression
red->green; TC-02..TC-14 + R-2). ORCH-115 anti-drift green (3 tests updated for
the D1/D4/D8 supersession). Full suite: 2131 passed.
Refs: ORCH-123
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replace the LLM `deployer` agent on the `deploy-staging` stage (self-hosting
orchestrator) with a deterministic staging-runner intercepted in launch_job
BEFORE _spawn (the deploy-finalizer / post-deploy-monitor reserved-agent
precedent). The runner executes the SAME staging suite, maps the exit-code to
`staging_status:` via the existing self_deploy.map_exit_code_to_status contract,
writes 15-staging-log.md, and initiates the UNCHANGED check_staging_status gate
exactly as a finished LLM-deployer would.
Invariant (NFR-1): this replaces only the *producer* of the artifact — the
artifact contract, the gate / _parse_staging_status / check_staging_status name,
STAGE_TRANSITIONS, the machine-verdict key `staging_status:` and the DB schema are
byte-for-byte unchanged. Additive, under a kill-switch + repo-scope CSV,
never-raise, fail-safe back to the LLM path.
Two-level outcome (D5, anti ORCH-110): suite executed -> verdict -> advance
(FAILED -> the existing deploy-staging -> development rollback + developer-retry,
same as a FAILED LLM verdict); tool-error (suite did not execute) -> bounded DEFER
-> fail-closed FAILED + alert on exhaustion (infra != code fault; never a silent
advance / false green).
First implemented slice of the LLM determinization roadmap (ORCH-118 A6,
replace-deterministic-now).
- New leaf src/staging_runner.py (never-raise; proc_group tree-kill + timeout)
- launch_job intercept + _run_staging_runner_job (mirror _run_deploy_finalizer_job)
- config: ORCH_STAGING_RUNNER_* keys (enabled/repos/timeout/infra-retry budget)
- GET /queue staging_runner observability block
- docs: llm-call-sites/roadmap/usage-policy (A6 implemented; machine blocks +
single-transport invariant intact), deployer.md (LLM branch -> fallback),
CLAUDE.md, CHANGELOG.md, overview (tech-pipeline/tech-agents/tech-quality-security),
.env.example
- tests/test_orch115_staging_runner.py (TC-01..TC-13); LLM anti-drift green (TC-14)
Refs: ORCH-115
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ORCH-118 (inventory-first, docs+tests only): publish an evidence-based map of
every place the orchestrator's control flow consumes (or can consume) an LLM
judgment, mark the control-path axis (C control-path vs P artifact-producer),
define "avoidable LLM control path" as a checkable two-bit predicate, classify
each call-site, and order the deterministic-replacement roadmap. Pin the map to
code with offline structural anti-drift tests.
- docs/architecture/llm-call-sites.md — map + machine-readable inventory block
+ control-path axis + classification + keep-LLM justifications + deterministic
non-agent paths (FR-1/FR-2/FR-3/FR-8).
- docs/architecture/llm-determinization-roadmap.md — ordered candidates BY ROLE,
savings sourced from agent_runs, recommended first slice = deployer staging
(FR-4). No fabricated follow-up Plane-IDs (R3/NFR-6).
- docs/architecture/llm-usage-policy.md — normative principle, keep/replace
criteria via the axis, definition of "avoidable LLM control path" (FR-5/FR-8).
- tests/test_llm_call_site_inventory.py — TC-01/02/03/04/05/06/09/12/13/14.
- tests/test_llm_determinization_docs.py — TC-07/08/11.
- CHANGELOG.md + docs/overview/tech-quality-security.md — golden-source sync (AC-8).
Avoidable LLM control paths = {tester, deployer}; control-path-keep = {reviewer};
not-control-path (P) = {analyst, architect, developer}. Single LLM transport =
launcher._spawn (S0); no alternative transport (TC-12). Runtime untouched:
STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / DB schema are
byte-for-byte; no replacement runners implemented (FR-7). Full suite: 2081 passed.
Refs: ORCH-118
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Close the root class of incident ORCH-114: a pytest/worktree process performed a
REAL write (PATCH issues state=<Done> + comment) against the PRODUCTION Plane
project, because test/staging processes inherit the live Plane token
(PLANE_HEADERS/PROJECT_ID are captured at import — a post-hoc env/token swap is a
no-op) and nothing forced them to write only to the sandbox. Symmetric to the
existing _no_telegram autouse floor.
- New pure never-raise leaf src/plane_write_guard.py (decide/audit_block/
audit_allow), wired into the 3 plane_sync write primitives (update_issue_state /
add_comment / _set_issue_state_direct) via _guard_allows_write, AT CALL TIME,
before any network step. Active ONLY in a test process (pytest in sys.modules /
PYTEST_CURRENT_TEST); live + staging runtimes (uvicorn) are a strict no-op.
- In a test process: default-deny. A write is allowed iff opt-in
(plane_test_write_enabled) AND target project in the sandbox allowlist
(plane_test_sandbox_projects, default = the one SANDBOX id). Prod is blocked even
with opt-in (allowlist sandbox-only); unresolved project -> block (fail-closed).
- Independent second layer: tests/conftest.py::_plane_sandbox_only autouse floor.
Intentionally NO prod-block kill-switch (anti back-door, NFR-6).
- Audit: block -> loud ERROR; sandbox-allow -> INFO.
- Bypass fixtures for the 3 (+1) pre-existing tests that assert on the mocked
write primitive's httpx call (header/URL/state logic), the guard is no Quality
Gate: STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict / DB schema
untouched.
- Tests: tests/test_orch117_plane_write_isolation.py (TC-01 mandatory ORCH-114
regression + TC-02..TC-14). Docs: CLAUDE.md, architecture/README.md,
operations/INFRA.md, .env.example, CHANGELOG.md.
Refs: ORCH-117
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolves the REQUEST_CHANGES findings on ORCH-114 (durable transition-ownership
lease + expected-stage CAS):
P1 — documentation = golden source:
- .env.example: add ORCH_TRANSITION_LEASE_ENABLED / ORCH_TRANSITION_LEASE_REPOS
(canon of 100% start keys, ORCH-101), next to the other gate kill-switches.
- CLAUDE.md: add the ORCH-114 passport section (mechanism, invariant, flags,
ADR links) so a future agent editing advance_stage/reaper/webhooks finds the
ownership invariant in the first mandatory-read doc (ORCH-078 traceability index).
P2 — should-fix:
- docs/overview/ (system showcase, ORCH-011): add transition_lease to
tech-data-model.md (helper tables), tech-observability.md (/queue blocks) and
tech-architecture.md (components).
- ADR-001 D4 alignment: the four side-effectful-edge rollback handlers
(_handle_merge_gate_rollback / _handle_security_gate / _handle_coverage_gate /
_handle_image_freshness) now write `development` through the expected-stage CAS
via a shared _rollback_stage_cas helper (defence against the rollback↔done
contradiction, BR-6) instead of a bare unconditional update_task_stage. Under the
held lease the sole owner always wins; a lost race aborts WITHOUT side effects.
Kill-switch off / out-of-scope repo -> degenerates to the prior write -> 1:1.
- Test isolation: make tests/test_webhooks.py order-independent by pinning the
proj-1 registry per-test (mirrors test_webhook_dedup.proj_registry); it had only
passed by relying on import order. Drop the needless module-level ORCH_DB_PATH
setdefault in test_orch114 (fresh_db already isolates db_path).
New regression tests (TC-11): in-region rollback writes route through CAS;
rollback CAS wins when at expected stage; rollback CAS-lost does NOT clobber `done`;
kill-switch-off rollback degenerates to the unconditional write.
ruff clean (src/stage_engine.py, src/transition_lease.py); full suite 2052 passed.
Refs: ORCH-114
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Close the root class of the ORCH-110/111/112/113 incident chain: side-effectful
stage transitions had no single ownership. `advance_stage` is re-enterable and wrote
the stage with a bare `UPDATE ... WHERE id=?` (no compare-and-swap), while >=5 actors
(monitor / Plane-webhook / reconciler F-1 / job-reaper / deploy-finalizer) enter the
same transition independently. A concurrent or post-restart re-entry therefore
re-applied irreversible effects (merge_pr / coverage-ratchet / image-rebuild /
prod-deploy initiation) and produced a contradictory rollback<->done (incident
ORCH-111, job 1914 / PR #130).
Two complementary layers, both additive, under one kill-switch, never-raise:
1. Durable transition-lease (new table `transition_lease`) — owner-exclusion on
ENTRY to the side-effectful region: a second actor that sees a LIVE owner does
not start the heavy sub-gates at all (prevention, not post-hoc repair).
2. Expected-stage CAS (`db.update_task_stage_cas`) — atomicity on the stage WRITE:
a lost race aborts with NO side effect. Also closes the 6 paths that write the
stage in bypass of advance_stage (gitea x5 + plane rollback).
Owner liveness = owner_pid + owner_boot_id (NOT a heartbeat — a blocking 900s merge
re-test cannot beat one; ADR-001 D3), making restart recovery free (a fresh boot_id
renders every prior lease stale -> reclaimed by recover_on_startup). The lease has no
own TTL: its hard age ceiling is the reaper Tier-3 backstop reaper_max_running_s, so
the cross-cutting budget invariant ORCH-065/109/110/113 is untouched.
Generalises ORCH-113 finalizer-liveness (process-local, Tier-2, deploy-staging) to a
durable cross-path lease: the reaper consults it on all relevant paths (defer live,
reclaim dead; Tier-3 ignores the marker -> bounded; a reap force-releases the lease);
reconciler F-1 and the Plane webhook defer on an active lease; main.lifespan calls
recover_on_startup() after requeue_running_jobs. finalizer_liveness.py is unchanged
(it remains the kill-switch-off fallback).
Scope self-hosting (transition_lease_repos="" -> orchestrator only; enduro untouched).
Kill-switch ORCH_TRANSITION_LEASE_ENABLED=false -> CAS degenerates to the prior
unconditional update_task_stage, lease inert, reaper -> ORCH-113 fallback (byte-for-
byte pre-ORCH-114). STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys /
existing table schemas — byte-for-byte (one additive table, no epoch column on tasks).
Observability: read-only `transition_lease` block in GET /queue + a Telegram alert on
forced/stale reclaim + optional POST /transition-lease/release?work_item=<id>.
Coverage: tests/test_orch114_transition_ownership.py (TC-01 mandatory regression of
the ORCH-111 class — red before fix, green after; TC-02..TC-14). Full suite green
(2048 passed); the 4 webhook tests that spied on the removed gitea.update_task_stage
were updated to spy on the new commit_stage_cas write path.
ADR: docs/work-items/ORCH-114/06-adr/ADR-001-transition-ownership-lease-and-stage-cas.md
Cross-cutting: docs/architecture/adr/adr-0045-transition-ownership-lease-and-stage-cas.md
Refs: ORCH-114
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Self-deploy git pull blocked on a dirty shared main checkout (manual/abandoned
WIP from a failed/cancelled task) — incident ORCH-111: "Your local changes to
src/config.py would be overwritten by merge" wedged the prod deploy and required
manual intervention (a group risk on self-hosting).
The deploy hook (--deploy) now converges the deploy-base to a clean, current
origin/main BEFORE the pull (git fetch + reset --hard origin/main + a SCOPED
`git clean -fd`, NEVER -x), strictly preserving the rollback/log artefacts
(.deploy-prev-image-* / deploy-hook.log via -e), gitignored .env/data/*.db/build
(no -x), and sibling/.git state (out of clean scope). Gated by CHECKOUT_HYGIENE
env injected by self_deploy.build_deploy_command only when the new pure never-raise
leaf src/checkout_hygiene.py says applies(repo) (kill-switch + self-hosting scope).
Convergence after failed/cancelled is this same deploy-time self-heal — cancel_task
is NOT extended and no background janitor is introduced. Observability: the hook
writes a `hygiene` sentinel, the Phase-C finalizer reads it and sends a best-effort
Telegram alert.
Additive, under kill-switch (ORCH_CHECKOUT_HYGIENE_ENABLED, default true; off ->
bare `git pull origin main` 1:1 before ORCH-112), never-raise, self-hosting scope.
STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / DB schema / the
hook exit-code contract (0/1/2, ORCH-036) are byte-for-byte untouched.
Coverage: tests/test_deploy_checkout_hygiene.py (TC-01..TC-10; real-hook shell
simulation in a temp git repo, no network/prod/ssh, + unit). TC-01 is the
mandatory ORCH-111 regression (RED before the fix, GREEN after). Docs golden
source updated in the same PR (CLAUDE.md, CHANGELOG.md, .env.example; INFRA.md /
architecture/README.md / adr-0044 written at the architecture stage).
Refs: ORCH-112
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
On the deploy-staging -> deploy edge the live monitor stamps
agent_runs.finished_at FIRST, then runs the heavy edge sub-gates
(security/merge-gate re-test/coverage/image-freshness) in-thread for MINUTES
and only THEN _finalize_job. Reaper Tier-2 measures finished_age_s from
finished_at, so past reaper_finalize_grace_s it treated the live, long
finalizer as dead and independently re-ran the advance -> a second re-test
went red -> false rollback deploy-staging -> development while the original
finalizer concurrently merged the PR (incident ORCH-111, job 1914).
Add a process-local finalizer-ownership registry (src/finalizer_liveness.py,
never-raise): the monitor mark()s ownership right after the exit_code stamp and
clear()s it in a try/finally around the (verbatim-extracted) finalization tail,
so an exception in the monitor thread still releases ownership and a genuinely
dead finalizer is reaped. The reaper Tier-2 consults the marker only when the
kill-switch is on AND the task stage == deploy-staging AND ownership is active
-> DEFER (no second advance) and fall through to the Tier-3 backstop, which
ignores the marker (a stuck/dead finalizer is still reaped in bounded time).
In-memory is authoritative (monitor + reaper are daemon threads of one uvicorn
process); restart is covered by the startup requeue_running_jobs.
Additive, global kill-switch reaper_finalizer_liveness_enabled (default True;
false -> reaper byte-for-byte prior). STAGE_TRANSITIONS / QG_CHECKS / every
check_* / machine-verdict keys / DB schema unchanged; grace/ceiling and the
ORCH-065/109/110 budget invariant untouched; never restarts prod, never pushes
main. Observability: finalizer_defers_total + finalizer_owned in GET /queue.
Tests: tests/test_orch113_reaper_finalizer_liveness.py (TC-01..TC-08, incl. the
mandatory ORCH-111 regression: red before the fix, green after).
Refs: ORCH-113
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Eliminate the false `deploy-staging -> development` rollback that fired when the
merge-gate local re-test timed out (infra/resource) on a green CI + tester +
staging branch (incident ORCH-109/PR #129: a 516.7s suite blew its 600s budget
under CPU starvation from orphaned pytest processes -> timeout misrouted as a
code fault -> developer-retry loop -> manual gate).
Additive, 5 independent kill-switches, never-raise, self-hosting scope. Untouched
byte-for-byte: STAGE_TRANSITIONS, the QG_CHECKS registry, check_branch_mergeable
name/semantics, machine-verdict keys, the DB schema. INV-4 (never push/force-push
main) and the no-prod-restart rule are preserved.
- D1: new stdlib-only leaf src/proc_group.py runs the spawned re-test/coverage
pytest in its own process group (start_new_session) and tree-kills the WHOLE
group on timeout (os.killpg SIGTERM->grace->SIGKILL); used by
merge_gate.retest_branch and coverage_gate.measure_coverage. No orphan leak.
Fallback never-break: subprocess_tree_kill_enabled=False / non-POSIX -> the
prior subprocess.run.
- D2/D3: merge_gate.classify_retest_failure distinguishes timeout/red/lock-busy/
other; an infra timeout routes to _handle_merge_gate_infra_retry (bounded
re-queue, task stays on deploy-staging, no rollback / no developer-retry); a
red re-test / conflict still rolls back (BR-6). Exhaustion -> one infra alert.
- D4: skip the local re-test when the pre-merge rebase was a proven no-op (HEAD
already CI/tester/staging-validated); fail-safe runs the re-test on any
uncertainty. Flag merge_retest_skip_when_current_enabled.
- D5: merge_retest_timeout_s 600 -> 900 + _resolve_retest_timeout validation;
reaper_max_running_s invariant preserved without change.
- D6: in-process counters + read-only merge_gate block in GET /queue; appended
("ORCH-110","classify_retest_failure","src/merge_gate.py") to
MAIN_REGRESSION_MARKERS. Docs (README/internals overview/CLAUDE/CHANGELOG/
.env.example) updated in the same PR.
Tests: tests/test_orch110_*.py (TC-01..TC-12, incl. the red-before/green-after
incident regression). Full suite green (1988 passed).
Refs: ORCH-110
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Close the observability gap between agent_hung (only tracked jobs by jobs.pid)
and orphaned pytest subprocesses the orchestrator launches itself
(merge_gate.retest_branch / coverage_gate.measure_coverage). On a timeout-kill of
the agent (-9, ORCH-109) the grand-child pytest reparents onto tini and keeps
running for days, starving CPU and failing merge-gate re-test — with no alert.
Strictly inside the observer (watchdog/** + the watchdog compose service):
- watchdog/collectors/proc.py: stdlib-only /proc scan (under pid: host),
read-only, never-raise -> []; pure parsers split from I/O (tested on a fake
/proc tree). Never reads /proc/<pid>/environ.
- watchdog/signals.py: pure proc_signals builder, per-entity
("proc_blocking", pid), active iff age_s > proc_age_s; actionable RU detail.
- watchdog/core.py: opt-in tick block (gated on proc_enabled -> zero overhead /
byte-for-byte when off) + RECOVERY synthesis for a vanished process through the
existing decide()/AlertState (no new anti-spam logic).
- watchdog/config.py: WATCHDOG_PROC_{ENABLED(false),AGE_MIN(60),PATTERNS(pytest),
COOLDOWN_S(1800)}; default threshold > max(merge_retest_timeout_s=600,
coverage_run_timeout_s=900) so a legit in-flight run never crosses it.
- docker-compose.yml: pid: host on orchestrator-watchdog ONLY (read-only privilege).
Anti-false-positive and no overlap with agent_hung are by construction (cmdline
scope + age threshold), not fragile cross-namespace PID matching.
Canon synced: WATCHDOG_PROC_* in .env.watchdog.example <-> .env.example block;
documented in LITE_SETUP.md and docs/architecture/README.md (architect). src/**,
/metrics, schema_version, STAGE_TRANSITIONS, QG_CHECKS, check_*, machine-verdict
and the DB schema are untouched; deploy rebuilds only the sidecar, prod
orchestrator is not restarted (NFR-3).
Tests: tests/watchdog/test_proc_blocking_signal.py (TC-01..TC-06),
test_proc_collector.py (/proc parsing), test_tick_proc_blocking_integration.py
(TC-07), plus pid: host and proc-config assertions. Full pytest tests/ green (1930).
Refs: ORCH-111
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When the ORCH-109 entry was inserted above the ORCH-105 entry, the
ORCH-105 bullet had its body accidentally duplicated (the same
"слайдо-источник …" paragraph appeared twice in one bullet). Restore
the ORCH-105 entry to its canonical single-bodied form (byte-for-byte
identical to origin/main); the legitimate ORCH-109 additions are
untouched.
Refs: ORCH-109
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two additive, isolated launch-subsystem fixes from incident ORCH-104, without
touching STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict / DB schema.
D1 — launch-time model stamp: write the resolved model into agent_runs.model in
the SAME UPDATE as the effort stamp (ORCH-087), so the model is present from
launch, survives a timeout-kill (exit_code=-9), and is visible in-flight in
/metrics & /queue. record_usage stays an enrichment (model=COALESCE preserves the
launch stamp when the usage JSON model is None). never-raise (isolated try/except).
D3/D4 — dedicated per-role budgets: agent_timeout_developer_s=3600 /
agent_timeout_reviewer_s=3000 with a deterministic _resolve_timeout ladder
(overrides_json[agent] > dedicated role key > agent_timeout_seconds=1800; other
roles byte-for-byte). Malformed/non-positive config falls back to the global
default + WARNING (never-break). reaper_max_running_s raised 3600 -> 5400 in
lockstep to keep the ORCH-065 invariant (5400 > 3600 + 20 = 3620).
FR-4 (kill / in-flight visibility) and FR-5 (anti-salvage) are structural in the
existing code; pinned here by regression tests (tests/test_orch109_timeout_model.py,
TC-01..TC-12). Docs: .env.example, config passport, CHANGELOG, CLAUDE.md
(README/internals authored by architect in this branch).
Refs: ORCH-109
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Merge-gate re-test runs under the orchestrator's prod env, where the
operator legitimately set ORCH_AGENT_FALLBACK_MODEL and changed
ORCH_AGENT_MODEL_DEFAULT / ORCH_AGENT_EFFORT_*. Two ORCH-41-era tests
asserted SHIPPED defaults through the env-backed settings singleton and
failed 3/3 there, while Gitea CI (clean env) stayed green. Branch
ORCH-009 touches neither src/ nor these tests - latent non-hermetic
landmine on main, detonated by the prod env change.
- test_resolve_agent_effort.py: autouse fixture now mirrors the sibling
model-file baseline (pins shipped model/fallback fields) so the
flag-assembly tests are env-independent.
- test_resolve_agent_model.py: fixture also resets agent_fallback_model;
test_fallback_model_disabled_by_default now asserts the CLASS field
default (the actual ORCH-074 ADR-001 G4 invariant: shipped default
is ""), never-break is_valid_model asserts unchanged byte-for-byte.
Clean-env behaviour is byte-equivalent (fixtures pin exactly what an
empty env yields). Full suite: 1713 passed (was 2 failed / 1711).
Refs: ORCH-009
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Operator capability to bring a NEW project online in one pass, fully
outside the runtime and the pipeline (src/** byte-exact, no kill-switch
needed — activation is an explicit human CLI run). Reference = the
orchestrator repo itself (ORCH-52b/c/d/e canons).
* onboarding/repo-skeleton/ — parametrized kit of a new repo: 6 agent
prompt templates per canon 52d/92 (5 ru + deployer en with the
shared-host guardrail frame), reviewer doc-gate (REQUEST_CHANGES),
CLAUDE.md passport, AGENTS.md, CONTRIBUTING.md, docs/ skeleton with
mandatory operations/INFRA.md, .env.example; {{NAME}} placeholders +
stdlib render, dictionary onboarding/placeholders.json (bijection
held by tests). Canon is NOT forked: docs/_templates + docs/_standards
are live-copied from the checkout at materialization time (BR-2/D3).
* scripts/onboard_project.py — plan (default, GET-only, zero mutations)
/ apply (idempotent ensure, no delete ops at all) / verify (registry
round-trip via the actual projects._parse_projects_json, all 22 state
names incl. fail-closed Confirm Deploy/STOP, labels, webhook, kit
completeness, unresolved-placeholder scan). Closed read-only src
import list (ADR D4); state groups fixed per ADR D5 (STOP→cancelled,
terminal groups only Done/Cancelled/STOP); Gitea webhook reuses the
single global ORCH_GITEA_WEBHOOK_SECRET (TR-6); initial push ONLY
into a freshly created empty repo (INV-4 untouched); never restarts
prod / never edits .env / deletes nothing (NFR-2); secrets masked
(NFR-3); Plane CE API gaps degrade to manual-step (fail-safe).
* docs/operations/ONBOARDING.md runbook + SETUP_WEBHOOKS.md generalized
per-repo; CLAUDE.md / docs/architecture/README.md / CHANGELOG.md
updated in the same PR (golden source).
* Anti-drift tests: test_onboarding_kit.py / test_onboarding_script.py
(mocked, no network) / test_onboarding_invariants.py (snapshots of
STAGE_TRANSITIONS/QG_CHECKS, closed CLI import list, reference
.openclaw/agents/ prompts untouched). Full regression: 1713 passed.
Refs: ORCH-009
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Merge-gate auto_rebase_onto_main bounced this branch back: ORCH-100 landed
in main first and claimed global ADR number adr-0033 (adr-0033-sidecar-watchdog),
while this branch had created adr-0033-lessons-journal. Resolved the genuine
land race:
- rebased feature/ORCH-098-fnd onto current origin/main (linear history)
- resolved docs/architecture/README.md component-list conflict — both the
Lessons-journal and Sidecar-watchdog bullets now coexist
- renamed docs/architecture/adr/adr-0033-lessons-journal.md →
adr-0034-lessons-journal.md (next free global ADR number) + fixed the
in-file header
- updated all cross-references (CLAUDE.md, README.md, work-item ADR-001,
12-review.md) 0033→0034 for the lessons journal; ORCH-100's adr-0033
(sidecar) left intact
- recovered the ORCH-098 CHANGELOG entry silently dropped by the rebase
auto-merge (now above ORCH-100, ADR ref corrected to 0034)
No code semantics changed; src/** auto-merged cleanly (ORCH-100 did not
touch src/**). ruff: n/a locally (CI). pytest tests/ -q: 1630 passed.
Refs: ORCH-098
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
test_queue.py::TestRetry::test_finalize_job_requeue_then_fail failed in the
self-hosting environment because launcher._finalize_job classifies a non-zero
exit by reading the tail of <settings.runs_dir>/<run_id>.log. settings.runs_dir
defaults to the live prod dir /app/data/runs, which on the host holds REAL
accumulated agent logs; a real 2.log containing "429" flips the expected
'permanent' classification to 'transient', requeueing the job instead of
marking it 'failed'. This is ambient prod pollution, not a code fault.
Add an autouse _isolate_runs_dir fixture (mirroring _no_telegram /
_disable_merge_verify) that redirects settings.runs_dir to a per-test tmp dir
so _run_log_path() resolves to a non-existent file and classify_log_file()
returns the documented 'permanent' default. Full suite: 1617 passed. src/**
untouched.
Refs: ORCH-100
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the `watchdog/` package (thin Python-3.12 stdlib-only daemon) and the
`orchestrator-watchdog` compose service — the brain half of the domain-0
observability pair. F1a (ORCH-099) exposes GET /metrics raw signal; F1b reads it,
augments with host / container / dependency probes, runs each signal through a
generalised pure decision function (decide(signal_active, prev, now, cooldown),
a strict superset of disk_watchdog.decide_action) with per-signal in-memory
dedup/throttle/recovery, and alerts over its OWN independent Telegram channel.
Key properties (ADR-001):
- Observer separated from observed: separate container; /metrics not answering is
itself the master `orch_down` alarm (debounced K ticks — no flap on a hiccup).
- Strictly read-only: docker.sock GET-only + mounted :ro (double guard), host
paths :ro, no DB/disk writes, no process control — self-hosting-safe.
- never-raise on three levels (per-source/per-tick/per-send) + WATCHDOG_ENABLED
kill-switch (disabled -> inert idle-loop, not exit).
- Disk anti-duplicate (D6): disk_watchdog (ORCH-063) stays sole owner of the 85%
alert; sidecar carries orch_down + an opt-in 97% ceiling (default off).
- NO import from src/** (C-1); src/**, STAGE_TRANSITIONS, QG_CHECKS, check_*, DB
schema — untouched. env_file optional so a missing .env.watchdog never breaks
`docker compose up` for the prod orchestrator.
Tests: tests/watchdog/ (TC-01…TC-13) + full tests/ regression green (TC-14).
Docs: CHANGELOG, .env.example canon (WATCHDOG_*); architecture README + adr-0033
authored at the architecture stage.
Refs: ORCH-100
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Follow-up ORCH-040: legacy root:root files in /repos broke worktree creation
under uid 1000 with a raw "Permission denied" (agent never started, no diagnosis).
Three additive, kill-switch-reversible layers; STAGE_TRANSITIONS / QG_CHECKS /
check_* / machine-verdict keys / DB schema are byte-for-byte unchanged.
- D1: ensure_worktree classifies the permission class and raises an actionable
RuntimeError (cause + chown command + INFRA.md ref); non-permission errors keep
the prior raw-stderr contract; kill-switch off -> contract 1:1 as before ORCH-057.
- D2: new never-raise leaf src/fs_normalize.py — scan_ownership (TTL-cached,
early-exit per root), applies()-first scope (empty CSV -> self-hosting only),
opt-in normalize() that chowns ONLY when privileged (no-op under uid 1000).
- D3: best-effort startup detect in main.lifespan (WARNING + Telegram on mismatch,
never-fatal); read-only fs_ownership block in GET /queue; POST /fs-normalize/check.
Claim is NOT blocked — the clear early outcome is delivered by D1 at launch.
- Docs/config: .env.example flags + CHANGELOG (architecture README / adr-0031 /
INFRA.md procedure already landed on the branch).
- Tests: test_fs_normalize.py, test_git_worktree_perm.py,
test_fs_normalize_startup.py, test_api_queue.py (TC-01..TC-12). Full suite green.
Refs: ORCH-057
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
FND/F1a: add a versioned read-only JSON endpoint GET /metrics that exposes the
orchestrator's own raw state for the future observability sidecar F1b — active
task stages, job queue, agent-liveness (pid/runtime/cpu_ticks), and cost/tokens.
The orchestrator emits ONLY raw signal it alone knows; thresholds/alerts/history
live in the separate sidecar (observer separated from observed, BRD §1).
- src/metrics.py: new leaf collector build_metrics() (never-raise per section,
serial_gate.snapshot() pattern); envelope schema_version/generated_at/clk_tck +
stages/queue/agents/cost. _read_cpu_ticks(pid) reads utime+stime from
/proc/<pid>/stat (null on None/dead/non-Linux pid — never raises).
- src/main.py: thin @app.get("/metrics") wrapper (style of GET /queue).
- src/db.py: read-only helpers get_running_agents() (dedicated SELECT, not an
extension of the hot-path get_running_jobs()), agent_cost_totals(),
queue_retry_stats(); job_status_counts() default dict gains the cancelled key.
- src/config.py: metrics_endpoint_enabled kill-switch (default True), env
ORCH_METRICS_ENABLED via explicit validation_alias so the documented switch
actually controls the flag.
- docs: README API table row + CHANGELOG entry (contract section already added
by architect); .env.example ORCH_METRICS_ENABLED.
Strictly read-only / never-raise: STAGE_TRANSITIONS / QG_CHECKS / check_* /
machine-verdict keys / DB schema untouched; /health//status//queue byte-for-byte.
Tests: tests/test_metrics.py (TC-01..TC-11) + env-alias tests in test_config.py.
Full suite green (1482).
Refs: ORCH-099
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reviewer P1 (ORCH-027 attempt 2): inserting the ORCH-027 changelog
block duplicated the adjacent ORCH-095 entry — its paragraph body was
repeated verbatim, corrupting a golden-source doc and another work
item's artifact (CLAUDE.md §3). Remove the duplicate half, leaving a
single ORCH-095 body. ORCH-027 entry untouched (already correct).
Refs: ORCH-027
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Introduce a deterministic (no-LLM) coverage sub-gate that blocks coverage
degradation before a task branch merges into `main`. Existing gates judge only by
the FACT of passing (check_ci_green / check_tests_passed / merge-gate re-test), not
by completeness — so a batch autonomous run (ORCH-088) silently erodes coverage.
Pattern mirrors the security-gate (ORCH-022): leaf src/coverage_gate.py (never-raise)
+ thin check_coverage_gate in QG_CHECKS + _handle_coverage_gate splice in advance_stage,
run AFTER merge-gate (measured on the caught-up HEAD that lands in main) and BEFORE
image-freshness (fail before the expensive docker rebuild).
- measure_coverage: pytest --cov=src --cov-report=json in the per-branch worktree ->
line coverage %; None on tool error -> fail-open + WARNING by default (FR-6).
- compute_coverage_verdict (pure): absolute | baseline | both + epsilon (NFR-4 anti-flap);
baseline None -> bootstrap (absolute-only).
- coverage_baseline DB table (additive, CREATE TABLE IF NOT EXISTS) + ratchet-up in
_handle_merge_verify (deploy->done): atomic compare-and-set under merge-lease, never
decreases; bootstrap on first merge.
- Artefact 18-coverage-report.md (coverage_status: frontmatter, single source of truth);
GET /queue `coverage` block; FAIL -> Telegram; optional POST /coverage/baseline override.
- Flags ORCH_COVERAGE_* (kill-switch + self-hosting-only scope) -> enduro untouched;
STAGE_TRANSITIONS / existing check_* / verdict keys byte-for-byte unchanged (NFR-5/AC-8).
- pytest-cov==5.0.0 added to requirements.txt.
Tests: tests/test_coverage_gate.py (TC-01..TC-15). Frozen QG-registry anti-regress
tests + deploy-staging edge tests updated for the new sub-gate. Full suite green.
Docs: README / adr-0029 / PIPELINE_DOCS / 18-coverage-report.md template (architecture
stage) + CHANGELOG / CLAUDE.md / .env.example (this PR).
Refs: ORCH-027
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
render_task_tracker sends/edits the live card with parse_mode=HTML. _fmt_minutes
returns the literal "<1м" for a sub-minute stage; interpolated raw into HTML text
Telegram parsed "<1м" as an opening tag -> editMessageText 400 can't parse
entities -> edit_telegram EDIT_FAILED -> update_task_tracker early return
(anti-duplicate ORCH-087) -> the card froze (incident ORCH-093, message_id 18854).
Close the whole "unescaped data in HTML text" class per ADR-001: a module-local
_esc(x)=html.escape(str(x)) (never-raise) wraps every DATA slot (durations, status
label, model, effort, token/cost metrics) exactly once at the render boundary in
render_task_tracker/_stage_line. Source functions stay HTML-agnostic (_fmt_minutes
still returns "<1м"; escape on the boundary renders it visually identical as
<1м, so the visible format is unchanged). Intentional MARKUP slots (num_html /
link_for / _done_link / already-escaped esc_title) are NOT escaped, so the issue
number stays a clickable <a> tag and nothing is double-escaped.
A previously-frozen card auto-recovers on the next stage transition (a new safe
render edits in place, 200) — no new code, no touch to edit_telegram /
update_task_tracker / the orphan ledger, so the ORCH-087 anti-duplicate invariant
is preserved (a transient edit failure still does not spawn a new card).
STAGE_TRANSITIONS / QG_CHECKS / check_* / notification transport / DB schema are
untouched. New tests/test_tracker_html_escape.py (TC-01..TC-11); full suite green.
Refs: ORCH-095
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A DB stage=done task with 0 active jobs flapped in Plane between `Awaiting
Deploy` and `Monitoring after Deploy` instead of holding `Done` (verified live
on ORCH-061, task 47): the three deploy-phase setters were terminal-blind, so
any stale/duplicate/unknown caller under the bot token re-stamped an
intermediate status over the terminal Done, forever.
- New leaf src/deploy_status_guard.py (pure, never-raise, config-gated): decide()
-> ALLOW | CONVERGE_DONE | SUPPRESS on the entry of set_issue_awaiting_deploy /
set_issue_deploying / set_issue_monitoring. A deploy-phase status is legitimate
iff the task is non-terminal OR (done AND post-deploy window active); otherwise
done converges to Done idempotently, cancelled is suppressed (FR-2, D1/D2).
- D3: move post_deploy.arm_monitor ABOVE the terminal-sync block in advance_stage
so window_active is True when the legitimate first Monitoring is set (the task
is already DB-done by then); a re-drive after the window closes converges to Done.
- D4: run_post_deploy_monitor no-ops without a status PATCH / re-queue when the
task became cancelled mid-window (zombie-tick guard, FR-3).
- D5: additive `reason` kwarg on the three setters + one structured log line per
verdict (work_item/caller/target/db_stage/window_active/verdict); new read-only
db.get_task_by_work_item_id; post_deploy.window_active helper.
- Flags deploy_status_guard_enabled (kill-switch -> 1:1) / deploy_status_guard_repos
(CSV; empty = self-hosting only). STAGE_TRANSITIONS / QG_CHECKS / check_* /
machine-verdict keys / DB schema untouched (reads existing tasks.stage).
Tests: TC-01..TC-12 across 5 new test modules + config flags; updated the
reason-kwarg assertions in test_deploy_terminal_sync / test_deploy_approve.
Full regress green (1413). Docs: CHANGELOG, CLAUDE.md, docs/architecture/README.md
(status -> реализовано), .env.example.
Refs: ORCH-094
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
merge_pr now wraps ONLY the mutating POST /pulls/{n}/merge in a bounded
exponential-backoff retry-loop on TRANSIENT outcomes (405 "try again later",
408, any 5xx, network/timeout, and 409|422 while the PR is still mergeable);
TERMINAL outcomes (403/404/real conflict via mergeable==False) -> fast honest
False, so the ORCH-071/081 not-merged HOLD backstop is unchanged. Fixes the
ORCH-063 false HOLD + manual re-merge on Gitea's post-push mergeability hiccup.
ensure_open_pr gains an "already fully in main" guard (_branch_fully_in_main,
git merge-base --is-ancestor HEAD origin/main) BEFORE creating a PR -> new
"already-in-main" outcome avoids the garbage empty PR on a re-driven finalizer;
_handle_merge_verify skips merge_pr on that outcome and lets the authoritative
SHA-in-main check confirm -> done (not a HOLD). git error of the guard fails
OPEN to the create path.
New ORCH_MERGE_RETRY_* settings (kill-switch merge_retry_enabled -> one-shot,
max_attempts=3, backoff base=2/max=5). INV-4 (merge only via Gitea PR-merge API,
never push/force-push main), never-raise, STAGE_TRANSITIONS/QG_CHECKS/DB schema
unchanged. Docs (README merge-verify section, CLAUDE.md, CHANGELOG, .env.example)
updated in the same PR. Tests: test_merge_gate.py TC-01..12, test_config.py
TC-13, test_merge_verify.py TC-14..16; full suite green (1389).
Refs: ORCH-093
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Three verified live-card defects in src/notifications.py (ORCH-067/087),
all additive and indication-only (STAGE_TRANSITIONS / QG_CHECKS / check_* /
transport / DB schema untouched; never-raise; revert = git revert):
- Деф.1 (D1): _STAGE_STATUS_LABEL covered 8 of 10 STAGE_TRANSITIONS keys —
deploy-staging and cancelled (ORCH-090) fell back to the misleading
"To Analyse". Added deploy-staging→"Deploying (staging)",
cancelled→"Cancelled"; replaced the runtime fallback for an UNMAPPED stage
with a neutral capitalized label (_neutral_stage_label). created stays an
explicit "To Analyse"; broken/None input degrades safely. Map completeness
is asserted programmatically from STAGE_TRANSITIONS.keys() (single source of
truth), not a static list.
- Деф.2 (D2): the stage-row loop drew ✅ for any stage with a finished agent
run regardless of position — after a rollback the card showed the absurd
"✅ Внедрение + 🔄 Разработка". Added read-only _pipeline_pos from the
STAGE_TRANSITIONS order and a suppression gate (✅ only when
current_pos >= _pipeline_pos(stage_key)); deploy-staging→deploy normalization
applied ONLY to the current position; is_active_stage untouched.
- Деф.3 (D3): _stage_line took only the LAST run (ORCH-069: developer 3 runs
Σ $3.98 rendered ~$0.00). It now aggregates ALL of the agent's runs with the
same per-run formulas as the task totals → strict convergence with
SUM(agent_runs) by task_id; model/effort/attempt come from the last run.
Tests: test_tracker_status_line.py (ORCH-091 TC-01..TC-03 + updated tc06);
new test_tracker_rollback_metrics.py (TC-05..TC-08). Full suite green (1370).
Docs: CHANGELOG + internals.md (architecture README already updated by architect).
Refs: ORCH-091
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Review P1: a STOP while a self-hosting task is PARKED on `deploy` awaiting the
manual `Confirm Deploy` was classified as a critical merge/deploy window solely
because the task still held the per-repo merge-lease (held from merge-gate through
deploy->done). That window is fully reversible — nothing is merged or deployed yet
(the irreversible merge_pr runs later in _handle_merge_verify, always under an
INITIATED marker). So the cancel was DEFERRED to run_deploy_finalizer, which only
runs after Phase B (Confirm Deploy) — the very step the operator pressed STOP to
avoid. Result: the deferred cancel was never applied, the task wedged non-terminal
holding the lease, blocking the repo's serial-gate (ORCH-088) and merges.
Fix: gate the merge-lease branch of cancel.in_critical_window on an actively
RUNNING actor (_task_has_running_actor). Lease held + running deploy/merge job ->
still deferred (genuine in-flight step). Lease held + no running actor (idle
deploy parking) -> NOT critical -> immediate full reset, which itself releases the
lease (step 3c) and drives the task terminal. INITIATED-marker deferral unchanged.
Also fixes review P2 (AC-6): set_task_cancel_requested now returns the first-stamp
fact (rowcount), and the deferred branch only notifies on the first transition —
a repeated STOP while still deferred no longer spams duplicate notifications.
Tests: test_d7_lease_held_idle_parking_is_not_critical,
test_d7_lease_held_with_running_actor_still_critical,
test_d7_stop_on_deploy_awaiting_confirm_full_resets,
test_d7_repeated_stop_in_critical_window_no_duplicate_notify. Full suite green (1349).
Refs: ORCH-090
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Introduce the dedicated Plane STOP status as a single declarative task-cancel
mechanism: stop the active agent (graceful SIGTERM cascade), cancel all jobs
(terminal `cancelled`, never requeued), remove the worktree + delete the remote
feature branch (never main, never force-push), drive the task to the new
system-terminal state `cancelled` and tombstone the natural keys so a later
"To Analyse" re-creates it from scratch (docs artefacts preserved). STOP during a
critical merge/deploy window is deferred until the irreversible step finishes
honestly. Also closes the relaunch hole: handle_status_start relaunch is gated to
the `analysis` stage; the only pipeline-start entry point remains "To Analyse".
Cross-cutting (adr-0026): the "task terminal" predicate is widened {done} ->
{done, cancelled} in serial_gate / task_deps / stages sink + reaper/worker
requeue guards. STAGE_TRANSITIONS exit-gates / QG_CHECKS / check_* are unchanged
(`cancelled` is a sink, not a new edge). Additive, never-raise, restart-safe,
under kill-switch ORCH_STOP_STATUS_ENABLED (off -> zero regression).
New: src/cancel.py (leaf), src/gitea.py (delete_remote_branch), tasks columns
cancelled_at/cancel_requested_at, jobs status `cancelled`, GET /queue `stop` block.
Tests: tests/test_stop_status.py (TC-01..TC-14 + D7); full suite green (1345).
Docs updated in-PR (architecture README, CLAUDE.md, README.md, .env.example,
CHANGELOG). ADR-001 D4 refinement: plane_issue_id is tombstoned too (the lookup
ORs on it) — original UUID recoverable from the parseable suffix.
Refs: ORCH-090
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add src/build_cache_pruner.py — a background daemon thread modelled 1:1 on
src/disk_watchdog.py that periodically runs STRICTLY `docker builder prune -f
--filter until=<until>` (BuildKit GC) on the HOST over ssh. It is the "second
half" of the disk-watchdog (ORCH-063): the watchdog signals, the pruner cleans.
Removes the root cause of the 07.06.2026 incident (build cache ~11GB -> disk
100% -> whole self-hosting pipeline down) automatically, без оператора.
ADR-001 (Variant A): host-over-ssh, same channel as image_freshness/self_deploy
(no docker CLI in the image). Touches ONLY the build cache — no image/system
prune, no image/container removal, never restarts the docker daemon or the prod
container (self-hosting safety). No ssh target -> tick is a no-op.
- src/config.py: ORCH_BUILD_CACHE_PRUNE_* flags + defensive validators
(interval/timeout >0, until ~ ^\d+[smhdw]?$, notify_min_gb >=0 -> safe default).
- src/main.py: start last (after disk_watchdog) / stop first in lifespan;
additive read-only build_cache_prune block in GET /queue.
- never-raise on two levels (per-command + per-tick); kill-switch
ORCH_BUILD_CACHE_PRUNE_ENABLED (false -> daemon does not start, 1:1 as before).
- STAGE_TRANSITIONS / QG_CHECKS / check_* / _parse_* / DB schema UNCHANGED;
last-run/last-result is in-memory (no migration).
- tests/test_build_cache_pruner.py: TC-01..TC-12 (23 cases, docker fully mocked).
- .env.example + CHANGELOG.md updated; INFRA.md / architecture docs already
carry the component (architecture stage).
Refs: ORCH-062
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds src/disk_watchdog.py — a background daemon thread modelled on
reconciler/job_reaper that measures host-FS fill via the mounted bind-paths
(/repos, /app/data) with shutil.disk_usage and Telegram-alerts the operator at
>= threshold (default 85%). The missing proactive signal: on 07.06.2026 the
mva154 host disk silently hit 100% and stalled the whole self-hosting pipeline.
- Pure decide_action(used_pct, threshold, prev, now, realert_s): alert on
crossing up, cooldown re-alert, single recovery below threshold (unit-tested
without a thread/timer; clock injected).
- measure_paths: shutil.disk_usage per path, dedup by st_dev, per-path
never-raise (a broken path never fails the tick).
- Config flags ORCH_DISK_MONITOR_* with defensive validation (threshold 1..100,
positive intervals -> default + warning). Kill-switch -> daemon does not start.
- Additive disk_monitor block in GET /queue; start/stop in main.lifespan.
- never-raise (per-path/per-tick/per-send); STAGE_TRANSITIONS/QG_CHECKS/check_*/
DB schema untouched, no migration (anti-spam state in-memory).
Tests: tests/test_disk_watchdog.py (TC-01..TC-12, 18 cases); full suite green
(1296). Docs: INFRA.md, .env.example, CHANGELOG.md (architecture/README.md +
ADRs authored at architecture stage).
Refs: ORCH-063
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>