src/frontmatter.py grows from a single-key reader into the full machine
contract: reader (read_frontmatter_value, unchanged), one parse primitive
(parse_frontmatter), writer (render/write_frontmatter), schema validator
(validate_schema/REQUIRED_FIELDS, warning-only by default) and a shared
strip_frontmatter helper. The five verdict gates (check_reviewer_verdict,
_parse_tests_verdict, _parse_deploy_status, _parse_staging_status,
parse_security_status) now read through the single parse_frontmatter point
instead of duplicated ad-hoc YAML logic; review_parse._strip_frontmatter and
security_gate.extract_security_findings reuse the shared helper.
Strictly backward compatible + never-raise: STAGE_TRANSITIONS, the QG_CHECKS
composition, verdict semantics (incl. ORCH-047 three-field tester + negative
token priority), reason-strings and worktree->origin/main fallback are 1:1.
The schema validator never influences a gate verdict by default; hard-fail is
reserved behind the frontmatter_validation_strict kill-switch (default False).
New formal handoff spec docs/_standards/HANDOFF_PROTOCOL.md ("stage -> required
output" + required frontmatter schema), aligned 1:1 with PIPELINE_DOCS.md.
Tests: test_frontmatter.py (TC-01..07), test_qg_verdicts.py (TC-08..15),
test_security_gate.py (TC-12), test_stages_invariants.py (TC-16). Full
tests/ green (1212).
Refs: ORCH-076
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Lift the two HUMAN gates that block an autonomous batch run (epic ORCH-088):
the BRD gate (analysis: manual Approved) and the prod-deploy gate (deploy
Phase A: manual Confirm Deploy, ORCH-059). Selective (a Plane label on the
issue), declarative, reversible, and WITHOUT touching a single technical check.
Additive, mirroring the conditional sub-gates (ORCH-035/043/058/088): leaf
src/labels.py (never-raise) + two point insertions + config flags.
STAGE_TRANSITIONS / QG_CHECKS / check_* / DB schema are NOT touched.
- autoApprove: врезка in _handle_analysis_approved_flow (files_ok branch) ->
set_issue_approved + log/Telegram/Plane-comment + advance_stage(
finished_agent=None) — the SAME path a human Approved takes (approved-via-
status -> analysis->architecture + mark_brd_review_ended). No duplicated
transition logic; re-entrancy safe.
- autoDeploy: врезка in _handle_self_deploy_phase_a after advance to deploy +
clear_state -> log/Telegram/Plane-comment + _handle_self_deploy_phase_b
(INITIATED marker, Deploying, finalizer). Only the indicative human steps are
skipped. BR-5 holds structurally: Phase A is reached only after the green edge
sub-gates, so autoDeploy can never deploy a broken build.
- plane_sync: fetch_issue_labels (None on error != []), get_project_labels
({normalized_name->uuid}, TTL cache, ambiguity sentinel), set_issue_approved.
- config flags: auto_label_enabled (kill-switch), auto_approve_label/
auto_deploy_label, auto_label_repos (empty -> self-hosting only),
auto_label_states_ttl_s. applies() (local) checked FIRST; has_label (network)
only when applies==True -> zero network / zero regression when disabled (AC-8).
- Fail-safe (never auto on doubt), transparency via log+Telegram+Plane+card,
read-only auto_labels block in GET /queue.
- Tests TC-01..TC-26 across 7 modules; docs (CLAUDE.md, architecture README,
CHANGELOG) updated in the same PR.
Refs: ORCH-089
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Этап 1 (serial e2e) пакетного автономного режима. Новая задача репо не входит
в analysis (analyst-job не выбирается, ветка не режется), пока в репо есть более
ранняя незавершённая задача (FIFO, t2.id < jobs.task_id) ИЛИ репо заморожен.
- src/serial_gate.py — новый leaf (never-raise): build_claim_clause (fail-OPEN),
is_repo_frozen (fail-CLOSED), set/clear_repo_freeze, serial_gate_applies, snapshot.
- src/db.py — идемпотентная миграция repo_freeze + serial_gate-фрагмент в claim_next_job.
- src/webhooks/plane.py + src/agents/launcher.py — отложенный срез ветки: start_pipeline
не создаёт Gitea-ветку/docs для применимого репо; релокация в _materialize_deferred_branch
на момент claim analyst-job (база = свежий origin/main с кодом предшественника, AC-6).
- src/stage_engine.py — post-deploy DEGRADED → durable per-repo freeze + Telegram-алерт.
- src/main.py — блок serial_gate в GET /queue + POST /serial-gate/unfreeze.
- src/config.py — serial_gate_enabled / serial_gate_repos / serial_gate_freeze_enabled.
FIFO-уточнение реализации (FR-2): ADR-001 D1 фиксировал t2.id != jobs.task_id; при !=
пакет одновременно созданных свежих задач взаимно блокировался бы (дедлок). t2.id <
jobs.task_id допускает самую раннюю задачу и сериализует остальные, сохраняя AC-1/R-7.
STAGE_TRANSITIONS / QG_CHECKS / check_* — без изменений. Аддитивно, под kill-switch,
never-raise, restart-safe; при выключенном флаге — нулевая регрессия (enduro не затронут).
Тесты: TC-01..TC-22 (test_serial_gate*.py + test_queue_endpoint.py); полный прогон 1114 зелёных.
Docs: README (serial gate / /queue / API / БД), CLAUDE.md, CHANGELOG.md, .env.example.
Refs: ORCH-088
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
test_spawn_stamps_resolved_effort упал в CI с PermissionError на '/app':
launcher._spawn хардкодил output_path='/app/data/runs/{run_id}.log' и
os.makedirs('/app/data/runs'). В контейнере /app есть, на CI-хосте
(act_runner hostexecutor) — нет, makedirs бросает -> красный CI.
Фикс корня (не только теста): базовый каталог per-run логов вынесен в
Settings.runs_dir (env ORCH_RUNS_DIR, дефолт '/app/data/runs' = прод 1:1).
Новый хелпер _run_log_path(run_id) — единый источник пути, использован в
_spawn + три прежних inline-строки логов/алертов. Тест monkeypatch-ит
settings.runs_dir на tmp_path -> окружение-независим (проверено прогоном
с принудительно недоступным /app). pytest tests/ -q: 1090 passed.
STAGE_TRANSITIONS/QG_CHECKS/схема БД не тронуты. Docs: README env-таблица,
CHANGELOG.
Refs: ORCH-087
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Закрывает F-1-пробел ORCH-068: терминал-исключение и in-memory dedup
(изначально только F-2) распространены на gate-side путь реконсилятора,
устраняя ложное «🔧 reconciler: ET-002 done разблокирована (потерян
webhook)» (особенно после рестарта).
- D1: новый _resolve_issue_status — один сетевой резолв Plane-статуса
задачи за тик (states, groups, state_uuid) после дешёвых локальных
гардов; never-raise -> ({}, {}, None) при сбое.
- D2: безусловный терминал-скип ДО Guard 2 (группа Plane completed/
cancelled, fallback на логические ключи done/cancelled, либо стадия в
БД орка ∈ {done, cancelled}); skipped_terminal_total++, не подчинён
reconcile_skip_blocked_enabled.
- D3: _is_blocked_or_needs_input переиспользует резолв D1 (опц. аргументы,
_UNSET -> самостоятельный резолв для прямых/легаси-вызовов; 1:1).
- D4: вызов _note_unblock на F-1 теперь передаёт state_uuid -> dedup
работает на обоих путях (deduped_total++ на повторе).
Анти-регресс: легитимный unblock не-терминальной застрявшей задачи
по-прежнему advance + один Telegram. STAGE_TRANSITIONS / QG_CHECKS /
схема БД / сигнатуры advance_*/_note_unblock / форма status() / новые
флаги — без изменений; never-raise сохранён.
Тесты: tests/test_reconciler.py TC-86-01..09/11,
tests/test_reconciler_plane.py TC-86-10. Полный прогон зелёный (1069).
Refs: ORCH-086
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add "disable_web_page_preview": True to the JSON payload of both
low-level Telegram primitives — send_telegram (POST /sendMessage) and
edit_telegram (POST /editMessageText). Telegram no longer expands the
Plane "Modern project management" link-preview banner under every
tracker card (bump/edit) and notify/alert message, which the default
bump mode (ORCH-067) was duplicating on each transition.
Single-point fix at the primitive level — all consumers
(update_task_tracker, notify_approve_requested, notify_error, stage
alerts from launcher/stage_engine) inherit it without code changes.
parse_mode: HTML is preserved so the ORCH-NNN issue link stays
clickable; disable_notification, bump/edit logic, the one-card-per-task
invariant, return contracts and never-raise are untouched. Unconditional,
no kill-switch (ADR-001).
Tests: tests/test_link_preview_disabled.py (TC-01..06). Docs: CHANGELOG,
CLAUDE.md, docs/architecture/README.md (Notifications component).
Refs: ORCH-080
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Close the missing invariant "by merge-verify time the branch has an open
code-PR". The pipeline created a PR only on the developer path with a fresh
worktree commit (launcher._ensure_pr), so a branch (e.g. after a manual main
restore) could reach the deploy->done merge-verify under-gate PR-less ->
merge_pr returned "no open PR" -> a FALSE HOLD (ORCH-074 incident).
- merge_gate.ensure_open_pr(repo, branch) -> (status, detail): idempotent
leaf-actor (never-raise). GET open PRs filtered head==branch AND base==main
(identical to merge_pr/ORCH-073 FR-3 — auto docs-PR is not a code-PR) ->
existed; else POST -> created; 409/422 race -> re-GET -> existed (no dup);
any other error -> failed.
- stage_engine._handle_merge_verify: врезка after validated_revision and
BEFORE merge_pr. created|existed -> proceed; failed -> honest HOLD via new
_hold_pr_create_failed (note "pr-create-failed-hold", text distinguishable
from the not-merged HOLD; task stays on deploy, NO rollback).
- launcher._ensure_pr delegated to ensure_open_pr (single PR-creation path,
shared head==branch & base==main filter); the developer-only trigger is
unchanged.
- ORCH-073 protection untouched & authoritative: merge is confirmed ONLY by
verify_merged_to_main (SHA-in-main) + check_main_regression. Real un-merged
code still HOLDs.
- Kill-switch ORCH_MERGE_VERIFY_AUTOCREATE_PR_ENABLED (default true); scope =
merge_verify_applies (self-hosting / merge_verify_repos); non-self -> no-op;
false -> ORCH-074 behaviour 1:1. No DB migration; main never push/force-push.
- Append ORCH-082 marker to MAIN_REGRESSION_MARKERS (append-only convention).
- conftest defaults the autocreate flag OFF (mirrors merge_verify_enabled) so
unrelated deploy->done tests stay 1:1 (no network).
Tests: tests/test_orch082_ensure_pr.py (TC-01..05),
tests/test_orch082_merge_verify_autocreate.py (TC-06..12). Docs: README
merge-verify block (ORCH-082), CHANGELOG, .env.example.
Refs: ORCH-082
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
resolve_agent_effort returned '' for all agents in prod because empty
ORCH_AGENT_EFFORT_*= env vars clobber pydantic class-defaults, leaving no
non-empty floor to fall back to -> --effort never reached the Claude CLI.
Add a level-4 per-role floor in resolve_agent_effort (src/agents/launcher.py):
_agent_effort_floor reads the declared class-default of agent_effort_<agent>
(model_fields[...].default), which a present-but-empty env cannot override.
Floor applies only when levels 1-3 are empty and BEFORE validation, so a typo
(non-empty) still drops to '' (never-break ORCH-41) and explicit env/override
still wins (priority preserved). config.py: agent_effort_developer high->xhigh
(single source of truth; floor follows automatically).
Refs: ORCH-081
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
G1: remove the dead `model:` line from all 6 .openclaw/agents/*.md prompts —
launcher never read it; config (agent_model_*) is the single source of truth.
G2: add is_valid_model helper (format check ^claude-…$) applied inside
resolve_agent_model's resolution cascade and at the inline --fallback-model
read in _spawn. An invalid name is logged and skipped to the next valid level
(in the limit: no --model flag), never passed to the CLI, never raises. Format
check chosen over an allowlist for forward-compatibility (ADR-001).
G3 (routing) and G4 (fallback) intentionally NOT enabled — all agents stay on
claude-opus-4-8; agent_fallback_model stays "".
Docs (golden source) updated in the same change: README model/effort table +
validation, CLAUDE.md, .env.example (ORCH_AGENT_MODEL_*/EFFORT_*/FALLBACK_MODEL),
CHANGELOG. Tests: test_agent_frontmatter_no_model.py (G1), extended
test_resolve_agent_model.py (G2 never-break).
Refs: ORCH-074
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Level A — merge/deploy serialization within one repo: reuse the existing
ORCH-043/065 merge-lease (no new mechanism); the only new logic is an
unconditional pre-merge rebase in check_branch_mergeable — under the held
lease, auto_rebase_onto_main is ALWAYS called when premerge_rebase_always
(default True), not just when the branch is behind. No-op on an up-to-date
branch (rebase keeps HEAD, force-with-lease -> "Everything up-to-date", CI
not triggered). Kill-switch off -> ORCH-043 behaviour 1:1.
Level B — declarative task dependencies: additive job_deps table
(CREATE ... IF NOT EXISTS, no live-DB migration); claim_next_job gate
(NOT EXISTS) defers a job whose depends-on tasks are not yet 'done' without
occupying a max_concurrency slot; inert on empty job_deps -> zero regression.
New leaf src/task_deps.py (never-raise): is_task_ready (fail-open), DFS cycle
detection + Blocked/alert, declare/ingest_plane_relations (db source never
hits the network on the hot path), snapshot. Telegram waiting-line, /queue
observability, reconciler skip + cycle backstop, reaper untouched.
Invariants unchanged: STAGE_TRANSITIONS, QG_CHECKS registry (dep gate is a
claim_next_job врезка, not a registered QG), DB schema of existing tables,
HTTP endpoints; non-self repos remain a no-op on empty deps/scope.
Flags: ORCH_PREMERGE_REBASE_ALWAYS, ORCH_TASK_DEPS_ENABLED, ORCH_TASK_DEPS_SOURCE.
Docs: docs/architecture/README.md, CLAUDE.md, .env.example, CHANGELOG.md,
adr-0015. Tests: tests/test_orch026_*.py (64 tests); full suite 991 green.
Refs: ORCH-026
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Root-cause fix for main erosion (phantom merge): code of ORCH-067/069 reached
`done` while absent from origin/main (only their auto docs-PRs landed).
- FR-1: verify_merged_to_main confirms merge ONLY by `git merge-base
--is-ancestor <validated_sha> origin/main`; the OR-branch pr_already_merged is
removed (a merged PR no longer confirms). Empty SHA / git error -> False.
- FR-2: pr_already_merged demoted to merge_pr idempotency-guard; counts a PR only
when merged & head.ref==<branch> & base.ref=="main" (explicit in-loop filter).
- FR-3: merge_pr selects the open code-PR by head==<branch> AND base==main.
- FR-5: new deterministic check_main_regression in _handle_merge_verify (after
confirmed SHA-in-main, before done) verifies MAIN_REGRESSION_MARKERS still in
origin/main; deterministic count==0 -> alert "main regressed" + HOLD (NOT done,
no rollback); git error of the grep -> fail-open. Kill-switch
ORCH_REGRESSION_GUARD_ENABLED; non-self -> no-op.
- FR-4: root .gitattributes `CHANGELOG.md merge=union` so Unreleased edits
auto-merge on rebase without conflict (branch not rolled back).
Invariants unchanged (STAGE_TRANSITIONS, QG_CHECKS, deploy-status, merge-gate,
image-freshness, DB schema, external HTTP API); non-self repos no-op (INV-5);
never-raise (INV-1); merge only via Gitea PR-API (INV-2).
Docs: CHANGELOG, .env.example (README/ADR updated by architect). Tests:
tests/test_orch073_*.py (TC-01..18); existing merge-gate tests updated for the
new code-PR filter.
Refs: ORCH-073
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the hardcoded `len(name) > 80` cap in the QG-0 entry validation
(_qg0_errors) with a configurable Settings.qg0_title_max (env
ORCH_QG0_TITLE_MAX, default 200). The 80-char cap was a hygiene limit, not
structural, so valid 81-200 char titles were rejected without a business
reason. The limit is read dynamically per call and the error text interpolates
the active value.
Graceful degradation (AC-3, self-hosting safety): an empty/non-numeric env
value no longer crashes the process on startup. A field_validator(mode="before")
intercepts the raw env before int-parsing and falls back to 200 (never raises),
suppressing pydantic ValidationError.
Additive and backward-compatible (default 200 > old 80). Invariants unchanged:
STAGE_TRANSITIONS, QG_CHECKS registry, DB schema, slug [:30], lower limits,
soft-QG-0 warning path, API.
Refs: ORCH-069
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reconciler F-2 spammed Telegram "<wi> разблокирована" every ~120s for a
fully-synchronized Done task (incident ET-002, 191+ msgs/night) after the
ORCH-066 Plane status model merge. Two stacked defects (defense in depth):
- D1 (selection): actionable states were told apart by bare UUID, so a Done
issue aliased onto the approved UUID entered the approved branch. Now
terminal states are excluded by Plane state GROUP (completed/cancelled),
a project-independent discriminator robust to UUID aliasing; per-issue
check with a logical-key fallback when the group is unavailable.
get_project_states caches {uuid -> group} from the same /states/ fetch;
new sibling accessor get_project_state_groups.
- D2 (notification): _note_unblock fired unconditionally after _dispatch.
Now it only fires on a confirmed state change (stage before/after _dispatch;
task-appears for the start case) — handlers' contracts untouched.
- TR-3: in-memory dedup guard {issue_id -> last unblocked state} as a backstop.
- TR-4: _STATES_CACHE lived for the whole process lifetime, so a new Plane
status was invisible without a restart. Added TTL ORCH_PLANE_STATES_TTL_S
(default 300s; 0 = previous lifetime cache) reusing reload_project_states();
a failed refresh serves the stale-but-correct set, not enduro defaults.
STAGE_TRANSITIONS / QG_CHECKS / DB schema / handle_* contracts / F-1 / F-3
unchanged; never-raise preserved; self-hosting tick never restarts prod.
Observability: skipped_terminal_total / deduped_total in /queue reconcile block.
Tests: tests/test_reconciler_plane.py (TC-01..TC-10),
tests/test_plane_states_cache.py (TC-11/TC-12).
Refs: ORCH-068
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Приводит статусы доски Plane к смыслу стадий конвейера, сохраняя
инвариант «статус — индикация, а не управление». Меняется только слой B
(отображение: src/plane_sync.py + точки выставления статуса в
stage_engine.py/webhooks/plane.py/reconciler.py); слой A — машина стадий
src/stages.py::STAGE_TRANSITIONS — остаётся байт-в-байт неизменным (AC-21).
- 6 новых логических ключей статуса (to_analyse, analysis, code_review,
awaiting_deploy, deploying, monitoring) + сеттеры и диспетчер
set_issue_stage_state.
- Project-relative alias-fallback (BR-12): новый ключ деградирует на
базовый UUID того же проекта → нулевая регрессия для enduro-trails.
- Самодеплой (ORCH-036) индицирует фазы: Awaiting Deploy / Deploying;
terminal-sync для self-hosting → Monitoring after Deploy, для прочих →
терминальный Done.
- Post-deploy монитор (ORCH-021): HEALTHY → Done, DEGRADED → Blocked
(только индикация; self-hosting ALERT_ONLY, прод не трогается, BR-5).
- Reconciler: триггер старта/резюма на To Analyse; Guard 2 учитывает
новые активные ожидания без расширения skip-set на алиасах.
- never-raise контракт сеттеров и резолвера состояний сохранён.
- Раскатка — созданием статусов в Plane оператором, без kill-switch.
Инварианты не менялись: STAGE_TRANSITIONS, QG_CHECKS (12 чеков),
check_deploy_status, exit-код-контракт хука, merge-gate, схема БД.
ADR: docs/work-items/ORCH-066/06-adr/ADR-001-plane-status-model.md
Тесты: test_plane_status_model, test_plane_to_analyse_resume,
test_plane_status_failclosed + TC в существующих наборах. 774 passed.
Refs: ORCH-066
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Split the overloaded `Approved` Plane status: it served BOTH as the human BRD
gate on `analysis` AND as the silent Phase B prod-deploy trigger on `deploy`
(ORCH-036), so a routine approve could launch a self-hosting prod restart.
ORCH-059 introduces a dedicated logical status `confirm_deploy` ("Confirm
Deploy") that triggers ONLY Phase B on `deploy`; `Approved` stays purely a
pipeline gate.
- plane_sync: map "Confirm Deploy" -> "confirm_deploy" in _PLANE_NAME_TO_KEY;
intentionally absent from _DEFAULT_STATES => fail-closed (no UUID -> .get
yields None, no KeyError, no blind deploy).
- webhooks/plane: handle_issue_updated routes "Confirm Deploy" (fail-closed
.get) to new handle_confirm_deploy (guarded to stage=="deploy") ->
_try_advance_stage(confirm_deploy=True).
- stage_engine: advance_stage gains keyword-only confirm_deploy=False; Phase B
block returns early for deploy+finished_agent is None but only initiates the
deploy when confirm_deploy=True; a plain Approved is a deterministic no-op
(returns before check_deploy_status -> no false БАГ-8 rollback).
- Phase A CTA now asks the operator for "Confirm Deploy", not "Approved".
Contracts unchanged: STAGE_TRANSITIONS, QG_CHECKS, check_deploy_status, hook
exit codes, Phases A/C, merge-gate, DB schema. Conditional like ORCH-35/36
(self-hosting only). Docs updated (CLAUDE.md, architecture/README.md, CHANGELOG).
Refs: ORCH-059
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
TC-17 seeded 17-security-report.md via get_worktree_path() which resolves to
settings.worktrees_dir (default /repos/_wt) -> the test wrote into the real shared
host worktree path. In CI that dir is owned by another user -> PermissionError.
Monkeypatch git_worktree.settings.worktrees_dir to tmp_path/_wt (same pattern as
test_git_worktree.py / test_merge_gate.py). Prod logic untouched.
Add a deterministic (no-LLM) security sub-gate on the deploy-staging -> deploy
edge, run FIRST (before merge-gate ORCH-043 and image-freshness ORCH-058) so it
fails cheaply before any expensive rebase/rebuild, and scans origin/main..HEAD
before rebase so a task is never blamed for a CVE introduced by an updated main.
Why: the autonomous pipeline merged branches into main with no check for a leaked
secret or a vulnerable dependency. For the self-hosting orchestrator (one shared
prod instance serving every project from a shared DB) a single leak/CVE landed in
the prod of all projects (CLAUDE.md self-hosting, section 8).
- New leaf src/security_gate.py (never-raise): gitleaks (offline, fail-closed on
tool error => secrets guarantee is unconditional) + pip-audit (best-effort;
unreachable CVE feed degrades fail-open + loud warning by default, strict via
security_dep_audit_fail_closed). Verdict lives ONLY in 17-security-report.md
YAML frontmatter (write -> read-back single source of truth); FAIL is
authoritative; missing/broken frontmatter => fail-closed.
- check_security_gate thin wrapper registered in QG_CHECKS (lazy import, no cycle).
- _handle_security_gate wired FIRST in advance_stage deploy-staging block: FAIL ->
rollback to development + developer-retry (cap MAX_DEVELOPER_RETRIES); task_desc
carries verbatim findings (ORCH-046 pattern). No merge-lease release (runs before
lease acquire). Self-hosting safe: only reads/scans/writes, never deploys.
- Conditional rollout (security_gate_enabled + security_gate_repos; empty scope ->
self-hosting only). 6 new ORCH_SECURITY_* settings.
- Infra: pinned gitleaks Go binary in Dockerfile (+curl/ca-certificates), pip-audit
in requirements.txt, versioned .gitleaks.toml at repo root.
- STAGE_TRANSITIONS and DB schema unchanged.
Docs: docs/architecture/README.md (marked realized), CLAUDE.md (artifact 17),
CHANGELOG.md. Tests: test_security_gate.py, test_qg_security.py,
test_stage_engine_security_gate.py + updated registry/edge snapshots.
Refs: ORCH-022
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tier-2 reaped a LIVE, still-finalizing monitor: _monitor_agent writes
agent_runs.exit_code FIRST, then does git push / PR / Plane comments before
_finalize_job, and the agent pid is already dead in that window — so the old
"exit_code recorded -> reap now" had no grace and could race a healthy job.
Worse, _reap_known_outcome ran the advance (advance_stage -> enqueue_job)
BEFORE the atomic claim, so a reaper that lost the race had already enqueued
the next stage (dup advance / dup enqueue), violating ADR-001 Р-1.
Fix:
- Tier-2 grace: reap only once agent_runs.exit_code has been recorded for
>= reaper_finalize_grace_s (new setting, default 300s; > max finalization
window). A live finalizing monitor is never reaped (FR-1.3/AC-3). New
finished_age_s column computed in get_running_jobs.
- claim-before-act for exit0: evaluate the canonical QG READ-ONLY (the
reconciler pattern) to choose the terminal status, then atomically claim
'done' FIRST; only the claim winner runs the advance. A loser performs no
side effects -> no dup advance / dup enqueue.
Docs (golden source) updated in the same change: ADR-001, global adr-0011,
README, internals, .env.example, CHANGELOG (also fixes the P3 broken adr-0011
link). New tests cover the grace window, lost-claim no-side-effects, and the
already-advanced idempotent path.
Refs: ORCH-065
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The pr_already_merged guard was defined + unit-tested but consulted by zero
production code, while ADR-001 Р-3 / README / CHANGELOG claimed the merge path
consults it before a repeat merge (reviewer P1, ORCH-065 attempt 2/3). The
actual merge actor is the LLM deployer agent (it merges the feature PR at the
start of the `deploy` stage), so on a reaper re-drive of an already-merged PR
the deployer would blindly re-merge → Gitea error → false БАГ-8 rollback; AC-11
("no second merge") was not met deterministically.
Wire the guard at the real consultation point — the deployer prompt — so it
runs merge_gate.pr_already_merged before any (re-)merge and no-ops when the PR
is already merged. check_branch_mergeable is left untouched (AC-13: check_*
behaviour unchanged; it runs on the first deploy-staging→deploy edge, not on a
deploy-stage re-drive where the second-merge risk lives).
- .openclaw/agents/deployer.md: idempotent pre-merge guard step + general rule.
- src/merge_gate.py: docstring names the deployer-prompt consultation point.
- docs/architecture/README.md, CHANGELOG.md: state the consultation point so
golden-source matches implementation.
- tests/test_merge_gate.py: regression test asserting the deployer prompt wires
the guard (so it can't silently become dead code again).
pytest tests/ -q: 743 passed.
Refs: ORCH-065
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the "zombie jobs" incident class: job status was set only inside
the live launcher process, so a process death left jobs.status='running'
forever; at max_concurrency=1 one zombie blocked ALL projects' queue
(self-hosting risk). Adds a background daemon (src/job_reaper.py) with
three-tier liveness (dead-pid streak / known exit_code / max-running
backstop) whose only mutating write is an atomic terminal flip guarded by
WHERE status='running' (no double-process). For exit0 the canonical QG is
the source of truth via gate-driven advance, not "exit0".
Also proactively reclaims stale merge-lease (dead pid OR TTL) via file
delete only (no git ops), and makes merge finalization idempotent
(pr_already_merged guard + up-to-date short-circuit on re-drive).
New jobs.pid column via idempotent _ensure_column (no migration); pid
stamped in launcher._spawn after Popen. Reaper start/stop in lifespan;
"reaper" snapshot in GET /queue. Kill-switches: ORCH_REAPER_ENABLED,
ORCH_REAPER_INTERVAL_S, ORCH_REAPER_DEAD_TICKS, ORCH_REAPER_MAX_RUNNING_S,
ORCH_LEASE_RECLAIM_ENABLED.
Invariants unchanged (AC-13): STAGE_TRANSITIONS, QG_CHECKS registry,
check_branch_mergeable signature/behaviour, BUG-8 rollback, hook exit
codes. restart-safe, never-raise per unit of background work.
Docs: docs/architecture/README.md, CHANGELOG.md, .env.example.
Tests: tests/test_job_reaper.py, tests/test_merge_lease_reclaim.py,
tests/test_merge_gate.py (TC-16), tests/test_merge_gate_race.py (TC-17),
tests/test_queue.py, tests/test_config.py (TC-19/TC-20). 742 passed.
Refs: ORCH-065
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The ORCH-058 staging rebuild (check_staging_image_fresh) builds the image with
the task git-worktree as the docker build context. A fresh worktree holds only
tracked files, but the Dockerfile did `COPY data/ ./data/` — and `data/` (the
SQLite dir) is gitignored, so it is absent from that context: `docker build`
failed with exit 1 ("BUILD-STAGING: docker build failed - aborting"), bouncing
the task off deploy-staging back to development in a loop.
The COPY was dead weight regardless: `data/` is always supplied at runtime as a
bind-mount volume (./data:/app/data, see docker-compose.yml) which shadows
anything baked into the image. Replace it with `RUN mkdir -p /app/data` so the
mountpoint exists without depending on the build context.
Regression guard: test_tc08b_dockerfile_does_not_copy_gitignored_data_dir
forbids COPY of any gitignored path (the worktree-context invariant).
Refs: ORCH-021
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extend pipeline responsibility past deploy->done: after the terminal
transition for an applicable repo, arm a ~15min observation window that
probes prod and reacts to a degradation the restart-time health-check
missed ("green deploy, red prod").
- src/post_deploy.py: new leaf module (config + lazy qg/db only).
Sentinel-file restart-safe state (.post-deploy-state-<repo>/<wi>/),
no DB migration. probe_signals/classify/decide_action/run_rollback,
all never-raise.
- Reserved-agent job `post-deploy-monitor` (no-LLM, Variant B, calque of
deploy-finalizer): self-requeues each tick via enqueue_job.
- Deterministic classify: DEGRADED iff >= fail_threshold consecutive
health failures OR window 5xx ratio > 5xx_threshold; fail-safe HEALTHY.
- Self-hosting invariant (BR-5/AC-8): a tick NEVER restarts the prod
orchestrator container -> orchestrator is ALWAYS ALERT_ONLY.
- Conditionality (ORCH-35/36/43/58): kill-switch + CSV repos, empty ->
self-hosting only.
- QG_CHECKS / STAGE_TRANSITIONS / schema unchanged (AC-12).
- Docs: CHANGELOG, CLAUDE artefact list (16-post-deploy-log.md),
architecture README, .env.example (ORCH_POST_DEPLOY_*).
Refs: ORCH-021
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The self-hosting orchestrator looped on deploy-staging -> development because
scripts/staging_check.py exited 1 on ANY failed check, so two infra-only checks
(C9a sandbox branch / C9b analyst-job — caused by SANDBOX bot accounts not being
members of the sandbox Plane project, NOT a pipeline regress) forced
staging_status: FAILED -> rollback -> loop, burning developer retries and tokens.
Direction (б) per ADR-001: classify staging checks as REAL (all pipeline checks,
fail-closed) vs SANDBOX_INFRA (narrow allowlist {C9a, C9b}, waivable). New leaf
module src/staging_verdict.py (stdlib-only, never-raise): classify_check +
compute_staging_verdict fold per-check results into a tolerant-but-fail-closed
verdict — any REAL failure -> FAILED/exit1 (safety net holds under any flag);
only C9a/C9b failed & tolerant -> SUCCESS/exit0 with waived list; only infra &
strict -> FAILED/exit1; any internal error -> FAILED/exit1 (never a false green).
staging_check.py now auto-classifies each check (public 3-tuple _items shape kept
as an ORCH-048 b6 regression guard), exposes categorized_items(), prints
INFRA-WAIVED/VERDICT lines, and exits via the verdict; new --strict flag forces
legacy strictness per-run. Kill-switch ORCH_STAGING_INFRA_TOLERANCE_ENABLED
(default true) restores legacy strict mode globally. launcher gains
action_stage_no_changes_note so "no changes to commit" on action stages is logged
as expected, not treated as under-delivery.
Contracts unchanged: STAGE_TRANSITIONS, QG_CHECKS registry, staging_status:/
deploy_status: frontmatter, hook exit-code (0/1/2), check_staging_status; no DB
migration. Docs: README, STAGING_CHECK.md, deployer.md, .env.example, CHANGELOG.
Refs: ORCH-061
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reconciler F-1 could not tell "stuck by a lost webhook" from "escalated:
max developer retries reached, waiting for a human". With CI green and a
reviewer that kept sending REQUEST_CHANGES up to the cap, every tick
re-unblocked development -> review -> rollback -> re-unblock (incident
ET-013, infinite bounce: wasted agent runs, Telegram spam, parasitic load
on the shared self-hosting instance).
Add two pre-gate guards in Reconciler._reconcile_gate_task (after the
existing analysis/no-gate/active-job/grace guards, before the gate
pre-evaluation), each an early silent return (no advance, no unblocked_total
increment, no notifications):
- Guard 1 (escalated, deterministic, no network, checked first):
developer_retry_count(task_id) >= MAX_DEVELOPER_RETRIES. Promote
stage_engine._developer_retry_count to public developer_retry_count
(single source of truth; private alias kept). Limit from the constant,
not a literal 3.
- Guard 2 (explicit human Plane gate, Variant A, no DB migration): new
never-raise plane_sync.fetch_issue_state + Reconciler._is_blocked_or_needs_input;
any error/None/unresolved project -> conservative skip. New sub-flag
ORCH_RECONCILE_SKIP_BLOCKED_ENABLED mutes only the networked Guard 2.
F-2 unchanged: Blocked/Needs Input are outside {in_progress, approved,
rejected} so they are never replayed (regression test added). DB schema,
STAGE_TRANSITIONS, QG_CHECKS, never-raise, analysis carve-out and
kill-switches untouched.
Refs: ORCH-060
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per ADR-001 step 3 / AC-4: after the freshly rebuilt staging container is
healthy, run staging_check.py --mode stub against the fresh 8501 stand BEFORE
reporting success, so the EXACT artefact BUILD-ONCE retagged to prod is the one
validated on staging. Fail-closed: staging_check rc!=0 -> exit 1 (not promoted).
- Invoked inside the container (docker exec $TARGET_SERVICE) per the canonical
signature in scripts/staging_check.py header, --base-url http://localhost:$TARGET_PORT.
- Targets ONLY 8501 (staging), never 8500 (prod) - AC-9.
- --mode stub: fast, deterministic, no LLM spend (ADR).
- Static regression test test_tc07_build_staging_runs_staging_check_stub_after_health:
asserts staging_check.py + --mode stub present, runs after health, before exit 0,
fail-closed, and never hard-codes prod 8500.
Closes reviewer P0/P1 (ORCH-058 attempt 3): the committed --build-staging hook
recomputed GIT_SHA=$(git rev-parse HEAD) in $REPO (prod clone on `main`) and built
`docker build ... "$REPO"`, ignoring the caller-supplied BUILD_CONTEXT/GIT_SHA. On
the deploy-staging -> deploy edge the PR is not yet merged, so `main` HEAD != the
validated SHA -> the staging image got the wrong revision label and Strategy-B's
guard fail-closed on EVERY valid self-deploy (AC-6 deadlock). It also only did
`docker build` + exit 0 — never recreating 8501 nor health-checking — so
rebuild_staging_image's rc=0 ("rebuilt and healthy") was a lie (AC-4 unmet).
- Hook --build-staging now honours caller BUILD_CONTEXT (validated worktree) and
GIT_SHA, recreates orchestrator-staging on the fresh image and runs the 10x6s
health-check; build/health failure -> exit 1 (FAILED contract preserved).
- image_freshness.rebuild_staging_image: document why COMPOSE_PROFILE/TARGET_SERVICE/
TARGET_PORT are intentionally omitted (hook STAGING defaults -> 8501 only, P2).
- tests: assert the caller<->hook contract (builds from $BUILD_CONTEXT, no
`git rev-parse HEAD` recompute, recreates + health-checks 8501) so the P0
regression can't pass green again (P1).
Refs: ORCH-058
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Re-deploy after a FAILED prod deploy wedged the task on `deploy`: the
sentinel markers (approve-requested/initiated/result) are keyed by the
stable work_item_id, so after the БАГ-8 rollback (deploy -> development)
and a developer fix, Phase B's idempotency-guard saw a STALE `initiated`
and became a no-op — the detached hook never re-launched and the
finalizer was never enqueued. Add self_deploy.clear_state (never-raise,
idempotent) and call it on the check_deploy_status FAILED rollback and at
the start of Phase A, so every fresh prod-deploy pass starts clean.
Also document the new ORCH_SELF_DEPLOY_* / ORCH_DEPLOY_* descriptors in
the canonical .env.example (CLAUDE.md rule #8, ТЗ §2.6), modelled on the
ORCH-043 merge-gate block (placeholders only, secrets not committed).
Contracts untouched: STAGE_TRANSITIONS, QG_CHECKS, _parse_deploy_status,
БАГ-8, merge-gate.
Refs: ORCH-036
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Конвейер продвигается только входящими webhook; потерянное событие (502 на
ребилде, отсутствие ретраев у Plane/Gitea, неразрезолвленный sha→branch)
оставляет задачу молча застрявшей (класс инцидента ORCH-044). Новый фоновый
daemon-поток src/reconciler.py (паттерн queue_worker) доигрывает пропущенный
переход через те же штатные гейты/обработчики, что и webhook:
- F-1 gate-side: для задач stage≠done, без активного job и age(updated_at) ≥
grace_for_stage(stage) — read-only пред-оценка канонического QG; зелёный →
stage_engine.advance_stage(..., finished_agent=None); красный → тишина (спам
нотификаций структурно невозможен). analysis F-1 не трогает (человеческий гейт).
- F-2 plane-side: опрос Plane API per-project (plane_sync.list_issues_by_state,
курсорная пагинация, never-raise) → реплей In Progress/Approved/Rejected через
существующие handle_status_start/handle_verdict (async из sync-потока, asyncio.run).
- F-3: усиление sha→branch в handle_ci_status — БД-fallback по единственной
development-задаче repo (неоднозначность → не резолвим), debug→info.
- Анти-дубль на создании (db.create_task_atomic под process-wide Lock): гонка
reconcile↔webhook не плодит второй task/branch/worktree/analyst-job (AC-4).
- F-4 observability: лог-строка разблокировки + Telegram + блок reconcile в /queue.
Старт/стоп в main.lifespan (после worker.start() / перед worker.stop()),
restart-safe, never-raise на единицу работы. Kill-switches ORCH_RECONCILE_ENABLED
/ ORCH_RECONCILE_PLANE_ENABLED + grace-настройки. Схема БД и реестры
STAGE_TRANSITIONS/QG_CHECKS не менялись.
Тесты: test_reconciler.py, test_reconciler_plane.py, test_gitea_sha_resolve.py,
test_config.py (33 новых, 563 всего зелёные). Документация обновлена (golden source):
architecture/README.md, INFRA.md, README.md, CHANGELOG.md, adr-0007 → accepted.
Refs: ORCH-053
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Deterministic (no-LLM) sub-gate on the deploy-staging -> deploy edge that
catches a feature branch up to the CURRENT origin/main, re-tests the combined
tree, and serialises merges with a per-repo file lease — so two green parallel
branches can no longer break main (self-hosting safety for the orchestrator repo).
- src/merge_gate.py: branch_is_behind_main, auto_rebase_onto_main (push
--force-with-lease ONLY the task branch, NEVER main), retest_branch, and a
file merge-lease (atomic O_CREAT|O_EXCL, holder-aware release, stale reclaim).
Strict never-raise contract; all git ops in the per-branch worktree.
- src/qg/checks.py: check_branch_mergeable composes the primitives under the
lease; registered in QG_CHECKS. Conditional rollout (merge_gate_enabled /
merge_gate_repos, default self-hosting only).
- src/stage_engine.py: sub-gate hook on deploy-staging (not a new stage). PASS ->
advance; "merge-lock busy" -> DEFER (re-queue with available_at, anti-deadlock
at max_concurrency=1, capped); conflict/red re-test -> rollback to development
+ developer retry (capped by MAX_DEVELOPER_RETRIES). Lease released on
deploy->done / rollback / PR-merged webhook.
- src/db.py: enqueue_job(available_at_delay_s=...) for the defer (no schema change).
- src/webhooks/gitea.py: holder-aware lease release on PR-merged.
- src/config.py + .env.example: ORCH_MERGE_* settings.
Docs: README + adr-0006 (architect) already cover the design; CHANGELOG updated.
Tests: test_merge_gate.py, test_qg_merge_gate.py, test_merge_gate_race.py,
test_stage_engine.py::TestMergeGate, test_config.py, QG-registry snapshot.
Full suite: 535 passed.
Refs: ORCH-043
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both compose services (orchestrator, orchestrator-staging) now declare
user: "1000:1000" so pipeline artifacts (git worktree, docs/work-items
commits) are created as slin:slin on the host — git pull/reset under slin
no longer fail with permission errors. docker.sock access preserved via
group_add: ["999"]. SSH mount target aligned with the launcher-forced
HOME=/home/slin (/root/.ssh -> /home/slin/.ssh). launcher.py and Dockerfile
unchanged. INFRA.md and CHANGELOG.md updated; host-prerequisites (P-1..P-4)
documented.
Refs: ORCH-040
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ORCH-042: new ORCH_TRACKER_MODE (Settings.tracker_mode, default edit) selects
the live-tracker card behaviour. bump mode re-creates the card at the bottom of
the chat on every update (delete_telegram + send silently + repoint message_id),
keeping the "one card per task" invariant: <=1 new message per call, repoint
only on successful send, delete result never gates the send. New never-raising
delete_telegram helper. Anything != "bump" resolves to edit (zero regression).
Also russify/cosmetic-fix the card text (both modes): "Подтверждение BRD" label,
✅ after approve-gate, Russian stage labels, "📦 Внедрено". Docs updated in the
same PR (CHANGELOG, internals.md, .env.example).
Refs: ORCH-042
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
B6 false-FAILed because it built the project registry from the
launcher process-env via a host-path hack (sys.path.insert +
importlib.reload), not from the running staging instance. Run from the
host, ORCH_PROJECTS_JSON is unset -> default ET+ORCH registry -> false
FAIL -> spurious deploy-staging -> development rollback.
Variant (v) per ADR-001: remove the host-path hack; canonically run the
suite INSIDE orchestrator-staging via docker exec so src.projects
resolves from /app (PYTHONPATH) with .env.staging. Verdict logic
extracted into pure _evaluate_b6(known) -> (passed, detail) +
_known_project_ids_from_registry() / _run_b6() with deterministic FAIL on
source unavailability. deployer.md and STAGING_CHECK.md updated to the
docker exec command. src/projects.py, .env* and checks A/B4/B5/C
untouched.
Refs: ORCH-048
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
При заворотах на development task_desc теперь несёт дословный must-fix текст
(P0/P1 ревьюера, причина FAIL тестера) вместо одной ссылки на файл — developer-
агент видит суть претензий сразу и не повторяет ту же ошибку, экономя retry-
бюджет и токены общего инстанса.
- Новый defensive-модуль src/review_parse.py (never-raise): extract_review_findings
(P0/P1 из 12-review.md ## Findings), extract_test_failures (фрагмент тела
13-test-report.md: pytest output / FAIL-строки / Итог), усечение по лимиту.
- Две rollback-ветки stage_engine: встраивают текст + сохраняют ссылку на полный
файл; graceful-фоллбэк на ссылку-строку при битом/пустом артефакте.
- Последовательность отката, retry-счётчик, поля AdvanceResult, реестр QG_CHECKS
не менялись.
- Доки: README (Stage Engine / Откаты), CHANGELOG.
- Тесты: tests/test_review_parse.py, test_stage_engine.py::TestRollbackTaskDescEmbedding.
Refs: ORCH-046
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
_parse_tests_verdict now accepts three equal-rank machine-readable
frontmatter fields in 13-test-report.md — result: (canonical tester
output), verdict: and status: (legacy/enduro-trails). Any one non-empty
field suffices; a negative token in any field stays authoritative.
Fixes the producer/consumer contract mismatch where the tester emits
`result: PASS` (per .openclaw/agents/tester.md) but the gate only read
verdict:/status:, causing a testing->development rollback loop until
MAX_DEVELOPER_RETRIES (observed on ORCH-17). Token sets frozen and gate
signature/QG_CHECKS unchanged for full backward compatibility.
Refs: ORCH-047
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>