Re-deploy after a FAILED prod deploy wedged the task on `deploy`: the
sentinel markers (approve-requested/initiated/result) are keyed by the
stable work_item_id, so after the БАГ-8 rollback (deploy -> development)
and a developer fix, Phase B's idempotency-guard saw a STALE `initiated`
and became a no-op — the detached hook never re-launched and the
finalizer was never enqueued. Add self_deploy.clear_state (never-raise,
idempotent) and call it on the check_deploy_status FAILED rollback and at
the start of Phase A, so every fresh prod-deploy pass starts clean.
Also document the new ORCH_SELF_DEPLOY_* / ORCH_DEPLOY_* descriptors in
the canonical .env.example (CLAUDE.md rule #8, ТЗ §2.6), modelled on the
ORCH-043 merge-gate block (placeholders only, secrets not committed).
Contracts untouched: STAGE_TRANSITIONS, QG_CHECKS, _parse_deploy_status,
БАГ-8, merge-gate.
Refs: ORCH-036
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add the optional, backward-compatible SOURCE_IMAGE branch to
orchestrator-deploy-hook.sh: when set, retag the staging-validated image
onto TARGET_IMAGE (docker tag) before `up -d --no-build` instead of
rebuilding — guarantees prod runs the exact artefact that passed staging
(AC-7 / TC-14). Unset -> prior behaviour; exit-code contract (0/1/2) and
health-loop untouched.
Update golden-source docs (AC-13): rewrite deployer.md `deploy` stage from
"paper SUCCESS" to the executable self-deploy (Phase A/B/C, no self-restart
from inside the container) and add the ORCH-036 CHANGELOG entry.
Refs: ORCH-036
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Re-run of deploy-staging gate (merge-gate defer cycle). Canonical
staging_check.py (mode=stub) ran inside orchestrator-staging (8501);
all 10 checks passed (exit 0). No prod (8500) container touched.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Конвейер продвигается только входящими webhook; потерянное событие (502 на
ребилде, отсутствие ретраев у Plane/Gitea, неразрезолвленный sha→branch)
оставляет задачу молча застрявшей (класс инцидента ORCH-044). Новый фоновый
daemon-поток src/reconciler.py (паттерн queue_worker) доигрывает пропущенный
переход через те же штатные гейты/обработчики, что и webhook:
- F-1 gate-side: для задач stage≠done, без активного job и age(updated_at) ≥
grace_for_stage(stage) — read-only пред-оценка канонического QG; зелёный →
stage_engine.advance_stage(..., finished_agent=None); красный → тишина (спам
нотификаций структурно невозможен). analysis F-1 не трогает (человеческий гейт).
- F-2 plane-side: опрос Plane API per-project (plane_sync.list_issues_by_state,
курсорная пагинация, never-raise) → реплей In Progress/Approved/Rejected через
существующие handle_status_start/handle_verdict (async из sync-потока, asyncio.run).
- F-3: усиление sha→branch в handle_ci_status — БД-fallback по единственной
development-задаче repo (неоднозначность → не резолвим), debug→info.
- Анти-дубль на создании (db.create_task_atomic под process-wide Lock): гонка
reconcile↔webhook не плодит второй task/branch/worktree/analyst-job (AC-4).
- F-4 observability: лог-строка разблокировки + Telegram + блок reconcile в /queue.
Старт/стоп в main.lifespan (после worker.start() / перед worker.stop()),
restart-safe, never-raise на единицу работы. Kill-switches ORCH_RECONCILE_ENABLED
/ ORCH_RECONCILE_PLANE_ENABLED + grace-настройки. Схема БД и реестры
STAGE_TRANSITIONS/QG_CHECKS не менялись.
Тесты: test_reconciler.py, test_reconciler_plane.py, test_gitea_sha_resolve.py,
test_config.py (33 новых, 563 всего зелёные). Документация обновлена (golden source):
architecture/README.md, INFRA.md, README.md, CHANGELOG.md, adr-0007 → accepted.
Refs: ORCH-053
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Live staging-stand suite (scripts/staging_check.py, stub mode) ran inside
orchestrator-staging: 10/10 checks PASS, exit code 0. Merge-gate edge
(deploy-staging → deploy) cleared for ORCH-043.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Deterministic (no-LLM) sub-gate on the deploy-staging -> deploy edge that
catches a feature branch up to the CURRENT origin/main, re-tests the combined
tree, and serialises merges with a per-repo file lease — so two green parallel
branches can no longer break main (self-hosting safety for the orchestrator repo).
- src/merge_gate.py: branch_is_behind_main, auto_rebase_onto_main (push
--force-with-lease ONLY the task branch, NEVER main), retest_branch, and a
file merge-lease (atomic O_CREAT|O_EXCL, holder-aware release, stale reclaim).
Strict never-raise contract; all git ops in the per-branch worktree.
- src/qg/checks.py: check_branch_mergeable composes the primitives under the
lease; registered in QG_CHECKS. Conditional rollout (merge_gate_enabled /
merge_gate_repos, default self-hosting only).
- src/stage_engine.py: sub-gate hook on deploy-staging (not a new stage). PASS ->
advance; "merge-lock busy" -> DEFER (re-queue with available_at, anti-deadlock
at max_concurrency=1, capped); conflict/red re-test -> rollback to development
+ developer retry (capped by MAX_DEVELOPER_RETRIES). Lease released on
deploy->done / rollback / PR-merged webhook.
- src/db.py: enqueue_job(available_at_delay_s=...) for the defer (no schema change).
- src/webhooks/gitea.py: holder-aware lease release on PR-merged.
- src/config.py + .env.example: ORCH_MERGE_* settings.
Docs: README + adr-0006 (architect) already cover the design; CHANGELOG updated.
Tests: test_merge_gate.py, test_qg_merge_gate.py, test_merge_gate_race.py,
test_stage_engine.py::TestMergeGate, test_config.py, QG-registry snapshot.
Full suite: 535 passed.
Refs: ORCH-043
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Staging check suite passed 10/10 (exit 0), run canonically inside
orchestrator-staging via the Docker Engine API (docker exec equivalent).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both compose services (orchestrator, orchestrator-staging) now declare
user: "1000:1000" so pipeline artifacts (git worktree, docs/work-items
commits) are created as slin:slin on the host — git pull/reset under slin
no longer fail with permission errors. docker.sock access preserved via
group_add: ["999"]. SSH mount target aligned with the launcher-forced
HOME=/home/slin (/root/.ssh -> /home/slin/.ssh). launcher.py and Dockerfile
unchanged. INFRA.md and CHANGELOG.md updated; host-prerequisites (P-1..P-4)
documented.
Refs: ORCH-040
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ORCH-042: new ORCH_TRACKER_MODE (Settings.tracker_mode, default edit) selects
the live-tracker card behaviour. bump mode re-creates the card at the bottom of
the chat on every update (delete_telegram + send silently + repoint message_id),
keeping the "one card per task" invariant: <=1 new message per call, repoint
only on successful send, delete result never gates the send. New never-raising
delete_telegram helper. Anything != "bump" resolves to edit (zero regression).
Also russify/cosmetic-fix the card text (both modes): "Подтверждение BRD" label,
✅ after approve-gate, Russian stage labels, "📦 Внедрено". Docs updated in the
same PR (CHANGELOG, internals.md, .env.example).
Refs: ORCH-042
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>