fix(deploy): clear stale self-deploy markers on rollback; document env
Re-deploy after a FAILED prod deploy wedged the task on `deploy`: the sentinel markers (approve-requested/initiated/result) are keyed by the stable work_item_id, so after the БАГ-8 rollback (deploy -> development) and a developer fix, Phase B's idempotency-guard saw a STALE `initiated` and became a no-op — the detached hook never re-launched and the finalizer was never enqueued. Add self_deploy.clear_state (never-raise, idempotent) and call it on the check_deploy_status FAILED rollback and at the start of Phase A, so every fresh prod-deploy pass starts clean. Also document the new ORCH_SELF_DEPLOY_* / ORCH_DEPLOY_* descriptors in the canonical .env.example (CLAUDE.md rule #8, ТЗ §2.6), modelled on the ORCH-043 merge-gate block (placeholders only, secrets not committed). Contracts untouched: STAGE_TRANSITIONS, QG_CHECKS, _parse_deploy_status, БАГ-8, merge-gate. Refs: ORCH-036 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
35
.env.example
35
.env.example
@@ -36,3 +36,38 @@ ORCH_MERGE_RETEST_TARGET=tests/
|
||||
ORCH_MERGE_LOCK_TIMEOUT_S=300
|
||||
ORCH_MERGE_DEFER_DELAY_S=60
|
||||
ORCH_MERGE_DEFER_MAX_ATTEMPTS=5
|
||||
# ORCH-036: executable self-deploy of the `deploy` stage. For the self-hosting repo
|
||||
# (orchestrator) the stage REALLY restarts prod (8500) via a detached host hook;
|
||||
# deploy_status: SUCCESS means proven health-ok, not an LLM declaration. Three
|
||||
# deterministic phases (A: request approve, B: human Approved -> detached deploy,
|
||||
# C: finalizer maps hook exit-code -> deploy_status). Non-self repos: unchanged
|
||||
# synchronous ssh deploy. SECRETS / host paths live ONLY on the host — do NOT commit.
|
||||
# SELF_DEPLOY_ENABLED -> global kill-switch (false -> legacy synchronous deploy for all).
|
||||
# SELF_DEPLOY_REPOS -> CSV of repos where Phase A/B/C is REAL; empty -> only the
|
||||
# self-hosting repo (orchestrator); others -> no-op (mirrors ORCH-35).
|
||||
# DEPLOY_REQUIRE_MANUAL_APPROVE -> require a human Plane "Approved" before the prod
|
||||
# deploy (true on rollout; full auto is ORCH-54).
|
||||
# DEPLOY_FINALIZE_DELAY_S -> delay before the first/each finalize poll (>= hook+health).
|
||||
# DEPLOY_FINALIZE_MAX_ATTEMPTS -> bounded finalize-defer budget (anti-livelock).
|
||||
# DEPLOY_SSH_USER / DEPLOY_SSH_HOST -> ssh target for the host hook (DEPLOY_SSH_HOST
|
||||
# empty -> detached deploy will NOT launch; set on the host).
|
||||
# DEPLOY_HOOK_SCRIPT -> path to the hook ON THE HOST (relative to the repo).
|
||||
# DEPLOY_HOST_REPO_PATH -> orchestrator clone path on the host.
|
||||
# DEPLOY_PROD_SOURCE_IMAGE -> staging-validated image, retagged build-once (no rebuild).
|
||||
# DEPLOY_PROD_TARGET_SERVICE / _PORT / _IMAGE / _COMPOSE_PROFILE -> prod compose profile.
|
||||
# DEPLOY_PROD_PREV_IMAGE_FILE -> prod prev-image snapshot (separate from staging's).
|
||||
ORCH_SELF_DEPLOY_ENABLED=true
|
||||
ORCH_SELF_DEPLOY_REPOS=
|
||||
ORCH_DEPLOY_REQUIRE_MANUAL_APPROVE=true
|
||||
ORCH_DEPLOY_FINALIZE_DELAY_S=90
|
||||
ORCH_DEPLOY_FINALIZE_MAX_ATTEMPTS=10
|
||||
ORCH_DEPLOY_SSH_USER=slin
|
||||
ORCH_DEPLOY_SSH_HOST=
|
||||
ORCH_DEPLOY_HOOK_SCRIPT=scripts/orchestrator-deploy-hook.sh
|
||||
ORCH_DEPLOY_HOST_REPO_PATH=/home/slin/repos/orchestrator
|
||||
ORCH_DEPLOY_PROD_SOURCE_IMAGE=orchestrator-orchestrator-staging
|
||||
ORCH_DEPLOY_PROD_TARGET_SERVICE=orchestrator
|
||||
ORCH_DEPLOY_PROD_TARGET_PORT=8500
|
||||
ORCH_DEPLOY_PROD_TARGET_IMAGE=orchestrator-orchestrator
|
||||
ORCH_DEPLOY_PROD_COMPOSE_PROFILE=
|
||||
ORCH_DEPLOY_PROD_PREV_IMAGE_FILE=.deploy-prev-image-prod
|
||||
|
||||
@@ -27,6 +27,7 @@
|
||||
- Цепочка стадий: `... testing → deploy-staging → deploy → done` (была без `deploy-staging`).
|
||||
|
||||
### Fixed
|
||||
- **Re-deploy после отката больше не зависает на `deploy`; `.env.example` дополнен** (ORCH-036, review-fix): sentinel-маркеры самодеплоя (`approve-requested`/`initiated`/`result`) ключуются по стабильному `work_item_id`, поэтому при FAILED-деплое и откате БАГ-8 (`deploy → development`) они оставались на диске — после фикса developer-ом и повторного захода задачи на `deploy` Фаза B по idempotency-guard видела STALE `initiated` и становилась no-op: detached-хук не перезапускался, finalizer не ставился, задача висела на `deploy` навсегда (нарушался retry-контракт стадии, AC-4/AC-10; устаревший `result` к тому же был бы перечитан новым finalizer'ом). Добавлен `self_deploy.clear_state(repo, work_item_id)` (never-raise, idempotent, рекурсивное удаление `<repos_dir>/.deploy-state-<repo>/<wi>/`), вызывается в ветке БАГ-8-отката `check_deploy_status` FAILED (`src/stage_engine.py`) и дополнительно в начале Фазы A (`_handle_self_deploy_phase_a`) — каждый новый прод-деплой-проход стартует с чистого состояния. Отдельно: канонический `.env.example` (CLAUDE.md правило №8, ТЗ §2.6) дополнен полным блоком новых дескрипторов `ORCH_SELF_DEPLOY_*` / `ORCH_DEPLOY_*` (плейсхолдеры, секреты не коммитятся) по образцу merge-gate ORCH-043. Контракты `STAGE_TRANSITIONS` / `QG_CHECKS` / `_parse_deploy_status` / БАГ-8 / merge-gate не тронуты. Тесты: `tests/test_deploy_rollback.py::test_tc11_re_deploy_after_rollback_not_wedged`, `tests/test_deploy_hook_mapping.py::test_clear_state_removes_all_markers_and_is_idempotent`.
|
||||
- **Контейнер и агенты бегут под uid хоста (1000:1000), не root** (ORCH-040): оба сервиса в `docker-compose.yml` (`orchestrator`, `orchestrator-staging`) получили `user: "1000:1000"` (slin) — устраняет корень проблемы, при которой Claude-CLI агенты, запускаемые через `subprocess.Popen` внутри root-контейнера, создавали все артефакты конвейера (git worktree `/repos/_wt/...`, коммиты в `docs/work-items/...`) с владельцем `root:root` на хосте, из-за чего `git pull`/`git reset` под slin падали с `insufficient permission for adding an object` и каждый деплой требовал ручного `chown`. Теперь файлы сразу `slin:slin`. Доступ к docker.sock сохранён через `group_add: ["999"]` (МИНА 1 — НЕ удалена). SSH-маунт приведён к единому HOME агента: target `/root/.ssh` → `/home/slin/.ssh` (`/home/slin/.orchestrator-ssh:/home/slin/.ssh:ro`), синхронно с `HOME=/home/slin`, который launcher форсит в env Popen и git_env — устранён скрытый рассинхрон SSH-маунта с форсимым HOME. `src/agents/launcher.py` и `Dockerfile` НЕ менялись (numeric uid работает без записи в `/etc/passwd`; `safe.directory '*'` уже покрывает git над bind-mount). Требует host-prerequisites Owner (P-1…P-4, вне кода): блокер P-1 — `chown -R 1000:1000 /home/slin/.claude` для доступа uid 1000 к claude creds (иначе preflight заворачивает конвейер); прод-рестарт self — только в окно тишины (общий инстанс с enduro-trails), страховка — staging-гейт (adr-0003). ADR `docs/work-items/ORCH-040/06-adr/ADR-001-run-agents-as-host-uid.md`, глобальный `docs/architecture/adr/adr-0005-container-runs-as-host-uid.md`; INFRA.md обновлён (рантайм-uid, volumes/SSH target, host-prerequisites). Тесты: `tests/test_orch040_compose.py`.
|
||||
- **Staging-чек B6 читает реестр из окружения работающего staging-инстанса** (ORCH-048): блок B6 «Registry: sandbox present, prod ET/ORCH absent» в `scripts/staging_check.py` давал **ложный FAIL** (`prod-ET=YES(BAD!)`, `prod-ORCH=YES(BAD!)`) при фактически исправной изоляции — единственный чек suite, который не ходил к инстансу по HTTP, а импортировал `src.projects` локально через host-path хак `sys.path.insert(0, "/repos/orchestrator")` + `importlib.reload`, строя реестр из `ORCH_PROJECTS_JSON` **process-env запускающего процесса**. При фактическом запуске деплоером с хоста переменная не задана → дефолт `_DEFAULT_PROJECTS` (ET+ORCH) → ложный FAIL → лишний откат `deploy-staging → development`. Решение (вариант «в», ADR-001): host-path хак удалён; suite канонически запускается ВНУТРИ контейнера `orchestrator-staging` через `docker exec … python3 /repos/orchestrator/scripts/staging_check.py` (`scripts/` доступен только через bind-mount, `import src.projects` резолвится через `PYTHONPATH=/app` из кода контейнера, env — `.env.staging`) → B6 читает реестр именно работающего инстанса, без HTTP-bootstrap и «курицы-яйца». Логика вердикта вынесена в чистую `_evaluate_b6(known) -> (passed, detail)` (инвариант `passed ⟺ SANDBOX ∈ known ∧ PROD_ET ∉ known ∧ PROD_ORCH ∉ known`, формат detail сохранён) + `_known_project_ids_from_registry()` / `_run_b6()` с детерминированным FAIL при недоступности источника (не ложный PASS, не необработанное исключение). Синхронно обновлены `.openclaw/agents/deployer.md` (команда стадии через `docker exec`) и `docs/operations/STAGING_CHECK.md`. `src/projects.py`, `.env*` и прочие чеки A/B4/B5/C не тронуты; реестр `QG_CHECKS` и `check_staging_status` (ADR-0003) не менялись. ADR `docs/work-items/ORCH-048/06-adr/ADR-001-b6-registry-via-in-container-run.md`. Тесты: `tests/test_staging_check_b6.py`.
|
||||
- **Testing-гейт `check_tests_passed` читает `result:` наравне с `verdict:`/`status:`** (ORCH-047): парсер `_parse_tests_verdict` (`src/qg/checks.py`) теперь принимает три равноправных машиночитаемых поля frontmatter `13-test-report.md` — `result:` (канон промпта тестера `.openclaw/agents/tester.md`, `result: PASS|FAIL`), плюс легаси `verdict:` и `status:` (enduro-trails ET-001..ET-014); достаточно любого одного непустого. Устраняет рассинхрон контракта: тестер честно эмитил `result: PASS` без `verdict:`/`status:`, парсер попадал в ветку «нет машинного вердикта» → откат `testing → development` в петлю до исчерпания `MAX_DEVELOPER_RETRIES` (наблюдалось на ORCH-17; ORCH-016 прошёл лишь из-за избыточного дублирования полей). Семантика приоритетов сохранена и распространена на все три поля через объединённую строку: negative-токен в любом поле авторитетен (перебивает positive), наборы токенов заморожены (обратная совместимость). Сигнатура гейта, имя и реестр `QG_CHECKS` не менялись. ADR `docs/work-items/ORCH-047/06-adr/ADR-001-result-field-in-tests-gate.md`. Тесты: `tests/test_qg.py::TestCheckTestsPassed`.
|
||||
|
||||
@@ -29,6 +29,7 @@ container (reads markers) and the host (writes ``result``):
|
||||
import logging
|
||||
import os
|
||||
import shlex
|
||||
import shutil
|
||||
import subprocess
|
||||
|
||||
from .config import settings
|
||||
@@ -160,6 +161,31 @@ def write_marker(repo: str, work_item_id: str | None, name: str, content: str =
|
||||
return False
|
||||
|
||||
|
||||
def clear_state(repo: str, work_item_id: str | None) -> bool:
|
||||
"""Remove ALL deploy-state sentinels for this work item (best-effort).
|
||||
|
||||
Sentinels are keyed by ``work_item_id`` (stable for the whole task lifetime),
|
||||
so a FAILED prod-deploy leaves ``approve-requested`` / ``initiated`` / ``result``
|
||||
behind. Without cleanup, after the БАГ-8 rollback (deploy -> development) and a
|
||||
fix, the task reaching ``deploy`` again would hit Phase B's idempotency-guard:
|
||||
the STALE ``initiated`` makes it a no-op, the detached hook never re-launches and
|
||||
the task wedges on ``deploy`` forever (re-deploy-after-rollback contract broken;
|
||||
AC-4/AC-10). A stale ``result`` would likewise be mis-read by the new finalizer.
|
||||
Clearing the whole state dir restores a clean slate for the next pass. Idempotent
|
||||
(a missing dir is success). Never raises.
|
||||
"""
|
||||
d = container_state_dir(repo, work_item_id)
|
||||
try:
|
||||
shutil.rmtree(d)
|
||||
logger.info("clear_state: removed deploy-state dir %s", d)
|
||||
return True
|
||||
except FileNotFoundError:
|
||||
return True
|
||||
except OSError as e: # noqa: BLE001 - never-raise contract
|
||||
logger.warning("clear_state error for %s/%s: %s", repo, work_item_id, e)
|
||||
return False
|
||||
|
||||
|
||||
def read_result(repo: str, work_item_id: str | None) -> tuple[bool, int | None]:
|
||||
"""Read the ``result`` sentinel (hook exit-code written by the host wrapper).
|
||||
|
||||
|
||||
@@ -622,6 +622,16 @@ def _handle_qg_failure_rollbacks(
|
||||
notify_stage_change(task_id, current_stage, "development")
|
||||
plane_notify_stage(work_item_id, current_stage, "development")
|
||||
result.rolled_back_to = "development"
|
||||
# ORCH-036: clear the deploy-state sentinels (approve-requested / initiated /
|
||||
# result) so the NEXT prod-deploy pass (after the developer fixes and the task
|
||||
# returns to `deploy`) is not wedged by Phase B's idempotency-guard reading a
|
||||
# STALE `initiated`, nor the finalizer mis-reading a STALE `result`. Markers are
|
||||
# keyed by work_item_id (stable across the rollback), so without this they
|
||||
# survive into the retry and break re-deploy-after-rollback (AC-4/AC-10).
|
||||
try:
|
||||
self_deploy.clear_state(repo, work_item_id)
|
||||
except Exception as e: # noqa: BLE001 - defensive (clear_state never-raises anyway)
|
||||
logger.warning(f"Task {task_id}: deploy-state clear on deploy-fail failed: {e}")
|
||||
# ORCH-043: deploy failed -> no merge will complete; release the lease so the
|
||||
# next task isn't blocked until the lease ages out (holder-aware no-op).
|
||||
try:
|
||||
@@ -821,6 +831,12 @@ def _handle_self_deploy_phase_a(
|
||||
|
||||
if work_item_id:
|
||||
set_issue_in_review(work_item_id)
|
||||
# ORCH-036: belt-and-suspenders — wipe any STALE deploy-state markers before
|
||||
# arming a fresh approve. A prior FAILED pass clears on rollback, but clearing
|
||||
# here too guarantees the entry to every new prod-deploy pass starts clean
|
||||
# (e.g. after a crash/manual intervention), so `initiated`/`result` from an
|
||||
# earlier attempt can never leak into this one.
|
||||
self_deploy.clear_state(repo, work_item_id)
|
||||
self_deploy.write_marker(
|
||||
repo, work_item_id, self_deploy.APPROVE_REQUESTED, content=str(time.time())
|
||||
)
|
||||
|
||||
@@ -11,6 +11,7 @@ import os
|
||||
os.environ.setdefault("ORCH_PLANE_API_TOKEN", "test-token")
|
||||
os.environ.setdefault("ORCH_GITEA_TOKEN", "test-token")
|
||||
|
||||
from src import self_deploy # noqa: E402
|
||||
from src.self_deploy import map_exit_code_to_status, build_deploy_log # noqa: E402
|
||||
|
||||
|
||||
@@ -45,3 +46,21 @@ def test_deploy_log_frontmatter_carries_status():
|
||||
body_fail = build_deploy_log("ORCH-036", 2, "FAILED")
|
||||
assert "deploy_status: FAILED" in body_fail
|
||||
assert "hook_exit_code: 2" in body_fail
|
||||
|
||||
|
||||
def test_clear_state_removes_all_markers_and_is_idempotent(monkeypatch, tmp_path):
|
||||
"""clear_state wipes the whole work-item state dir (all sentinels) and treats a
|
||||
missing dir as success, so a re-deploy after rollback starts from a clean slate."""
|
||||
monkeypatch.setattr(self_deploy.settings, "repos_dir", str(tmp_path))
|
||||
repo, wi = "orchestrator", "ORCH-036"
|
||||
self_deploy.write_marker(repo, wi, self_deploy.APPROVE_REQUESTED, "t")
|
||||
self_deploy.write_marker(repo, wi, self_deploy.INITIATED, "t")
|
||||
self_deploy.write_marker(repo, wi, self_deploy.RESULT, "1")
|
||||
assert self_deploy.has_marker(repo, wi, self_deploy.INITIATED) is True
|
||||
|
||||
assert self_deploy.clear_state(repo, wi) is True
|
||||
assert self_deploy.has_marker(repo, wi, self_deploy.APPROVE_REQUESTED) is False
|
||||
assert self_deploy.has_marker(repo, wi, self_deploy.INITIATED) is False
|
||||
assert self_deploy.has_marker(repo, wi, self_deploy.RESULT) is False
|
||||
# Idempotent: clearing an already-absent dir is still success (never raises).
|
||||
assert self_deploy.clear_state(repo, wi) is True
|
||||
|
||||
@@ -98,3 +98,44 @@ def test_tc10_failed_deploy_rolls_back_to_development(monkeypatch):
|
||||
assert stage_engine.set_issue_blocked.called
|
||||
assert stage_engine.send_telegram.called
|
||||
assert stage_engine.set_issue_done.called is False
|
||||
|
||||
|
||||
def test_tc11_re_deploy_after_rollback_not_wedged(monkeypatch):
|
||||
"""FAILED deploy -> rollback wipes stale markers so a later Phase B re-initiates.
|
||||
|
||||
Regression for the re-deploy-after-rollback contract (AC-4/AC-10): markers are
|
||||
keyed by the (stable) work_item_id, so without cleanup the STALE `initiated` from
|
||||
the first failed attempt would make Phase B's idempotency-guard a no-op on the
|
||||
retry and wedge the task on `deploy` forever.
|
||||
"""
|
||||
repo, wi, branch = "orchestrator", "ORCH-036", "feature/ORCH-036-x"
|
||||
# First (failed) pass left BOTH the idempotency-guard and the verdict behind.
|
||||
self_deploy.write_marker(repo, wi, self_deploy.INITIATED, "123")
|
||||
self_deploy.write_marker(repo, wi, self_deploy.RESULT, "1")
|
||||
monkeypatch.setattr(
|
||||
stage_engine, "QG_CHECKS",
|
||||
{**stage_engine.QG_CHECKS, "check_deploy_status": _fail("Deploy status: FAILED")},
|
||||
)
|
||||
task_id = _make_task("deploy")
|
||||
|
||||
stage_engine.run_deploy_finalizer(
|
||||
{"task_id": task_id, "repo": repo, "id": 1, "agent": "deploy-finalizer"}
|
||||
)
|
||||
|
||||
# Rollback fired AND the stale deploy-state sentinels were wiped.
|
||||
assert _stage(task_id) == "development"
|
||||
assert self_deploy.has_marker(repo, wi, self_deploy.INITIATED) is False
|
||||
assert self_deploy.has_marker(repo, wi, self_deploy.RESULT) is False
|
||||
assert self_deploy.read_result(repo, wi) == (False, None)
|
||||
|
||||
# Second pass: the task reaches `deploy` again and the human re-approves. Phase B
|
||||
# must ACTUALLY initiate (no stale `initiated` -> not a no-op), proving the retry
|
||||
# is no longer wedged.
|
||||
init = MagicMock(return_value=(True, "ok"))
|
||||
monkeypatch.setattr(stage_engine.self_deploy, "initiate_deploy", init)
|
||||
result = stage_engine.AdvanceResult(from_stage="deploy")
|
||||
stage_engine._handle_self_deploy_phase_b(task_id, repo, wi, branch, result)
|
||||
|
||||
assert init.called
|
||||
assert result.note == "self-deploy-initiated"
|
||||
assert self_deploy.has_marker(repo, wi, self_deploy.INITIATED) is True
|
||||
|
||||
Reference in New Issue
Block a user