fix(stage-engine): address ORCH-114 review — env/docs canon + in-region rollback CAS

Resolves the REQUEST_CHANGES findings on ORCH-114 (durable transition-ownership
lease + expected-stage CAS):

P1 — documentation = golden source:
- .env.example: add ORCH_TRANSITION_LEASE_ENABLED / ORCH_TRANSITION_LEASE_REPOS
  (canon of 100% start keys, ORCH-101), next to the other gate kill-switches.
- CLAUDE.md: add the ORCH-114 passport section (mechanism, invariant, flags,
  ADR links) so a future agent editing advance_stage/reaper/webhooks finds the
  ownership invariant in the first mandatory-read doc (ORCH-078 traceability index).

P2 — should-fix:
- docs/overview/ (system showcase, ORCH-011): add transition_lease to
  tech-data-model.md (helper tables), tech-observability.md (/queue blocks) and
  tech-architecture.md (components).
- ADR-001 D4 alignment: the four side-effectful-edge rollback handlers
  (_handle_merge_gate_rollback / _handle_security_gate / _handle_coverage_gate /
  _handle_image_freshness) now write `development` through the expected-stage CAS
  via a shared _rollback_stage_cas helper (defence against the rollback↔done
  contradiction, BR-6) instead of a bare unconditional update_task_stage. Under the
  held lease the sole owner always wins; a lost race aborts WITHOUT side effects.
  Kill-switch off / out-of-scope repo -> degenerates to the prior write -> 1:1.
- Test isolation: make tests/test_webhooks.py order-independent by pinning the
  proj-1 registry per-test (mirrors test_webhook_dedup.proj_registry); it had only
  passed by relying on import order. Drop the needless module-level ORCH_DB_PATH
  setdefault in test_orch114 (fresh_db already isolates db_path).

New regression tests (TC-11): in-region rollback writes route through CAS;
rollback CAS wins when at expected stage; rollback CAS-lost does NOT clobber `done`;
kill-switch-off rollback degenerates to the unconditional write.

ruff clean (src/stage_engine.py, src/transition_lease.py); full suite 2052 passed.

Refs: ORCH-114
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-15 18:16:49 +03:00
committed by deployer
parent 4a6b32e61d
commit c4a97a7a28
9 changed files with 202 additions and 8 deletions

View File

@@ -1233,6 +1233,31 @@ def _merge_gate_infra_retry_impl(
)
def _rollback_stage_cas(task_id, current_stage, repo, result: AdvanceResult) -> bool:
"""ORCH-114 (ADR-001 D4): write a rollback stage (`development`) through the
expected-stage CAS — the same contract as the forward/bypass writes.
Returns True iff the write was applied (the caller proceeds with the rollback side
effects); False iff the CAS was lost (the caller MUST abort WITHOUT side effects).
These in-region rollback handlers run inside ``advance_stage`` under the held
transition-lease, so this is the sole owner and the CAS practically always wins. A
lost race means a concurrent winner already advanced this task (e.g. to ``done``) —
rolling back to ``development`` would be exactly the rollback↔done contradiction
BR-6 guards against, so we abort instead of a blind overwrite. Kill-switch off /
repo out of scope -> commit_stage_cas degenerates to the prior unconditional
``update_task_stage`` (always True) -> byte-for-byte (AC-9).
"""
if transition_lease.commit_stage_cas(task_id, current_stage, "development", repo):
return True
logger.info(
f"Task {task_id}: rollback stage-CAS lost on {current_stage}->development "
f"— aborting rollback without side effects (a concurrent winner advanced)"
)
result.note = "rollback-cas-lost"
return False
def _handle_merge_gate_rollback(
task_id, current_stage, repo, work_item_id, branch, reason, result: AdvanceResult
):
@@ -1243,7 +1268,8 @@ def _handle_merge_gate_rollback(
already released by check_branch_mergeable on failure; a defensive holder-aware
release here is a harmless no-op.
"""
update_task_stage(task_id, "development")
if not _rollback_stage_cas(task_id, current_stage, repo, result):
return
notify_stage_change(task_id, current_stage, "development")
plane_notify_stage(work_item_id, current_stage, "development")
result.rolled_back_to = "development"
@@ -1320,7 +1346,8 @@ def _handle_security_gate(
result.qg_passed = False
result.qg_reason = reason
update_task_stage(task_id, "development")
if not _rollback_stage_cas(task_id, current_stage, repo, result):
return True
notify_stage_change(task_id, current_stage, "development")
plane_notify_stage(work_item_id, current_stage, "development")
result.rolled_back_to = "development"
@@ -1408,7 +1435,8 @@ def _handle_coverage_gate(
result.qg_passed = False
result.qg_reason = reason
update_task_stage(task_id, "development")
if not _rollback_stage_cas(task_id, current_stage, repo, result):
return True
notify_stage_change(task_id, current_stage, "development")
plane_notify_stage(work_item_id, current_stage, "development")
result.rolled_back_to = "development"
@@ -1488,7 +1516,8 @@ def _handle_image_freshness(
result.qg_passed = False
result.qg_reason = reason
update_task_stage(task_id, "development")
if not _rollback_stage_cas(task_id, current_stage, repo, result):
return True
notify_stage_change(task_id, current_stage, "development")
plane_notify_stage(work_item_id, current_stage, "development")
result.rolled_back_to = "development"