fix(stage-engine): address ORCH-114 review — env/docs canon + in-region rollback CAS
Resolves the REQUEST_CHANGES findings on ORCH-114 (durable transition-ownership lease + expected-stage CAS): P1 — documentation = golden source: - .env.example: add ORCH_TRANSITION_LEASE_ENABLED / ORCH_TRANSITION_LEASE_REPOS (canon of 100% start keys, ORCH-101), next to the other gate kill-switches. - CLAUDE.md: add the ORCH-114 passport section (mechanism, invariant, flags, ADR links) so a future agent editing advance_stage/reaper/webhooks finds the ownership invariant in the first mandatory-read doc (ORCH-078 traceability index). P2 — should-fix: - docs/overview/ (system showcase, ORCH-011): add transition_lease to tech-data-model.md (helper tables), tech-observability.md (/queue blocks) and tech-architecture.md (components). - ADR-001 D4 alignment: the four side-effectful-edge rollback handlers (_handle_merge_gate_rollback / _handle_security_gate / _handle_coverage_gate / _handle_image_freshness) now write `development` through the expected-stage CAS via a shared _rollback_stage_cas helper (defence against the rollback↔done contradiction, BR-6) instead of a bare unconditional update_task_stage. Under the held lease the sole owner always wins; a lost race aborts WITHOUT side effects. Kill-switch off / out-of-scope repo -> degenerates to the prior write -> 1:1. - Test isolation: make tests/test_webhooks.py order-independent by pinning the proj-1 registry per-test (mirrors test_webhook_dedup.proj_registry); it had only passed by relying on import order. Drop the needless module-level ORCH_DB_PATH setdefault in test_orch114 (fresh_db already isolates db_path). New regression tests (TC-11): in-region rollback writes route through CAS; rollback CAS wins when at expected stage; rollback CAS-lost does NOT clobber `done`; kill-switch-off rollback degenerates to the unconditional write. ruff clean (src/stage_engine.py, src/transition_lease.py); full suite 2052 passed. Refs: ORCH-114 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -1233,6 +1233,31 @@ def _merge_gate_infra_retry_impl(
|
||||
)
|
||||
|
||||
|
||||
def _rollback_stage_cas(task_id, current_stage, repo, result: AdvanceResult) -> bool:
|
||||
"""ORCH-114 (ADR-001 D4): write a rollback stage (`development`) through the
|
||||
expected-stage CAS — the same contract as the forward/bypass writes.
|
||||
|
||||
Returns True iff the write was applied (the caller proceeds with the rollback side
|
||||
effects); False iff the CAS was lost (the caller MUST abort WITHOUT side effects).
|
||||
|
||||
These in-region rollback handlers run inside ``advance_stage`` under the held
|
||||
transition-lease, so this is the sole owner and the CAS practically always wins. A
|
||||
lost race means a concurrent winner already advanced this task (e.g. to ``done``) —
|
||||
rolling back to ``development`` would be exactly the rollback↔done contradiction
|
||||
BR-6 guards against, so we abort instead of a blind overwrite. Kill-switch off /
|
||||
repo out of scope -> commit_stage_cas degenerates to the prior unconditional
|
||||
``update_task_stage`` (always True) -> byte-for-byte (AC-9).
|
||||
"""
|
||||
if transition_lease.commit_stage_cas(task_id, current_stage, "development", repo):
|
||||
return True
|
||||
logger.info(
|
||||
f"Task {task_id}: rollback stage-CAS lost on {current_stage}->development "
|
||||
f"— aborting rollback without side effects (a concurrent winner advanced)"
|
||||
)
|
||||
result.note = "rollback-cas-lost"
|
||||
return False
|
||||
|
||||
|
||||
def _handle_merge_gate_rollback(
|
||||
task_id, current_stage, repo, work_item_id, branch, reason, result: AdvanceResult
|
||||
):
|
||||
@@ -1243,7 +1268,8 @@ def _handle_merge_gate_rollback(
|
||||
already released by check_branch_mergeable on failure; a defensive holder-aware
|
||||
release here is a harmless no-op.
|
||||
"""
|
||||
update_task_stage(task_id, "development")
|
||||
if not _rollback_stage_cas(task_id, current_stage, repo, result):
|
||||
return
|
||||
notify_stage_change(task_id, current_stage, "development")
|
||||
plane_notify_stage(work_item_id, current_stage, "development")
|
||||
result.rolled_back_to = "development"
|
||||
@@ -1320,7 +1346,8 @@ def _handle_security_gate(
|
||||
result.qg_passed = False
|
||||
result.qg_reason = reason
|
||||
|
||||
update_task_stage(task_id, "development")
|
||||
if not _rollback_stage_cas(task_id, current_stage, repo, result):
|
||||
return True
|
||||
notify_stage_change(task_id, current_stage, "development")
|
||||
plane_notify_stage(work_item_id, current_stage, "development")
|
||||
result.rolled_back_to = "development"
|
||||
@@ -1408,7 +1435,8 @@ def _handle_coverage_gate(
|
||||
result.qg_passed = False
|
||||
result.qg_reason = reason
|
||||
|
||||
update_task_stage(task_id, "development")
|
||||
if not _rollback_stage_cas(task_id, current_stage, repo, result):
|
||||
return True
|
||||
notify_stage_change(task_id, current_stage, "development")
|
||||
plane_notify_stage(work_item_id, current_stage, "development")
|
||||
result.rolled_back_to = "development"
|
||||
@@ -1488,7 +1516,8 @@ def _handle_image_freshness(
|
||||
result.qg_passed = False
|
||||
result.qg_reason = reason
|
||||
|
||||
update_task_stage(task_id, "development")
|
||||
if not _rollback_stage_cas(task_id, current_stage, repo, result):
|
||||
return True
|
||||
notify_stage_change(task_id, current_stage, "development")
|
||||
plane_notify_stage(work_item_id, current_stage, "development")
|
||||
result.rolled_back_to = "development"
|
||||
|
||||
Reference in New Issue
Block a user