A DB stage=done task with 0 active jobs flapped in Plane between `Awaiting
Deploy` and `Monitoring after Deploy` instead of holding `Done` (verified live
on ORCH-061, task 47): the three deploy-phase setters were terminal-blind, so
any stale/duplicate/unknown caller under the bot token re-stamped an
intermediate status over the terminal Done, forever.
- New leaf src/deploy_status_guard.py (pure, never-raise, config-gated): decide()
-> ALLOW | CONVERGE_DONE | SUPPRESS on the entry of set_issue_awaiting_deploy /
set_issue_deploying / set_issue_monitoring. A deploy-phase status is legitimate
iff the task is non-terminal OR (done AND post-deploy window active); otherwise
done converges to Done idempotently, cancelled is suppressed (FR-2, D1/D2).
- D3: move post_deploy.arm_monitor ABOVE the terminal-sync block in advance_stage
so window_active is True when the legitimate first Monitoring is set (the task
is already DB-done by then); a re-drive after the window closes converges to Done.
- D4: run_post_deploy_monitor no-ops without a status PATCH / re-queue when the
task became cancelled mid-window (zombie-tick guard, FR-3).
- D5: additive `reason` kwarg on the three setters + one structured log line per
verdict (work_item/caller/target/db_stage/window_active/verdict); new read-only
db.get_task_by_work_item_id; post_deploy.window_active helper.
- Flags deploy_status_guard_enabled (kill-switch -> 1:1) / deploy_status_guard_repos
(CSV; empty = self-hosting only). STAGE_TRANSITIONS / QG_CHECKS / check_* /
machine-verdict keys / DB schema untouched (reads existing tasks.stage).
Tests: TC-01..TC-12 across 5 new test modules + config flags; updated the
reason-kwarg assertions in test_deploy_terminal_sync / test_deploy_approve.
Full regress green (1413). Docs: CHANGELOG, CLAUDE.md, docs/architecture/README.md
(status -> реализовано), .env.example.
Refs: ORCH-094
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Extend pipeline responsibility past deploy->done: after the terminal
transition for an applicable repo, arm a ~15min observation window that
probes prod and reacts to a degradation the restart-time health-check
missed ("green deploy, red prod").
- src/post_deploy.py: new leaf module (config + lazy qg/db only).
Sentinel-file restart-safe state (.post-deploy-state-<repo>/<wi>/),
no DB migration. probe_signals/classify/decide_action/run_rollback,
all never-raise.
- Reserved-agent job `post-deploy-monitor` (no-LLM, Variant B, calque of
deploy-finalizer): self-requeues each tick via enqueue_job.
- Deterministic classify: DEGRADED iff >= fail_threshold consecutive
health failures OR window 5xx ratio > 5xx_threshold; fail-safe HEALTHY.
- Self-hosting invariant (BR-5/AC-8): a tick NEVER restarts the prod
orchestrator container -> orchestrator is ALWAYS ALERT_ONLY.
- Conditionality (ORCH-35/36/43/58): kill-switch + CSV repos, empty ->
self-hosting only.
- QG_CHECKS / STAGE_TRANSITIONS / schema unchanged (AC-12).
- Docs: CHANGELOG, CLAUDE artefact list (16-post-deploy-log.md),
architecture README, .env.example (ORCH_POST_DEPLOY_*).
Refs: ORCH-021
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>