feat(post-deploy): post-deploy prod monitoring + degradation reaction (ORCH-021)

Extend pipeline responsibility past deploy->done: after the terminal transition for an applicable repo, arm a ~15min observation window that probes prod and reacts to a degradation the restart-time health-check missed ("green deploy, red prod"). - src/post_deploy.py: new leaf module (config + lazy qg/db only). Sentinel-file restart-safe state (.post-deploy-state-<repo>/<wi>/), no DB migration. probe_signals/classify/decide_action/run_rollback, all never-raise. - Reserved-agent job `post-deploy-monitor` (no-LLM, Variant B, calque of deploy-finalizer): self-requeues each tick via enqueue_job. - Deterministic classify: DEGRADED iff >= fail_threshold consecutive health failures OR window 5xx ratio > 5xx_threshold; fail-safe HEALTHY. - Self-hosting invariant (BR-5/AC-8): a tick NEVER restarts the prod orchestrator container -> orchestrator is ALWAYS ALERT_ONLY. - Conditionality (ORCH-35/36/43/58): kill-switch + CSV repos, empty -> self-hosting only. - QG_CHECKS / STAGE_TRANSITIONS / schema unchanged (AC-12). - Docs: CHANGELOG, CLAUDE artefact list (16-post-deploy-log.md), architecture README, .env.example (ORCH_POST_DEPLOY_*). Refs: ORCH-021 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-07 14:16:12 +00:00
parent 2bdba532d5
commit 2f4c553fd8
12 changed files with 1322 additions and 3 deletions
--- a/.env.example
+++ b/.env.example
@@ -116,3 +116,27 @@ ORCH_RECONCILE_GRACE_DEFAULT_S=600
 ORCH_RECONCILE_GRACE_OVERRIDES_JSON=
 ORCH_RECONCILE_NOTIFY_UNBLOCK=true
 ORCH_RECONCILE_SKIP_BLOCKED_ENABLED=true
+
+# ORCH-021: post-deploy production monitoring + degradation reaction. After the
+# terminal deploy->done transition for an applicable repo, a reserved-agent job
+# `post-deploy-monitor` (no LLM, modelled on deploy-finalizer) probes prod over a
+# window and reacts to a degradation the restart-time health-check missed (class
+# "green deploy, red prod", precedent ET-8). State is in sentinel files
+# (.post-deploy-state-<repo>/<wi>/), no DB migration.
+#   MONITOR_ENABLED  -> global kill-switch; false -> pipeline is 1:1 as before ORCH-021.
+#   REPOS            -> CSV of repos where monitoring is REAL; empty -> only self-hosting.
+#   WINDOW_S         -> observation window length (~15 min).
+#   INTERVAL_S       -> seconds between probe ticks.
+#   FAIL_THRESHOLD   -> N CONSECUTIVE health failures -> DEGRADED.
+#   5XX_THRESHOLD    -> window 5xx ratio above this -> DEGRADED.
+#   AUTO_ROLLBACK    -> allow auto-rollback; acts ONLY for non-self repos. Self-hosting
+#                       is ALWAYS ALERT_ONLY (a tick NEVER restarts the prod container).
+#   BASE_URL         -> base URL of the observed prod instance.
+ORCH_POST_DEPLOY_MONITOR_ENABLED=true
+ORCH_POST_DEPLOY_REPOS=
+ORCH_POST_DEPLOY_WINDOW_S=900
+ORCH_POST_DEPLOY_INTERVAL_S=30
+ORCH_POST_DEPLOY_FAIL_THRESHOLD=3
+ORCH_POST_DEPLOY_5XX_THRESHOLD=0.5
+ORCH_POST_DEPLOY_AUTO_ROLLBACK=false
+ORCH_POST_DEPLOY_BASE_URL=http://localhost:8500