feat(post-deploy): post-deploy prod monitoring + degradation reaction (ORCH-021)

Extend pipeline responsibility past deploy->done: after the terminal transition for an applicable repo, arm a ~15min observation window that probes prod and reacts to a degradation the restart-time health-check missed ("green deploy, red prod"). - src/post_deploy.py: new leaf module (config + lazy qg/db only). Sentinel-file restart-safe state (.post-deploy-state-<repo>/<wi>/), no DB migration. probe_signals/classify/decide_action/run_rollback, all never-raise. - Reserved-agent job `post-deploy-monitor` (no-LLM, Variant B, calque of deploy-finalizer): self-requeues each tick via enqueue_job. - Deterministic classify: DEGRADED iff >= fail_threshold consecutive health failures OR window 5xx ratio > 5xx_threshold; fail-safe HEALTHY. - Self-hosting invariant (BR-5/AC-8): a tick NEVER restarts the prod orchestrator container -> orchestrator is ALWAYS ALERT_ONLY. - Conditionality (ORCH-35/36/43/58): kill-switch + CSV repos, empty -> self-hosting only. - QG_CHECKS / STAGE_TRANSITIONS / schema unchanged (AC-12). - Docs: CHANGELOG, CLAUDE artefact list (16-post-deploy-log.md), architecture README, .env.example (ORCH_POST_DEPLOY_*). Refs: ORCH-021 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-07 14:16:12 +00:00
parent a4ad55c862
commit 8273c1fc9d
12 changed files with 1322 additions and 3 deletions
--- a/src/agents/launcher.py
+++ b/src/agents/launcher.py
@@ -249,6 +249,11 @@ class AgentLauncher:
        """
        if job.get("agent") == "deploy-finalizer":
            return self._run_deploy_finalizer_job(job)
+        # ORCH-021: the reserved-agent `post-deploy-monitor` is also a
+        # DETERMINISTIC (no-LLM) tick — intercept it BEFORE _spawn and run one
+        # observation tick synchronously. Returns None (no agent_run row).
+        if job.get("agent") == "post-deploy-monitor":
+            return self._run_post_deploy_monitor_job(job)
        return self._spawn(
            job["agent"],
            job["repo"],
@@ -278,6 +283,27 @@ class AgentLauncher:
                pass
        return None

+    def _run_post_deploy_monitor_job(self, job: dict):
+        """ORCH-021: run one deterministic post-deploy monitor tick for a job.
+
+        Not an LLM spawn — there is no subprocess/monitor, so we mark the jobs row
+        done/failed here. The tick never-raises, but we guard anyway so a monitor
+        fault can never wedge the worker / starve other projects (AC-16).
+        """
+        from ..db import mark_job
+        from .. import stage_engine
+        try:
+            stage_engine.run_post_deploy_monitor(job)
+            mark_job(job["id"], "done")
+            logger.info(f"post-deploy-monitor job {job['id']} done")
+        except Exception as e:
+            logger.error(f"post-deploy-monitor job {job['id']} failed: {e}")
+            try:
+                mark_job(job["id"], "failed", error=f"post-deploy-monitor error: {e}")
+            except Exception:
+                pass
+        return None
+
    def _spawn(self, agent: str, repo: str, task_content: str = None,
               task_id: int = None, job_id: int = None) -> int:
        """Shared spawn implementation for launch() and launch_job().