feat(post-deploy): post-deploy prod monitoring + auto-rollback (ORCH-021) #65
Reference in New Issue
Block a user
Delete Branch "feature/ORCH-021-post-deploy-rollback"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
ORCH-021 — Post-deploy prod monitoring + degradation reaction. Extends pipeline responsibility past
deploy → done: for an applicable repo, after the terminal transition, arm a ~15 min observation window that probes prod and reacts to a degradation the restart-time health-check missed ("green deploy, red prod", precedent ET-8)..post-deploy-state-<repo>/<wi>/), no DB migration.probe_signals/classify/decide_action/run_rollback, every public helper never-raise.post-deploy-monitor(no-LLM, Variant B, calque ofdeploy-finalizer) — intercepted inlauncher.launch_jobbefore_spawn; self-requeues each tick viaenqueue_job(available_at_delay_s=interval).fail_thresholdconsecutive health failures OR window 5xx ratio >5xx_threshold; fail-safe → HEALTHY.orchestratorcontainer →orchestratoris ALWAYSALERT_ONLY.16-post-deploy-log.md), architecture README,.env.example(ORCH_POST_DEPLOY_*).Test plan
tests/test_post_deploy.py— TC-01..TC-15 (unit: probe/classify/decide/applies/idempotent arm)tests/test_post_deploy_integration.py— TC-16..TC-20 (arm at deploy→done, tick lifecycle, requeue, rollback paths)tests/test_deploy_terminal_sync.py::test_tc17— ORCH-036 terminal-sync contract preservedRefs: ORCH-021
Extend pipeline responsibility past deploy->done: after the terminal transition for an applicable repo, arm a ~15min observation window that probes prod and reacts to a degradation the restart-time health-check missed ("green deploy, red prod"). - src/post_deploy.py: new leaf module (config + lazy qg/db only). Sentinel-file restart-safe state (.post-deploy-state-<repo>/<wi>/), no DB migration. probe_signals/classify/decide_action/run_rollback, all never-raise. - Reserved-agent job `post-deploy-monitor` (no-LLM, Variant B, calque of deploy-finalizer): self-requeues each tick via enqueue_job. - Deterministic classify: DEGRADED iff >= fail_threshold consecutive health failures OR window 5xx ratio > 5xx_threshold; fail-safe HEALTHY. - Self-hosting invariant (BR-5/AC-8): a tick NEVER restarts the prod orchestrator container -> orchestrator is ALWAYS ALERT_ONLY. - Conditionality (ORCH-35/36/43/58): kill-switch + CSV repos, empty -> self-hosting only. - QG_CHECKS / STAGE_TRANSITIONS / schema unchanged (AC-12). - Docs: CHANGELOG, CLAUDE artefact list (16-post-deploy-log.md), architecture README, .env.example (ORCH_POST_DEPLOY_*). Refs: ORCH-021 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>The ORCH-058 staging rebuild (check_staging_image_fresh) builds the image with the task git-worktree as the docker build context. A fresh worktree holds only tracked files, but the Dockerfile did `COPY data/ ./data/` — and `data/` (the SQLite dir) is gitignored, so it is absent from that context: `docker build` failed with exit 1 ("BUILD-STAGING: docker build failed - aborting"), bouncing the task off deploy-staging back to development in a loop. The COPY was dead weight regardless: `data/` is always supplied at runtime as a bind-mount volume (./data:/app/data, see docker-compose.yml) which shadows anything baked into the image. Replace it with `RUN mkdir -p /app/data` so the mountpoint exists without depending on the build context. Regression guard: test_tc08b_dockerfile_does_not_copy_gitignored_data_dir forbids COPY of any gitignored path (the worktree-context invariant). Refs: ORCH-021 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>f92b34e9d7to1c89ac9df9