feat(post-deploy): post-deploy prod monitoring + auto-rollback (ORCH-021) #65

Merged
admin merged 9 commits from feature/ORCH-021-post-deploy-rollback into main 2026-06-07 17:42:27 +03:00
Owner

Summary

ORCH-021 — Post-deploy prod monitoring + degradation reaction. Extends pipeline responsibility past deploy → done: for an applicable repo, after the terminal transition, arm a ~15 min observation window that probes prod and reacts to a degradation the restart-time health-check missed ("green deploy, red prod", precedent ET-8).

  • src/post_deploy.py — new leaf module (imports only config + lazy qg/db). Sentinel-file restart-safe state (.post-deploy-state-<repo>/<wi>/), no DB migration. probe_signals / classify / decide_action / run_rollback, every public helper never-raise.
  • Reserved-agent job post-deploy-monitor (no-LLM, Variant B, calque of deploy-finalizer) — intercepted in launcher.launch_job before _spawn; self-requeues each tick via enqueue_job(available_at_delay_s=interval).
  • Deterministic classify (BR-3): DEGRADED iff ≥ fail_threshold consecutive health failures OR window 5xx ratio > 5xx_threshold; fail-safe → HEALTHY.
  • Self-hosting invariant (BR-5 / AC-8): a monitor tick NEVER restarts the prod orchestrator container → orchestrator is ALWAYS ALERT_ONLY.
  • Conditionality (ORCH-35/36/43/58): kill-switch + CSV repos; empty → self-hosting only.
  • QG_CHECKS / STAGE_TRANSITIONS / schema unchanged (AC-12).
  • Docs (golden source) updated in-PR: CHANGELOG, CLAUDE artefact list (16-post-deploy-log.md), architecture README, .env.example (ORCH_POST_DEPLOY_*).

Test plan

  • tests/test_post_deploy.py — TC-01..TC-15 (unit: probe/classify/decide/applies/idempotent arm)
  • tests/test_post_deploy_integration.py — TC-16..TC-20 (arm at deploy→done, tick lifecycle, requeue, rollback paths)
  • tests/test_deploy_terminal_sync.py::test_tc17 — ORCH-036 terminal-sync contract preserved
  • Full suite green: 700 passed

Refs: ORCH-021

## Summary ORCH-021 — Post-deploy prod monitoring + degradation reaction. Extends pipeline responsibility past `deploy → done`: for an applicable repo, after the terminal transition, arm a ~15 min observation window that probes prod and reacts to a degradation the restart-time health-check missed ("green deploy, red prod", precedent ET-8). - **src/post_deploy.py** — new leaf module (imports only config + lazy qg/db). Sentinel-file restart-safe state (`.post-deploy-state-<repo>/<wi>/`), **no DB migration**. `probe_signals` / `classify` / `decide_action` / `run_rollback`, every public helper never-raise. - **Reserved-agent job `post-deploy-monitor`** (no-LLM, Variant B, calque of `deploy-finalizer`) — intercepted in `launcher.launch_job` before `_spawn`; self-requeues each tick via `enqueue_job(available_at_delay_s=interval)`. - **Deterministic classify** (BR-3): DEGRADED iff ≥ `fail_threshold` consecutive health failures OR window 5xx ratio > `5xx_threshold`; fail-safe → HEALTHY. - **Self-hosting invariant** (BR-5 / AC-8): a monitor tick NEVER restarts the prod `orchestrator` container → `orchestrator` is ALWAYS `ALERT_ONLY`. - **Conditionality** (ORCH-35/36/43/58): kill-switch + CSV repos; empty → self-hosting only. - **QG_CHECKS / STAGE_TRANSITIONS / schema unchanged** (AC-12). - Docs (golden source) updated in-PR: CHANGELOG, CLAUDE artefact list (`16-post-deploy-log.md`), architecture README, `.env.example` (`ORCH_POST_DEPLOY_*`). ## Test plan - [x] `tests/test_post_deploy.py` — TC-01..TC-15 (unit: probe/classify/decide/applies/idempotent arm) - [x] `tests/test_post_deploy_integration.py` — TC-16..TC-20 (arm at deploy→done, tick lifecycle, requeue, rollback paths) - [x] `tests/test_deploy_terminal_sync.py::test_tc17` — ORCH-036 terminal-sync contract preserved - [x] Full suite green: **700 passed** Refs: ORCH-021
admin added 9 commits 2026-06-07 17:40:07 +03:00
Extend pipeline responsibility past deploy->done: after the terminal
transition for an applicable repo, arm a ~15min observation window that
probes prod and reacts to a degradation the restart-time health-check
missed ("green deploy, red prod").

- src/post_deploy.py: new leaf module (config + lazy qg/db only).
  Sentinel-file restart-safe state (.post-deploy-state-<repo>/<wi>/),
  no DB migration. probe_signals/classify/decide_action/run_rollback,
  all never-raise.
- Reserved-agent job `post-deploy-monitor` (no-LLM, Variant B, calque of
  deploy-finalizer): self-requeues each tick via enqueue_job.
- Deterministic classify: DEGRADED iff >= fail_threshold consecutive
  health failures OR window 5xx ratio > 5xx_threshold; fail-safe HEALTHY.
- Self-hosting invariant (BR-5/AC-8): a tick NEVER restarts the prod
  orchestrator container -> orchestrator is ALWAYS ALERT_ONLY.
- Conditionality (ORCH-35/36/43/58): kill-switch + CSV repos, empty ->
  self-hosting only.
- QG_CHECKS / STAGE_TRANSITIONS / schema unchanged (AC-12).
- Docs: CHANGELOG, CLAUDE artefact list (16-post-deploy-log.md),
  architecture README, .env.example (ORCH_POST_DEPLOY_*).

Refs: ORCH-021

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The ORCH-058 staging rebuild (check_staging_image_fresh) builds the image with
the task git-worktree as the docker build context. A fresh worktree holds only
tracked files, but the Dockerfile did `COPY data/ ./data/` — and `data/` (the
SQLite dir) is gitignored, so it is absent from that context: `docker build`
failed with exit 1 ("BUILD-STAGING: docker build failed - aborting"), bouncing
the task off deploy-staging back to development in a loop.

The COPY was dead weight regardless: `data/` is always supplied at runtime as a
bind-mount volume (./data:/app/data, see docker-compose.yml) which shadows
anything baked into the image. Replace it with `RUN mkdir -p /app/data` so the
mountpoint exists without depending on the build context.

Regression guard: test_tc08b_dockerfile_does_not_copy_gitignored_data_dir
forbids COPY of any gitignored path (the worktree-context invariant).

Refs: ORCH-021

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tester(ET): auto-commit from tester run_id=313
All checks were successful
CI / test (push) Successful in 19s
CI / test (pull_request) Successful in 17s
1c89ac9df9
admin force-pushed feature/ORCH-021-post-deploy-rollback from f92b34e9d7 to 1c89ac9df9 2026-06-07 17:40:07 +03:00 Compare
admin merged commit f85e449d80 into main 2026-06-07 17:42:27 +03:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: admin/orchestrator#65