TC-17 seeded 17-security-report.md via get_worktree_path() which resolves to
settings.worktrees_dir (default /repos/_wt) -> the test wrote into the real shared
host worktree path. In CI that dir is owned by another user -> PermissionError.
Monkeypatch git_worktree.settings.worktrees_dir to tmp_path/_wt (same pattern as
test_git_worktree.py / test_merge_gate.py). Prod logic untouched.
Add a deterministic (no-LLM) security sub-gate on the deploy-staging -> deploy
edge, run FIRST (before merge-gate ORCH-043 and image-freshness ORCH-058) so it
fails cheaply before any expensive rebase/rebuild, and scans origin/main..HEAD
before rebase so a task is never blamed for a CVE introduced by an updated main.
Why: the autonomous pipeline merged branches into main with no check for a leaked
secret or a vulnerable dependency. For the self-hosting orchestrator (one shared
prod instance serving every project from a shared DB) a single leak/CVE landed in
the prod of all projects (CLAUDE.md self-hosting, section 8).
- New leaf src/security_gate.py (never-raise): gitleaks (offline, fail-closed on
tool error => secrets guarantee is unconditional) + pip-audit (best-effort;
unreachable CVE feed degrades fail-open + loud warning by default, strict via
security_dep_audit_fail_closed). Verdict lives ONLY in 17-security-report.md
YAML frontmatter (write -> read-back single source of truth); FAIL is
authoritative; missing/broken frontmatter => fail-closed.
- check_security_gate thin wrapper registered in QG_CHECKS (lazy import, no cycle).
- _handle_security_gate wired FIRST in advance_stage deploy-staging block: FAIL ->
rollback to development + developer-retry (cap MAX_DEVELOPER_RETRIES); task_desc
carries verbatim findings (ORCH-046 pattern). No merge-lease release (runs before
lease acquire). Self-hosting safe: only reads/scans/writes, never deploys.
- Conditional rollout (security_gate_enabled + security_gate_repos; empty scope ->
self-hosting only). 6 new ORCH_SECURITY_* settings.
- Infra: pinned gitleaks Go binary in Dockerfile (+curl/ca-certificates), pip-audit
in requirements.txt, versioned .gitleaks.toml at repo root.
- STAGE_TRANSITIONS and DB schema unchanged.
Docs: docs/architecture/README.md (marked realized), CLAUDE.md (artifact 17),
CHANGELOG.md. Tests: test_security_gate.py, test_qg_security.py,
test_stage_engine_security_gate.py + updated registry/edge snapshots.
Refs: ORCH-022
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Canonical staging_check.py run inside orchestrator-staging:
8/10 PASS, all REAL checks green, C9a/C9b infra-waived (ORCH-061),
exit 0 → advance.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tier-2 reaped a LIVE, still-finalizing monitor: _monitor_agent writes
agent_runs.exit_code FIRST, then does git push / PR / Plane comments before
_finalize_job, and the agent pid is already dead in that window — so the old
"exit_code recorded -> reap now" had no grace and could race a healthy job.
Worse, _reap_known_outcome ran the advance (advance_stage -> enqueue_job)
BEFORE the atomic claim, so a reaper that lost the race had already enqueued
the next stage (dup advance / dup enqueue), violating ADR-001 Р-1.
Fix:
- Tier-2 grace: reap only once agent_runs.exit_code has been recorded for
>= reaper_finalize_grace_s (new setting, default 300s; > max finalization
window). A live finalizing monitor is never reaped (FR-1.3/AC-3). New
finished_age_s column computed in get_running_jobs.
- claim-before-act for exit0: evaluate the canonical QG READ-ONLY (the
reconciler pattern) to choose the terminal status, then atomically claim
'done' FIRST; only the claim winner runs the advance. A loser performs no
side effects -> no dup advance / dup enqueue.
Docs (golden source) updated in the same change: ADR-001, global adr-0011,
README, internals, .env.example, CHANGELOG (also fixes the P3 broken adr-0011
link). New tests cover the grace window, lost-claim no-side-effects, and the
already-advanced idempotent path.
Refs: ORCH-065
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The pr_already_merged guard was defined + unit-tested but consulted by zero
production code, while ADR-001 Р-3 / README / CHANGELOG claimed the merge path
consults it before a repeat merge (reviewer P1, ORCH-065 attempt 2/3). The
actual merge actor is the LLM deployer agent (it merges the feature PR at the
start of the `deploy` stage), so on a reaper re-drive of an already-merged PR
the deployer would blindly re-merge → Gitea error → false БАГ-8 rollback; AC-11
("no second merge") was not met deterministically.
Wire the guard at the real consultation point — the deployer prompt — so it
runs merge_gate.pr_already_merged before any (re-)merge and no-ops when the PR
is already merged. check_branch_mergeable is left untouched (AC-13: check_*
behaviour unchanged; it runs on the first deploy-staging→deploy edge, not on a
deploy-stage re-drive where the second-merge risk lives).
- .openclaw/agents/deployer.md: idempotent pre-merge guard step + general rule.
- src/merge_gate.py: docstring names the deployer-prompt consultation point.
- docs/architecture/README.md, CHANGELOG.md: state the consultation point so
golden-source matches implementation.
- tests/test_merge_gate.py: regression test asserting the deployer prompt wires
the guard (so it can't silently become dead code again).
pytest tests/ -q: 743 passed.
Refs: ORCH-065
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes the "zombie jobs" incident class: job status was set only inside
the live launcher process, so a process death left jobs.status='running'
forever; at max_concurrency=1 one zombie blocked ALL projects' queue
(self-hosting risk). Adds a background daemon (src/job_reaper.py) with
three-tier liveness (dead-pid streak / known exit_code / max-running
backstop) whose only mutating write is an atomic terminal flip guarded by
WHERE status='running' (no double-process). For exit0 the canonical QG is
the source of truth via gate-driven advance, not "exit0".
Also proactively reclaims stale merge-lease (dead pid OR TTL) via file
delete only (no git ops), and makes merge finalization idempotent
(pr_already_merged guard + up-to-date short-circuit on re-drive).
New jobs.pid column via idempotent _ensure_column (no migration); pid
stamped in launcher._spawn after Popen. Reaper start/stop in lifespan;
"reaper" snapshot in GET /queue. Kill-switches: ORCH_REAPER_ENABLED,
ORCH_REAPER_INTERVAL_S, ORCH_REAPER_DEAD_TICKS, ORCH_REAPER_MAX_RUNNING_S,
ORCH_LEASE_RECLAIM_ENABLED.
Invariants unchanged (AC-13): STAGE_TRANSITIONS, QG_CHECKS registry,
check_branch_mergeable signature/behaviour, BUG-8 rollback, hook exit
codes. restart-safe, never-raise per unit of background work.
Docs: docs/architecture/README.md, CHANGELOG.md, .env.example.
Tests: tests/test_job_reaper.py, tests/test_merge_lease_reclaim.py,
tests/test_merge_gate.py (TC-16), tests/test_merge_gate_race.py (TC-17),
tests/test_queue.py, tests/test_config.py (TC-19/TC-20). 742 passed.
Refs: ORCH-065
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The ORCH-058 staging rebuild (check_staging_image_fresh) builds the image with
the task git-worktree as the docker build context. A fresh worktree holds only
tracked files, but the Dockerfile did `COPY data/ ./data/` — and `data/` (the
SQLite dir) is gitignored, so it is absent from that context: `docker build`
failed with exit 1 ("BUILD-STAGING: docker build failed - aborting"), bouncing
the task off deploy-staging back to development in a loop.
The COPY was dead weight regardless: `data/` is always supplied at runtime as a
bind-mount volume (./data:/app/data, see docker-compose.yml) which shadows
anything baked into the image. Replace it with `RUN mkdir -p /app/data` so the
mountpoint exists without depending on the build context.
Regression guard: test_tc08b_dockerfile_does_not_copy_gitignored_data_dir
forbids COPY of any gitignored path (the worktree-context invariant).
Refs: ORCH-021
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Extend pipeline responsibility past deploy->done: after the terminal
transition for an applicable repo, arm a ~15min observation window that
probes prod and reacts to a degradation the restart-time health-check
missed ("green deploy, red prod").
- src/post_deploy.py: new leaf module (config + lazy qg/db only).
Sentinel-file restart-safe state (.post-deploy-state-<repo>/<wi>/),
no DB migration. probe_signals/classify/decide_action/run_rollback,
all never-raise.
- Reserved-agent job `post-deploy-monitor` (no-LLM, Variant B, calque of
deploy-finalizer): self-requeues each tick via enqueue_job.
- Deterministic classify: DEGRADED iff >= fail_threshold consecutive
health failures OR window 5xx ratio > 5xx_threshold; fail-safe HEALTHY.
- Self-hosting invariant (BR-5/AC-8): a tick NEVER restarts the prod
orchestrator container -> orchestrator is ALWAYS ALERT_ONLY.
- Conditionality (ORCH-35/36/43/58): kill-switch + CSV repos, empty ->
self-hosting only.
- QG_CHECKS / STAGE_TRANSITIONS / schema unchanged (AC-12).
- Docs: CHANGELOG, CLAUDE artefact list (16-post-deploy-log.md),
architecture README, .env.example (ORCH_POST_DEPLOY_*).
Refs: ORCH-021
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Re-ran staging_check inside orchestrator-staging (exit 0); all REAL checks
green, C9a/C9b waived per ORCH-061.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Validated ORCH-061 infra-tolerance against live staging (8501): all REAL
checks pass, only sandbox-infra C9a/C9b fail and are waived → exit 0.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The self-hosting orchestrator looped on deploy-staging -> development because
scripts/staging_check.py exited 1 on ANY failed check, so two infra-only checks
(C9a sandbox branch / C9b analyst-job — caused by SANDBOX bot accounts not being
members of the sandbox Plane project, NOT a pipeline regress) forced
staging_status: FAILED -> rollback -> loop, burning developer retries and tokens.
Direction (б) per ADR-001: classify staging checks as REAL (all pipeline checks,
fail-closed) vs SANDBOX_INFRA (narrow allowlist {C9a, C9b}, waivable). New leaf
module src/staging_verdict.py (stdlib-only, never-raise): classify_check +
compute_staging_verdict fold per-check results into a tolerant-but-fail-closed
verdict — any REAL failure -> FAILED/exit1 (safety net holds under any flag);
only C9a/C9b failed & tolerant -> SUCCESS/exit0 with waived list; only infra &
strict -> FAILED/exit1; any internal error -> FAILED/exit1 (never a false green).
staging_check.py now auto-classifies each check (public 3-tuple _items shape kept
as an ORCH-048 b6 regression guard), exposes categorized_items(), prints
INFRA-WAIVED/VERDICT lines, and exits via the verdict; new --strict flag forces
legacy strictness per-run. Kill-switch ORCH_STAGING_INFRA_TOLERANCE_ENABLED
(default true) restores legacy strict mode globally. launcher gains
action_stage_no_changes_note so "no changes to commit" on action stages is logged
as expected, not treated as under-delivery.
Contracts unchanged: STAGE_TRANSITIONS, QG_CHECKS registry, staging_status:/
deploy_status: frontmatter, hook exit-code (0/1/2), check_staging_status; no DB
migration. Docs: README, STAGING_CHECK.md, deployer.md, .env.example, CHANGELOG.
Refs: ORCH-061
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>