orchestrator

Author	SHA1	Message	Date
Slava	2d20da295e	docs: init ORCH-022 business request	2026-06-07 18:04:50 +00:00
claude-bot	67e98b8296	docs(ORCH-022): staging gate log — staging_status SUCCESS Some checks failed CI / test (push) Has been cancelled Details Canonical staging_check.py run inside orchestrator-staging: 8/10 PASS, all REAL checks green, C9a/C9b infra-waived (ORCH-061), exit 0 → advance. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 18:04:35 +00:00
stream	cad5e98892	docs(history): lessons 2026-06-07 — autonomy closure (5 задач: ORCH-58/60/61/21/65 в прод) Some checks failed CI / test (push) Has been cancelled Details	2026-06-07 19:24:49 +03:00
claude-bot	930e65298c	tester(ET): auto-commit from tester run_id=324 All checks were successful CI / test (push) Successful in 20s Details CI / test (pull_request) Successful in 18s Details	2026-06-07 16:14:45 +00:00
claude-bot	cba67a4270	reviewer(ET): auto-commit from reviewer run_id=323	2026-06-07 16:14:45 +00:00
claude-bot	720c31393a	fix(reaper): Tier-2 finalization grace + claim-before-act (no dup advance) Tier-2 reaped a LIVE, still-finalizing monitor: _monitor_agent writes agent_runs.exit_code FIRST, then does git push / PR / Plane comments before _finalize_job, and the agent pid is already dead in that window — so the old "exit_code recorded -> reap now" had no grace and could race a healthy job. Worse, _reap_known_outcome ran the advance (advance_stage -> enqueue_job) BEFORE the atomic claim, so a reaper that lost the race had already enqueued the next stage (dup advance / dup enqueue), violating ADR-001 Р-1. Fix: - Tier-2 grace: reap only once agent_runs.exit_code has been recorded for >= reaper_finalize_grace_s (new setting, default 300s; > max finalization window). A live finalizing monitor is never reaped (FR-1.3/AC-3). New finished_age_s column computed in get_running_jobs. - claim-before-act for exit0: evaluate the canonical QG READ-ONLY (the reconciler pattern) to choose the terminal status, then atomically claim 'done' FIRST; only the claim winner runs the advance. A loser performs no side effects -> no dup advance / dup enqueue. Docs (golden source) updated in the same change: ADR-001, global adr-0011, README, internals, .env.example, CHANGELOG (also fixes the P3 broken adr-0011 link). New tests cover the grace window, lost-claim no-side-effects, and the already-advanced idempotent path. Refs: ORCH-065 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 16:14:45 +00:00
claude-bot	9b7c855df3	reviewer(ET): auto-commit from reviewer run_id=321	2026-06-07 16:14:45 +00:00
claude-bot	a6b444c356	fix(merge): wire pr_already_merged guard into deployer merge path (idempotent re-merge) The pr_already_merged guard was defined + unit-tested but consulted by zero production code, while ADR-001 Р-3 / README / CHANGELOG claimed the merge path consults it before a repeat merge (reviewer P1, ORCH-065 attempt 2/3). The actual merge actor is the LLM deployer agent (it merges the feature PR at the start of the `deploy` stage), so on a reaper re-drive of an already-merged PR the deployer would blindly re-merge → Gitea error → false БАГ-8 rollback; AC-11 ("no second merge") was not met deterministically. Wire the guard at the real consultation point — the deployer prompt — so it runs merge_gate.pr_already_merged before any (re-)merge and no-ops when the PR is already merged. check_branch_mergeable is left untouched (AC-13: check_* behaviour unchanged; it runs on the first deploy-staging→deploy edge, not on a deploy-stage re-drive where the second-merge risk lives). - .openclaw/agents/deployer.md: idempotent pre-merge guard step + general rule. - src/merge_gate.py: docstring names the deployer-prompt consultation point. - docs/architecture/README.md, CHANGELOG.md: state the consultation point so golden-source matches implementation. - tests/test_merge_gate.py: regression test asserting the deployer prompt wires the guard (so it can't silently become dead code again). pytest tests/ -q: 743 passed. Refs: ORCH-065 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 16:14:45 +00:00
claude-bot	dbf14e3d5a	reviewer(ET): auto-commit from reviewer run_id=319	2026-06-07 16:14:45 +00:00
claude-bot	4bebb921ff	feat(reaper): job-reaper + stale merge-lease reclaim + idempotent merge finalization Closes the "zombie jobs" incident class: job status was set only inside the live launcher process, so a process death left jobs.status='running' forever; at max_concurrency=1 one zombie blocked ALL projects' queue (self-hosting risk). Adds a background daemon (src/job_reaper.py) with three-tier liveness (dead-pid streak / known exit_code / max-running backstop) whose only mutating write is an atomic terminal flip guarded by WHERE status='running' (no double-process). For exit0 the canonical QG is the source of truth via gate-driven advance, not "exit0". Also proactively reclaims stale merge-lease (dead pid OR TTL) via file delete only (no git ops), and makes merge finalization idempotent (pr_already_merged guard + up-to-date short-circuit on re-drive). New jobs.pid column via idempotent _ensure_column (no migration); pid stamped in launcher._spawn after Popen. Reaper start/stop in lifespan; "reaper" snapshot in GET /queue. Kill-switches: ORCH_REAPER_ENABLED, ORCH_REAPER_INTERVAL_S, ORCH_REAPER_DEAD_TICKS, ORCH_REAPER_MAX_RUNNING_S, ORCH_LEASE_RECLAIM_ENABLED. Invariants unchanged (AC-13): STAGE_TRANSITIONS, QG_CHECKS registry, check_branch_mergeable signature/behaviour, BUG-8 rollback, hook exit codes. restart-safe, never-raise per unit of background work. Docs: docs/architecture/README.md, CHANGELOG.md, .env.example. Tests: tests/test_job_reaper.py, tests/test_merge_lease_reclaim.py, tests/test_merge_gate.py (TC-16), tests/test_merge_gate_race.py (TC-17), tests/test_queue.py, tests/test_config.py (TC-19/TC-20). 742 passed. Refs: ORCH-065 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 16:14:45 +00:00
claude-bot	9f846b5a50	architect(ET): auto-commit from architect run_id=317	2026-06-07 16:14:45 +00:00
claude-bot	b760b24a48	analyst(ET): auto-commit from analyst run_id=316	2026-06-07 16:14:45 +00:00
Slava	f0ac9d5562	docs: init ORCH-065 business request	2026-06-07 16:14:45 +00:00
claude-bot	987ea810bf	docs(ORCH-065): staging gate SUCCESS — REAL green, C9a/C9b infra-waived Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 16:14:22 +00:00
claude-bot	1c89ac9df9	tester(ET): auto-commit from tester run_id=313 All checks were successful CI / test (push) Successful in 19s Details CI / test (pull_request) Successful in 17s Details	2026-06-07 14:40:06 +00:00
claude-bot	03d899812c	reviewer(ET): auto-commit from reviewer run_id=312	2026-06-07 14:40:06 +00:00
claude-bot	b04fae748e	tester(ET): auto-commit from tester run_id=309	2026-06-07 14:40:06 +00:00
claude-bot	fbfcd84b16	reviewer(ET): auto-commit from reviewer run_id=308	2026-06-07 14:40:06 +00:00
claude-bot	2f4c553fd8	feat(post-deploy): post-deploy prod monitoring + degradation reaction (ORCH-021) Extend pipeline responsibility past deploy->done: after the terminal transition for an applicable repo, arm a ~15min observation window that probes prod and reacts to a degradation the restart-time health-check missed ("green deploy, red prod"). - src/post_deploy.py: new leaf module (config + lazy qg/db only). Sentinel-file restart-safe state (.post-deploy-state-<repo>/<wi>/), no DB migration. probe_signals/classify/decide_action/run_rollback, all never-raise. - Reserved-agent job `post-deploy-monitor` (no-LLM, Variant B, calque of deploy-finalizer): self-requeues each tick via enqueue_job. - Deterministic classify: DEGRADED iff >= fail_threshold consecutive health failures OR window 5xx ratio > 5xx_threshold; fail-safe HEALTHY. - Self-hosting invariant (BR-5/AC-8): a tick NEVER restarts the prod orchestrator container -> orchestrator is ALWAYS ALERT_ONLY. - Conditionality (ORCH-35/36/43/58): kill-switch + CSV repos, empty -> self-hosting only. - QG_CHECKS / STAGE_TRANSITIONS / schema unchanged (AC-12). - Docs: CHANGELOG, CLAUDE artefact list (16-post-deploy-log.md), architecture README, .env.example (ORCH_POST_DEPLOY_*). Refs: ORCH-021 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 14:40:06 +00:00
claude-bot	2bdba532d5	architect(ET): auto-commit from architect run_id=306	2026-06-07 14:40:06 +00:00
claude-bot	db83b89467	analyst(ET): auto-commit from analyst run_id=305	2026-06-07 14:40:06 +00:00
Slava	961c5e9eee	docs: init ORCH-021 business request	2026-06-07 14:40:06 +00:00
claude-bot	84a6f61ba8	docs(ORCH-021): staging gate SUCCESS — refresh 15-staging-log timestamp Re-ran staging_check inside orchestrator-staging (exit 0); all REAL checks green, C9a/C9b waived per ORCH-061. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 14:39:48 +00:00
claude-bot	1af356a343	docs(ORCH-021): staging gate SUCCESS — REAL green, C9a/C9b infra-waived Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 14:25:00 +00:00
Slava	e18947d2d9	Merge pull request 'fix(staging): tolerate sandbox-infra-only FAILs (C9a/C9b) in deploy-staging verdict (ORCH-061)' (#62 ) from feature/ORCH-061-bug-deploy-staging-development into main Some checks failed CI / test (push) Has been cancelled Details	2026-06-07 16:30:07 +03:00
claude-bot	bf6a0c095a	docs(ORCH-061): staging gate SUCCESS — REAL green, C9a/C9b infra-waived All checks were successful CI / test (pull_request) Successful in 16s Details Validated ORCH-061 infra-tolerance against live staging (8501): all REAL checks pass, only sandbox-infra C9a/C9b fail and are waived → exit 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 13:29:33 +00:00
claude-bot	39769bdf23	tester(ET): auto-commit from tester run_id=300 All checks were successful CI / test (push) Successful in 17s Details CI / test (pull_request) Successful in 17s Details	2026-06-07 13:21:17 +00:00
claude-bot	de47737f4f	reviewer(ET): auto-commit from reviewer run_id=299 All checks were successful CI / test (push) Successful in 16s Details CI / test (pull_request) Successful in 15s Details	2026-06-07 13:18:47 +00:00
stream	e3f7c1c272	ci: re-trigger after gitea restart (ORCH-061) All checks were successful CI / test (push) Successful in 16s Details CI / test (pull_request) Successful in 17s Details	2026-06-07 13:14:14 +00:00
stream	32a7aa8c6b	ci: trigger re-run after host disk cleanup (ORCH-061)	2026-06-07 13:08:38 +00:00
claude-bot	9070489968	fix(staging): tolerate sandbox-infra-only FAILs (C9a/C9b) in deploy-staging verdict Some checks failed CI / test (push) Failing after 39s Details CI / test (pull_request) Failing after 35s Details The self-hosting orchestrator looped on deploy-staging -> development because scripts/staging_check.py exited 1 on ANY failed check, so two infra-only checks (C9a sandbox branch / C9b analyst-job — caused by SANDBOX bot accounts not being members of the sandbox Plane project, NOT a pipeline regress) forced staging_status: FAILED -> rollback -> loop, burning developer retries and tokens. Direction (б) per ADR-001: classify staging checks as REAL (all pipeline checks, fail-closed) vs SANDBOX_INFRA (narrow allowlist {C9a, C9b}, waivable). New leaf module src/staging_verdict.py (stdlib-only, never-raise): classify_check + compute_staging_verdict fold per-check results into a tolerant-but-fail-closed verdict — any REAL failure -> FAILED/exit1 (safety net holds under any flag); only C9a/C9b failed & tolerant -> SUCCESS/exit0 with waived list; only infra & strict -> FAILED/exit1; any internal error -> FAILED/exit1 (never a false green). staging_check.py now auto-classifies each check (public 3-tuple _items shape kept as an ORCH-048 b6 regression guard), exposes categorized_items(), prints INFRA-WAIVED/VERDICT lines, and exits via the verdict; new --strict flag forces legacy strictness per-run. Kill-switch ORCH_STAGING_INFRA_TOLERANCE_ENABLED (default true) restores legacy strict mode globally. launcher gains action_stage_no_changes_note so "no changes to commit" on action stages is logged as expected, not treated as under-delivery. Contracts unchanged: STAGE_TRANSITIONS, QG_CHECKS registry, staging_status:/ deploy_status: frontmatter, hook exit-code (0/1/2), check_staging_status; no DB migration. Docs: README, STAGING_CHECK.md, deployer.md, .env.example, CHANGELOG. Refs: ORCH-061 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 12:39:00 +00:00
claude-bot	1d1208c136	architect(ET): auto-commit from architect run_id=297 All checks were successful CI / test (push) Successful in 18s Details	2026-06-07 12:22:46 +00:00
claude-bot	3ab2690a68	analyst(ET): auto-commit from analyst run_id=296 All checks were successful CI / test (push) Successful in 16s Details	2026-06-07 12:10:46 +00:00
Slava	3806522041	docs: init ORCH-061 business request All checks were successful CI / test (push) Successful in 17s Details	2026-06-07 15:05:55 +03:00
claude-bot	210aef6954	deployer(ET): auto-commit from deployer run_id=293 All checks were successful CI / test (push) Successful in 17s Details CI / test (pull_request) Successful in 16s Details	2026-06-07 11:59:00 +00:00
claude-bot	829b914ff7	tester(ET): auto-commit from tester run_id=292 All checks were successful CI / test (push) Successful in 17s Details CI / test (pull_request) Successful in 16s Details	2026-06-07 11:54:59 +00:00
claude-bot	55e5e968ae	reviewer(ET): auto-commit from reviewer run_id=291 All checks were successful CI / test (push) Successful in 16s Details CI / test (pull_request) Successful in 22s Details	2026-06-07 11:53:34 +00:00
claude-bot	4db8276f98	fix(reconciler): skip escalated / Blocked / Needs-Input tasks in F-1 All checks were successful CI / test (push) Successful in 16s Details CI / test (pull_request) Successful in 16s Details Reconciler F-1 could not tell "stuck by a lost webhook" from "escalated: max developer retries reached, waiting for a human". With CI green and a reviewer that kept sending REQUEST_CHANGES up to the cap, every tick re-unblocked development -> review -> rollback -> re-unblock (incident ET-013, infinite bounce: wasted agent runs, Telegram spam, parasitic load on the shared self-hosting instance). Add two pre-gate guards in Reconciler._reconcile_gate_task (after the existing analysis/no-gate/active-job/grace guards, before the gate pre-evaluation), each an early silent return (no advance, no unblocked_total increment, no notifications): - Guard 1 (escalated, deterministic, no network, checked first): developer_retry_count(task_id) >= MAX_DEVELOPER_RETRIES. Promote stage_engine._developer_retry_count to public developer_retry_count (single source of truth; private alias kept). Limit from the constant, not a literal 3. - Guard 2 (explicit human Plane gate, Variant A, no DB migration): new never-raise plane_sync.fetch_issue_state + Reconciler._is_blocked_or_needs_input; any error/None/unresolved project -> conservative skip. New sub-flag ORCH_RECONCILE_SKIP_BLOCKED_ENABLED mutes only the networked Guard 2. F-2 unchanged: Blocked/Needs Input are outside {in_progress, approved, rejected} so they are never replayed (regression test added). DB schema, STAGE_TRANSITIONS, QG_CHECKS, never-raise, analysis carve-out and kill-switches untouched. Refs: ORCH-060 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 11:50:02 +00:00
claude-bot	efe437a4aa	architect(ET): auto-commit from architect run_id=289 All checks were successful CI / test (push) Successful in 16s Details	2026-06-07 11:41:02 +00:00
claude-bot	365c67f45d	analyst(ET): auto-commit from analyst run_id=288 All checks were successful CI / test (push) Successful in 17s Details	2026-06-07 11:28:57 +00:00
Slava	d6e0df3550	docs: init ORCH-060 business request All checks were successful CI / test (push) Successful in 17s Details	2026-06-07 14:24:00 +03:00
claude-bot	9e810c89f0	docs(ORCH-058): staging gate FAILED (8/10) — CORRECTED root cause (harness bug, not handler) All checks were successful CI / test (pull_request) Successful in 16s Details Staging check exit code 1 (C9a/C9b). Live inspection inside orchestrator-staging proves the production webhook handler is correct: get_project_states(SANDBOX).in_progress = 84a76f65..., but scripts/staging_check.py hardcodes the enduro fallback b873d9eb... => handler correctly classifies the webhook as "no pipeline action". Fix belongs in scripts/staging_check.py (resolve SANDBOX in_progress dynamically), NOT in handle_status_start or any ORCH-058 image-freshness code. Image under test = ORCH-058 merge commit `094b5e2f`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 11:05:37 +00:00
claude-bot	60e5596e94	docs(ORCH-058): staging gate re-run — staging_status FAILED (8/10, C9a/C9b) E2E pipeline not triggered on staging webhook ("no pipeline action" on state b873d9eb...); reproduces prior FAILED. Rolls task back to development. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 10:42:21 +00:00
claude-bot	637c4e9e2e	docs(ORCH-058): staging gate re-run on fresh image — staging_status FAILED (8/10) All checks were successful CI / test (pull_request) Successful in 16s Details Strategy-A freshness re-validation rebuilt 8501 from merged commit `094b5e2` and re-ran staging_check; E2E C9a/C9b fail (Plane "In Progress"/started webhook -> "no pipeline action", no task/branch/analyst-job). Machine verdict FAILED -> rollback to development. Prod (8500) untouched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 10:21:37 +00:00
Slava	094b5e2f96	Merge pull request 'feat(ORCH-058): staging-image provenance before BUILD-ONCE prod retag (INV-FRESH)' (#57 ) from feature/ORCH-058-self-deploy-retag-staging into main	2026-06-07 13:04:07 +03:00
claude-bot	90b6c8d5a8	docs(ORCH-058): staging gate re-run — staging_status SUCCESS (10/10 PASS) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 09:52:41 +00:00
claude-bot	2221d402b1	docs(ORCH-058): staging gate log — staging_status SUCCESS (10/10 PASS) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 09:33:05 +00:00
claude-bot	6ddff5583d	fix(ORCH-058): parametrize staging_check in --build-staging + explicit staging target All checks were successful CI / test (push) Successful in 19s Details CI / test (pull_request) Successful in 18s Details Round-3 review follow-up on `c53d625` (P1/P2): - P1: --build-staging now runs staging_check via parametrized STAGING_CONTAINER / STAGING_CHECK_PATH / STAGING_CHECK_MODE (default orchestrator-staging / bind-mount path / stub) instead of hardcoding $TARGET_SERVICE + the script path. docker exec runs INSIDE the staging container (ORCH-048 canonical: B6 registry isolation), after health, before exit 0. Fail-closed: any non-zero -> exit 1. STAGING only (8501). - P2a: rebuild_staging_image now passes the STAGING target EXPLICITLY (TARGET_SERVICE/TARGET_PORT/COMPOSE_PROFILE/STAGING_CONTAINER) so the self-rebuild can never drift onto prod 8500 if hook defaults change (AC-9). - P2b: TC-09 caller<->hook contract tests assert the ssh command carries GIT_SHA + BUILD_CONTEXT + the staging target and never the prod 8500 one; no-ssh-host fails closed. - P3: consolidated the three duplicate README footers into one. - Docs (golden source): DEPLOY_HOOK.md step 4 + env rows, README footer, CHANGELOG, Dockerfile ARG GIT_SHA="" comment, .env.example freshness block. Validates exactly the artefact later BUILD-ONCE retagged to prod (AC-4, ADR-001 step 3). 632 tests pass, ruff clean, bash -n OK. Refs: ORCH-058 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 09:24:38 +00:00
claude-bot	3b3d587300	docs(ORCH-058): add CHANGELOG entry, .env.example flags, fix README status All checks were successful CI / test (push) Successful in 17s Details Close AC-11 documentation gap left by the prior developer run: the ORCH-058 feature (staging-image provenance before BUILD-ONCE retag) was implemented and green but never recorded in the golden-source docs. - CHANGELOG.md: add the ORCH-058 [Unreleased]/Added entry (layers A+B, validated_revision anchor, check_staging_image_fresh, EXPECTED_REVISION hook guard, new ORCH_IMAGE_FRESHNESS_* flags, ADR/test refs). - .env.example (canon): document ORCH_IMAGE_FRESHNESS_ENABLED / ORCH_IMAGE_FRESHNESS_REPOS, mirroring the ORCH-036/043/053 precedent. - docs/architecture/README.md: footer note design -> реализовано, aligning it with the already-updated section. Refs: ORCH-058 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 08:27:57 +00:00
claude-bot	83397570fe	developer(ET): auto-commit from developer run_id=264 Some checks failed CI / test (push) Failing after 17s Details	2026-06-07 07:46:19 +00:00

1 2 3 4

169 Commits