orchestrator

Author	SHA1	Message	Date
post-deploy-monitor	06271b0bfb	docs(ORCH-021): post-deploy HEALTHY/NONE for ORCH-068 All checks were successful CI / test (push) Successful in 17s Details CI / test (pull_request) Successful in 18s Details	2026-06-08 05:49:27 +00:00
deploy-finalizer	aa4161fc78	deploy(ORCH-036): finalize SUCCESS for ORCH-068 All checks were successful CI / test (push) Successful in 19s Details CI / test (pull_request) Successful in 20s Details	2026-06-08 05:34:23 +00:00
claude-bot	6bbd530caa	tester(ET): auto-commit from tester run_id=351 All checks were successful CI / test (push) Successful in 19s Details CI / test (pull_request) Successful in 21s Details	2026-06-08 05:18:46 +00:00
claude-bot	4b03f213f7	reviewer(ET): auto-commit from reviewer run_id=349	2026-06-08 05:18:46 +00:00
claude-bot	1d72c44587	fix(reconciler): stop F-2 livelock spam on synced terminal tasks + cache TTL Reconciler F-2 spammed Telegram "<wi> разблокирована" every ~120s for a fully-synchronized Done task (incident ET-002, 191+ msgs/night) after the ORCH-066 Plane status model merge. Two stacked defects (defense in depth): - D1 (selection): actionable states were told apart by bare UUID, so a Done issue aliased onto the approved UUID entered the approved branch. Now terminal states are excluded by Plane state GROUP (completed/cancelled), a project-independent discriminator robust to UUID aliasing; per-issue check with a logical-key fallback when the group is unavailable. get_project_states caches {uuid -> group} from the same /states/ fetch; new sibling accessor get_project_state_groups. - D2 (notification): _note_unblock fired unconditionally after _dispatch. Now it only fires on a confirmed state change (stage before/after _dispatch; task-appears for the start case) — handlers' contracts untouched. - TR-3: in-memory dedup guard {issue_id -> last unblocked state} as a backstop. - TR-4: _STATES_CACHE lived for the whole process lifetime, so a new Plane status was invisible without a restart. Added TTL ORCH_PLANE_STATES_TTL_S (default 300s; 0 = previous lifetime cache) reusing reload_project_states(); a failed refresh serves the stale-but-correct set, not enduro defaults. STAGE_TRANSITIONS / QG_CHECKS / DB schema / handle_* contracts / F-1 / F-3 unchanged; never-raise preserved; self-hosting tick never restarts prod. Observability: skipped_terminal_total / deduped_total in /queue reconcile block. Tests: tests/test_reconciler_plane.py (TC-01..TC-10), tests/test_plane_states_cache.py (TC-11/TC-12). Refs: ORCH-068 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-08 05:18:46 +00:00
claude-bot	0605309602	architect(ET): auto-commit from architect run_id=347	2026-06-08 05:18:46 +00:00
claude-bot	044894cbe9	analyst(ET): auto-commit from analyst run_id=346	2026-06-08 05:18:46 +00:00
Slava	cb11137a77	docs: init ORCH-068 business request	2026-06-08 05:18:46 +00:00
claude-bot	48b54051e5	docs(ORCH-068): add staging gate log (staging_status: SUCCESS)	2026-06-08 05:18:24 +00:00
claude-bot	3fb3d15cb4	docs(ORCH-066): add staging gate log (staging_status: SUCCESS) Some checks failed CI / test (push) Has been cancelled Details Staging check suite passed (8/10, exit 0): all REAL checks green; C9a/C9b waived as known sandbox-infra (ORCH-061). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 22:02:33 +00:00
claude-bot	4815e378d9	docs(ORCH-059): staging gate log — staging_status SUCCESS Some checks failed CI / test (push) Has been cancelled Details Staging check suite passed (exit 0); C9a/C9b sandbox-infra waived (ORCH-061). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 19:20:18 +00:00
claude-bot	67e98b8296	docs(ORCH-022): staging gate log — staging_status SUCCESS Some checks failed CI / test (push) Has been cancelled Details Canonical staging_check.py run inside orchestrator-staging: 8/10 PASS, all REAL checks green, C9a/C9b infra-waived (ORCH-061), exit 0 → advance. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 18:04:35 +00:00
stream	cad5e98892	docs(history): lessons 2026-06-07 — autonomy closure (5 задач: ORCH-58/60/61/21/65 в прод) Some checks failed CI / test (push) Has been cancelled Details	2026-06-07 19:24:49 +03:00
claude-bot	930e65298c	tester(ET): auto-commit from tester run_id=324 All checks were successful CI / test (push) Successful in 20s Details CI / test (pull_request) Successful in 18s Details	2026-06-07 16:14:45 +00:00
claude-bot	cba67a4270	reviewer(ET): auto-commit from reviewer run_id=323	2026-06-07 16:14:45 +00:00
claude-bot	720c31393a	fix(reaper): Tier-2 finalization grace + claim-before-act (no dup advance) Tier-2 reaped a LIVE, still-finalizing monitor: _monitor_agent writes agent_runs.exit_code FIRST, then does git push / PR / Plane comments before _finalize_job, and the agent pid is already dead in that window — so the old "exit_code recorded -> reap now" had no grace and could race a healthy job. Worse, _reap_known_outcome ran the advance (advance_stage -> enqueue_job) BEFORE the atomic claim, so a reaper that lost the race had already enqueued the next stage (dup advance / dup enqueue), violating ADR-001 Р-1. Fix: - Tier-2 grace: reap only once agent_runs.exit_code has been recorded for >= reaper_finalize_grace_s (new setting, default 300s; > max finalization window). A live finalizing monitor is never reaped (FR-1.3/AC-3). New finished_age_s column computed in get_running_jobs. - claim-before-act for exit0: evaluate the canonical QG READ-ONLY (the reconciler pattern) to choose the terminal status, then atomically claim 'done' FIRST; only the claim winner runs the advance. A loser performs no side effects -> no dup advance / dup enqueue. Docs (golden source) updated in the same change: ADR-001, global adr-0011, README, internals, .env.example, CHANGELOG (also fixes the P3 broken adr-0011 link). New tests cover the grace window, lost-claim no-side-effects, and the already-advanced idempotent path. Refs: ORCH-065 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 16:14:45 +00:00
claude-bot	9b7c855df3	reviewer(ET): auto-commit from reviewer run_id=321	2026-06-07 16:14:45 +00:00
claude-bot	a6b444c356	fix(merge): wire pr_already_merged guard into deployer merge path (idempotent re-merge) The pr_already_merged guard was defined + unit-tested but consulted by zero production code, while ADR-001 Р-3 / README / CHANGELOG claimed the merge path consults it before a repeat merge (reviewer P1, ORCH-065 attempt 2/3). The actual merge actor is the LLM deployer agent (it merges the feature PR at the start of the `deploy` stage), so on a reaper re-drive of an already-merged PR the deployer would blindly re-merge → Gitea error → false БАГ-8 rollback; AC-11 ("no second merge") was not met deterministically. Wire the guard at the real consultation point — the deployer prompt — so it runs merge_gate.pr_already_merged before any (re-)merge and no-ops when the PR is already merged. check_branch_mergeable is left untouched (AC-13: check_* behaviour unchanged; it runs on the first deploy-staging→deploy edge, not on a deploy-stage re-drive where the second-merge risk lives). - .openclaw/agents/deployer.md: idempotent pre-merge guard step + general rule. - src/merge_gate.py: docstring names the deployer-prompt consultation point. - docs/architecture/README.md, CHANGELOG.md: state the consultation point so golden-source matches implementation. - tests/test_merge_gate.py: regression test asserting the deployer prompt wires the guard (so it can't silently become dead code again). pytest tests/ -q: 743 passed. Refs: ORCH-065 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 16:14:45 +00:00
claude-bot	dbf14e3d5a	reviewer(ET): auto-commit from reviewer run_id=319	2026-06-07 16:14:45 +00:00
claude-bot	4bebb921ff	feat(reaper): job-reaper + stale merge-lease reclaim + idempotent merge finalization Closes the "zombie jobs" incident class: job status was set only inside the live launcher process, so a process death left jobs.status='running' forever; at max_concurrency=1 one zombie blocked ALL projects' queue (self-hosting risk). Adds a background daemon (src/job_reaper.py) with three-tier liveness (dead-pid streak / known exit_code / max-running backstop) whose only mutating write is an atomic terminal flip guarded by WHERE status='running' (no double-process). For exit0 the canonical QG is the source of truth via gate-driven advance, not "exit0". Also proactively reclaims stale merge-lease (dead pid OR TTL) via file delete only (no git ops), and makes merge finalization idempotent (pr_already_merged guard + up-to-date short-circuit on re-drive). New jobs.pid column via idempotent _ensure_column (no migration); pid stamped in launcher._spawn after Popen. Reaper start/stop in lifespan; "reaper" snapshot in GET /queue. Kill-switches: ORCH_REAPER_ENABLED, ORCH_REAPER_INTERVAL_S, ORCH_REAPER_DEAD_TICKS, ORCH_REAPER_MAX_RUNNING_S, ORCH_LEASE_RECLAIM_ENABLED. Invariants unchanged (AC-13): STAGE_TRANSITIONS, QG_CHECKS registry, check_branch_mergeable signature/behaviour, BUG-8 rollback, hook exit codes. restart-safe, never-raise per unit of background work. Docs: docs/architecture/README.md, CHANGELOG.md, .env.example. Tests: tests/test_job_reaper.py, tests/test_merge_lease_reclaim.py, tests/test_merge_gate.py (TC-16), tests/test_merge_gate_race.py (TC-17), tests/test_queue.py, tests/test_config.py (TC-19/TC-20). 742 passed. Refs: ORCH-065 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 16:14:45 +00:00
claude-bot	9f846b5a50	architect(ET): auto-commit from architect run_id=317	2026-06-07 16:14:45 +00:00
claude-bot	b760b24a48	analyst(ET): auto-commit from analyst run_id=316	2026-06-07 16:14:45 +00:00
Slava	f0ac9d5562	docs: init ORCH-065 business request	2026-06-07 16:14:45 +00:00
claude-bot	987ea810bf	docs(ORCH-065): staging gate SUCCESS — REAL green, C9a/C9b infra-waived Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 16:14:22 +00:00
claude-bot	1c89ac9df9	tester(ET): auto-commit from tester run_id=313 All checks were successful CI / test (push) Successful in 19s Details CI / test (pull_request) Successful in 17s Details	2026-06-07 14:40:06 +00:00
claude-bot	03d899812c	reviewer(ET): auto-commit from reviewer run_id=312	2026-06-07 14:40:06 +00:00
claude-bot	b04fae748e	tester(ET): auto-commit from tester run_id=309	2026-06-07 14:40:06 +00:00
claude-bot	fbfcd84b16	reviewer(ET): auto-commit from reviewer run_id=308	2026-06-07 14:40:06 +00:00
claude-bot	2f4c553fd8	feat(post-deploy): post-deploy prod monitoring + degradation reaction (ORCH-021) Extend pipeline responsibility past deploy->done: after the terminal transition for an applicable repo, arm a ~15min observation window that probes prod and reacts to a degradation the restart-time health-check missed ("green deploy, red prod"). - src/post_deploy.py: new leaf module (config + lazy qg/db only). Sentinel-file restart-safe state (.post-deploy-state-<repo>/<wi>/), no DB migration. probe_signals/classify/decide_action/run_rollback, all never-raise. - Reserved-agent job `post-deploy-monitor` (no-LLM, Variant B, calque of deploy-finalizer): self-requeues each tick via enqueue_job. - Deterministic classify: DEGRADED iff >= fail_threshold consecutive health failures OR window 5xx ratio > 5xx_threshold; fail-safe HEALTHY. - Self-hosting invariant (BR-5/AC-8): a tick NEVER restarts the prod orchestrator container -> orchestrator is ALWAYS ALERT_ONLY. - Conditionality (ORCH-35/36/43/58): kill-switch + CSV repos, empty -> self-hosting only. - QG_CHECKS / STAGE_TRANSITIONS / schema unchanged (AC-12). - Docs: CHANGELOG, CLAUDE artefact list (16-post-deploy-log.md), architecture README, .env.example (ORCH_POST_DEPLOY_*). Refs: ORCH-021 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 14:40:06 +00:00
claude-bot	2bdba532d5	architect(ET): auto-commit from architect run_id=306	2026-06-07 14:40:06 +00:00
claude-bot	db83b89467	analyst(ET): auto-commit from analyst run_id=305	2026-06-07 14:40:06 +00:00
Slava	961c5e9eee	docs: init ORCH-021 business request	2026-06-07 14:40:06 +00:00
claude-bot	84a6f61ba8	docs(ORCH-021): staging gate SUCCESS — refresh 15-staging-log timestamp Re-ran staging_check inside orchestrator-staging (exit 0); all REAL checks green, C9a/C9b waived per ORCH-061. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 14:39:48 +00:00
claude-bot	1af356a343	docs(ORCH-021): staging gate SUCCESS — REAL green, C9a/C9b infra-waived Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 14:25:00 +00:00
Slava	e18947d2d9	Merge pull request 'fix(staging): tolerate sandbox-infra-only FAILs (C9a/C9b) in deploy-staging verdict (ORCH-061)' (#62 ) from feature/ORCH-061-bug-deploy-staging-development into main Some checks failed CI / test (push) Has been cancelled Details	2026-06-07 16:30:07 +03:00
claude-bot	bf6a0c095a	docs(ORCH-061): staging gate SUCCESS — REAL green, C9a/C9b infra-waived All checks were successful CI / test (pull_request) Successful in 16s Details Validated ORCH-061 infra-tolerance against live staging (8501): all REAL checks pass, only sandbox-infra C9a/C9b fail and are waived → exit 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 13:29:33 +00:00
claude-bot	39769bdf23	tester(ET): auto-commit from tester run_id=300 All checks were successful CI / test (push) Successful in 17s Details CI / test (pull_request) Successful in 17s Details	2026-06-07 13:21:17 +00:00
claude-bot	de47737f4f	reviewer(ET): auto-commit from reviewer run_id=299 All checks were successful CI / test (push) Successful in 16s Details CI / test (pull_request) Successful in 15s Details	2026-06-07 13:18:47 +00:00
stream	e3f7c1c272	ci: re-trigger after gitea restart (ORCH-061) All checks were successful CI / test (push) Successful in 16s Details CI / test (pull_request) Successful in 17s Details	2026-06-07 13:14:14 +00:00
stream	32a7aa8c6b	ci: trigger re-run after host disk cleanup (ORCH-061)	2026-06-07 13:08:38 +00:00
claude-bot	9070489968	fix(staging): tolerate sandbox-infra-only FAILs (C9a/C9b) in deploy-staging verdict Some checks failed CI / test (push) Failing after 39s Details CI / test (pull_request) Failing after 35s Details The self-hosting orchestrator looped on deploy-staging -> development because scripts/staging_check.py exited 1 on ANY failed check, so two infra-only checks (C9a sandbox branch / C9b analyst-job — caused by SANDBOX bot accounts not being members of the sandbox Plane project, NOT a pipeline regress) forced staging_status: FAILED -> rollback -> loop, burning developer retries and tokens. Direction (б) per ADR-001: classify staging checks as REAL (all pipeline checks, fail-closed) vs SANDBOX_INFRA (narrow allowlist {C9a, C9b}, waivable). New leaf module src/staging_verdict.py (stdlib-only, never-raise): classify_check + compute_staging_verdict fold per-check results into a tolerant-but-fail-closed verdict — any REAL failure -> FAILED/exit1 (safety net holds under any flag); only C9a/C9b failed & tolerant -> SUCCESS/exit0 with waived list; only infra & strict -> FAILED/exit1; any internal error -> FAILED/exit1 (never a false green). staging_check.py now auto-classifies each check (public 3-tuple _items shape kept as an ORCH-048 b6 regression guard), exposes categorized_items(), prints INFRA-WAIVED/VERDICT lines, and exits via the verdict; new --strict flag forces legacy strictness per-run. Kill-switch ORCH_STAGING_INFRA_TOLERANCE_ENABLED (default true) restores legacy strict mode globally. launcher gains action_stage_no_changes_note so "no changes to commit" on action stages is logged as expected, not treated as under-delivery. Contracts unchanged: STAGE_TRANSITIONS, QG_CHECKS registry, staging_status:/ deploy_status: frontmatter, hook exit-code (0/1/2), check_staging_status; no DB migration. Docs: README, STAGING_CHECK.md, deployer.md, .env.example, CHANGELOG. Refs: ORCH-061 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 12:39:00 +00:00
claude-bot	1d1208c136	architect(ET): auto-commit from architect run_id=297 All checks were successful CI / test (push) Successful in 18s Details	2026-06-07 12:22:46 +00:00
claude-bot	3ab2690a68	analyst(ET): auto-commit from analyst run_id=296 All checks were successful CI / test (push) Successful in 16s Details	2026-06-07 12:10:46 +00:00
Slava	3806522041	docs: init ORCH-061 business request All checks were successful CI / test (push) Successful in 17s Details	2026-06-07 15:05:55 +03:00
claude-bot	210aef6954	deployer(ET): auto-commit from deployer run_id=293 All checks were successful CI / test (push) Successful in 17s Details CI / test (pull_request) Successful in 16s Details	2026-06-07 11:59:00 +00:00
claude-bot	829b914ff7	tester(ET): auto-commit from tester run_id=292 All checks were successful CI / test (push) Successful in 17s Details CI / test (pull_request) Successful in 16s Details	2026-06-07 11:54:59 +00:00
claude-bot	55e5e968ae	reviewer(ET): auto-commit from reviewer run_id=291 All checks were successful CI / test (push) Successful in 16s Details CI / test (pull_request) Successful in 22s Details	2026-06-07 11:53:34 +00:00
claude-bot	4db8276f98	fix(reconciler): skip escalated / Blocked / Needs-Input tasks in F-1 All checks were successful CI / test (push) Successful in 16s Details CI / test (pull_request) Successful in 16s Details Reconciler F-1 could not tell "stuck by a lost webhook" from "escalated: max developer retries reached, waiting for a human". With CI green and a reviewer that kept sending REQUEST_CHANGES up to the cap, every tick re-unblocked development -> review -> rollback -> re-unblock (incident ET-013, infinite bounce: wasted agent runs, Telegram spam, parasitic load on the shared self-hosting instance). Add two pre-gate guards in Reconciler._reconcile_gate_task (after the existing analysis/no-gate/active-job/grace guards, before the gate pre-evaluation), each an early silent return (no advance, no unblocked_total increment, no notifications): - Guard 1 (escalated, deterministic, no network, checked first): developer_retry_count(task_id) >= MAX_DEVELOPER_RETRIES. Promote stage_engine._developer_retry_count to public developer_retry_count (single source of truth; private alias kept). Limit from the constant, not a literal 3. - Guard 2 (explicit human Plane gate, Variant A, no DB migration): new never-raise plane_sync.fetch_issue_state + Reconciler._is_blocked_or_needs_input; any error/None/unresolved project -> conservative skip. New sub-flag ORCH_RECONCILE_SKIP_BLOCKED_ENABLED mutes only the networked Guard 2. F-2 unchanged: Blocked/Needs Input are outside {in_progress, approved, rejected} so they are never replayed (regression test added). DB schema, STAGE_TRANSITIONS, QG_CHECKS, never-raise, analysis carve-out and kill-switches untouched. Refs: ORCH-060 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-06-07 11:50:02 +00:00
claude-bot	efe437a4aa	architect(ET): auto-commit from architect run_id=289 All checks were successful CI / test (push) Successful in 16s Details	2026-06-07 11:41:02 +00:00
claude-bot	365c67f45d	analyst(ET): auto-commit from analyst run_id=288 All checks were successful CI / test (push) Successful in 17s Details	2026-06-07 11:28:57 +00:00

1 2 3 4

179 Commits