orchestrator

Author	SHA1	Message	Date
post-deploy-monitor	0df409909c	docs(ORCH-021): post-deploy HEALTHY/NONE for ORCH-114 All checks were successful CI / test (push) Successful in 1m10s Details	2026-06-15 19:51:06 +03:00
deploy-finalizer	eb34324852	deploy(ORCH-036): finalize SUCCESS for ORCH-114 All checks were successful CI / test (push) Successful in 1m11s Details	2026-06-15 19:35:56 +03:00
claude-bot	7490f4fac4	tester(ET): auto-commit from tester run_id=714 All checks were successful CI / test (push) Successful in 1m18s Details CI / test (pull_request) Successful in 1m28s Details	2026-06-15 19:28:38 +03:00
claude-bot	d4eca78423	reviewer(ET): auto-commit from reviewer run_id=713	2026-06-15 19:28:38 +03:00
claude-bot	c4a97a7a28	fix(stage-engine): address ORCH-114 review — env/docs canon + in-region rollback CAS Resolves the REQUEST_CHANGES findings on ORCH-114 (durable transition-ownership lease + expected-stage CAS): P1 — documentation = golden source: - .env.example: add ORCH_TRANSITION_LEASE_ENABLED / ORCH_TRANSITION_LEASE_REPOS (canon of 100% start keys, ORCH-101), next to the other gate kill-switches. - CLAUDE.md: add the ORCH-114 passport section (mechanism, invariant, flags, ADR links) so a future agent editing advance_stage/reaper/webhooks finds the ownership invariant in the first mandatory-read doc (ORCH-078 traceability index). P2 — should-fix: - docs/overview/ (system showcase, ORCH-011): add transition_lease to tech-data-model.md (helper tables), tech-observability.md (/queue blocks) and tech-architecture.md (components). - ADR-001 D4 alignment: the four side-effectful-edge rollback handlers (_handle_merge_gate_rollback / _handle_security_gate / _handle_coverage_gate / _handle_image_freshness) now write `development` through the expected-stage CAS via a shared _rollback_stage_cas helper (defence against the rollback↔done contradiction, BR-6) instead of a bare unconditional update_task_stage. Under the held lease the sole owner always wins; a lost race aborts WITHOUT side effects. Kill-switch off / out-of-scope repo -> degenerates to the prior write -> 1:1. - Test isolation: make tests/test_webhooks.py order-independent by pinning the proj-1 registry per-test (mirrors test_webhook_dedup.proj_registry); it had only passed by relying on import order. Drop the needless module-level ORCH_DB_PATH setdefault in test_orch114 (fresh_db already isolates db_path). New regression tests (TC-11): in-region rollback writes route through CAS; rollback CAS wins when at expected stage; rollback CAS-lost does NOT clobber `done`; kill-switch-off rollback degenerates to the unconditional write. ruff clean (src/stage_engine.py, src/transition_lease.py); full suite 2052 passed. Refs: ORCH-114 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 19:28:38 +03:00
claude-bot	4a6b32e61d	reviewer(ET): auto-commit from reviewer run_id=711	2026-06-15 19:28:38 +03:00
claude-bot	6ea4402942	fix(stage-engine): durable transition-ownership lease + expected-stage CAS (ORCH-114) Close the root class of the ORCH-110/111/112/113 incident chain: side-effectful stage transitions had no single ownership. `advance_stage` is re-enterable and wrote the stage with a bare `UPDATE ... WHERE id=?` (no compare-and-swap), while >=5 actors (monitor / Plane-webhook / reconciler F-1 / job-reaper / deploy-finalizer) enter the same transition independently. A concurrent or post-restart re-entry therefore re-applied irreversible effects (merge_pr / coverage-ratchet / image-rebuild / prod-deploy initiation) and produced a contradictory rollback<->done (incident ORCH-111, job 1914 / PR #130). Two complementary layers, both additive, under one kill-switch, never-raise: 1. Durable transition-lease (new table `transition_lease`) — owner-exclusion on ENTRY to the side-effectful region: a second actor that sees a LIVE owner does not start the heavy sub-gates at all (prevention, not post-hoc repair). 2. Expected-stage CAS (`db.update_task_stage_cas`) — atomicity on the stage WRITE: a lost race aborts with NO side effect. Also closes the 6 paths that write the stage in bypass of advance_stage (gitea x5 + plane rollback). Owner liveness = owner_pid + owner_boot_id (NOT a heartbeat — a blocking 900s merge re-test cannot beat one; ADR-001 D3), making restart recovery free (a fresh boot_id renders every prior lease stale -> reclaimed by recover_on_startup). The lease has no own TTL: its hard age ceiling is the reaper Tier-3 backstop reaper_max_running_s, so the cross-cutting budget invariant ORCH-065/109/110/113 is untouched. Generalises ORCH-113 finalizer-liveness (process-local, Tier-2, deploy-staging) to a durable cross-path lease: the reaper consults it on all relevant paths (defer live, reclaim dead; Tier-3 ignores the marker -> bounded; a reap force-releases the lease); reconciler F-1 and the Plane webhook defer on an active lease; main.lifespan calls recover_on_startup() after requeue_running_jobs. finalizer_liveness.py is unchanged (it remains the kill-switch-off fallback). Scope self-hosting (transition_lease_repos="" -> orchestrator only; enduro untouched). Kill-switch ORCH_TRANSITION_LEASE_ENABLED=false -> CAS degenerates to the prior unconditional update_task_stage, lease inert, reaper -> ORCH-113 fallback (byte-for- byte pre-ORCH-114). STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys / existing table schemas — byte-for-byte (one additive table, no epoch column on tasks). Observability: read-only `transition_lease` block in GET /queue + a Telegram alert on forced/stale reclaim + optional POST /transition-lease/release?work_item=<id>. Coverage: tests/test_orch114_transition_ownership.py (TC-01 mandatory regression of the ORCH-111 class — red before fix, green after; TC-02..TC-14). Full suite green (2048 passed); the 4 webhook tests that spied on the removed gitea.update_task_stage were updated to spy on the new commit_stage_cas write path. ADR: docs/work-items/ORCH-114/06-adr/ADR-001-transition-ownership-lease-and-stage-cas.md Cross-cutting: docs/architecture/adr/adr-0045-transition-ownership-lease-and-stage-cas.md Refs: ORCH-114 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 19:28:38 +03:00
claude-bot	cc03e68847	architect(ET): auto-commit from architect run_id=709	2026-06-15 19:28:38 +03:00
claude-bot	9fcca9efbc	analyst(ET): auto-commit from analyst run_id=708	2026-06-15 19:28:38 +03:00
Slava	ab5e4c345b	docs: init ORCH-114 business request	2026-06-15 19:28:38 +03:00
claude-bot	6565d50242	deploy-staging(ORCH-114): staging gate SUCCESS (8/10 PASS, C9a/C9b infra-waived) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 19:28:13 +03:00
deploy-finalizer	285f5f05dc	deploy(ORCH-036): finalize SUCCESS for ORCH-112 All checks were successful CI / test (push) Successful in 3m9s Details CI / test (pull_request) Successful in 3m11s Details	2026-06-15 15:33:15 +03:00
claude-bot	344ab72f37	tester(ET): auto-commit from tester run_id=706 All checks were successful CI / test (push) Successful in 3m59s Details CI / test (pull_request) Successful in 3m9s Details	2026-06-15 15:15:56 +03:00
claude-bot	7f673a45f7	reviewer(ET): auto-commit from reviewer run_id=705	2026-06-15 15:15:56 +03:00
claude-bot	31b4f3fd1d	architect(ET): auto-commit from architect run_id=703	2026-06-15 15:15:56 +03:00
claude-bot	96b653d11c	architect(ET): auto-commit from architect run_id=702	2026-06-15 15:15:56 +03:00
claude-bot	860de5b0a5	analyst(ET): auto-commit from analyst run_id=701	2026-06-15 15:15:56 +03:00
Slava	c086921aa1	docs: init ORCH-112 business request	2026-06-15 15:15:56 +03:00
claude-bot	eb1b7aa056	docs(ORCH-112): staging gate log artifact — SUCCESS All checks were successful CI / test (pull_request) Successful in 3m52s Details Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 15:14:32 +03:00
deploy-finalizer	c8faa1ec23	deploy(ORCH-036): finalize SUCCESS for ORCH-113 All checks were successful CI / test (push) Successful in 3m9s Details CI / test (pull_request) Successful in 3m5s Details	2026-06-15 13:51:44 +03:00
claude-bot	b62e196710	developer(ET): auto-commit from developer run_id=699 All checks were successful CI / test (push) Successful in 3m22s Details CI / test (pull_request) Successful in 3m43s Details	2026-06-15 13:43:22 +03:00
claude-bot	7523b843a5	tester(ET): auto-commit from tester run_id=696 All checks were successful CI / test (push) Successful in 4m41s Details CI / test (pull_request) Successful in 4m1s Details	2026-06-15 13:08:41 +03:00
claude-bot	adeffbb39a	reviewer(ET): auto-commit from reviewer run_id=695	2026-06-15 13:08:41 +03:00
claude-bot	1e74b9d042	architect(ET): auto-commit from architect run_id=693	2026-06-15 13:08:41 +03:00
claude-bot	425ecb7585	analyst(ET): auto-commit from analyst run_id=692	2026-06-15 13:08:41 +03:00
Slava	55e9483fb8	docs: init ORCH-113 business request	2026-06-15 13:08:41 +03:00
claude-bot	164cf2143c	docs(ORCH-113): staging gate SUCCESS — 15-staging-log.md All checks were successful CI / test (pull_request) Successful in 3m56s Details Staging suite 8/10 PASS, REAL failed: none; C9a/C9b infra-waived (ORCH-061). staging_status: SUCCESS Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 13:07:34 +03:00
deploy-finalizer	f3cd6f4c5a	deploy(ORCH-036): finalize SUCCESS for ORCH-110 All checks were successful CI / test (push) Successful in 2m45s Details CI / test (pull_request) Successful in 2m26s Details	2026-06-15 11:04:55 +03:00
claude-bot	04d5671e1b	tester(ET): auto-commit from tester run_id=690 All checks were successful CI / test (push) Successful in 4m35s Details CI / test (pull_request) Successful in 4m24s Details	2026-06-15 10:42:34 +03:00
claude-bot	1622454d43	reviewer(ET): auto-commit from reviewer run_id=689	2026-06-15 10:42:34 +03:00
claude-bot	651b9af7c3	fix(merge-gate): tolerate re-test infra-timeout + tree-kill spawned pytest Eliminate the false `deploy-staging -> development` rollback that fired when the merge-gate local re-test timed out (infra/resource) on a green CI + tester + staging branch (incident ORCH-109/PR #129: a 516.7s suite blew its 600s budget under CPU starvation from orphaned pytest processes -> timeout misrouted as a code fault -> developer-retry loop -> manual gate). Additive, 5 independent kill-switches, never-raise, self-hosting scope. Untouched byte-for-byte: STAGE_TRANSITIONS, the QG_CHECKS registry, check_branch_mergeable name/semantics, machine-verdict keys, the DB schema. INV-4 (never push/force-push main) and the no-prod-restart rule are preserved. - D1: new stdlib-only leaf src/proc_group.py runs the spawned re-test/coverage pytest in its own process group (start_new_session) and tree-kills the WHOLE group on timeout (os.killpg SIGTERM->grace->SIGKILL); used by merge_gate.retest_branch and coverage_gate.measure_coverage. No orphan leak. Fallback never-break: subprocess_tree_kill_enabled=False / non-POSIX -> the prior subprocess.run. - D2/D3: merge_gate.classify_retest_failure distinguishes timeout/red/lock-busy/ other; an infra timeout routes to _handle_merge_gate_infra_retry (bounded re-queue, task stays on deploy-staging, no rollback / no developer-retry); a red re-test / conflict still rolls back (BR-6). Exhaustion -> one infra alert. - D4: skip the local re-test when the pre-merge rebase was a proven no-op (HEAD already CI/tester/staging-validated); fail-safe runs the re-test on any uncertainty. Flag merge_retest_skip_when_current_enabled. - D5: merge_retest_timeout_s 600 -> 900 + _resolve_retest_timeout validation; reaper_max_running_s invariant preserved without change. - D6: in-process counters + read-only merge_gate block in GET /queue; appended ("ORCH-110","classify_retest_failure","src/merge_gate.py") to MAIN_REGRESSION_MARKERS. Docs (README/internals overview/CLAUDE/CHANGELOG/ .env.example) updated in the same PR. Tests: tests/test_orch110_*.py (TC-01..TC-12, incl. the red-before/green-after incident regression). Full suite green (1988 passed). Refs: ORCH-110 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 10:42:34 +03:00
claude-bot	cf602b4810	architect(ET): auto-commit from architect run_id=687	2026-06-15 10:42:34 +03:00
claude-bot	3a2a5063e0	analyst(ET): auto-commit from analyst run_id=686	2026-06-15 10:42:34 +03:00
Slava	fe130db788	docs: init ORCH-110 business request	2026-06-15 10:42:34 +03:00
claude-bot	e34233f323	docs(ORCH-110): staging gate SUCCESS — 15-staging-log.md All checks were successful CI / test (pull_request) Successful in 3m48s Details 8/10 checks PASS, exit 0. C9a/C9b infra-waived (ORCH-061). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 10:41:12 +03:00
deploy-finalizer	da599e8736	deploy(ORCH-036): finalize SUCCESS for ORCH-111 All checks were successful CI / test (push) Successful in 2m41s Details CI / test (pull_request) Successful in 3m12s Details	2026-06-15 09:14:06 +03:00
claude-bot	d1e8346605	deploy-staging(ORCH-111): staging gate SUCCESS (8/10 PASS, C9a/C9b infra-waived) All checks were successful CI / test (push) Successful in 3m31s Details CI / test (pull_request) Successful in 4m15s Details Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 08:47:44 +03:00
claude-bot	3f16b77d2b	tester(ET): auto-commit from tester run_id=682 All checks were successful CI / test (push) Successful in 3m3s Details CI / test (pull_request) Successful in 3m13s Details	2026-06-15 08:43:55 +03:00
claude-bot	521a72e702	reviewer(ET): auto-commit from reviewer run_id=681 All checks were successful CI / test (push) Successful in 3m48s Details CI / test (pull_request) Successful in 4m48s Details	2026-06-15 08:31:48 +03:00
deploy-finalizer	007a9ad47d	deploy(ORCH-036): finalize FAILED for ORCH-111 All checks were successful CI / test (push) Successful in 3m0s Details CI / test (pull_request) Successful in 3m0s Details	2026-06-15 02:43:37 +03:00
claude-bot	27b85144c2	developer(ET): auto-commit from developer run_id=680 Some checks failed CI / test (push) Has been cancelled Details CI / test (pull_request) Successful in 2m50s Details	2026-06-15 02:43:30 +03:00
claude-bot	1fbfb941a9	tester(ET): auto-commit from tester run_id=678 All checks were successful CI / test (push) Successful in 4m22s Details CI / test (pull_request) Successful in 4m27s Details	2026-06-15 02:14:17 +03:00
claude-bot	96701a1a2d	reviewer(ET): auto-commit from reviewer run_id=677	2026-06-15 02:14:17 +03:00
claude-bot	2e73ccf090	feat(watchdog): proc_blocking alert for orphaned long-lived test processes Close the observability gap between agent_hung (only tracked jobs by jobs.pid) and orphaned pytest subprocesses the orchestrator launches itself (merge_gate.retest_branch / coverage_gate.measure_coverage). On a timeout-kill of the agent (-9, ORCH-109) the grand-child pytest reparents onto tini and keeps running for days, starving CPU and failing merge-gate re-test — with no alert. Strictly inside the observer (watchdog/** + the watchdog compose service): - watchdog/collectors/proc.py: stdlib-only /proc scan (under pid: host), read-only, never-raise -> []; pure parsers split from I/O (tested on a fake /proc tree). Never reads /proc/<pid>/environ. - watchdog/signals.py: pure proc_signals builder, per-entity ("proc_blocking", pid), active iff age_s > proc_age_s; actionable RU detail. - watchdog/core.py: opt-in tick block (gated on proc_enabled -> zero overhead / byte-for-byte when off) + RECOVERY synthesis for a vanished process through the existing decide()/AlertState (no new anti-spam logic). - watchdog/config.py: WATCHDOG_PROC_{ENABLED(false),AGE_MIN(60),PATTERNS(pytest), COOLDOWN_S(1800)}; default threshold > max(merge_retest_timeout_s=600, coverage_run_timeout_s=900) so a legit in-flight run never crosses it. - docker-compose.yml: pid: host on orchestrator-watchdog ONLY (read-only privilege). Anti-false-positive and no overlap with agent_hung are by construction (cmdline scope + age threshold), not fragile cross-namespace PID matching. Canon synced: WATCHDOG_PROC_* in .env.watchdog.example <-> .env.example block; documented in LITE_SETUP.md and docs/architecture/README.md (architect). src/*, /metrics, schema_version, STAGE_TRANSITIONS, QG_CHECKS, check_, machine-verdict and the DB schema are untouched; deploy rebuilds only the sidecar, prod orchestrator is not restarted (NFR-3). Tests: tests/watchdog/test_proc_blocking_signal.py (TC-01..TC-06), test_proc_collector.py (/proc parsing), test_tick_proc_blocking_integration.py (TC-07), plus pid: host and proc-config assertions. Full pytest tests/ green (1930). Refs: ORCH-111 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 02:14:17 +03:00
claude-bot	7298f11064	architect(ET): auto-commit from architect run_id=675	2026-06-15 02:14:17 +03:00
claude-bot	44adcba389	analyst(ET): auto-commit from analyst run_id=674	2026-06-15 02:14:17 +03:00
Slava	a0526e1def	docs: init ORCH-111 business request	2026-06-15 02:14:17 +03:00
claude-bot	afc4e641c0	docs(ORCH-111): staging gate log — SUCCESS (8/10, C9a/C9b infra-waived) All checks were successful CI / test (pull_request) Successful in 3m27s Details Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-15 02:12:59 +03:00
deploy-finalizer	f5c93aa3cc	deploy(ORCH-036): finalize SUCCESS for ORCH-109 All checks were successful CI / test (push) Successful in 3m7s Details CI / test (pull_request) Successful in 3m9s Details	2026-06-14 20:47:24 +03:00
claude-bot	2028b6cb14	reviewer(ET): auto-commit from reviewer run_id=671 All checks were successful CI / test (push) Successful in 3m39s Details CI / test (pull_request) Successful in 4m23s Details	2026-06-14 20:10:25 +03:00

1 2 3 4 5 ...

589 Commits