Commit Graph

588 Commits

Author SHA1 Message Date
deploy-finalizer
eb34324852 deploy(ORCH-036): finalize SUCCESS for ORCH-114
All checks were successful
CI / test (push) Successful in 1m11s
2026-06-15 19:35:56 +03:00
7490f4fac4 tester(ET): auto-commit from tester run_id=714
All checks were successful
CI / test (push) Successful in 1m18s
CI / test (pull_request) Successful in 1m28s
2026-06-15 19:28:38 +03:00
d4eca78423 reviewer(ET): auto-commit from reviewer run_id=713 2026-06-15 19:28:38 +03:00
c4a97a7a28 fix(stage-engine): address ORCH-114 review — env/docs canon + in-region rollback CAS
Resolves the REQUEST_CHANGES findings on ORCH-114 (durable transition-ownership
lease + expected-stage CAS):

P1 — documentation = golden source:
- .env.example: add ORCH_TRANSITION_LEASE_ENABLED / ORCH_TRANSITION_LEASE_REPOS
  (canon of 100% start keys, ORCH-101), next to the other gate kill-switches.
- CLAUDE.md: add the ORCH-114 passport section (mechanism, invariant, flags,
  ADR links) so a future agent editing advance_stage/reaper/webhooks finds the
  ownership invariant in the first mandatory-read doc (ORCH-078 traceability index).

P2 — should-fix:
- docs/overview/ (system showcase, ORCH-011): add transition_lease to
  tech-data-model.md (helper tables), tech-observability.md (/queue blocks) and
  tech-architecture.md (components).
- ADR-001 D4 alignment: the four side-effectful-edge rollback handlers
  (_handle_merge_gate_rollback / _handle_security_gate / _handle_coverage_gate /
  _handle_image_freshness) now write `development` through the expected-stage CAS
  via a shared _rollback_stage_cas helper (defence against the rollback↔done
  contradiction, BR-6) instead of a bare unconditional update_task_stage. Under the
  held lease the sole owner always wins; a lost race aborts WITHOUT side effects.
  Kill-switch off / out-of-scope repo -> degenerates to the prior write -> 1:1.
- Test isolation: make tests/test_webhooks.py order-independent by pinning the
  proj-1 registry per-test (mirrors test_webhook_dedup.proj_registry); it had only
  passed by relying on import order. Drop the needless module-level ORCH_DB_PATH
  setdefault in test_orch114 (fresh_db already isolates db_path).

New regression tests (TC-11): in-region rollback writes route through CAS;
rollback CAS wins when at expected stage; rollback CAS-lost does NOT clobber `done`;
kill-switch-off rollback degenerates to the unconditional write.

ruff clean (src/stage_engine.py, src/transition_lease.py); full suite 2052 passed.

Refs: ORCH-114
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 19:28:38 +03:00
4a6b32e61d reviewer(ET): auto-commit from reviewer run_id=711 2026-06-15 19:28:38 +03:00
6ea4402942 fix(stage-engine): durable transition-ownership lease + expected-stage CAS (ORCH-114)
Close the root class of the ORCH-110/111/112/113 incident chain: side-effectful
stage transitions had no single ownership. `advance_stage` is re-enterable and wrote
the stage with a bare `UPDATE ... WHERE id=?` (no compare-and-swap), while >=5 actors
(monitor / Plane-webhook / reconciler F-1 / job-reaper / deploy-finalizer) enter the
same transition independently. A concurrent or post-restart re-entry therefore
re-applied irreversible effects (merge_pr / coverage-ratchet / image-rebuild /
prod-deploy initiation) and produced a contradictory rollback<->done (incident
ORCH-111, job 1914 / PR #130).

Two complementary layers, both additive, under one kill-switch, never-raise:
  1. Durable transition-lease (new table `transition_lease`) — owner-exclusion on
     ENTRY to the side-effectful region: a second actor that sees a LIVE owner does
     not start the heavy sub-gates at all (prevention, not post-hoc repair).
  2. Expected-stage CAS (`db.update_task_stage_cas`) — atomicity on the stage WRITE:
     a lost race aborts with NO side effect. Also closes the 6 paths that write the
     stage in bypass of advance_stage (gitea x5 + plane rollback).

Owner liveness = owner_pid + owner_boot_id (NOT a heartbeat — a blocking 900s merge
re-test cannot beat one; ADR-001 D3), making restart recovery free (a fresh boot_id
renders every prior lease stale -> reclaimed by recover_on_startup). The lease has no
own TTL: its hard age ceiling is the reaper Tier-3 backstop reaper_max_running_s, so
the cross-cutting budget invariant ORCH-065/109/110/113 is untouched.

Generalises ORCH-113 finalizer-liveness (process-local, Tier-2, deploy-staging) to a
durable cross-path lease: the reaper consults it on all relevant paths (defer live,
reclaim dead; Tier-3 ignores the marker -> bounded; a reap force-releases the lease);
reconciler F-1 and the Plane webhook defer on an active lease; main.lifespan calls
recover_on_startup() after requeue_running_jobs. finalizer_liveness.py is unchanged
(it remains the kill-switch-off fallback).

Scope self-hosting (transition_lease_repos="" -> orchestrator only; enduro untouched).
Kill-switch ORCH_TRANSITION_LEASE_ENABLED=false -> CAS degenerates to the prior
unconditional update_task_stage, lease inert, reaper -> ORCH-113 fallback (byte-for-
byte pre-ORCH-114). STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict keys /
existing table schemas — byte-for-byte (one additive table, no epoch column on tasks).

Observability: read-only `transition_lease` block in GET /queue + a Telegram alert on
forced/stale reclaim + optional POST /transition-lease/release?work_item=<id>.

Coverage: tests/test_orch114_transition_ownership.py (TC-01 mandatory regression of
the ORCH-111 class — red before fix, green after; TC-02..TC-14). Full suite green
(2048 passed); the 4 webhook tests that spied on the removed gitea.update_task_stage
were updated to spy on the new commit_stage_cas write path.

ADR: docs/work-items/ORCH-114/06-adr/ADR-001-transition-ownership-lease-and-stage-cas.md
Cross-cutting: docs/architecture/adr/adr-0045-transition-ownership-lease-and-stage-cas.md

Refs: ORCH-114
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 19:28:38 +03:00
cc03e68847 architect(ET): auto-commit from architect run_id=709 2026-06-15 19:28:38 +03:00
9fcca9efbc analyst(ET): auto-commit from analyst run_id=708 2026-06-15 19:28:38 +03:00
ab5e4c345b docs: init ORCH-114 business request 2026-06-15 19:28:38 +03:00
6565d50242 deploy-staging(ORCH-114): staging gate SUCCESS (8/10 PASS, C9a/C9b infra-waived)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 19:28:13 +03:00
deploy-finalizer
285f5f05dc deploy(ORCH-036): finalize SUCCESS for ORCH-112
All checks were successful
CI / test (push) Successful in 3m9s
CI / test (pull_request) Successful in 3m11s
2026-06-15 15:33:15 +03:00
344ab72f37 tester(ET): auto-commit from tester run_id=706
All checks were successful
CI / test (push) Successful in 3m59s
CI / test (pull_request) Successful in 3m9s
2026-06-15 15:15:56 +03:00
7f673a45f7 reviewer(ET): auto-commit from reviewer run_id=705 2026-06-15 15:15:56 +03:00
31b4f3fd1d architect(ET): auto-commit from architect run_id=703 2026-06-15 15:15:56 +03:00
96b653d11c architect(ET): auto-commit from architect run_id=702 2026-06-15 15:15:56 +03:00
860de5b0a5 analyst(ET): auto-commit from analyst run_id=701 2026-06-15 15:15:56 +03:00
c086921aa1 docs: init ORCH-112 business request 2026-06-15 15:15:56 +03:00
eb1b7aa056 docs(ORCH-112): staging gate log artifact — SUCCESS
All checks were successful
CI / test (pull_request) Successful in 3m52s
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:14:32 +03:00
deploy-finalizer
c8faa1ec23 deploy(ORCH-036): finalize SUCCESS for ORCH-113
All checks were successful
CI / test (push) Successful in 3m9s
CI / test (pull_request) Successful in 3m5s
2026-06-15 13:51:44 +03:00
b62e196710 developer(ET): auto-commit from developer run_id=699
All checks were successful
CI / test (push) Successful in 3m22s
CI / test (pull_request) Successful in 3m43s
2026-06-15 13:43:22 +03:00
7523b843a5 tester(ET): auto-commit from tester run_id=696
All checks were successful
CI / test (push) Successful in 4m41s
CI / test (pull_request) Successful in 4m1s
2026-06-15 13:08:41 +03:00
adeffbb39a reviewer(ET): auto-commit from reviewer run_id=695 2026-06-15 13:08:41 +03:00
1e74b9d042 architect(ET): auto-commit from architect run_id=693 2026-06-15 13:08:41 +03:00
425ecb7585 analyst(ET): auto-commit from analyst run_id=692 2026-06-15 13:08:41 +03:00
55e9483fb8 docs: init ORCH-113 business request 2026-06-15 13:08:41 +03:00
164cf2143c docs(ORCH-113): staging gate SUCCESS — 15-staging-log.md
All checks were successful
CI / test (pull_request) Successful in 3m56s
Staging suite 8/10 PASS, REAL failed: none; C9a/C9b infra-waived (ORCH-061).
staging_status: SUCCESS

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 13:07:34 +03:00
deploy-finalizer
f3cd6f4c5a deploy(ORCH-036): finalize SUCCESS for ORCH-110
All checks were successful
CI / test (push) Successful in 2m45s
CI / test (pull_request) Successful in 2m26s
2026-06-15 11:04:55 +03:00
04d5671e1b tester(ET): auto-commit from tester run_id=690
All checks were successful
CI / test (push) Successful in 4m35s
CI / test (pull_request) Successful in 4m24s
2026-06-15 10:42:34 +03:00
1622454d43 reviewer(ET): auto-commit from reviewer run_id=689 2026-06-15 10:42:34 +03:00
651b9af7c3 fix(merge-gate): tolerate re-test infra-timeout + tree-kill spawned pytest
Eliminate the false `deploy-staging -> development` rollback that fired when the
merge-gate local re-test timed out (infra/resource) on a green CI + tester +
staging branch (incident ORCH-109/PR #129: a 516.7s suite blew its 600s budget
under CPU starvation from orphaned pytest processes -> timeout misrouted as a
code fault -> developer-retry loop -> manual gate).

Additive, 5 independent kill-switches, never-raise, self-hosting scope. Untouched
byte-for-byte: STAGE_TRANSITIONS, the QG_CHECKS registry, check_branch_mergeable
name/semantics, machine-verdict keys, the DB schema. INV-4 (never push/force-push
main) and the no-prod-restart rule are preserved.

- D1: new stdlib-only leaf src/proc_group.py runs the spawned re-test/coverage
  pytest in its own process group (start_new_session) and tree-kills the WHOLE
  group on timeout (os.killpg SIGTERM->grace->SIGKILL); used by
  merge_gate.retest_branch and coverage_gate.measure_coverage. No orphan leak.
  Fallback never-break: subprocess_tree_kill_enabled=False / non-POSIX -> the
  prior subprocess.run.
- D2/D3: merge_gate.classify_retest_failure distinguishes timeout/red/lock-busy/
  other; an infra timeout routes to _handle_merge_gate_infra_retry (bounded
  re-queue, task stays on deploy-staging, no rollback / no developer-retry); a
  red re-test / conflict still rolls back (BR-6). Exhaustion -> one infra alert.
- D4: skip the local re-test when the pre-merge rebase was a proven no-op (HEAD
  already CI/tester/staging-validated); fail-safe runs the re-test on any
  uncertainty. Flag merge_retest_skip_when_current_enabled.
- D5: merge_retest_timeout_s 600 -> 900 + _resolve_retest_timeout validation;
  reaper_max_running_s invariant preserved without change.
- D6: in-process counters + read-only merge_gate block in GET /queue; appended
  ("ORCH-110","classify_retest_failure","src/merge_gate.py") to
  MAIN_REGRESSION_MARKERS. Docs (README/internals overview/CLAUDE/CHANGELOG/
  .env.example) updated in the same PR.

Tests: tests/test_orch110_*.py (TC-01..TC-12, incl. the red-before/green-after
incident regression). Full suite green (1988 passed).

Refs: ORCH-110

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 10:42:34 +03:00
cf602b4810 architect(ET): auto-commit from architect run_id=687 2026-06-15 10:42:34 +03:00
3a2a5063e0 analyst(ET): auto-commit from analyst run_id=686 2026-06-15 10:42:34 +03:00
fe130db788 docs: init ORCH-110 business request 2026-06-15 10:42:34 +03:00
e34233f323 docs(ORCH-110): staging gate SUCCESS — 15-staging-log.md
All checks were successful
CI / test (pull_request) Successful in 3m48s
8/10 checks PASS, exit 0. C9a/C9b infra-waived (ORCH-061).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 10:41:12 +03:00
deploy-finalizer
da599e8736 deploy(ORCH-036): finalize SUCCESS for ORCH-111
All checks were successful
CI / test (push) Successful in 2m41s
CI / test (pull_request) Successful in 3m12s
2026-06-15 09:14:06 +03:00
d1e8346605 deploy-staging(ORCH-111): staging gate SUCCESS (8/10 PASS, C9a/C9b infra-waived)
All checks were successful
CI / test (push) Successful in 3m31s
CI / test (pull_request) Successful in 4m15s
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 08:47:44 +03:00
3f16b77d2b tester(ET): auto-commit from tester run_id=682
All checks were successful
CI / test (push) Successful in 3m3s
CI / test (pull_request) Successful in 3m13s
2026-06-15 08:43:55 +03:00
521a72e702 reviewer(ET): auto-commit from reviewer run_id=681
All checks were successful
CI / test (push) Successful in 3m48s
CI / test (pull_request) Successful in 4m48s
2026-06-15 08:31:48 +03:00
deploy-finalizer
007a9ad47d deploy(ORCH-036): finalize FAILED for ORCH-111
All checks were successful
CI / test (push) Successful in 3m0s
CI / test (pull_request) Successful in 3m0s
2026-06-15 02:43:37 +03:00
27b85144c2 developer(ET): auto-commit from developer run_id=680
Some checks failed
CI / test (push) Has been cancelled
CI / test (pull_request) Successful in 2m50s
2026-06-15 02:43:30 +03:00
1fbfb941a9 tester(ET): auto-commit from tester run_id=678
All checks were successful
CI / test (push) Successful in 4m22s
CI / test (pull_request) Successful in 4m27s
2026-06-15 02:14:17 +03:00
96701a1a2d reviewer(ET): auto-commit from reviewer run_id=677 2026-06-15 02:14:17 +03:00
2e73ccf090 feat(watchdog): proc_blocking alert for orphaned long-lived test processes
Close the observability gap between agent_hung (only tracked jobs by jobs.pid)
and orphaned pytest subprocesses the orchestrator launches itself
(merge_gate.retest_branch / coverage_gate.measure_coverage). On a timeout-kill of
the agent (-9, ORCH-109) the grand-child pytest reparents onto tini and keeps
running for days, starving CPU and failing merge-gate re-test — with no alert.

Strictly inside the observer (watchdog/** + the watchdog compose service):
- watchdog/collectors/proc.py: stdlib-only /proc scan (under pid: host),
  read-only, never-raise -> []; pure parsers split from I/O (tested on a fake
  /proc tree). Never reads /proc/<pid>/environ.
- watchdog/signals.py: pure proc_signals builder, per-entity
  ("proc_blocking", pid), active iff age_s > proc_age_s; actionable RU detail.
- watchdog/core.py: opt-in tick block (gated on proc_enabled -> zero overhead /
  byte-for-byte when off) + RECOVERY synthesis for a vanished process through the
  existing decide()/AlertState (no new anti-spam logic).
- watchdog/config.py: WATCHDOG_PROC_{ENABLED(false),AGE_MIN(60),PATTERNS(pytest),
  COOLDOWN_S(1800)}; default threshold > max(merge_retest_timeout_s=600,
  coverage_run_timeout_s=900) so a legit in-flight run never crosses it.
- docker-compose.yml: pid: host on orchestrator-watchdog ONLY (read-only privilege).

Anti-false-positive and no overlap with agent_hung are by construction (cmdline
scope + age threshold), not fragile cross-namespace PID matching.

Canon synced: WATCHDOG_PROC_* in .env.watchdog.example <-> .env.example block;
documented in LITE_SETUP.md and docs/architecture/README.md (architect). src/**,
/metrics, schema_version, STAGE_TRANSITIONS, QG_CHECKS, check_*, machine-verdict
and the DB schema are untouched; deploy rebuilds only the sidecar, prod
orchestrator is not restarted (NFR-3).

Tests: tests/watchdog/test_proc_blocking_signal.py (TC-01..TC-06),
test_proc_collector.py (/proc parsing), test_tick_proc_blocking_integration.py
(TC-07), plus pid: host and proc-config assertions. Full pytest tests/ green (1930).

Refs: ORCH-111
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 02:14:17 +03:00
7298f11064 architect(ET): auto-commit from architect run_id=675 2026-06-15 02:14:17 +03:00
44adcba389 analyst(ET): auto-commit from analyst run_id=674 2026-06-15 02:14:17 +03:00
a0526e1def docs: init ORCH-111 business request 2026-06-15 02:14:17 +03:00
afc4e641c0 docs(ORCH-111): staging gate log — SUCCESS (8/10, C9a/C9b infra-waived)
All checks were successful
CI / test (pull_request) Successful in 3m27s
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 02:12:59 +03:00
deploy-finalizer
f5c93aa3cc deploy(ORCH-036): finalize SUCCESS for ORCH-109
All checks were successful
CI / test (push) Successful in 3m7s
CI / test (pull_request) Successful in 3m9s
2026-06-14 20:47:24 +03:00
2028b6cb14 reviewer(ET): auto-commit from reviewer run_id=671
All checks were successful
CI / test (push) Successful in 3m39s
CI / test (pull_request) Successful in 4m23s
2026-06-14 20:10:25 +03:00
8628e609d9 tester(ET): auto-commit from tester run_id=669
All checks were successful
CI / test (push) Successful in 4m27s
CI / test (pull_request) Successful in 4m8s
2026-06-14 14:26:11 +03:00