diff --git a/CLAUDE.md b/CLAUDE.md index e8426bc..c1ab1e4 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -235,6 +235,44 @@ kill-switch, never-raise, fail-safe → полный цикл. `docs/work-items/ORCH-027/06-adr/ADR-001-coverage-gate.md`, `docs/architecture/adr/adr-0029-coverage-gate.md`. +## Машинный журнал уроков (ORCH-098) +Шаг 1 («Фундамент», F2) эпика саморазвития: формализует свободнотекстовые «уроки» из `memory/` в +**машинную структурированную таблицу отклонений конвейера** `lessons`, фундамент для будущих +ретроспективщика (E2), приоритизатора RICE (E3) и Стрим. Чистый **observer-leaf** `src/lessons.py` +(never-raise, kill-switch, паттерн `serial_gate`/`coverage_gate`/`metrics`): `record()`/`get()`/ +`update()`/`snapshot()`. **Инвариант:** журнал — наблюдатель, **не** Quality Gate; запись урока +никогда не влияет на продвижение по стадиям — `STAGE_TRANSITIONS`/`QG_CHECKS`/`check_*`/ +machine-verdict/схемы существующих таблиц байт-в-байт не тронуты. +- **Таблица (D1):** аддитивная идемпотентная `lessons` (`CREATE TABLE IF NOT EXISTS` в `init_db()`, + три индекса) — контекст (`work_item_id`/`task_id`/`stage`/`agent`/`repo`), анализ (`root_cause`/ + `suggestion`), статус (`status`/`related_task`), **атрибуция сразу и нуллабельно** (`attribution`/ + `target_repo`/`target_domain`, требование Славы 10.06 / NFR-6, заполняется позже через update; + `_ensure_column` форвард-safe на старой таблице) + `source`/`detail`. Без `enum`-констрейнтов — + значения суть forward-compatible слаги. Хелперы `db.record_lesson`/`get_lessons`/`update_lesson`/ + `lessons_snapshot`/`lessons_recent_dup_exists`. +- **НЕ скоупится по репо (D2):** в отличие от гейт-leaf'ов (`serial_gate`/`coverage_gate` имеют + `*_repos`, т.к. *действуют* на репо), журнал observer-only → единственный регулятор — глобальный + kill-switch `lessons_enabled` (env `ORCH_LESSONS_ENABLED`, дефолт `True`); **`lessons_repos` НЕ + вводится**. Recorder пишет уроки про **любой** репо (включая enduro-trails — урок ценен для петли); + репо-разрез — на **выборке** (`get(repo=…)`). enduro не затронут (общая БД, аддитивная таблица). +- **Автозапись 4 типов (D3):** тонкие best-effort врезки (`source="auto"`, never-raise, дедуп) — + `gate_failure` (`stage_engine._handle_qg_failure_rollbacks`, откат на `development`), `merge_hold` + (`stage_engine._handle_merge_verify` HOLD-ветка), `transient_retry` (`launcher._finalize_transient` + на **исчерпании** бюджета ретраев, а не на каждом backoff), `deploy_degraded` (post-deploy + `DEGRADED → set_repo_freeze`, урок слоя-3 «деплой OK / прод сломан» ET-8 — `attribution="unknown"`, + классифицируется позже). +- **Дедуп (D4):** для `source="auto"` — один indexed-SELECT по `idx_lessons_wi_type`: дубль с тем же + `(work_item_id, lesson_type, stage)` в окне `lessons_dedup_window_s` (env, дефолт 3600с) → no-op. + `source="manual"` дедуп НЕ проходит (оператор/Стрим всегда пишут). +- **Эндпоинты (D5):** `GET /lessons` (read-only, фильтры `type`/`status`/`repo`/`work_item`/`limit`), + `POST /lessons` (ручная запись, `source="manual"`), `POST /lessons/{id}` (доклассификация/update); + read-only ключ `lessons` в `GET /queue`. Выключенный флаг → `{"enabled": false}`. +- **never-raise (NFR-1):** все публичные функции и врезки изолированы (`try/except` → warning + + безопасный дефолт) — сбой журнала не роняет конвейер. Self-hosting-безопасно: только читает/пишет + свою таблицу, не деплоит/не рестартит прод/не трогает `main`/без процессов/сети. Детали — + `docs/work-items/ORCH-098/06-adr/ADR-001-lessons-journal.md`, + `docs/architecture/adr/adr-0033-lessons-journal.md`. + ## Конвенции - Conventional Commits (`feat:`, `fix:`, `docs:`, `refactor:`, `test:`) - Ветки: `feature/ORCH-NNN-slug`, `fix/ORCH-NNN-slug` diff --git a/docs/architecture/README.md b/docs/architecture/README.md index a80007b..278f7bc 100644 --- a/docs/architecture/README.md +++ b/docs/architecture/README.md @@ -20,7 +20,7 @@ - **Plane Sync** (`src/plane_sync.py`) — синхронизация статусов/комментариев в Plane. Резолв статусов проекта `get_project_states` (ORCH-10) кэширует `{logical_key→uuid}` per-project; **ORCH-068** добавляет в кэш-запись `{uuid→group}` (для терминал-исключения F-2) и **TTL** `ORCH_PLANE_STATES_TTL_S` (дефолт 300с; `0` → прежний lifetime-кэш) — устаревший набор статусов самозалечивается без рестарта процесса через существующий `reload_project_states()` (баг кэша после появления нового Plane-статуса). Форма возврата `get_project_states` неизменна (обратная совместимость). - **FS ownership detect** (`src/fs_normalize.py`, ORCH-057 — [adr-0031](adr/adr-0031-legacy-ownership-normalization.md)) — чистый **never-raise** leaf (паттерн `serial_gate`/`preflight`), закрывает пробел ORCH-040: при миграции на `user: "1000:1000"` legacy `root:root` файлы в `/repos` ломали создание worktree под uid 1000 (`ensure_worktree` → сырой `fatal: … Permission denied`, агент не стартовал). Три слоя: (1) **D1** — `src/git_worktree.py::ensure_worktree` классифицирует класс «нет прав» (`Permission denied`/`could not create leading directories`/`insufficient permission`/`EACCES`/`EPERM`) и поднимает actionable `RuntimeError` с причиной + лечащей командой (не-прав-ошибки сохраняют прежний контракт — меняется только формулировка, не факт сбоя); (2) **D2** — `scan_ownership(roots, target_uid=os.getuid())` обходит `/repos/_wt`, `/.git/{objects,worktrees}`, `data/runs` с ранним выходом при первом `st_uid != target_uid` + TTL-кэш; (3) **D3** — best-effort вызов на старте `main.lifespan` → WARNING + Telegram при mismatch (claim **НЕ** блокируется — внятный ранний отказ даёт D1 в точке launch, знающей repo; preflight-блок отвергнут как repo-слепой → регресс enduro). Опц. `normalize()` chown'ит только при `CAP_CHOWN` (под uid 1000 — no-op; init-контейнер/root-entrypoint отвергнуты — реинтродукция root-контекста + self-deploy compose). Фактическая нормализация = **операторская процедура** под root на хосте (`INFRA.md` «Миграция uid»). Условность `applies(repo)` first: `fs_normalize_enabled` (kill-switch) + `fs_normalize_repos` (CSV, пусто → self-hosting only). Наблюдаемость — блок `fs_ownership` в `GET /queue`; опц. `POST /fs-normalize/check`. `STAGE_TRANSITIONS`/`QG_CHECKS`/`check_*`/machine-verdict/схема БД — не тронуты. Детали — `docs/work-items/ORCH-057/06-adr/ADR-001-legacy-ownership-normalization.md`. - **Metrics endpoint** (`src/metrics.py` + `GET /metrics`, ORCH-099 — [adr-0030](adr/adr-0030-metrics-endpoint.md)) — лёгкий **read-only** leaf-сборщик (`build_metrics() -> dict`, never-raise по разделам, паттерн `serial_gate.snapshot()`) + тонкий эндпоинт (стиль `GET /queue`). Отдаёт JSON-«сырьё» о самом орке (стадии задач / очередь jobs / agent-liveness / стоимость-токены) как **стабильный машинный контракт для sidecar F1b** (`watchdog/`, отдельная задача — наблюдатель отделён от наблюдаемого). Только чтение существующих `tasks`/`jobs`/`agent_runs` + in-memory-снапшотов (`worker.breaker`); два read-only helper'а в `db.py` (`get_running_agents`/`agent_cost_totals`). Логику мониторинга (пороги/алерты/история/Telegram) НЕ несёт — это F1b. Контракт ниже (§ «Сырьё-эндпоинт `/metrics`»). Kill-switch `metrics_endpoint_enabled` (дефолт `True`). `STAGE_TRANSITIONS`/`QG_CHECKS`/схема БД — не тронуты. -- **Lessons journal** (`src/lessons.py` + таблица `lessons`, ORCH-098 — design, [adr-0034](adr/adr-0034-lessons-journal.md)) — машинный журнал уроков (структурированная база отклонений конвейера); шаг 1 эпика саморазвития (домен 0 «Фундамент», F2; топливо петли самообучения 8A), фундамент для будущих ретроспективщика (E2)/приоритизатора RICE (E3)/Стрим. Чистый **observer-leaf** (never-raise, паттерн `serial_gate`/`coverage_gate`/`metrics`): `record()`/`get()`/`update()`/`snapshot()`. **Аддитивная идемпотентная таблица `lessons`** (`CREATE TABLE IF NOT EXISTS` в `init_db()`, restart-safe) с полями контекста (`work_item_id`/`task_id`/`stage`/`agent`/`repo`), анализа (`root_cause`/`suggestion`), статуса (`status`/`related_task`) и **атрибуции — сразу и нуллабельно** (`attribution`/`target_repo`/`target_domain`, требование Славы 10.06 / NFR-6, заполняется позже ретроспективщиком/человеком) + `source`/`detail`; без `enum`-констрейнтов (слаги forward-compatible). **Автозапись 4 типов** (`source="auto"`, best-effort, дедуп в окне; `transient_retry` — только на исчерпании бюджета ретраев) тонкими врезками: `gate_failure` (`stage_engine._handle_qg_failure_rollbacks`), `merge_hold` (`merge_gate._handle_merge_verify` HOLD), `transient_retry` (merge-retry/launcher transient budget-exhaustion), `deploy_degraded` (post-deploy `DEGRADED → set_repo_freeze`, урок слоя-3 «деплой OK / прод сломан», ET-8). Эндпоинты `GET /lessons` (read-only, фильтры), `POST /lessons` (ручная запись), `POST /lessons/{id}` (update/доклассификация), + read-only ключ `lessons` в `GET /queue`. **Расхождение с гейт-шаблоном:** журнал observer-only → **НЕ скоупится по репо** (kill-switch `lessons_enabled` only, без `lessons_repos`); репо-разрез — на выборке (`repo`-колонка/фильтр), enduro не затронут (общая БД, аддитивная таблица). `STAGE_TRANSITIONS`/`QG_CHECKS`/`check_*`/machine-verdict/схемы существующих таблиц — байт-в-байт не тронуты (журнал не участвует в решении гейта). Kill-switch `lessons_enabled` (env `ORCH_LESSONS_ENABLED`, дефолт `True`). Детали — `docs/work-items/ORCH-098/06-adr/ADR-001-lessons-journal.md`. +- **Lessons journal** (`src/lessons.py` + таблица `lessons`, ORCH-098 — реализовано, [adr-0034](adr/adr-0034-lessons-journal.md)) — машинный журнал уроков (структурированная база отклонений конвейера); шаг 1 эпика саморазвития (домен 0 «Фундамент», F2; топливо петли самообучения 8A), фундамент для будущих ретроспективщика (E2)/приоритизатора RICE (E3)/Стрим. Чистый **observer-leaf** (never-raise, паттерн `serial_gate`/`coverage_gate`/`metrics`): `record()`/`get()`/`update()`/`snapshot()`. **Аддитивная идемпотентная таблица `lessons`** (`CREATE TABLE IF NOT EXISTS` в `init_db()`, restart-safe) с полями контекста (`work_item_id`/`task_id`/`stage`/`agent`/`repo`), анализа (`root_cause`/`suggestion`), статуса (`status`/`related_task`) и **атрибуции — сразу и нуллабельно** (`attribution`/`target_repo`/`target_domain`, требование Славы 10.06 / NFR-6, заполняется позже ретроспективщиком/человеком) + `source`/`detail`; без `enum`-констрейнтов (слаги forward-compatible). **Автозапись 4 типов** (`source="auto"`, best-effort, дедуп в окне; `transient_retry` — только на исчерпании бюджета ретраев) тонкими врезками: `gate_failure` (`stage_engine._handle_qg_failure_rollbacks`), `merge_hold` (`merge_gate._handle_merge_verify` HOLD), `transient_retry` (merge-retry/launcher transient budget-exhaustion), `deploy_degraded` (post-deploy `DEGRADED → set_repo_freeze`, урок слоя-3 «деплой OK / прод сломан», ET-8). Эндпоинты `GET /lessons` (read-only, фильтры), `POST /lessons` (ручная запись), `POST /lessons/{id}` (update/доклассификация), + read-only ключ `lessons` в `GET /queue`. **Расхождение с гейт-шаблоном:** журнал observer-only → **НЕ скоупится по репо** (kill-switch `lessons_enabled` only, без `lessons_repos`); репо-разрез — на выборке (`repo`-колонка/фильтр), enduro не затронут (общая БД, аддитивная таблица). `STAGE_TRANSITIONS`/`QG_CHECKS`/`check_*`/machine-verdict/схемы существующих таблиц — байт-в-байт не тронуты (журнал не участвует в решении гейта). Kill-switch `lessons_enabled` (env `ORCH_LESSONS_ENABLED`, дефолт `True`). Детали — `docs/work-items/ORCH-098/06-adr/ADR-001-lessons-journal.md`. - **Sidecar-watchdog F1b** (`watchdog/` + сервис `orchestrator-watchdog`, ORCH-100 — [adr-0033](adr/adr-0033-sidecar-watchdog.md)) — **мозг мониторинга в ОТДЕЛЬНОМ контейнере** (наблюдатель отделён от наблюдаемого, C-1): код в репо орка (`watchdog/`), рантайм — свой образ (`watchdog/Dockerfile`, `python:3.12-slim`, **stdlib-only**) + сервис в `docker-compose.yml` (`network_mode: host`, read-only `docker.sock`, `mem_limit: 128m`). На каждом тике собирает 4 источника: `GET /metrics` орка (F1a/ORCH-099), хост (диск/inode/память/CPU, stdlib), статусы контейнеров через read-only `docker.sock` (GET-only, без `docker` SDK), пинг Plane/Gitea/Anthropic. Каждый сигнал → **обобщённая чистая** `decide(signal_active, prev, now, cooldown)` (генерализация `disk_watchdog.decide_action`, per-signal in-memory `AlertState`) → алерт в **собственный** Telegram-канал sidecar (`WATCHDOG_TG_*`, **НЕ** импорт `src/notifications.py`). Особый сигнал `orch_down` — `/metrics` не отвечает (наблюдатель жив, наблюдаемый лёг). Диск: штатные 85% остаются за `disk_watchdog` (ORCH-063, нулевой дубль), sidecar — `orch_down` + opt-in потолок 97% (default off). never-raise, kill-switch `WATCHDOG_ENABLED`, строго read-only к наблюдаемому; `src/**`/`STAGE_TRANSITIONS`/`QG_CHECKS`/схема БД орка — не тронуты. Подробнее ниже (§ «Sidecar-watchdog F1b»). Детали — `docs/work-items/ORCH-100/06-adr/ADR-001-sidecar-watchdog.md`. ## Сырьё-эндпоинт `/metrics` для sidecar (ORCH-099 — design) @@ -1086,6 +1086,7 @@ Monitoring after Deploy → Done - `jobs` — очередь задач (ORCH-1); статусы `queued|running|done|failed|cancelled` (ORCH-090: `cancelled` — терминальный исход STOP, нигде не реквью'ится); колонка `pid` (ORCH-065) — pid агентского процесса для liveness-детекции зомби job-reaper'ом - `job_deps` — декларативные зависимости задач (ORCH-026, Уровень B): `(task_id, depends_on_task_id)`, аддитивная; источник истины планировщика для гейта «B ждёт A» - `repo_freeze` — durable per-repo rollback-freeze (ORCH-088, FR-5): `(id, repo, frozen_at, reason, work_item_id, cleared_at)`, аддитивная append-only; активный freeze ⇔ строка репо с `cleared_at IS NULL`. Выставляется post-deploy `DEGRADED` (`set_repo_freeze`), снимается вручную (`POST /serial-gate/unfreeze` → `cleared_at=now`). Гейтит serial-claim безусловно (деградировавшая задача уже `done`) +- `lessons` — машинный журнал отклонений конвейера (ORCH-098, FR-1): `(id, created_at, updated_at, lesson_type, work_item_id, task_id, stage, agent, repo, root_cause, suggestion, status, related_task, attribution, target_repo, target_domain, source, detail)`, аддитивная идемпотентная (`CREATE TABLE IF NOT EXISTS` + три индекса); колонки атрибуции (`attribution`/`target_repo`/`target_domain`) — нуллабельны и присутствуют сразу (NFR-6), без `enum`-констрейнтов (слаги forward-compatible). Автозапись 4 типов (`gate_failure`/`merge_hold`/`transient_retry`/`deploy_degraded`, `source="auto"`, дедуп в окне `lessons_dedup_window_s`) + ручная (`source="manual"`); observer-only (не участвует в решении гейта). Leaf `src/lessons.py` never-raise, kill-switch `lessons_enabled` (без `*_repos` — журнал не скоупится по репо, репо-разрез на выборке) ## Изоляция (git worktree, ORCH-2) Каждая задача исполняется в отдельном git worktree, ветки не пересекаются. Репозитории проектов разделены под `/repos/`. @@ -1095,9 +1096,12 @@ Monitoring after Deploy → Done |--------|------|----------| | GET | `/health` | health check | | GET | `/status` | активные задачи (stage != done) | -| GET | `/queue` | очередь: counts + max_concurrency + resilience + reconcile (ORCH-053) + reaper (ORCH-065) + post_deploy (ORCH-021) + task_deps (ORCH-026) + serial_gate (ORCH-088) + auto_labels (ORCH-089) + stop (ORCH-090) + последние jobs | +| GET | `/queue` | очередь: counts + max_concurrency + resilience + reconcile (ORCH-053) + reaper (ORCH-065) + post_deploy (ORCH-021) + task_deps (ORCH-026) + serial_gate (ORCH-088) + auto_labels (ORCH-089) + stop (ORCH-090) + lessons (ORCH-098) + последние jobs | | GET | `/metrics` | ORCH-099 (FND/F1a): read-only машинное «сырьё» для sidecar F1b — конверт `schema_version`/`generated_at`/`clk_tck` + разделы `stages`/`queue`/`agents` (liveness: pid/runtime/cpu_ticks)/`cost`. never-raise по разделам; kill-switch `ORCH_METRICS_ENABLED` (дефолт `True`). Контракт — см. раздел «Сырьё-эндпоинт `/metrics`» | | POST | `/serial-gate/unfreeze` | ORCH-088 (FR-5): ручное снятие per-repo rollback-freeze (query/body `repo=`) → `{ok, repo, cleared, frozen}`; идемпотентно. Альтернатива — `UPDATE repo_freeze SET cleared_at=datetime('now') WHERE repo=? AND cleared_at IS NULL` | +| GET | `/lessons` | ORCH-098 (FR-4): read-only выборка журнала уроков; query-фильтры `type`/`status`/`repo`/`work_item`/`limit` → `{enabled, lessons:[…]}` (всегда `200`, чтение не мутирует). При `lessons_enabled=False` → `{enabled:false, lessons:[]}` | +| POST | `/lessons` | ORCH-098 (FR-5): ручная запись урока (JSON-тело, `lesson_type` обязателен, `source="manual"` не дедупится) → `{id}`; при выключенном флаге → `{enabled:false}` | +| POST | `/lessons/{id}` | ORCH-098 (FR-5): доклассификация/обновление урока (`status`/`attribution`/`target_*`/`related_task`/`root_cause`/`suggestion`), стампит `updated_at` → `{ok}` | | POST | `/webhook/plane` | Plane webhook | | POST | `/webhook/gitea` | Gitea webhook (push, PR, CI status) | diff --git a/src/agents/launcher.py b/src/agents/launcher.py index 15eb41d..ba1b744 100644 --- a/src/agents/launcher.py +++ b/src/agents/launcher.py @@ -1016,6 +1016,20 @@ class AgentLauncher: ) self._notify_failed(job_id, agent, job, run_id, f"transient (rate-limit) after {tattempts} attempts") + # ORCH-098 (FR-3c / D3): auto-record a `transient_retry` lesson ONLY on + # budget EXHAUSTION (not on each backoff — that would be noise; the + # valuable signal is "transients exhausted"). best-effort, never-raise, + # deduped; can't escape into the queue-worker path. + try: + from ..lessons import record as record_lesson, LessonType + record_lesson( + LessonType.TRANSIENT_RETRY, + task_id=job.get("task_id"), repo=job.get("repo"), agent=agent, + root_cause=f"transient retry budget exhausted ({tattempts}/{tmax})", + detail=err, source="auto", + ) + except Exception as e: # noqa: BLE001 - never break the queue worker + logger.warning(f"Job {job_id}: lessons transient_retry record failed: {e}") def _finalize_permanent(self, job_id, agent, run_id, exit_code, job): """Permanent (code-fault) failure -> normal attempts SINGLE kill-switch (env ORCH_LESSONS_ENABLED). + # False -> record/get/update/snapshot inert (no DB + # access), endpoints return {"enabled": false}, + # auto-record injections no-op. Default True. + # lessons_query_limit_default-> default LIMIT for GET /lessons / get() when the + # caller passes none. + # lessons_dedup_window_s -> auto-record dedup window (s): a second auto lesson + # with the same (work_item_id, lesson_type, stage) + # inside this window is suppressed (D4). manual + # records are never deduped. Default 3600 (1h). + lessons_enabled: bool = True + lessons_query_limit_default: int = 100 + lessons_dedup_window_s: int = 3600 + # ORCH-057: legacy root-owned file ownership detect + actionable worktree error # (follow-up ORCH-040). Three additive, kill-switch-reversible layers: (1) an # actionable RuntimeError in git_worktree.ensure_worktree when a worktree fails diff --git a/src/db.py b/src/db.py index 6aca2e0..a158985 100644 --- a/src/db.py +++ b/src/db.py @@ -220,10 +220,195 @@ def init_db(): updated_at TEXT NOT NULL DEFAULT (datetime('now')) ); """) + # ORCH-098 (FR-1, ADR-001 D1): additive machine lessons-journal — a structured + # table of pipeline deviations (gate-fail / merge-hold / transient-retry / + # post-deploy-degraded), the foundation of the self-improvement epic (E2 + # retrospective / E3 RICE prioritiser). Purely ADDITIVE (CREATE TABLE/INDEX IF NOT + # EXISTS, pattern repo_freeze/coverage_baseline) -> idempotent, restart-safe on + # the shared prod DB; existing tables untouched (NFR-3, enduro-trails not + # affected). The attribution columns (attribution/target_repo/target_domain) are + # NULLABLE and present FROM THE START (Слава 10.06, NFR-6) so the live shared DB + # never needs a schema rework — an auto-recorded `unknown` lesson is classified + # later via update. lesson_type / attribution / target_domain carry NO enum/CHECK + # constraint: the values are a forward-compatible slug convention (a new lesson + # type never needs a migration). See docs/work-items/ORCH-098/08-data-requirements.md. + conn.executescript(""" + CREATE TABLE IF NOT EXISTS lessons ( + id INTEGER PRIMARY KEY AUTOINCREMENT, + created_at TEXT NOT NULL DEFAULT (datetime('now')), + updated_at TEXT, + lesson_type TEXT NOT NULL, + work_item_id TEXT, + task_id INTEGER, + stage TEXT, + agent TEXT, + repo TEXT, + root_cause TEXT, + suggestion TEXT, + status TEXT NOT NULL DEFAULT 'new', + related_task TEXT, + attribution TEXT, + target_repo TEXT, + target_domain TEXT, + source TEXT, + detail TEXT + ); + CREATE INDEX IF NOT EXISTS idx_lessons_type_status ON lessons (lesson_type, status); + CREATE INDEX IF NOT EXISTS idx_lessons_repo ON lessons (repo); + CREATE INDEX IF NOT EXISTS idx_lessons_wi_type ON lessons (work_item_id, lesson_type); + """) + # Forward-safe: on an already-created `lessons` table the attribution columns are + # added idempotently (_ensure_column is a no-op once present) so an old prod DB + # picks them up without a data migration (NFR-6, AC-2). + _ensure_column(conn, "lessons", "attribution", "TEXT") + _ensure_column(conn, "lessons", "target_repo", "TEXT") + _ensure_column(conn, "lessons", "target_domain", "TEXT") conn.commit() conn.close() +# --------------------------------------------------------------------------- +# ORCH-098 (FR-1..FR-5, ADR-001 D1): lessons-journal DDL helpers. Each opens its +# own connection and closes it in `finally` (pattern coverage_baseline). The leaf +# src/lessons.py wraps these in its never-raise contract — these may raise on a +# real DB fault (the leaf swallows it). +# --------------------------------------------------------------------------- +# The full column set, in INSERT order. Single source of truth so record/get stay +# in lockstep with the schema. +_LESSON_COLUMNS = ( + "lesson_type", "work_item_id", "task_id", "stage", "agent", "repo", + "root_cause", "suggestion", "status", "related_task", + "attribution", "target_repo", "target_domain", "source", "detail", +) +# Fields an update() may set (everything mutable; never id/created_at/lesson_type). +_LESSON_UPDATABLE = ( + "status", "attribution", "target_repo", "target_domain", "related_task", + "root_cause", "suggestion", "stage", "agent", "repo", "detail", +) + + +def record_lesson(**fields) -> int: + """Insert one lessons row; return the new id. Raises only on a real DB fault. + + Only the known columns in ``_LESSON_COLUMNS`` are written; unknown keys are + ignored (forward-safe). ``created_at`` is stamped by the table default. + """ + cols = [c for c in _LESSON_COLUMNS if c in fields] + if "lesson_type" not in cols: + raise ValueError("record_lesson requires lesson_type") + placeholders = ", ".join("?" for _ in cols) + sql = f"INSERT INTO lessons ({', '.join(cols)}) VALUES ({placeholders})" + conn = get_db() + try: + cur = conn.execute(sql, tuple(fields[c] for c in cols)) + conn.commit() + return int(cur.lastrowid) + finally: + conn.close() + + +def lessons_recent_dup_exists(work_item_id, lesson_type, stage, window_s: int) -> bool: + """ORCH-098 (D4): is there an auto-lesson with the same (work_item_id, + lesson_type, stage) within the last ``window_s`` seconds? One indexed lookup on + ``idx_lessons_wi_type``. Used to suppress duplicate auto-records on retries. + """ + conn = get_db() + try: + row = conn.execute( + "SELECT 1 FROM lessons " + "WHERE work_item_id IS ? AND lesson_type = ? AND stage IS ? " + "AND source = 'auto' " + "AND created_at > datetime('now', ?) LIMIT 1", + (work_item_id, lesson_type, stage, f"-{int(window_s)} seconds"), + ).fetchone() + finally: + conn.close() + return row is not None + + +def get_lessons(*, lesson_type=None, status=None, repo=None, work_item_id=None, + limit: int = 100) -> list[dict]: + """Read-only parametrised SELECT of lessons (ORDER BY id DESC LIMIT ?).""" + where = [] + params: list = [] + if lesson_type: + where.append("lesson_type = ?") + params.append(lesson_type) + if status: + where.append("status = ?") + params.append(status) + if repo: + where.append("repo = ?") + params.append(repo) + if work_item_id: + where.append("work_item_id = ?") + params.append(work_item_id) + sql = "SELECT * FROM lessons" + if where: + sql += " WHERE " + " AND ".join(where) + sql += " ORDER BY id DESC LIMIT ?" + try: + lim = int(limit) + except (TypeError, ValueError): + lim = 100 + params.append(max(1, lim)) + conn = get_db() + try: + rows = conn.execute(sql, tuple(params)).fetchall() + finally: + conn.close() + return [dict(r) for r in rows] + + +def update_lesson(lesson_id: int, **fields) -> bool: + """Update mutable fields of a lesson + stamp updated_at. Returns True iff a row + changed. Unknown / non-updatable keys are ignored (forward-safe). + """ + sets = [c for c in _LESSON_UPDATABLE if c in fields] + if not sets: + return False + assignments = ", ".join(f"{c} = ?" for c in sets) + sql = f"UPDATE lessons SET {assignments}, updated_at = datetime('now') WHERE id = ?" + conn = get_db() + try: + cur = conn.execute(sql, tuple(fields[c] for c in sets) + (int(lesson_id),)) + conn.commit() + return (cur.rowcount or 0) > 0 + finally: + conn.close() + + +def lessons_snapshot(recent: int = 10) -> dict: + """Light GROUP BY summary (counts by type/status) + the last N lessons, for the + GET /queue observability block.""" + conn = get_db() + try: + total = conn.execute("SELECT COUNT(*) FROM lessons").fetchone()[0] + by_type = { + r["lesson_type"]: r["n"] + for r in conn.execute( + "SELECT lesson_type, COUNT(*) AS n FROM lessons GROUP BY lesson_type" + ).fetchall() + } + by_status = { + r["status"]: r["n"] + for r in conn.execute( + "SELECT status, COUNT(*) AS n FROM lessons GROUP BY status" + ).fetchall() + } + rows = conn.execute( + "SELECT * FROM lessons ORDER BY id DESC LIMIT ?", (max(1, int(recent)),) + ).fetchall() + finally: + conn.close() + return { + "total": total, + "by_type": by_type, + "by_status": by_status, + "recent": [dict(r) for r in rows], + } + + def get_coverage_baseline(repo: str) -> float | None: """ORCH-027: read the per-repo coverage baseline (%, line coverage). diff --git a/src/lessons.py b/src/lessons.py new file mode 100644 index 0000000..2e3c054 --- /dev/null +++ b/src/lessons.py @@ -0,0 +1,191 @@ +"""ORCH-098 (FND/F2): machine lessons-journal — a never-raise observer leaf. + +Background +---------- +The orchestrator runs an autonomous pipeline; when it deviates (a quality gate +rolls a task back, a merge is held, a transient burst exhausts the retry budget, +a post-deploy verdict comes back DEGRADED) the only trace today is free-text in +``memory/`` — not machine-readable, so nothing can count the patterns or +prioritise the fixes. ORCH-098 is step 1 («Фундамент», F2) of the +self-improvement epic: it formalises those deviations into a structured +``lessons`` table on which the future retrospective agent (E2), the RICE +prioritiser (E3) and Стрим will stand. + +Design (ADR-001, by образцу ``serial_gate`` / ``coverage_gate`` / ``metrics``) +------------------------------------------------------------------------------ +This is a **leaf**: it imports only ``config`` + ``db`` (lazily). It NEVER imports +``stage_engine`` / ``merge_gate`` / ``launcher`` (anti-cycle) — those choke-points +call INTO this module, never the reverse. + +Two contract invariants, both load-bearing on the shared self-hosting prod DB: + + * **kill-switch** (FR-6 / AC-7): ``lessons_enabled=False`` -> every public + function is an immediate no-op (``record→None``, ``get→[]``, ``update→False``, + ``snapshot→{}``) WITHOUT touching the DB; the auto-record injections become + no-ops; pipeline behaviour is byte-for-byte the pre-ORCH-098 behaviour. + * **never-raise** (NFR-1 / AC-6): with the switch on, every body runs under + ``try/except Exception -> logger.warning + safe default``. A journal fault + (a failing DB, a bad row) can NEVER propagate into the hot path that called it + (a rollback / HOLD / retry must complete regardless). + +**No repo scope (D2).** Unlike the gate leaves (``serial_gate`` / ``coverage_gate`` +/ ``bug_fast_track`` carry a ``*_repos`` CSV because they *act* on a repo), the +journal is observer-only: writing a row never influences any repo's pipeline. +So it records lessons about ANY repo — including enduro-trails (a degraded enduro +deploy is a valuable self-learning signal; a repo scope would drop it). The +repo cut lives on the READ side (``get(repo=...)`` / ``snapshot``). enduro is not +affected (NFR-3): an observer row about enduro changes no enduro stage/gate. + +Self-hosting safety (NFR-7): the journal only reads/writes its own table. It never +deploys, never restarts prod, never touches ``main``, spawns no process, opens no +socket. +""" +from __future__ import annotations + +import logging + +from .config import settings + +logger = logging.getLogger("orchestrator.lessons") + + +# --------------------------------------------------------------------------- +# Slug conventions (NOT enum constraints — forward-compatible string slugs, D1). +# Exposed as constants so the choke-point injections and tests share one spelling. +# --------------------------------------------------------------------------- +class LessonType: + """Canonical ``lesson_type`` slugs written by the auto-detectors (D3).""" + GATE_FAILURE = "gate_failure" # QG rollback to development + MERGE_HOLD = "merge_hold" # merge not verified -> task held on deploy + TRANSIENT_RETRY = "transient_retry" # transient retry budget exhausted + DEPLOY_DEGRADED = "deploy_degraded" # post-deploy DEGRADED -> repo freeze + + +class Attribution: + """``attribution`` slugs (who a lesson is about — filled in later by a human / + the retrospective agent; auto-records leave it NULL or ``unknown``).""" + PLATFORM = "platform" + PROJECT = "project" + BOTH = "both" + UNKNOWN = "unknown" + + +class Domain: + """``target_domain`` slugs (which improvement axis a lesson touches).""" + RELIABILITY = "reliability" + QUALITY = "quality" + ECONOMY = "economy" + FEATURES = "features" + SCALE = "scale" + + +class Status: + """``status`` lifecycle slugs.""" + NEW = "new" + IN_PROGRESS = "in_progress" + CLOSED = "closed" + LINKED = "linked" + + +def _enabled() -> bool: + """Read the kill-switch; never raises (a config read fault -> treated as off).""" + try: + return bool(settings.lessons_enabled) + except Exception as e: # noqa: BLE001 - never-raise contract + logger.warning("lessons: kill-switch read error: %s", e) + return False + + +def record(lesson_type, *, work_item_id=None, task_id=None, stage=None, agent=None, + repo=None, root_cause=None, suggestion=None, status="new", related_task=None, + attribution=None, target_repo=None, target_domain=None, source="auto", + detail=None) -> int | None: + """Record one lesson; return its new id, or ``None`` (no-op / error / deduped). + + * Kill-switch off -> immediate ``None`` WITHOUT a DB access (FR-6 / AC-7). + * ``source="auto"`` records are DEDUPED (D4): a prior auto-lesson with the same + ``(work_item_id, lesson_type, stage)`` within ``lessons_dedup_window_s`` -> + ``None`` (so transient retry-storms / repeated rollbacks don't flood the + table). ``source="manual"`` is NEVER deduped (the operator / Стрим can always + write). + * never-raise (NFR-1 / AC-6): any DB / internal error -> ``logger.warning`` + + ``None``; the caller (a hot-path rollback / HOLD / retry) is untouched. + """ + if not _enabled(): + return None + if not lesson_type: + return None + try: + from . import db + if source == "auto": + try: + window = int(getattr(settings, "lessons_dedup_window_s", 3600) or 0) + except (TypeError, ValueError): + window = 3600 + if window > 0 and db.lessons_recent_dup_exists( + work_item_id, lesson_type, stage, window + ): + logger.debug( + "lessons: deduped auto %s for %s/%s (within %ss window)", + lesson_type, work_item_id, stage, window, + ) + return None + return db.record_lesson( + lesson_type=lesson_type, work_item_id=work_item_id, task_id=task_id, + stage=stage, agent=agent, repo=repo, root_cause=root_cause, + suggestion=suggestion, status=status, related_task=related_task, + attribution=attribution, target_repo=target_repo, + target_domain=target_domain, source=source, detail=detail, + ) + except Exception as e: # noqa: BLE001 - never-raise contract (NFR-1 / AC-6) + logger.warning("lessons.record(%s) error: %s", lesson_type, e) + return None + + +def get(*, lesson_type=None, status=None, repo=None, work_item_id=None, + limit=None) -> list[dict]: + """Read-only fetch of lessons (newest first). never-raise -> ``[]`` on error / + when the kill-switch is off.""" + if not _enabled(): + return [] + try: + if limit is None: + limit = getattr(settings, "lessons_query_limit_default", 100) + from . import db + return db.get_lessons( + lesson_type=lesson_type, status=status, repo=repo, + work_item_id=work_item_id, limit=limit, + ) + except Exception as e: # noqa: BLE001 - never-raise contract + logger.warning("lessons.get error: %s", e) + return [] + + +def update(lesson_id, **fields) -> bool: + """Re-classify / re-status an existing lesson (status / attribution / target_* / + related_task / root_cause / suggestion). Stamps ``updated_at``. never-raise -> + ``False`` on error / kill-switch off.""" + if not _enabled(): + return False + try: + from . import db + return db.update_lesson(lesson_id, **fields) + except Exception as e: # noqa: BLE001 - never-raise contract + logger.warning("lessons.update(%s) error: %s", lesson_id, e) + return False + + +def snapshot() -> dict: + """Light read-only summary for the GET /queue ``lessons`` block. never-raise -> + a minimal dict (``{"enabled": False}`` when off / ``{"enabled": True}`` on + error).""" + if not _enabled(): + return {"enabled": False} + try: + from . import db + out = {"enabled": True} + out.update(db.lessons_snapshot()) + return out + except Exception as e: # noqa: BLE001 - never-raise contract + logger.warning("lessons.snapshot error: %s", e) + return {"enabled": True} diff --git a/src/main.py b/src/main.py index 2ca7d28..5f0f107 100644 --- a/src/main.py +++ b/src/main.py @@ -1,4 +1,4 @@ -from fastapi import FastAPI +from fastapi import FastAPI, Request from contextlib import asynccontextmanager import logging from .db import init_db @@ -213,6 +213,7 @@ async def queue(): from . import labels from . import cancel from . import bug_fast_track + from . import lessons from .disk_watchdog import disk_watchdog from .build_cache_pruner import build_cache_pruner return { @@ -248,6 +249,10 @@ async def queue(): # kill-switch, label, scope, bug-task counts + the structural savings metric # (architecture stages skipped). Additive block; never-raise. "bug_fast_track": bug_fast_track.snapshot(), + # ORCH-098 (FR-4 / AC-4): lessons-journal observability (read-only) — + # kill-switch + counts by type/status + last N lessons. Additive block; + # never-raise (snapshot() returns {"enabled": ...} minimum on error). + "lessons": lessons.snapshot(), # ORCH-063 (FR-6 / AC-7): disk-watchdog observability (read-only) — # enabled, threshold, interval, last measurement per host-path. Additive # block; never-raise (status() returns {"enabled": ...} minimum on error). @@ -390,3 +395,82 @@ async def bug_fast_track_escalate(work_item: str = ""): except Exception: pass return {"ok": True, "work_item": work_item, "track": "full", "was": prev_track} + + +# --------------------------------------------------------------------------- +# ORCH-098 (FR-4 / FR-5, ADR-001 D5): machine lessons-journal endpoints. +# Read-only fetch + manual record + re-classify. All never-raise; with the +# kill-switch off they return {"enabled": false} (style of /metrics, AC-7). +# --------------------------------------------------------------------------- +@app.get("/lessons") +async def lessons_list( + type: str = "", status: str = "", repo: str = "", work_item: str = "", + limit: int | None = None, +): + """ORCH-098: read-only lessons fetch with optional filters (type / status / repo + / work_item / limit). Always 200; reading never mutates. ``lessons_enabled=False`` + -> ``{"enabled": false}``.""" + from . import lessons + from .config import settings + if not getattr(settings, "lessons_enabled", True): + return {"enabled": False, "lessons": []} + rows = lessons.get( + lesson_type=(type or None), status=(status or None), repo=(repo or None), + work_item_id=(work_item or None), limit=limit, + ) + return {"enabled": True, "lessons": rows} + + +@app.post("/lessons") +async def lessons_create(request: Request): + """ORCH-098: manually record a lesson (``source="manual"``, never deduped). JSON + body: ``lesson_type`` (required) + optional context / analysis / attribution + fields. Returns ``{"id": }`` or ``{"enabled": false}`` / + ``{"error": ...}``.""" + from . import lessons + from .config import settings + if not getattr(settings, "lessons_enabled", True): + return {"enabled": False} + try: + body = await request.json() + except Exception: # noqa: BLE001 - malformed body + body = {} + if not isinstance(body, dict): + body = {} + lesson_type = body.get("lesson_type") + if not lesson_type: + return {"ok": False, "error": "missing 'lesson_type'"} + # Only forward known fields; source is forced to "manual" (operator/Стрим). + allowed = ( + "work_item_id", "task_id", "stage", "agent", "repo", "root_cause", + "suggestion", "status", "related_task", "attribution", "target_repo", + "target_domain", "detail", + ) + kwargs = {k: body[k] for k in allowed if k in body} + new_id = lessons.record(lesson_type, source="manual", **kwargs) + return {"id": new_id} + + +@app.post("/lessons/{lesson_id}") +async def lessons_update(lesson_id: int, request: Request): + """ORCH-098: re-classify / re-status an existing lesson (status / attribution / + target_* / related_task / root_cause / suggestion). Lets a human / the + retrospective agent classify an auto-recorded ``unknown``. Returns + ``{"ok": bool}`` or ``{"enabled": false}``.""" + from . import lessons + from .config import settings + if not getattr(settings, "lessons_enabled", True): + return {"enabled": False} + try: + body = await request.json() + except Exception: # noqa: BLE001 - malformed body + body = {} + if not isinstance(body, dict): + body = {} + allowed = ( + "status", "attribution", "target_repo", "target_domain", "related_task", + "root_cause", "suggestion", "stage", "agent", "repo", "detail", + ) + kwargs = {k: body[k] for k in allowed if k in body} + ok = lessons.update(lesson_id, **kwargs) + return {"ok": ok} diff --git a/src/stage_engine.py b/src/stage_engine.py index 3d4bbbb..a1e65de 100644 --- a/src/stage_engine.py +++ b/src/stage_engine.py @@ -927,6 +927,24 @@ def _handle_qg_failure_rollbacks( f"development ({reason})" ) + # ORCH-098 (FR-3a / D3): machine lessons-journal — auto-record a `gate_failure` + # lesson whenever a quality gate rolled this task back to `development` + # (reviewer REQUEST_CHANGES / tester FAIL / staging FAILED / deploy FAILED — all + # four branches above set result.rolled_back_to="development"). One best-effort + # call covers every rollback branch; lessons.record is never-raise + deduped, and + # this guard ensures even an import fault can't escape into the hot rollback path. + if result.rolled_back_to == "development": + try: + from . import lessons + lessons.record( + lessons.LessonType.GATE_FAILURE, + work_item_id=work_item_id, task_id=task_id, stage=current_stage, + agent=agent, repo=repo, root_cause=reason, detail=qg_name, + source="auto", + ) + except Exception as e: # noqa: BLE001 - never break the rollback path + logger.warning(f"Task {task_id}: lessons gate_failure record failed: {e}") + # --------------------------------------------------------------------------- # ORCH-043: merge-gate sub-gate on the deploy-staging -> deploy edge @@ -1726,6 +1744,19 @@ def _handle_merge_verify(task_id, repo, work_item_id, branch, result: AdvanceRes result.alerted = True result.note = "merge-not-verified-hold" result.advanced = False + # ORCH-098 (FR-3b / D3): auto-record a `merge_hold` lesson — deploy succeeded + # but `main` never got the commit, so the task is held on `deploy` (not done). + # best-effort, never-raise, deduped; can't escape into the HOLD path. + try: + from . import lessons + lessons.record( + lessons.LessonType.MERGE_HOLD, + work_item_id=work_item_id, task_id=task_id, stage="deploy", + repo=repo, root_cause="merge-not-verified-hold", detail=merge_msg, + source="auto", + ) + except Exception as e: # noqa: BLE001 - never break the HOLD + logger.warning(f"Task {task_id}: lessons merge_hold record failed: {e}") return True except Exception as e: # noqa: BLE001 - never-raise contract (INV-1/AC-7) # Any internal error -> treat as "not confirmed" -> HOLD + alert, never crash. @@ -2009,6 +2040,24 @@ def run_post_deploy_monitor(job: dict): except Exception as e: # noqa: BLE001 - never break the tick logger.warning(f"post-deploy: set_repo_freeze failed for {repo}: {e}") + # ORCH-098 (FR-3d / D3): auto-record a `deploy_degraded` lesson — "deploy OK / + # prod broken" (layer-3, ET-8). attribution left "unknown" + target_domain + # "reliability" for a human / the retrospective agent to classify later (this is + # exactly the signal Слава required the attribution columns for). best-effort, + # never-raise; can't escape into the monitor tick. + try: + from . import lessons + reason = f"post-deploy DEGRADED ({checks_failed}/{checks_total})" + lessons.record( + lessons.LessonType.DEPLOY_DEGRADED, + work_item_id=work_item_id, repo=repo, stage="deploy", + root_cause=reason, attribution=lessons.Attribution.UNKNOWN, + target_repo=repo, target_domain=lessons.Domain.RELIABILITY, + source="auto", + ) + except Exception as e: # noqa: BLE001 - never break the tick + logger.warning(f"post-deploy: lessons deploy_degraded record failed for {repo}: {e}") + post_deploy.write_post_deploy_log( repo, work_item_id, branch, post_deploy.DEGRADED, action_taken, settings.post_deploy_window_s, checks_total, checks_failed, diff --git a/tests/test_lessons.py b/tests/test_lessons.py new file mode 100644 index 0000000..83759b2 --- /dev/null +++ b/tests/test_lessons.py @@ -0,0 +1,396 @@ +"""ORCH-098 / TC-01..TC-12: the machine lessons-journal (src/lessons.py + db + wiring). + +Contract under test (ADR-001 §7 / acceptance-criteria): + * the `lessons` table is additive + idempotent and carries the NULLABLE + attribution columns (attribution / target_repo / target_domain) from the start; + * record() inserts a row (auto/manual) and returns its id; auto records are + deduped in a window, manual records are never deduped; + * never-raise: a failing DB -> None/[]/{}/False, never an exception into the caller; + * kill-switch off -> record/get/update/snapshot inert (no DB access); + * get_lessons filters by type/status/repo/work_item + LIMIT + ORDER BY id DESC; + * update_lesson mutates fields + stamps updated_at; unknown id is safe; + * auto-record wiring: a QG rollback to development writes a `gate_failure` lesson; + a launcher transient-budget-exhaustion writes a `transient_retry` lesson; a + failing journal never breaks the hot path; + * the HTTP endpoints (GET /lessons, POST /lessons, POST /lessons/{id}) and the + GET /queue `lessons` block behave + honour the kill-switch; + * pipeline invariants (STAGE_TRANSITIONS / QG_CHECKS) are structurally untouched. +""" +import os +import tempfile + +os.environ["ORCH_DB_PATH"] = os.path.join(tempfile.gettempdir(), "test_lessons.db") +os.environ.setdefault("ORCH_GITEA_TOKEN", "test-token") +os.environ.setdefault("ORCH_PLANE_API_TOKEN", "test-token") + +import pytest # noqa: E402 + +import src.db as db # noqa: E402 +from src import config as cfg # noqa: E402 +from src import lessons # noqa: E402 + +_REPO = "orchestrator" +_WI = "ORCH-098" + + +@pytest.fixture(autouse=True) +def fresh_db(tmp_path, monkeypatch): + """Isolated tmp SQLite DB + journal ON by default.""" + dbfile = tmp_path / "lessons.db" + monkeypatch.setattr(db.settings, "db_path", str(dbfile)) + monkeypatch.setattr(cfg.settings, "lessons_enabled", True, raising=False) + monkeypatch.setattr(cfg.settings, "lessons_query_limit_default", 100, raising=False) + monkeypatch.setattr(cfg.settings, "lessons_dedup_window_s", 3600, raising=False) + db.init_db() + yield + + +def _columns(): + conn = db.get_db() + try: + return {r[1] for r in conn.execute("PRAGMA table_info(lessons)").fetchall()} + finally: + conn.close() + + +# =========================================================================== +# TC-01 — additive + idempotent table with all BR-1 fields +# =========================================================================== +def test_tc01_table_idempotent_and_fields(): + # Double init must not raise nor duplicate. + db.init_db() + db.init_db() + cols = _columns() + for f in ( + "id", "created_at", "updated_at", "lesson_type", "work_item_id", "task_id", + "stage", "agent", "repo", "root_cause", "suggestion", "status", "related_task", + ): + assert f in cols, f"missing column {f}" + # No existing table mutated: tasks/jobs still present and unchanged in shape. + conn = db.get_db() + try: + tabs = { + r[0] for r in conn.execute( + "SELECT name FROM sqlite_master WHERE type='table'" + ).fetchall() + } + finally: + conn.close() + assert {"tasks", "jobs", "agent_runs", "lessons"} <= tabs + + +# =========================================================================== +# TC-02 — attribution columns present from the start, nullable, set later +# =========================================================================== +def test_tc02_attribution_columns_nullable_and_settable(): + cols = _columns() + assert {"attribution", "target_repo", "target_domain"} <= cols + # A record WITHOUT attribution is accepted (NULL). + lid = lessons.record(lessons.LessonType.DEPLOY_DEGRADED, work_item_id=_WI, repo=_REPO) + assert lid is not None + rows = lessons.get(work_item_id=_WI) + assert rows[0]["attribution"] is None + # Attribution can be filled in later via update. + assert lessons.update( + lid, attribution=lessons.Attribution.PLATFORM, + target_repo=_REPO, target_domain=lessons.Domain.RELIABILITY, + ) is True + rows = lessons.get(work_item_id=_WI) + assert rows[0]["attribution"] == "platform" + assert rows[0]["target_domain"] == "reliability" + + +# =========================================================================== +# TC-03 — record() inserts and returns id, created_at filled, source honoured +# =========================================================================== +def test_tc03_record_inserts_and_returns_id(): + lid = lessons.record( + lessons.LessonType.GATE_FAILURE, work_item_id=_WI, task_id=7, stage="review", + agent="reviewer", repo=_REPO, root_cause="REQUEST_CHANGES", source="auto", + ) + assert isinstance(lid, int) and lid > 0 + rows = lessons.get(work_item_id=_WI) + assert len(rows) == 1 + r = rows[0] + assert r["lesson_type"] == "gate_failure" + assert r["task_id"] == 7 + assert r["agent"] == "reviewer" + assert r["source"] == "auto" + assert r["status"] == "new" + assert r["created_at"] + # A manual record with a different (work_item, type) -> distinct row. + lid2 = lessons.record("custom_manual", work_item_id="ORCH-1", source="manual") + assert lid2 is not None and lid2 != lid + + +# =========================================================================== +# TC-04 — never-raise: a failing DB -> safe defaults, no exception +# =========================================================================== +def test_tc04_never_raise_on_db_error(monkeypatch): + def boom(*a, **k): + raise RuntimeError("db down") + + monkeypatch.setattr(db, "record_lesson", boom) + monkeypatch.setattr(db, "lessons_recent_dup_exists", lambda *a, **k: False) + monkeypatch.setattr(db, "get_lessons", boom) + monkeypatch.setattr(db, "update_lesson", boom) + monkeypatch.setattr(db, "lessons_snapshot", boom) + + assert lessons.record("gate_failure", work_item_id=_WI) is None + assert lessons.get(work_item_id=_WI) == [] + assert lessons.update(1, status="closed") is False + snap = lessons.snapshot() + assert snap == {"enabled": True} # never-raise -> minimal dict, no exception + + +# =========================================================================== +# TC-05 — kill-switch: lessons_enabled=False -> inert, no DB access +# =========================================================================== +def test_tc05_kill_switch_inert(monkeypatch): + monkeypatch.setattr(cfg.settings, "lessons_enabled", False, raising=False) + + def fail(*a, **k): + raise AssertionError("DB must NOT be touched when kill-switch is off") + + monkeypatch.setattr(db, "record_lesson", fail) + monkeypatch.setattr(db, "get_lessons", fail) + monkeypatch.setattr(db, "update_lesson", fail) + monkeypatch.setattr(db, "lessons_snapshot", fail) + + assert lessons.record("gate_failure", work_item_id=_WI) is None + assert lessons.get(work_item_id=_WI) == [] + assert lessons.update(1, status="closed") is False + assert lessons.snapshot() == {"enabled": False} + + +# =========================================================================== +# TC-06 — get_lessons filters + limit + ORDER BY id DESC +# =========================================================================== +def test_tc06_filters_limit_order(): + # Seed rows directly via the DB helper (bypasses the leaf's auto-dedup). + for i in range(5): + db.record_lesson( + lesson_type="gate_failure", work_item_id=f"ORCH-{i}", repo=_REPO, + status="new", source="auto", + ) + db.record_lesson(lesson_type="merge_hold", work_item_id="ORCH-X", repo="enduro-trails", + status="closed", source="auto") + + # Filter by type. + gf = db.get_lessons(lesson_type="gate_failure") + assert len(gf) == 5 and all(r["lesson_type"] == "gate_failure" for r in gf) + # Filter by status. + assert len(db.get_lessons(status="closed")) == 1 + # Filter by repo. + assert len(db.get_lessons(repo="enduro-trails")) == 1 + # Filter by work_item. + assert len(db.get_lessons(work_item_id="ORCH-3")) == 1 + # LIMIT. + assert len(db.get_lessons(lesson_type="gate_failure", limit=2)) == 2 + # ORDER BY id DESC (newest first). + allr = db.get_lessons(limit=100) + got_ids = [r["id"] for r in allr] + assert got_ids == sorted(got_ids, reverse=True) + + +# =========================================================================== +# TC-07 — update_lesson mutates + stamps updated_at; unknown id safe +# =========================================================================== +def test_tc07_update_and_unknown_id(): + lid = db.record_lesson(lesson_type="deploy_degraded", work_item_id=_WI, repo=_REPO, + status="new", source="auto") + before = db.get_lessons(work_item_id=_WI)[0] + assert before["updated_at"] is None + ok = db.update_lesson( + lid, status="in_progress", attribution="both", target_repo=_REPO, + target_domain="reliability", related_task="ORCH-200", + ) + assert ok is True + after = db.get_lessons(work_item_id=_WI)[0] + assert after["status"] == "in_progress" + assert after["attribution"] == "both" + assert after["related_task"] == "ORCH-200" + assert after["updated_at"] is not None + # Unknown id -> no row changed, no raise. + assert db.update_lesson(999999, status="closed") is False + # Empty update (no recognised fields) -> False, safe. + assert db.update_lesson(lid) is False + + +# =========================================================================== +# TC-07b — auto dedup vs manual always-writes (D4) +# =========================================================================== +def test_tc07b_auto_dedup_and_manual_passthrough(): + a = lessons.record("transient_retry", work_item_id=_WI, stage="deploy", source="auto") + b = lessons.record("transient_retry", work_item_id=_WI, stage="deploy", source="auto") + assert a is not None and b is None # second auto deduped in-window + # Manual is never deduped. + m1 = lessons.record("transient_retry", work_item_id=_WI, stage="deploy", source="manual") + m2 = lessons.record("transient_retry", work_item_id=_WI, stage="deploy", source="manual") + assert m1 is not None and m2 is not None and m1 != m2 + # Window=0 disables dedup. + import src.config as c + c.settings.lessons_dedup_window_s = 0 + c2 = lessons.record("transient_retry", work_item_id=_WI, stage="deploy", source="auto") + assert c2 is not None + c.settings.lessons_dedup_window_s = 3600 + + +# =========================================================================== +# TC-08 — wiring: QG rollback to development writes a gate_failure lesson +# =========================================================================== +def test_tc08_gate_failure_autorecord(monkeypatch): + from src import stage_engine as se + + # All side-effecting DB / notifier / plane ops on the rollback path are patched + # to no-ops; only the lessons block reaches the (real tmp) DB — so we assert the + # WIRING (rolled_back_to -> gate_failure lesson) without standing up a full task. + for name in ("notify_stage_change", "plane_notify_stage", "send_telegram", + "set_issue_in_progress", "plane_add_comment", "update_task_stage"): + monkeypatch.setattr(se, name, lambda *a, **k: None, raising=False) + monkeypatch.setattr(se, "extract_test_failures", lambda *a, **k: "", raising=False) + monkeypatch.setattr(se, "_developer_retry_count", lambda *a, **k: 0, raising=False) + monkeypatch.setattr(se, "enqueue_job", lambda *a, **k: 123, raising=False) + + result = se.AdvanceResult() + se._handle_qg_failure_rollbacks( + 99, "testing", _REPO, "ORCH-098", "feature/ORCH-098-fnd", + agent="tester", qg_name="check_tests_passed", reason="2 failed", result=result, + ) + assert result.rolled_back_to == "development" + rows = db.get_lessons(lesson_type="gate_failure", work_item_id="ORCH-098") + assert len(rows) == 1 + r = rows[0] + assert r["stage"] == "testing" + assert r["agent"] == "tester" + assert r["repo"] == _REPO + assert r["source"] == "auto" + assert r["detail"] == "check_tests_passed" + + +# =========================================================================== +# TC-09 — wiring: launcher transient-budget-exhaustion writes a lesson; +# a failing journal never breaks the hot path +# =========================================================================== +def test_tc09_transient_autorecord_and_never_raise(monkeypatch): + from src.agents import launcher as lmod + + launcher = lmod.AgentLauncher() + monkeypatch.setattr(launcher, "_notify_failed", lambda *a, **k: None) + monkeypatch.setattr(launcher, "_record_outcome", lambda *a, **k: None) + monkeypatch.setattr(cfg.settings, "transient_max_attempts", 3, raising=False) + + job_id = db.enqueue_job("developer", _REPO, "task", task_id=42) + job = {"transient_attempts": 3, "task_id": 42, "repo": _REPO} + # Budget exhausted (tattempts >= tmax) -> the failed branch records the lesson. + launcher._finalize_transient(job_id, "developer", 1, 99, job, retry_after=None) + + rows = db.get_lessons(lesson_type="transient_retry") + assert len(rows) == 1 + assert rows[0]["repo"] == _REPO + assert rows[0]["agent"] == "developer" + assert rows[0]["source"] == "auto" + + # never-raise in the hot path: a failing record must not break finalisation. + def boom(*a, **k): + raise RuntimeError("journal down") + + monkeypatch.setattr(db, "record_lesson", boom) + monkeypatch.setattr(db, "lessons_recent_dup_exists", lambda *a, **k: False) + job_id2 = db.enqueue_job("developer", _REPO, "task2", task_id=43) + job2 = {"transient_attempts": 3, "task_id": 43, "repo": _REPO} + # Must NOT raise even though the journal insert blows up. + launcher._finalize_transient(job_id2, "developer", 1, 99, job2, retry_after=None) + + +# =========================================================================== +# TC-10 — GET /lessons + GET /queue block; reads do not mutate +# =========================================================================== +def test_tc10_get_endpoints(monkeypatch): + from fastapi.testclient import TestClient + import src.main as main + + db.record_lesson(lesson_type="gate_failure", work_item_id=_WI, repo=_REPO, + status="new", source="auto") + db.record_lesson(lesson_type="merge_hold", work_item_id="ORCH-2", repo="enduro-trails", + status="closed", source="auto") + + client = TestClient(main.app) + + r = client.get("/lessons") + assert r.status_code == 200 + body = r.json() + assert body["enabled"] is True + assert len(body["lessons"]) == 2 + + # Filters. + r = client.get("/lessons", params={"type": "gate_failure"}) + assert len(r.json()["lessons"]) == 1 + r = client.get("/lessons", params={"repo": "enduro-trails"}) + assert len(r.json()["lessons"]) == 1 + r = client.get("/lessons", params={"limit": 1}) + assert len(r.json()["lessons"]) == 1 + + # Reads do not mutate. + assert db.lessons_snapshot()["total"] == 2 + + # GET /queue carries the read-only lessons block. + q = client.get("/queue") + assert q.status_code == 200 + assert "lessons" in q.json() + assert q.json()["lessons"]["enabled"] is True + assert q.json()["lessons"]["total"] == 2 + + +# =========================================================================== +# TC-11 — POST /lessons (manual) + POST /lessons/{id} (update); kill-switch +# =========================================================================== +def test_tc11_post_endpoints_and_killswitch(monkeypatch): + from fastapi.testclient import TestClient + import src.main as main + + client = TestClient(main.app) + + # Manual create with attribution. + r = client.post("/lessons", json={ + "lesson_type": "process_gap", "work_item_id": _WI, "repo": _REPO, + "attribution": "platform", "target_domain": "quality", "root_cause": "manual note", + }) + assert r.status_code == 200 + lid = r.json()["id"] + assert isinstance(lid, int) + rows = db.get_lessons(work_item_id=_WI) + assert rows[0]["source"] == "manual" + assert rows[0]["attribution"] == "platform" + + # Missing lesson_type -> error, no row. + r = client.post("/lessons", json={"work_item_id": "X"}) + assert r.json()["ok"] is False + + # Update via POST /lessons/{id}. + r = client.post(f"/lessons/{lid}", json={"status": "closed", "related_task": "ORCH-300"}) + assert r.json()["ok"] is True + assert db.get_lessons(work_item_id=_WI)[0]["status"] == "closed" + + # Kill-switch off -> endpoints report {"enabled": false}. + monkeypatch.setattr(cfg.settings, "lessons_enabled", False, raising=False) + assert client.get("/lessons").json() == {"enabled": False, "lessons": []} + assert client.post("/lessons", json={"lesson_type": "x"}).json() == {"enabled": False} + assert client.post(f"/lessons/{lid}", json={"status": "new"}).json() == {"enabled": False} + + +# =========================================================================== +# TC-12 — pipeline invariants structurally untouched +# =========================================================================== +def test_tc12_pipeline_invariants_untouched(): + from src.stages import STAGE_TRANSITIONS + from src.qg.checks import QG_CHECKS + + # The journal must not have added/removed a stage edge or a QG check. + assert "development" in STAGE_TRANSITIONS + assert "deploy" in STAGE_TRANSITIONS + # machine-verdict QG checks still registered (sample of the canon set). + for name in ("check_ci_green", "check_tests_passed", "check_coverage_gate"): + assert name in QG_CHECKS + # The journal is NOT a quality gate — no check named after it. + assert not any("lesson" in k.lower() for k in QG_CHECKS)