Compare commits

..

45 Commits

Author SHA1 Message Date
Dev Agent
c23f000c05 fix(preflight): check the binary the launcher actually spawns (ORCH-1)
Container ORCH_CLAUDE_BIN pointed at a non-existent /usr/bin/claude while the
launcher spawns the hardcoded /opt/claude-code/bin/claude.exe. Preflight now
follows AgentLauncher.CLAUDE_BIN (the genuinely executed path), so it no longer
falsely blocks every job in production.
2026-06-03 00:13:44 +03:00
Dev Agent
d0d47058b4 docs(resilience): document preflight/429/backoff/breaker + env vars (ORCH-1) 2026-06-03 00:12:17 +03:00
Dev Agent
a613fd8180 test(resilience): 34 tests for preflight/classifier/backoff/breaker (ORCH-1)
Covers preflight FAIL->queued + cache, transient/permanent classifier +
Retry-After, exp backoff + available_at gating, launcher transient vs permanent
finalize, circuit breaker open/half-open/closed. test_queue worker tests stub
preflight OK. Popen never spawned.
2026-06-03 00:12:17 +03:00
Dev Agent
f314ae09e5 feat(worker): preflight gate + circuit breaker + /queue resilience (ORCH-1)
QueueWorker gates claims behind preflight and the CircuitBreaker (open ->
pause, no CLI calls + Telegram alert; half-open probes one job; closed on
recovery). Wires launcher.on_outcome. /queue exposes resilience snapshot.
2026-06-03 00:12:17 +03:00
Dev Agent
90fdd19394 feat(launcher): classify failures, backoff transient retry, breaker outcome (ORCH-1)
_finalize_job classifies the run log: transient (429/overload) -> backoff
requeue via mark_job_transient with separate transient_attempts budget honouring
Retry-After; permanent -> normal attempts<max. on_outcome callback feeds the
circuit breaker. _backoff_seconds = min(2^n*base, max) | Retry-After.
2026-06-03 00:12:17 +03:00
Dev Agent
4ef87a3959 feat(resilience): cheap preflight + 429/transient error classifier (ORCH-1)
preflight.py: cached CLAUDE_BIN exists + claude --version (no tokens, no
prompt-ping). error_classifier.py: classify_log_file -> transient|permanent
from log tail + Retry-After parsing.
2026-06-03 00:12:17 +03:00
Dev Agent
0cd9b11fe0 feat(queue): resilience schema + backoff helper + config (ORCH-1)
jobs.transient_attempts + available_at columns (idempotent _ensure_column
migration); claim_next_job honours available_at; mark_job_transient (backoff
requeue with separate transient budget). Config: preflight_cache_ttl,
backoff_base/max_seconds, transient_max_attempts, breaker_threshold,
breaker_pause_seconds.
2026-06-03 00:12:17 +03:00
Dev Agent
4be168c0ec docs(queue): document job queue, /queue, env vars (ORCH-1)
ARCHITECTURE job-queue section + flow diagram, README /queue endpoint and
ORCH_MAX_CONCURRENCY/ORCH_QUEUE_POLL_INTERVAL, new docs/ORCH-1_JOB_QUEUE.md.
2026-06-02 23:58:44 +03:00
Dev Agent
2283b8898b test(queue): 19 tests for job queue lifecycle/atomicity/retry/worker (ORCH-1)
Covers enqueue->claim->mark, atomic claim (no double dispatch, 8-thread race),
retry fail->queued->failed, requeue_running_jobs, observability, worker
max_concurrency. Popen fully mocked (no real agent spawned).
2026-06-02 23:58:44 +03:00
Dev Agent
b6d4426a48 feat(worker): background queue worker + lifespan + queue-recovery + /queue (ORCH-1)
queue_worker.QueueWorker drains the queue respecting max_concurrency. main.py
lifespan: queue-recovery (requeue running jobs) after M-1 orphan-recovery, starts
worker and stops it on shutdown. New GET /queue endpoint (counts + recent jobs).
2026-06-02 23:58:44 +03:00
Dev Agent
20d6556e22 refactor(webhooks): enqueue_job instead of in-process launch (ORCH-1)
All 8 webhook launch points (plane x4, gitea x4) now enqueue a job and return
immediately instead of synchronously spawning claude in the uvicorn process.
2026-06-02 23:58:44 +03:00
Dev Agent
3345c2fa0a feat(launcher): launch_job + job-status finalize with retries (ORCH-1)
Refactor launch() into shared _spawn(); add launch_job(job) that threads job_id
through monitor/watchdog. _finalize_job marks done / requeue (attempts<max) /
failed+notify. Internal advance-chain self.launch -> enqueue_job. B-1/B-2/M-1/ORCH-2
spawn logic unchanged.
2026-06-02 23:58:44 +03:00
Dev Agent
fd3dac7d22 feat(queue): add jobs table + queue helpers and config (ORCH-1)
Persistent SQLite job queue (F-2b): jobs table + idx, atomic claim_next_job,
enqueue/mark/count/requeue/get helpers. New settings max_concurrency
(ORCH_MAX_CONCURRENCY) and queue_poll_interval (ORCH_QUEUE_POLL_INTERVAL).
2026-06-02 23:58:44 +03:00
b021ff7cb0 Merge pull request 'ORCH-6: multi-repo (project filter + repo/prefix per project)' (#2) from feature/ORCH-6-multirepo into main 2026-06-02 23:42:29 +03:00
Dev Agent
ca81f38330 docs: document multi-repo registry + ORCH-6 bugfix and incident
ORCH-6: ARCHITECTURE.md gets a project-registry section; README explains
how to add a project via ORCH_PROJECTS_JSON; BUGFIXES_2026-06-03.md
records the fix and links the 2026-06-02 webhook autorun incident.
2026-06-02 22:30:51 +03:00
Dev Agent
c1f35a2047 test(projects,webhook): cover registry resolvers + project filter
ORCH-6: test_projects.py covers resolvers and ORCH_PROJECTS_JSON parsing
(valid/malformed/fallback). test_plane_webhook.py covers the webhook
project filter via TestClient (unknown->ignored, orchestrator->orchestrator
repo, enduro->enduro-trails, independent ORCH/ET prefixes); launcher
mocked. test_webhooks.py: register proj-1 so existing ET fixtures pass.
2026-06-02 22:30:51 +03:00
Dev Agent
a6f6a43c1c fix(webhooks/gitea): ignore pushes/events for repos outside the registry
ORCH-6: get_project_by_repo None -> ignored, so events for unknown repos
do not trigger the pipeline.
2026-06-02 22:30:42 +03:00
Dev Agent
171f4eb304 fix(webhooks/plane): filter by project + resolve repo/prefix from registry
ORCH-6 / incident 2026-06-02: ignore work items from unknown Plane
projects (status=ignored) instead of funneling everything into
default_repo. Resolve repo, work-item prefix and Plane sync project from
the registry by data.project.
2026-06-02 22:30:42 +03:00
Dev Agent
a87c633003 refactor(plane_sync): parameterize project_id (backward compatible)
ORCH-6: sync functions resolve the issue PROJECT_ID via the registry
(get_project_by_repo) and accept project_id; default stays enduro so
existing ET callers keep working.
2026-06-02 22:30:42 +03:00
Dev Agent
0797f958dc feat(db): per-project work-item prefix in get_next_work_item_id
ORCH-6: get_next_work_item_id(repo, prefix="ET") numbers per (repo, prefix)
so orchestrator issues number ORCH-001 independently of the ET sequence.
Default prefix stays ET for backward compatibility.
2026-06-02 22:30:42 +03:00
Dev Agent
36d5f25f2a feat(projects): add project registry (Plane id -> repo/prefix mapping)
ORCH-6: src/projects.py introduces ProjectConfig + resolvers
(get_project_by_plane_id/by_repo, known_plane_project_ids) keyed by
Plane project uuid. Source: ORCH_PROJECTS_JSON env (config.projects_json),
with a built-in default registry (enduro-trails + orchestrator) and
robust parsing (malformed JSON/entries fall back to default).
2026-06-02 22:30:42 +03:00
Dev Agent
1ebe8afc23 feat(worktree): git worktree per task to isolate shared /repos (ORCH-2 / S-4)
- add src/git_worktree.py: ensure/remove/get_worktree_path
- config: worktrees_dir=/repos/_wt
- launcher: agent runs in per-branch worktree; task-file + commit/push in worktree; no shared checkout
- qg/checks: read artifacts + run make test from worktree (branch arg, backward-compatible)
- webhooks/plane: pass branch into QG dispatch; review fallback from worktree
- webhooks/gitea: keep read-only branch --contains in main clone (documented)
- tests: test_git_worktree.py (isolation) + update test_launcher write-task-file
- docs: ARCHITECTURE worktree section + BUGFIXES_2026-06-02_ORCH2

Preserves B-1/B-2/S-1/S-5 fixes (paths now point at worktree).
2026-06-02 21:12:06 +03:00
Dev Agent
66a37612fd docs(bugfixes): add safe.directory, init:true findings and autonomy test result 2026-06-02 20:22:51 +03:00
Dev Agent
57cca14ed3 fix(compose): init:true (PID1 reaper) to reap claude grandchild zombies (B-2) 2026-06-02 20:20:33 +03:00
Dev Agent
5de8462a13 fix(docker): trust /repos for git (safe.directory) so launcher commit/push works 2026-06-02 20:18:44 +03:00
Dev Agent
553e0aae0c docs: update QG table, task-file write, orphan recovery; add BUGFIXES_2026-06-02 2026-06-02 20:12:29 +03:00
Dev Agent
67b9f814b5 test(launcher): cover _write_task_file and reviewer verdict parsing (L-5) 2026-06-02 20:12:29 +03:00
Dev Agent
212352997e fix(main): proper orphan recovery with per-run warning + notify (M-1) 2026-06-02 20:12:29 +03:00
Dev Agent
b585701c62 fix(webhooks): dispatch new QGs; stop false Gitea CI alerts (S-1)
- plane._try_advance_stage handles check_tests_local + check_reviewer_verdict
- gitea.handle_ci_status: failure -> debug log only (CI not authoritative)
2026-06-02 20:12:29 +03:00
Dev Agent
0924783be3 fix(qg): frontmatter-only reviewer verdict + local test gate (S-5, S-1)
- check_reviewer_verdict reads verdict: from YAML frontmatter of 12-review.md only
- add check_tests_local: orchestrator runs make test in /repos/<repo>
- stages: development QG -> check_tests_local
2026-06-02 20:12:29 +03:00
Dev Agent
265a5ef1e6 fix(launcher): write task file to /repos without docker; stdout->file, no PIPE zombies (B-1, B-2)
- _write_task_file writes directly to mounted /repos/<repo>, raises on failure
- Popen stdout=log_fh at OS level; _monitor_agent simplified to proc.wait()+close
- remove PIPE reader thread and startup-timeout (watchdog by pid stays)
- dispatch check_tests_local args (repo, branch)
2026-06-02 20:12:29 +03:00
Dev Agent
f575f6bc6a chore: save WIP changes before audit fixes
- notifications: Telegram integration, richer stage/agent/QG notifications
- plane_sync: explicit Plane state IDs, needs_input/in_review/blocked helpers, links in comments
- launcher: deployer stage, model flag (opus), PR auto-create, REQUEST_CHANGES/tester/architect rollback+retry logic, partial check_reviewer_verdict path
- qg/checks: add check_reviewer_verdict (substring-based, will be hardened in S-5)
- stages: review->check_reviewer_verdict, testing->deployer agent
- config: telegram_bot_token/chat_id settings
2026-06-02 19:57:43 +03:00
claude-bot
8715dd7148 feat(deploy): SSH key mount, deploy env vars, openssh-client in image 2026-06-01 20:03:27 +03:00
Dev Agent
e27e489157 fix(plane-webhook): read issue/comment_stripped fields from Plane comment payload 2026-06-01 19:17:14 +03:00
51f7364532 feat: integrate Analyst into Plane/Orchestrator pipeline
- Add git fetch+checkout in agent launch cmd (ensures correct branch)
- Add git fetch+checkout in _monitor_agent before commit/push
- Post start comment in Plane when analyst launches
- Post :approved: request comment after analyst completes successfully
- Branch lookup moved before cmd construction for reuse
2026-05-31 20:15:01 +03:00
Dev Agent
81e0e383e0 feat(analysis): add check_analysis_approved QG with stakeholder approval requirement
- stages.py: QG renamed to check_analysis_approved (requires :approved: comment)
- qg/checks.py: new check_analysis_approved verifies files + Plane :approved: comment
- launcher.py: skip auto-advance for analysis stage (requires human approval)
- plane.py: route check_analysis_approved in _try_advance_stage
- docs/ARCHITECTURE.md: updated QG table and flow description
2026-05-31 15:19:03 +03:00
Dev Agent
0f0b984656 docs: add pipeline design backlog (audit + backlog mgmt) 2026-05-23 09:17:41 +03:00
Dev Agent
267bc58fb2 docs: update README, add ARCHITECTURE.md with full system documentation 2026-05-22 14:09:24 +03:00
Dev Agent
0ad56e1f0a fix: tini entrypoint, event routing wildcard, orphan recovery 2026-05-22 13:52:46 +03:00
Dev Agent
c326ef0ac4 docs: lessons learned ET-006 — problems and solutions 2026-05-22 13:45:40 +03:00
Dev Agent
b545665e2d feat: full pipeline fixes - CI status branch lookup, review webhook routing, auto-advance, plane sync
- handle_ci_status: fallback git branch -r --contains when branches[] empty
- webhook router: handle pull_request_approved event type
- handle_pr: map review.type to review.state for new Gitea format
- launcher: auto-advance stage after agent completion (_try_advance_stage)
- plane_sync: notify Plane on stage changes
- stages.py: stage machine with QG definitions
- notifications.py: stage change notifications
- safe.directory fix for container git operations
2026-05-22 01:57:02 +03:00
Dev Agent
b428163c32 docs: bugfixes 2026-05-21 (5 fixes for CI status, review webhook, auto-advance) 2026-05-22 01:56:47 +03:00
Dev Agent
3116ae67bb chore: clean up .gitignore, remove cached files from tracking 2026-05-19 15:58:45 +03:00
Dev Agent
95072e000f fix: tests — add setup_db fixture for init_db in test env 2026-05-19 15:58:37 +03:00
Dev Agent
8859c38a2a chore: add .gitignore, remove .env from tracking 2026-05-19 15:57:13 +03:00
39 changed files with 6283 additions and 145 deletions

10
.env
View File

@@ -1,10 +0,0 @@
ORCH_PLANE_API_URL=http://plane-app-api-1:8000
ORCH_PLANE_API_TOKEN=
ORCH_PLANE_WORKSPACE_SLUG=
ORCH_PLANE_WEBHOOK_SECRET=
ORCH_GITEA_URL=http://localhost:3000
ORCH_GITEA_TOKEN=c81227b0dee2217f9ab3d28c3642a4578a1b9772
ORCH_GITEA_WEBHOOK_SECRET=
ORCH_CLAUDE_BIN=/usr/bin/claude
ORCH_REPOS_DIR=/home/slin/repos
ORCH_DB_PATH=/app/data/orchestrator.db

7
.gitignore vendored Normal file
View File

@@ -0,0 +1,7 @@
.env
.venv/
__pycache__/
*.pyc
data/
*.db
.pytest_cache/

View File

@@ -1,7 +1,11 @@
FROM python:3.12-slim
WORKDIR /app
RUN apt-get update -qq && apt-get install -y -qq openssh-client git && rm -rf /var/lib/apt/lists/*
# git operations run as root over bind-mounted /repos (may be owned by host uid) -> trust it.
RUN git config --system --add safe.directory '*'
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY src/ src/
RUN mkdir -p /app/data/runs
COPY src/ ./src/
COPY data/ ./data/
ENV PYTHONPATH=/app
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8500"]

212
README.md
View File

@@ -1,70 +1,220 @@
# Multi-Agent Orchestrator
FastAPI-сервис для оркестрации мульти-агентного пайплайна разработки.
FastAPI-сервис для оркестрации мульти-агентного пайплайна разработки. Принимает webhooks от Plane и Gitea, управляет жизненным циклом задач через Quality Gates, запускает Claude CLI агентов на каждой стадии.
## Что делает
## Архитектура
- Принимает webhooks от **Plane** (task management) и **Gitea** (git events)
- Проверяет Quality Gates перед переходом между стадиями
- Запускает **Claude CLI** агентов (analyst, architect, developer, reviewer, tester)
- Ведёт журнал событий в SQLite
```
Plane (task mgmt) ──webhook──┐
├──► Orchestrator (FastAPI) ──► Quality Gates ──► Agent Launcher
Gitea (git events) ─webhook──┘ │ │
▼ ▼
SQLite DB Claude CLI
(events, tasks, (analyst, architect,
agent_runs) developer, reviewer, tester)
```
## Стадии пайплайна
```
created → analysis → architecture → development → review → testing → deploy → done
↑ │
└─── REQUEST_CHANGES ─┘ (max 3 retries)
```
| Стадия | Агент | Quality Gate (выход) | Триггер перехода |
|--------|-------|---------------------|------------------|
| created | — | — | Plane webhook (work_item.created) |
| analysis | analyst | Файлы BRD/TRZ/AC/TestPlan | Push docs/ |
| architecture | architect | ADR или infra-requirements | Push docs/ |
| development | developer | check_tests_local (орк сам гоняет `make test`) | Auto-advance после developer |
| review | reviewer | check_reviewer_verdict (`verdict:` во frontmatter 12-review.md) | Auto-advance после reviewer |
| testing | tester | Test report с PASS | Auto-advance после tester |
| deploy | deployer | — | SSH deploy-hook |
| done | — | — | — |
## API Endpoints
| Method | Path | Описание |
|--------|------|----------|
| GET | `/health` | Health check |
| GET | `/status` | Активные задачи |
| GET | `/status` | Активные задачи (stage != done) |
| GET | `/queue` | Очередь задач (ORCH-1): counts по статусам + max_concurrency + последние 10 jobs |
| POST | `/webhook/plane` | Plane webhook receiver |
| POST | `/webhook/gitea` | Gitea webhook receiver |
## Настройка
## Структура проекта
```bash
cp .env.example .env
# Заполнить токены в .env
```
src/
├── main.py # FastAPI app, lifespan (orphan recovery)
├── config.py # Pydantic settings (env vars)
├── db.py # SQLite: init, get_db, update_task_stage
├── stages.py # State machine (transitions, agents, QG)
├── notifications.py # Уведомления (логирование)
├── plane_sync.py # Синхронизация статусов с Plane API
├── queue_worker.py # ORCH-1: фоновый воркер очереди (claim → launch_job)
├── agents/
│ └── launcher.py # AgentLauncher: launch/launch_job, monitor, watchdog, auto-advance
├── webhooks/
│ ├── plane.py # Plane webhook handler
│ └── gitea.py # Gitea webhook handler (push, PR, CI status)
└── qg/
└── checks.py # Quality Gate checks (filesystem + Gitea API)
data/
├── orchestrator.db # SQLite database
└── runs/ # Agent output logs ({run_id}.log)
docs/
├── ARCHITECTURE.md # Подробная архитектура
├── LESSONS_ET006.md # Lessons learned из ET-006
├── BUGFIXES_2026-05-21.md # Багфиксы
└── SETUP_WEBHOOKS.md # Настройка webhooks
docker-compose.yml # Deployment config
Dockerfile # Python 3.12 + Docker CLI + tini
```
## Запуск (Docker)
## Запуск
### Docker (production)
```bash
docker compose up -d --build
```
## Запуск (dev)
### Dev
```bash
pip install -r requirements.txt
uvicorn src.main:app --reload --port 8500
```
## Тесты
## Конфигурация
```bash
pip install pytest
pytest tests/ -v
```
## Переменные окружения
Все переменные с префиксом `ORCH_`:
| Переменная | Описание | Default |
|-----------|----------|---------|
| `ORCH_PLANE_API_URL` | Plane API URL | `http://localhost:8091` |
| `ORCH_PLANE_API_TOKEN` | Plane API token | — |
| `ORCH_PLANE_WEBHOOK_SECRET` | Webhook secret для верификации | — |
| `ORCH_PLANE_WEBHOOK_SECRET` | Webhook secret | — |
| `ORCH_PLANE_WORKSPACE_SLUG` | Workspace slug | — |
| `ORCH_PLANE_PROJECT_ID` | Project UUID | — |
| `ORCH_GITEA_URL` | Gitea URL | `http://localhost:3000` |
| `ORCH_GITEA_TOKEN` | Gitea API token | — |
| `ORCH_GITEA_WEBHOOK_SECRET` | Gitea webhook secret | — |
| `ORCH_CLAUDE_BIN` | Путь к Claude CLI | `/usr/bin/claude` |
| `ORCH_REPOS_DIR` | Директория с репозиториями | `/home/slin/repos` |
| `ORCH_DB_PATH` | Путь к SQLite БД | `/app/data/orchestrator.db` |
| `ORCH_GITEA_OWNER` | Gitea repo owner | `admin` |
| `ORCH_DEFAULT_REPO` | Default repository (fallback) | `enduro-trails` |
| `ORCH_PROJECTS_JSON` | Multi-repo реестр (JSON-массив, ORCH-6) | `""` → дефолт в `src/projects.py` |
| `ORCH_CLAUDE_BIN` | Путь к Claude CLI | `/opt/claude-code/bin/claude.exe` |
| `ORCH_REPOS_DIR` | Repos dir (container) | `/repos` |
| `ORCH_HOST_REPOS_DIR` | Repos dir (host) | `/home/slin/repos` |
| `ORCH_DB_PATH` | SQLite path | `/app/data/orchestrator.db` |
| `ORCH_MAX_CONCURRENCY` | Сколько jobs воркер запускает параллельно (ORCH-1) | `1` |
| `ORCH_QUEUE_POLL_INTERVAL` | Период опроса очереди воркером, сек (ORCH-1) | `2.0` |
| `ORCH_PREFLIGHT_CACHE_TTL` | Кэш preflight (CLI/net), сек (ORCH-1 resilience) | `45` |
| `ORCH_BACKOFF_BASE_SECONDS` | База exp-backoff для transient (429) | `10` |
| `ORCH_BACKOFF_MAX_SECONDS` | Потолок backoff | `600` |
| `ORCH_TRANSIENT_MAX_ATTEMPTS` | Ретраи для 429/недоступности | `5` |
| `ORCH_BREAKER_THRESHOLD` | transient подряд до открытия breaker | `3` |
| `ORCH_BREAKER_PAUSE_SECONDS` | Пауза при открытом breaker | `300` |
## Архитектура
## Очередь задач (ORCH-1 / F-2b)
Webhook-хэндлеры больше не спавнят claude-агентов синхронно в процессе uvicorn.
Вместо этого они кладут **job** в персистентную SQLite-таблицу `jobs`
(`enqueue_job`, мгновенный ответ), а фоновый воркер (`src/queue_worker.py`)
забирает jobs с учётом `ORCH_MAX_CONCURRENCY` и запускает агента (`launch_job`,
та же Popen-логика, что и раньше).
Преимущества:
- **Рестарт-safe.** При старте jobs со статусом `running` возвращаются в `queued`
(queue-recovery в lifespan) — работа не теряется.
- **Лимит параллелизма.** Воркер не превышает `ORCH_MAX_CONCURRENCY`.
- **Ретраи.** Упавший job (exit≠0) ретраится пока `attempts < max_attempts`,
потом `failed` + Telegram-нотификация.
Статусы job: `queued → running → done | failed`. Наблюдаемость — через `GET /queue`.
**Resilience-слой:** дешёвый preflight (CLI/net, кэш, без токенов) гейтит claim;
429/overload детектится по логу (transient vs permanent), transient ретраится с
exp-backoff (`available_at`, Retry-After); circuit breaker паузит воркер после N
transient подряд. Подробности: `docs/ORCH-1_JOB_QUEUE.md`.
## Multi-repo: реестр проектов (ORCH-6)
Оркестратор обслуживает несколько репозиториев через реестр проектов
(`src/projects.py`), ключ = **Plane project id**. Plane-webhook фильтрует события
по проекту (неизвестный проект → `ignored`) и резолвит `repo` / `work_item_prefix` /
Plane-проект из маппинга.
По умолчанию (если `ORCH_PROJECTS_JSON` пуст) зарегистрированы два проекта:
| Проект | Plane project id | repo | prefix |
|--------|------------------|------|--------|
| enduro-trails | `7a79f0a9-5278-49cd-9007-9a338f238f9c` | `enduro-trails` | `ET` |
| orchestrator | `8da6aa25-a60e-44d6-a1e2-d8ae59aa7d6a` | `orchestrator` | `ORCH` |
### Как добавить новый проект
1. Убедись, что gitea-репо уже клонировано в `/repos/<repo>` (авто-clone — отдельно).
2. Узнай Plane project uuid (из URL проекта в Plane или через Plane API).
3. Добавь запись в `ORCH_PROJECTS_JSON` в `.env` (JSON-массив). **Важно:** если
задаёшь `ORCH_PROJECTS_JSON`, он полностью заменяет дефолт — перечисли **все**
нужные проекты (включая enduro-trails и orchestrator):
```bash
ORCH_PROJECTS_JSON='[
{"plane_project_id":"7a79f0a9-5278-49cd-9007-9a338f238f9c","repo":"enduro-trails","work_item_prefix":"ET","name":"enduro-trails"},
{"plane_project_id":"8da6aa25-a60e-44d6-a1e2-d8ae59aa7d6a","repo":"orchestrator","work_item_prefix":"ORCH","name":"orchestrator"},
{"plane_project_id":"<новый-uuid>","repo":"<новый-repo>","work_item_prefix":"<PREFIX>","name":"<имя>"}
]'
```
4. Пересобери: `docker compose up -d --build`.
5. Проверь резолв:
```bash
docker exec orchestrator python3 -c "from src.projects import get_project_by_plane_id as g; print(g('<новый-uuid>'))"
```
Поля `name` опционально (по умолчанию = `repo`). Подробности — `docs/ARCHITECTURE.md`.
## Ключевые механизмы
### Auto-advance
После успешного завершения агента (exit_code=0), `_try_advance_stage()` проверяет QG и автоматически продвигает задачу + запускает следующего агента.
### Review bounce
При REQUEST_CHANGES от reviewer задача откатывается в development, developer перезапускается (до 3 попыток). При исчерпании — эскалация.
### Orphan recovery (M-1)
При старте контейнера каждый run с `finished_at IS NULL` старше 35 минут помечается exit_code=-1, логируется per-run warning и отправляется Telegram-уведомление «нужна ручная проверка/перезапуск» (не молча).
### Запись task-файлов (B-1)
Task-файлы `.task-*.md` пишутся **прямой записью в смонтированный volume `/repos/<repo>/`** (без docker). При ошибке записи — RuntimeError (не молчит). В `.gitignore` проекта.
### Логи агентов (B-2)
stdout/stderr агента перенаправляются СРАЗУ в `/app/data/runs/{id}.log` на уровне ОС (без PIPE). monitor-поток делает `proc.wait()` → реальный exit_code, нет зомби.
### Watchdog
Каждый агент имеет timeout 30 минут. При превышении — SIGKILL + запись exit_code=-9.
### Event routing
Gitea events роутятся по типу:
- `push` → проверка файлов, advance architecture/development
- `pull_request*` (wildcard) → review approved/rejected, PR merge
- `status` → (legacy) Gitea CI; С-1: больше не authoritative, `failure` логируется на debug и не блокирует/не алертит (QG развития = локальный `check_tests_local`)
## Тесты
```bash
pytest tests/ -v
```
Plane webhook ──┐
├──► Orchestrator ──► Quality Gates ──► Agent Launcher ──► Claude CLI
Gitea webhook ──┘ │
SQLite (events, tasks, agent_runs)
```
## Известные ограничения
1. **Single-task / shared `/repos` checkout** — одновременно безопасно обрабатывается одна задача: все агенты и `check_tests_local` делают `git checkout` в одном `/repos/<repo>` → гонки при параллельных задачах. Исправление — git worktree per task (S-4, отдельно).
2. **Plane sync** — маппинг issue ID может быть некорректным (P3, в работе)
3. **In-process daemon-потоки** — агенты живут в потоках uvicorn; при рестарте ловит orphan-recovery. Целевое — очередь задач (F-2b)
4. **Gitea CI не настроен** — тесты гоняет сам оркестратор локально
3. **Tester timeout** — e2e тесты с Playwright могут занимать >25 мин на тяжёлых фичах
4. **No retry on API errors** — httpx вызовы к Gitea/Plane без retry logic

View File

@@ -3,11 +3,25 @@ services:
build: .
container_name: orchestrator
restart: unless-stopped
ports:
- "127.0.0.1:8500:8500"
# init: true injects docker-init (tini) as PID 1 so reparented grandchild
# processes from the claude/node subprocess tree are reaped (no zombies, B-2).
init: true
network_mode: host
volumes:
- ./data:/app/data
- /home/slin/repos:/repos:ro
- /home/slin/repos:/repos
- /var/run/docker.sock:/var/run/docker.sock
- /usr/lib/node_modules/@anthropic-ai/claude-code:/opt/claude-code:ro
- /usr/bin/node:/usr/bin/node:ro
- /home/slin/.claude:/home/slin/.claude
- /home/slin/.claude.json:/home/slin/.claude.json:ro
- /home/slin/.orchestrator-ssh:/root/.ssh:ro
env_file: .env
environment:
- ORCH_REPOS_DIR=/repos
- ORCH_HOST_REPOS_DIR=/home/slin/repos
- DEPLOY_SSH_USER=slin
- DEPLOY_SSH_HOST=127.0.0.1
- DEPLOY_HOOK_SCRIPT=/home/slin/bin/enduro-deploy-hook.sh
group_add:
- "999"

335
docs/ARCHITECTURE.md Normal file
View File

@@ -0,0 +1,335 @@
# Архитектура Orchestrator
## Обзор
Orchestrator — event-driven FastAPI сервис, который управляет жизненным циклом задач разработки через мульти-агентный пайплайн. Каждая задача проходит через фиксированные стадии, на каждой из которых работает специализированный Claude CLI агент.
## Компоненты
### 1. Webhook Receivers
#### Plane Webhook (`src/webhooks/plane.py`)
- **Фильтр по проекту (ORCH-6):** извлекает `data.project` (Plane project uuid) и игнорирует событие, если проект не в реестре (`known_plane_project_ids()`) → ответ `{"status":"ignored","reason":"unknown project"}`. Это предотвращает инцидент 2026-06-02 (webhook на весь workspace без фильтра).
- Принимает `work_item.created` — резолвит repo/prefix/Plane-проект из реестра по `project`, создаёт задачу в DB, запускает analyst
- Принимает `work_item.updated` — синхронизация статусов
#### Реестр проектов (`src/projects.py`, multi-repo, ORCH-6)
Маппинг **Plane project id → (repo, work_item_prefix, name)**. Позволяет одному
оркестратору обслуживать несколько репозиториев, не путая их.
```python
@dataclass(frozen=True)
class ProjectConfig:
plane_project_id: str # uuid Plane-проекта (ключ реестра)
repo: str # имя gitea-репо (= папка в /repos)
work_item_prefix: str # ET / ORCH
name: str # человекочитаемое
```
Резолверы:
- `get_project_by_plane_id(uuid) -> ProjectConfig | None` — для фильтра/резолва в plane-webhook.
- `get_project_by_repo(repo) -> ProjectConfig | None` — когда известен только repo (gitea-webhook, plane_sync).
- `known_plane_project_ids() -> set[str]` — множество разрешённых проектов (фильтр).
**Источник конфигурации:** env `ORCH_PROJECTS_JSON` (JSON-массив `ProjectConfig`).
Если пусто/битый JSON — используется встроенный дефолт-реестр (enduro-trails + orchestrator),
чтобы система работала из коробки. Парсинг устойчив: битые записи пропускаются,
полностью невалидный JSON → fallback на дефолт.
Следствия multi-repo:
- **repo per project:** `repo = get_project_by_plane_id(project_id).repo` вместо хардкода `default_repo`.
- **prefix per project:** `get_next_work_item_id(repo, prefix)` нумерует независимо — `ORCH-001` vs `ET-010` (`src/db.py`).
- **plane_sync в правильный проект:** state/comment пишутся в Plane-проект самой задачи (резолв по repo через `get_project_by_repo`), а не в единственный хардкоженный `PROJECT_ID` (обратная совместимость сохранена дефолтом на enduro).
- **gitea-webhook:** push в repo вне реестра → `ignored` (не триггерит конвейер).
#### Gitea Webhook (`src/webhooks/gitea.py`)
- **push** — проверяет наличие артефактов (docs/, src/), продвигает стадию
- **pull_request\*** (wildcard) — обрабатывает review approved/rejected, PR merge
- **status** — CI green/failure, продвигает development → review
### 2. State Machine (`src/stages.py`)
Линейный пайплайн с одним возможным откатом (review → development):
```
STAGE_TRANSITIONS = {
created: → analysis (agent: None)
analysis: → architecture (agent: architect, QG: check_analysis_approved)
architecture: → development (agent: developer, QG: check_architecture_done)
development: → review (agent: reviewer, QG: check_tests_local)
review: → testing (agent: tester, QG: check_reviewer_verdict)
testing: → deploy (agent: deployer, QG: check_tests_passed)
deploy: → done (agent: None, QG: None)
}
```
### 3. Quality Gates (`src/qg/checks.py`)
| Check | Метод проверки |
|-------|---------------|
| check_analysis_approved | Filesystem: 4 файла + :approved: comment в Plane |
| check_architecture_done | Filesystem: ADR dir или infra-requirements.md |
| check_tests_local | Оркестратор сам гоняет `make test` в **worktree задачи** `/repos/_wt/<repo>/<branch>` (judge по exit-code). Заменил check_ci_green: Gitea CI не сконфигурирован. Worktree-изоляция → безопасно при параллельных задачах (ORCH-2 / S-4). |
| check_reviewer_verdict | Filesystem: читает `verdict: APPROVED\|REQUEST_CHANGES` из YAML-frontmatter `12-review.md` (только машиночитаемое поле, не подстроки в тексте) |
| check_tests_passed | Filesystem: test-report.md содержит "PASS" |
| check_ci_green | (legacy) Gitea API: GET /commits/{branch}/status — больше не используется как QG развития |
| check_review_approved | (legacy) Gitea API: GET /pulls/{n}/reviews — не используется в STAGE_TRANSITIONS |
### 4. Agent Launcher (`src/agents/launcher.py`)
Запускает Claude CLI как subprocess:
```bash
claude.exe --print --system-prompt --allowedTools Read,Write,Edit,Bash
```
Каждый запуск:
1. Записывает run в DB (agent_runs)
2. Запускает subprocess. **stdout/stderr перенаправляются СРАЗУ в файл `/app/data/runs/{id}.log` на уровне ОС** (Popen `stdout=log_fh`). Никакого PIPE в памяти оркестратора → нет PIPE-deadlock, нет потока-читателя, нет зомби (B-2).
3. Стартует **watchdog thread** (timeout 30 мин → SIGKILL по pid)
4. Стартует **monitor thread**: `proc.wait()` (гарантированный reap → реальный exit_code в БД) → закрывает log_fh → git commit/push → auto-advance
### 5. Auto-advance (`launcher._try_advance_stage`)
После успешного завершения агента:
1. Определяет текущую стадию задачи
2. Проверяет QG для выхода из стадии
3. Если QG пройден — продвигает стадию
4. Запускает следующего агента (если определён)
Примечание: переход `review → testing` использует `check_reviewer_verdict` (читается из frontmatter `12-review.md`); `development → review``check_tests_local` (оркестратор сам прогоняет тесты, не зависит от Gitea CI).
### 6. Review Bounce
При REQUEST_CHANGES:
1. Считает количество developer runs для задачи
2. Если < MAX_DEV_RETRIES (3) — откатывает в development, перезапускает developer
3. Если >= MAX_DEV_RETRIES — эскалация (логирование + уведомление)
## Database Schema
```sql
-- Задачи
CREATE TABLE tasks (
id INTEGER PRIMARY KEY,
work_item_id TEXT, -- Plane issue identifier (e.g. "ET-006")
plane_issue_id TEXT, -- Plane UUID
repo TEXT,
branch TEXT,
stage TEXT DEFAULT 'created',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Запуски агентов
CREATE TABLE agent_runs (
id INTEGER PRIMARY KEY,
task_id INTEGER REFERENCES tasks(id),
agent TEXT, -- analyst/architect/developer/reviewer/tester
started_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
finished_at TIMESTAMP,
exit_code INTEGER,
output_path TEXT -- /app/data/runs/{id}.log
);
-- Сырые события
CREATE TABLE events (
id INTEGER PRIMARY KEY,
source TEXT, -- plane/gitea
event_type TEXT,
payload TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
```
## Deployment
### Docker Compose
```yaml
services:
orchestrator:
build: .
container_name: orchestrator
restart: unless-stopped
network_mode: host
volumes:
- ./data:/app/data # SQLite + logs
- /home/slin/repos:/repos # Git repositories
- /var/run/docker.sock:/var/run/docker.sock # Docker CLI
- claude-code:/opt/claude-code:ro # Claude CLI binary
- /home/slin/.claude:/home/slin/.claude # Claude config
env_file: .env
group_add: ["999"] # docker group
```
### Dockerfile
- Base: python:3.12-slim
- Docker CLI (sibling containers)
- **tini** как PID 1 (proper zombie reaping)
- `git config --global safe.directory '*'`
- ENTRYPOINT: tini → uvicorn
## Потоки данных
### Happy path (ET-006 пример)
```
1. Plane webhook: work_item.created → task created, analyst launched
2. Analyst: пишет BRD/TRZ/AC/TestPlan → git push docs/
3. Plane comment :approved: → QG check_analysis_approved → PASS
4. Auto-advance: analysis → architecture, architect launched
5. Architect: пишет ADR, infra-requirements → git push docs/
6. Gitea push webhook: ADR detected → QG check_architecture_done → PASS
7. Auto-advance: architecture → development, developer launched
8. Developer: пишет код src/ + tests/ → git push, creates PR
9. Gitea status webhook: CI green → QG check_ci_green → PASS
10. Auto-advance: development → review, reviewer launched
11. Reviewer: оставляет review (APPROVED или REQUEST_CHANGES)
12. Gitea PR webhook: review event → QG check_review_approved → PASS
13. Advance: review → testing, tester launched
14. Tester: прогоняет тесты, пишет test-report.md → git push
15. Auto-advance: testing → deploy (QG check_tests_passed → PASS)
16. PR merge → Gitea PR webhook: action=closed, merged=true → done
```
### Review bounce path
```
11. Reviewer: REQUEST_CHANGES
12. Gitea PR webhook: review_state=REQUEST_CHANGES, stage=review
13. Rollback: review → development, developer relaunched (attempt N/3)
14. Developer: фиксит замечания → git push
15. CI green → development → review, reviewer relaunched
16. Reviewer: APPROVED → continue happy path
```
## Resilience
| Механизм | Описание |
|----------|----------|
| Watchdog | Каждый агент: timeout 30 мин → SIGKILL + exit_code=-9 |
| safe.directory | git операции работают в любой директории |
| Max retries | Developer: max 3 попытки, затем эскалация |
| Zombie-free | stdout идёт сразу в файл + monitor `proc.wait()` → процесс всегда reap'нут (B-2) |
| Orphan recovery | При старте: orphan-run'ы (finished_at IS NULL, старше 35 мин) помечаются exit=-1 с per-run warning + Telegram-уведомлением «нужна ручная проверка» (M-1) |
## Агенты
Каждый агент — Claude CLI с:
- **System prompt**: `.openclaw/agents/{role}.md` (в репозитории)
- **Task file**: `.task-{suffix}.md` — генерируется orchestrator **прямой записью в worktree задачи** `/repos/_wt/<repo>/<branch>/` (B-1, без docker; ORCH-2 — в изолированную рабочую копию, не в shared `/repos/<repo>`). В `.gitignore` репозитория проекта (рантайм-артефакт, не коммитится).
- **Tools**: Read, Write, Edit, Bash
- **Output**: `--print` mode (весь вывод в stdout после завершения)
| Агент | Артефакты | Время (типичное) |
|-------|-----------|-------------------|
| analyst | BRD, TRZ, AC, TestPlan | 5-10 мин |
| architect | ADR, infra-requirements, tech-risks | 5-10 мин |
| developer | src/, tests/, PR | 15-30 мин |
| reviewer | review report, PR review | 3-5 мин |
| tester | test-report.md, e2e results | 10-25 мин |
| deployer | merge PR + SSH deploy-hook + smoke | 5-10 мин |
## Изоляция через git worktree (ORCH-2 / S-4)
Каждая задача (= одна git-ветка) работает в **изолированной git worktree**, а не в общем
`/repos/<repo>`. Это убирает гонки `git checkout`, когда две задачи активны одновременно.
```
/repos/<repo> ← основной clone (fetch / управление worktree, read-only запросы)
/repos/_wt/<repo>/<safe-branch> ← worktree конкретной задачи (рабочая копия агента)
```
Модуль `src/git_worktree.py`:
- `get_worktree_path(repo, branch)` — путь worktree (не создаёт).
- `ensure_worktree(repo, branch)` — создаёт (или переиспользует) worktree на нужной ветке;
для новой ветки создаёт её от `origin/main`. Возвращает путь.
- `remove_worktree(repo, branch)` — опциональная очистка при `done`.
Где используется worktree:
- **launcher**: агент запускается с `cd <worktree>` (без `git checkout` в cmd); task-файл
пишется в worktree; commit/push в `_monitor_agent` идут в worktree.
- **qg/checks**: чтение артефактов агента (`check_analysis_complete`, `check_architecture_done`,
`check_tests_passed`, `check_reviewer_verdict`) и `check_tests_local` (`make test`) — из worktree.
Артефакт-функции принимают опциональный `branch`; без него падают на shared `/repos/<repo>`
(обратная совместимость).
- **webhooks/gitea**: `git branch -r --contains <sha>` оставлен в основном clone — это
**read-only** запрос (нет checkout/мутации), гонок не создаёт.
> Один branch может быть checked out только в одной worktree одновременно —
> это и есть нужное свойство: одна задача = одна ветка = одна worktree.
## Известные ограничения
- ~~Shared `/repos` checkout (гонки при параллельных задачах).~~ **РЕШЕНО (ORCH-2 / S-4):**
git worktree per task/branch — см. раздел «Изоляция через git worktree» ниже.
- ~~In-process daemon-потоки (рестарт → сироты, потеря работы).~~ **РЕШЕНО (ORCH-1 / F-2b):**
персистентная очередь jobs + фоновый воркер — см. раздел «Очередь задач (ORCH-1)» ниже.
Daemon-потоки monitor/watchdog остаются для одного запущенного агента, но при
рестарте его job возвращается в `queued` (queue-recovery) и переподхватывается.
## Очередь задач (ORCH-1 / F-2b)
Раньше webhook-хэндлер **синхронно** спавнил `subprocess.Popen` + 2 daemon-thread
прямо в процессе uvicorn (8 точек вызова). Рестарт = сироты + потеря работы,
нет лимита параллелизма, нет ретраев.
### Flow
```
webhook (plane/gitea) background thread (queue_worker)
│ │
enqueue_job() ---> [ jobs table ] <--- claim_next_job() (atomic queued->running)
(мгновенный status=queued │
ответ 200) launch_job(job)
AgentLauncher._spawn (Popen claude)
_monitor_agent (proc.wait, commit/push,
│ advance stage)
_finalize_job:
exit 0 -> mark_job done
exit !=0 & attempts<max -> requeue (queued)
exit !=0 & attempts>=max -> failed + Telegram
```
### Таблица `jobs`
| Колонка | Назначение |
|--------|------------|
| `status` | `queued``running``done` \| `failed` |
| `attempts` / `max_attempts` | счётчик попыток (инкремент при claim) / лимит ретраев (default 2) |
| `run_id` | FK на `agent_runs.id` после старта |
| `task_content` | ТЗ, которое пишется в task-файл агента |
| `error` | последняя ошибка |
`idx_jobs_status (status, id)` — быстрый FIFO-выбор queued.
### Атомарный claim
`claim_next_job()` делает `SELECT queued ORDER BY id LIMIT 1``UPDATE ... WHERE id=? AND
status='queued'` и проверяет `rowcount`. При гонке двух тиков лишь один UPDATE
переведёт строку в `running` (rowcount==1); проигравший берёт следующий job.
### Queue-recovery (рестарт-safe)
В `main.py` lifespan **после** M-1 orphan-recovery вызывается `requeue_running_jobs()`:
jobs со статусом `running` (воркер умёр на рестарте) → возвращаются в `queued`.
Потом стартует воркер; на shutdown — `worker.stop()` (Event.set + join).
### Конфиг
- `ORCH_MAX_CONCURRENCY` (default 1) — лимит параллельных jobs.
- `ORCH_QUEUE_POLL_INTERVAL` (default 2.0) — период опроса.
Наблюдаемость: `GET /queue` — counts по статусам + последние 10 jobs.
> Совместимость: `launcher.launch()` (прямой синхронный запуск, `job_id=None`)
> сохранён для обратной совместимости. Очередь использует `launch_job()`;
> оба разделяют `_spawn()` (Popen-логика B-2 не изменена).
- **Gitea CI не настроен.** QG развития теперь локальный (`check_tests_local`);
Gitea CI-статусы не являются authoritative и не блокируют pipeline.
- **Docker внутри контейнера orchestrator НЕДОСТУПЕН.** Деплой идёт только через
SSH-хук `enduro-deploy-hook.sh` на хосте.

80
docs/BACKLOG_PIPELINE.md Normal file
View File

@@ -0,0 +1,80 @@
# Pipeline Design Backlog
Вопросы требующие архитектурной проработки перед реализацией.
---
## BL-001 — Тестирование / Аудит вне work item
**Статус:** Open
**Добавлено:** 2026-05-23
### Проблема
Текущий пайплайн feature-driven: каждый запуск привязан к Plane issue.
Нет механизма для:
- Standalone UI-аудита (проверить текущее состояние приложения)
- Регрессионного тестирования без новой фичи
- Периодических health-check UI
### Вопросы для проработки
1. Нужен ли отдельный тип задачи "audit" в Plane?
2. Или аудит — это всегда ad-hoc вне orchestrator?
3. Если через orchestrator — какой сокращённый пайплайн? (`analyst → tester` без dev/review)
4. Куда писать отчёт? В Plane? В отдельный docs/audits/?
5. Кто инициирует: Слава через Plane, или Стрим через heartbeat?
### Варианты
| Вариант | Плюсы | Минусы |
|---------|-------|--------|
| Ad-hoc через Стрим (spawn agents) | Быстро, без инфра | Не трекается в Plane |
| Synthetic Plane issue | Трекается | Orchestrator не умеет пропускать этапы |
| Новый тип задачи "audit" в orchestrator | Правильно архитектурно | Требует разработки |
---
## BL-002 — Управление бэклогом / Задачи
**Статус:** Open
**Добавлено:** 2026-05-23
### Проблема
Не определён процесс: кто и куда заводит задачи, как они попадают в пайплайн.
### Вопросы для проработки
1. **Кто заводит задачи в Plane?**
- Слава напрямую через Plane UI?
- Стрим создаёт задачи по запросу Славы в чате?
- Автоматически по ключевым словам из Telegram?
2. **Куда заводить?**
- Только в Plane project "Enduro Trails"?
- Стрим ведёт свой список в workspace?
- Нужен ли отдельный inbox?
3. **Что инициирует пайплайн?**
- Сейчас: Plane issue с определённым статусом → webhook → orchestrator
- Нужно ли добавить: Telegram → Стрим создаёт Plane issue → пайплайн?
4. **Приоритизация:**
- Кто решает что брать в работу следующим?
- Есть ли sprint/канбан?
5. **Plane синхронизация (см. текущий баг):**
- Plane не синхронизирован (ET-001..ET-006 показаны некорректно)
- Нужно ли чинить маппинг plane_issue_id в orchestrator?
- Или Plane — просто decorative, реальный трекинг в orchestrator.db?
### Контекст
- Текущая связка: Plane webhook → orchestrator → агенты
- Plane sync сломан (известный P3 из LESSONS_ET006)
- orchestrator.db — единственный источник правды о состоянии задач
---
*Документ для обсуждения архитектуры пайплайна. Не roadmap, не ТЗ.*

View File

@@ -0,0 +1,62 @@
# Bugfixes — 2026-05-21
## Контекст
Задача ET-005 (переключатель единиц измерения) застряла на переходе `development → review`.
В процессе диагностики и починки найдено и исправлено 5 багов в orchestrator.
## Баги исправленные
### 1. CI status webhook: пустой `branches` в payload
**Файл:** `src/webhooks/gitea.py` (handle_ci_status)
**Проблема:** Gitea отправляет CI status webhook с `branches: []`. Функция делала ранний `return` — не могла определить branch и не продвигала задачу.
**Решение:** Fallback через `git branch -r --contains <sha>` — определяет ветку по SHA коммита. Ищет ветку `feature/*` в output.
### 2. git safe.directory в контейнере
**Файл:** Docker runtime (orchestrator container)
**Проблема:** `subprocess.run(["git", ...])` внутри контейнера падал с `fatal: detected dubious ownership in repository` — repo mount принадлежит другому user.
**Решение:** `git config --global --add safe.directory '*'` при старте контейнера. Убран кастомный `env={**os.environ, "HOME": "/home/slin"}` который ломал gitconfig.
### 3. X-Gitea-Event: pull_request_approved не роутился
**Файл:** `src/webhooks/gitea.py` (webhook router)
**Проблема:** Gitea отправляет event type `pull_request_approved` при approve review, но роутер обрабатывал только `pull_request`.
**Решение:** Расширен роутинг на `pull_request`, `pull_request_approved`, `pull_request_review_approved`.
### 4. review.state vs review.type — новый формат Gitea
**Файл:** `src/webhooks/gitea.py` (handle_pr)
**Проблема:** Gitea webhook отправляет `review.type = "pull_request_review_approved"` вместо `review.state = "APPROVED"`. Код искал только `review.state`.
**Решение:** Маппинг из `review.type` если `review.state` пустой: `"approved" in type → APPROVED`, `"request_changes"/"rejected" in type → REQUEST_CHANGES`.
### 5. Нет auto-advance после завершения agent
**Файл:** `src/agents/launcher.py`
**Проблема:** После завершения tester (exit_code=0) задача оставалась в `testing` — не было механизма автоматического продвижения. Для `development → review` триггер — CI status webhook, для `review → testing` — PR review webhook, но для `testing → deploy` внешнего триггера нет.
**Решение:** Добавлен метод `_try_advance_stage()` в `AgentLauncher`, вызывается из `_monitor_agent` после успешного завершения агента. Проверяет QG, продвигает stage, запускает следующего агента.
## Известные проблемы (не исправлены)
### dismiss_stale_approvals
Branch protection `dismiss_stale_approvals: true` на main ветке: tester пушит коммит после review approval → approval становится stale → merge блокируется.
**Workaround:** Re-approve через claude-bot после каждого push tester'а.
**Рекомендация:** Либо отключить `dismiss_stale_approvals`, либо добавить auto-re-approve в orchestrator после tester push.
## Результат
ET-005 прошла полный цикл: `analysis → architecture → development → review → testing → deploy → done`

View File

@@ -0,0 +1,84 @@
# Bugfixes 2026-06-02 — устранение багов оркестратора
**Источник:** `tasks/multi-agent/AUDIT_2026-06-02.md`
**Цель:** вернуть автономность мультиагентного pipeline (ET-009: 0/6 этапов были автономны).
**Исполнитель:** Dev-агент (Opus 4.8 Tokenator).
---
## Что починено
### B-1 — запись `.task-*.md` без docker
**Было:** `launcher._write_task_file()` писал файл через `docker run --rm -i python:3.12-slim bash -c "cat > ..."`. Бинарника `docker` в контейнере НЕТ → запись падала молча → агент читал старый task-файл.
**Стало:** прямая запись в смонтированный volume `/repos/<repo>/<task_file>` обычным `open(..., "w")`. При ошибке записи — `RuntimeError` (не молчит).
**Файл:** `src/agents/launcher.py` (`_write_task_file`, вызов в `launch`).
**Проверка:**
```bash
docker exec orchestrator python3 -c "
import sys; sys.path.insert(0,'/repos/orchestrator')
from src.agents.launcher import launcher
launcher._write_task_file('enduro-trails', '.task-test-write.md', 'hello-from-fix')
print(open('/repos/enduro-trails/.task-test-write.md').read())"
# => hello-from-fix (без docker)
```
✅ Verified: READBACK = `hello-from-fix`.
### B-2 — Popen stdout → файл, убран PIPE-поток (зомби, потеря exit_code)
**Было:** `Popen(stdout=PIPE)` + daemon-поток с `select`/`readline` + startup-timeout 120с. → PIPE-deadlock, зомби при рестарте, `exit_code=None` в БД (все прогоны ET-009).
**Стало:** `log_fh = open(output_path, "w")`; `Popen(stdout=log_fh, stderr=STDOUT)`. `_monitor_agent` упрощён до `proc.wait()` + `log_fh.close()`. PIPE-поток и startup-timeout удалены. Watchdog по pid (`AGENT_TIMEOUT`) сохранён.
**Файл:** `src/agents/launcher.py` (`launch`, `_monitor_agent`).
**Проверка:** после прогона `SELECT exit_code FROM agent_runs ORDER BY id DESC LIMIT 1` != NULL; `ps aux | grep defunct` — пусто.
### B-3 — `.task-*.md` в `.gitignore`, не коммитятся
**Было:** task-файлы трекались в git (`.task-arch.md`, `.task-dev.md`, `.task-review.md`, `.task.md`) и тащились между задачами.
**Стало:** в `enduro-trails/.gitignore` добавлено `.task*.md`; трекаемые файлы убраны из индекса (`git rm --cached`).
**Файл:** `enduro-trails/.gitignore` (+ untrack). Ветка `main` protected → изменения в **PR #19** (`chore/gitignore-task-files`).
**Проверка:** `git check-ignore .task.md .task-arch.md` → matched. `git add docs/ src/ tests/` (scoped) не цепляют task-файлы.
### S-5 — машиночитаемый verdict ревьюера
**Было:** `check_reviewer_verdict` искал подстроки `APPROVED`/`REQUEST_CHANGES` во всём тексте (5000 байт) → ложные срабатывания на таблицах.
**Стало:** читается ТОЛЬКО `verdict:` из YAML-frontmatter `12-review.md` (через `yaml.safe_load`). Нет verdict / нет frontmatter → not-approved. `reviewer.md` обновлён: требование frontmatter `verdict: APPROVED|REQUEST_CHANGES`.
**Файлы:** `src/qg/checks.py` (`check_reviewer_verdict`), `enduro-trails/.openclaw/agents/reviewer.md` (PR #19; рабочая копия применена сразу).
**Проверка:** ET-009 `12-review.md` (frontmatter `verdict: APPROVED`) → `(True, 'Reviewer verdict: APPROVED')`. Unit-тесты покрывают APPROVED/REQUEST_CHANGES/no-verdict/no-frontmatter/таблица-в-теле.
### S-1 — QG тестов гоняет сам оркестратор (не Gitea CI)
**Было:** `development → review` QG = `check_ci_green` (Gitea status). CI не настроен → всегда false → автопереход не происходил + ложные «CI failed» алерты.
**Стало:** новый QG `check_tests_local` — оркестратор делает `git fetch/checkout <branch>` + `make test` в `/repos/<repo>`, judge по exit-code. `stages.py`: `development` QG → `check_tests_local`. Dispatch добавлен в `launcher._try_advance_stage` и `webhooks/plane._try_advance_stage` (args `(repo, branch)`). `webhooks/gitea.handle_ci_status`: `failure` → debug-лог, без `notify_error`.
**Файлы:** `src/qg/checks.py`, `src/stages.py`, `src/agents/launcher.py`, `src/webhooks/plane.py`, `src/webhooks/gitea.py`.
**Грабля (известное ограничение):** `check_tests_local` делает checkout в shared `/repos` — небезопасно при параллельных задачах (S-4 worktree — отдельно).
### M-1 — нормальный orphan-recovery
**Было:** `UPDATE agent_runs SET exit_code=-1 WHERE finished_at IS NULL AND started_at < now-35min` — молча списывал зомби.
**Стало:** перечисляем каждый orphan-run, помечаем exit=-1, логируем per-run `warning` («manual check needed»), отправляем Telegram-уведомление. Не автоперезапускаем (риск зацикливания). Killing по pid невозможен — pid не персистится в БД (задокументировано).
**Файл:** `src/main.py` (lifespan).
---
## Что НЕ входило (отдельные задачи)
- S-2/S-3 (rollback деплоера в shared-репо), S-4 (git worktree per task), M-3 (единый stage-engine), F-2b (очередь задач), M-7 (идемпотентность webhook). `_auto_merge_pr` — мёртвый код оставлен (отдельная чистка).
## Тесты
- Новый файл `tests/test_launcher.py`: 10 тестов (`_write_task_file` пишет/raise/без docker; `check_reviewer_verdict` frontmatter cases).
- `tests/test_qg.py`: 16 passed. `tests/test_launcher.py`: 10 passed.
- ⚠️ Pre-existing: `tests/test_webhooks.py` имеет падения (401/signature + cross-file env pollution) — НЕ связаны с этими фиксами, существовали до правок. Запуск в изоляции part-passes; в общем прогоне больше падений из-за общего env/DB между тест-файлами. Гигиена test_webhooks — отдельная задача.
## Деплой
Оркестратор пересобран: `cd /home/slin/repos/orchestrator && docker compose up -d --build`. Health: `{"status":"ok"}`.
---
## Дополнительно найдено и починено в ходе теста автономности
### git safe.directory (launcher commit/push)
В ходе теста выяснилось: git внутри контейнера (root) над bind-mounted `/repos` падал с "dubious ownership" → авто-commit/push агента не проходил. Фикс: `git config --system --add safe.directory "*"` в Dockerfile. Теперь `_monitor_agent` commit+push работает автономно (проверено: `analyst(ET): auto-commit run_id=47` запушен в origin).
### init:true (PID-1 reaper) — добиваем B-2
Прямой child (bash) reap-ался корректно через `proc.wait()`, НО claude (node) порождает свои дочерние процессы; при выходе bash они реparent-ились на PID 1 (uvicorn), который их НЕ reap-ал → grandchild-зомби. Фикс: `init: true` в docker-compose.yml — Docker внедряет `docker-init`(tini) как PID 1. Проверено: после реального прогона агента `ZOMBIE_COUNT_AFTER=0`.
## Тест автономности (Task 9) — РЕЗУЛЬТАТ
Запуск через `launcher.launch("analyst", ...)` (НЕ base64). Подтверждено автономно:
- B-1: свежий `.task.md` записан без docker (which docker = NO_DOCKER_BINARY)
- B-2: `exit_code=0` в `agent_runs` (run 46/47/48)
- зомби: 0 после прогона (tini reaper)
- git: auto-commit + push в origin отработал
- M-1: при рестарте orphan-recovery залогировал per-run + Telegram (runs 42/43/44 ET-009)

View File

@@ -0,0 +1,81 @@
# ORCH-2 / S-4 — git worktree per task (изоляция shared /repos)
**Дата:** 2026-06-02
**Ветка:** `feature/ORCH-2-worktree`
**Источник:** `AUDIT_2026-06-02.md` (SERIOUS S-4), `DEV_TASK_ORCH2_WORKTREE.md`
**Исполнитель:** Dev (Opus 4.8 Tokenator)
## Проблема (S-4)
Все git-операции (`launcher.launch` cmd, `_monitor_agent` commit/push, `check_tests_local`)
делали `git checkout <branch>` в одном общем `/repos/<repo>`. При двух активных задачах
checkout одной перетирал рабочую копию другой → гонки (на ET-009 это дало «два коллектора»
и путаницу веток).
## Решение
**git worktree per branch.** Каждая задача (ветка) работает в изолированной рабочей копии:
```
/repos/<repo> ← основной clone (fetch / worktree mgmt / read-only)
/repos/_wt/<repo>/<safe-branch> ← worktree задачи (рабочая копия агента)
```
## Изменения
| Файл | Что |
|------|-----|
| `src/config.py` | + `worktrees_dir: str = "/repos/_wt"` |
| `src/git_worktree.py` (новый) | `_safe`, `get_worktree_path`, `ensure_worktree`, `remove_worktree` |
| `src/agents/launcher.py` | `launch()`: ветка резолвится заранее → `ensure_worktree`; cmd = `cd <worktree>` без `git checkout`; `_write_task_file(repo, branch, ...)` пишет в worktree; `_monitor_agent` commit/push в worktree (checkout убран); чтение `01-questions.md`/`10-conflict.md` из worktree; QG-диспетчер прокидывает `branch` |
| `src/qg/checks.py` | `_repo_path(repo, branch)` helper (worktree если есть, иначе shared); артефакт-чеки получили опциональный `branch`; `check_tests_local``ensure_worktree` + `make test` в worktree (TODO про S-4 удалён) |
| `src/webhooks/plane.py` | QG-диспетчер прокидывает `branch`; review-файл fallback читается из worktree |
| `src/webhooks/gitea.py` | `git branch -r --contains <sha>` — подтверждено read-only, оставлено в main clone (+ комментарий) |
| `tests/test_git_worktree.py` (новый) | покрытие `_safe`/`get_worktree_path`/`ensure_worktree`/`remove_worktree` + изоляция двух веток (реальные локальные git-репо в tmp, без сети) |
| `tests/test_launcher.py` | `TestWriteTaskFile` обновлён под новую сигнатуру (запись в worktree) |
| `docs/ARCHITECTURE.md` | раздел «Изоляция через git worktree»; убран пункт про shared-checkout гонки |
## Совместимость с прежними фиксами
- **B-1** (запись task-файла без docker, прямой `open()`): сохранена — теперь путь = worktree.
- **B-2** (Popen stdout → файл, monitor `proc.wait()` без зомби): не тронут.
- **S-5** (`check_reviewer_verdict` — только YAML-frontmatter): не тронут, добавлен лишь worktree-путь.
- **S-1** (`check_tests_local` — свой `make test` вместо Gitea CI): сохранён, тесты теперь в worktree.
Обратная совместимость QG-диспетчеризации: артефакт-чеки принимают `branch` опционально
(default `None` → shared `/repos/<repo>`), поэтому существующие 2-арг вызовы/тесты не сломаны.
## Проверка
```bash
# Тесты (в контейнере через образ — хостовый .venv сломан):
IMG=$(docker inspect orchestrator --format '{{.Config.Image}}')
docker run --rm -v /home/slin/repos/orchestrator:/code -w /code --entrypoint python3 $IMG -m pytest tests/ -q
# → 37 passed, 9 failed (pre-existing test_webhooks 401/signature — НЕ относятся к ORCH-2,
# идентичны baseline на main).
# test_git_worktree.py изолированно → 9 passed.
```
### Тест изоляции (в работающем контейнере)
```bash
docker exec orchestrator python3 -c "
import sys; sys.path.insert(0,'/app')
from src.git_worktree import ensure_worktree
import subprocess
p1 = ensure_worktree('enduro-trails','feature/wt-test-A')
p2 = ensure_worktree('enduro-trails','feature/wt-test-B')
b1 = subprocess.run(['git','-C',p1,'branch','--show-current'],capture_output=True,text=True).stdout.strip()
b2 = subprocess.run(['git','-C',p2,'branch','--show-current'],capture_output=True,text=True).stdout.strip()
assert p1!=p2 and b1!=b2, 'NOT ISOLATED'
print('ISOLATION OK', p1, p2, b1, b2)
"
```
(Результат прогона на сервере — см. ниже / в отчёте Стрим.)
## Ограничения / заметки
- Очередь задач (ORCH-1 / F-2b) **не** входит в эту задачу.
- `remove_worktree` существует, но автоматический вызов при `done` не подключён (опционально, отдельным шагом).

View File

@@ -0,0 +1,82 @@
# BUGFIXES / CHANGES — 2026-06-03
## ORCH-6 — Multi-repo: фильтр проекта + маппинг repo per project
**Тип:** root-fix инцидента + новая возможность (multi-repo)
**Ветка:** `feature/ORCH-6-multirepo`
**Plane:** ORCH-6 (project `8da6aa25-a60e-44d6-a1e2-d8ae59aa7d6a`)
**Связанный инцидент:** [`INCIDENT_2026-06-02_webhook_autorun.txt`](./INCIDENT_2026-06-02_webhook_autorun.txt)
### Контекст инцидента
При создании задач ORCH-1..7 в Plane (проект `orchestrator`) Plane-webhook
(id `93f0c342-a614-4248-9d0f-c107276f5620`) сработал на каждую задачу и запустил
конвейер — но **всё ушло в репо `enduro-trails`**, потому что `plane.py:91`
хардкодил `repo = settings.default_repo`. Webhook слушал **весь workspace без
фильтра по проекту**, наплодив мусорные ET-010..016.
Митигация на время фикса: Plane-webhook **деактивирован** (`is_active=false`).
### Root cause
1. Нет фильтра по Plane-проекту — любая issue из любого проекта попадала в конвейер.
2. `repo` хардкожен на единственный `default_repo` (enduro-trails).
3. `work_item_prefix` всегда `ET` (db.py).
4. `plane_sync` ходил в единственный хардкоженный `PROJECT_ID` (enduro).
### Что сделано
| Файл | Изменение |
|------|-----------|
| `src/projects.py` (новый) | Реестр проектов: `ProjectConfig` + дефолт-список (enduro-trails + orchestrator) + резолверы `get_project_by_plane_id` / `get_project_by_repo` / `known_plane_project_ids`. Источник переопределения — `ORCH_PROJECTS_JSON`; устойчивый парсинг (битый JSON / битые записи → fallback на дефолт). |
| `src/config.py` | Добавлен `projects_json: str = ""` (env `ORCH_PROJECTS_JSON`). |
| `src/webhooks/plane.py` | **Фильтр по проекту**: `data.project` не в реестре → `{"status":"ignored","reason":"unknown project"}`. Резолв `repo`/`prefix`/Plane-проекта из реестра. Plane-sync для задачи идёт в её собственный проект. |
| `src/db.py` | `get_next_work_item_id(repo, prefix="ET")` — нумерация per (repo, prefix); `ORCH-001` независимо от `ET-010`. Дефолт `ET` сохранён для обратной совместимости. |
| `src/plane_sync.py` | `_resolve_project_id` + параметризация `project_id` (дефолт на enduro → обратная совместимость существующих вызовов). |
| `src/webhooks/gitea.py` | Неизвестный repo (`get_project_by_repo` → None) → `ignored` в 3 хэндлерах. |
### Тесты
- `tests/test_projects.py` (16 тестов): резолверы (by plane_id, by repo, unknown→None,
known_plane_project_ids), парсинг `ORCH_PROJECTS_JSON` (валидный / битый JSON / не массив /
битые записи → skip / all-bad → fallback), reload с кастомным JSON.
- `tests/test_plane_webhook.py` (4 теста, FastAPI TestClient, `launcher.launch` замокан):
unknown project → `ignored` + нет task/branch/agent; orchestrator-проект → `repo=orchestrator`,
`ORCH-*`; enduro-проект → `repo=enduro-trails`, `ET-*`; независимые префиксы (`ORCH-001`/`ORCH-002`
параллельно с `ET-001`).
**Прогон (в контейнере, образ `orchestrator-orchestrator`):** `57 passed`. 9 падений в
`tests/test_webhooks.py`**pre-existing** (webhook signature 401 / TypeError, не связаны с ORCH-6,
не трогались).
```bash
IMG=$(docker inspect orchestrator --format '{{.Config.Image}}')
docker run --rm -v /home/slin/repos/orchestrator:/code -w /code --entrypoint python3 $IMG -m pytest tests/ -q
```
### Проверка резолва (offline, в работающем контейнере)
```bash
docker exec orchestrator python3 -c "
from src.projects import get_project_by_plane_id, known_plane_project_ids
o = get_project_by_plane_id('8da6aa25-a60e-44d6-a1e2-d8ae59aa7d6a')
e = get_project_by_plane_id('7a79f0a9-5278-49cd-9007-9a338f238f9c')
assert o.repo=='orchestrator' and o.work_item_prefix=='ORCH'
assert e.repo=='enduro-trails' and e.work_item_prefix=='ET'
assert get_project_by_plane_id('00000000-0000-0000-0000-000000000000') is None
print('RESOLVE OK:', o.repo, e.repo, '| known:', len(known_plane_project_ids()))
"
```
### ⚠️ Важно
- Plane-webhook **остаётся выключенным** (`is_active=false`). Включение — отдельный
шаг Стрим после ревью PR.
- `ORCH_PROJECTS_JSON` (если задан) **полностью заменяет** дефолт — перечислять все нужные проекты.
- Обратная совместимость `plane_sync` сохранена (дефолт project_id = enduro), ET-задачи не сломаны.
### Re-enable webhook (после ревью, делает Стрим)
```sql
UPDATE webhooks SET is_active=true WHERE id='93f0c342-a614-4248-9d0f-c107276f5620';
```

View File

@@ -0,0 +1,7 @@
INCIDENT 2026-06-02: Plane webhook auto-triggered pipeline for ALL ORCH-1..7 tasks
- Plane webhook (id 93f0c342) fires on ANY issue creation in workspace, no project filter
- plane.py:91 hardcodes repo=settings.default_repo (enduro-trails)
- Result: ORCH-x tasks ran analyst/architect in WRONG repo (enduro-trails), created junk ET-010..016
- MITIGATION: Plane webhook DEACTIVATED (is_active=false) until ORCH-6 adds project filter
- ROOT FIX = ORCH-6 (multi-repo): filter by plane_project_id + repo mapping per project
- To re-enable webhook after ORCH-6: UPDATE webhooks SET is_active=true WHERE id=93f0c342...

190
docs/LESSONS_ET006.md Normal file
View File

@@ -0,0 +1,190 @@
# Lessons Learned — ET-006 (GPX Upload & Visualization)
## Дата: 2026-05-22
## Задача: ET-006 — Загрузка и визуализация GPX-треков
---
## Что сработало хорошо
### 1. Review bounce — реальный баг найден и исправлен автоматически
Reviewer обнаружил P1: `Math.min.apply(null, array)` падает с RangeError на массивах >100K элементов.
Developer пофиксил за 6 минут (attempt 2), второй review прошёл чисто.
**Вывод:** reviewer в пайплайне оправдывает себя — ловит баги которые unit-тесты пропускают.
### 2. Auto-advance testing → deploy
Новый `_try_advance_stage()` в launcher сработал без ручного вмешательства.
### 3. Качество артефактов агентов
- Analyst предусмотрел REQ-F-13 (persist GPX layers при map style switch) — предотвратил архитектурный bounce-back
- Architect обосновал невозможность Web Worker (DOMParser отсутствует в WorkerGlobalScope)
- Developer: ~1300 строк production + 700 строк тестов, все REQ покрыты
- Tester: полный e2e с Playwright, 48 pass / 0 fail
### 4. Полный цикл с bounce
```
analysis → architecture → development → review (REQUEST_CHANGES)
→ development (fix P1) → review (APPROVED) → testing → deploy → done
```
Время: ~6.5 часов (включая ожидание API и e2e тесты)
---
## Проблемы найденные
### P1. Zombie processes после docker rebuild
**Симптом:** Monitor threads умирают при `docker compose up --build`, agent процессы остаются zombie.
**Влияние:** Ручное вмешательство для commit/push и advance stage.
**Root cause:** Daemon threads в Python не переживают restart контейнера, но child processes (claude.exe) наследуются init (PID 1).
### P2. Stale reviews блокируют merge
**Симптом:** Tester пушит коммит после review approval → approval становится stale → merge отклоняется.
**Влияние:** Ручной re-approve перед каждым merge.
**Root cause:** Branch protection `dismiss_stale_approvals: true`.
### P3. Plane sync 404
**Симптом:** `plane_issue_id` в orchestrator DB не совпадает с реальным UUID issue в Plane API.
**Влияние:** State updates в Plane не работают (comments работают).
**Root cause:** Webhook payload содержит ID объекта webhook event, не issue ID.
### P4. Неполный event routing
**Симптом:** `pull_request_rejected` event type не роутился в `handle_pr`.
**Влияние:** REQUEST_CHANGES от reviewer не откатывал задачу автоматически.
**Root cause:** Gitea использует разные event types: `pull_request`, `pull_request_approved`, `pull_request_rejected`.
### P5. Analyst не запускался автоматически
**Симптом:** После создания задачи через Plane webhook analyst не стартовал.
**Влияние:** Ручной запуск analyst.
**Root cause:** В `handle_work_item_created` не было вызова `launcher.launch("analyst")`.
### P6. Tester долгий (25 мин)
**Симптом:** Playwright e2e тесты с headless Chromium на GPX-фиче заняли 25 минут.
**Влияние:** Долгое ожидание, watchdog timeout (30 мин) почти сработал.
**Root cause:** Рендеринг 700K точек + установка зависимостей (Playwright, shapely) в runtime.
---
## Решения
### P1. Zombie processes → Entrypoint + orphan recovery
**Решение A (быстрое):** Добавить в Dockerfile:
```dockerfile
RUN git config --global --add safe.directory '*'
```
**Решение B (полное):** Startup recovery в `main.py`:
```python
@app.on_event("startup")
async def recover_orphaned_runs():
"""Mark orphaned runs (started but never finished) as failed."""
conn = get_db()
orphans = conn.execute(
"UPDATE agent_runs SET finished_at=datetime('now'), exit_code=-1 "
"WHERE finished_at IS NULL AND started_at < datetime('now', '-35 minutes')"
).rowcount
conn.commit()
if orphans:
logger.warning(f"Recovered {orphans} orphaned agent runs")
# Re-check tasks stuck in intermediate stages
stuck = conn.execute(
"SELECT id, stage, work_item_id, repo, branch FROM tasks "
"WHERE stage NOT IN ('done', 'created')"
).fetchall()
for task in stuck:
# Try to advance if QG passes
...
```
**Решение C (robust):** Использовать `tini` как PID 1 в контейнере для proper zombie reaping:
```dockerfile
RUN apt-get install -y tini
ENTRYPOINT ["tini", "--"]
CMD ["uvicorn", "src.main:app", ...]
```
### P2. Stale reviews → Отключить dismiss или auto-re-approve
**Решение A (простое):** Отключить `dismiss_stale_approvals`:
```bash
curl -X PATCH '.../branch_protections/main' -d '{"dismiss_stale_approvals": false}'
```
**Решение B (лучше):** Auto-re-approve в launcher после tester push:
```python
# В _monitor_agent, после успешного push для tester:
if agent == "tester":
_reapprove_pr(repo, branch)
```
**Рекомендация:** Решение A — проще и безопаснее. В нашем пайплайне reviewer уже проверяет код, stale dismiss не добавляет ценности.
### P3. Plane sync → Исправить маппинг ID
**Решение:** При `work_item.created` webhook сохранять правильный `issue_id`:
```python
# В handle_work_item_created:
plane_issue_id = data.get("id") # Это ID issue, не event
# Проверить через Plane API: GET /issues/{id} — если 404, искать по name
```
**Диагностика:** Сравнить `plane_issue_id` в DB с реальным через:
```bash
curl http://localhost:8091/api/v1/workspaces/ag_proj/projects/.../issues/?search=ET-006
```
### P4. Event routing → Wildcard для pull_request_*
**Решение:**
```python
if event_type == "push":
await handle_push(payload)
elif event_type.startswith("pull_request"):
await handle_pr(payload)
elif event_type == "status":
await handle_ci_status(payload)
```
### P5. Analyst auto-launch → Уже исправлено
Патч применён: `launcher.launch("analyst")` добавлен в `handle_work_item_created`.
### P6. Tester долгий → Pre-bake dependencies
**Решение A:** Добавить Playwright и зависимости в Dockerfile:
```dockerfile
RUN pip install playwright pytest-playwright shapely mapbox-vector-tile && \
playwright install chromium --with-deps
```
**Решение B:** Разделить unit/integration и e2e тесты. Unit/integration — обязательные (быстрые), e2e — опциональные (по флагу в task description).
**Решение C:** Увеличить timeout для tester до 45 минут:
```python
AGENT_CONFIGS = {
"tester": {..., "timeout": 2700}, # 45 min
}
```
---
## Приоритет исправлений
| # | Проблема | Приоритет | Усилие | Решение |
|---|----------|-----------|--------|---------|
| P1 | Zombie processes | HIGH | Medium | tini + startup recovery |
| P2 | Stale reviews | HIGH | Low | Отключить dismiss_stale_approvals |
| P4 | Event routing | HIGH | Low | startswith("pull_request") |
| P5 | Analyst auto-launch | DONE | — | Уже исправлено |
| P6 | Tester timeout | MEDIUM | Medium | Pre-bake deps в Dockerfile |
| P3 | Plane sync 404 | LOW | Medium | Исправить маппинг ID |
---
## Метрики ET-006
- **Общее время:** ~6.5 часов (00:20 → 06:45 UTC)
- **Agent runs:** 7 (analyst, architect, developer×2, reviewer×2, tester)
- **Ручные вмешательства:** 4 (zombie recovery×2, PR approve, event re-trigger)
- **Код написан агентами:** ~2000 строк (1300 production + 700 tests)
- **Баги найдены reviewer:** 1×P1, 3×P2, 6×P3
- **Баги исправлены developer:** все P1 + все P2 + 3×P3

127
docs/ORCH-1_JOB_QUEUE.md Normal file
View File

@@ -0,0 +1,127 @@
# ORCH-1 (F-2b): Persistent Job Queue
**Дата:** 2026-06-02
**Ветка:** `feature/ORCH-1-job-queue`
**Источник:** AUDIT_2026-06-02 (B-2 / F-2b)
## Проблема
Агенты запускались **in-process**: `launcher.launch()` синхронно спавнил
`subprocess.Popen` + 2 daemon-thread (`_watchdog`, `_monitor_agent`) прямо в
процессе uvicorn, из **8 webhook-точек**. Последствия:
- **Рестарт = катастрофа.** daemon-threads умирают, claude-процессы → сироты,
работа теряется (M-1 лишь помечал `exit=-1` и звал человека).
- **Нет лимита параллелизма** — N webhook'ов = N одновременных claude.
- **Нет ретраев** — упавший агент просто мёртв.
## Решение
Персистентная очередь задач (SQLite-таблица `jobs`) + фоновый воркер:
1. Webhook-хэндлер кладёт job (`enqueue_job`) → мгновенный ответ 200.
2. Фоновый воркер (`src/queue_worker.py`, отдельный daemon-thread) забирает
jobs с учётом `max_concurrency` (`claim_next_job`, атомарно) и спавнит агента
(`launcher.launch_job`, та же Popen-логика).
3. По завершении `_monitor_agent``_finalize_job`:
- `exit 0``done`;
- `exit != 0` & `attempts < max_attempts` → requeue (`queued`);
- `exit != 0` & `attempts >= max_attempts``failed` + Telegram.
## Что изменено
| Файл | Изменение |
|------|-----------|
| `src/db.py` | Таблица `jobs` + индекс; хелперы `enqueue_job`, `claim_next_job` (атомарный), `mark_job`, `count_running_jobs`, `requeue_running_jobs`, `get_job`, `job_status_counts`, `recent_jobs` |
| `src/config.py` | `max_concurrency` (env `ORCH_MAX_CONCURRENCY`, default 1), `queue_poll_interval` (env `ORCH_QUEUE_POLL_INTERVAL`, default 2.0) |
| `src/agents/launcher.py` | `launch()` → тонкая обёртка над `_spawn()`; новый `launch_job(job)`; `_spawn()` (общий, `job_id` опционально); monitor/watchdog принимают `job_id`; новый `_finalize_job()` (статусы + ретраи). 4 внутренних advance-вызова `self.launch``enqueue_job` |
| `src/webhooks/plane.py` | 4 точки `launcher.launch``enqueue_job` |
| `src/webhooks/gitea.py` | 4 точки `launcher.launch``enqueue_job` |
| `src/queue_worker.py` | **НОВЫЙ**`QueueWorker` (drain loop + max_concurrency + graceful stop) |
| `src/main.py` | lifespan: queue-recovery (`requeue_running_jobs`) после M-1, старт/останов воркера; новый `GET /queue` |
| `tests/test_queue.py` | **НОВЫЙ** — 19 тестов (lifecycle, атомарность claim, ретраи, requeue, observability, worker max_concurrency; Popen полностью замокан) |
## Атомарность claim
```sql
SELECT id FROM jobs WHERE status='queued' ORDER BY id LIMIT 1;
UPDATE jobs SET status='running', attempts=attempts+1, started_at=datetime('now')
WHERE id=? AND status='queued'; -- rowcount==1 => claimed, ==0 => проиграл гонку
```
Гарантия: один job не выдаётся дважды даже при параллельных тиках воркера
(проверено `test_concurrent_claims_no_duplicate` — 8 потоков, 20 jobs).
## Сохранённые фиксы (НЕ сломаны)
- **B-1** task-file write (direct `open()` в worktree) — без изменений.
- **B-2** Popen → log_fh (no PIPE), monitor reap — без изменений, только обёрнут.
- **M-1** orphan-recovery в `main.py` — оставлен, queue-recovery добавлен ПОСЛЕ него.
- **ORCH-2** worktree per task — без изменений.
- **ORCH-6** project registry/filter — без изменений.
## Acceptance
| # | Проверка | Статус |
|---|----------|--------|
| 1 | webhook кладёт job (queued) | ✅ enqueue_job |
| 2 | воркер исполняет queued→running→done | ✅ worker + _finalize_job |
| 3 | running ≤ max_concurrency | ✅ test_worker_respects_max_concurrency |
| 4 | ретрай fail→queued→failed+notify | ✅ test_finalize_job_requeue_then_fail |
| 5 | рестарт-safe (running→requeue) | ✅ requeue_running_jobs + lifespan |
| 6 | M-1 не сломан | ✅ оставлен в lifespan |
| 7 | тесты (new green, 9 pre-existing) | ✅ 76 passed / 9 pre-existing |
| 8 | `/queue` | ✅ counts + recent |
## Тесты
```bash
IMG=$(docker inspect orchestrator --format '{{.Config.Image}}')
docker run --rm -v /home/slin/repos/orchestrator:/code -w /code \
--entrypoint python3 $IMG -m pytest tests/ -q
# 110 passed, 9 failed (pre-existing test_webhooks 401/signature/TypeError)
```
---
## Resilience-слой (ДОПОЛНЕНИЕ: preflight + 429 + backoff + circuit breaker)
Надёжность очереди против недоступности CLI и rate-limit. Два РАЗНЫХ класса
проблем лечатся по-разному.
### A. Дешёвый preflight (`src/preflight.py`) — не жжёт токены
Перед claim воркер проверяет: `os.path.exists(CLAUDE_BIN)` + `claude --version`
(timeout 5с, токены НЕ тратит). Результат кэшируется `preflight_cache_ttl` (45с).
FAIL → воркер НЕ claimит (job остаётся `queued`), ждёт. 🚫 НЕТ prompt-ping.
### B. 429 — детект НА ВЫХОДЕ (`src/error_classifier.py`)
rate-limit нельзя предсказать — классифицируем по логу прогона. `classify_log_file`
читает хвост лога (16KB), ищет `429/rate limit/overloaded/quota/503/529/timeout/...`
`transient` или `permanent`. Извлекает `Retry-After`.
- **transient** (429/сеть) → backoff-ретрай с ОТДЕЛЬНЫМ `transient_attempts`
(лимит `transient_max_attempts=5`) — не жжёт code-fault бюджет.
- **permanent** (code-fault) → обычные `attempts < max_attempts` (2), потом `failed`.
### C. Backoff + `available_at`
Колонки `jobs.available_at TEXT` + `jobs.transient_attempts INTEGER` (миграция
`_ensure_column`). `claim_next_job`: `WHERE status='queued' AND (available_at IS NULL
OR available_at <= datetime('now'))`. При transient: `available_at = now +
min(2^n * base, max)` (base=10с, max=600с), `Retry-After` уважается (берёмся max).
### D. Circuit breaker (`CircuitBreaker` в queue_worker)
N=3 transient подряд → **open**: воркер паузит `breaker_pause_seconds=300`, ВООБЩЕ
не дёргает CLI, Telegram-алерт. Через паузу → **half-open** (пробует 1 job);
ожил (exit 0) → **closed**; снова transient → опять open. Состояние в памяти
воркера, отражается в `/queue.resilience`.
Связь launcher→breaker — через callback `launcher.on_outcome` (без import-цикла).
### Конфиг (config.py)
`preflight_cache_ttl=45`, `backoff_base_seconds=10`, `backoff_max_seconds=600`,
`transient_max_attempts=5`, `breaker_threshold=3`, `breaker_pause_seconds=300`.
### Тесты
`tests/test_resilience.py` — 34 теста: preflight (FAIL→queued, кэш, force),
классификатор (transient/permanent/Retry-After), backoff (рост/cap/Retry-After,
`available_at` гейтинг), launcher transient/permanent finalize, breaker
(open/half-open/closed/re-open, блок claim).

163
docs/SETUP_WEBHOOKS.md Normal file
View File

@@ -0,0 +1,163 @@
# Webhook Setup: Plane + Gitea → Orchestrator
## Архитектура
```
Gitea (push/PR/CI) ──→ Nginx proxy ──→ Orchestrator /webhook/gitea
Plane (work_item/comment) ──→ Nginx proxy ──→ Orchestrator /webhook/plane
```
External URL: `https://openclaw.mva154.duckdns.org/orchestrator/`
Internal URL: `http://127.0.0.1:8500/`
---
## Gitea Webhook
**Создан автоматически через API.**
- URL: `https://openclaw.mva154.duckdns.org/orchestrator/webhook/gitea`
- Events: `push`, `pull_request`, `status`
- Secret: значение `ORCH_GITEA_WEBHOOK_SECRET` в `.env`
- Signature header: `X-Gitea-Signature` (HMAC-SHA256 hex digest)
### Проверка
```bash
GITEA_TOKEN=$(grep ORCH_GITEA_TOKEN /home/slin/repos/orchestrator/.env | cut -d= -f2)
curl -s "http://localhost:3000/api/v1/repos/admin/enduro-trails/hooks" \
-H "Authorization: token ${GITEA_TOKEN}" | python3 -m json.tool
```
### Пересоздание (если нужно)
```bash
GITEA_WEBHOOK_SECRET=$(openssl rand -hex 20)
# Обновить в .env: ORCH_GITEA_WEBHOOK_SECRET=<new_secret>
curl -X POST "http://localhost:3000/api/v1/repos/admin/enduro-trails/hooks" \
-H "Authorization: token ${GITEA_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"type": "gitea",
"active": true,
"config": {
"url": "https://openclaw.mva154.duckdns.org/orchestrator/webhook/gitea",
"content_type": "json",
"secret": "'${GITEA_WEBHOOK_SECRET}'"
},
"events": ["push", "pull_request", "status"],
"branch_filter": "*"
}'
```
---
## Plane Webhook
**Создан напрямую в PostgreSQL** (Plane CE не экспортирует webhook API через внешний /api/v1/).
- URL: `https://openclaw.mva154.duckdns.org/orchestrator/webhook/plane`
- Events: `issue` (work_item.created), `issue_comment` (comment.created)
- Secret: значение `ORCH_PLANE_WEBHOOK_SECRET` в `.env`
- Signature header: `X-Plane-Signature` (HMAC-SHA256 hex digest)
### Проверка
```bash
docker exec -e PGPASSWORD=plane plane-app-plane-db-1 psql -U plane -d plane -c \
"SELECT id, url, is_active FROM webhooks;"
```
### Ручная настройка через UI (альтернатива)
1. Открыть `https://plane.mva154.duckdns.org`
2. Workspace Settings → Webhooks → Add Webhook
3. URL: `https://openclaw.mva154.duckdns.org/orchestrator/webhook/plane`
4. Secret: значение из `ORCH_PLANE_WEBHOOK_SECRET` в `.env`
5. Events: Issue, Issue Comment
6. Save
### Пересоздание через SQL
```bash
PLANE_WEBHOOK_SECRET=$(openssl rand -hex 20)
# Обновить в .env: ORCH_PLANE_WEBHOOK_SECRET=<new_secret>
WORKSPACE_ID=$(docker exec -e PGPASSWORD=plane plane-app-plane-db-1 psql -U plane -d plane -t -A -c \
"SELECT id FROM workspaces WHERE slug='ag_proj'")
WEBHOOK_ID=$(cat /proc/sys/kernel/random/uuid)
docker exec -e PGPASSWORD=plane plane-app-plane-db-1 psql -U plane -d plane -c "
INSERT INTO webhooks (id, created_at, updated_at, deleted_at, workspace_id, url, is_active, secret_key, project, issue, module, cycle, issue_comment, is_internal, version)
VALUES ('${WEBHOOK_ID}', NOW(), NOW(), NULL, '${WORKSPACE_ID}',
'https://openclaw.mva154.duckdns.org/orchestrator/webhook/plane',
true, '${PLANE_WEBHOOK_SECRET}', true, true, false, false, true, false, 'v1');
"
```
---
## HMAC Signature Verification
Оба handler'а проверяют подпись:
- Если secret пустой в `.env` — верификация пропускается (для dev/debug)
- Если secret задан — запрос без валидной подписи получает `401 Unauthorized`
### Формат подписи
| Source | Header | Algorithm | Format |
|--------|--------|-----------|--------|
| Gitea | `X-Gitea-Signature` | HMAC-SHA256 | hex digest (без префикса) |
| Plane | `X-Plane-Signature` | HMAC-SHA256 | hex digest |
### Тест подписи вручную
```bash
SECRET=$(grep ORCH_GITEA_WEBHOOK_SECRET /home/slin/repos/orchestrator/.env | cut -d= -f2)
BODY='{"ref":"refs/heads/test","repository":{"name":"enduro-trails"},"commits":[]}'
SIG=$(echo -n "${BODY}" | openssl dgst -sha256 -hmac "${SECRET}" | awk '{print $NF}')
curl -X POST http://localhost:8500/webhook/gitea \
-H "Content-Type: application/json" \
-H "X-Gitea-Event: push" \
-H "X-Gitea-Signature: ${SIG}" \
-d "${BODY}"
# Expected: {"status":"accepted"}
```
---
## Переменные окружения (.env)
| Переменная | Описание |
|-----------|----------|
| `ORCH_GITEA_WEBHOOK_SECRET` | HMAC secret для Gitea webhook |
| `ORCH_PLANE_WEBHOOK_SECRET` | HMAC secret для Plane webhook |
| `ORCH_GITEA_TOKEN` | API token для Gitea |
| `ORCH_PLANE_API_TOKEN` | API token для Plane |
---
## Troubleshooting
```bash
# Логи Orchestrator
docker logs orchestrator --tail 50 2>&1 | grep -i "webhook\|signature\|401"
# События в БД
docker exec orchestrator python3 -c "
import sqlite3
conn = sqlite3.connect('/app/data/orchestrator.db')
for r in conn.execute('SELECT id, source, event_type, timestamp FROM events ORDER BY id DESC LIMIT 10').fetchall():
print(r)
"
# Gitea webhook delivery history
# Gitea UI → Settings → Webhooks → click webhook → Recent Deliveries
```
---
*Создано: 2026-05-21 | Автор: Dev-агент*

View File

@@ -2,3 +2,4 @@ fastapi==0.115.0
uvicorn[standard]==0.30.0
pydantic-settings==2.5.0
httpx==0.27.0
pytest==8.3.3

View File

@@ -1,11 +1,21 @@
import subprocess
import os
import logging
import threading
import signal
from ..config import settings
from ..db import get_db
from ..db import get_db, get_task_by_repo_branch, update_task_stage, enqueue_job
from ..stages import get_next_stage, get_qg_for_stage, get_agent_for_stage
from ..git_worktree import ensure_worktree, get_worktree_path
from ..qg.checks import QG_CHECKS
from ..notifications import notify_stage_change, notify_qg_failure, notify_agent_started, notify_agent_finished, notify_approve_requested
from ..plane_sync import notify_stage_change as plane_notify_stage, add_comment as plane_add_comment
logger = logging.getLogger("orchestrator.launcher")
class AgentLauncher:
"""Launch Claude CLI agents for specific tasks."""
"""Launch Claude CLI agents directly (binary mounted into container)."""
AGENT_CONFIGS = {
"analyst": {
@@ -17,6 +27,7 @@ class AgentLauncher:
"system_prompt": ".openclaw/agents/architect.md",
"task_file": ".task-arch.md",
"allowed_tools": "Read,Write,Edit,Bash",
"model": "opus",
},
"developer": {
"system_prompt": ".openclaw/agents/developer.md",
@@ -27,68 +38,144 @@ class AgentLauncher:
"system_prompt": ".openclaw/agents/reviewer.md",
"task_file": ".task-review.md",
"allowed_tools": "Read,Write,Edit,Bash",
"model": "opus",
},
"tester": {
"system_prompt": ".openclaw/agents/tester.md",
"task_file": ".task-test.md",
"allowed_tools": "Read,Write,Edit,Bash",
},
"deployer": {
"task_file": ".task-deploy.md",
"system_prompt": ".openclaw/agents/deployer.md",
"allowed_tools": "Read,Write,Edit,Bash",
},
}
def launch(self, agent: str, repo: str, task_content: str = None) -> int:
CLAUDE_BIN = "/opt/claude-code/bin/claude.exe"
AGENT_TIMEOUT = 1800 # 30 minutes
def launch(self, agent: str, repo: str, task_content: str = None, task_id: int = None) -> int:
"""
Launch a Claude CLI agent.
Launch a Claude CLI agent directly (legacy synchronous path).
Kept for backward compatibility (direct callers / existing tests). The
ORCH-1 job queue uses launch_job() instead, but both share _spawn().
Args:
agent: Agent role (analyst, architect, developer, reviewer, tester)
repo: Repository name
task_content: Optional task content to write to task file
task_id: Optional task ID to associate with this run
Returns:
agent_run_id from DB
"""
return self._spawn(agent, repo, task_content, task_id, job_id=None)
def launch_job(self, job: dict) -> int:
"""ORCH-1: launch an agent for a claimed queue job.
Same spawn path as launch(), but threads job['id'] through so the monitor
can update the job's status (done / requeue / failed) and link jobs.run_id
to the agent_runs row. Returns the agent_run_id.
"""
return self._spawn(
job["agent"],
job["repo"],
job.get("task_content"),
job.get("task_id"),
job_id=job["id"],
)
def _spawn(self, agent: str, repo: str, task_content: str = None,
task_id: int = None, job_id: int = None) -> int:
"""Shared spawn implementation for launch() and launch_job().
When job_id is set, the monitor/watchdog drive the jobs table status
(ORCH-1). The claude-CLI Popen logic (B-2) and worktree/task-file logic
(B-1 / ORCH-2) are unchanged.
"""
config = self.AGENT_CONFIGS.get(agent)
if not config:
raise ValueError(f"Unknown agent: {agent}")
repo_path = os.path.join(settings.repos_dir, repo)
if not os.path.isdir(repo_path):
raise FileNotFoundError(f"Repo not found: {repo_path}")
# Main clone lives at /repos/<repo>; the agent works in an isolated worktree
# (ORCH-2 / S-4) so concurrent tasks never fight over a shared checkout.
local_repo_path = os.path.join(settings.repos_dir, repo)
if not os.path.isdir(local_repo_path):
raise FileNotFoundError(f"Repo not found: {local_repo_path}")
# Write task file if content provided
# Determine branch (needed before we touch the worktree / task file).
_br_row = get_db().execute("SELECT branch FROM tasks WHERE id=?", (task_id,)).fetchone() if task_id else None
agent_branch = _br_row[0] if _br_row else "main"
# Ensure the per-branch worktree exists and is on the right branch.
work_path = ensure_worktree(repo, agent_branch)
# Write task file if content provided (B-1: direct write; now into the worktree).
if task_content:
task_path = os.path.join(repo_path, config["task_file"])
with open(task_path, "w") as f:
f.write(task_content)
self._write_task_file(repo, agent_branch, config["task_file"], task_content)
# Record run in DB
conn = get_db()
cursor = conn.execute(
"INSERT INTO agent_runs (task_id, agent) VALUES (NULL, ?)",
(agent,),
"INSERT INTO agent_runs (task_id, agent) VALUES (?, ?)",
(task_id, agent),
)
run_id = cursor.lastrowid
conn.commit()
# Prepare output log
# ORCH-1: link this job to the agent_runs row and stamp started_at.
if job_id is not None:
conn.execute(
"UPDATE jobs SET run_id = ?, started_at = datetime('now') WHERE id = ?",
(run_id, job_id),
)
conn.commit()
# Prepare output log path
output_path = f"/app/data/runs/{run_id}.log"
os.makedirs(os.path.dirname(output_path), exist_ok=True)
# Build shell command
# Build the claude command
task_file = config["task_file"]
system_prompt = config["system_prompt"]
allowed_tools = config["allowed_tools"]
model = config.get("model", "")
model_flag = f"--model {model} " if model else ""
# No git fetch/checkout here: ensure_worktree() already put the worktree on
# the right branch. The agent simply runs inside its isolated work_path.
cmd = (
f'cd {repo_path} && {settings.claude_bin} --print '
f'"$(cat {config["task_file"]})" '
f'--system-prompt "$(cat {config["system_prompt"]})" '
f'--allowedTools {config["allowed_tools"]}'
f'cd {work_path} && '
f'{self.CLAUDE_BIN} --print '
f'{model_flag}'
f'"$(cat {task_file})" '
f'--system-prompt "$(cat {system_prompt})" '
f'--allowedTools {allowed_tools}'
)
# Launch as background process
with open(output_path, "w") as log_file:
subprocess.Popen(
["bash", "-c", cmd],
stdout=log_file,
stderr=subprocess.STDOUT,
cwd=repo_path,
logger.info(f"Launching agent '{agent}' for repo '{repo}', run_id={run_id}")
# Launch as background process.
# B-2 fix: redirect stdout/stderr straight to the log file at the OS level.
# No PIPE in the orchestrator process -> no PIPE deadlock, no reader thread,
# no zombies. log_fh is closed by _monitor_agent after proc.wait().
log_fh = open(output_path, "w")
proc = subprocess.Popen(
["bash", "-c", cmd],
stdout=log_fh,
stderr=subprocess.STDOUT,
env={
**os.environ,
"HOME": "/home/slin",
"GIT_AUTHOR_NAME": "claude-bot",
"GIT_AUTHOR_EMAIL": "claude-bot@mva154.local",
"GIT_COMMITTER_NAME": "claude-bot",
"GIT_COMMITTER_EMAIL": "claude-bot@mva154.local",
},
)
# Update DB with output path
@@ -99,7 +186,581 @@ class AgentLauncher:
conn.commit()
conn.close()
# Start timeout watchdog
t = threading.Thread(
target=self._watchdog,
args=(proc.pid, run_id),
kwargs={"job_id": job_id},
daemon=True,
)
t.start()
# Start monitor thread (waits for completion, commits, pushes)
# agent_branch already computed above
m = threading.Thread(
target=self._monitor_agent,
args=(proc, run_id, agent, repo, agent_branch, output_path, log_fh),
kwargs={"job_id": job_id},
daemon=True,
)
m.start()
logger.info(f"Agent '{agent}' launched, pid={proc.pid}, run_id={run_id}")
notify_agent_started(run_id, agent, task_id)
return run_id
def _watchdog(self, pid: int, run_id: int, timeout: int = None, job_id: int = None):
"""Kill agent if it exceeds timeout.
ORCH-1: on a timeout-kill the monitor's proc.wait() returns the kill exit
code and drives the job retry/fail logic, so the watchdog itself only needs
to SIGKILL and record the agent_runs exit. job_id is accepted for symmetry.
"""
import time
if timeout is None:
timeout = self.AGENT_TIMEOUT
time.sleep(timeout)
try:
os.kill(pid, signal.SIGKILL)
logger.warning(f"Agent run_id={run_id} killed after {timeout}s timeout")
conn = get_db()
conn.execute(
"UPDATE agent_runs SET finished_at=datetime('now'), exit_code=-9 WHERE id=?",
(run_id,),
)
conn.commit()
conn.close()
except ProcessLookupError:
pass # Already finished
def _monitor_agent(self, proc, run_id, agent, repo, branch, output_path=None, log_fh=None, job_id=None):
"""Wait for agent to finish, commit+push results, update DB.
B-2 fix: stdout already goes straight to the log file via Popen, so we just
block on proc.wait() (guaranteed reap -> no zombie, real exit_code) and then
close the log file handle. No PIPE, no select loop, no startup timeout here
(the watchdog still enforces the overall AGENT_TIMEOUT by pid).
"""
import time as _time
_start_ts = _time.time()
exit_code = proc.wait()
if log_fh is not None:
try:
log_fh.close()
except Exception:
pass
_duration_s = int(_time.time() - _start_ts)
logger.info(f"Agent run_id={run_id} ({agent}) finished with exit_code={exit_code}")
# Update DB
conn = get_db()
conn.execute(
"UPDATE agent_runs SET finished_at=datetime('now'), exit_code=? WHERE id=?",
(exit_code, run_id),
)
conn.commit()
# Get task_id for notification
_row = conn.execute("SELECT task_id FROM agent_runs WHERE id=?", (run_id,)).fetchone()
_task_id = _row[0] if _row else None
conn.close()
notify_agent_finished(run_id, agent, exit_code, task_id=_task_id, duration_s=_duration_s)
# Commit and push any changes — in the per-branch worktree (ORCH-2 / S-4),
# NOT in the shared /repos/<repo>. The worktree is already on `branch`
# (ensure_worktree did the checkout), so no checkout is needed here.
repo_path = get_worktree_path(repo, branch)
try:
git_env = {
**os.environ,
"HOME": "/home/slin",
"GIT_AUTHOR_NAME": "claude-bot",
"GIT_AUTHOR_EMAIL": "claude-bot@mva154.local",
"GIT_COMMITTER_NAME": "claude-bot",
"GIT_COMMITTER_EMAIL": "claude-bot@mva154.local",
}
result = subprocess.run(
["git", "-C", repo_path, "status", "--porcelain"],
capture_output=True, text=True, timeout=10, env=git_env
)
if result.stdout.strip():
# Add docs/ always
subprocess.run(
["git", "-C", repo_path, "add", "docs/"],
capture_output=True, text=True, timeout=10, env=git_env
)
# Add src/ and tests/ for developer
if agent == "developer":
subprocess.run(
["git", "-C", repo_path, "add", "src/", "tests/"],
capture_output=True, text=True, timeout=10, env=git_env
)
# Commit
commit_result = subprocess.run(
["git", "-C", repo_path, "commit", "-m",
f"{agent}(ET): auto-commit from {agent} run_id={run_id}"],
capture_output=True, text=True, timeout=30, env=git_env
)
if commit_result.returncode == 0:
push_result = subprocess.run(
["git", "-C", repo_path, "push", "origin", branch],
capture_output=True, text=True, timeout=60, env=git_env
)
if push_result.returncode == 0:
logger.info(f"Agent run_id={run_id}: committed and pushed to {branch}")
# Auto-create PR after developer pushes
if agent == "developer":
self._ensure_pr(repo, branch, run_id)
else:
logger.error(f"Agent run_id={run_id}: push failed: {push_result.stderr}")
else:
logger.warning(f"Agent run_id={run_id}: commit failed: {commit_result.stderr}")
else:
logger.info(f"Agent run_id={run_id}: no changes to commit")
except Exception as e:
logger.error(f"Agent run_id={run_id}: post-run git failed: {e}")
# Handle deployer failure (smoke/healthcheck failed) — Task 7
if exit_code != 0 and agent == "deployer":
conn = get_db()
task_row = conn.execute(
"SELECT id, work_item_id FROM tasks WHERE repo=? AND branch=?",
(repo, branch),
).fetchone()
conn.close()
if task_row:
_tid, _wid = task_row
update_task_stage(_tid, "development")
notify_stage_change(_tid, "deploy", "development")
plane_notify_stage(_wid, "deploy", "development")
from ..plane_sync import set_issue_blocked
set_issue_blocked(_wid)
plane_add_comment(
_wid,
"\u274c Deploy FAILED (smoke/healthcheck). Rolled back. Developer \u043d\u0443\u0436\u0435\u043d \u0434\u043b\u044f \u0444\u0438\u043a\u0441\u0430."
)
from ..notifications import send_telegram
send_telegram(f"\U0001f6a8 {_wid}: Deploy failed! Rolled back. Needs fix.")
# Notify on startup timeout (exit_code from kill = -9 or 137)
if exit_code != 0 and exit_code not in (None,):
conn = get_db()
task_row = conn.execute(
"SELECT id, work_item_id FROM tasks WHERE repo=? AND branch=?",
(repo, branch),
).fetchone()
conn.close()
if task_row and agent != "deployer": # deployer handled above
_tid, _wid = task_row
from ..notifications import send_telegram
send_telegram(f"\u26a0\ufe0f {_wid}: Agent {agent} failed (exit_code={exit_code}). Check logs: /app/data/runs/{run_id}.log")
# Auto-advance stage if agent finished successfully and QG passes
if exit_code == 0:
self._try_advance_stage(run_id, agent, repo, branch)
# ORCH-1: drive the job-queue status for queue-launched jobs only.
# (Legacy direct launch() has job_id=None and is unaffected.)
if job_id is not None:
self._finalize_job(job_id, agent, run_id, exit_code, output_path=output_path)
def _backoff_seconds(self, transient_attempts: int, retry_after: int = None) -> int:
"""Exponential backoff for transient failures, honouring Retry-After.
backoff = min(2^transient_attempts * base, max). If the server sent a
Retry-After, use the larger of the two (never poll sooner than asked).
"""
base = settings.backoff_base_seconds
cap = settings.backoff_max_seconds
backoff = min((2 ** max(transient_attempts, 0)) * base, cap)
if retry_after is not None and retry_after > 0:
backoff = max(backoff, min(retry_after, cap))
return int(backoff)
def _finalize_job(self, job_id: int, agent: str, run_id: int, exit_code, output_path=None):
"""ORCH-1: update the jobs row after the agent process finished.
exit_code == 0 -> done (and resets the breaker streak via on_outcome).
exit_code != 0 -> classify the failure from the run log tail (token-free):
- TRANSIENT (429/overload/network): backoff-requeue with available_at in
the future + a SEPARATE transient_attempts budget
(settings.transient_max_attempts), honouring Retry-After. Reported to
the breaker so it opens after N consecutive transient failures.
- PERMANENT (code fault): ordinary attempts < max_attempts requeue,
otherwise 'failed' + Telegram.
"""
from ..db import get_job, mark_job
from ..error_classifier import classify_log_file
try:
job = get_job(job_id)
if not job:
return
if exit_code == 0:
mark_job(job_id, "done", run_id=run_id)
logger.info(f"Job {job_id} ({agent}) done (run_id={run_id})")
self._record_outcome(transient=False, recovered=True)
return
# Classify the failure from the agent log tail (no token cost).
kind, retry_after = "permanent", None
log_path = output_path or f"/app/data/runs/{run_id}.log"
try:
kind, retry_after = classify_log_file(log_path)
except Exception:
pass
if kind == "transient":
self._finalize_transient(job_id, agent, run_id, exit_code, job, retry_after)
else:
self._finalize_permanent(job_id, agent, run_id, exit_code, job)
except Exception as e:
logger.error(f"Job {job_id}: _finalize_job error: {e}")
def _finalize_transient(self, job_id, agent, run_id, exit_code, job, retry_after):
"""Transient (429/overload/net) failure -> backoff requeue or fail when budget out."""
from ..db import mark_job, mark_job_transient
tattempts = job.get("transient_attempts", 0)
tmax = settings.transient_max_attempts
err = (f"transient (429/overload) agent {agent} exit={exit_code} "
f"(run_id={run_id}); retry_after={retry_after}")
self._record_outcome(transient=True, recovered=False)
if tattempts < tmax:
backoff = self._backoff_seconds(tattempts + 1, retry_after)
mark_job_transient(job_id, backoff, error=err)
logger.warning(
f"Job {job_id} ({agent}) TRANSIENT fail (exit={exit_code}), "
f"backoff {backoff}s, transient_attempt {tattempts + 1}/{tmax}"
)
else:
mark_job(job_id, "failed", run_id=run_id, error=err)
logger.error(
f"Job {job_id} ({agent}) failed after {tattempts} transient attempts"
)
self._notify_failed(job_id, agent, job, run_id,
f"transient (rate-limit) after {tattempts} attempts")
def _finalize_permanent(self, job_id, agent, run_id, exit_code, job):
"""Permanent (code-fault) failure -> normal attempts<max requeue, then fail."""
from ..db import mark_job
attempts = job.get("attempts", 0)
max_attempts = job.get("max_attempts", 2)
err = f"agent {agent} exit_code={exit_code} (run_id={run_id})"
self._record_outcome(transient=False, recovered=False)
if attempts < max_attempts:
mark_job(job_id, "queued", run_id=run_id, error=err)
logger.warning(
f"Job {job_id} ({agent}) failed (exit={exit_code}), "
f"requeued (attempt {attempts}/{max_attempts})"
)
else:
mark_job(job_id, "failed", run_id=run_id, error=err)
logger.error(
f"Job {job_id} ({agent}) failed permanently after "
f"{attempts} attempts (exit={exit_code})"
)
self._notify_failed(job_id, agent, job, run_id,
f"{attempts} attempts (exit={exit_code})")
def _notify_failed(self, job_id, agent, job, run_id, why):
try:
from ..notifications import send_telegram
send_telegram(
f"\U0001f6a8 Job {job_id} ({agent}, repo {job.get('repo')}) "
f"failed: {why}. Logs: /app/data/runs/{run_id}.log"
)
except Exception:
pass
def _record_outcome(self, transient: bool, recovered: bool):
"""Forward the run outcome to the circuit breaker (if a worker is wired).
Decoupled via a settable callback (set by QueueWorker.start) so the launcher
does not hard-import the worker (avoids a cycle) and tests can run the
launcher standalone.
"""
cb = getattr(self, "on_outcome", None)
if cb:
try:
cb(transient=transient, recovered=recovered)
except Exception:
pass
def _try_advance_stage(self, run_id: int, agent: str, repo: str, branch: str):
"""After agent finishes successfully, check QG and advance stage if possible."""
try:
conn = get_db()
task_row = conn.execute(
"SELECT id, stage, work_item_id FROM tasks WHERE repo=? AND branch=?",
(repo, branch),
).fetchone()
conn.close()
if not task_row:
return
task_id, current_stage, work_item_id = task_row
qg_name = get_qg_for_stage(current_stage)
next_stage = get_next_stage(current_stage)
if not next_stage:
return
# Run QG check if defined
if qg_name and qg_name in QG_CHECKS:
check_fn = QG_CHECKS[qg_name]
if qg_name in ("check_analysis_approved",):
# Requires human approval - post request comment if analyst just finished
if agent == "analyst" and qg_name == "check_analysis_approved" and work_item_id:
files_check = QG_CHECKS.get("check_analysis_complete")
if files_check:
files_ok, _ = files_check(repo, work_item_id, branch)
if files_ok:
# Full artifacts ready -> In Review
from ..plane_sync import set_issue_in_review
set_issue_in_review(work_item_id)
plane_add_comment(
work_item_id,
"\U0001f4cb BRD/\u0422\u0417/AC/TestPlan \u0433\u043e\u0442\u043e\u0432\u044b. "
"\u041f\u0440\u043e\u0448\u0443 review \u0438 \u0440\u0435\u0430\u043a\u0446\u0438\u044e :approved: \u0434\u043b\u044f \u043f\u0440\u043e\u0434\u0432\u0438\u0436\u0435\u043d\u0438\u044f \u0432 Architecture."
)
notify_approve_requested(task_id)
logger.info(f"Task {task_id}: analyst finished, requested :approved: in Plane")
else:
# Check if questions file exists (in the task worktree)
import os as _os
questions_path = _os.path.join(
get_worktree_path(repo, branch),
f"docs/work-items/{work_item_id}/01-questions.md"
)
if _os.path.isfile(questions_path):
# Analyst has questions -> Needs Input
from ..plane_sync import set_issue_needs_input
set_issue_needs_input(work_item_id)
with open(questions_path, "r") as qf:
questions_text = qf.read()
plane_add_comment(
work_item_id,
f"\u2753 Analyst \u043d\u0443\u0436\u0434\u0430\u0435\u0442\u0441\u044f \u0432 \u0443\u0442\u043e\u0447\u043d\u0435\u043d\u0438\u0438:\n\n{questions_text}"
)
from ..notifications import send_telegram
send_telegram(
f"\u2753 {work_item_id}: Analyst \u0437\u0430\u0434\u0430\u0451\u0442 \u0432\u043e\u043f\u0440\u043e\u0441\u044b. \u041e\u0442\u0432\u0435\u0442\u044c \u0432 Plane."
)
else:
# No artifacts and no questions
plane_add_comment(
work_item_id,
"\u26a0\ufe0f Analyst \u0437\u0430\u0432\u0435\u0440\u0448\u0438\u043b\u0441\u044f \u0431\u0435\u0437 \u0430\u0440\u0442\u0435\u0444\u0430\u043a\u0442\u043e\u0432 \u0438 \u0431\u0435\u0437 \u0432\u043e\u043f\u0440\u043e\u0441\u043e\u0432. \u041f\u0440\u043e\u0432\u0435\u0440\u044c\u0442\u0435 \u043b\u043e\u0433."
)
return
elif qg_name in ("check_ci_green", "check_tests_local"):
# (repo, branch) signature — already worktree-aware.
passed, reason = check_fn(repo, branch)
elif qg_name == "check_tests_passed":
# Artifact check — pass branch so it reads from the worktree.
passed, reason = check_fn(repo, work_item_id or "", branch)
else:
# Other artifact checks (check_architecture_done, etc.) — worktree-aware.
passed, reason = check_fn(repo, work_item_id or "", branch)
if not passed:
logger.info(f"Task {task_id}: QG '{qg_name}' not passed after {agent}: {reason}")
# If reviewer says REQUEST_CHANGES, rollback to development
if agent == "reviewer" and "REQUEST_CHANGES" in reason:
update_task_stage(task_id, "development")
notify_stage_change(task_id, current_stage, "development")
plane_notify_stage(work_item_id, current_stage, "development")
# Count retries
conn2 = get_db()
retry_count = conn2.execute(
"SELECT COUNT(*) FROM agent_runs WHERE task_id=? AND agent='developer'",
(task_id,)
).fetchone()[0]
conn2.close()
if retry_count < 3:
task_desc = (
f"Work item: {work_item_id}\nRepo: {repo}\nBranch: {branch}\n"
f"Stage: development\nNote: REQUEST_CHANGES from reviewer "
f"(attempt {retry_count+1}/3). Fix findings in "
f"docs/work-items/{work_item_id}/12-review.md"
)
new_job = enqueue_job("developer", repo, task_desc, task_id=task_id)
logger.info(f"Task {task_id}: reviewer REQUEST_CHANGES, enqueued developer (job_id={new_job})")
else:
from ..notifications import send_telegram
send_telegram(f"\u26a0\ufe0f {work_item_id}: Max developer retries (3) reached. Manual intervention needed.")
logger.error(f"Task {task_id}: max retries reached")
# Task 6: Tester FAIL -> rollback to development
if agent == "tester" and qg_name == "check_tests_passed" and not passed:
update_task_stage(task_id, "development")
notify_stage_change(task_id, current_stage, "development")
plane_notify_stage(work_item_id, current_stage, "development")
from ..plane_sync import set_issue_in_progress
set_issue_in_progress(work_item_id)
plane_add_comment(
work_item_id,
f"\u274c \u0422\u0435\u0441\u0442\u044b \u043d\u0435 \u043f\u0440\u043e\u0448\u043b\u0438: {reason}. Developer \u043f\u0435\u0440\u0435\u0437\u0430\u043f\u0443\u0449\u0435\u043d \u0434\u043b\u044f \u0444\u0438\u043a\u0441\u0430."
)
conn2 = get_db()
retry_count = conn2.execute(
"SELECT COUNT(*) FROM agent_runs WHERE task_id=? AND agent='developer'",
(task_id,)
).fetchone()[0]
conn2.close()
if retry_count < 3:
task_desc = (
f"Work item: {work_item_id}\nRepo: {repo}\nBranch: {branch}\n"
f"Stage: development\nNote: Tests FAILED. "
f"Fix failures described in docs/work-items/{work_item_id}/13-test-report.md"
)
new_job = enqueue_job("developer", repo, task_desc, task_id=task_id)
logger.info(f"Task {task_id}: tester FAIL, enqueued developer (job_id={new_job})")
else:
from ..notifications import send_telegram
from ..plane_sync import set_issue_blocked
set_issue_blocked(work_item_id)
send_telegram(f"\U0001f6a8 {work_item_id}: Tests still failing after 3 developer retries. Manual intervention needed.")
# Task 8: Architect conflict -> rollback to analysis
if agent == "architect" and qg_name == "check_architecture_done" and not passed:
import os as _os
conflict_path = _os.path.join(
get_worktree_path(repo, branch),
f"docs/work-items/{work_item_id}/10-conflict.md"
)
if _os.path.isfile(conflict_path):
update_task_stage(task_id, "analysis")
notify_stage_change(task_id, current_stage, "analysis")
plane_notify_stage(work_item_id, current_stage, "analysis")
from ..plane_sync import set_issue_in_progress
set_issue_in_progress(work_item_id)
with open(conflict_path, "r") as cf:
conflict_text = cf.read()[:500]
plane_add_comment(
work_item_id,
f"\u26a0\ufe0f Architect \u043d\u0430\u0448\u0451\u043b \u043a\u043e\u043d\u0444\u043b\u0438\u043a\u0442 \u0441 \u0422\u0417. \u0412\u043e\u0437\u0432\u0440\u0430\u0442 \u0432 Analysis.\n\n{conflict_text}"
)
task_desc = (
f"Work item: {work_item_id}\nRepo: {repo}\nBranch: {branch}\n"
f"Stage: analysis\nNote: Architect conflict. Revise TRZ. "
f"See docs/work-items/{work_item_id}/10-conflict.md"
)
new_job = enqueue_job("analyst", repo, task_desc, task_id=task_id)
logger.info(f"Task {task_id}: architect conflict, enqueued analyst (job_id={new_job})")
return
return
elif qg_name:
return
# Advance stage
update_task_stage(task_id, next_stage)
notify_stage_change(task_id, current_stage, next_stage)
plane_notify_stage(work_item_id, current_stage, next_stage)
logger.info(f"Task {task_id}: {current_stage} -> {next_stage} (auto-advance after {agent})")
# Launch next agent if defined
next_agent = get_agent_for_stage(next_stage)
if next_agent:
task_desc = f"Work item: {work_item_id}\nRepo: {repo}\nBranch: {branch}\nStage: {next_stage}"
new_job_id = enqueue_job(next_agent, repo, task_desc, task_id=task_id)
logger.info(f"Task {task_id}: enqueued '{next_agent}' (job_id={new_job_id})")
except Exception as e:
logger.error(f"Auto-advance failed for run_id={run_id}: {e}")
def _ensure_pr(self, repo: str, branch: str, run_id: int):
import httpx
owner = settings.gitea_owner
headers = {"Authorization": f"token {settings.gitea_token}"}
base_url = f"{settings.gitea_url}/api/v1"
try:
resp = httpx.get(
f"{base_url}/repos/{owner}/{repo}/pulls",
params={"state": "open", "head": branch},
headers=headers, timeout=10
)
resp.raise_for_status()
prs = resp.json()
if prs:
return prs[0]["number"]
parts = branch.split("/")
title = parts[-1] if parts else branch
resp = httpx.post(
f"{base_url}/repos/{owner}/{repo}/pulls",
json={"title": f"feat: {title}", "head": branch, "base": "main",
"body": f"Auto-created by orchestrator after developer run_id={run_id}"},
headers=headers, timeout=10
)
resp.raise_for_status()
pr_number = resp.json()["number"]
logger.info(f"Created PR #{pr_number} for {branch}")
return pr_number
except Exception as e:
logger.error(f"Failed to create PR for {branch}: {e}")
return None
def _auto_merge_pr(self, repo: str, branch: str, task_id: int, work_item_id: str):
import httpx
owner = settings.gitea_owner
headers = {"Authorization": f"token {settings.gitea_token}"}
base_url = f"{settings.gitea_url}/api/v1"
try:
resp = httpx.get(
f"{base_url}/repos/{owner}/{repo}/pulls",
params={"state": "open", "head": branch},
headers=headers, timeout=10
)
resp.raise_for_status()
prs = resp.json()
if not prs:
pr_number = self._ensure_pr(repo, branch, 0)
if not pr_number:
return False
else:
pr_number = prs[0]["number"]
resp = httpx.post(
f"{base_url}/repos/{owner}/{repo}/pulls/{pr_number}/merge",
json={"Do": "merge"},
headers=headers, timeout=30
)
if resp.status_code in (200, 204):
logger.info(f"PR #{pr_number} merged for {branch}")
update_task_stage(task_id, "done")
notify_stage_change(task_id, "deploy", "done")
plane_notify_stage(work_item_id, "deploy", "done")
from ..notifications import send_telegram
send_telegram(f"\u2705 {work_item_id}: PR #{pr_number} merged! deploy -> done. Task complete.")
return True
else:
logger.error(f"Merge failed for PR #{pr_number}: {resp.status_code} {resp.text}")
from ..notifications import send_telegram
send_telegram(f"\u26a0\ufe0f {work_item_id}: Auto-merge failed (HTTP {resp.status_code}). Manual merge needed.")
return False
except Exception as e:
logger.error(f"Auto-merge failed for {branch}: {e}")
return False
def _write_task_file(self, repo: str, branch: str, task_file: str, content: str):
"""Write task file directly into the task's worktree.
B-1 fix: no docker (direct open()). ORCH-2/S-4: the target is the per-branch
worktree (/repos/_wt/<repo>/<branch>), not the shared /repos/<repo>, so the
agent reads the task ZADANIE from its own isolated working copy.
Raise on failure instead of silently swallowing errors.
"""
work_path = get_worktree_path(repo, branch) # /repos/_wt/<repo>/<branch>
full_path = os.path.join(work_path, task_file)
try:
with open(full_path, "w", encoding="utf-8") as f:
f.write(content)
logger.info(f"Task file written: {full_path} ({len(content)} bytes)")
except OSError as e:
logger.error(f"Failed to write task file {full_path}: {e}")
raise RuntimeError(f"Failed to write task file: {e}")
launcher = AgentLauncher()

View File

@@ -7,19 +7,57 @@ class Settings(BaseSettings):
plane_api_token: str = ""
plane_workspace_slug: str = ""
plane_webhook_secret: str = ""
plane_project_id: str = ""
# Gitea
gitea_url: str = "http://localhost:3000"
gitea_token: str = ""
gitea_webhook_secret: str = ""
gitea_owner: str = "admin"
default_repo: str = "enduro-trails"
# ORCH-6: multi-repo project registry. JSON array of
# {plane_project_id, repo, work_item_prefix, name}.
# Empty -> built-in default registry in src/projects.py.
projects_json: str = ""
# Claude CLI
claude_bin: str = "/usr/bin/claude"
repos_dir: str = "/home/slin/repos"
claude_bin: str = "/opt/claude-code/bin/claude.exe"
repos_dir: str = "/repos"
host_repos_dir: str = "/home/slin/repos"
worktrees_dir: str = "/repos/_wt" # ORCH-2 / S-4: isolated worktree per task/branch
# DB
db_path: str = "/app/data/orchestrator.db"
# ORCH-1 (F-2b): persistent job queue / background worker.
# max_concurrency -> max agent jobs running in parallel (env ORCH_MAX_CONCURRENCY)
# queue_poll_interval -> worker loop poll seconds (env ORCH_QUEUE_POLL_INTERVAL)
max_concurrency: int = 1
queue_poll_interval: float = 2.0
# ORCH-1b (resilience): preflight + 429/rate-limit + backoff + circuit breaker.
# preflight_cache_ttl -> cache the cheap CLI/network preflight result (seconds);
# the worker does NOT re-run `claude --version` more often
# than this (env ORCH_PREFLIGHT_CACHE_TTL).
# backoff_base_seconds -> base for exponential transient backoff.
# backoff_max_seconds -> ceiling for the transient backoff.
# transient_max_attempts -> retry budget for transient (429/overload/network)
# failures, separate from code-fault `attempts`.
# breaker_threshold -> consecutive transient failures that OPEN the breaker.
# breaker_pause_seconds -> how long the breaker stays open before half-open.
preflight_cache_ttl: int = 45
backoff_base_seconds: int = 10
backoff_max_seconds: int = 600
transient_max_attempts: int = 5
breaker_threshold: int = 3
breaker_pause_seconds: int = 300
# Telegram notifications
telegram_bot_token: str = ""
telegram_chat_id: str = ""
class Config:
env_prefix = "ORCH_"
env_file = ".env"

333
src/db.py
View File

@@ -22,12 +22,14 @@ def init_db():
CREATE TABLE IF NOT EXISTS tasks (
id INTEGER PRIMARY KEY AUTOINCREMENT,
plane_id TEXT,
work_item_id TEXT,
repo TEXT NOT NULL,
branch TEXT,
stage TEXT DEFAULT 'created',
agent_running TEXT,
created_at TEXT DEFAULT (datetime('now')),
updated_at TEXT DEFAULT (datetime('now'))
updated_at TEXT DEFAULT (datetime('now')),
plane_issue_id TEXT
);
CREATE TABLE IF NOT EXISTS agent_runs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
@@ -38,5 +40,334 @@ def init_db():
exit_code INTEGER,
output_path TEXT
);
-- ORCH-1 (F-2b): persistent job queue. Webhook handlers enqueue a job and
-- return immediately; a background worker claims jobs (respecting
-- max_concurrency), spawns the claude agent, and updates the status.
-- Restart-safe: running jobs are requeued on startup (queue-recovery).
CREATE TABLE IF NOT EXISTS jobs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
agent TEXT NOT NULL,
repo TEXT NOT NULL,
task_id INTEGER, -- FK tasks.id (nullable)
task_content TEXT, -- written to the agent task_file
status TEXT NOT NULL DEFAULT 'queued', -- queued|running|done|failed
attempts INTEGER NOT NULL DEFAULT 0,
max_attempts INTEGER NOT NULL DEFAULT 2,
run_id INTEGER, -- agent_runs.id once started
error TEXT, -- last error message
transient_attempts INTEGER NOT NULL DEFAULT 0, -- ORCH-1 resilience: 429/transient retries
available_at TEXT, -- ORCH-1 resilience: backoff gate (claim when <= now)
created_at TEXT DEFAULT (datetime('now')),
started_at TEXT,
finished_at TEXT
);
CREATE INDEX IF NOT EXISTS idx_jobs_status ON jobs(status, id);
""")
# Lightweight migration: add resilience columns to a pre-existing jobs table
# (CREATE TABLE IF NOT EXISTS won't add columns to an already-created table).
_ensure_column(conn, "jobs", "transient_attempts", "INTEGER NOT NULL DEFAULT 0")
_ensure_column(conn, "jobs", "available_at", "TEXT")
conn.close()
def _ensure_column(conn, table: str, column: str, decl: str):
"""Add a column to `table` if it does not already exist (idempotent migration)."""
cols = [r[1] for r in conn.execute(f"PRAGMA table_info({table})").fetchall()]
if column not in cols:
conn.execute(f"ALTER TABLE {table} ADD COLUMN {column} {decl}")
conn.commit()
def get_task_by_plane_id(plane_id: str) -> dict | None:
"""Find task by Plane work item ID (checks plane_id and plane_issue_id)."""
conn = get_db()
row = conn.execute(
"SELECT * FROM tasks WHERE plane_id = ? OR plane_issue_id = ?", (plane_id, plane_id)
).fetchone()
conn.close()
if row:
return dict(row)
return None
def get_task_by_repo_branch(repo: str, branch: str) -> dict | None:
"""Find task by repo and branch name."""
conn = get_db()
row = conn.execute(
"SELECT * FROM tasks WHERE repo = ? AND branch = ?", (repo, branch)
).fetchone()
conn.close()
if row:
return dict(row)
return None
def update_task_stage(task_id: int, stage: str):
"""Update task stage and timestamp."""
conn = get_db()
conn.execute(
"UPDATE tasks SET stage = ?, updated_at = datetime('now') WHERE id = ?",
(stage, task_id),
)
conn.commit()
conn.close()
def get_next_work_item_id(repo: str, prefix: str = "ET") -> str:
"""Generate next work item ID (e.g., ET-003 / ORCH-001).
ORCH-6: numbering is per (repo, prefix). The prefix comes from the project
registry (proj.work_item_prefix), so orchestrator issues number ORCH-001,
ORCH-002 independently of the ET sequence in enduro-trails. Default prefix
stays "ET" for backward compatibility with existing callers.
"""
conn = get_db()
row = conn.execute(
"SELECT work_item_id FROM tasks "
"WHERE repo = ? AND work_item_id LIKE ? AND work_item_id IS NOT NULL "
"ORDER BY id DESC LIMIT 1",
(repo, f"{prefix}-%"),
).fetchone()
conn.close()
if row and row["work_item_id"]:
# Parse <PREFIX>-003 -> 3, increment (keep the existing prefix).
existing_prefix, num = row["work_item_id"].rsplit("-", 1)
prefix = existing_prefix
next_num = int(num) + 1
else:
next_num = 1
return f"{prefix}-{next_num:03d}"
# ---------------------------------------------------------------------------
# ORCH-1 (F-2b): job queue helpers
# ---------------------------------------------------------------------------
def enqueue_job(
agent: str,
repo: str,
task_content: str | None = None,
task_id: int | None = None,
max_attempts: int = 2,
) -> int:
"""Enqueue a new job (status='queued'). Returns the new job id.
This is what webhook handlers call instead of launching an agent in-process:
it is a fast DB INSERT that returns immediately. The background worker
(queue_worker) picks the job up later.
"""
conn = get_db()
cursor = conn.execute(
"INSERT INTO jobs (agent, repo, task_id, task_content, max_attempts) "
"VALUES (?, ?, ?, ?, ?)",
(agent, repo, task_id, task_content, max_attempts),
)
job_id = cursor.lastrowid
conn.commit()
conn.close()
return job_id
def claim_next_job() -> dict | None:
"""Atomically claim the oldest queued job and mark it 'running'.
Atomicity: the UPDATE carries the `status='queued'` guard in its WHERE clause
and we check `rowcount`. If two worker ticks race for the same row, only the
first UPDATE flips it to 'running' (rowcount==1); the loser sees rowcount==0
and retries the SELECT. We rely on SQLite's default per-connection transaction
so the SELECT+UPDATE pair is consistent. Returns the claimed job dict or None
when the queue is empty.
"""
conn = get_db()
try:
while True:
row = conn.execute(
"SELECT id FROM jobs WHERE status='queued' "
"AND (available_at IS NULL OR available_at <= datetime('now')) "
"ORDER BY id LIMIT 1"
).fetchone()
if not row:
return None
job_id = row["id"]
cur = conn.execute(
"UPDATE jobs SET status='running', "
"attempts = attempts + 1, started_at = datetime('now') "
"WHERE id = ? AND status='queued'",
(job_id,),
)
conn.commit()
if cur.rowcount == 1:
claimed = conn.execute(
"SELECT * FROM jobs WHERE id = ?", (job_id,)
).fetchone()
return dict(claimed)
# Lost the race for this row; loop and try the next queued job.
finally:
conn.close()
def mark_job_transient(job_id: int, available_at_sql_offset_seconds: int,
error: str | None = None) -> None:
"""ORCH-1 resilience: requeue a job after a *transient* failure (429/overload/net).
Increments `transient_attempts` (separate from the code-fault `attempts`),
sets status back to 'queued', and gates re-pickup via `available_at` =
now + backoff seconds. started_at/finished_at are cleared.
"""
conn = get_db()
sets = [
"status='queued'",
"transient_attempts = transient_attempts + 1",
"available_at = datetime('now', ?)",
"started_at = NULL",
"finished_at = NULL",
]
params: list = [f"+{int(available_at_sql_offset_seconds)} seconds"]
if error is not None:
sets.append("error = ?")
params.append(error)
params.append(job_id)
conn.execute(f"UPDATE jobs SET {', '.join(sets)} WHERE id = ?", params)
conn.commit()
conn.close()
def mark_job(
job_id: int,
status: str,
run_id: int | None = None,
error: str | None = None,
):
"""Update a job's status (queued|running|done|failed).
- run_id (optional): link to the agent_runs row that executed this job.
- error (optional): last error message (for failed/retry).
- 'done'/'failed' also stamp finished_at.
- 'queued' (requeue for retry) clears started_at/finished_at so the next
claim treats it as fresh.
"""
conn = get_db()
sets = ["status = ?"]
params: list = [status]
if run_id is not None:
sets.append("run_id = ?")
params.append(run_id)
if error is not None:
sets.append("error = ?")
params.append(error)
if status in ("done", "failed"):
sets.append("finished_at = datetime('now')")
elif status == "queued":
sets.append("started_at = NULL")
sets.append("finished_at = NULL")
params.append(job_id)
conn.execute(f"UPDATE jobs SET {', '.join(sets)} WHERE id = ?", params)
conn.commit()
conn.close()
def count_running_jobs() -> int:
"""Number of jobs currently in 'running' status (for max_concurrency)."""
conn = get_db()
n = conn.execute(
"SELECT COUNT(*) FROM jobs WHERE status='running'"
).fetchone()[0]
conn.close()
return int(n)
def requeue_running_jobs() -> int:
"""Queue-recovery: on startup, any job left 'running' belongs to a worker that
died on restart -> put it back to 'queued'. attempts are kept as-is (the next
claim does NOT re-increment beyond what is needed; claim_next_job increments on
pickup). Returns the number of requeued jobs.
"""
conn = get_db()
cur = conn.execute(
"UPDATE jobs SET status='queued', started_at = NULL "
"WHERE status='running'"
)
conn.commit()
n = cur.rowcount
conn.close()
return int(n)
def get_job(job_id: int) -> dict | None:
"""Fetch a single job by id."""
conn = get_db()
row = conn.execute("SELECT * FROM jobs WHERE id = ?", (job_id,)).fetchone()
conn.close()
return dict(row) if row else None
def job_status_counts() -> dict:
"""Return counts grouped by status (for /queue observability)."""
conn = get_db()
rows = conn.execute(
"SELECT status, COUNT(*) AS n FROM jobs GROUP BY status"
).fetchall()
conn.close()
counts = {"queued": 0, "running": 0, "done": 0, "failed": 0}
for r in rows:
counts[r["status"]] = r["n"]
return counts
def recent_jobs(limit: int = 10) -> list[dict]:
"""Return the most recent jobs (for /queue observability)."""
conn = get_db()
rows = conn.execute(
"SELECT * FROM jobs ORDER BY id DESC LIMIT ?", (limit,)
).fetchall()
conn.close()
return [dict(r) for r in rows]
# ---------------------------------------------------------------------------
# ORCH-1b (resilience): transient backoff helpers
# ---------------------------------------------------------------------------
def requeue_job_transient(job_id: int, delay_seconds: float, error: str | None = None):
"""ORCH-1b: requeue a job after a TRANSIENT (429/overload/network) failure.
Unlike a code-fault requeue, this:
- increments `transient_attempts` (a separate budget from code-fault attempts)
- sets `available_at = now + delay_seconds` so claim_next_job won't pick it
up until the backoff window elapses
- sets status back to 'queued' and clears started_at/finished_at
delay_seconds is computed by the caller (exp backoff, capped, Retry-After).
"""
conn = get_db()
conn.execute(
"UPDATE jobs SET status='queued', "
"transient_attempts = transient_attempts + 1, "
"available_at = datetime('now', ? || ' seconds'), "
"started_at = NULL, finished_at = NULL, "
"error = COALESCE(?, error) "
"WHERE id = ?",
(f"+{int(round(delay_seconds))}", error, job_id),
)
conn.commit()
conn.close()
def compute_backoff(transient_attempts: int, retry_after: float | None = None) -> float:
"""ORCH-1b: exponential backoff (seconds) for a transient failure.
delay = min(2**transient_attempts * base, max). If the server sent a
Retry-After hint we honour it as a floor (use the larger of the two so we
never poll sooner than the server asked).
`transient_attempts` is the count AFTER this failure (i.e. how many transient
failures have occurred), so the first backoff uses 2**1.
"""
base = getattr(settings, "backoff_base_seconds", 10)
cap = getattr(settings, "backoff_max_seconds", 600)
exp = min((2 ** max(transient_attempts, 0)) * base, cap)
if retry_after is not None and retry_after > 0:
return float(min(max(exp, retry_after), cap))
return float(exp)

87
src/error_classifier.py Normal file
View File

@@ -0,0 +1,87 @@
"""ORCH-1 resilience: classify an agent failure as transient vs permanent.
Rate limits / overload / network blips cannot be reliably predicted in advance,
so we classify *after the run* by scanning the agent's combined stdout/stderr log
(B-2 sends both to /app/data/runs/<run_id>.log).
- transient -> 429 / rate limit / overloaded / network / quota-exhausted etc.
=> backoff + transient retry (separate counter, larger budget).
- permanent -> a genuine code fault / agent error
=> normal attempts < max_attempts, then 'failed'.
Also extracts a Retry-After hint (seconds) when the server provided one.
"""
import re
# Case-insensitive substrings/patterns that signal a transient/rate-limit issue.
_TRANSIENT_PATTERNS = [
r"\b429\b",
r"rate[\s_-]*limit",
r"rate_limit_error",
r"overloaded",
r"overloaded_error",
r"too many requests",
r"quota",
r"insufficient[_\s-]*quota",
r"retry[\s-]*after",
r"service unavailable",
r"\b503\b",
r"\b529\b",
r"timed out",
r"timeout",
r"connection (reset|refused|error|aborted)",
r"temporarily unavailable",
r"econnreset",
r"etimedout",
]
_TRANSIENT_RE = re.compile("|".join(_TRANSIENT_PATTERNS), re.IGNORECASE)
# Retry-After: header style ("Retry-After: 30") or JSON ("retry_after": 30) or
# "retry after 30 seconds". Returns the integer seconds.
_RETRY_AFTER_RE = re.compile(
r"retry[\s_-]*after[\"']?\s*[:=]?\s*[\"']?\s*(\d+)",
re.IGNORECASE,
)
def classify_text(text: str) -> str:
"""Return 'transient' or 'permanent' for a chunk of log/stderr text."""
if not text:
return "permanent"
return "transient" if _TRANSIENT_RE.search(text) else "permanent"
def parse_retry_after(text: str) -> int | None:
"""Return Retry-After seconds if present in the text, else None."""
if not text:
return None
m = _RETRY_AFTER_RE.search(text)
if m:
try:
return int(m.group(1))
except (TypeError, ValueError):
return None
return None
def classify_log_file(path: str, tail_bytes: int = 16384) -> tuple[str, int | None]:
"""Classify the tail of a log file.
Reads the last `tail_bytes` of the log (rate-limit messages appear near the
end) and returns (classification, retry_after_seconds_or_None).
On any read error, treats it as 'permanent' (no special backoff).
"""
if not path:
return "permanent", None
try:
with open(path, "rb") as f:
try:
f.seek(-tail_bytes, 2)
except OSError:
f.seek(0)
data = f.read()
text = data.decode("utf-8", errors="replace")
except Exception:
return "permanent", None
return classify_text(text), parse_retry_after(text)

107
src/git_worktree.py Normal file
View File

@@ -0,0 +1,107 @@
"""Git worktree management — isolated working copy per task/branch (ORCH-2 / S-4).
Background
----------
Previously every git operation (checkout/commit/push/test) ran in the single shared
clone ``/repos/<repo>``. With two active tasks a ``git checkout`` of one branch would
overwrite the working copy of the other -> races (see AUDIT S-4 / ET-009 "two collectors").
Solution
--------
Each task (branch) gets an isolated git worktree::
/repos/<repo> <- main clone (fetch / worktree management)
/repos/_wt/<repo>/<safe-branch> <- worktree for one task/branch (agent works here)
A branch can only be checked out in ONE worktree at a time, which is exactly the
property we want: one task = one branch = one worktree.
"""
import os
import re
import subprocess
import logging
from .config import settings
logger = logging.getLogger("orchestrator.git_worktree")
def _safe(branch: str) -> str:
"""Filesystem-safe branch name for use in a path component."""
return re.sub(r"[^A-Za-z0-9._-]", "_", branch)
def get_worktree_path(repo: str, branch: str) -> str:
"""Path of the worktree for (repo, branch). Does NOT create it."""
return os.path.join(settings.worktrees_dir, repo, _safe(branch))
def _main_repo(repo: str) -> str:
return os.path.join(settings.repos_dir, repo)
def ensure_worktree(repo: str, branch: str) -> str:
"""Create (or reuse) an isolated worktree for ``branch``. Returns its path.
Main clone stays at ``/repos/<repo>``. Worktree lives at
``/repos/_wt/<repo>/<safe-branch>``.
- If the worktree already exists, it is fetched + fast-aligned to the branch
(and to ``origin/<branch>`` when that remote branch exists).
- If the branch exists (locally or on origin) it is checked out into a fresh
worktree; otherwise a new branch is created from ``origin/main``.
"""
main_repo = _main_repo(repo)
wt = get_worktree_path(repo, branch)
if not os.path.isdir(main_repo):
raise FileNotFoundError(f"Main repo not found: {main_repo}")
# Always refresh refs in the main clone first.
subprocess.run(["git", "-C", main_repo, "fetch", "origin"],
capture_output=True, timeout=60)
# Reuse existing worktree (.git may be a dir or a file pointer for worktrees).
if os.path.isdir(os.path.join(wt, ".git")) or os.path.isfile(os.path.join(wt, ".git")):
subprocess.run(["git", "-C", wt, "fetch", "origin"], capture_output=True, timeout=60)
subprocess.run(["git", "-C", wt, "checkout", branch], capture_output=True, timeout=30)
# Align to remote only if the remote branch exists (avoid wiping local-only work).
rb = subprocess.run(
["git", "-C", wt, "rev-parse", "--verify", "--quiet", f"origin/{branch}"],
capture_output=True,
)
if rb.returncode == 0:
subprocess.run(["git", "-C", wt, "reset", "--hard", f"origin/{branch}"],
capture_output=True, timeout=30)
logger.info(f"Worktree reused: {wt} (branch {branch})")
return wt
os.makedirs(os.path.dirname(wt), exist_ok=True)
# Try to attach an existing branch (local or remote-tracking) to the new worktree.
r = subprocess.run(["git", "-C", main_repo, "worktree", "add", wt, branch],
capture_output=True, text=True, timeout=60)
if r.returncode != 0:
# Branch doesn't exist yet — create it from origin/main.
r2 = subprocess.run(
["git", "-C", main_repo, "worktree", "add", "-b", branch, wt, "origin/main"],
capture_output=True, text=True, timeout=60,
)
if r2.returncode != 0:
raise RuntimeError(
f"git worktree add failed for {repo}:{branch}: "
f"{r.stderr.strip()} | {r2.stderr.strip()}"
)
logger.info(f"Worktree ready: {wt} (branch {branch})")
return wt
def remove_worktree(repo: str, branch: str):
"""Remove the worktree for (repo, branch) — optional cleanup when a task is done."""
main_repo = _main_repo(repo)
wt = get_worktree_path(repo, branch)
subprocess.run(["git", "-C", main_repo, "worktree", "remove", "--force", wt],
capture_output=True, timeout=30)
# Prune dangling administrative entries.
subprocess.run(["git", "-C", main_repo, "worktree", "prune"],
capture_output=True, timeout=30)
logger.info(f"Worktree removed: {wt}")

View File

@@ -1,14 +1,75 @@
from fastapi import FastAPI
from contextlib import asynccontextmanager
import logging
from .db import init_db
from .webhooks.plane import router as plane_router
from .webhooks.gitea import router as gitea_router
# Configure logging
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
)
@asynccontextmanager
async def lifespan(app: FastAPI):
init_db()
yield
# M-1: proper orphan-recovery.
# An orphan = an agent_run with no finished_at that is older than the recovery
# window. After a uvicorn restart the monitor thread is gone, so its child claude
# process (if any) was reparented to init; we cannot kill it by pid (pid is not
# persisted). Instead of silently writing exit=-1, we: enumerate each orphan,
# mark it exit=-1, log a warning per run, and notify so a human can check/restart.
log = logging.getLogger('orchestrator')
from .db import get_db
conn = get_db()
orphan_rows = conn.execute(
"SELECT id, task_id, agent FROM agent_runs "
"WHERE finished_at IS NULL AND started_at < datetime('now', '-35 minutes')"
).fetchall()
for row in orphan_rows:
run_id, task_id, agent = row[0], row[1], row[2]
conn.execute(
"UPDATE agent_runs SET finished_at=datetime('now'), exit_code=-1 WHERE id=?",
(run_id,),
)
log.warning(
f"Orphan run {run_id} (task {task_id}, agent {agent}) recovered — "
f"manual check needed (process may have been killed on restart)"
)
conn.commit()
conn.close()
if orphan_rows:
try:
from .notifications import send_telegram
ids = ", ".join(str(r[0]) for r in orphan_rows)
send_telegram(
f"\u26a0\ufe0f Orchestrator restart: {len(orphan_rows)} orphaned agent run(s) "
f"(run_id: {ids}) marked exit=-1. Нужна ручная проверка/перезапуск."
)
except Exception:
pass
log.warning(f"Recovered {len(orphan_rows)} orphaned agent runs")
# ORCH-1 (F-2b): queue-recovery. Any job left in 'running' status belongs to a
# worker that died on the previous restart -> put it back to 'queued' so the
# worker re-picks it up (restart-safe, no lost work). Runs AFTER M-1.
from .db import requeue_running_jobs
requeued = requeue_running_jobs()
if requeued:
log.warning(f"Queue-recovery: requeued {requeued} running job(s) after restart")
# Start the background job-queue worker (ORCH-1).
from .queue_worker import worker
worker.start()
try:
yield
finally:
# Graceful shutdown of the worker (running agents keep going; their jobs
# are requeued on next start via queue-recovery if the process dies).
worker.stop()
app = FastAPI(title="Multi-Agent Orchestrator", lifespan=lifespan)
@@ -30,3 +91,17 @@ async def status():
).fetchall()
conn.close()
return {"active_tasks": [dict(t) for t in tasks]}
@app.get("/queue")
async def queue():
"""ORCH-1: job-queue observability — status counts + recent jobs."""
from .db import job_status_counts, recent_jobs
from .queue_worker import worker
return {
"counts": job_status_counts(),
"max_concurrency": worker.max_concurrency,
"poll_interval": worker.poll_interval,
"resilience": worker.status(),
"recent": recent_jobs(10),
}

125
src/notifications.py Normal file
View File

@@ -0,0 +1,125 @@
"""Notifications and logging for orchestrator events."""
import logging
import httpx
logger = logging.getLogger("orchestrator")
# Lazy import to avoid circular imports at module level
_settings = None
def _get_settings():
global _settings
if _settings is None:
from .config import settings
_settings = settings
return _settings
def send_telegram(text: str):
"""Send notification to Telegram. Fire-and-forget, never raises."""
s = _get_settings()
if not s.telegram_bot_token or not s.telegram_chat_id:
return
try:
url = f"https://api.telegram.org/bot{s.telegram_bot_token}/sendMessage"
httpx.post(
url,
json={
"chat_id": s.telegram_chat_id,
"text": text,
"parse_mode": "HTML",
"disable_notification": False,
},
timeout=5,
)
except Exception:
pass # Never crash orchestrator due to notification failure
def _get_work_item_id(task_id: int) -> str:
"""Get work_item_id from DB by task_id."""
try:
from .db import get_db
conn = get_db()
row = conn.execute("SELECT work_item_id FROM tasks WHERE id=?", (task_id,)).fetchone()
conn.close()
return row[0] if row and row[0] else f"task-{task_id}"
except Exception:
return f"task-{task_id}"
def notify_stage_change(task_id: int, old_stage: str, new_stage: str, agent: str = None):
"""Log and notify stage transition."""
work_item_id = _get_work_item_id(task_id)
msg = f"\U0001f504 {work_item_id}: {old_stage} \u2192 {new_stage}"
if agent:
msg += f" (\u0437\u0430\u043f\u0443\u0449\u0435\u043d {agent})"
logger.info(msg)
send_telegram(msg)
def notify_agent_started(run_id: int, agent: str, task_id: int):
"""Notify agent launch."""
work_item_id = _get_work_item_id(task_id)
msg = f"\U0001f680 {work_item_id}: {agent} \u0437\u0430\u043f\u0443\u0449\u0435\u043d (run_id={run_id})"
logger.info(msg)
send_telegram(msg)
def notify_agent_finished(run_id: int, agent: str, exit_code: int, task_id: int = None, duration_s: int = None):
"""Notify agent completion."""
work_item_id = _get_work_item_id(task_id) if task_id else "?"
if exit_code == 0:
dur = f" ({duration_s // 60} \u043c\u0438\u043d)" if duration_s else ""
msg = f"\u2705 {work_item_id}: {agent} \u0437\u0430\u0432\u0435\u0440\u0448\u0438\u043b{dur}"
elif exit_code == -9:
msg = f"\u23f0 {work_item_id}: {agent} \u0443\u0431\u0438\u0442 \u043f\u043e \u0442\u0430\u0439\u043c\u0430\u0443\u0442\u0443 (30 \u043c\u0438\u043d)"
else:
msg = f"\u274c {work_item_id}: {agent} \u0443\u043f\u0430\u043b (exit_code={exit_code})"
logger.info(msg)
send_telegram(msg)
def notify_qg_result(task_id: int, check: str, passed: bool, reason: str = None):
"""Notify QG check result."""
work_item_id = _get_work_item_id(task_id)
if passed:
msg = f"\u2705 {work_item_id}: QG {check} \u2014 passed"
else:
msg = f"\u26a0\ufe0f {work_item_id}: QG {check} \u2014 failed: {reason}"
logger.info(msg)
send_telegram(msg)
def notify_qg_failure(task_id: int, stage: str, check: str, reason: str):
"""Log and notify QG check failure."""
work_item_id = _get_work_item_id(task_id)
msg = f"\u26a0\ufe0f {work_item_id}: QG {check} \u2014 failed: {reason}"
logger.warning(msg)
send_telegram(msg)
def notify_approve_requested(task_id: int):
"""Notify that analyst requests :approved:."""
work_item_id = _get_work_item_id(task_id)
msg = f"\U0001f4cb {work_item_id}: BRD/\u0422\u0417/AC \u0433\u043e\u0442\u043e\u0432\u044b. \u0416\u0434\u0443 :approved: \u0432 Plane"
logger.info(msg)
send_telegram(msg)
def notify_done(task_id: int):
"""Notify task completion."""
work_item_id = _get_work_item_id(task_id)
msg = f"\U0001f389 {work_item_id}: \u0437\u0430\u0434\u0430\u0447\u0430 \u0437\u0430\u0432\u0435\u0440\u0448\u0435\u043d\u0430!"
logger.info(msg)
send_telegram(msg)
def notify_error(task_id: int, error: str):
"""Log and notify error for a task."""
work_item_id = _get_work_item_id(task_id) if task_id else "system"
msg = f"\U0001f534 {work_item_id}: ERROR \u2014 {error}"
logger.error(msg)
send_telegram(msg)

242
src/plane_sync.py Normal file
View File

@@ -0,0 +1,242 @@
"""Plane API sync — update issue state and add comments."""
import logging
import httpx
from .config import settings
logger = logging.getLogger("orchestrator.plane_sync")
PLANE_BASE = f"{settings.plane_api_url}/api/v1"
PLANE_HEADERS = {"X-API-Key": settings.plane_api_token}
WORKSPACE = settings.plane_workspace_slug
PROJECT_ID = settings.plane_project_id or "7a79f0a9-5278-49cd-9007-9a338f238f9c"
def _resolve_project_id(work_item_id: str = None, project_id: str = None) -> str:
"""ORCH-6: resolve the Plane project id for a sync call.
Priority:
1. explicit project_id arg (caller already knows the project),
2. project derived from the task's repo in the DB (by work_item_id),
3. legacy default PROJECT_ID (enduro) for backward compatibility.
"""
if project_id:
return project_id
if work_item_id:
try:
from .db import get_db
from .projects import get_project_by_repo
conn = get_db()
row = conn.execute(
"SELECT repo FROM tasks WHERE work_item_id = ? ORDER BY id DESC LIMIT 1",
(work_item_id,),
).fetchone()
conn.close()
if row and row[0]:
proj = get_project_by_repo(row[0])
if proj:
return proj.plane_project_id
except Exception as e:
logger.debug(f"_resolve_project_id fallback for {work_item_id}: {e}")
return PROJECT_ID
# Plane state IDs
PLANE_STATES = {
"backlog": "113b24f6-cce8-4be9-9a22-a359b9cf0122",
"todo": "2c7d3df3-9eb9-419b-92b7-d7d560bcdd10",
"in_progress": "b873d9eb-993c-48cd-97ac-99a9b1623967",
"needs_input": "babf08a3-ff4d-41f3-a821-5491aa29a8ac",
"in_review": "38fb1f64-aa1e-48a3-92e0-0b109679046b",
"blocked": "6c4543f9-ac47-4ef7-ae0f-070020dc9920",
"done": "381a2833-3c4e-4be5-bd0f-be84cb946ad8",
"cancelled": "b1cae7f9-961d-4889-a179-f3acea697d17",
}
# Map orchestrator stages to Plane states
STAGE_TO_STATE = {
"created": PLANE_STATES["todo"],
"analysis": PLANE_STATES["in_progress"],
"architecture": PLANE_STATES["in_progress"],
"development": PLANE_STATES["in_progress"],
"review": PLANE_STATES["in_progress"],
"testing": PLANE_STATES["in_progress"],
"deploy": PLANE_STATES["in_progress"],
"done": PLANE_STATES["done"],
}
def find_issue_id(work_item_id: str, project_id: str = None) -> str | None:
"""Find Plane issue UUID by work_item_id (e.g. 'ET-002')."""
project_id = _resolve_project_id(work_item_id, project_id)
# Primary: lookup from DB (plane_issue_id column)
try:
from .db import get_db
conn = get_db()
row = conn.execute(
"SELECT plane_issue_id FROM tasks WHERE work_item_id = ? AND plane_issue_id IS NOT NULL",
(work_item_id,)
).fetchone()
if row and row[0]:
return row[0]
except Exception as e:
logger.debug(f"DB lookup failed for {work_item_id}: {e}")
# Fallback: search via Plane API
url = f"{PLANE_BASE}/workspaces/{WORKSPACE}/projects/{project_id}/issues/"
try:
# First try search by work_item_id
resp = httpx.get(url, headers=PLANE_HEADERS, params={"search": work_item_id}, timeout=10)
resp.raise_for_status()
data = resp.json()
results = data.get("results", data if isinstance(data, list) else [])
for issue in results:
seq = issue.get("sequence_id")
identifier = f"ET-{seq:03d}" if seq else ""
if identifier == work_item_id or work_item_id in issue.get("name", ""):
return issue["id"]
# Fallback: get all issues and match by sequence_id number
if work_item_id.startswith("ET-"):
try:
target_num = int(work_item_id.split("-")[1])
except (IndexError, ValueError):
target_num = None
if target_num:
resp2 = httpx.get(url, headers=PLANE_HEADERS, timeout=10)
resp2.raise_for_status()
data2 = resp2.json()
results2 = data2.get("results", data2 if isinstance(data2, list) else [])
for issue in results2:
if issue.get("sequence_id") == target_num:
return issue["id"]
except Exception as e:
logger.error(f"Failed to find issue for {work_item_id}: {e}")
return None
def update_issue_state(work_item_id: str, stage: str, project_id: str = None):
"""Update Plane issue state based on orchestrator stage."""
state_id = STAGE_TO_STATE.get(stage)
if not state_id:
return
project_id = _resolve_project_id(work_item_id, project_id)
issue_id = find_issue_id(work_item_id, project_id)
if not issue_id:
logger.warning(f"Issue not found in Plane for {work_item_id}")
return
url = f"{PLANE_BASE}/workspaces/{WORKSPACE}/projects/{project_id}/issues/{issue_id}/"
try:
resp = httpx.patch(url, headers=PLANE_HEADERS, json={"state": state_id}, timeout=10)
resp.raise_for_status()
logger.info(f"Plane: {work_item_id} state -> {stage} ({state_id[:8]}...)")
except Exception as e:
logger.error(f"Failed to update Plane state for {work_item_id}: {e}")
def add_comment(work_item_id: str, text: str, project_id: str = None):
"""Add a comment to Plane issue."""
project_id = _resolve_project_id(work_item_id, project_id)
issue_id = find_issue_id(work_item_id, project_id)
if not issue_id:
logger.warning(f"Issue not found in Plane for {work_item_id}, skipping comment")
return
url = f"{PLANE_BASE}/workspaces/{WORKSPACE}/projects/{project_id}/issues/{issue_id}/comments/"
html = f"<p>{text}</p>"
try:
resp = httpx.post(url, headers=PLANE_HEADERS, json={"comment_html": html}, timeout=10)
resp.raise_for_status()
logger.info(f"Plane: comment added to {work_item_id}")
except Exception as e:
logger.error(f"Failed to add comment to {work_item_id}: {e}")
def set_issue_needs_input(work_item_id: str, project_id: str = None):
"""Set issue to 'Needs Input' state — waiting for stakeholder response."""
_set_issue_state_direct(work_item_id, PLANE_STATES["needs_input"], project_id)
def set_issue_in_review(work_item_id: str, project_id: str = None):
"""Set issue to 'In Review' state — waiting for :approved: or :rejected:."""
_set_issue_state_direct(work_item_id, PLANE_STATES["in_review"], project_id)
def set_issue_blocked(work_item_id: str, project_id: str = None):
"""Set issue to 'Blocked' state — manual intervention needed."""
_set_issue_state_direct(work_item_id, PLANE_STATES["blocked"], project_id)
def set_issue_in_progress(work_item_id: str, project_id: str = None):
"""Set issue to 'In Progress' state — agent working."""
_set_issue_state_direct(work_item_id, PLANE_STATES["in_progress"], project_id)
def _set_issue_state_direct(work_item_id: str, state_id: str, project_id: str = None):
"""Set issue state directly by state_id."""
project_id = _resolve_project_id(work_item_id, project_id)
issue_id = find_issue_id(work_item_id, project_id)
if not issue_id:
logger.warning(f"Issue not found in Plane for {work_item_id}")
return
url = f"{PLANE_BASE}/workspaces/{WORKSPACE}/projects/{project_id}/issues/{issue_id}/"
try:
resp = httpx.patch(url, headers=PLANE_HEADERS, json={"state": state_id}, timeout=10)
resp.raise_for_status()
logger.info(f"Plane: {work_item_id} state -> {state_id[:8]}...")
except Exception as e:
logger.error(f"Failed to update Plane state for {work_item_id}: {e}")
def notify_stage_change(work_item_id: str, old_stage: str, new_stage: str, agent: str = None, project_id: str = None):
"""Notify Plane about stage transition with links."""
project_id = _resolve_project_id(work_item_id, project_id)
update_issue_state(work_item_id, new_stage, project_id)
msg = f"🔄 Stage: {old_stage}{new_stage}"
if agent:
msg += f" (launching {agent})"
# Add relevant links
gitea_base = "http://git.mva154.duckdns.org"
try:
from .db import get_db
conn = get_db()
row = conn.execute(
"SELECT branch, repo FROM tasks WHERE work_item_id=?", (work_item_id,)
).fetchone()
conn.close()
if row:
branch, repo = row
msg += chr(10) + "📂 Branch: [" + branch + "](" + gitea_base + "/admin/" + repo + "/src/branch/" + branch + ")"
if new_stage in ("review", "testing", "deploy"):
import httpx as _httpx
from .config import settings
_headers = {"Authorization": f"token {settings.gitea_token}"}
_resp = _httpx.get(
f"{settings.gitea_url}/api/v1/repos/{settings.gitea_owner}/{repo}/pulls",
params={"state": "open", "head": branch},
headers=_headers, timeout=5
)
if _resp.status_code == 200:
_prs = _resp.json()
if _prs:
pr_num = _prs[0]["number"]
msg += chr(10) + "🔗 PR: [#" + str(pr_num) + "](" + gitea_base + "/admin/" + repo + "/pulls/" + str(pr_num) + ")"
except Exception:
pass
add_comment(work_item_id, msg, project_id)
def notify_qg_failure(work_item_id: str, stage: str, check: str, reason: str, project_id: str = None):
"""Notify Plane about QG failure."""
add_comment(work_item_id, f"⚠️ QG failed at {stage}: {check}{reason}", project_id)
def notify_done(work_item_id: str, project_id: str = None):
"""Mark issue as Done in Plane."""
project_id = _resolve_project_id(work_item_id, project_id)
update_issue_state(work_item_id, "done", project_id)
add_comment(work_item_id, "✅ Task completed! PR merged and deployed.", project_id)

106
src/preflight.py Normal file
View File

@@ -0,0 +1,106 @@
"""ORCH-1 resilience: cheap preflight check (CLI / network available?).
Goal: before the worker claims a job, confirm the claude CLI binary and runtime
are reachable WITHOUT spending any tokens. We only do local/cheap checks:
1. os.path.exists(CLAUDE_BIN) -- instant
2. `claude --version` (timeout 5s) -- spawns CLI, does NOT call the API
The result is cached for `preflight_cache_ttl` seconds so we do not re-run
`claude --version` on every worker tick.
🚫 We deliberately do NOT do a prompt ping (ping->pong) — that would burn the
rate limit and add latency. Preflight is local-only.
"""
import os
import time
import logging
import subprocess
from .config import settings
logger = logging.getLogger("orchestrator.preflight")
_VERSION_TIMEOUT = 5
class _PreflightCache:
def __init__(self):
self.ts: float = 0.0
self.ok: bool = False
self.reason: str = "not checked yet"
_cache = _PreflightCache()
def _claude_bin() -> str:
"""Resolve the claude binary preflight should check.
Must match the binary the launcher actually spawns. The launcher hardcodes
AgentLauncher.CLAUDE_BIN for the real Popen, so we prefer that; we only fall
back to settings.claude_bin / a default if it is somehow unset. (Note: the
container's ORCH_CLAUDE_BIN may point elsewhere; preflight follows the path
that is genuinely executed, not the unused env override.)
"""
try:
from .agents.launcher import AgentLauncher
launcher_bin = getattr(AgentLauncher, "CLAUDE_BIN", None)
if launcher_bin and os.path.exists(launcher_bin):
return launcher_bin
# Launcher path not present -> fall back to configured/default.
return launcher_bin or getattr(settings, "claude_bin", None) or "/opt/claude-code/bin/claude.exe"
except Exception:
return getattr(settings, "claude_bin", None) or "/opt/claude-code/bin/claude.exe"
def _run_version(bin_path: str) -> tuple[bool, str]:
"""`claude --version` — proves the CLI runs without touching the API."""
try:
r = subprocess.run(
[bin_path, "--version"],
capture_output=True,
text=True,
timeout=_VERSION_TIMEOUT,
)
if r.returncode == 0:
return True, (r.stdout or r.stderr or "").strip()[:120] or "ok"
return False, f"--version exit {r.returncode}: {(r.stderr or r.stdout).strip()[:120]}"
except subprocess.TimeoutExpired:
return False, f"--version timed out after {_VERSION_TIMEOUT}s"
except FileNotFoundError:
return False, "claude binary not found (FileNotFoundError)"
except Exception as e: # pragma: no cover - defensive
return False, f"--version error: {e}"
def _compute() -> tuple[bool, str]:
bin_path = _claude_bin()
if not os.path.exists(bin_path):
return False, f"CLAUDE_BIN not found: {bin_path}"
return _run_version(bin_path)
def check(force: bool = False) -> tuple[bool, str]:
"""Return (ok, reason). Cached for preflight_cache_ttl seconds.
force=True bypasses the cache (used by the breaker half-open probe / tests).
"""
now = time.time()
ttl = settings.preflight_cache_ttl
if not force and _cache.ts > 0 and (now - _cache.ts) < ttl:
return _cache.ok, _cache.reason
ok, reason = _compute()
_cache.ts = now
_cache.ok = ok
_cache.reason = reason
if not ok:
logger.warning(f"Preflight FAIL: {reason}")
return ok, reason
def reset_cache() -> None:
"""Invalidate the cache (tests / forced recheck)."""
_cache.ts = 0.0
_cache.ok = False
_cache.reason = "reset"

127
src/projects.py Normal file
View File

@@ -0,0 +1,127 @@
"""ORCH-6: Project registry — map Plane project id -> repo / work-item prefix.
Root cause of the 2026-06-02 incident: the Plane webhook listened to the whole
workspace and hardcoded ``repo = settings.default_repo`` (enduro-trails). Every
issue from any project was funneled into one repo with one prefix (ET).
This module introduces a small registry keyed by the Plane project uuid so the
orchestrator can:
* filter webhooks by project (ignore unknown projects),
* resolve the gitea repo + work-item prefix for a known project,
* route Plane sync (state/comment) into the issue's own project.
Source of truth: ``settings.projects_json`` (a JSON array set via the
``ORCH_PROJECTS_JSON`` env var). If unset/empty/invalid, a built-in default
registry is used so the system works out of the box.
"""
import json
import logging
from dataclasses import dataclass
from .config import settings
logger = logging.getLogger("orchestrator.projects")
@dataclass(frozen=True)
class ProjectConfig:
plane_project_id: str # uuid of the Plane project (registry key)
repo: str # gitea repo name (== folder under /repos)
work_item_prefix: str # ET / ORCH
name: str # human-readable label
# Built-in default registry (used when ORCH_PROJECTS_JSON is empty/invalid).
# Keep enduro-trails first so existing behaviour is the safe default.
_DEFAULT_PROJECTS = [
ProjectConfig(
plane_project_id="7a79f0a9-5278-49cd-9007-9a338f238f9c",
repo="enduro-trails",
work_item_prefix="ET",
name="enduro-trails",
),
ProjectConfig(
plane_project_id="8da6aa25-a60e-44d6-a1e2-d8ae59aa7d6a",
repo="orchestrator",
work_item_prefix="ORCH",
name="orchestrator",
),
]
def _parse_projects_json(raw: str) -> list[ProjectConfig] | None:
"""Parse ORCH_PROJECTS_JSON. Returns None if empty/invalid (-> use default)."""
if not raw or not raw.strip():
return None
try:
data = json.loads(raw)
except (ValueError, TypeError) as e:
logger.error(f"ORCH_PROJECTS_JSON is not valid JSON, falling back to default: {e}")
return None
if not isinstance(data, list):
logger.error("ORCH_PROJECTS_JSON must be a JSON array, falling back to default")
return None
parsed: list[ProjectConfig] = []
for i, item in enumerate(data):
if not isinstance(item, dict):
logger.error(f"ORCH_PROJECTS_JSON[{i}] is not an object, skipping")
continue
try:
parsed.append(
ProjectConfig(
plane_project_id=str(item["plane_project_id"]),
repo=str(item["repo"]),
work_item_prefix=str(item["work_item_prefix"]),
name=str(item.get("name", item["repo"])),
)
)
except KeyError as e:
logger.error(f"ORCH_PROJECTS_JSON[{i}] missing required key {e}, skipping")
continue
if not parsed:
logger.error("ORCH_PROJECTS_JSON produced no valid entries, falling back to default")
return None
return parsed
def _load_projects() -> list[ProjectConfig]:
parsed = _parse_projects_json(getattr(settings, "projects_json", "") or "")
if parsed is not None:
logger.info(f"Project registry loaded from ORCH_PROJECTS_JSON: {len(parsed)} project(s)")
return parsed
return list(_DEFAULT_PROJECTS)
# Module-level registry, built once at import.
PROJECTS: list[ProjectConfig] = _load_projects()
_BY_PLANE_ID: dict[str, ProjectConfig] = {p.plane_project_id: p for p in PROJECTS}
_BY_REPO: dict[str, ProjectConfig] = {p.repo: p for p in PROJECTS}
def get_project_by_plane_id(plane_project_id: str) -> ProjectConfig | None:
"""Resolve project config by Plane project uuid. None if unknown."""
if not plane_project_id:
return None
return _BY_PLANE_ID.get(plane_project_id)
def get_project_by_repo(repo: str) -> ProjectConfig | None:
"""Resolve project config by gitea repo name. None if unknown."""
if not repo:
return None
return _BY_REPO.get(repo)
def known_plane_project_ids() -> set[str]:
"""Set of Plane project ids the orchestrator is configured to handle."""
return set(_BY_PLANE_ID.keys())
def reload_projects() -> None:
"""Rebuild the registry from current settings (used by tests)."""
global PROJECTS, _BY_PLANE_ID, _BY_REPO
PROJECTS = _load_projects()
_BY_PLANE_ID = {p.plane_project_id: p for p in PROJECTS}
_BY_REPO = {p.repo: p for p in PROJECTS}

View File

@@ -1,26 +1,285 @@
# Quality Gate checks placeholder
# Will be expanded as pipeline matures
"""Quality Gate checks — real implementations using Gitea/Plane API and filesystem."""
import os
import logging
import httpx
from ..config import settings
logger = logging.getLogger("orchestrator.qg")
from ..git_worktree import get_worktree_path, ensure_worktree
def check_analysis_complete(task_id: int) -> bool:
"""Check if analysis artifacts exist."""
# TODO: verify .task-arch.md exists in repo
return True
def _repo_path(repo: str, branch: str | None = None) -> str:
"""Resolve the working path to read agent artifacts from.
ORCH-2 / S-4: artifacts now live in the per-branch worktree. When a branch is
given and its worktree exists on disk, read from there; otherwise fall back to
the shared /repos/<repo> clone (keeps backward-compat for 2-arg callers/tests).
"""
if branch:
wt = get_worktree_path(repo, branch)
if os.path.isdir(wt):
return wt
return os.path.join(settings.repos_dir, repo)
# Shared httpx client config
GITEA_HEADERS = {"Authorization": f"token {settings.gitea_token}"}
GITEA_BASE = f"{settings.gitea_url}/api/v1"
def check_architecture_approved(task_id: int) -> bool:
"""Check if architecture was approved in Plane."""
# TODO: check Plane comment for :approved:
return False
def check_analysis_complete(repo: str, work_item_id: str, branch: str | None = None) -> tuple[bool, str]:
"""
Check if analysis artifacts exist in the repo branch.
Required files:
- docs/work-items/<work_item_id>/01-brd.md
- docs/work-items/<work_item_id>/02-trz.md
- docs/work-items/<work_item_id>/03-acceptance-criteria.md
- docs/work-items/<work_item_id>/04-test-plan.yaml
"""
required_files = [
f"docs/work-items/{work_item_id}/01-brd.md",
f"docs/work-items/{work_item_id}/02-trz.md",
f"docs/work-items/{work_item_id}/03-acceptance-criteria.md",
f"docs/work-items/{work_item_id}/04-test-plan.yaml",
]
repo_path = _repo_path(repo, branch)
missing = []
for f in required_files:
full_path = os.path.join(repo_path, f)
if not os.path.isfile(full_path):
missing.append(f)
if missing:
return False, f"Missing files: {', '.join(missing)}"
return True, "All analysis artifacts present"
def check_ci_green(repo: str, branch: str) -> bool:
"""Check if CI status is green for branch."""
# TODO: query Gitea commit status API
return False
def check_architecture_done(repo: str, work_item_id: str, branch: str | None = None) -> tuple[bool, str]:
"""
Check if architecture artifacts exist.
Required: docs/work-items/<work_item_id>/06-adr/ (at least 1 file)
OR: docs/work-items/<work_item_id>/07-infra-requirements.md
"""
repo_path = _repo_path(repo, branch)
adr_dir = os.path.join(repo_path, f"docs/work-items/{work_item_id}/06-adr")
infra_file = os.path.join(repo_path, f"docs/work-items/{work_item_id}/07-infra-requirements.md")
if os.path.isdir(adr_dir) and len(os.listdir(adr_dir)) > 0:
return True, "ADR directory exists with files"
if os.path.isfile(infra_file):
return True, "Infra requirements file exists"
return False, "No ADR directory or infra-requirements.md found"
def check_review_approved(repo: str, pr_number: int) -> bool:
"""Check if PR has approved review."""
# TODO: query Gitea PR reviews API
return False
def check_ci_green(repo: str, branch: str) -> tuple[bool, str]:
"""
Check if CI status is green for branch via Gitea API.
GET /repos/{owner}/{repo}/commits/{branch}/status
"""
owner = settings.gitea_owner
url = f"{GITEA_BASE}/repos/{owner}/{repo}/commits/{branch}/status"
try:
resp = httpx.get(url, headers=GITEA_HEADERS, timeout=10)
if resp.status_code == 404:
return False, f"Branch '{branch}' not found or no status"
resp.raise_for_status()
data = resp.json()
state = data.get("state", "unknown")
if state == "success":
return True, "CI green"
return False, f"CI state: {state}"
except httpx.HTTPError as e:
logger.error(f"Gitea API error checking CI: {e}")
return False, f"API error: {e}"
def check_review_approved(repo: str, pr_number: int) -> tuple[bool, str]:
"""
Check if PR has at least one approved review and no request_changes.
GET /repos/{owner}/{repo}/pulls/{pr_number}/reviews
"""
owner = settings.gitea_owner
url = f"{GITEA_BASE}/repos/{owner}/{repo}/pulls/{pr_number}/reviews"
try:
resp = httpx.get(url, headers=GITEA_HEADERS, timeout=10)
resp.raise_for_status()
reviews = resp.json()
approved = 0
changes_requested = 0
for review in reviews:
# Skip stale reviews (dismissed by new commits)
if review.get("stale", False):
continue
state = review.get("state", "").upper()
if state == "APPROVED":
approved += 1
elif state == "REQUEST_CHANGES":
changes_requested += 1
if changes_requested > 0:
return False, f"Changes requested ({changes_requested} reviews)"
if approved > 0:
return True, f"Approved ({approved} reviews)"
return False, "No reviews yet"
except httpx.HTTPError as e:
logger.error(f"Gitea API error checking reviews: {e}")
return False, f"API error: {e}"
def check_tests_passed(repo: str, work_item_id: str, branch: str | None = None) -> tuple[bool, str]:
"""
Check if test report exists and contains PASS indicator.
File: docs/work-items/<work_item_id>/13-test-report.md
"""
repo_path = _repo_path(repo, branch)
report_path = os.path.join(repo_path, f"docs/work-items/{work_item_id}/13-test-report.md")
if not os.path.isfile(report_path):
return False, "Test report not found"
try:
with open(report_path, "r") as f:
content = f.read()
if "PASS" in content or "All tests passed" in content:
return True, "Test report indicates PASS"
return False, "Test report exists but no PASS indicator found"
except OSError as e:
return False, f"Error reading test report: {e}"
def check_analysis_approved(repo: str, work_item_id: str, branch: str | None = None) -> tuple[bool, str]:
"""
Check if analysis is complete AND approved by stakeholder.
Requirements:
1. All analysis artifacts exist (BRD, TRZ, AC, TestPlan)
2. Stakeholder has posted :approved: comment on the Plane issue
This QG is designed to be triggered by :approved: comment handler,
so the approval check verifies file completeness as a safety gate.
"""
# First check files
files_ok, files_reason = check_analysis_complete(repo, work_item_id, branch)
if not files_ok:
return False, files_reason
# Check for :approved: comment via Plane API
try:
from ..plane_sync import find_issue_id, PLANE_BASE, PLANE_HEADERS, WORKSPACE, PROJECT_ID
from ..projects import get_project_by_repo
# ORCH-6: verify approval in the issue's own Plane project.
_proj = get_project_by_repo(repo)
_pid = _proj.plane_project_id if _proj else PROJECT_ID
issue_id = find_issue_id(work_item_id, _pid)
if not issue_id:
return False, "Cannot find Plane issue to verify approval"
url = f"{PLANE_BASE}/workspaces/{WORKSPACE}/projects/{_pid}/issues/{issue_id}/comments/"
resp = httpx.get(url, headers=PLANE_HEADERS, timeout=10)
resp.raise_for_status()
comments = resp.json()
# Handle paginated response
if isinstance(comments, dict):
comments = comments.get("results", [])
for comment in comments:
body = comment.get("comment_html", "") or comment.get("comment", "")
if ":approved:" in body:
return True, "Analysis complete and approved by stakeholder"
return False, "Analysis artifacts present but no :approved: comment found"
except Exception as e:
logger.warning(f"Failed to check approval for {work_item_id}: {e}")
# If we can't reach Plane API but files exist, allow advance
# (the :approved: handler already verified the comment exists)
return True, f"Files present; Plane API check skipped ({e})"
def check_reviewer_verdict(repo: str, work_item_id: str, branch: str | None = None) -> tuple[bool, str]:
"""
Check reviewer agent verdict from 12-review.md (S-5 fix).
Reads ONLY the machine-readable `verdict:` field from the YAML frontmatter,
so tables / prose that merely mention APPROVED or REQUEST_CHANGES no longer
cause false positives/negatives. Returns:
(True, ...) -> verdict: APPROVED
(False, ...) -> verdict: REQUEST_CHANGES, missing verdict, or no frontmatter
"""
import yaml
repo_path = _repo_path(repo, branch)
review_path = os.path.join(repo_path, f"docs/work-items/{work_item_id}/12-review.md")
if not os.path.isfile(review_path):
return False, "Review report not found (12-review.md)"
try:
with open(review_path, "r") as f:
content = f.read()
verdict = None
if content.startswith("---"):
parts = content.split("---", 2)
if len(parts) >= 3:
try:
fm = yaml.safe_load(parts[1]) or {}
except yaml.YAMLError as e:
return False, f"Invalid YAML frontmatter in review: {e}"
verdict = str(fm.get("verdict", "")).upper().strip()
if verdict == "APPROVED":
return True, "Reviewer verdict: APPROVED"
if verdict == "REQUEST_CHANGES":
return False, "Reviewer verdict: REQUEST_CHANGES"
return False, f"No machine-readable verdict in frontmatter (got: {verdict!r})"
except OSError as e:
return False, f"Error reading review: {e}"
def check_tests_local(repo: str, branch: str) -> tuple[bool, str]:
"""
S-1 fix: run the project test suite locally and judge by exit code, instead of
depending on Gitea CI (which is not configured -> always false).
ORCH-2 / S-4: tests run inside the per-branch worktree (ensure_worktree), so this
is safe for concurrent active tasks — no shared /repos checkout race.
"""
import subprocess
try:
repo_path = ensure_worktree(repo, branch)
r = subprocess.run(
["make", "test"], cwd=repo_path,
capture_output=True, text=True, timeout=600,
)
if r.returncode == 0:
return True, "Local tests passed"
tail = (r.stdout + r.stderr)[-500:]
return False, f"Local tests failed: ...{tail}"
except subprocess.TimeoutExpired:
return False, "Local tests timed out (600s)"
except Exception as e:
return False, f"Local test run error: {e}"
# Registry for dynamic lookup by name
QG_CHECKS = {
"check_analysis_approved": check_analysis_approved,
"check_analysis_complete": check_analysis_complete,
"check_architecture_done": check_architecture_done,
"check_ci_green": check_ci_green,
"check_review_approved": check_review_approved,
"check_tests_passed": check_tests_passed,
"check_reviewer_verdict": check_reviewer_verdict,
"check_tests_local": check_tests_local,
}

246
src/queue_worker.py Normal file
View File

@@ -0,0 +1,246 @@
"""ORCH-1 (F-2b): background job-queue worker with resilience layer.
A single background thread polls the `jobs` table and spawns agents:
while running:
if breaker.open and not cooled_down: sleep; continue # don't touch CLI
if not preflight.ok: sleep; continue # CLI/net down -> wait
while count_running_jobs() < max_concurrency:
job = claim_next_job() # atomic queued -> running (available_at-gated)
if not job: break
launcher.launch_job(job) # spawns claude (Popen) + monitor thread
sleep(poll_interval)
Resilience (ДОПОЛНЕНИЕ):
A. Preflight — cheap local CLI/net check (cached, no tokens) gates claiming.
B/C. The launcher classifies failures (transient vs permanent) and applies
backoff via available_at; the worker only needs to honour available_at
(claim_next_job does) and react to transient outcomes via the breaker.
D. Circuit breaker — N consecutive transient failures -> open (pause M minutes,
no CLI calls, Telegram alert) -> half-open (probe one job) -> closed.
Design: plain daemon thread + threading.Event (the launcher already manages its
own monitor/watchdog threads + blocking Popen).
"""
import time
import logging
import threading
from .config import settings
from .db import claim_next_job, count_running_jobs
from .agents.launcher import launcher
from . import preflight
logger = logging.getLogger("orchestrator.queue_worker")
class CircuitBreaker:
"""Trips after `threshold` consecutive transient failures.
States: closed -> (threshold transient) -> open -> (after pause) half-open
-> (recovered) closed | (transient again) open.
Thread-safe enough for our single-worker + monitor-thread callbacks (a lock
guards the counters).
"""
def __init__(self, threshold: int = None, pause_seconds: int = None):
self.threshold = threshold if threshold is not None else settings.breaker_threshold
self.pause_seconds = (
pause_seconds if pause_seconds is not None else settings.breaker_pause_seconds
)
self._lock = threading.Lock()
self.state = "closed" # closed | open | half-open
self.consecutive_transient = 0
self.opened_at = 0.0
self._notify = None # optional callable(message) for alerts
def set_notifier(self, fn):
self._notify = fn
def record_transient(self):
with self._lock:
self.consecutive_transient += 1
if self.state == "half-open":
# Probe failed -> re-open.
self._open("circuit re-opened: probe job hit transient again")
elif self.consecutive_transient >= self.threshold and self.state == "closed":
self._open(
f"circuit OPEN: {self.consecutive_transient} consecutive "
f"transient failures; pausing {self.pause_seconds}s (no CLI calls)"
)
def record_recovered(self):
with self._lock:
self.consecutive_transient = 0
if self.state in ("half-open", "open"):
self.state = "closed"
logger.info("Circuit CLOSED: recovered")
def record_permanent(self):
# A clean permanent (code-fault) failure breaks the transient streak.
with self._lock:
self.consecutive_transient = 0
def _open(self, msg: str):
self.state = "open"
self.opened_at = time.time()
logger.warning(msg)
if self._notify:
try:
self._notify(f"\U0001f534 {msg}")
except Exception:
pass
def allow_claim(self) -> bool:
"""Return True if the worker may attempt to claim/launch a job now.
- closed -> yes.
- open -> no until pause elapsed; then transition to half-open (yes, one probe).
- half-open -> yes (the single probe).
"""
with self._lock:
if self.state == "closed":
return True
if self.state == "open":
if (time.time() - self.opened_at) >= self.pause_seconds:
self.state = "half-open"
logger.info("Circuit HALF-OPEN: probing one job")
return True
return False
# half-open: allow the probe.
return True
def snapshot(self) -> dict:
with self._lock:
remaining = 0
if self.state == "open":
remaining = max(0, int(self.pause_seconds - (time.time() - self.opened_at)))
return {
"state": self.state,
"consecutive_transient": self.consecutive_transient,
"pause_remaining_s": remaining,
}
class QueueWorker:
"""Background worker that drains the persistent job queue (with resilience)."""
def __init__(self, max_concurrency: int = None, poll_interval: float = None,
breaker: CircuitBreaker = None):
self.max_concurrency = (
max_concurrency if max_concurrency is not None else settings.max_concurrency
)
self.poll_interval = (
poll_interval if poll_interval is not None else settings.queue_poll_interval
)
self.breaker = breaker or CircuitBreaker()
self.last_preflight_ok = True
self.last_preflight_reason = "not checked"
self._stop = threading.Event()
self._thread: threading.Thread | None = None
# --- circuit breaker outcome callback wired into the launcher ----------
def _on_outcome(self, transient: bool, recovered: bool):
if recovered:
self.breaker.record_recovered()
elif transient:
self.breaker.record_transient()
else:
self.breaker.record_permanent()
def _drain_once(self):
"""Claim and launch jobs until concurrency is full or the queue is empty.
Gated by the circuit breaker and preflight: if the breaker is open (and
not yet cooled down) or preflight fails, we do NOT claim — jobs stay
queued and no CLI/tokens are touched.
"""
if not self.breaker.allow_claim():
return
ok, reason = preflight.check()
self.last_preflight_ok = ok
self.last_preflight_reason = reason
if not ok:
logger.info(f"Preflight not ok ({reason}) -> not claiming jobs this tick")
return
# In half-open we only probe a single job, regardless of max_concurrency.
half_open = self.breaker.snapshot()["state"] == "half-open"
launched = 0
while not self._stop.is_set():
if half_open and launched >= 1:
return
if count_running_jobs() >= self.max_concurrency:
return
job = claim_next_job()
if not job:
return
launched += 1
try:
run_id = launcher.launch_job(job)
logger.info(
f"Worker launched job {job['id']} ({job['agent']}, "
f"repo {job['repo']}) -> run_id={run_id}"
)
except Exception as e:
# Launch itself failed (e.g. repo missing): treat as a permanent
# launch error so the job does not wedge as 'running' forever.
logger.error(f"Worker failed to launch job {job['id']}: {e}")
try:
from .db import get_job, mark_job
j = get_job(job["id"])
attempts = j.get("attempts", 0) if j else 0
max_attempts = j.get("max_attempts", 2) if j else 2
if attempts < max_attempts:
mark_job(job["id"], "queued", error=f"launch error: {e}")
else:
mark_job(job["id"], "failed", error=f"launch error: {e}")
except Exception:
pass
def _run(self):
logger.info(
f"Queue worker started (max_concurrency={self.max_concurrency}, "
f"poll_interval={self.poll_interval}s, breaker_threshold={self.breaker.threshold})"
)
while not self._stop.is_set():
try:
self._drain_once()
except Exception as e:
logger.error(f"Queue worker loop error: {e}")
self._stop.wait(self.poll_interval)
logger.info("Queue worker stopped")
def start(self):
if self._thread and self._thread.is_alive():
return
# Wire breaker alerting + launcher outcome callback.
try:
from .notifications import send_telegram
self.breaker.set_notifier(send_telegram)
except Exception:
pass
launcher.on_outcome = self._on_outcome
self._stop.clear()
self._thread = threading.Thread(
target=self._run, name="queue-worker", daemon=True
)
self._thread.start()
def stop(self, timeout: float = 5.0):
self._stop.set()
if self._thread:
self._thread.join(timeout=timeout)
def status(self) -> dict:
"""Resilience snapshot for /queue."""
return {
"breaker": self.breaker.snapshot(),
"preflight_ok": self.last_preflight_ok,
"preflight_reason": self.last_preflight_reason,
}
# Module-level singleton used by the FastAPI lifespan.
worker = QueueWorker()

54
src/stages.py Normal file
View File

@@ -0,0 +1,54 @@
"""Stage machine for orchestrator pipeline.
Stages:
created → analysis → architecture → development → review → testing → deploy → done
Each stage defines:
- next: the stage to advance to
- agent: the agent to launch when entering the NEXT stage
- qg: the quality gate check required to leave this stage
"""
STAGE_TRANSITIONS = {
"created": {"next": "analysis", "agent": "analyst", "qg": None},
"analysis": {"next": "architecture", "agent": "architect", "qg": "check_analysis_approved"},
"architecture": {"next": "development", "agent": "developer", "qg": "check_architecture_done"},
"development": {"next": "review", "agent": "reviewer", "qg": "check_tests_local"},
"review": {"next": "testing", "agent": "tester", "qg": "check_reviewer_verdict"},
"testing": {"next": "deploy", "agent": "deployer", "qg": "check_tests_passed"},
"deploy": {"next": "done", "agent": None, "qg": None},
"done": {"next": None, "agent": None, "qg": None},
}
def get_next_stage(current_stage: str) -> str | None:
"""Get the next stage after current."""
transition = STAGE_TRANSITIONS.get(current_stage)
if not transition:
return None
return transition["next"]
def get_agent_for_stage(stage: str) -> str | None:
"""Get the agent to launch when advancing FROM this stage (entering next stage)."""
transition = STAGE_TRANSITIONS.get(stage)
if not transition:
return None
return transition["agent"]
def get_qg_for_stage(current_stage: str) -> str | None:
"""Get the QG check function name required to leave current stage."""
transition = STAGE_TRANSITIONS.get(current_stage)
if not transition:
return None
return transition["qg"]
def get_previous_stage(current_stage: str) -> str | None:
"""Get the previous stage (for rollback)."""
stages = list(STAGE_TRANSITIONS.keys())
idx = stages.index(current_stage) if current_stage in stages else -1
if idx <= 0:
return None
return stages[idx - 1]

View File

@@ -1,14 +1,54 @@
from fastapi import APIRouter, Request
"""Gitea webhook handlers — full implementation."""
import hmac
import subprocess
import os
import hashlib
import json
from ..db import get_db
import logging
import httpx
from fastapi import APIRouter, Request, HTTPException
from ..config import settings
from ..db import get_db, get_task_by_repo_branch, update_task_stage, enqueue_job
from ..stages import get_next_stage, get_agent_for_stage
from ..qg.checks import check_ci_green, check_review_approved
from ..notifications import notify_stage_change, notify_qg_failure, notify_error
from ..agents.launcher import launcher
from ..plane_sync import notify_stage_change as plane_notify_stage
from ..projects import get_project_by_repo
logger = logging.getLogger("orchestrator.webhooks.gitea")
router = APIRouter()
# Max retries for developer on request_changes
MAX_DEV_RETRIES = 3
def verify_gitea_signature(body: bytes, signature: str) -> bool:
"""Verify Gitea webhook HMAC-SHA256 signature."""
if not settings.gitea_webhook_secret:
return True # Skip verification if no secret configured
expected = hmac.new(
settings.gitea_webhook_secret.encode(),
body,
hashlib.sha256,
).hexdigest()
return hmac.compare_digest(expected, signature)
@router.post("/gitea")
async def gitea_webhook(request: Request):
"""Handle Gitea webhook events."""
body = await request.body()
# Verify HMAC signature
signature = request.headers.get("X-Gitea-Signature", "")
if not verify_gitea_signature(body, signature):
logger.warning("Gitea webhook: invalid signature")
raise HTTPException(status_code=401, detail="Invalid signature")
payload = json.loads(body)
# Log event
@@ -19,36 +59,253 @@ async def gitea_webhook(request: Request):
("gitea", event_type, body.decode()),
)
conn.commit()
conn.close()
if event_type == "push":
await handle_push(payload, conn)
elif event_type == "pull_request":
await handle_pr(payload, conn)
await handle_push(payload)
elif event_type.startswith("pull_request"):
await handle_pr(payload)
elif event_type == "status":
await handle_ci_status(payload, conn)
await handle_ci_status(payload)
conn.close()
return {"status": "accepted"}
async def handle_push(payload: dict, conn):
"""Push event — log for now."""
pass
async def handle_push(payload: dict):
"""
Push event:
- If stage=architecture and push contains ADR files → advance to development
- If stage=development and push contains src/ → wait for CI
"""
ref = payload.get("ref", "")
# Extract branch: refs/heads/feature/ET-003-slug → feature/ET-003-slug
if not ref.startswith("refs/heads/"):
return
branch = ref.removeprefix("refs/heads/")
repo_name = payload.get("repository", {}).get("name", settings.default_repo)
# ORCH-6: ignore pushes to repos outside the project registry.
if not get_project_by_repo(repo_name):
logger.info(f"Gitea push: ignoring unknown repo '{repo_name}'")
return
task = get_task_by_repo_branch(repo_name, branch)
if not task:
logger.debug(f"Push to '{branch}' — no matching task found")
return
task_id = task["id"]
current_stage = task["stage"]
work_item_id = task.get("work_item_id", "")
# Collect modified files from commits
modified_files = set()
for commit in payload.get("commits", []):
modified_files.update(commit.get("added", []))
modified_files.update(commit.get("modified", []))
if current_stage == "architecture":
# Check if ADR files were pushed
has_adr = any(
f"docs/work-items/{work_item_id}/06-adr/" in f
or f"docs/work-items/{work_item_id}/07-infra-requirements.md" == f
for f in modified_files
)
if has_adr:
# Advance to development
next_stage = "development"
update_task_stage(task_id, next_stage)
notify_stage_change(task_id, current_stage, next_stage)
plane_notify_stage(work_item_id, current_stage, next_stage)
agent = get_agent_for_stage(current_stage)
if agent:
try:
task_desc = f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {branch}\nStage: {next_stage}"
job_id = enqueue_job(agent, repo_name, task_desc, task_id=task_id)
logger.info(f"Task {task_id}: push triggered {current_stage}{next_stage}, enqueued '{agent}' (job_id={job_id})")
except Exception as e:
notify_error(task_id, f"Failed to launch agent '{agent}': {e}")
elif current_stage == "development":
# Source files pushed — just log, wait for CI
has_src = any(f.startswith("src/") for f in modified_files)
if has_src:
logger.info(f"Task {task_id}: source push detected on '{branch}', waiting for CI")
async def handle_pr(payload: dict, conn):
"""PR event — check reviews, CI status."""
async def handle_ci_status(payload: dict):
"""
CI status update:
- If state=success and stage=development → advance to review, launch reviewer
- If state=failure → log
"""
state = payload.get("state", "")
# Extract branch from target_url or branches
branches = payload.get("branches", [])
branch = ""
if branches:
branch = branches[0].get("name", "")
# Alternative: find branch by SHA from tasks DB
if not branch:
sha = payload.get("sha", "")
repo_name = payload.get("repository", {}).get("name", settings.default_repo)
# Try to find task by checking git branch containing this SHA.
# ORCH-2 / S-4: this is a READ-ONLY query of remote-tracking refs in the main
# clone (no checkout / no mutation), so it is safe to keep on /repos/<repo>.
try:
result = subprocess.run(
["git", "-C", os.path.join(settings.repos_dir, repo_name),
"branch", "-r", "--contains", sha],
capture_output=True, text=True, timeout=10,
)
for line in result.stdout.strip().splitlines():
b = line.strip().replace("origin/", "")
if b.startswith("feature/"):
branch = b
break
except Exception:
pass
if not branch:
logger.debug(f"CI status event: could not determine branch for sha={sha}")
return
repo_name = payload.get("repository", {}).get("name", settings.default_repo)
# ORCH-6: ignore CI status for repos outside the project registry.
if not get_project_by_repo(repo_name):
logger.info(f"Gitea CI status: ignoring unknown repo '{repo_name}'")
return
task = get_task_by_repo_branch(repo_name, branch)
if not task:
return
task_id = task["id"]
current_stage = task["stage"]
work_item_id = task.get("work_item_id", "")
if state == "success" and current_stage == "development":
# Verify CI is actually green via API (double-check)
passed, reason = check_ci_green(repo_name, branch)
if passed:
next_stage = "review"
update_task_stage(task_id, next_stage)
notify_stage_change(task_id, current_stage, next_stage)
plane_notify_stage(work_item_id, current_stage, next_stage)
agent = get_agent_for_stage(current_stage)
if agent:
try:
task_desc = f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {branch}\nStage: {next_stage}"
job_id = enqueue_job(agent, repo_name, task_desc, task_id=task_id)
logger.info(f"Task {task_id}: CI green → {next_stage}, enqueued '{agent}' (job_id={job_id})")
except Exception as e:
notify_error(task_id, f"Failed to launch agent '{agent}': {e}")
else:
notify_qg_failure(task_id, current_stage, "check_ci_green", reason)
elif state == "failure":
# S-1: Gitea CI is NOT the authoritative gate anymore (the orchestrator runs
# tests locally via check_tests_local). Gitea CI is often unconfigured, so a
# "failure"/empty status here is not actionable. Log only, do not alert.
logger.debug(f"Task {task_id}: Gitea CI state='failure' on branch '{branch}' "
f"(non-authoritative, suppressed — local tests are the gate)")
async def handle_pr(payload: dict):
"""
PR event:
- action=reviewed + approved → advance to testing, launch tester
- action=reviewed + request_changes → back to development, relaunch developer (max 3x)
- action=closed + merged → stage=done
"""
action = payload.get("action", "")
pr = payload.get("pull_request", {})
review = payload.get("review", {})
if action == "reviewed" and pr.get("state") == "approved":
# TODO: QG-5 check -> launch Tester
pass
# Get branch from PR head
head_branch = pr.get("head", {}).get("ref", "")
repo_name = payload.get("repository", {}).get("name", settings.default_repo)
if not head_branch:
return
async def handle_ci_status(payload: dict, conn):
"""CI status update — check if all green -> advance."""
state = payload.get("state", "")
if state == "success":
# TODO: Check all required contexts green -> advance stage
pass
# ORCH-6: ignore PR events for repos outside the project registry.
if not get_project_by_repo(repo_name):
logger.info(f"Gitea PR: ignoring unknown repo '{repo_name}'")
return
task = get_task_by_repo_branch(repo_name, head_branch)
if not task:
logger.debug(f"PR event for branch '{head_branch}' — no matching task")
return
task_id = task["id"]
current_stage = task["stage"]
work_item_id = task.get("work_item_id", "")
if action == "reviewed":
# Gitea sends review.state (older) or review.type (newer format)
review_state = review.get("state", "").upper()
if not review_state and review.get("type", ""):
# Map type field: "pull_request_review_approved" -> "APPROVED"
rtype = review.get("type", "")
if "approved" in rtype.lower():
review_state = "APPROVED"
elif "request_changes" in rtype.lower() or "rejected" in rtype.lower():
review_state = "REQUEST_CHANGES"
if review_state == "APPROVED" and current_stage == "review":
# Advance to testing
pr_number = pr.get("number")
passed, reason = check_review_approved(repo_name, pr_number)
if passed:
next_stage = "testing"
update_task_stage(task_id, next_stage)
notify_stage_change(task_id, current_stage, next_stage)
plane_notify_stage(work_item_id, current_stage, next_stage)
agent = get_agent_for_stage(current_stage)
if agent:
try:
task_desc = f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {head_branch}\nStage: {next_stage}"
job_id = enqueue_job(agent, repo_name, task_desc, task_id=task_id)
logger.info(f"Task {task_id}: PR approved → {next_stage}, enqueued '{agent}' (job_id={job_id})")
except Exception as e:
notify_error(task_id, f"Failed to launch agent '{agent}': {e}")
else:
notify_qg_failure(task_id, current_stage, "check_review_approved", reason)
elif review_state == "REQUEST_CHANGES" and current_stage == "review":
# Count retries
conn = get_db()
retry_count = conn.execute(
"SELECT COUNT(*) as cnt FROM agent_runs WHERE task_id = ? AND agent = 'developer'",
(task_id,),
).fetchone()["cnt"]
conn.close()
if retry_count < MAX_DEV_RETRIES:
# Back to development, relaunch developer
update_task_stage(task_id, "development")
notify_stage_change(task_id, current_stage, "development")
try:
task_desc = (
f"Work item: {work_item_id}\nRepo: {repo_name}\nBranch: {head_branch}\n"
f"Stage: development\nNote: Changes requested in review (attempt {retry_count + 1}/{MAX_DEV_RETRIES})"
)
job_id = enqueue_job("developer", repo_name, task_desc, task_id=task_id)
logger.info(f"Task {task_id}: changes requested, enqueued developer (attempt {retry_count + 1}, job_id={job_id})")
except Exception as e:
notify_error(task_id, f"Failed to relaunch developer: {e}")
else:
notify_error(task_id, f"Max developer retries ({MAX_DEV_RETRIES}) reached, escalating")
logger.error(f"Task {task_id}: max retries reached, needs manual intervention")
elif action == "closed" and pr.get("merged", False):
update_task_stage(task_id, "done")
notify_stage_change(task_id, current_stage, "done")
logger.info(f"Task {task_id}: PR merged, stage → done")

View File

@@ -1,14 +1,64 @@
from fastapi import APIRouter, Request
"""Plane webhook handlers — full implementation."""
import hmac
import hashlib
import re
import json
from ..db import get_db
import logging
import httpx
from fastapi import APIRouter, Request, HTTPException
from ..config import settings
from ..db import (
get_db,
get_task_by_plane_id,
get_next_work_item_id,
update_task_stage,
enqueue_job,
)
from ..stages import get_next_stage, get_agent_for_stage, get_qg_for_stage, get_previous_stage
from ..qg.checks import QG_CHECKS
from ..notifications import notify_stage_change, notify_qg_failure, notify_error
from ..agents.launcher import launcher
from ..plane_sync import (
notify_stage_change as plane_notify_stage,
notify_qg_failure as plane_notify_qg,
notify_done as plane_notify_done,
)
from ..projects import (
get_project_by_plane_id,
get_project_by_repo,
known_plane_project_ids,
)
logger = logging.getLogger("orchestrator.webhooks.plane")
router = APIRouter()
def verify_plane_signature(body: bytes, signature: str) -> bool:
"""Verify Plane webhook HMAC-SHA256 signature."""
if not settings.plane_webhook_secret:
return True # Skip verification if no secret configured
expected = hmac.new(
settings.plane_webhook_secret.encode(),
body,
hashlib.sha256,
).hexdigest()
return hmac.compare_digest(expected, signature)
@router.post("/plane")
async def plane_webhook(request: Request):
"""Handle Plane webhook events."""
body = await request.body()
# Verify HMAC signature
signature = request.headers.get("X-Plane-Signature", "")
if not verify_plane_signature(body, signature):
logger.warning("Plane webhook: invalid signature")
raise HTTPException(status_code=401, detail="Invalid signature")
payload = json.loads(body)
# Log event
@@ -18,32 +68,368 @@ async def plane_webhook(request: Request):
("plane", payload.get("event", "unknown"), body.decode()),
)
conn.commit()
conn.close()
event = payload.get("event")
action = payload.get("action", "")
data = payload.get("data", {})
if event == "work_item.created":
await handle_work_item_created(data, conn)
elif event == "comment.created":
await handle_comment(data, conn)
# ORCH-6: filter by Plane project. Ignore issues from unknown/unconfigured
# projects so a webhook on the whole workspace cannot funnel everything into
# the default repo (root cause of the 2026-06-02 incident).
project_id = data.get("project") or data.get("project_id") or ""
if project_id not in known_plane_project_ids():
logger.info(
f"Plane webhook: ignoring event '{event}' from unknown project "
f"'{project_id}' (known: {len(known_plane_project_ids())})"
)
return {"status": "ignored", "reason": "unknown project"}
if (event == "work_item.created") or (event == "issue" and action == "created"):
await handle_work_item_created(data, project_id)
elif (event == "comment.created") or (event == "issue_comment" and action == "created"):
await handle_comment(data, project_id)
conn.close()
return {"status": "accepted"}
async def handle_work_item_created(data: dict, conn):
"""New work item -> create task record."""
async def handle_work_item_created(data: dict, project_id: str = ""):
"""
New work item created in Plane.
QG-0: validate title, description, priority.
If valid: create branch, init docs, launch analyst.
If invalid: comment with what's missing, set Blocked.
"""
plane_id = data.get("id", "")
name = data.get("name", "untitled")
description = data.get("description_stripped", data.get("description", ""))
priority = data.get("priority", {})
priority_name = priority if isinstance(priority, str) else priority.get("name", "")
# ORCH-6: resolve repo / prefix / Plane project from the registry instead of
# the single hardcoded default_repo.
if not project_id:
project_id = data.get("project") or data.get("project_id") or ""
proj = get_project_by_plane_id(project_id)
if not proj:
logger.warning(f"handle_work_item_created: unknown project '{project_id}', ignoring {plane_id}")
return
repo = proj.repo
plane_project_id = proj.plane_project_id
# QG-0 validation
errors = []
if not name or len(name) < 5:
errors.append("Title \u0441\u043b\u0438\u0448\u043a\u043e\u043c \u043a\u043e\u0440\u043e\u0442\u043a\u0438\u0439 (\u043d\u0443\u0436\u043d\u043e >= 5 \u0441\u0438\u043c\u0432\u043e\u043b\u043e\u0432)")
if len(name) > 80:
errors.append("Title \u0441\u043b\u0438\u0448\u043a\u043e\u043c \u0434\u043b\u0438\u043d\u043d\u044b\u0439 (\u043c\u0430\u043a\u0441\u0438\u043c\u0443\u043c 80 \u0441\u0438\u043c\u0432\u043e\u043b\u043e\u0432)")
if not description or len(description.strip()) < 20:
errors.append("Description \u0441\u043b\u0438\u0448\u043a\u043e\u043c \u043a\u043e\u0440\u043e\u0442\u043a\u0438\u0439 (\u043d\u0443\u0436\u043d\u043e >= 20 \u0441\u0438\u043c\u0432\u043e\u043b\u043e\u0432)")
if errors:
# QG-0 failed
error_text = "\u26a0\ufe0f QG-0 failed:\n" + "\n".join(f"\u2022 {e}" for e in errors)
from ..plane_sync import PLANE_BASE, PLANE_HEADERS, WORKSPACE, PLANE_STATES
import httpx as _httpx
# Post comment (ORCH-6: route to the issue's own project)
url = f"{PLANE_BASE}/workspaces/{WORKSPACE}/projects/{plane_project_id}/issues/{plane_id}/comments/"
try:
_httpx.post(url, headers=PLANE_HEADERS,
json={"comment_html": f"<p>{error_text}</p>"}, timeout=10)
except Exception:
pass
# Set blocked
url2 = f"{PLANE_BASE}/workspaces/{WORKSPACE}/projects/{plane_project_id}/issues/{plane_id}/"
try:
_httpx.patch(url2, headers=PLANE_HEADERS,
json={"state": PLANE_STATES["blocked"]}, timeout=10)
except Exception:
pass
logger.info(f"QG-0 failed for {plane_id}: {errors}")
return
# Generate work item ID
work_item_id = get_next_work_item_id(repo, proj.work_item_prefix)
# Create slug from name
slug = re.sub(r"[^a-z0-9]+", "-", name.lower()).strip("-")[:30]
branch = f"feature/{work_item_id}-{slug}"
# Insert task into DB
conn = get_db()
conn.execute(
"INSERT INTO tasks (plane_id, repo, stage) VALUES (?, ?, ?)",
(plane_id, "enduro-trails", "analysis"),
"INSERT INTO tasks (plane_id, work_item_id, repo, branch, stage, plane_issue_id) VALUES (?, ?, ?, ?, ?, ?)",
(plane_id, work_item_id, repo, branch, "analysis", plane_id),
)
conn.commit()
conn.close()
# Create branch in Gitea
try:
await _create_gitea_branch(repo, branch)
except Exception as e:
logger.error(f"Failed to create branch '{branch}': {e}")
# Task is created, branch creation failed — log but don't crash
notify_error(0, f"Branch creation failed: {e}")
return
# Create initial docs structure via Gitea API (create file)
try:
await _create_initial_docs(repo, branch, work_item_id, name)
except Exception as e:
logger.error(f"Failed to create initial docs: {e}")
logger.info(f"Task created: {work_item_id} ({name}), branch={branch}, stage=analysis")
# Launch analyst agent
try:
task_row = get_db().execute("SELECT id FROM tasks WHERE work_item_id=?", (work_item_id,)).fetchone()
if task_row:
task_id = task_row[0]
task_desc = f"Work item: {work_item_id}\nRepo: {repo}\nBranch: {branch}\nStage: analysis\nTitle: {name}"
job_id = enqueue_job("analyst", repo, task_desc, task_id=task_id)
logger.info(f"Task {task_id}: enqueued analyst (job_id={job_id})")
# Post start comment to Plane
from ..plane_sync import add_comment as _add_comment
_add_comment(work_item_id, "\U0001f50d Analyst \u0437\u0430\u043f\u0443\u0449\u0435\u043d. BRD/\u0422\u0417/AC/TestPlan \u0432 \u0440\u0430\u0431\u043e\u0442\u0435 (\u043e\u0436\u0438\u0434\u0430\u0439\u0442\u0435 8-15 \u043c\u0438\u043d).")
except Exception as e:
logger.error(f"Failed to launch analyst for {work_item_id}: {e}")
async def handle_comment(data: dict, conn):
"""Check for :approved: reaction -> advance stage."""
comment_body = data.get("comment", "")
async def handle_comment(data: dict, project_id: str = ""):
"""
Handle comment event — check for :approved: or :rejected:.
Advance or rollback stage accordingly.
"""
comment_body = data.get("comment_stripped", data.get("comment", data.get("body", data.get("comment_html", ""))))
plane_id = str(data.get("work_item_id") or data.get("issue_id") or data.get("issue") or "")
if not plane_id:
logger.warning("Comment event without work_item_id, skipping")
return
task = get_task_by_plane_id(plane_id)
if not task:
logger.warning(f"No task found for plane_id={plane_id}")
return
task_id = task["id"]
current_stage = task["stage"]
repo = task["repo"]
work_item_id = task.get("work_item_id", "")
branch = task.get("branch", "")
if ":rejected:" in comment_body:
# Extract reason (text after :rejected:)
reason = comment_body.split(":rejected:", 1)[-1].strip()[:300]
if current_stage == "analysis":
# Already in analysis — just relaunch analyst with rejection reason
from ..plane_sync import set_issue_in_progress
set_issue_in_progress(work_item_id)
task_desc = (
f"Work item: {work_item_id}\nRepo: {repo}\nBranch: {branch}\n"
f"Stage: analysis\nNote: Stakeholder REJECTED your artifacts. "
f"Reason: {reason}\nRevise and improve."
)
new_job = enqueue_job("analyst", repo, task_desc, task_id=task_id)
from ..plane_sync import add_comment as _plane_comment
_plane_comment(work_item_id, f"\U0001f504 Analyst \u043f\u0435\u0440\u0435\u0437\u0430\u043f\u0443\u0449\u0435\u043d. \u041f\u0440\u0438\u0447\u0438\u043d\u0430 \u043e\u0442\u043a\u043b\u043e\u043d\u0435\u043d\u0438\u044f: {reason}")
logger.info(f"Task {task_id}: rejected at analysis, enqueued analyst (job_id={new_job})")
else:
# Rollback to previous stage
prev_stage = get_previous_stage(current_stage)
if prev_stage:
update_task_stage(task_id, prev_stage)
from ..plane_sync import set_issue_in_progress
set_issue_in_progress(work_item_id)
notify_stage_change(task_id, current_stage, prev_stage)
plane_notify_stage(work_item_id, current_stage, prev_stage)
from ..plane_sync import add_comment as _plane_comment
_plane_comment(work_item_id, f"\U0001f504 \u041e\u0442\u043a\u0430\u0442: {current_stage} \u2192 {prev_stage}. \u041f\u0440\u0438\u0447\u0438\u043d\u0430: {reason}")
logger.info(f"Task {task_id}: rejected, rolled back {current_stage} \u2192 {prev_stage}")
return
if ":approved:" in comment_body:
# TODO: Determine which task, advance QG
pass
from ..plane_sync import set_issue_in_progress
set_issue_in_progress(work_item_id)
# Try to advance stage
await _try_advance_stage(task_id, current_stage, repo, work_item_id, branch)
return
# Task 3: If neither :approved: nor :rejected: — check if this is an answer to questions
if current_stage == "analysis":
from ..plane_sync import PLANE_STATES, set_issue_in_progress
issue_id = task.get("plane_issue_id") or task.get("plane_id")
if not issue_id:
issue_id = plane_id
if issue_id:
from ..plane_sync import PLANE_BASE, PLANE_HEADERS, WORKSPACE
from ..plane_sync import PROJECT_ID as _DEFAULT_PROJECT_ID
# ORCH-6: route to this task's own Plane project (resolved from repo).
_proj = get_project_by_repo(repo)
_pid = _proj.plane_project_id if _proj else (project_id or _DEFAULT_PROJECT_ID)
import httpx as _httpx
try:
_resp = _httpx.get(
f"{PLANE_BASE}/workspaces/{WORKSPACE}/projects/{_pid}/issues/{issue_id}/",
headers=PLANE_HEADERS, timeout=10
)
if _resp.status_code == 200:
issue_data = _resp.json()
if issue_data.get("state") == PLANE_STATES["needs_input"]:
# Task 11: Check analyst retry count (max 3 question rounds)
conn3 = get_db()
analyst_runs = conn3.execute(
"SELECT COUNT(*) FROM agent_runs WHERE task_id=? AND agent='analyst'",
(task_id,)
).fetchone()[0]
conn3.close()
if analyst_runs >= 4: # initial + 3 retries
from ..plane_sync import set_issue_blocked, add_comment as _pc
set_issue_blocked(work_item_id)
_pc(
work_item_id,
"\U0001f6a8 3 \u0440\u0430\u0443\u043d\u0434\u0430 \u0443\u0442\u043e\u0447\u043d\u0435\u043d\u0438\u0439 \u0438\u0441\u0447\u0435\u0440\u043f\u0430\u043d\u044b. Analyst \u043d\u0435 \u043c\u043e\u0436\u0435\u0442 \u0441\u0444\u043e\u0440\u043c\u0438\u0440\u043e\u0432\u0430\u0442\u044c \u0422\u0417. "
"\u0422\u0440\u0435\u0431\u0443\u0435\u0442\u0441\u044f \u0431\u043e\u043b\u0435\u0435 \u0434\u0435\u0442\u0430\u043b\u044c\u043d\u043e\u0435 \u043e\u043f\u0438\u0441\u0430\u043d\u0438\u0435 \u0438\u043b\u0438 \u0432\u0441\u0442\u0440\u0435\u0447\u0430."
)
from ..notifications import send_telegram
send_telegram(f"\U0001f6a8 {work_item_id}: 3 \u0440\u0430\u0443\u043d\u0434\u0430 \u0432\u043e\u043f\u0440\u043e\u0441\u043e\u0432 analyst'\u0430 \u0438\u0441\u0447\u0435\u0440\u043f\u0430\u043d\u044b. \u041d\u0443\u0436\u043d\u0430 \u043f\u043e\u043c\u043e\u0449\u044c.")
return
# This is an answer to analyst's questions — relaunch
set_issue_in_progress(work_item_id)
task_desc = (
f"Work item: {work_item_id}\nRepo: {repo}\nBranch: {branch}\n"
f"Stage: analysis\nNote: Stakeholder answered your questions. "
f"Read the latest comment in Plane and revise your artifacts.\n"
f"Answer: {comment_body[:500]}"
)
new_job = enqueue_job("analyst", repo, task_desc, task_id=task_id)
from ..plane_sync import add_comment as _pc2
_pc2(work_item_id, "\U0001f504 Analyst \u043f\u0435\u0440\u0435\u0437\u0430\u043f\u0443\u0449\u0435\u043d \u0441 \u043e\u0442\u0432\u0435\u0442\u0430\u043c\u0438 \u0441\u0442\u0435\u0439\u043a\u0445\u043e\u043b\u0434\u0435\u0440\u0430.")
logger.info(f"Task {task_id}: stakeholder answered questions, enqueued analyst (job_id={new_job})")
return
except Exception as e:
logger.error(f"Failed to check issue state: {e}")
async def _try_advance_stage(
task_id: int, current_stage: str, repo: str, work_item_id: str, branch: str
):
"""Run QG check for current stage and advance if passed."""
qg_name = get_qg_for_stage(current_stage)
next_stage = get_next_stage(current_stage)
if not next_stage:
logger.info(f"Task {task_id}: already at terminal stage '{current_stage}'")
return
# Run QG check if one is required
if qg_name:
qg_func = QG_CHECKS.get(qg_name)
if not qg_func:
logger.error(f"QG function '{qg_name}' not found in registry")
return
# Determine args based on QG function
if qg_name in ("check_analysis_approved", "check_analysis_complete", "check_architecture_done", "check_tests_passed", "check_reviewer_verdict"):
# ORCH-2 / S-4: pass branch so artifacts are read from the task worktree.
passed, reason = qg_func(repo, work_item_id, branch)
elif qg_name in ("check_ci_green", "check_tests_local"):
passed, reason = qg_func(repo, branch)
elif qg_name == "check_review_approved":
# Find PR number by branch via Gitea API
import httpx as _httpx
from ..config import settings as _s
_owner = _s.gitea_owner
_url = f"{_s.gitea_url}/api/v1/repos/{_owner}/{repo}/pulls?state=open&limit=50"
_headers = {"Authorization": f"token {_s.gitea_token}"}
try:
_resp = _httpx.get(_url, headers=_headers, timeout=10)
_prs = _resp.json()
_pr_number = None
for _pr in _prs:
if _pr.get("head", {}).get("ref") == branch:
_pr_number = _pr["number"]
break
if _pr_number:
passed, reason = qg_func(repo, _pr_number)
else:
# No open PR but review file exists — check file-based
import os
from ..git_worktree import get_worktree_path as _gwp
_wt = _gwp(repo, branch) if os.path.isdir(_gwp(repo, branch)) else os.path.join(_s.repos_dir, repo)
_review_path = os.path.join(_wt, f"docs/work-items/{work_item_id}/12-review.md")
_review_path2 = os.path.join(_wt, f"docs/work-items/{work_item_id}/09-review.md")
if os.path.isfile(_review_path) or os.path.isfile(_review_path2):
passed, reason = True, "Review file exists (file-based approval)"
else:
passed, reason = False, "No open PR found and no review file"
except Exception as _e:
passed, reason = False, f"Error finding PR: {_e}"
else:
passed, reason = False, f"Unknown QG: {qg_name}"
if not passed:
notify_qg_failure(task_id, current_stage, qg_name, reason)
plane_notify_qg(work_item_id, current_stage, qg_name, reason)
return
# Advance stage
update_task_stage(task_id, next_stage)
notify_stage_change(task_id, current_stage, next_stage)
plane_notify_stage(work_item_id, current_stage, next_stage)
# Launch agent associated with the current stage's transition
agent = get_agent_for_stage(current_stage)
if agent:
try:
task_desc = f"Work item: {work_item_id}\nRepo: {repo}\nBranch: {branch}\nStage: {next_stage}"
job_id = enqueue_job(agent, repo, task_desc, task_id=task_id)
plane_notify_stage(work_item_id, current_stage, next_stage, agent)
logger.info(f"Task {task_id}: enqueued agent '{agent}', job_id={job_id}")
except Exception as e:
notify_error(task_id, f"Failed to launch agent '{agent}': {e}")
logger.error(f"Agent launch failed: {e}")
async def _create_gitea_branch(repo: str, branch: str):
"""Create a new branch in Gitea from main."""
owner = settings.gitea_owner
url = f"{settings.gitea_url}/api/v1/repos/{owner}/{repo}/branches"
headers = {"Authorization": f"token {settings.gitea_token}"}
payload = {"new_branch_name": branch, "old_branch_name": "main"}
async with httpx.AsyncClient() as client:
resp = await client.post(url, json=payload, headers=headers, timeout=10)
if resp.status_code == 409:
logger.info(f"Branch '{branch}' already exists")
return
resp.raise_for_status()
logger.info(f"Created branch '{branch}' in {owner}/{repo}")
async def _create_initial_docs(repo: str, branch: str, work_item_id: str, name: str):
"""Create initial business request doc in the feature branch."""
owner = settings.gitea_owner
file_path = f"docs/work-items/{work_item_id}/00-business-request.md"
url = f"{settings.gitea_url}/api/v1/repos/{owner}/{repo}/contents/{file_path}"
headers = {"Authorization": f"token {settings.gitea_token}"}
import base64
content = f"# Business Request: {name}\n\nWork Item ID: {work_item_id}\n\n## Description\n\nTBD\n"
encoded = base64.b64encode(content.encode()).decode()
payload = {
"message": f"docs: init {work_item_id} business request",
"content": encoded,
"branch": branch,
}
async with httpx.AsyncClient() as client:
resp = await client.post(url, json=payload, headers=headers, timeout=10)
if resp.status_code in (201, 422): # 422 = already exists
return
resp.raise_for_status()

152
tests/test_git_worktree.py Normal file
View File

@@ -0,0 +1,152 @@
"""Tests for src/git_worktree (ORCH-2 / S-4): isolated worktree per task/branch.
Uses real local git repos in tmp (a bare 'origin' + a working main clone) so that
`git fetch origin`, `git worktree add`, branch creation from origin/main, reuse and
removal are all exercised without network access.
"""
import os
import subprocess
import tempfile
import pytest
# Env must be set before importing app modules (same convention as the other suites).
_test_db = os.path.join(tempfile.gettempdir(), "test_orchestrator_wt.db")
os.environ["ORCH_DB_PATH"] = _test_db
os.environ["ORCH_REPOS_DIR"] = tempfile.gettempdir()
os.environ["ORCH_GITEA_TOKEN"] = "test-token"
os.environ["ORCH_PLANE_API_TOKEN"] = "test-token"
from src import git_worktree
from src.git_worktree import (
_safe,
get_worktree_path,
ensure_worktree,
remove_worktree,
)
def _git(cwd, *args):
return subprocess.run(["git", "-C", cwd, *args], capture_output=True, text=True)
@pytest.fixture
def repos(tmp_path, monkeypatch):
"""Build a bare 'origin' with main + a feature branch, plus a main clone at repos_dir/<repo>.
Returns the repo name. settings.repos_dir / worktrees_dir are pointed at tmp.
"""
repo = "enduro-trails"
repos_dir = tmp_path / "repos"
wt_dir = tmp_path / "repos" / "_wt"
repos_dir.mkdir(parents=True)
monkeypatch.setattr(git_worktree.settings, "repos_dir", str(repos_dir))
monkeypatch.setattr(git_worktree.settings, "worktrees_dir", str(wt_dir))
# Bare origin
origin = tmp_path / "origin.git"
subprocess.run(["git", "init", "--bare", "-b", "main", str(origin)], capture_output=True)
# Seed repo
seed = tmp_path / "seed"
seed.mkdir()
_git(str(seed), "init", "-b", "main")
_git(str(seed), "config", "user.email", "t@t")
_git(str(seed), "config", "user.name", "t")
(seed / "README.md").write_text("# seed\n")
_git(str(seed), "add", ".")
_git(str(seed), "commit", "-m", "init")
_git(str(seed), "remote", "add", "origin", str(origin))
_git(str(seed), "push", "origin", "main")
# An existing feature branch on origin
_git(str(seed), "checkout", "-b", "feature/existing")
(seed / "f.txt").write_text("feature\n")
_git(str(seed), "add", ".")
_git(str(seed), "commit", "-m", "feat")
_git(str(seed), "push", "origin", "feature/existing")
# Main clone at repos_dir/<repo>
main_clone = repos_dir / repo
subprocess.run(["git", "clone", str(origin), str(main_clone)], capture_output=True)
_git(str(main_clone), "config", "user.email", "t@t")
_git(str(main_clone), "config", "user.name", "t")
return repo
# ---------------------------------------------------------------------------
# _safe / get_worktree_path
# ---------------------------------------------------------------------------
class TestSafeAndPath:
def test_safe_replaces_slashes_and_specials(self):
assert _safe("feature/ET-001-x") == "feature_ET-001-x"
assert _safe("a b/c:d") == "a_b_c_d"
assert _safe("keep.dots-and_underscores") == "keep.dots-and_underscores"
def test_get_worktree_path(self, monkeypatch):
monkeypatch.setattr(git_worktree.settings, "worktrees_dir", "/repos/_wt")
assert get_worktree_path("repo", "feature/x") == "/repos/_wt/repo/feature_x"
# ---------------------------------------------------------------------------
# ensure_worktree
# ---------------------------------------------------------------------------
class TestEnsureWorktree:
def test_missing_main_repo_raises(self, tmp_path, monkeypatch):
monkeypatch.setattr(git_worktree.settings, "repos_dir", str(tmp_path / "nope"))
monkeypatch.setattr(git_worktree.settings, "worktrees_dir", str(tmp_path / "_wt"))
with pytest.raises(FileNotFoundError):
ensure_worktree("enduro-trails", "main")
def test_creates_worktree_for_existing_branch(self, repos):
wt = ensure_worktree(repos, "feature/existing")
assert os.path.isdir(wt)
assert wt == get_worktree_path(repos, "feature/existing")
# On the right branch
cur = _git(wt, "branch", "--show-current").stdout.strip()
assert cur == "feature/existing"
# Feature file from that branch is present (proves correct checkout)
assert os.path.isfile(os.path.join(wt, "f.txt"))
def test_creates_new_branch_from_origin_main(self, repos):
wt = ensure_worktree(repos, "feature/brand-new")
assert os.path.isdir(wt)
cur = _git(wt, "branch", "--show-current").stdout.strip()
assert cur == "feature/brand-new"
# Based on main -> README present, no feature file
assert os.path.isfile(os.path.join(wt, "README.md"))
assert not os.path.isfile(os.path.join(wt, "f.txt"))
def test_reuse_returns_same_path(self, repos):
wt1 = ensure_worktree(repos, "feature/existing")
wt2 = ensure_worktree(repos, "feature/existing")
assert wt1 == wt2
assert os.path.isdir(wt2)
def test_two_branches_are_isolated(self, repos):
a = ensure_worktree(repos, "feature/wt-A")
b = ensure_worktree(repos, "feature/wt-B")
assert a != b
ba = _git(a, "branch", "--show-current").stdout.strip()
bb = _git(b, "branch", "--show-current").stdout.strip()
assert ba == "feature/wt-A"
assert bb == "feature/wt-B"
# Writing in A must not affect B
with open(os.path.join(a, "only-a.txt"), "w") as f:
f.write("a")
assert not os.path.isfile(os.path.join(b, "only-a.txt"))
# ---------------------------------------------------------------------------
# remove_worktree
# ---------------------------------------------------------------------------
class TestRemoveWorktree:
def test_remove_deletes_worktree_dir(self, repos):
wt = ensure_worktree(repos, "feature/to-remove")
assert os.path.isdir(wt)
remove_worktree(repos, "feature/to-remove")
assert not os.path.isdir(wt)
def test_remove_nonexistent_is_noop(self, repos):
# Should not raise even if the worktree was never created.
remove_worktree(repos, "feature/never-made")

140
tests/test_launcher.py Normal file
View File

@@ -0,0 +1,140 @@
"""Tests for launcher critical functions and reviewer verdict parsing.
Covers the audit-2026-06-02 fixes:
- B-1: _write_task_file writes directly to /repos/<repo>/<task_file> (no docker),
and raises on write failure instead of failing silently.
- S-5: check_reviewer_verdict reads the machine-readable `verdict:` field from
the YAML frontmatter only (no fragile substring matching).
"""
import os
import tempfile
import pytest
# Override env before importing app modules (same convention as test_qg.py)
_test_db = os.path.join(tempfile.gettempdir(), "test_orchestrator_launcher.db")
os.environ["ORCH_DB_PATH"] = _test_db
os.environ["ORCH_REPOS_DIR"] = tempfile.gettempdir()
os.environ["ORCH_GITEA_TOKEN"] = "test-token"
os.environ["ORCH_PLANE_API_TOKEN"] = "test-token"
from src.agents.launcher import AgentLauncher
from src.qg.checks import check_reviewer_verdict
# ---------------------------------------------------------------------------
# B-1: _write_task_file
# ---------------------------------------------------------------------------
class TestWriteTaskFile:
"""B-1 fix preserved + ORCH-2/S-4: task file now lands in the per-branch worktree.
_write_task_file(repo, branch, task_file, content) writes to
<worktrees_dir>/<repo>/<safe-branch>/<task_file> with a plain open() (no docker).
"""
def _wt_dir(self, tmp_path, repo, branch):
from src.git_worktree import _safe
d = tmp_path / "_wt" / repo / _safe(branch)
d.mkdir(parents=True)
return d
def test_writes_to_worktree_path(self, tmp_path, monkeypatch):
"""Task file is written to the worktree path, content matches (B-1 + S-4)."""
monkeypatch.setattr("src.git_worktree.settings.worktrees_dir", str(tmp_path / "_wt"))
wt = self._wt_dir(tmp_path, "enduro-trails", "feature/ET-001-x")
launcher = AgentLauncher()
launcher._write_task_file("enduro-trails", "feature/ET-001-x", ".task-dev.md", "hello-content")
written = wt / ".task-dev.md"
assert written.is_file()
assert written.read_text() == "hello-content"
def test_does_not_use_docker(self, tmp_path, monkeypatch):
"""No subprocess/docker call: if subprocess.run were used it would error here."""
monkeypatch.setattr("src.git_worktree.settings.worktrees_dir", str(tmp_path / "_wt"))
self._wt_dir(tmp_path, "enduro-trails", "main")
called = {"run": False}
def _fail_run(*a, **k):
called["run"] = True
raise AssertionError("subprocess.run must not be called by _write_task_file")
monkeypatch.setattr("src.agents.launcher.subprocess.run", _fail_run)
launcher = AgentLauncher()
launcher._write_task_file("enduro-trails", "main", ".task.md", "x")
assert called["run"] is False
def test_raises_on_write_failure(self, tmp_path, monkeypatch):
"""If the target worktree dir does not exist, raise RuntimeError (no silent fail)."""
monkeypatch.setattr("src.git_worktree.settings.worktrees_dir", str(tmp_path / "_wt"))
# worktree dir intentionally NOT created -> open() raises OSError
launcher = AgentLauncher()
with pytest.raises(RuntimeError):
launcher._write_task_file("nonexistent-repo", "main", ".task.md", "x")
# ---------------------------------------------------------------------------
# S-5: check_reviewer_verdict (frontmatter-only)
# ---------------------------------------------------------------------------
@pytest.fixture
def review_repo(tmp_path, monkeypatch):
monkeypatch.setattr("src.qg.checks.settings.repos_dir", str(tmp_path))
wi_dir = tmp_path / "enduro-trails" / "docs" / "work-items" / "ET-001"
wi_dir.mkdir(parents=True)
return wi_dir
def _write_review(wi_dir, text):
(wi_dir / "12-review.md").write_text(text)
class TestCheckReviewerVerdict:
def test_approved_in_frontmatter(self, review_repo):
_write_review(review_repo, "---\ntype: review\nverdict: APPROVED\n---\n# Review\nbody\n")
passed, reason = check_reviewer_verdict("enduro-trails", "ET-001")
assert passed is True
assert "APPROVED" in reason
def test_request_changes_in_frontmatter(self, review_repo):
_write_review(review_repo, "---\ntype: review\nverdict: REQUEST_CHANGES\n---\n# Review\n")
passed, reason = check_reviewer_verdict("enduro-trails", "ET-001")
assert passed is False
assert "REQUEST_CHANGES" in reason
def test_lowercase_verdict_normalized(self, review_repo):
_write_review(review_repo, "---\nverdict: approved\n---\nbody\n")
passed, _ = check_reviewer_verdict("enduro-trails", "ET-001")
assert passed is True
def test_no_verdict_field_is_not_approved(self, review_repo):
# Frontmatter present but no verdict -> must NOT approve.
_write_review(review_repo, "---\ntype: review\nstatus: done\n---\nbody\n")
passed, reason = check_reviewer_verdict("enduro-trails", "ET-001")
assert passed is False
assert "verdict" in reason.lower()
def test_no_frontmatter_is_not_approved(self, review_repo):
# APPROVED appears only in body/table text -> must NOT cause false positive (S-5).
_write_review(review_repo, "# Review\n| Finding | Status |\n|---|---|\n| F-01 | APPROVED |\n")
passed, _ = check_reviewer_verdict("enduro-trails", "ET-001")
assert passed is False
def test_request_changes_in_body_does_not_block_approved_frontmatter(self, review_repo):
# Body mentions REQUEST_CHANGES in a table, but frontmatter verdict is APPROVED.
_write_review(
review_repo,
"---\nverdict: APPROVED\n---\n# Review\n"
"| Item | Old verdict |\n|---|---|\n| x | REQUEST_CHANGES |\n",
)
passed, reason = check_reviewer_verdict("enduro-trails", "ET-001")
assert passed is True
assert "APPROVED" in reason
def test_missing_file(self, review_repo):
passed, reason = check_reviewer_verdict("enduro-trails", "ET-999")
assert passed is False
assert "not found" in reason.lower()

180
tests/test_plane_webhook.py Normal file
View File

@@ -0,0 +1,180 @@
"""ORCH-6: Plane webhook project-filter + repo-resolution tests.
Verifies the core of the 2026-06-02 incident fix:
* webhook from an UNKNOWN Plane project -> {"status": "ignored"} and no task
* webhook from the orchestrator project -> task created with repo=orchestrator
* webhook from the enduro project -> task created with repo=enduro-trails
launcher.launch is mocked so no real agents are spawned. Gitea branch/doc
creation is mocked (network). FastAPI TestClient drives the real endpoint.
This module configures its own registry via monkeypatch + reload_projects so it
is independent of ORCH_PROJECTS_JSON set by other test modules.
"""
import os
import tempfile
import pytest
# Test DB / disable signature checks (same convention as test_webhooks.py).
_test_db = os.path.join(tempfile.gettempdir(), "test_orchestrator_plane.db")
os.environ["ORCH_DB_PATH"] = _test_db
os.environ.setdefault("ORCH_PLANE_WEBHOOK_SECRET", "")
os.environ.setdefault("ORCH_GITEA_WEBHOOK_SECRET", "")
os.environ.setdefault("ORCH_GITEA_TOKEN", "test-token")
os.environ.setdefault("ORCH_PLANE_API_TOKEN", "test-token")
from unittest.mock import patch, AsyncMock # noqa: E402
from fastapi.testclient import TestClient # noqa: E402
from src.main import app # noqa: E402
from src.db import init_db, get_db # noqa: E402
from src import projects as P # noqa: E402
from src.projects import reload_projects # noqa: E402
ORCH_PLANE_ID = "8da6aa25-a60e-44d6-a1e2-d8ae59aa7d6a"
ENDURO_PLANE_ID = "7a79f0a9-5278-49cd-9007-9a338f238f9c"
UNKNOWN_PLANE_ID = "deadbeef-0000-0000-0000-000000000000"
client = TestClient(app)
@pytest.fixture(autouse=True)
def setup(monkeypatch):
"""Fresh DB + a known two-project registry for each test."""
# settings.db_path is resolved once at import; force it to our isolated DB so
# this suite is independent of whichever test module imported config first.
monkeypatch.setattr(P.settings, "db_path", _test_db)
import src.db as _db
monkeypatch.setattr(_db.settings, "db_path", _test_db)
if os.path.exists(_test_db):
os.unlink(_test_db)
init_db()
# The webhook signature secret may be baked into the runtime env; this suite
# focuses on the project filter, so bypass signature verification.
monkeypatch.setattr("src.webhooks.plane.verify_plane_signature", lambda body, sig: True)
registry_json = (
f'[{{"plane_project_id": "{ENDURO_PLANE_ID}", "repo": "enduro-trails",'
f' "work_item_prefix": "ET", "name": "enduro-trails"}},'
f' {{"plane_project_id": "{ORCH_PLANE_ID}", "repo": "orchestrator",'
f' "work_item_prefix": "ORCH", "name": "orchestrator"}}]'
)
monkeypatch.setattr(P.settings, "projects_json", registry_json)
reload_projects()
yield
reload_projects() # restore from env
if os.path.exists(_test_db):
os.unlink(_test_db)
def _post_created(plane_project_id, plane_id="wi-1", name="A valid work item title"):
return client.post(
"/webhook/plane",
json={
"event": "work_item.created",
"data": {
"id": plane_id,
"name": name,
"description_stripped": "This is a sufficiently long description.",
"project": plane_project_id,
},
},
)
# ---------------------------------------------------------------------------
# Filter: unknown project is ignored, no side effects
# ---------------------------------------------------------------------------
@patch("src.webhooks.plane.launcher")
@patch("src.webhooks.plane._create_initial_docs", new_callable=AsyncMock)
@patch("src.webhooks.plane._create_gitea_branch", new_callable=AsyncMock)
def test_unknown_project_ignored(mock_branch, mock_docs, mock_launcher):
resp = _post_created(UNKNOWN_PLANE_ID, plane_id="ignore-me")
assert resp.status_code == 200
assert resp.json()["status"] == "ignored"
assert resp.json().get("reason") == "unknown project"
# No task, no branch, no agent.
conn = get_db()
task = conn.execute("SELECT * FROM tasks WHERE plane_id='ignore-me'").fetchone()
conn.close()
assert task is None
mock_branch.assert_not_called()
mock_launcher.launch.assert_not_called()
# ---------------------------------------------------------------------------
# orchestrator project -> repo=orchestrator, prefix ORCH
# ---------------------------------------------------------------------------
@patch("src.webhooks.plane.launcher")
@patch("src.webhooks.plane._create_initial_docs", new_callable=AsyncMock)
@patch("src.webhooks.plane._create_gitea_branch", new_callable=AsyncMock)
def test_orchestrator_project_routes_to_orchestrator_repo(mock_branch, mock_docs, mock_launcher):
mock_launcher.launch.return_value = 1
resp = _post_created(ORCH_PLANE_ID, plane_id="orch-1")
assert resp.status_code == 200
assert resp.json()["status"] == "accepted"
conn = get_db()
task = conn.execute("SELECT * FROM tasks WHERE plane_id='orch-1'").fetchone()
conn.close()
assert task is not None
assert task["repo"] == "orchestrator"
assert task["work_item_id"].startswith("ORCH-")
assert task["stage"] == "analysis"
# Branch created against the orchestrator repo.
args = mock_branch.call_args.args
assert args[0] == "orchestrator"
# ---------------------------------------------------------------------------
# enduro project -> repo=enduro-trails, prefix ET
# ---------------------------------------------------------------------------
@patch("src.webhooks.plane.launcher")
@patch("src.webhooks.plane._create_initial_docs", new_callable=AsyncMock)
@patch("src.webhooks.plane._create_gitea_branch", new_callable=AsyncMock)
def test_enduro_project_routes_to_enduro_repo(mock_branch, mock_docs, mock_launcher):
mock_launcher.launch.return_value = 1
resp = _post_created(ENDURO_PLANE_ID, plane_id="et-1")
assert resp.status_code == 200
assert resp.json()["status"] == "accepted"
conn = get_db()
task = conn.execute("SELECT * FROM tasks WHERE plane_id='et-1'").fetchone()
conn.close()
assert task is not None
assert task["repo"] == "enduro-trails"
assert task["work_item_id"].startswith("ET-")
args = mock_branch.call_args.args
assert args[0] == "enduro-trails"
# ---------------------------------------------------------------------------
# prefixes are independent per repo (ORCH-001 vs ET-001 in parallel)
# ---------------------------------------------------------------------------
@patch("src.webhooks.plane.launcher")
@patch("src.webhooks.plane._create_initial_docs", new_callable=AsyncMock)
@patch("src.webhooks.plane._create_gitea_branch", new_callable=AsyncMock)
def test_prefixes_independent_per_project(mock_branch, mock_docs, mock_launcher):
mock_launcher.launch.return_value = 1
_post_created(ORCH_PLANE_ID, plane_id="o1", name="Orchestrator item one")
_post_created(ENDURO_PLANE_ID, plane_id="e1", name="Enduro item one")
_post_created(ORCH_PLANE_ID, plane_id="o2", name="Orchestrator item two")
conn = get_db()
rows = {r["plane_id"]: r["work_item_id"] for r in
conn.execute("SELECT plane_id, work_item_id FROM tasks").fetchall()}
conn.close()
assert rows["o1"] == "ORCH-001"
assert rows["o2"] == "ORCH-002"
assert rows["e1"] == "ET-001"

177
tests/test_projects.py Normal file
View File

@@ -0,0 +1,177 @@
"""ORCH-6: tests for the project registry (src/projects.py).
Covers resolvers (by plane_id, by repo, unknown -> None, known ids) against the
built-in default registry, plus ORCH_PROJECTS_JSON parsing (valid + malformed
-> default fallback).
The pure parser ``_parse_projects_json`` is tested directly so we don't mutate
the module-global registry. Resolver tests run against the default registry; if
another test (e.g. test_webhooks) set ORCH_PROJECTS_JSON in the env, we restore
the default via monkeypatch + reload_projects to keep this file order-independent.
"""
import pytest
from src import projects as P
from src.projects import (
ProjectConfig,
get_project_by_plane_id,
get_project_by_repo,
known_plane_project_ids,
reload_projects,
_parse_projects_json,
_DEFAULT_PROJECTS,
)
# Known ids from the default registry / task spec.
ENDURO_PLANE_ID = "7a79f0a9-5278-49cd-9007-9a338f238f9c"
ORCH_PLANE_ID = "8da6aa25-a60e-44d6-a1e2-d8ae59aa7d6a"
@pytest.fixture
def default_registry(monkeypatch):
"""Force the default (built-in) registry regardless of ORCH_PROJECTS_JSON
that other test modules may have set in the process env."""
monkeypatch.setattr(P.settings, "projects_json", "")
reload_projects()
yield
# Restore from current settings (whatever env says) after the test.
reload_projects()
# ---------------------------------------------------------------------------
# Resolvers
# ---------------------------------------------------------------------------
def test_get_project_by_plane_id_orchestrator(default_registry):
proj = get_project_by_plane_id(ORCH_PLANE_ID)
assert proj is not None
assert proj.repo == "orchestrator"
assert proj.work_item_prefix == "ORCH"
assert proj.plane_project_id == ORCH_PLANE_ID
def test_get_project_by_plane_id_enduro(default_registry):
proj = get_project_by_plane_id(ENDURO_PLANE_ID)
assert proj is not None
assert proj.repo == "enduro-trails"
assert proj.work_item_prefix == "ET"
def test_get_project_by_plane_id_unknown_returns_none(default_registry):
assert get_project_by_plane_id("00000000-0000-0000-0000-000000000000") is None
def test_get_project_by_plane_id_empty_returns_none(default_registry):
assert get_project_by_plane_id("") is None
assert get_project_by_plane_id(None) is None
def test_get_project_by_repo(default_registry):
assert get_project_by_repo("enduro-trails").work_item_prefix == "ET"
assert get_project_by_repo("orchestrator").work_item_prefix == "ORCH"
def test_get_project_by_repo_unknown_returns_none(default_registry):
assert get_project_by_repo("does-not-exist") is None
assert get_project_by_repo("") is None
assert get_project_by_repo(None) is None
def test_known_plane_project_ids(default_registry):
ids = known_plane_project_ids()
assert isinstance(ids, set)
assert ENDURO_PLANE_ID in ids
assert ORCH_PLANE_ID in ids
assert len(ids) == len(_DEFAULT_PROJECTS)
# ---------------------------------------------------------------------------
# ORCH_PROJECTS_JSON parsing (pure function, no global mutation)
# ---------------------------------------------------------------------------
def test_parse_empty_returns_none():
assert _parse_projects_json("") is None
assert _parse_projects_json(" ") is None
assert _parse_projects_json(None) is None
def test_parse_valid_json():
raw = (
'[{"plane_project_id": "p-1", "repo": "repo-a", '
'"work_item_prefix": "AAA", "name": "Alpha"}]'
)
parsed = _parse_projects_json(raw)
assert parsed is not None
assert len(parsed) == 1
assert isinstance(parsed[0], ProjectConfig)
assert parsed[0].plane_project_id == "p-1"
assert parsed[0].repo == "repo-a"
assert parsed[0].work_item_prefix == "AAA"
assert parsed[0].name == "Alpha"
def test_parse_valid_json_multiple():
raw = (
'[{"plane_project_id": "p-1", "repo": "repo-a", "work_item_prefix": "A"},'
' {"plane_project_id": "p-2", "repo": "repo-b", "work_item_prefix": "B"}]'
)
parsed = _parse_projects_json(raw)
assert len(parsed) == 2
# name defaults to repo when omitted
assert parsed[0].name == "repo-a"
assert parsed[1].repo == "repo-b"
def test_parse_malformed_json_returns_none():
assert _parse_projects_json("{not valid json") is None
assert _parse_projects_json("[}") is None
def test_parse_not_an_array_returns_none():
# A JSON object (not array) is invalid -> fallback.
assert _parse_projects_json('{"plane_project_id": "p-1"}') is None
def test_parse_skips_bad_entries_keeps_good():
raw = (
'[{"repo": "missing-id"},' # missing required key -> skipped
' {"plane_project_id": "p-2", "repo": "repo-b", "work_item_prefix": "B"}]'
)
parsed = _parse_projects_json(raw)
assert parsed is not None
assert len(parsed) == 1
assert parsed[0].plane_project_id == "p-2"
def test_parse_all_bad_entries_returns_none():
# No valid entries -> None (fallback to default).
assert _parse_projects_json('[{"repo": "no-id"}, "not-an-object"]') is None
def test_reload_from_custom_json(monkeypatch):
"""End-to-end: set settings.projects_json, reload, resolvers reflect it."""
custom = (
'[{"plane_project_id": "custom-uuid", "repo": "custom-repo", '
'"work_item_prefix": "CUS", "name": "Custom"}]'
)
monkeypatch.setattr(P.settings, "projects_json", custom)
reload_projects()
try:
assert get_project_by_plane_id("custom-uuid").repo == "custom-repo"
assert get_project_by_repo("custom-repo").work_item_prefix == "CUS"
assert known_plane_project_ids() == {"custom-uuid"}
# The built-in defaults must NOT be present when JSON overrides.
assert get_project_by_plane_id(ENDURO_PLANE_ID) is None
finally:
reload_projects()
def test_reload_invalid_json_falls_back_to_default(monkeypatch):
monkeypatch.setattr(P.settings, "projects_json", "{garbage")
reload_projects()
try:
assert get_project_by_plane_id(ENDURO_PLANE_ID) is not None
assert get_project_by_plane_id(ORCH_PLANE_ID) is not None
finally:
reload_projects()

188
tests/test_qg.py Normal file
View File

@@ -0,0 +1,188 @@
import pytest
import os
import tempfile
from unittest.mock import patch, MagicMock
import httpx
# Override DB path before importing app
_test_db = os.path.join(tempfile.gettempdir(), "test_orchestrator.db")
os.environ["ORCH_DB_PATH"] = _test_db
os.environ["ORCH_REPOS_DIR"] = tempfile.gettempdir()
os.environ["ORCH_GITEA_TOKEN"] = "test-token"
os.environ["ORCH_PLANE_API_TOKEN"] = "test-token"
from src.qg.checks import (
check_analysis_complete,
check_architecture_done,
check_ci_green,
check_review_approved,
check_tests_passed,
)
@pytest.fixture(autouse=True)
def setup_work_item_dir(tmp_path, monkeypatch):
"""Create temp repo structure for filesystem checks."""
monkeypatch.setattr("src.qg.checks.settings.repos_dir", str(tmp_path))
repo_dir = tmp_path / "enduro-trails"
repo_dir.mkdir()
return repo_dir
class TestCheckAnalysisComplete:
def test_all_files_present(self, setup_work_item_dir):
repo_dir = setup_work_item_dir
wi_dir = repo_dir / "docs" / "work-items" / "ET-001"
wi_dir.mkdir(parents=True)
(wi_dir / "01-brd.md").write_text("# BRD")
(wi_dir / "02-trz.md").write_text("# TRZ")
(wi_dir / "03-acceptance-criteria.md").write_text("# AC")
(wi_dir / "04-test-plan.yaml").write_text("tests: []")
passed, reason = check_analysis_complete("enduro-trails", "ET-001")
assert passed is True
def test_missing_files(self, setup_work_item_dir):
repo_dir = setup_work_item_dir
wi_dir = repo_dir / "docs" / "work-items" / "ET-002"
wi_dir.mkdir(parents=True)
(wi_dir / "01-brd.md").write_text("# BRD")
passed, reason = check_analysis_complete("enduro-trails", "ET-002")
assert passed is False
assert "Missing files" in reason
def test_no_directory(self, setup_work_item_dir):
passed, reason = check_analysis_complete("enduro-trails", "ET-999")
assert passed is False
class TestCheckArchitectureDone:
def test_adr_directory_with_files(self, setup_work_item_dir):
repo_dir = setup_work_item_dir
adr_dir = repo_dir / "docs" / "work-items" / "ET-001" / "06-adr"
adr_dir.mkdir(parents=True)
(adr_dir / "001-use-postgres.md").write_text("# ADR")
passed, reason = check_architecture_done("enduro-trails", "ET-001")
assert passed is True
def test_infra_requirements(self, setup_work_item_dir):
repo_dir = setup_work_item_dir
wi_dir = repo_dir / "docs" / "work-items" / "ET-001"
wi_dir.mkdir(parents=True)
(wi_dir / "07-infra-requirements.md").write_text("# Infra")
passed, reason = check_architecture_done("enduro-trails", "ET-001")
assert passed is True
def test_empty_adr_directory(self, setup_work_item_dir):
repo_dir = setup_work_item_dir
adr_dir = repo_dir / "docs" / "work-items" / "ET-001" / "06-adr"
adr_dir.mkdir(parents=True)
passed, reason = check_architecture_done("enduro-trails", "ET-001")
assert passed is False
def test_nothing_present(self, setup_work_item_dir):
passed, reason = check_architecture_done("enduro-trails", "ET-001")
assert passed is False
class TestCheckCIGreen:
@patch("src.qg.checks.httpx.get")
def test_ci_success(self, mock_get):
mock_resp = MagicMock()
mock_resp.status_code = 200
mock_resp.json.return_value = {"state": "success"}
mock_resp.raise_for_status = MagicMock()
mock_get.return_value = mock_resp
passed, reason = check_ci_green("enduro-trails", "feature/ET-001-test")
assert passed is True
assert "green" in reason.lower()
@patch("src.qg.checks.httpx.get")
def test_ci_pending(self, mock_get):
mock_resp = MagicMock()
mock_resp.status_code = 200
mock_resp.json.return_value = {"state": "pending"}
mock_resp.raise_for_status = MagicMock()
mock_get.return_value = mock_resp
passed, reason = check_ci_green("enduro-trails", "feature/ET-001-test")
assert passed is False
@patch("src.qg.checks.httpx.get")
def test_ci_branch_not_found(self, mock_get):
mock_resp = MagicMock()
mock_resp.status_code = 404
mock_get.return_value = mock_resp
passed, reason = check_ci_green("enduro-trails", "nonexistent")
assert passed is False
class TestCheckReviewApproved:
@patch("src.qg.checks.httpx.get")
def test_approved(self, mock_get):
mock_resp = MagicMock()
mock_resp.status_code = 200
mock_resp.json.return_value = [
{"state": "APPROVED", "user": {"login": "reviewer1"}}
]
mock_resp.raise_for_status = MagicMock()
mock_get.return_value = mock_resp
passed, reason = check_review_approved("enduro-trails", 1)
assert passed is True
@patch("src.qg.checks.httpx.get")
def test_changes_requested(self, mock_get):
mock_resp = MagicMock()
mock_resp.status_code = 200
mock_resp.json.return_value = [
{"state": "REQUEST_CHANGES", "user": {"login": "reviewer1"}}
]
mock_resp.raise_for_status = MagicMock()
mock_get.return_value = mock_resp
passed, reason = check_review_approved("enduro-trails", 1)
assert passed is False
assert "Changes requested" in reason
@patch("src.qg.checks.httpx.get")
def test_no_reviews(self, mock_get):
mock_resp = MagicMock()
mock_resp.status_code = 200
mock_resp.json.return_value = []
mock_resp.raise_for_status = MagicMock()
mock_get.return_value = mock_resp
passed, reason = check_review_approved("enduro-trails", 1)
assert passed is False
class TestCheckTestsPassed:
def test_report_with_pass(self, setup_work_item_dir):
repo_dir = setup_work_item_dir
wi_dir = repo_dir / "docs" / "work-items" / "ET-001"
wi_dir.mkdir(parents=True)
(wi_dir / "13-test-report.md").write_text("# Test Report\n\nResult: PASS\n")
passed, reason = check_tests_passed("enduro-trails", "ET-001")
assert passed is True
def test_report_without_pass(self, setup_work_item_dir):
repo_dir = setup_work_item_dir
wi_dir = repo_dir / "docs" / "work-items" / "ET-001"
wi_dir.mkdir(parents=True)
(wi_dir / "13-test-report.md").write_text("# Test Report\n\nResult: FAIL\n")
passed, reason = check_tests_passed("enduro-trails", "ET-001")
assert passed is False
def test_no_report(self, setup_work_item_dir):
passed, reason = check_tests_passed("enduro-trails", "ET-001")
assert passed is False
assert "not found" in reason.lower()

304
tests/test_queue.py Normal file
View File

@@ -0,0 +1,304 @@
"""Tests for ORCH-1 (F-2b) persistent job queue.
Covers:
- enqueue_job -> claim_next_job -> mark_job lifecycle
- claim_next_job atomicity (no double-dispatch of the same job)
- retry: fail -> requeue while attempts < max_attempts, then failed
- requeue_running_jobs (queue-recovery)
- count_running_jobs / job_status_counts / recent_jobs
- QueueWorker respects max_concurrency (Popen / launch fully mocked)
The real claude/Popen is NEVER spawned: launcher.launch_job is mocked in worker
tests, and the launcher finalize logic is exercised directly via mark_job.
"""
import os
import tempfile
import pytest
# Override env before importing app modules (same convention as test_qg.py).
_test_db = os.path.join(tempfile.gettempdir(), "test_orchestrator_queue.db")
os.environ["ORCH_DB_PATH"] = _test_db
os.environ["ORCH_REPOS_DIR"] = tempfile.gettempdir()
os.environ["ORCH_GITEA_TOKEN"] = "test-token"
os.environ["ORCH_PLANE_API_TOKEN"] = "test-token"
import src.db as db
from src.db import (
init_db,
enqueue_job,
claim_next_job,
mark_job,
count_running_jobs,
requeue_running_jobs,
get_job,
job_status_counts,
recent_jobs,
)
@pytest.fixture(autouse=True)
def fresh_db(tmp_path, monkeypatch):
"""Point the DB at a fresh per-test sqlite file and init the schema."""
dbfile = tmp_path / "queue.db"
monkeypatch.setattr(db.settings, "db_path", str(dbfile))
init_db()
yield
# ---------------------------------------------------------------------------
# enqueue / claim / mark lifecycle
# ---------------------------------------------------------------------------
class TestLifecycle:
def test_enqueue_creates_queued_job(self):
jid = enqueue_job("analyst", "enduro-trails", "task body", task_id=7)
job = get_job(jid)
assert job["status"] == "queued"
assert job["agent"] == "analyst"
assert job["repo"] == "enduro-trails"
assert job["task_content"] == "task body"
assert job["task_id"] == 7
assert job["attempts"] == 0
assert job["max_attempts"] == 2
def test_claim_marks_running_and_increments_attempts(self):
jid = enqueue_job("developer", "repo")
claimed = claim_next_job()
assert claimed is not None
assert claimed["id"] == jid
assert claimed["status"] == "running"
assert claimed["attempts"] == 1
assert count_running_jobs() == 1
def test_claim_empty_queue_returns_none(self):
assert claim_next_job() is None
def test_claim_is_fifo(self):
a = enqueue_job("analyst", "r")
b = enqueue_job("developer", "r")
assert claim_next_job()["id"] == a
assert claim_next_job()["id"] == b
def test_mark_done(self):
jid = enqueue_job("tester", "r")
claim_next_job()
mark_job(jid, "done", run_id=42)
job = get_job(jid)
assert job["status"] == "done"
assert job["run_id"] == 42
assert job["finished_at"] is not None
assert count_running_jobs() == 0
def test_mark_failed_records_error(self):
jid = enqueue_job("tester", "r")
claim_next_job()
mark_job(jid, "failed", run_id=9, error="boom")
job = get_job(jid)
assert job["status"] == "failed"
assert job["error"] == "boom"
assert job["finished_at"] is not None
# ---------------------------------------------------------------------------
# claim atomicity — no double dispatch
# ---------------------------------------------------------------------------
class TestClaimAtomicity:
def test_single_job_claimed_once(self):
jid = enqueue_job("analyst", "r")
first = claim_next_job()
second = claim_next_job()
assert first["id"] == jid
assert second is None # already running, not re-dispatched
def test_concurrent_claims_no_duplicate(self):
"""Many enqueued jobs claimed from parallel threads -> each claimed once."""
import threading
n = 20
for _ in range(n):
enqueue_job("developer", "r")
claimed_ids = []
lock = threading.Lock()
def grab():
while True:
job = claim_next_job()
if job is None:
return
with lock:
claimed_ids.append(job["id"])
threads = [threading.Thread(target=grab) for _ in range(8)]
for t in threads:
t.start()
for t in threads:
t.join()
assert len(claimed_ids) == n
assert len(set(claimed_ids)) == n # no id claimed twice
assert count_running_jobs() == n
# ---------------------------------------------------------------------------
# retry semantics (mirrors launcher._finalize_job logic)
# ---------------------------------------------------------------------------
class TestRetry:
def test_fail_requeues_while_under_max(self):
jid = enqueue_job("developer", "r", max_attempts=2)
job = claim_next_job() # attempts=1
assert job["attempts"] == 1
# attempts(1) < max(2) -> requeue
mark_job(jid, "queued", error="exit 1")
j = get_job(jid)
assert j["status"] == "queued"
assert j["error"] == "exit 1"
assert j["started_at"] is None # requeue clears started_at
def test_fail_fails_when_max_reached(self):
jid = enqueue_job("developer", "r", max_attempts=2)
claim_next_job() # attempts=1 -> requeue
mark_job(jid, "queued")
job2 = claim_next_job() # attempts=2
assert job2["attempts"] == 2
# attempts(2) >= max(2) -> failed
mark_job(jid, "failed", error="exit 1")
assert get_job(jid)["status"] == "failed"
def test_finalize_job_done(self):
"""launcher._finalize_job marks done on exit_code 0 (no Popen needed)."""
from src.agents.launcher import AgentLauncher
jid = enqueue_job("analyst", "r")
claim_next_job()
AgentLauncher()._finalize_job(jid, "analyst", run_id=5, exit_code=0)
assert get_job(jid)["status"] == "done"
def test_finalize_job_requeue_then_fail(self, monkeypatch):
from src.agents.launcher import AgentLauncher
# Silence telegram side-effect.
monkeypatch.setattr("src.notifications.send_telegram", lambda *a, **k: None)
lr = AgentLauncher()
jid = enqueue_job("developer", "r", max_attempts=2)
claim_next_job() # attempts=1
lr._finalize_job(jid, "developer", run_id=1, exit_code=2)
assert get_job(jid)["status"] == "queued" # 1 < 2 -> requeue
claim_next_job() # attempts=2
lr._finalize_job(jid, "developer", run_id=2, exit_code=2)
assert get_job(jid)["status"] == "failed" # 2 >= 2 -> failed
# ---------------------------------------------------------------------------
# queue-recovery
# ---------------------------------------------------------------------------
class TestRequeueRunning:
def test_requeue_running_jobs(self):
a = enqueue_job("analyst", "r")
b = enqueue_job("developer", "r")
claim_next_job() # a -> running
claim_next_job() # b -> running
assert count_running_jobs() == 2
n = requeue_running_jobs()
assert n == 2
assert count_running_jobs() == 0
assert get_job(a)["status"] == "queued"
assert get_job(b)["status"] == "queued"
def test_requeue_preserves_attempts(self):
jid = enqueue_job("analyst", "r")
claim_next_job() # attempts=1
requeue_running_jobs()
assert get_job(jid)["attempts"] == 1 # not reset
# ---------------------------------------------------------------------------
# observability helpers
# ---------------------------------------------------------------------------
class TestObservability:
def test_status_counts(self):
enqueue_job("analyst", "r") # stays queued
enqueue_job("developer", "r") # first claimed -> running (FIFO)
claim_next_job()
counts = job_status_counts()
assert counts["running"] == 1
assert counts["queued"] == 1
assert counts["done"] == 0
assert counts["failed"] == 0
def test_recent_jobs_desc(self):
ids = [enqueue_job("analyst", "r") for _ in range(3)]
recent = recent_jobs(10)
assert [r["id"] for r in recent] == sorted(ids, reverse=True)
# ---------------------------------------------------------------------------
# QueueWorker max_concurrency (launch_job fully mocked — no real Popen)
# ---------------------------------------------------------------------------
class TestWorkerConcurrency:
@pytest.fixture(autouse=True)
def _ok_preflight(self, monkeypatch):
# ORCH-1 resilience: the worker gates claims behind preflight; in tests there
# is no claude binary, so stub preflight OK to exercise pure queue/concurrency.
monkeypatch.setattr("src.queue_worker.preflight.check", lambda *a, **k: (True, "ok"))
def test_worker_respects_max_concurrency(self, monkeypatch):
from src.queue_worker import QueueWorker
launched = []
def fake_launch_job(job):
# Simulate a long-running agent: the job stays 'running' (we do NOT
# mark it done), so the slot remains occupied.
launched.append(job["id"])
return 100 + job["id"]
monkeypatch.setattr("src.queue_worker.launcher.launch_job", fake_launch_job)
for _ in range(5):
enqueue_job("developer", "r")
w = QueueWorker(max_concurrency=2, poll_interval=0.01)
w._drain_once()
# Only max_concurrency jobs may be launched / running at once.
assert len(launched) == 2
assert count_running_jobs() == 2
def test_worker_drains_as_slots_free(self, monkeypatch):
from src.queue_worker import QueueWorker
def fake_launch_job(job):
# Immediately complete the job so the slot frees for the next claim.
mark_job(job["id"], "done", run_id=job["id"])
return job["id"]
monkeypatch.setattr("src.queue_worker.launcher.launch_job", fake_launch_job)
for _ in range(4):
enqueue_job("analyst", "r")
w = QueueWorker(max_concurrency=1, poll_interval=0.01)
w._drain_once()
# With instant completion and concurrency 1, one drain pass empties the queue.
assert job_status_counts()["done"] == 4
assert count_running_jobs() == 0
def test_worker_launch_failure_does_not_wedge_slot(self, monkeypatch):
from src.queue_worker import QueueWorker
def boom(job):
raise RuntimeError("repo missing")
monkeypatch.setattr("src.queue_worker.launcher.launch_job", boom)
monkeypatch.setattr("src.notifications.send_telegram", lambda *a, **k: None)
enqueue_job("developer", "r", max_attempts=1)
w = QueueWorker(max_concurrency=1, poll_interval=0.01)
w._drain_once()
# attempts=1 >= max_attempts=1 -> failed, not stuck running.
assert count_running_jobs() == 0
counts = job_status_counts()
assert counts["failed"] == 1

295
tests/test_resilience.py Normal file
View File

@@ -0,0 +1,295 @@
"""ORCH-1 resilience tests: preflight, 429-classifier, backoff, circuit breaker.
No real claude/Popen is ever spawned: preflight subprocess and launcher.launch_job
are mocked. DB is a fresh per-test sqlite file.
"""
import os
import tempfile
import pytest
_test_db = os.path.join(tempfile.gettempdir(), "test_orchestrator_resilience.db")
os.environ["ORCH_DB_PATH"] = _test_db
os.environ["ORCH_REPOS_DIR"] = tempfile.gettempdir()
os.environ["ORCH_GITEA_TOKEN"] = "test-token"
os.environ["ORCH_PLANE_API_TOKEN"] = "test-token"
import src.db as db
from src.db import (
init_db, enqueue_job, claim_next_job, get_job, count_running_jobs,
mark_job_transient,
)
from src import preflight, error_classifier
from src.error_classifier import classify_text, parse_retry_after, classify_log_file
from src.queue_worker import QueueWorker, CircuitBreaker
from src.agents.launcher import AgentLauncher
@pytest.fixture(autouse=True)
def fresh_db(tmp_path, monkeypatch):
monkeypatch.setattr(db.settings, "db_path", str(tmp_path / "res.db"))
init_db()
preflight.reset_cache()
yield
# ---------------------------------------------------------------------------
# A. Preflight
# ---------------------------------------------------------------------------
class TestPreflight:
def test_fail_when_bin_missing(self, monkeypatch):
monkeypatch.setattr(preflight, "_claude_bin", lambda: "/no/such/claude")
ok, reason = preflight.check(force=True)
assert ok is False
assert "not found" in reason.lower()
def test_ok_when_version_succeeds(self, monkeypatch, tmp_path):
fake_bin = tmp_path / "claude"
fake_bin.write_text("#!/bin/sh\necho v1\n")
monkeypatch.setattr(preflight, "_claude_bin", lambda: str(fake_bin))
monkeypatch.setattr(preflight, "_run_version", lambda b: (True, "1.2.3"))
ok, reason = preflight.check(force=True)
assert ok is True
def test_cache_does_not_recheck_within_ttl(self, monkeypatch, tmp_path):
fake_bin = tmp_path / "claude"
fake_bin.write_text("x")
monkeypatch.setattr(preflight, "_claude_bin", lambda: str(fake_bin))
monkeypatch.setattr(db.settings, "preflight_cache_ttl", 999)
calls = {"n": 0}
def counting_version(b):
calls["n"] += 1
return True, "ok"
monkeypatch.setattr(preflight, "_run_version", counting_version)
preflight.reset_cache()
preflight.check() # first -> runs version
preflight.check() # cached -> no extra version call
preflight.check()
assert calls["n"] == 1
def test_force_bypasses_cache(self, monkeypatch, tmp_path):
fake_bin = tmp_path / "claude"
fake_bin.write_text("x")
monkeypatch.setattr(preflight, "_claude_bin", lambda: str(fake_bin))
calls = {"n": 0}
monkeypatch.setattr(preflight, "_run_version",
lambda b: (calls.__setitem__("n", calls["n"] + 1), (True, "ok"))[1])
preflight.reset_cache()
preflight.check()
preflight.check(force=True)
assert calls["n"] == 2
def test_worker_does_not_claim_when_preflight_fails(self, monkeypatch):
# Preflight FAIL -> job stays queued, launch_job never called.
monkeypatch.setattr("src.queue_worker.preflight.check",
lambda *a, **k: (False, "down"))
called = {"launch": False}
monkeypatch.setattr("src.queue_worker.launcher.launch_job",
lambda job: called.__setitem__("launch", True))
jid = enqueue_job("analyst", "r")
QueueWorker(max_concurrency=1, poll_interval=0.01)._drain_once()
assert called["launch"] is False
assert get_job(jid)["status"] == "queued"
assert count_running_jobs() == 0
# ---------------------------------------------------------------------------
# B. Error classifier
# ---------------------------------------------------------------------------
class TestClassifier:
@pytest.mark.parametrize("text", [
"Error: 429 Too Many Requests",
"anthropic rate limit exceeded",
"overloaded_error: server is overloaded",
"API quota exhausted",
"503 Service Unavailable",
"connection reset by peer",
])
def test_transient_patterns(self, text):
assert classify_text(text) == "transient"
@pytest.mark.parametrize("text", [
"Traceback: KeyError 'foo'",
"SyntaxError: invalid syntax",
"assertion failed in test",
"",
])
def test_permanent_patterns(self, text):
assert classify_text(text) == "permanent"
def test_retry_after_header(self):
assert parse_retry_after("HTTP/1.1 429\nRetry-After: 42\n") == 42
def test_retry_after_json(self):
assert parse_retry_after('{"error":{"type":"rate_limit","retry_after": 7}}') == 7
def test_retry_after_absent(self):
assert parse_retry_after("just an error") is None
def test_classify_log_file(self, tmp_path):
p = tmp_path / "run.log"
p.write_text("...lots of output...\n429 rate limit. Retry-After: 30\n")
kind, ra = classify_log_file(str(p))
assert kind == "transient"
assert ra == 30
def test_classify_missing_file_is_permanent(self):
kind, ra = classify_log_file("/no/such/log")
assert kind == "permanent"
assert ra is None
# ---------------------------------------------------------------------------
# C. Backoff + available_at gating
# ---------------------------------------------------------------------------
class TestBackoff:
def test_backoff_grows_exponentially(self):
lr = AgentLauncher()
# base=10, cap=600 (defaults)
b1 = lr._backoff_seconds(1)
b2 = lr._backoff_seconds(2)
b3 = lr._backoff_seconds(3)
assert b1 == 20 # 2^1*10
assert b2 == 40 # 2^2*10
assert b3 == 80 # 2^3*10
assert b2 > b1 and b3 > b2
def test_backoff_capped(self):
lr = AgentLauncher()
assert lr._backoff_seconds(20) == 600 # capped at backoff_max_seconds
def test_retry_after_respected_when_larger(self):
lr = AgentLauncher()
# transient_attempts=1 -> base backoff 20; Retry-After=120 wins.
assert lr._backoff_seconds(1, retry_after=120) == 120
def test_retry_after_ignored_when_smaller(self):
lr = AgentLauncher()
assert lr._backoff_seconds(3, retry_after=5) == 80 # backoff bigger
def test_transient_requeue_sets_future_available_at_and_claim_skips(self):
jid = enqueue_job("developer", "r")
claim_next_job()
# Big backoff -> available_at far in the future.
mark_job_transient(jid, 3600, error="429")
job = get_job(jid)
assert job["status"] == "queued"
assert job["transient_attempts"] == 1
assert job["available_at"] is not None
# claim must NOT pick it up while available_at is in the future.
assert claim_next_job() is None
def test_transient_requeue_claimable_when_due(self):
jid = enqueue_job("developer", "r")
claim_next_job()
mark_job_transient(jid, -5, error="429") # available_at in the past
c = claim_next_job()
assert c is not None and c["id"] == jid
# ---------------------------------------------------------------------------
# D. Launcher transient/permanent finalize (no Popen)
# ---------------------------------------------------------------------------
class TestFinalizeClassified:
def test_transient_failure_backoff_requeue(self, tmp_path, monkeypatch):
monkeypatch.setattr("src.notifications.send_telegram", lambda *a, **k: None)
log = tmp_path / "1.log"
log.write_text("Error 429 rate limit exceeded\n")
jid = enqueue_job("developer", "r", max_attempts=2)
claim_next_job()
AgentLauncher()._finalize_job(jid, "developer", run_id=1, exit_code=1,
output_path=str(log))
job = get_job(jid)
assert job["status"] == "queued"
assert job["transient_attempts"] == 1
assert job["available_at"] is not None # backoff-gated
assert job["attempts"] == 1 # code-fault budget NOT burned
def test_permanent_failure_uses_normal_attempts(self, tmp_path, monkeypatch):
monkeypatch.setattr("src.notifications.send_telegram", lambda *a, **k: None)
log = tmp_path / "2.log"
log.write_text("Traceback: ValueError\n")
jid = enqueue_job("developer", "r", max_attempts=2)
claim_next_job()
AgentLauncher()._finalize_job(jid, "developer", run_id=2, exit_code=1,
output_path=str(log))
job = get_job(jid)
assert job["status"] == "queued"
assert job["transient_attempts"] == 0 # not transient
assert job["available_at"] is None # no backoff for code-fault
def test_transient_exhausts_to_failed(self, tmp_path, monkeypatch):
monkeypatch.setattr("src.notifications.send_telegram", lambda *a, **k: None)
monkeypatch.setattr(db.settings, "transient_max_attempts", 2)
log = tmp_path / "3.log"
log.write_text("overloaded_error\n")
lr = AgentLauncher()
jid = enqueue_job("developer", "r")
claim_next_job()
lr._finalize_job(jid, "developer", 1, exit_code=1, output_path=str(log))
assert get_job(jid)["status"] == "queued" # transient 1 -> requeue
# force claimable and retry
mark_job_transient(jid, -1) # makes it due; transient=2 now
claim_next_job()
lr._finalize_job(jid, "developer", 2, exit_code=1, output_path=str(log))
assert get_job(jid)["status"] == "failed" # transient budget exhausted
# ---------------------------------------------------------------------------
# E. Circuit breaker
# ---------------------------------------------------------------------------
class TestCircuitBreaker:
def test_opens_after_threshold(self):
cb = CircuitBreaker(threshold=3, pause_seconds=300)
assert cb.allow_claim() is True
cb.record_transient()
cb.record_transient()
assert cb.state == "closed"
cb.record_transient() # 3rd -> open
assert cb.state == "open"
assert cb.allow_claim() is False # paused, no CLI calls
def test_recovered_resets_streak(self):
cb = CircuitBreaker(threshold=3)
cb.record_transient()
cb.record_transient()
cb.record_recovered()
assert cb.consecutive_transient == 0
assert cb.state == "closed"
def test_half_open_after_pause_then_closed_on_success(self, monkeypatch):
cb = CircuitBreaker(threshold=2, pause_seconds=300)
cb.record_transient()
cb.record_transient() # open
assert cb.state == "open"
# Simulate the pause elapsing.
cb.opened_at -= 301
assert cb.allow_claim() is True # -> half-open (probe)
assert cb.state == "half-open"
cb.record_recovered() # probe succeeded
assert cb.state == "closed"
def test_half_open_reopens_on_transient(self):
cb = CircuitBreaker(threshold=2, pause_seconds=300)
cb.record_transient(); cb.record_transient() # open
cb.opened_at -= 301
cb.allow_claim() # half-open
assert cb.state == "half-open"
cb.record_transient() # probe failed -> re-open
assert cb.state == "open"
def test_breaker_blocks_worker_claim(self, monkeypatch):
monkeypatch.setattr("src.queue_worker.preflight.check",
lambda *a, **k: (True, "ok"))
called = {"launch": False}
monkeypatch.setattr("src.queue_worker.launcher.launch_job",
lambda job: called.__setitem__("launch", True))
cb = CircuitBreaker(threshold=1, pause_seconds=300)
cb.record_transient() # open immediately
w = QueueWorker(max_concurrency=1, poll_interval=0.01, breaker=cb)
enqueue_job("analyst", "r")
w._drain_once()
assert called["launch"] is False # breaker open -> no claim, no CLI

View File

@@ -1,12 +1,41 @@
import pytest
from fastapi.testclient import TestClient
import os
import tempfile
from unittest.mock import patch, MagicMock, AsyncMock
# Override DB path before importing app
os.environ["ORCH_DB_PATH"] = os.path.join(tempfile.gettempdir(), "test_orchestrator.db")
_test_db = os.path.join(tempfile.gettempdir(), "test_orchestrator.db")
os.environ["ORCH_DB_PATH"] = _test_db
os.environ["ORCH_PLANE_WEBHOOK_SECRET"] = ""
os.environ["ORCH_GITEA_WEBHOOK_SECRET"] = ""
os.environ["ORCH_REPOS_DIR"] = tempfile.gettempdir()
os.environ["ORCH_HOST_REPOS_DIR"] = "/home/slin/repos"
os.environ["ORCH_GITEA_TOKEN"] = "test-token"
os.environ["ORCH_PLANE_API_TOKEN"] = "test-token"
os.environ["ORCH_GITEA_OWNER"] = "admin"
os.environ["ORCH_DEFAULT_REPO"] = "enduro-trails"
# ORCH-6: register the test project so the project filter lets these fixtures
# through. proj-1 maps to enduro-trails/ET, preserving the ET-001/ET-002 asserts.
os.environ["ORCH_PROJECTS_JSON"] = (
'[{"plane_project_id": "proj-1", "repo": "enduro-trails", '
'"work_item_prefix": "ET", "name": "enduro-trails"}]'
)
from fastapi.testclient import TestClient
from src.main import app
from src.db import init_db, get_db
@pytest.fixture(autouse=True)
def setup_db():
"""Ensure DB tables exist before each test."""
if os.path.exists(_test_db):
os.unlink(_test_db)
init_db()
yield
if os.path.exists(_test_db):
os.unlink(_test_db)
client = TestClient(app)
@@ -18,7 +47,16 @@ def test_health():
assert resp.json()["service"] == "orchestrator"
def test_plane_webhook_accepts():
def test_status_endpoint():
resp = client.get("/status")
assert resp.status_code == 200
assert "active_tasks" in resp.json()
@patch("src.webhooks.plane._create_gitea_branch", new_callable=AsyncMock)
@patch("src.webhooks.plane._create_initial_docs", new_callable=AsyncMock)
def test_plane_webhook_creates_task(mock_docs, mock_branch):
"""work_item.created → task in DB with stage=analysis."""
resp = client.post("/webhook/plane", json={
"event": "work_item.created",
"data": {"id": "test-123", "name": "Test task", "project": "proj-1"}
@@ -26,32 +64,208 @@ def test_plane_webhook_accepts():
assert resp.status_code == 200
assert resp.json()["status"] == "accepted"
# Verify task was created
conn = get_db()
task = conn.execute("SELECT * FROM tasks WHERE plane_id = 'test-123'").fetchone()
conn.close()
assert task is not None
assert task["stage"] == "analysis"
assert task["work_item_id"] is not None
assert "feature/" in task["branch"]
def test_plane_webhook_comment():
@patch("src.webhooks.plane._create_gitea_branch", new_callable=AsyncMock)
@patch("src.webhooks.plane._create_initial_docs", new_callable=AsyncMock)
def test_plane_webhook_generates_sequential_ids(mock_docs, mock_branch):
"""Multiple work items get sequential IDs."""
client.post("/webhook/plane", json={
"event": "work_item.created",
"data": {"id": "item-1", "name": "First task", "project": "proj-1"}
})
client.post("/webhook/plane", json={
"event": "work_item.created",
"data": {"id": "item-2", "name": "Second task", "project": "proj-1"}
})
conn = get_db()
tasks = conn.execute("SELECT work_item_id FROM tasks ORDER BY id").fetchall()
conn.close()
ids = [t["work_item_id"] for t in tasks]
assert ids[0] == "ET-001"
assert ids[1] == "ET-002"
@patch("src.webhooks.plane._create_gitea_branch", new_callable=AsyncMock)
@patch("src.webhooks.plane._create_initial_docs", new_callable=AsyncMock)
@patch("src.webhooks.plane.launcher")
def test_plane_approved_advances_stage(mock_launcher, mock_docs, mock_branch, tmp_path, monkeypatch):
"""Comment :approved: at stage=analysis → advance to architecture."""
# Patch repos_dir for QG check
monkeypatch.setattr("src.qg.checks.settings.repos_dir", str(tmp_path))
# Create task first
client.post("/webhook/plane", json={
"event": "work_item.created",
"data": {"id": "adv-001", "name": "Advance test", "project": "proj-1"}
})
# Get the task to find work_item_id
conn = get_db()
task = conn.execute("SELECT * FROM tasks WHERE plane_id = 'adv-001'").fetchone()
conn.close()
work_item_id = task["work_item_id"]
# Create required analysis files
wi_dir = tmp_path / "enduro-trails" / "docs" / "work-items" / work_item_id
wi_dir.mkdir(parents=True)
(wi_dir / "01-brd.md").write_text("# BRD")
(wi_dir / "02-trz.md").write_text("# TRZ")
(wi_dir / "03-acceptance-criteria.md").write_text("# AC")
(wi_dir / "04-test-plan.yaml").write_text("tests: []")
# Mock launcher
mock_launcher.launch.return_value = 1
# Send approved comment
resp = client.post("/webhook/plane", json={
"event": "comment.created",
"data": {"comment": "LGTM :approved:"}
"data": {
"work_item_id": "adv-001",
"comment": "Looks good :approved:"
}
})
assert resp.status_code == 200
assert resp.json()["status"] == "accepted"
# Verify stage advanced
conn = get_db()
task = conn.execute("SELECT * FROM tasks WHERE plane_id = 'adv-001'").fetchone()
conn.close()
assert task["stage"] == "architecture"
@patch("src.webhooks.plane._create_gitea_branch", new_callable=AsyncMock)
@patch("src.webhooks.plane._create_initial_docs", new_callable=AsyncMock)
def test_plane_rejected_rolls_back(mock_docs, mock_branch):
"""Comment :rejected: rolls back stage."""
# Create task
client.post("/webhook/plane", json={
"event": "work_item.created",
"data": {"id": "rej-001", "name": "Reject test", "project": "proj-1"}
})
# Manually set stage to architecture
conn = get_db()
conn.execute("UPDATE tasks SET stage = 'architecture' WHERE plane_id = 'rej-001'")
conn.commit()
conn.close()
# Send rejected comment
resp = client.post("/webhook/plane", json={
"event": "comment.created",
"data": {
"work_item_id": "rej-001",
"comment": "Not ready :rejected:"
}
})
assert resp.status_code == 200
# Verify stage rolled back
conn = get_db()
task = conn.execute("SELECT * FROM tasks WHERE plane_id = 'rej-001'").fetchone()
conn.close()
assert task["stage"] == "analysis"
def test_gitea_webhook_push():
"""Push event is accepted."""
resp = client.post(
"/webhook/gitea",
json={"ref": "refs/heads/feature/test", "repository": {"name": "enduro-trails"}},
json={"ref": "refs/heads/feature/test", "repository": {"name": "enduro-trails"}, "commits": []},
headers={"X-Gitea-Event": "push"}
)
assert resp.status_code == 200
assert resp.json()["status"] == "accepted"
def test_gitea_webhook_pr():
@patch("src.webhooks.gitea.launcher")
def test_gitea_push_with_adr_advances_stage(mock_launcher):
"""Push with ADR files at architecture stage → advance to development."""
mock_launcher.launch.return_value = 1
# Create a task at architecture stage
conn = get_db()
conn.execute(
"INSERT INTO tasks (plane_id, work_item_id, repo, branch, stage) VALUES (?, ?, ?, ?, ?)",
("push-001", "ET-010", "enduro-trails", "feature/ET-010-test", "architecture"),
)
conn.commit()
conn.close()
# Push with ADR file
resp = client.post(
"/webhook/gitea",
json={
"action": "reviewed",
"pull_request": {"state": "approved", "number": 1}
"ref": "refs/heads/feature/ET-010-test",
"repository": {"name": "enduro-trails"},
"commits": [
{"added": ["docs/work-items/ET-010/06-adr/001-decision.md"], "modified": []}
],
},
headers={"X-Gitea-Event": "push"}
)
assert resp.status_code == 200
# Verify stage advanced
conn = get_db()
task = conn.execute("SELECT * FROM tasks WHERE plane_id = 'push-001'").fetchone()
conn.close()
assert task["stage"] == "development"
mock_launcher.launch.assert_called_once()
@patch("src.webhooks.gitea.check_ci_green")
@patch("src.webhooks.gitea.launcher")
def test_gitea_ci_success_advances_to_review(mock_launcher, mock_ci):
"""CI success at development stage → advance to review."""
mock_ci.return_value = (True, "CI green")
mock_launcher.launch.return_value = 2
# Create a task at development stage
conn = get_db()
conn.execute(
"INSERT INTO tasks (plane_id, work_item_id, repo, branch, stage) VALUES (?, ?, ?, ?, ?)",
("ci-001", "ET-011", "enduro-trails", "feature/ET-011-test", "development"),
)
conn.commit()
conn.close()
# CI status success
resp = client.post(
"/webhook/gitea",
json={
"state": "success",
"branches": [{"name": "feature/ET-011-test"}],
"repository": {"name": "enduro-trails"},
},
headers={"X-Gitea-Event": "status"}
)
assert resp.status_code == 200
# Verify stage advanced
conn = get_db()
task = conn.execute("SELECT * FROM tasks WHERE plane_id = 'ci-001'").fetchone()
conn.close()
assert task["stage"] == "review"
def test_gitea_webhook_pr():
"""PR event is accepted."""
resp = client.post(
"/webhook/gitea",
json={
"action": "opened",
"pull_request": {"head": {"ref": "feature/test"}, "number": 1},
"repository": {"name": "enduro-trails"},
},
headers={"X-Gitea-Event": "pull_request"}
)
@@ -59,7 +273,17 @@ def test_gitea_webhook_pr():
assert resp.json()["status"] == "accepted"
def test_status_endpoint():
resp = client.get("/status")
assert resp.status_code == 200
assert "active_tasks" in resp.json()
def test_plane_webhook_event_logged():
"""Events are logged in the events table."""
client.post("/webhook/plane", json={
"event": "test.event",
"data": {"foo": "bar"}
})
conn = get_db()
event = conn.execute(
"SELECT * FROM events WHERE event_type = 'test.event'"
).fetchone()
conn.close()
assert event is not None
assert event["source"] == "plane"