Commit Graph

55 Commits

Author SHA1 Message Date
2f0fd24670 Merge pull request 'ORCH-4: unified stage-engine (M-3)' (#5) from feature/ORCH-4-stage-engine into main 2026-06-03 08:59:51 +03:00
Dev Agent
6abdc220d2 test(stage): cover unified stage_engine + launcher/plane delegation
18 tests: happy-path advance per stage with correct agent (ORCH-4 fix),
QG-fail no-advance, reviewer REQUEST_CHANGES rollback+retry/alert, tester FAIL
rollback+retry/block, architect conflict rollback to analysis, analyst
approved-flow no-advance, and launcher+plane both delegating to the engine.
2026-06-03 08:56:25 +03:00
Dev Agent
51401a3ba9 refactor(launcher,plane): delegate stage advance to stage_engine
launcher._try_advance_stage and plane._try_advance_stage are now thin
wrappers over stage_engine.advance_stage. The plane webhook calls the sync
engine via asyncio.to_thread so there is exactly one implementation. The
launcher forwards finished_agent so the agent-specific rollback branches still
fire; the webhook passes None (human :approved:), matching prior behavior.

Also fixes the agent-selection bug in the launcher path: it used to enqueue
get_agent_for_stage(next_stage) (skipping a stage, e.g. analysis->architecture
launched developer instead of architect). The unified engine uses
get_agent_for_stage(current_stage), consistent with plane and gitea.
2026-06-03 08:56:25 +03:00
Dev Agent
0befc49b1e refactor(stage): extract unified stage_engine.advance_stage (M-3)
Merge the two diverged _try_advance_stage implementations (launcher sync +
plane async) into one synchronous engine. Preserves all launcher business
logic (analyst approved-flow, reviewer REQUEST_CHANGES rollback+retry, tester
FAIL rollback+retry, architect conflict rollback) and the plane
check_review_approved PR-by-branch dispatch. Unifies the QG signature
dispatch. Fixes agent selection: advancing FROM current_stage launches
get_agent_for_stage(current_stage), not next_stage.
2026-06-03 08:56:14 +03:00
fd554c8a5a Merge pull request 'ORCH-7: cleanup + hardening (M-4 dead code + M-2 graceful timeout)' (#4) from feature/ORCH-7-hardening into main 2026-06-03 08:31:26 +03:00
Dev Agent
c167c6930d test(launcher): watchdog graceful kill ordering + timeout config + M-4 removal
Cover M-2: SIGTERM-before-SIGKILL ordering, graceful exit within grace skips
SIGKILL, ProcessLookupError before SIGTERM is tolerated (no _record_kill), and
_resolve_timeout per-agent override / default / malformed-JSON fallback.
Cover M-4: _auto_merge_pr removed, _ensure_pr retained.
2026-06-03 08:28:09 +03:00
Dev Agent
49ecb48eb0 feat(launcher): graceful SIGTERM->SIGKILL + configurable agent timeout (M-2)
The watchdog used to time.sleep(timeout) then immediately SIGKILL, which cut
claude off mid-write and left half-written artifacts. It now sends SIGTERM,
polls os.kill(pid, 0) for up to agent_kill_grace_seconds, and only SIGKILL if
the process is still alive; ProcessLookupError is tolerated at every step.

Timeout is now configurable via config.py: agent_timeout_seconds (default 1800),
agent_kill_grace_seconds (default 20), and agent_timeout_overrides_json for
per-agent overrides (e.g. {"reviewer": 3600}). AGENT_TIMEOUT is kept as a
backward-compatible alias. The recorded exit_code stays -9 so the ORCH-1
monitor retry/fail logic is unchanged (timeout-kills classify as permanent and
requeue within max_attempts, no retry loop).
2026-06-03 08:28:03 +03:00
Dev Agent
237732bc64 refactor(launcher): remove dead _auto_merge_pr (M-4)
_auto_merge_pr had zero callers (merge is handled by the deployer agent).
Removed the method; _ensure_pr (still used by the auto-advance path) is kept.
2026-06-03 08:27:52 +03:00
4e52e192e4 Merge pull request 'ORCH-1 (F-2b): persistent job queue instead of in-process daemon threads' (#3) from feature/ORCH-1-job-queue into main 2026-06-03 08:09:23 +03:00
Dev Agent
c23f000c05 fix(preflight): check the binary the launcher actually spawns (ORCH-1)
Container ORCH_CLAUDE_BIN pointed at a non-existent /usr/bin/claude while the
launcher spawns the hardcoded /opt/claude-code/bin/claude.exe. Preflight now
follows AgentLauncher.CLAUDE_BIN (the genuinely executed path), so it no longer
falsely blocks every job in production.
2026-06-03 00:13:44 +03:00
Dev Agent
d0d47058b4 docs(resilience): document preflight/429/backoff/breaker + env vars (ORCH-1) 2026-06-03 00:12:17 +03:00
Dev Agent
a613fd8180 test(resilience): 34 tests for preflight/classifier/backoff/breaker (ORCH-1)
Covers preflight FAIL->queued + cache, transient/permanent classifier +
Retry-After, exp backoff + available_at gating, launcher transient vs permanent
finalize, circuit breaker open/half-open/closed. test_queue worker tests stub
preflight OK. Popen never spawned.
2026-06-03 00:12:17 +03:00
Dev Agent
f314ae09e5 feat(worker): preflight gate + circuit breaker + /queue resilience (ORCH-1)
QueueWorker gates claims behind preflight and the CircuitBreaker (open ->
pause, no CLI calls + Telegram alert; half-open probes one job; closed on
recovery). Wires launcher.on_outcome. /queue exposes resilience snapshot.
2026-06-03 00:12:17 +03:00
Dev Agent
90fdd19394 feat(launcher): classify failures, backoff transient retry, breaker outcome (ORCH-1)
_finalize_job classifies the run log: transient (429/overload) -> backoff
requeue via mark_job_transient with separate transient_attempts budget honouring
Retry-After; permanent -> normal attempts<max. on_outcome callback feeds the
circuit breaker. _backoff_seconds = min(2^n*base, max) | Retry-After.
2026-06-03 00:12:17 +03:00
Dev Agent
4ef87a3959 feat(resilience): cheap preflight + 429/transient error classifier (ORCH-1)
preflight.py: cached CLAUDE_BIN exists + claude --version (no tokens, no
prompt-ping). error_classifier.py: classify_log_file -> transient|permanent
from log tail + Retry-After parsing.
2026-06-03 00:12:17 +03:00
Dev Agent
0cd9b11fe0 feat(queue): resilience schema + backoff helper + config (ORCH-1)
jobs.transient_attempts + available_at columns (idempotent _ensure_column
migration); claim_next_job honours available_at; mark_job_transient (backoff
requeue with separate transient budget). Config: preflight_cache_ttl,
backoff_base/max_seconds, transient_max_attempts, breaker_threshold,
breaker_pause_seconds.
2026-06-03 00:12:17 +03:00
Dev Agent
4be168c0ec docs(queue): document job queue, /queue, env vars (ORCH-1)
ARCHITECTURE job-queue section + flow diagram, README /queue endpoint and
ORCH_MAX_CONCURRENCY/ORCH_QUEUE_POLL_INTERVAL, new docs/ORCH-1_JOB_QUEUE.md.
2026-06-02 23:58:44 +03:00
Dev Agent
2283b8898b test(queue): 19 tests for job queue lifecycle/atomicity/retry/worker (ORCH-1)
Covers enqueue->claim->mark, atomic claim (no double dispatch, 8-thread race),
retry fail->queued->failed, requeue_running_jobs, observability, worker
max_concurrency. Popen fully mocked (no real agent spawned).
2026-06-02 23:58:44 +03:00
Dev Agent
b6d4426a48 feat(worker): background queue worker + lifespan + queue-recovery + /queue (ORCH-1)
queue_worker.QueueWorker drains the queue respecting max_concurrency. main.py
lifespan: queue-recovery (requeue running jobs) after M-1 orphan-recovery, starts
worker and stops it on shutdown. New GET /queue endpoint (counts + recent jobs).
2026-06-02 23:58:44 +03:00
Dev Agent
20d6556e22 refactor(webhooks): enqueue_job instead of in-process launch (ORCH-1)
All 8 webhook launch points (plane x4, gitea x4) now enqueue a job and return
immediately instead of synchronously spawning claude in the uvicorn process.
2026-06-02 23:58:44 +03:00
Dev Agent
3345c2fa0a feat(launcher): launch_job + job-status finalize with retries (ORCH-1)
Refactor launch() into shared _spawn(); add launch_job(job) that threads job_id
through monitor/watchdog. _finalize_job marks done / requeue (attempts<max) /
failed+notify. Internal advance-chain self.launch -> enqueue_job. B-1/B-2/M-1/ORCH-2
spawn logic unchanged.
2026-06-02 23:58:44 +03:00
Dev Agent
fd3dac7d22 feat(queue): add jobs table + queue helpers and config (ORCH-1)
Persistent SQLite job queue (F-2b): jobs table + idx, atomic claim_next_job,
enqueue/mark/count/requeue/get helpers. New settings max_concurrency
(ORCH_MAX_CONCURRENCY) and queue_poll_interval (ORCH_QUEUE_POLL_INTERVAL).
2026-06-02 23:58:44 +03:00
b021ff7cb0 Merge pull request 'ORCH-6: multi-repo (project filter + repo/prefix per project)' (#2) from feature/ORCH-6-multirepo into main 2026-06-02 23:42:29 +03:00
Dev Agent
ca81f38330 docs: document multi-repo registry + ORCH-6 bugfix and incident
ORCH-6: ARCHITECTURE.md gets a project-registry section; README explains
how to add a project via ORCH_PROJECTS_JSON; BUGFIXES_2026-06-03.md
records the fix and links the 2026-06-02 webhook autorun incident.
2026-06-02 22:30:51 +03:00
Dev Agent
c1f35a2047 test(projects,webhook): cover registry resolvers + project filter
ORCH-6: test_projects.py covers resolvers and ORCH_PROJECTS_JSON parsing
(valid/malformed/fallback). test_plane_webhook.py covers the webhook
project filter via TestClient (unknown->ignored, orchestrator->orchestrator
repo, enduro->enduro-trails, independent ORCH/ET prefixes); launcher
mocked. test_webhooks.py: register proj-1 so existing ET fixtures pass.
2026-06-02 22:30:51 +03:00
Dev Agent
a6f6a43c1c fix(webhooks/gitea): ignore pushes/events for repos outside the registry
ORCH-6: get_project_by_repo None -> ignored, so events for unknown repos
do not trigger the pipeline.
2026-06-02 22:30:42 +03:00
Dev Agent
171f4eb304 fix(webhooks/plane): filter by project + resolve repo/prefix from registry
ORCH-6 / incident 2026-06-02: ignore work items from unknown Plane
projects (status=ignored) instead of funneling everything into
default_repo. Resolve repo, work-item prefix and Plane sync project from
the registry by data.project.
2026-06-02 22:30:42 +03:00
Dev Agent
a87c633003 refactor(plane_sync): parameterize project_id (backward compatible)
ORCH-6: sync functions resolve the issue PROJECT_ID via the registry
(get_project_by_repo) and accept project_id; default stays enduro so
existing ET callers keep working.
2026-06-02 22:30:42 +03:00
Dev Agent
0797f958dc feat(db): per-project work-item prefix in get_next_work_item_id
ORCH-6: get_next_work_item_id(repo, prefix="ET") numbers per (repo, prefix)
so orchestrator issues number ORCH-001 independently of the ET sequence.
Default prefix stays ET for backward compatibility.
2026-06-02 22:30:42 +03:00
Dev Agent
36d5f25f2a feat(projects): add project registry (Plane id -> repo/prefix mapping)
ORCH-6: src/projects.py introduces ProjectConfig + resolvers
(get_project_by_plane_id/by_repo, known_plane_project_ids) keyed by
Plane project uuid. Source: ORCH_PROJECTS_JSON env (config.projects_json),
with a built-in default registry (enduro-trails + orchestrator) and
robust parsing (malformed JSON/entries fall back to default).
2026-06-02 22:30:42 +03:00
Dev Agent
1ebe8afc23 feat(worktree): git worktree per task to isolate shared /repos (ORCH-2 / S-4)
- add src/git_worktree.py: ensure/remove/get_worktree_path
- config: worktrees_dir=/repos/_wt
- launcher: agent runs in per-branch worktree; task-file + commit/push in worktree; no shared checkout
- qg/checks: read artifacts + run make test from worktree (branch arg, backward-compatible)
- webhooks/plane: pass branch into QG dispatch; review fallback from worktree
- webhooks/gitea: keep read-only branch --contains in main clone (documented)
- tests: test_git_worktree.py (isolation) + update test_launcher write-task-file
- docs: ARCHITECTURE worktree section + BUGFIXES_2026-06-02_ORCH2

Preserves B-1/B-2/S-1/S-5 fixes (paths now point at worktree).
2026-06-02 21:12:06 +03:00
Dev Agent
66a37612fd docs(bugfixes): add safe.directory, init:true findings and autonomy test result 2026-06-02 20:22:51 +03:00
Dev Agent
57cca14ed3 fix(compose): init:true (PID1 reaper) to reap claude grandchild zombies (B-2) 2026-06-02 20:20:33 +03:00
Dev Agent
5de8462a13 fix(docker): trust /repos for git (safe.directory) so launcher commit/push works 2026-06-02 20:18:44 +03:00
Dev Agent
553e0aae0c docs: update QG table, task-file write, orphan recovery; add BUGFIXES_2026-06-02 2026-06-02 20:12:29 +03:00
Dev Agent
67b9f814b5 test(launcher): cover _write_task_file and reviewer verdict parsing (L-5) 2026-06-02 20:12:29 +03:00
Dev Agent
212352997e fix(main): proper orphan recovery with per-run warning + notify (M-1) 2026-06-02 20:12:29 +03:00
Dev Agent
b585701c62 fix(webhooks): dispatch new QGs; stop false Gitea CI alerts (S-1)
- plane._try_advance_stage handles check_tests_local + check_reviewer_verdict
- gitea.handle_ci_status: failure -> debug log only (CI not authoritative)
2026-06-02 20:12:29 +03:00
Dev Agent
0924783be3 fix(qg): frontmatter-only reviewer verdict + local test gate (S-5, S-1)
- check_reviewer_verdict reads verdict: from YAML frontmatter of 12-review.md only
- add check_tests_local: orchestrator runs make test in /repos/<repo>
- stages: development QG -> check_tests_local
2026-06-02 20:12:29 +03:00
Dev Agent
265a5ef1e6 fix(launcher): write task file to /repos without docker; stdout->file, no PIPE zombies (B-1, B-2)
- _write_task_file writes directly to mounted /repos/<repo>, raises on failure
- Popen stdout=log_fh at OS level; _monitor_agent simplified to proc.wait()+close
- remove PIPE reader thread and startup-timeout (watchdog by pid stays)
- dispatch check_tests_local args (repo, branch)
2026-06-02 20:12:29 +03:00
Dev Agent
f575f6bc6a chore: save WIP changes before audit fixes
- notifications: Telegram integration, richer stage/agent/QG notifications
- plane_sync: explicit Plane state IDs, needs_input/in_review/blocked helpers, links in comments
- launcher: deployer stage, model flag (opus), PR auto-create, REQUEST_CHANGES/tester/architect rollback+retry logic, partial check_reviewer_verdict path
- qg/checks: add check_reviewer_verdict (substring-based, will be hardened in S-5)
- stages: review->check_reviewer_verdict, testing->deployer agent
- config: telegram_bot_token/chat_id settings
2026-06-02 19:57:43 +03:00
claude-bot
8715dd7148 feat(deploy): SSH key mount, deploy env vars, openssh-client in image 2026-06-01 20:03:27 +03:00
Dev Agent
e27e489157 fix(plane-webhook): read issue/comment_stripped fields from Plane comment payload 2026-06-01 19:17:14 +03:00
51f7364532 feat: integrate Analyst into Plane/Orchestrator pipeline
- Add git fetch+checkout in agent launch cmd (ensures correct branch)
- Add git fetch+checkout in _monitor_agent before commit/push
- Post start comment in Plane when analyst launches
- Post :approved: request comment after analyst completes successfully
- Branch lookup moved before cmd construction for reuse
2026-05-31 20:15:01 +03:00
Dev Agent
81e0e383e0 feat(analysis): add check_analysis_approved QG with stakeholder approval requirement
- stages.py: QG renamed to check_analysis_approved (requires :approved: comment)
- qg/checks.py: new check_analysis_approved verifies files + Plane :approved: comment
- launcher.py: skip auto-advance for analysis stage (requires human approval)
- plane.py: route check_analysis_approved in _try_advance_stage
- docs/ARCHITECTURE.md: updated QG table and flow description
2026-05-31 15:19:03 +03:00
Dev Agent
0f0b984656 docs: add pipeline design backlog (audit + backlog mgmt) 2026-05-23 09:17:41 +03:00
Dev Agent
267bc58fb2 docs: update README, add ARCHITECTURE.md with full system documentation 2026-05-22 14:09:24 +03:00
Dev Agent
0ad56e1f0a fix: tini entrypoint, event routing wildcard, orphan recovery 2026-05-22 13:52:46 +03:00
Dev Agent
c326ef0ac4 docs: lessons learned ET-006 — problems and solutions 2026-05-22 13:45:40 +03:00
Dev Agent
b545665e2d feat: full pipeline fixes - CI status branch lookup, review webhook routing, auto-advance, plane sync
- handle_ci_status: fallback git branch -r --contains when branches[] empty
- webhook router: handle pull_request_approved event type
- handle_pr: map review.type to review.state for new Gitea format
- launcher: auto-advance stage after agent completion (_try_advance_stage)
- plane_sync: notify Plane on stage changes
- stages.py: stage machine with QG definitions
- notifications.py: stage change notifications
- safe.directory fix for container git operations
2026-05-22 01:57:02 +03:00