orchestrator

Author	SHA1	Message	Date
Dev Agent	9a702a0216	feat(metrics): per-agent token/cost accounting Feature 4. claude is now launched with --output-format json; the run-log trailing result JSON is parsed (defensively, never fatal) for usage + total_cost_usd. New idempotent ALTERs add input_tokens/output_tokens/cache_read_tokens/cost_usd to agent_runs; the launcher monitor records usage per run, posts a per-agent finish comment under that agent bot (e.g. Developer gotov · 45.2k in / 12.1k out · $0.21), and the deployer posts an end-of-task summary (SUM over agent_runs GROUP BY agent) on done. New src/usage.py holds parse/format/record/summary helpers; test_usage.py covers parsing a real CLI JSON blob, NULL-on-garbage, recording, formatting, and the per-task aggregate.	2026-06-03 18:18:46 +03:00
Dev Agent	09b1c5e1b9	feat(webhook): start pipeline on In Progress status (not on create) Feature 1. work_item.created no longer starts the pipeline (soft QG-0 log only); the issue stays in the backlog until moved to In Progress. The pipeline-start body is extracted into start_pipeline(); a new issue updated handler routes a state change to In Progress -> handle_status_start, which is idempotent: an existing task for the plane_id is NOT re-created or restarted (protects handle_comment, which also flips issues to In Progress). Real Plane payload: event=issue, action=updated, data.state.id. Existing m6/plane_webhook/dedup tests updated to drive the new trigger; new test_status_trigger.py covers created-no-op / start / idempotent.	2026-06-03 18:18:26 +03:00
Dev Agent	a4668c0303	feat(plane): stage visibility on board + verdict status UUIDs Feature 3 + Feature 2 infra. Extend the global PLANE_STATES with the 6 new enduro status UUIDs (architecture/development/review/testing + approved/rejected), remap STAGE_TO_STATE so the 4 mid-pipeline stages move the issue across its own board column instead of all sitting in In Progress, and add the set_issue_stage_state() helper. Needs Input / In Review / Blocked keep their own explicit setters and stay higher priority. TODO(ORCH-10): statuses are per-project; resolve per project when more projects are onboarded.	2026-06-03 18:18:17 +03:00
Dev Agent	d305521067	feat(plane): per-agent bot authorship for comments add_comment now accepts an optional author (agent role) and POSTs under the matching Plane bot token via _headers_for(), so Plane shows the real author (Analyst/Architect/Developer/Reviewer/Tester/Deployer/Stream) instead of a single shared account. Unknown/empty roles or missing tokens fall back to the shared orchestrator token (autonomy preserved). GET/PATCH (find_issue_id, set_state) are unchanged and stay on the shared token. Call sites in stage_engine, launcher, webhooks/plane and the plane_sync notify helpers now pass author by stage role; stage transitions use stream. Adds tests/test_plane_author.py.	2026-06-03 10:53:25 +03:00
Dev Agent	30d6dd0557	feat(config): add per-agent Plane bot token settings Add 7 optional bot-token fields (plane_bot_analyst..stream) read from the ORCH_PLANE_BOT_* env vars, default empty. Required for per-agent comment authorship; empty values fall back to the shared orchestrator token.	2026-06-03 10:53:17 +03:00
Dev Agent	c431a3d055	fix(plane_sync): drop hardcoded ET- prefix in find_issue_id (M-6)	2026-06-03 10:02:15 +03:00
Dev Agent	1d978caea7	feat(webhook): derive work_item_id from Plane sequence_id (M-6)	2026-06-03 10:02:15 +03:00
Dev Agent	8f11971bfc	refactor(plane_sync): extract emoji literals to constants (L-3)	2026-06-03 09:54:43 +03:00
Dev Agent	0653c2437f	feat(launcher): prune old run logs (L-2)	2026-06-03 09:53:55 +03:00
Dev Agent	48b7707eb3	docs(stages): fix misleading STAGE_TRANSITIONS comment (L-1)	2026-06-03 09:51:46 +03:00
Dev Agent	e6a7c6de8d	feat(webhook): dedup deliveries by delivery_id (M-7)	2026-06-03 09:18:02 +03:00
Dev Agent	0b924208dc	feat(db): add events.delivery_id + partial unique index (M-7)	2026-06-03 09:18:02 +03:00
Dev Agent	51401a3ba9	refactor(launcher,plane): delegate stage advance to stage_engine launcher._try_advance_stage and plane._try_advance_stage are now thin wrappers over stage_engine.advance_stage. The plane webhook calls the sync engine via asyncio.to_thread so there is exactly one implementation. The launcher forwards finished_agent so the agent-specific rollback branches still fire; the webhook passes None (human :approved:), matching prior behavior. Also fixes the agent-selection bug in the launcher path: it used to enqueue get_agent_for_stage(next_stage) (skipping a stage, e.g. analysis->architecture launched developer instead of architect). The unified engine uses get_agent_for_stage(current_stage), consistent with plane and gitea.	2026-06-03 08:56:25 +03:00
Dev Agent	0befc49b1e	refactor(stage): extract unified stage_engine.advance_stage (M-3) Merge the two diverged _try_advance_stage implementations (launcher sync + plane async) into one synchronous engine. Preserves all launcher business logic (analyst approved-flow, reviewer REQUEST_CHANGES rollback+retry, tester FAIL rollback+retry, architect conflict rollback) and the plane check_review_approved PR-by-branch dispatch. Unifies the QG signature dispatch. Fixes agent selection: advancing FROM current_stage launches get_agent_for_stage(current_stage), not next_stage.	2026-06-03 08:56:14 +03:00
Dev Agent	49ecb48eb0	feat(launcher): graceful SIGTERM->SIGKILL + configurable agent timeout (M-2) The watchdog used to time.sleep(timeout) then immediately SIGKILL, which cut claude off mid-write and left half-written artifacts. It now sends SIGTERM, polls os.kill(pid, 0) for up to agent_kill_grace_seconds, and only SIGKILL if the process is still alive; ProcessLookupError is tolerated at every step. Timeout is now configurable via config.py: agent_timeout_seconds (default 1800), agent_kill_grace_seconds (default 20), and agent_timeout_overrides_json for per-agent overrides (e.g. {"reviewer": 3600}). AGENT_TIMEOUT is kept as a backward-compatible alias. The recorded exit_code stays -9 so the ORCH-1 monitor retry/fail logic is unchanged (timeout-kills classify as permanent and requeue within max_attempts, no retry loop).	2026-06-03 08:28:03 +03:00
Dev Agent	237732bc64	refactor(launcher): remove dead _auto_merge_pr (M-4) _auto_merge_pr had zero callers (merge is handled by the deployer agent). Removed the method; _ensure_pr (still used by the auto-advance path) is kept.	2026-06-03 08:27:52 +03:00
Dev Agent	c23f000c05	fix(preflight): check the binary the launcher actually spawns (ORCH-1) Container ORCH_CLAUDE_BIN pointed at a non-existent /usr/bin/claude while the launcher spawns the hardcoded /opt/claude-code/bin/claude.exe. Preflight now follows AgentLauncher.CLAUDE_BIN (the genuinely executed path), so it no longer falsely blocks every job in production.	2026-06-03 00:13:44 +03:00
Dev Agent	f314ae09e5	feat(worker): preflight gate + circuit breaker + /queue resilience (ORCH-1) QueueWorker gates claims behind preflight and the CircuitBreaker (open -> pause, no CLI calls + Telegram alert; half-open probes one job; closed on recovery). Wires launcher.on_outcome. /queue exposes resilience snapshot.	2026-06-03 00:12:17 +03:00
Dev Agent	90fdd19394	feat(launcher): classify failures, backoff transient retry, breaker outcome (ORCH-1) _finalize_job classifies the run log: transient (429/overload) -> backoff requeue via mark_job_transient with separate transient_attempts budget honouring Retry-After; permanent -> normal attempts<max. on_outcome callback feeds the circuit breaker. _backoff_seconds = min(2^n*base, max) \| Retry-After.	2026-06-03 00:12:17 +03:00
Dev Agent	4ef87a3959	feat(resilience): cheap preflight + 429/transient error classifier (ORCH-1) preflight.py: cached CLAUDE_BIN exists + claude --version (no tokens, no prompt-ping). error_classifier.py: classify_log_file -> transient\|permanent from log tail + Retry-After parsing.	2026-06-03 00:12:17 +03:00
Dev Agent	0cd9b11fe0	feat(queue): resilience schema + backoff helper + config (ORCH-1) jobs.transient_attempts + available_at columns (idempotent _ensure_column migration); claim_next_job honours available_at; mark_job_transient (backoff requeue with separate transient budget). Config: preflight_cache_ttl, backoff_base/max_seconds, transient_max_attempts, breaker_threshold, breaker_pause_seconds.	2026-06-03 00:12:17 +03:00
Dev Agent	b6d4426a48	feat(worker): background queue worker + lifespan + queue-recovery + /queue (ORCH-1) queue_worker.QueueWorker drains the queue respecting max_concurrency. main.py lifespan: queue-recovery (requeue running jobs) after M-1 orphan-recovery, starts worker and stops it on shutdown. New GET /queue endpoint (counts + recent jobs).	2026-06-02 23:58:44 +03:00
Dev Agent	20d6556e22	refactor(webhooks): enqueue_job instead of in-process launch (ORCH-1) All 8 webhook launch points (plane x4, gitea x4) now enqueue a job and return immediately instead of synchronously spawning claude in the uvicorn process.	2026-06-02 23:58:44 +03:00
Dev Agent	3345c2fa0a	feat(launcher): launch_job + job-status finalize with retries (ORCH-1) Refactor launch() into shared _spawn(); add launch_job(job) that threads job_id through monitor/watchdog. _finalize_job marks done / requeue (attempts<max) / failed+notify. Internal advance-chain self.launch -> enqueue_job. B-1/B-2/M-1/ORCH-2 spawn logic unchanged.	2026-06-02 23:58:44 +03:00
Dev Agent	fd3dac7d22	feat(queue): add jobs table + queue helpers and config (ORCH-1) Persistent SQLite job queue (F-2b): jobs table + idx, atomic claim_next_job, enqueue/mark/count/requeue/get helpers. New settings max_concurrency (ORCH_MAX_CONCURRENCY) and queue_poll_interval (ORCH_QUEUE_POLL_INTERVAL).	2026-06-02 23:58:44 +03:00
Dev Agent	a6f6a43c1c	fix(webhooks/gitea): ignore pushes/events for repos outside the registry ORCH-6: get_project_by_repo None -> ignored, so events for unknown repos do not trigger the pipeline.	2026-06-02 22:30:42 +03:00
Dev Agent	171f4eb304	fix(webhooks/plane): filter by project + resolve repo/prefix from registry ORCH-6 / incident 2026-06-02: ignore work items from unknown Plane projects (status=ignored) instead of funneling everything into default_repo. Resolve repo, work-item prefix and Plane sync project from the registry by data.project.	2026-06-02 22:30:42 +03:00
Dev Agent	a87c633003	refactor(plane_sync): parameterize project_id (backward compatible) ORCH-6: sync functions resolve the issue PROJECT_ID via the registry (get_project_by_repo) and accept project_id; default stays enduro so existing ET callers keep working.	2026-06-02 22:30:42 +03:00
Dev Agent	0797f958dc	feat(db): per-project work-item prefix in get_next_work_item_id ORCH-6: get_next_work_item_id(repo, prefix="ET") numbers per (repo, prefix) so orchestrator issues number ORCH-001 independently of the ET sequence. Default prefix stays ET for backward compatibility.	2026-06-02 22:30:42 +03:00
Dev Agent	36d5f25f2a	feat(projects): add project registry (Plane id -> repo/prefix mapping) ORCH-6: src/projects.py introduces ProjectConfig + resolvers (get_project_by_plane_id/by_repo, known_plane_project_ids) keyed by Plane project uuid. Source: ORCH_PROJECTS_JSON env (config.projects_json), with a built-in default registry (enduro-trails + orchestrator) and robust parsing (malformed JSON/entries fall back to default).	2026-06-02 22:30:42 +03:00
Dev Agent	1ebe8afc23	feat(worktree): git worktree per task to isolate shared /repos (ORCH-2 / S-4) - add src/git_worktree.py: ensure/remove/get_worktree_path - config: worktrees_dir=/repos/_wt - launcher: agent runs in per-branch worktree; task-file + commit/push in worktree; no shared checkout - qg/checks: read artifacts + run make test from worktree (branch arg, backward-compatible) - webhooks/plane: pass branch into QG dispatch; review fallback from worktree - webhooks/gitea: keep read-only branch --contains in main clone (documented) - tests: test_git_worktree.py (isolation) + update test_launcher write-task-file - docs: ARCHITECTURE worktree section + BUGFIXES_2026-06-02_ORCH2 Preserves B-1/B-2/S-1/S-5 fixes (paths now point at worktree).	2026-06-02 21:12:06 +03:00
Dev Agent	212352997e	fix(main): proper orphan recovery with per-run warning + notify (M-1)	2026-06-02 20:12:29 +03:00
Dev Agent	b585701c62	fix(webhooks): dispatch new QGs; stop false Gitea CI alerts (S-1) - plane._try_advance_stage handles check_tests_local + check_reviewer_verdict - gitea.handle_ci_status: failure -> debug log only (CI not authoritative)	2026-06-02 20:12:29 +03:00
Dev Agent	0924783be3	fix(qg): frontmatter-only reviewer verdict + local test gate (S-5, S-1) - check_reviewer_verdict reads verdict: from YAML frontmatter of 12-review.md only - add check_tests_local: orchestrator runs make test in /repos/<repo> - stages: development QG -> check_tests_local	2026-06-02 20:12:29 +03:00
Dev Agent	265a5ef1e6	fix(launcher): write task file to /repos without docker; stdout->file, no PIPE zombies (B-1, B-2) - _write_task_file writes directly to mounted /repos/<repo>, raises on failure - Popen stdout=log_fh at OS level; _monitor_agent simplified to proc.wait()+close - remove PIPE reader thread and startup-timeout (watchdog by pid stays) - dispatch check_tests_local args (repo, branch)	2026-06-02 20:12:29 +03:00
Dev Agent	f575f6bc6a	chore: save WIP changes before audit fixes - notifications: Telegram integration, richer stage/agent/QG notifications - plane_sync: explicit Plane state IDs, needs_input/in_review/blocked helpers, links in comments - launcher: deployer stage, model flag (opus), PR auto-create, REQUEST_CHANGES/tester/architect rollback+retry logic, partial check_reviewer_verdict path - qg/checks: add check_reviewer_verdict (substring-based, will be hardened in S-5) - stages: review->check_reviewer_verdict, testing->deployer agent - config: telegram_bot_token/chat_id settings	2026-06-02 19:57:43 +03:00
Dev Agent	e27e489157	fix(plane-webhook): read issue/comment_stripped fields from Plane comment payload	2026-06-01 19:17:14 +03:00
claude-bot	51f7364532	feat: integrate Analyst into Plane/Orchestrator pipeline - Add git fetch+checkout in agent launch cmd (ensures correct branch) - Add git fetch+checkout in _monitor_agent before commit/push - Post start comment in Plane when analyst launches - Post :approved: request comment after analyst completes successfully - Branch lookup moved before cmd construction for reuse	2026-05-31 20:15:01 +03:00
Dev Agent	81e0e383e0	feat(analysis): add check_analysis_approved QG with stakeholder approval requirement - stages.py: QG renamed to check_analysis_approved (requires :approved: comment) - qg/checks.py: new check_analysis_approved verifies files + Plane :approved: comment - launcher.py: skip auto-advance for analysis stage (requires human approval) - plane.py: route check_analysis_approved in _try_advance_stage - docs/ARCHITECTURE.md: updated QG table and flow description	2026-05-31 15:19:03 +03:00
Dev Agent	0ad56e1f0a	fix: tini entrypoint, event routing wildcard, orphan recovery	2026-05-22 13:52:46 +03:00
Dev Agent	b545665e2d	feat: full pipeline fixes - CI status branch lookup, review webhook routing, auto-advance, plane sync - handle_ci_status: fallback git branch -r --contains when branches[] empty - webhook router: handle pull_request_approved event type - handle_pr: map review.type to review.state for new Gitea format - launcher: auto-advance stage after agent completion (_try_advance_stage) - plane_sync: notify Plane on stage changes - stages.py: stage machine with QG definitions - notifications.py: stage change notifications - safe.directory fix for container git operations	2026-05-22 01:57:02 +03:00
Dev Agent	3116ae67bb	chore: clean up .gitignore, remove cached files from tracking	2026-05-19 15:58:45 +03:00
Dev Agent	95072e000f	fix: tests — add setup_db fixture for init_db in test env	2026-05-19 15:58:37 +03:00
Dev Agent	daf8cdad9e	feat: orchestrator MVP — webhooks, agent launcher, QG checks	2026-05-19 15:57:00 +03:00

44 Commits