feat(reconciler): sweeper потерянных webhook (реконсиляция застрявших стадий)
Конвейер продвигается только входящими webhook; потерянное событие (502 на ребилде, отсутствие ретраев у Plane/Gitea, неразрезолвленный sha→branch) оставляет задачу молча застрявшей (класс инцидента ORCH-044). Новый фоновый daemon-поток src/reconciler.py (паттерн queue_worker) доигрывает пропущенный переход через те же штатные гейты/обработчики, что и webhook: - F-1 gate-side: для задач stage≠done, без активного job и age(updated_at) ≥ grace_for_stage(stage) — read-only пред-оценка канонического QG; зелёный → stage_engine.advance_stage(..., finished_agent=None); красный → тишина (спам нотификаций структурно невозможен). analysis F-1 не трогает (человеческий гейт). - F-2 plane-side: опрос Plane API per-project (plane_sync.list_issues_by_state, курсорная пагинация, never-raise) → реплей In Progress/Approved/Rejected через существующие handle_status_start/handle_verdict (async из sync-потока, asyncio.run). - F-3: усиление sha→branch в handle_ci_status — БД-fallback по единственной development-задаче repo (неоднозначность → не резолвим), debug→info. - Анти-дубль на создании (db.create_task_atomic под process-wide Lock): гонка reconcile↔webhook не плодит второй task/branch/worktree/analyst-job (AC-4). - F-4 observability: лог-строка разблокировки + Telegram + блок reconcile в /queue. Старт/стоп в main.lifespan (после worker.start() / перед worker.stop()), restart-safe, never-raise на единицу работы. Kill-switches ORCH_RECONCILE_ENABLED / ORCH_RECONCILE_PLANE_ENABLED + grace-настройки. Схема БД и реестры STAGE_TRANSITIONS/QG_CHECKS не менялись. Тесты: test_reconciler.py, test_reconciler_plane.py, test_gitea_sha_resolve.py, test_config.py (33 новых, 563 всего зелёные). Документация обновлена (golden source): architecture/README.md, INFRA.md, README.md, CHANGELOG.md, adr-0007 → accepted. Refs: ORCH-053 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
93
src/db.py
93
src/db.py
@@ -1,6 +1,15 @@
|
||||
import sqlite3
|
||||
import threading
|
||||
from .config import settings
|
||||
|
||||
# ORCH-053 (F-2 anti-dup): process-wide lock guarding the SELECT-exists -> INSERT
|
||||
# task-creation claim. The prod topology is a single uvicorn process per DB
|
||||
# (staging/prod isolated), with the webhook running in uvicorn's asyncio thread
|
||||
# and the reconciler in its own thread of the SAME process -> a threading.Lock
|
||||
# covers both sides of the create race without a schema migration. See
|
||||
# docs/work-items/ORCH-053/06-adr/ADR-001-stuck-task-reconciler.md §4.
|
||||
_CREATE_TASK_LOCK = threading.Lock()
|
||||
|
||||
|
||||
def get_db() -> sqlite3.Connection:
|
||||
conn = sqlite3.connect(settings.db_path)
|
||||
@@ -145,6 +154,90 @@ def get_task_by_repo_branch(repo: str, branch: str) -> dict | None:
|
||||
return None
|
||||
|
||||
|
||||
def get_active_tasks_for_reconcile() -> list[dict]:
|
||||
"""ORCH-053 (F-1): tasks eligible for the gate-side sweeper.
|
||||
|
||||
Returns every task whose stage is not terminal ('done'), each augmented with
|
||||
``age_s`` = seconds since ``tasks.updated_at`` (computed in SQL against UTC
|
||||
'now', matching how ``update_task_stage`` stamps ``updated_at``). The
|
||||
reconciler applies the per-stage grace and active-job guard on top.
|
||||
"""
|
||||
conn = get_db()
|
||||
try:
|
||||
rows = conn.execute(
|
||||
"SELECT *, "
|
||||
"CAST(strftime('%s','now') - strftime('%s', updated_at) AS INTEGER) AS age_s "
|
||||
"FROM tasks WHERE stage != 'done'"
|
||||
).fetchall()
|
||||
finally:
|
||||
conn.close()
|
||||
return [dict(r) for r in rows]
|
||||
|
||||
|
||||
def get_development_tasks_by_repo(repo: str) -> list[dict]:
|
||||
"""ORCH-053 (F-3): tasks of a repo currently on the 'development' stage.
|
||||
|
||||
Used as the sha->branch DB fallback in handle_ci_status: a CI-status webhook
|
||||
whose branch could not be resolved (no branches[], empty
|
||||
``git branch -r --contains``) is matched to the unique development task of
|
||||
the repo (ambiguity -> caller leaves it unresolved).
|
||||
"""
|
||||
conn = get_db()
|
||||
try:
|
||||
rows = conn.execute(
|
||||
"SELECT * FROM tasks WHERE repo = ? AND stage = 'development'", (repo,)
|
||||
).fetchall()
|
||||
finally:
|
||||
conn.close()
|
||||
return [dict(r) for r in rows]
|
||||
|
||||
|
||||
def create_task_atomic(
|
||||
plane_id: str,
|
||||
work_item_id: str,
|
||||
repo: str,
|
||||
branch: str,
|
||||
stage: str,
|
||||
title: str,
|
||||
) -> tuple[dict, bool]:
|
||||
"""ORCH-053 (AC-4): atomically claim creation of a task for a plane_id.
|
||||
|
||||
Performs SELECT-exists -> INSERT under the process-wide ``_CREATE_TASK_LOCK``
|
||||
so a race between the live Plane webhook and the F-2 reconciler (both seeing
|
||||
"no task yet" for the same plane_id) cannot create two task rows / branches /
|
||||
worktrees / starter analyst jobs.
|
||||
|
||||
Returns ``(row, created)``:
|
||||
* ``created=True`` -> THIS caller inserted the row and owns the follow-up
|
||||
work (branch / docs / analyst enqueue);
|
||||
* ``created=False`` -> a task for this plane_id already existed (the other
|
||||
racer won); ``row`` is the existing task and the caller must NOT duplicate
|
||||
the follow-up work.
|
||||
"""
|
||||
with _CREATE_TASK_LOCK:
|
||||
conn = get_db()
|
||||
try:
|
||||
existing = conn.execute(
|
||||
"SELECT * FROM tasks WHERE plane_id = ? OR plane_issue_id = ?",
|
||||
(plane_id, plane_id),
|
||||
).fetchone()
|
||||
if existing:
|
||||
return dict(existing), False
|
||||
cur = conn.execute(
|
||||
"INSERT INTO tasks "
|
||||
"(plane_id, work_item_id, repo, branch, stage, plane_issue_id, title) "
|
||||
"VALUES (?, ?, ?, ?, ?, ?, ?)",
|
||||
(plane_id, work_item_id, repo, branch, stage, plane_id, title),
|
||||
)
|
||||
conn.commit()
|
||||
row = conn.execute(
|
||||
"SELECT * FROM tasks WHERE id = ?", (cur.lastrowid,)
|
||||
).fetchone()
|
||||
return dict(row), True
|
||||
finally:
|
||||
conn.close()
|
||||
|
||||
|
||||
def update_task_stage(task_id: int, stage: str):
|
||||
"""Update task stage and timestamp."""
|
||||
conn = get_db()
|
||||
|
||||
Reference in New Issue
Block a user