fix(serial-gate): pause-without-blocking via per-task park signal (ORCH-124)
All checks were successful
CI / test (push) Successful in 1m12s
CI / test (pull_request) Successful in 1m17s

Fixes incident ORCH-116/ORCH-123: serial_gate defined a repo's "active task"
purely by machine stage (tasks.stage NOT IN ('done','cancelled')). Plane statuses
Backlog/Blocked/Needs-Input (layer-B indication, ORCH-066) do NOT change
tasks.stage (layer A), so a paused predecessor was indistinguishable from an active
one and held the FIFO gate closed against an urgent successor — the urgent fix
could not start until the paused task was formally done.

Introduces an explicit, durable, DB-resolvable per-task "park" signal — additive
nullable column tasks.paused_at (pattern of cancelled_at/track) — and a new
ORTHOGONAL scheduler "pause" axis. The serial-gate "active task" predicate becomes
`stage NOT IN ('done','cancelled') AND paused_at IS NULL` across all three points
(build_claim_clause / repo_has_active_task / _per_repo_snapshot). The terminal set
{done,cancelled} in serial_gate/task_deps/stages.py is byte-for-byte unchanged
(adr-0026 not regressed): task_deps/stages.py do NOT read paused_at, so a paused
declared dependency and an active repo_freeze STILL block (pause never bypasses
them — different axes). Anti-stale-base on resume relies on the existing deferred
branch cut (ORCH-088) + pre-merge auto_rebase_onto_main + merge-gate re-test
(ORCH-026/093/110) — no new rebase machinery.

Additive, under an independent sub-flag, never-raise, restart-safe; hot-claim
fail-OPEN and freeze fail-CLOSED preserved. STAGE_TRANSITIONS / QG_CHECKS / check_*
/ machine-verdict keys / existing table schemas are byte-for-byte untouched (this is
a queue-scheduler + observability change, not a Quality Gate).

- src/db.py: additive tasks.paused_at column (_ensure_column) + set/clear/is helpers
- src/serial_gate.py: _pause_layer_enabled() + pause-term in the 3 points; `paused`
  list + per-job `reason` (freeze>dependency>active-task>null) in the /queue snapshot
- src/config.py + .env.example: serial_gate_pause_enabled (default True = true no-op)
- src/main.py: POST /serial-gate/pause|resume?work_item=<id> (by образцу unfreeze)
- tests/test_orch124_serial_gate_pause.py: TC-01 mandatory incident regress + TC-02..15
- CHANGELOG.md: [Unreleased] entry

ADR: docs/work-items/ORCH-124/06-adr/ADR-001-serial-gate-pause-without-blocking.md
Cross-cutting: docs/architecture/adr/adr-0051-serial-gate-pause-without-blocking.md

Refs: ORCH-124

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-16 19:35:55 +03:00
parent de4f067655
commit 87af857082
8 changed files with 683 additions and 8 deletions

100
src/db.py
View File

@@ -147,6 +147,17 @@ def init_db():
# after a successful atomic create). Read in advance_stage for the routing-override
# (skips architecture) — from the DB, NEVER from the network (NFR-4).
_ensure_column(conn, "tasks", "track", "TEXT DEFAULT 'full'")
# ORCH-124 (08-data-requirements.md, ADR-001 D2): per-task durable "park"
# signal for the serial gate. Additive, idempotent (_ensure_column is a no-op
# once present) -> safe on the live shared prod DB (enduro untouched), exactly
# like tasks.cancelled_at / tasks.cancel_requested_at / tasks.track above.
# paused_at -> NULL = not paused; ISO timestamp (datetime('now')) = an
# operator explicitly parked the task (POST /serial-gate/pause).
# Read ONLY by the serial-gate "active task" predicate (ORTHOGONAL to the
# {done,cancelled} terminal axis — task_deps/stages.py do NOT read it, adr-0026
# is untouched). All existing rows default to NULL -> pre-ORCH-124 behaviour
# holds until the first explicit operator pause.
_ensure_column(conn, "tasks", "paused_at", "TEXT")
# ORCH-026 (Level B): declarative task dependencies. job_deps stores the
# directed edge "task_id (B) is blocked-by depends_on_task_id (A)". The
# scheduler gate in claim_next_job keeps B queued until every A reaches
@@ -776,6 +787,95 @@ def get_task_track(task_id: int) -> str:
return "full"
# ---------------------------------------------------------------------------
# ORCH-124: serial-gate per-task park signal (tasks.paused_at) helpers
# ---------------------------------------------------------------------------
def set_task_paused(task_id: int) -> bool:
"""ORCH-124 (ADR-001 D7): park a task for the serial gate (idempotent).
Stamps ``tasks.paused_at=datetime('now')`` so the serial-gate "active task"
predicate stops counting this task as a FIFO blocker (an URGENT successor may
overtake it). Durable (survives restart) and DB-resolvable — the hot-claim SQL
reads it locally without any network call. Re-pausing an already-paused task
keeps the original timestamp (``WHERE paused_at IS NULL``), so the park moment
is stable. never-raise -> False on error (a write failure must not crash the
operator endpoint / worker).
"""
if task_id is None:
return False
try:
conn = get_db()
try:
conn.execute(
"UPDATE tasks SET paused_at=datetime('now') "
"WHERE id=? AND paused_at IS NULL",
(task_id,),
)
conn.commit()
finally:
conn.close()
return True
except Exception as e: # noqa: BLE001 - never-raise
import logging
logging.getLogger("orchestrator.db").warning(
"set_task_paused error for task %s: %s", task_id, e
)
return False
def clear_task_paused(task_id: int) -> bool:
"""ORCH-124 (ADR-001 D7): resume a parked task (idempotent).
Clears ``tasks.paused_at`` back to NULL so the task re-enters the serial-gate
FIFO (holds the gate as active again, or re-enters with a deferred branch cut —
see ADR-001 D8). Resuming a task that is not paused is a no-op. never-raise ->
False on error.
"""
if task_id is None:
return False
try:
conn = get_db()
try:
conn.execute(
"UPDATE tasks SET paused_at=NULL WHERE id=?",
(task_id,),
)
conn.commit()
finally:
conn.close()
return True
except Exception as e: # noqa: BLE001 - never-raise
import logging
logging.getLogger("orchestrator.db").warning(
"clear_task_paused error for task %s: %s", task_id, e
)
return False
def is_task_paused(task_id: int) -> bool:
"""ORCH-124: read whether a task is currently parked; missing/error -> False.
Conservative fail direction (ADR-001 D9): on any read error we report "not
paused" so the task is treated as active -> the serial gate stays CLOSED rather
than wrongly opening (anti-stale-base safe). Mirror of ``get_task_track``.
"""
if task_id is None:
return False
try:
conn = get_db()
try:
row = conn.execute(
"SELECT paused_at FROM tasks WHERE id=?", (task_id,)
).fetchone()
finally:
conn.close()
if not row:
return False
return row["paused_at"] is not None
except Exception: # noqa: BLE001 - conservative: not paused -> stays active
return False
# ---------------------------------------------------------------------------
# Telegram live tracker helpers (feat/telegram-live-tracker)
# ---------------------------------------------------------------------------