fix(serial-gate): pause-without-blocking via per-task park signal (ORCH-124)
Fixes incident ORCH-116/ORCH-123: serial_gate defined a repo's "active task"
purely by machine stage (tasks.stage NOT IN ('done','cancelled')). Plane statuses
Backlog/Blocked/Needs-Input (layer-B indication, ORCH-066) do NOT change
tasks.stage (layer A), so a paused predecessor was indistinguishable from an active
one and held the FIFO gate closed against an urgent successor — the urgent fix
could not start until the paused task was formally done.
Introduces an explicit, durable, DB-resolvable per-task "park" signal — additive
nullable column tasks.paused_at (pattern of cancelled_at/track) — and a new
ORTHOGONAL scheduler "pause" axis. The serial-gate "active task" predicate becomes
`stage NOT IN ('done','cancelled') AND paused_at IS NULL` across all three points
(build_claim_clause / repo_has_active_task / _per_repo_snapshot). The terminal set
{done,cancelled} in serial_gate/task_deps/stages.py is byte-for-byte unchanged
(adr-0026 not regressed): task_deps/stages.py do NOT read paused_at, so a paused
declared dependency and an active repo_freeze STILL block (pause never bypasses
them — different axes). Anti-stale-base on resume relies on the existing deferred
branch cut (ORCH-088) + pre-merge auto_rebase_onto_main + merge-gate re-test
(ORCH-026/093/110) — no new rebase machinery.
Additive, under an independent sub-flag, never-raise, restart-safe; hot-claim
fail-OPEN and freeze fail-CLOSED preserved. STAGE_TRANSITIONS / QG_CHECKS / check_*
/ machine-verdict keys / existing table schemas are byte-for-byte untouched (this is
a queue-scheduler + observability change, not a Quality Gate).
- src/db.py: additive tasks.paused_at column (_ensure_column) + set/clear/is helpers
- src/serial_gate.py: _pause_layer_enabled() + pause-term in the 3 points; `paused`
list + per-job `reason` (freeze>dependency>active-task>null) in the /queue snapshot
- src/config.py + .env.example: serial_gate_pause_enabled (default True = true no-op)
- src/main.py: POST /serial-gate/pause|resume?work_item=<id> (by образцу unfreeze)
- tests/test_orch124_serial_gate_pause.py: TC-01 mandatory incident regress + TC-02..15
- CHANGELOG.md: [Unreleased] entry
ADR: docs/work-items/ORCH-124/06-adr/ADR-001-serial-gate-pause-without-blocking.md
Cross-cutting: docs/architecture/adr/adr-0051-serial-gate-pause-without-blocking.md
Refs: ORCH-124
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
100
src/db.py
100
src/db.py
@@ -147,6 +147,17 @@ def init_db():
|
||||
# after a successful atomic create). Read in advance_stage for the routing-override
|
||||
# (skips architecture) — from the DB, NEVER from the network (NFR-4).
|
||||
_ensure_column(conn, "tasks", "track", "TEXT DEFAULT 'full'")
|
||||
# ORCH-124 (08-data-requirements.md, ADR-001 D2): per-task durable "park"
|
||||
# signal for the serial gate. Additive, idempotent (_ensure_column is a no-op
|
||||
# once present) -> safe on the live shared prod DB (enduro untouched), exactly
|
||||
# like tasks.cancelled_at / tasks.cancel_requested_at / tasks.track above.
|
||||
# paused_at -> NULL = not paused; ISO timestamp (datetime('now')) = an
|
||||
# operator explicitly parked the task (POST /serial-gate/pause).
|
||||
# Read ONLY by the serial-gate "active task" predicate (ORTHOGONAL to the
|
||||
# {done,cancelled} terminal axis — task_deps/stages.py do NOT read it, adr-0026
|
||||
# is untouched). All existing rows default to NULL -> pre-ORCH-124 behaviour
|
||||
# holds until the first explicit operator pause.
|
||||
_ensure_column(conn, "tasks", "paused_at", "TEXT")
|
||||
# ORCH-026 (Level B): declarative task dependencies. job_deps stores the
|
||||
# directed edge "task_id (B) is blocked-by depends_on_task_id (A)". The
|
||||
# scheduler gate in claim_next_job keeps B queued until every A reaches
|
||||
@@ -776,6 +787,95 @@ def get_task_track(task_id: int) -> str:
|
||||
return "full"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# ORCH-124: serial-gate per-task park signal (tasks.paused_at) helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
def set_task_paused(task_id: int) -> bool:
|
||||
"""ORCH-124 (ADR-001 D7): park a task for the serial gate (idempotent).
|
||||
|
||||
Stamps ``tasks.paused_at=datetime('now')`` so the serial-gate "active task"
|
||||
predicate stops counting this task as a FIFO blocker (an URGENT successor may
|
||||
overtake it). Durable (survives restart) and DB-resolvable — the hot-claim SQL
|
||||
reads it locally without any network call. Re-pausing an already-paused task
|
||||
keeps the original timestamp (``WHERE paused_at IS NULL``), so the park moment
|
||||
is stable. never-raise -> False on error (a write failure must not crash the
|
||||
operator endpoint / worker).
|
||||
"""
|
||||
if task_id is None:
|
||||
return False
|
||||
try:
|
||||
conn = get_db()
|
||||
try:
|
||||
conn.execute(
|
||||
"UPDATE tasks SET paused_at=datetime('now') "
|
||||
"WHERE id=? AND paused_at IS NULL",
|
||||
(task_id,),
|
||||
)
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
return True
|
||||
except Exception as e: # noqa: BLE001 - never-raise
|
||||
import logging
|
||||
logging.getLogger("orchestrator.db").warning(
|
||||
"set_task_paused error for task %s: %s", task_id, e
|
||||
)
|
||||
return False
|
||||
|
||||
|
||||
def clear_task_paused(task_id: int) -> bool:
|
||||
"""ORCH-124 (ADR-001 D7): resume a parked task (idempotent).
|
||||
|
||||
Clears ``tasks.paused_at`` back to NULL so the task re-enters the serial-gate
|
||||
FIFO (holds the gate as active again, or re-enters with a deferred branch cut —
|
||||
see ADR-001 D8). Resuming a task that is not paused is a no-op. never-raise ->
|
||||
False on error.
|
||||
"""
|
||||
if task_id is None:
|
||||
return False
|
||||
try:
|
||||
conn = get_db()
|
||||
try:
|
||||
conn.execute(
|
||||
"UPDATE tasks SET paused_at=NULL WHERE id=?",
|
||||
(task_id,),
|
||||
)
|
||||
conn.commit()
|
||||
finally:
|
||||
conn.close()
|
||||
return True
|
||||
except Exception as e: # noqa: BLE001 - never-raise
|
||||
import logging
|
||||
logging.getLogger("orchestrator.db").warning(
|
||||
"clear_task_paused error for task %s: %s", task_id, e
|
||||
)
|
||||
return False
|
||||
|
||||
|
||||
def is_task_paused(task_id: int) -> bool:
|
||||
"""ORCH-124: read whether a task is currently parked; missing/error -> False.
|
||||
|
||||
Conservative fail direction (ADR-001 D9): on any read error we report "not
|
||||
paused" so the task is treated as active -> the serial gate stays CLOSED rather
|
||||
than wrongly opening (anti-stale-base safe). Mirror of ``get_task_track``.
|
||||
"""
|
||||
if task_id is None:
|
||||
return False
|
||||
try:
|
||||
conn = get_db()
|
||||
try:
|
||||
row = conn.execute(
|
||||
"SELECT paused_at FROM tasks WHERE id=?", (task_id,)
|
||||
).fetchone()
|
||||
finally:
|
||||
conn.close()
|
||||
if not row:
|
||||
return False
|
||||
return row["paused_at"] is not None
|
||||
except Exception: # noqa: BLE001 - conservative: not paused -> stays active
|
||||
return False
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Telegram live tracker helpers (feat/telegram-live-tracker)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
Reference in New Issue
Block a user