fix(serial-gate): pause-without-blocking via per-task park signal (ORCH-124) #144

Merged
admin merged 13 commits from feature/ORCH-124-bug-serial-gate-treats-backlog into main 2026-06-16 22:46:02 +03:00
Owner

ORCH-124 — Serial-gate «пауза без блокировки» (per-task park-сигнал)

Тип: fix (метка Bug, эскалирован в full-cycle) · Стадия: development · Гейт: check_ci_green

Проблема (инцидент ORCH-116/ORCH-123)

serial_gate определял «активную задачу репо» исключительно по машинной стадии
tasks.stage NOT IN ('done','cancelled'). Plane-статусы Backlog/Blocked/Needs-Input
(слой B индикации, ORCH-066) не меняют tasks.stage (слой A) ⇒ приостановленный
предшественник был неотличим от активного и держал FIFO-гейт закрытым против срочного
успешника (ORCH-116 поставлен на паузу, чтобы пропустить фикс ORCH-123 — фикс не
стартовал, пока ORCH-116 формально не done).

Решение (ADR-001 / adr-0051)

Явный per-task park-сигнал — аддитивная колонка tasks.paused_at TEXT — и новая
ортогональная ось планировщика «пауза»
. serial-gate «активна» ⇔
stage NOT IN ('done','cancelled') AND paused_at IS NULL во всех 3 точках под под-флагом.

  • Терминал {done,cancelled} — байт-в-байт (adr-0026 не регрессирует): task_deps/
    stages.py колонку paused_at не читают ⇒ паузнутая зависимость и repo_freeze
    по-прежнему блокируют (пауза их не обходит — разные оси).
  • Анти-stale-base при resume — существующие механизмы (отложенный срез ORCH-088 +
    pre-merge auto_rebase_onto_main + merge-gate re-test ORCH-026/093/110); новой
    rebase-машинерии нет. Нормальная задача (paused_at IS NULL) держит гейт.
  • Аддитивно, под независимым под-флагом serial_gate_pause_enabled (дефолт True =
    истинный no-op до явной паузы), never-raise, hot-claim fail-OPEN сохранён.
  • STAGE_TRANSITIONS / QG_CHECKS / check_* / machine-verdict ключи / схемы
    существующих таблиц — байт-в-байт не тронуты.

Точки врезки

Файл Изменение
src/db.py колонка tasks.paused_at (_ensure_column) + set_task_paused/clear_task_paused/is_task_paused
src/serial_gate.py _pause_layer_enabled(); pause-терм в 3 точках; ключ paused + reason в снапшоте
src/config.py, .env.example serial_gate_pause_enabled (env ORCH_SERIAL_GATE_PAUSE_ENABLED, дефолт True)
src/main.py POST /serial-gate/pause|resume?work_item=<id> (по образцу unfreeze)
tests/test_orch124_serial_gate_pause.py TC-01 обязательный регресс инцидента + TC-02…TC-15
CHANGELOG.md запись [Unreleased]

Доки (docs/architecture/README.md/internals.md, ADR adr-0051) — уже в ветке (architect).

Проверки

  • tests/test_orch124_serial_gate_pause.py15 passed (TC-01 красный до фикса, зелёный после).
  • Регресс serial-gate (test_serial_gate*.py, test_orch026_*) — зелёный.
  • Полный pytest tests/ — все зелёные, кроме одного пред-существующего, не связанного с
    ORCH-124
    падения test_orch123_staging_runner_exec.py::test_r2_held_deploy_staging_not_rolled_back
    (cross-test contamination: падает и на чистой базе с застэшенными изменениями, проходит в изоляции;
    затрагивает staging-гейт ORCH-123, которого ORCH-124 не касается).

Refs: ORCH-124

## ORCH-124 — Serial-gate «пауза без блокировки» (per-task park-сигнал) **Тип:** `fix` (метка `Bug`, эскалирован в full-cycle) · **Стадия:** development · **Гейт:** `check_ci_green` ### Проблема (инцидент ORCH-116/ORCH-123) `serial_gate` определял «активную задачу репо» **исключительно по машинной стадии** `tasks.stage NOT IN ('done','cancelled')`. Plane-статусы Backlog/Blocked/Needs-Input (слой B индикации, ORCH-066) **не меняют `tasks.stage`** (слой A) ⇒ приостановленный предшественник был неотличим от активного и держал FIFO-гейт закрытым против срочного успешника (ORCH-116 поставлен на паузу, чтобы пропустить фикс ORCH-123 — фикс не стартовал, пока ORCH-116 формально не `done`). ### Решение (ADR-001 / adr-0051) Явный per-task **park-сигнал** — аддитивная колонка `tasks.paused_at TEXT` — и **новая ортогональная ось планировщика «пауза»**. serial-gate «активна» ⇔ `stage NOT IN ('done','cancelled') AND paused_at IS NULL` во всех 3 точках под под-флагом. - **Терминал `{done,cancelled}` — байт-в-байт** (adr-0026 не регрессирует): `task_deps`/ `stages.py` колонку `paused_at` **не читают** ⇒ паузнутая зависимость и `repo_freeze` **по-прежнему блокируют** (пауза их не обходит — разные оси). - **Анти-stale-base при resume** — существующие механизмы (отложенный срез ORCH-088 + pre-merge `auto_rebase_onto_main` + merge-gate re-test ORCH-026/093/110); новой rebase-машинерии нет. Нормальная задача (`paused_at IS NULL`) держит гейт. - Аддитивно, под независимым под-флагом `serial_gate_pause_enabled` (дефолт `True` = истинный no-op до явной паузы), never-raise, hot-claim fail-OPEN сохранён. - `STAGE_TRANSITIONS` / `QG_CHECKS` / `check_*` / machine-verdict ключи / схемы существующих таблиц — **байт-в-байт не тронуты**. ### Точки врезки | Файл | Изменение | |------|-----------| | `src/db.py` | колонка `tasks.paused_at` (`_ensure_column`) + `set_task_paused`/`clear_task_paused`/`is_task_paused` | | `src/serial_gate.py` | `_pause_layer_enabled()`; pause-терм в 3 точках; ключ `paused` + `reason` в снапшоте | | `src/config.py`, `.env.example` | `serial_gate_pause_enabled` (env `ORCH_SERIAL_GATE_PAUSE_ENABLED`, дефолт `True`) | | `src/main.py` | `POST /serial-gate/pause\|resume?work_item=<id>` (по образцу `unfreeze`) | | `tests/test_orch124_serial_gate_pause.py` | TC-01 обязательный регресс инцидента + TC-02…TC-15 | | `CHANGELOG.md` | запись `[Unreleased]` | Доки (`docs/architecture/README.md`/`internals.md`, ADR `adr-0051`) — уже в ветке (architect). ### Проверки - `tests/test_orch124_serial_gate_pause.py` — **15 passed** (TC-01 красный до фикса, зелёный после). - Регресс serial-gate (`test_serial_gate*.py`, `test_orch026_*`) — зелёный. - Полный `pytest tests/` — все зелёные, **кроме** одного **пред-существующего, не связанного с ORCH-124** падения `test_orch123_staging_runner_exec.py::test_r2_held_deploy_staging_not_rolled_back` (cross-test contamination: падает и на чистой базе с застэшенными изменениями, проходит в изоляции; затрагивает staging-гейт ORCH-123, которого ORCH-124 не касается). Refs: ORCH-124
admin added 4 commits 2026-06-16 19:36:55 +03:00
docs: init ORCH-124 business request
All checks were successful
CI / test (push) Successful in 1m14s
569abee5f2
analyst(ET): auto-commit from analyst run_id=763
All checks were successful
CI / test (push) Successful in 1m9s
fef5ba15d5
architect(ET): auto-commit from architect run_id=764
All checks were successful
CI / test (push) Successful in 1m12s
de4f067655
fix(serial-gate): pause-without-blocking via per-task park signal (ORCH-124)
All checks were successful
CI / test (push) Successful in 1m12s
CI / test (pull_request) Successful in 1m17s
87af857082
Fixes incident ORCH-116/ORCH-123: serial_gate defined a repo's "active task"
purely by machine stage (tasks.stage NOT IN ('done','cancelled')). Plane statuses
Backlog/Blocked/Needs-Input (layer-B indication, ORCH-066) do NOT change
tasks.stage (layer A), so a paused predecessor was indistinguishable from an active
one and held the FIFO gate closed against an urgent successor — the urgent fix
could not start until the paused task was formally done.

Introduces an explicit, durable, DB-resolvable per-task "park" signal — additive
nullable column tasks.paused_at (pattern of cancelled_at/track) — and a new
ORTHOGONAL scheduler "pause" axis. The serial-gate "active task" predicate becomes
`stage NOT IN ('done','cancelled') AND paused_at IS NULL` across all three points
(build_claim_clause / repo_has_active_task / _per_repo_snapshot). The terminal set
{done,cancelled} in serial_gate/task_deps/stages.py is byte-for-byte unchanged
(adr-0026 not regressed): task_deps/stages.py do NOT read paused_at, so a paused
declared dependency and an active repo_freeze STILL block (pause never bypasses
them — different axes). Anti-stale-base on resume relies on the existing deferred
branch cut (ORCH-088) + pre-merge auto_rebase_onto_main + merge-gate re-test
(ORCH-026/093/110) — no new rebase machinery.

Additive, under an independent sub-flag, never-raise, restart-safe; hot-claim
fail-OPEN and freeze fail-CLOSED preserved. STAGE_TRANSITIONS / QG_CHECKS / check_*
/ machine-verdict keys / existing table schemas are byte-for-byte untouched (this is
a queue-scheduler + observability change, not a Quality Gate).

- src/db.py: additive tasks.paused_at column (_ensure_column) + set/clear/is helpers
- src/serial_gate.py: _pause_layer_enabled() + pause-term in the 3 points; `paused`
  list + per-job `reason` (freeze>dependency>active-task>null) in the /queue snapshot
- src/config.py + .env.example: serial_gate_pause_enabled (default True = true no-op)
- src/main.py: POST /serial-gate/pause|resume?work_item=<id> (by образцу unfreeze)
- tests/test_orch124_serial_gate_pause.py: TC-01 mandatory incident regress + TC-02..15
- CHANGELOG.md: [Unreleased] entry

ADR: docs/work-items/ORCH-124/06-adr/ADR-001-serial-gate-pause-without-blocking.md
Cross-cutting: docs/architecture/adr/adr-0051-serial-gate-pause-without-blocking.md

Refs: ORCH-124

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
admin added 1 commit 2026-06-16 19:49:25 +03:00
reviewer(ET): auto-commit from reviewer run_id=766
All checks were successful
CI / test (push) Successful in 1m23s
CI / test (pull_request) Successful in 1m15s
7ac83a9731
admin added 1 commit 2026-06-16 19:51:09 +03:00
test(ORCH-116): test gate FAIL for ORCH-124
All checks were successful
CI / test (push) Successful in 1m23s
CI / test (pull_request) Successful in 1m15s
c7336dd9ea
admin added 1 commit 2026-06-16 20:12:34 +03:00
fix(tests): isolate repos_dir in ORCH-123 staging-runner test fixture
All checks were successful
CI / test (push) Successful in 1m13s
CI / test (pull_request) Successful in 1m12s
3a1972875f
The deterministic test-runner gate (full `pytest tests/`) failed on
test_orch123_staging_runner_exec.py::test_r2_held_deploy_staging_not_rolled_back
once ORCH-124 reached the testing stage.

Root cause (pre-existing latent regress, surfaced — not introduced — by
ORCH-124): the fixture isolated `worktrees_dir` but not `repos_dir`.
`check_staging_status` falls back to `<repos_dir>/<repo>` (and its
origin/main) when the feature worktree is absent. After ORCH-123 merged,
the real `/repos/orchestrator/docs/work-items/ORCH-123/15-staging-log.md`
(verdict SUCCESS) exists on disk, so the intended-RED staging gate read it
and went green -> advance_stage was called -> the R-2 assertion failed.
Order-dependent: the test passed alone, failed in the full suite.

Fix: isolate `settings.repos_dir` to an empty tmp subdir in the fixture
(mirroring the existing worktrees_dir isolation) so the staging gate is
deterministically "not found" -> red, regardless of suite ordering. The
ORCH-123 R-2 invariant (a held deploy-staging task is never rolled back to
development, adr-0049/ADR-001 D4) is preserved and strengthened — the fix
only restores the test's stated premise. src/** / STAGE_TRANSITIONS /
QG_CHECKS / check_* untouched (test-only change).

Refs: ORCH-124
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
admin added 1 commit 2026-06-16 20:24:57 +03:00
reviewer(ET): auto-commit from reviewer run_id=768
All checks were successful
CI / test (push) Successful in 1m16s
CI / test (pull_request) Successful in 1m12s
ec932264db
admin added 1 commit 2026-06-16 21:50:50 +03:00
docs(serial-gate): sync system showcase + clean stray tags (ORCH-124)
All checks were successful
CI / test (push) Successful in 1m15s
CI / test (pull_request) Successful in 1m12s
58e5dfe55d
Addresses reviewer REQUEST_CHANGES (run 768) on ORCH-124 — docs-only,
no src/tests touched, fix scope unchanged.

P1: update docs/overview/ showcase for the new serial-gate "pause without
blocking" axis (changed task-routing functionality, ORCH-011/ORCH-079):
- tech-pipeline.md: FIFO exception "pause without blocking" next to freeze
- tech-data-model.md: durable signal tasks.paused_at on the Task row
- tech-observability.md: paused/reason in serial_gate GET /queue block +
  operator endpoints POST /serial-gate/pause|resume

P2: strip leaked tool-call trailing tags (</content>/</invoke>) from 4
golden-source docs of this PR (06-adr/ADR-001, adr-0051,
08-data-requirements.md, 10-tech-risks.md).

CHANGELOG "Доки" bullet extended accordingly. Full suite green (2178 passed);
test_system_docs.py green (machine-checked showcase facts intact).

Refs: ORCH-124
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
admin added 1 commit 2026-06-16 22:31:51 +03:00
reviewer(ET): auto-commit from reviewer run_id=772
All checks were successful
CI / test (push) Successful in 1m18s
CI / test (pull_request) Successful in 1m13s
be8ddfcd57
admin added 1 commit 2026-06-16 22:33:34 +03:00
test(ORCH-116): test gate PASS for ORCH-124
Some checks failed
CI / test (push) Has been cancelled
CI / test (pull_request) Successful in 1m16s
b61a4eb092
admin added 1 commit 2026-06-16 22:35:09 +03:00
staging(ORCH-115): staging gate SUCCESS for ORCH-124
All checks were successful
CI / test (push) Successful in 1m19s
CI / test (pull_request) Successful in 1m12s
9709aa2267
admin merged commit 8e2281aab4 into main 2026-06-16 22:46:02 +03:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: admin/orchestrator#144