fix(reconciler): stop F-2 livelock spam on synced terminal tasks + cache TTL #70

Closed
admin wants to merge 0 commits from feature/ORCH-068-bug-reconciler-livelock-unbloc into main
Owner

Summary

Fixes the reconciler F-2 livelock that spammed Telegram <wi> разблокирована (потерян webhook) every ~120s for a fully-synchronized Done task (incident ET-002, 191+ messages/night) after the ORCH-066 Plane status-model merge. Two stacked defects fixed independently (defense in depth) + a secondary states-cache bug.

  • D1 (selection): terminal states (Done/Cancelled) are excluded from the actionable set by Plane state group (completed/cancelled) — project-independent, robust to UUID aliasing after status renames. Per-issue check + logical-key fallback. get_project_states caches {uuid → group} from the same /states/ fetch; new accessor get_project_state_groups.
  • D2 (notification): _note_unblock fires only on a confirmed state change (stage before/after _dispatch; task-appears for the start case). No-op dispatch → silence. handle_* contracts untouched.
  • TR-3: in-memory dedup guard {issue_id → last unblocked state} as a backstop.
  • TR-4: _STATES_CACHE TTL ORCH_PLANE_STATES_TTL_S (default 300s; 0 → previous lifetime cache) self-heals a status added to Plane after start without a restart (reuses reload_project_states()); a failed refresh serves the stale-but-correct set, not enduro defaults.

Invariants preserved: STAGE_TRANSITIONS/QG_CHECKS/DB schema/handle_*/F-1/F-3 unchanged; never-raise per unit of work; 0 jobs/0 tokens for synced tasks; self-hosting tick never restarts prod. Observability: skipped_terminal_total/deduped_total in the /queue reconcile block.

ADR: docs/work-items/ORCH-068/06-adr/ADR-001-reconciler-terminal-exclusion-and-cache-ttl.md

Test plan

  • pytest tests/ -q green (764 passed)
  • TC-01..TC-10 in tests/test_reconciler_plane.py (terminal exclusion incl. UUID aliasing on enduro+orchestrator, no-op silence, dedup, legit-unblock regression, never-raise, kill-switches)
  • TC-11/TC-12 in tests/test_plane_states_cache.py (TTL self-heal, ttl=0 back-compat, fallback preserved)
  • Reviewer confirms docs updated (README / ADR / CHANGELOG / .env.example)

🤖 Generated with Claude Code

## Summary Fixes the reconciler **F-2 livelock** that spammed Telegram `<wi> разблокирована (потерян webhook)` every ~120s for a fully-synchronized Done task (incident ET-002, 191+ messages/night) after the ORCH-066 Plane status-model merge. Two stacked defects fixed independently (defense in depth) + a secondary states-cache bug. - **D1 (selection):** terminal states (`Done`/`Cancelled`) are excluded from the actionable set by Plane **state group** (`completed`/`cancelled`) — project-independent, robust to UUID aliasing after status renames. Per-issue check + logical-key fallback. `get_project_states` caches `{uuid → group}` from the same `/states/` fetch; new accessor `get_project_state_groups`. - **D2 (notification):** `_note_unblock` fires **only on a confirmed state change** (stage before/after `_dispatch`; task-appears for the start case). No-op dispatch → silence. `handle_*` contracts untouched. - **TR-3:** in-memory dedup guard `{issue_id → last unblocked state}` as a backstop. - **TR-4:** `_STATES_CACHE` TTL `ORCH_PLANE_STATES_TTL_S` (default 300s; `0` → previous lifetime cache) self-heals a status added to Plane after start without a restart (reuses `reload_project_states()`); a failed refresh serves the stale-but-correct set, not enduro defaults. Invariants preserved: `STAGE_TRANSITIONS`/`QG_CHECKS`/DB schema/`handle_*`/F-1/F-3 unchanged; never-raise per unit of work; 0 jobs/0 tokens for synced tasks; self-hosting tick never restarts prod. Observability: `skipped_terminal_total`/`deduped_total` in the `/queue` reconcile block. ADR: `docs/work-items/ORCH-068/06-adr/ADR-001-reconciler-terminal-exclusion-and-cache-ttl.md` ## Test plan - [x] `pytest tests/ -q` green (764 passed) - [x] TC-01..TC-10 in `tests/test_reconciler_plane.py` (terminal exclusion incl. UUID aliasing on enduro+orchestrator, no-op silence, dedup, legit-unblock regression, never-raise, kill-switches) - [x] TC-11/TC-12 in `tests/test_plane_states_cache.py` (TTL self-heal, ttl=0 back-compat, fallback preserved) - [ ] Reviewer confirms docs updated (README / ADR / CHANGELOG / .env.example) 🤖 Generated with [Claude Code](https://claude.com/claude-code)
admin added 6 commits 2026-06-08 08:18:48 +03:00
Reconciler F-2 spammed Telegram "<wi> разблокирована" every ~120s for a
fully-synchronized Done task (incident ET-002, 191+ msgs/night) after the
ORCH-066 Plane status model merge. Two stacked defects (defense in depth):

- D1 (selection): actionable states were told apart by bare UUID, so a Done
  issue aliased onto the approved UUID entered the approved branch. Now
  terminal states are excluded by Plane state GROUP (completed/cancelled),
  a project-independent discriminator robust to UUID aliasing; per-issue
  check with a logical-key fallback when the group is unavailable.
  get_project_states caches {uuid -> group} from the same /states/ fetch;
  new sibling accessor get_project_state_groups.
- D2 (notification): _note_unblock fired unconditionally after _dispatch.
  Now it only fires on a confirmed state change (stage before/after _dispatch;
  task-appears for the start case) — handlers' contracts untouched.
- TR-3: in-memory dedup guard {issue_id -> last unblocked state} as a backstop.
- TR-4: _STATES_CACHE lived for the whole process lifetime, so a new Plane
  status was invisible without a restart. Added TTL ORCH_PLANE_STATES_TTL_S
  (default 300s; 0 = previous lifetime cache) reusing reload_project_states();
  a failed refresh serves the stale-but-correct set, not enduro defaults.

STAGE_TRANSITIONS / QG_CHECKS / DB schema / handle_* contracts / F-1 / F-3
unchanged; never-raise preserved; self-hosting tick never restarts prod.
Observability: skipped_terminal_total / deduped_total in /queue reconcile block.

Tests: tests/test_reconciler_plane.py (TC-01..TC-10),
tests/test_plane_states_cache.py (TC-11/TC-12).

Refs: ORCH-068

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tester(ET): auto-commit from tester run_id=351
All checks were successful
CI / test (push) Successful in 19s
CI / test (pull_request) Successful in 21s
6bbd530caa
admin force-pushed feature/ORCH-068-bug-reconciler-livelock-unbloc from 53bc54c212 to 6bbd530caa 2026-06-08 08:18:48 +03:00 Compare
admin added 1 commit 2026-06-08 08:34:25 +03:00
deploy(ORCH-036): finalize SUCCESS for ORCH-068
All checks were successful
CI / test (push) Successful in 19s
CI / test (pull_request) Successful in 20s
aa4161fc78
admin added 1 commit 2026-06-08 08:49:29 +03:00
docs(ORCH-021): post-deploy HEALTHY/NONE for ORCH-068
All checks were successful
CI / test (push) Successful in 17s
CI / test (pull_request) Successful in 18s
06271b0bfb
Author
Owner

Superseded by #71 (restore-main 2026-06-08): ORCH-068 code restored to main after phantom-merge.

Superseded by #71 (restore-main 2026-06-08): ORCH-068 code restored to main after phantom-merge.
admin closed this pull request 2026-06-08 10:55:33 +03:00
All checks were successful
CI / test (push) Successful in 17s
CI / test (pull_request) Successful in 18s

Pull request closed

Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: admin/orchestrator#70