FND/F1a: add a versioned read-only JSON endpoint GET /metrics that exposes the
orchestrator's own raw state for the future observability sidecar F1b — active
task stages, job queue, agent-liveness (pid/runtime/cpu_ticks), and cost/tokens.
The orchestrator emits ONLY raw signal it alone knows; thresholds/alerts/history
live in the separate sidecar (observer separated from observed, BRD §1).
- src/metrics.py: new leaf collector build_metrics() (never-raise per section,
serial_gate.snapshot() pattern); envelope schema_version/generated_at/clk_tck +
stages/queue/agents/cost. _read_cpu_ticks(pid) reads utime+stime from
/proc/<pid>/stat (null on None/dead/non-Linux pid — never raises).
- src/main.py: thin @app.get("/metrics") wrapper (style of GET /queue).
- src/db.py: read-only helpers get_running_agents() (dedicated SELECT, not an
extension of the hot-path get_running_jobs()), agent_cost_totals(),
queue_retry_stats(); job_status_counts() default dict gains the cancelled key.
- src/config.py: metrics_endpoint_enabled kill-switch (default True), env
ORCH_METRICS_ENABLED via explicit validation_alias so the documented switch
actually controls the flag.
- docs: README API table row + CHANGELOG entry (contract section already added
by architect); .env.example ORCH_METRICS_ENABLED.
Strictly read-only / never-raise: STAGE_TRANSITIONS / QG_CHECKS / check_* /
machine-verdict keys / DB schema untouched; /health//status//queue byte-for-byte.
Tests: tests/test_metrics.py (TC-01..TC-11) + env-alias tests in test_config.py.
Full suite green (1482).
Refs: ORCH-099
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A DB stage=done task with 0 active jobs flapped in Plane between `Awaiting
Deploy` and `Monitoring after Deploy` instead of holding `Done` (verified live
on ORCH-061, task 47): the three deploy-phase setters were terminal-blind, so
any stale/duplicate/unknown caller under the bot token re-stamped an
intermediate status over the terminal Done, forever.
- New leaf src/deploy_status_guard.py (pure, never-raise, config-gated): decide()
-> ALLOW | CONVERGE_DONE | SUPPRESS on the entry of set_issue_awaiting_deploy /
set_issue_deploying / set_issue_monitoring. A deploy-phase status is legitimate
iff the task is non-terminal OR (done AND post-deploy window active); otherwise
done converges to Done idempotently, cancelled is suppressed (FR-2, D1/D2).
- D3: move post_deploy.arm_monitor ABOVE the terminal-sync block in advance_stage
so window_active is True when the legitimate first Monitoring is set (the task
is already DB-done by then); a re-drive after the window closes converges to Done.
- D4: run_post_deploy_monitor no-ops without a status PATCH / re-queue when the
task became cancelled mid-window (zombie-tick guard, FR-3).
- D5: additive `reason` kwarg on the three setters + one structured log line per
verdict (work_item/caller/target/db_stage/window_active/verdict); new read-only
db.get_task_by_work_item_id; post_deploy.window_active helper.
- Flags deploy_status_guard_enabled (kill-switch -> 1:1) / deploy_status_guard_repos
(CSV; empty = self-hosting only). STAGE_TRANSITIONS / QG_CHECKS / check_* /
machine-verdict keys / DB schema untouched (reads existing tasks.stage).
Tests: TC-01..TC-12 across 5 new test modules + config flags; updated the
reason-kwarg assertions in test_deploy_terminal_sync / test_deploy_approve.
Full regress green (1413). Docs: CHANGELOG, CLAUDE.md, docs/architecture/README.md
(status -> реализовано), .env.example.
Refs: ORCH-094
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
merge_pr now wraps ONLY the mutating POST /pulls/{n}/merge in a bounded
exponential-backoff retry-loop on TRANSIENT outcomes (405 "try again later",
408, any 5xx, network/timeout, and 409|422 while the PR is still mergeable);
TERMINAL outcomes (403/404/real conflict via mergeable==False) -> fast honest
False, so the ORCH-071/081 not-merged HOLD backstop is unchanged. Fixes the
ORCH-063 false HOLD + manual re-merge on Gitea's post-push mergeability hiccup.
ensure_open_pr gains an "already fully in main" guard (_branch_fully_in_main,
git merge-base --is-ancestor HEAD origin/main) BEFORE creating a PR -> new
"already-in-main" outcome avoids the garbage empty PR on a re-driven finalizer;
_handle_merge_verify skips merge_pr on that outcome and lets the authoritative
SHA-in-main check confirm -> done (not a HOLD). git error of the guard fails
OPEN to the create path.
New ORCH_MERGE_RETRY_* settings (kill-switch merge_retry_enabled -> one-shot,
max_attempts=3, backoff base=2/max=5). INV-4 (merge only via Gitea PR-merge API,
never push/force-push main), never-raise, STAGE_TRANSITIONS/QG_CHECKS/DB schema
unchanged. Docs (README merge-verify section, CLAUDE.md, CHANGELOG, .env.example)
updated in the same PR. Tests: test_merge_gate.py TC-01..12, test_config.py
TC-13, test_merge_verify.py TC-14..16; full suite green (1389).
Refs: ORCH-093
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>