feat(metrics): lightweight read-only GET /metrics raw-signal endpoint (ORCH-099)

FND/F1a: add a versioned read-only JSON endpoint GET /metrics that exposes the
orchestrator's own raw state for the future observability sidecar F1b — active
task stages, job queue, agent-liveness (pid/runtime/cpu_ticks), and cost/tokens.
The orchestrator emits ONLY raw signal it alone knows; thresholds/alerts/history
live in the separate sidecar (observer separated from observed, BRD §1).

- src/metrics.py: new leaf collector build_metrics() (never-raise per section,
  serial_gate.snapshot() pattern); envelope schema_version/generated_at/clk_tck +
  stages/queue/agents/cost. _read_cpu_ticks(pid) reads utime+stime from
  /proc/<pid>/stat (null on None/dead/non-Linux pid — never raises).
- src/main.py: thin @app.get("/metrics") wrapper (style of GET /queue).
- src/db.py: read-only helpers get_running_agents() (dedicated SELECT, not an
  extension of the hot-path get_running_jobs()), agent_cost_totals(),
  queue_retry_stats(); job_status_counts() default dict gains the cancelled key.
- src/config.py: metrics_endpoint_enabled kill-switch (default True), env
  ORCH_METRICS_ENABLED via explicit validation_alias so the documented switch
  actually controls the flag.
- docs: README API table row + CHANGELOG entry (contract section already added
  by architect); .env.example ORCH_METRICS_ENABLED.

Strictly read-only / never-raise: STAGE_TRANSITIONS / QG_CHECKS / check_* /
machine-verdict keys / DB schema untouched; /health//status//queue byte-for-byte.
Tests: tests/test_metrics.py (TC-01..TC-11) + env-alias tests in test_config.py.
Full suite green (1482).

Refs: ORCH-099
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-10 01:58:47 +03:00
committed by orchestrator-deployer
parent 8988dca14d
commit d8793c9698
10 changed files with 739 additions and 5 deletions

View File

@@ -213,6 +213,26 @@ async def queue():
}
@app.get("/metrics")
async def metrics():
"""ORCH-099 (FND/F1a): lightweight read-only raw-signal snapshot for the F1b sidecar.
A versioned JSON envelope (``schema_version`` / ``generated_at`` / ``clk_tck``)
with four raw-signal sections — ``stages`` (active task stages + age),
``queue`` (counts / retries / breaker / concurrency), ``agents`` (agent-liveness:
pid / runtime / cpu_ticks), ``cost`` (per-run + aggregate tokens/cost). The
orchestrator emits ONLY raw signal it alone knows; the stateful arbiter
(thresholds / deltas / alerts) is the separate sidecar (BRD §1).
Thin wrapper over ``metrics.build_metrics()`` (in the style of GET /queue): the
collector is already strictly read-only and never-raise, so no extra error
handling is needed here. Same access level as /queue//status. The format is the
documented contract for the sidecar (docs/architecture/README.md).
"""
from . import metrics as metrics_mod
return metrics_mod.build_metrics()
@app.post("/serial-gate/unfreeze")
async def serial_gate_unfreeze(repo: str = ""):
"""ORCH-088 (FR-5, ADR-001 D4): manually clear a per-repo rollback-freeze.