feat(metrics): lightweight read-only GET /metrics raw-signal endpoint (ORCH-099) #111

Merged
admin merged 7 commits from feature/ORCH-099-fnd-f1a-metrics-agent-liveness into main 2026-06-10 02:14:40 +03:00
Owner

ORCH-099 — FND/F1a: lightweight read-only GET /metrics (raw signal for sidecar F1b)

Adds a versioned, strictly read-only, never-raise JSON endpoint GET /metrics
that exposes the orchestrator's own raw state for the future observability sidecar
F1b (watchdog/): active task stages, job queue, agent-liveness, and cost/tokens.
The orchestrator emits ONLY raw signal it alone knows — thresholds/alerts/history live
in the separate sidecar (observer separated from observed, BRD §1).

What changed

  • src/metrics.py (new leaf) — build_metrics() -> dict, never-raise per section
    (serial_gate.snapshot() pattern). Envelope schema_version/generated_at/clk_tck
    • sections stages/queue/agents/cost. _read_cpu_ticks(pid) reads utime+stime
      from /proc/<pid>/stat (→ null on None/dead/non-Linux pid, never raises).
  • src/main.py — thin @app.get("/metrics") wrapper (style of GET /queue).
  • src/db.py — read-only helpers get_running_agents() (dedicated SELECT, not
    an extension of the hot-path get_running_jobs()), agent_cost_totals(),
    queue_retry_stats(); job_status_counts() default dict gains the cancelled key.
  • src/config.pymetrics_endpoint_enabled kill-switch (default True), env
    ORCH_METRICS_ENABLED via explicit validation_alias so the documented switch
    actually controls the flag.
  • docs — README API-table row + CHANGELOG entry (contract section already added by
    architect); .env.example ORCH_METRICS_ENABLED.

Invariants

Strictly read-only / never-raise: STAGE_TRANSITIONS / QG_CHECKS / check_* /
machine-verdict keys / DB schema untouched; /health//status///queue byte-for-byte.
Self-hosting-safe (physically cannot affect the shared prod pipeline).

Tests

tests/test_metrics.py (TC-01..TC-11: envelope/4 sections, terminal exclusion, queue
fields, liveness raw + cpu_ticks on a live pid, never-raise on pid=None/dead pid/
throwing source/unavailable breaker, cost aggregate + empty table, endpoint via handler,
read-only DB snapshot before/after, additivity vs /health//status//queue, empty state,
kill-switch) + env-alias tests in test_config.py. Full suite green (1482).

ADR: docs/work-items/ORCH-099/06-adr/ADR-001-metrics-endpoint.md,
cross-cutting docs/architecture/adr/adr-0030-metrics-endpoint.md.

Refs: ORCH-099

🤖 Generated with Claude Code

## ORCH-099 — FND/F1a: lightweight read-only `GET /metrics` (raw signal for sidecar F1b) Adds a versioned, strictly **read-only**, **never-raise** JSON endpoint `GET /metrics` that exposes the orchestrator's own raw state for the future observability sidecar **F1b** (`watchdog/`): active task stages, job queue, agent-liveness, and cost/tokens. The orchestrator emits ONLY raw signal it alone knows — thresholds/alerts/history live in the separate sidecar (observer separated from observed, BRD §1). ### What changed - **`src/metrics.py`** (new leaf) — `build_metrics() -> dict`, never-raise per section (`serial_gate.snapshot()` pattern). Envelope `schema_version`/`generated_at`/`clk_tck` + sections `stages`/`queue`/`agents`/`cost`. `_read_cpu_ticks(pid)` reads utime+stime from `/proc/<pid>/stat` (→ `null` on `None`/dead/non-Linux pid, never raises). - **`src/main.py`** — thin `@app.get("/metrics")` wrapper (style of `GET /queue`). - **`src/db.py`** — read-only helpers `get_running_agents()` (dedicated SELECT, **not** an extension of the hot-path `get_running_jobs()`), `agent_cost_totals()`, `queue_retry_stats()`; `job_status_counts()` default dict gains the `cancelled` key. - **`src/config.py`** — `metrics_endpoint_enabled` kill-switch (default `True`), env `ORCH_METRICS_ENABLED` via explicit `validation_alias` so the documented switch actually controls the flag. - **docs** — README API-table row + CHANGELOG entry (contract section already added by architect); `.env.example` `ORCH_METRICS_ENABLED`. ### Invariants Strictly read-only / never-raise: `STAGE_TRANSITIONS` / `QG_CHECKS` / `check_*` / machine-verdict keys / DB schema **untouched**; `/health`//status//`/queue` byte-for-byte. Self-hosting-safe (physically cannot affect the shared prod pipeline). ### Tests `tests/test_metrics.py` (TC-01..TC-11: envelope/4 sections, terminal exclusion, queue fields, liveness raw + cpu_ticks on a live pid, never-raise on `pid=None`/dead pid/ throwing source/unavailable breaker, cost aggregate + empty table, endpoint via handler, read-only DB snapshot before/after, additivity vs `/health`//status//queue, empty state, kill-switch) + env-alias tests in `test_config.py`. Full suite green (**1482**). ADR: `docs/work-items/ORCH-099/06-adr/ADR-001-metrics-endpoint.md`, cross-cutting `docs/architecture/adr/adr-0030-metrics-endpoint.md`. Refs: ORCH-099 🤖 Generated with [Claude Code](https://claude.com/claude-code)
admin added 6 commits 2026-06-10 02:09:20 +03:00
FND/F1a: add a versioned read-only JSON endpoint GET /metrics that exposes the
orchestrator's own raw state for the future observability sidecar F1b — active
task stages, job queue, agent-liveness (pid/runtime/cpu_ticks), and cost/tokens.
The orchestrator emits ONLY raw signal it alone knows; thresholds/alerts/history
live in the separate sidecar (observer separated from observed, BRD §1).

- src/metrics.py: new leaf collector build_metrics() (never-raise per section,
  serial_gate.snapshot() pattern); envelope schema_version/generated_at/clk_tck +
  stages/queue/agents/cost. _read_cpu_ticks(pid) reads utime+stime from
  /proc/<pid>/stat (null on None/dead/non-Linux pid — never raises).
- src/main.py: thin @app.get("/metrics") wrapper (style of GET /queue).
- src/db.py: read-only helpers get_running_agents() (dedicated SELECT, not an
  extension of the hot-path get_running_jobs()), agent_cost_totals(),
  queue_retry_stats(); job_status_counts() default dict gains the cancelled key.
- src/config.py: metrics_endpoint_enabled kill-switch (default True), env
  ORCH_METRICS_ENABLED via explicit validation_alias so the documented switch
  actually controls the flag.
- docs: README API table row + CHANGELOG entry (contract section already added
  by architect); .env.example ORCH_METRICS_ENABLED.

Strictly read-only / never-raise: STAGE_TRANSITIONS / QG_CHECKS / check_* /
machine-verdict keys / DB schema untouched; /health//status//queue byte-for-byte.
Tests: tests/test_metrics.py (TC-01..TC-11) + env-alias tests in test_config.py.
Full suite green (1482).

Refs: ORCH-099
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tester(ET): auto-commit from tester run_id=546
All checks were successful
CI / test (push) Successful in 46s
CI / test (pull_request) Successful in 51s
fda1bea9b8
admin force-pushed feature/ORCH-099-fnd-f1a-metrics-agent-liveness from e0d78f3035 to fda1bea9b8 2026-06-10 02:09:20 +03:00 Compare
admin merged commit cd664b0382 into main 2026-06-10 02:14:40 +03:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: admin/orchestrator#111