feat(metrics): lightweight read-only GET /metrics raw-signal endpoint (ORCH-099)
FND/F1a: add a versioned read-only JSON endpoint GET /metrics that exposes the
orchestrator's own raw state for the future observability sidecar F1b — active
task stages, job queue, agent-liveness (pid/runtime/cpu_ticks), and cost/tokens.
The orchestrator emits ONLY raw signal it alone knows; thresholds/alerts/history
live in the separate sidecar (observer separated from observed, BRD §1).
- src/metrics.py: new leaf collector build_metrics() (never-raise per section,
serial_gate.snapshot() pattern); envelope schema_version/generated_at/clk_tck +
stages/queue/agents/cost. _read_cpu_ticks(pid) reads utime+stime from
/proc/<pid>/stat (null on None/dead/non-Linux pid — never raises).
- src/main.py: thin @app.get("/metrics") wrapper (style of GET /queue).
- src/db.py: read-only helpers get_running_agents() (dedicated SELECT, not an
extension of the hot-path get_running_jobs()), agent_cost_totals(),
queue_retry_stats(); job_status_counts() default dict gains the cancelled key.
- src/config.py: metrics_endpoint_enabled kill-switch (default True), env
ORCH_METRICS_ENABLED via explicit validation_alias so the documented switch
actually controls the flag.
- docs: README API table row + CHANGELOG entry (contract section already added
by architect); .env.example ORCH_METRICS_ENABLED.
Strictly read-only / never-raise: STAGE_TRANSITIONS / QG_CHECKS / check_* /
machine-verdict keys / DB schema untouched; /health//status//queue byte-for-byte.
Tests: tests/test_metrics.py (TC-01..TC-11) + env-alias tests in test_config.py.
Full suite green (1482).
Refs: ORCH-099
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -394,6 +394,12 @@ ORCH_COVERAGE_EPSILON=0.5
|
|||||||
ORCH_COVERAGE_TOOL_FAIL_CLOSED=false
|
ORCH_COVERAGE_TOOL_FAIL_CLOSED=false
|
||||||
ORCH_COVERAGE_RUN_TIMEOUT_S=900
|
ORCH_COVERAGE_RUN_TIMEOUT_S=900
|
||||||
|
|
||||||
|
# ORCH-099 (FND/F1a): operator off-switch for the read-only GET /metrics endpoint
|
||||||
|
# (raw-signal snapshot for the F1b sidecar). Default true -> available out of the
|
||||||
|
# box. false -> /metrics returns a minimal parsable body {"schema_version":1,
|
||||||
|
# "enabled":false} (200, not 404). The endpoint is inert / read-only anyway.
|
||||||
|
ORCH_METRICS_ENABLED=true
|
||||||
|
|
||||||
# ORCH-021: post-deploy production monitoring + degradation reaction. After the
|
# ORCH-021: post-deploy production monitoring + degradation reaction. After the
|
||||||
# terminal deploy->done transition for an applicable repo, a reserved-agent job
|
# terminal deploy->done transition for an applicable repo, a reserved-agent job
|
||||||
# `post-deploy-monitor` (no LLM, modelled on deploy-finalizer) probes prod over a
|
# `post-deploy-monitor` (no LLM, modelled on deploy-finalizer) probes prod over a
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
Work item: ORCH-093
|
Work item: ORCH-099
|
||||||
Repo: orchestrator
|
Repo: orchestrator
|
||||||
Branch: feature/ORCH-093-bug-merge-gitea-405-5xx-hold-p
|
Branch: feature/ORCH-099-fnd-f1a-metrics-agent-liveness
|
||||||
Stage: development
|
Stage: development
|
||||||
@@ -3,6 +3,13 @@
|
|||||||
Формат: [Keep a Changelog](https://keepachangelog.com/). Записи — на смысловой PR/задачу.
|
Формат: [Keep a Changelog](https://keepachangelog.com/). Записи — на смысловой PR/задачу.
|
||||||
|
|
||||||
## [Unreleased]
|
## [Unreleased]
|
||||||
|
- **Лёгкий read-only `GET /metrics` — машинное «сырьё» о самом орке для sidecar F1b** (ORCH-099, FND/F1a, `feat`): добавлен версионируемый JSON-эндпоинт `GET /metrics`, отдающий снимок внутреннего состояния орка для будущего отдельного sidecar-наблюдателя F1b (`watchdog/`) — наблюдатель отделён от наблюдаемого (BRD §1): орк отдаёт ТОЛЬКО факты, которые знает лишь он сам; пороги/алерты/история/Telegram — на стороне F1b. **Аддитивно, строго read-only, never-raise:** `STAGE_TRANSITIONS` / `QG_CHECKS` / `check_*` / machine-verdict ключи / схема БД — **не тронуты**; `/health`/`/status`/`/queue` — байт-в-байт прежние. ADR: `docs/work-items/ORCH-099/06-adr/ADR-001-metrics-endpoint.md`, сквозной `docs/architecture/adr/adr-0030-metrics-endpoint.md`.
|
||||||
|
- **Leaf-сборщик + тонкий эндпоинт (D1):** новый `src/metrics.py` (`build_metrics() -> dict`, never-raise по разделам, паттерн `serial_gate.snapshot()`) собирает конверт по-раздельно (каждый раздел в своём `try/except` → безопасный дефолт `null`/`[]`/`{}` + WARNING); эндпоинт `@app.get("/metrics")` в `src/main.py` — тонкая обёртка, возвращает результат как есть (стиль `GET /queue`). Тестируемость без ASGI: разделы проверяются прямым вызовом `build_metrics()`.
|
||||||
|
- **Конверт + контракт `schema_version` (D2):** `schema_version` (стартует с `1`), `generated_at` (UTC ISO-8601, часовой домен орка → дельты CPU иммунны к skew орк↔sidecar, TR-3), `clk_tck` (`os.sysconf("SC_CLK_TCK")`, базис тиков). Политика: аддитивные изменения **НЕ бампят** версию (sidecar обязан игнорировать незнакомые ключи) — бамп только при ломающем (rename/remove/retype).
|
||||||
|
- **Разделы сырья (D3–D7):** `stages` — незавершённые задачи (`stage NOT IN ('done','cancelled')`, ORCH-090) с `work_item`/`stage`/`age_in_stage_s`/`repo` (источник `db.get_active_tasks_for_reconcile()` + фильтр терминалов на потребителе, helper-инвариант ORCH-053/086 не тронут). `queue` — `db.job_status_counts()` (+`cancelled`-ключ дефолтом), глубина, сырьё ретраев (`db.queue_retry_stats()`: attempts/transient/в-backoff), `worker.breaker.snapshot()`, `max_concurrency`. `agents` (liveness) — по running-job (новый read-only `db.get_running_agents()`, dedicated SELECT, НЕ расширение hot-path `get_running_jobs()`): `agent`/`run_id`/`job_id`/`pid`/`runtime_s` (= `running_age_s` от `jobs.started_at`, D6)/`model`/`effort` + **CPU-сырьё** `cpu_ticks` (utime+stime из `/proc/<pid>/stat`, поля 14+15; орк дельту не считает — stateless, арбитр sidecar). `cost` — `running` (по running-job, `null` до завершения = честное сырьё) + `aggregate` (новый `db.agent_cost_totals()`, `COALESCE(SUM(...),0)` по `agent_runs`).
|
||||||
|
- **Never-raise сырьё для liveness (FR-6/NFR-2):** `metrics._read_cpu_ticks(pid)` — `pid is None` / нет `/proc/<pid>` / мёртвый процесс / не-Linux → `cpu_ticks: null` у этого агента, прочие поля и весь эндпоинт целы (НЕ raise). Недоступный `worker` → `breaker: null`/`max_concurrency: null`, не 500. Пустые таблицы → `stages=[]`/`agents=[]`/`cost.aggregate=нули`.
|
||||||
|
- **Kill-switch (D8):** `src/config.py` `metrics_endpoint_enabled: bool = True` (env `ORCH_METRICS_ENABLED` через явный `validation_alias` — документированное имя контракта реально управляет флагом). `False` → `200` с минимальным телом `{"schema_version":1,"enabled":false}` (НЕ 404 — контракт остаётся парсимым). Дефолт `True` → нулевая регрессия (эндпоинт доступен из коробки).
|
||||||
|
- **Контракт задокументирован (AC-7):** формат `/metrics` зафиксирован в `docs/architecture/README.md` (раздел «Сырьё-эндпоинт `/metrics`» + строка в таблице API) как стабильный контракт для F1b. Тесты: `tests/test_metrics.py` (TC-01…TC-11: конверт/4 раздела, исключение терминалов, queue-поля, liveness-сырьё + cpu_ticks на живом pid, never-raise на `pid=None`/мёртвом pid/бросающем источнике/недоступном breaker, cost-агрегат + пустая таблица, эндпоинт через handler, read-only снимок БД до/после, аддитивность `/health`//status//queue, пустое состояние, kill-switch). Полный регресс `tests/ -q` зелёный (1480 → +14). Откат: `ORCH_METRICS_ENABLED=false` (мгновенный) или удаление модуля/эндпоинта/helper'ов (без следов в БД/схеме).
|
||||||
- **Детерминированный гейт покрытия тестами — защита от тихой деградации coverage перед merge в `main`** (ORCH-027, `feat`): существующие тестовые гейты (`check_ci_green`, `check_tests_passed`, merge-gate re-test) судят только по **факту** прохождения, не по **полноте** — ни один не замечает «300 строк кода, 0 тестов», и при пакетном автономном прогоне (ORCH-088) покрытие монотонно деградирует. Введён детерминированный (без LLM) под-гейт ребра `deploy-staging → deploy` по образцу security-гейта (ORCH-022): leaf `src/coverage_gate.py` (never-raise) + тонкая обёртка `check_coverage_gate` в `QG_CHECKS` + врезка `_handle_coverage_gate` в `advance_stage`. **Аддитивно:** `STAGE_TRANSITIONS` / семантика существующих `check_*` / machine-verdict ключи (`verdict:`/`result:`/`deploy_status:`/`staging_status:`/`security_status:`) — байт-в-байт прежние; новая БД-таблица аддитивна (NFR-5/AC-8). См. `docs/work-items/ORCH-027/06-adr/ADR-001-coverage-gate.md`, сквозной `docs/architecture/adr/adr-0029-coverage-gate.md`.
|
- **Детерминированный гейт покрытия тестами — защита от тихой деградации coverage перед merge в `main`** (ORCH-027, `feat`): существующие тестовые гейты (`check_ci_green`, `check_tests_passed`, merge-gate re-test) судят только по **факту** прохождения, не по **полноте** — ни один не замечает «300 строк кода, 0 тестов», и при пакетном автономном прогоне (ORCH-088) покрытие монотонно деградирует. Введён детерминированный (без LLM) под-гейт ребра `deploy-staging → deploy` по образцу security-гейта (ORCH-022): leaf `src/coverage_gate.py` (never-raise) + тонкая обёртка `check_coverage_gate` в `QG_CHECKS` + врезка `_handle_coverage_gate` в `advance_stage`. **Аддитивно:** `STAGE_TRANSITIONS` / семантика существующих `check_*` / machine-verdict ключи (`verdict:`/`result:`/`deploy_status:`/`staging_status:`/`security_status:`) — байт-в-байт прежние; новая БД-таблица аддитивна (NFR-5/AC-8). См. `docs/work-items/ORCH-027/06-adr/ADR-001-coverage-gate.md`, сквозной `docs/architecture/adr/adr-0029-coverage-gate.md`.
|
||||||
- **Точка/порядок (D1, AC-2):** под-гейт исполняется **ПОСЛЕ merge-gate** (покрытие меряется на догнанном `auto_rebase_onto_main` HEAD — ровно том коде, что landed в `main`) и **ДО image-freshness** (фейл до дорогого docker-rebuild). FAIL → штатный откат на `development` (+ инкремент developer-retry, cap `MAX_DEVELOPER_RETRIES`) **и освобождение merge-lease** (merge-gate держал его на своём PASS — зеркало image-freshness rollback, TR-2). `STAGE_TRANSITIONS` не меняется (под-гейт, как security/merge/image-freshness).
|
- **Точка/порядок (D1, AC-2):** под-гейт исполняется **ПОСЛЕ merge-gate** (покрытие меряется на догнанном `auto_rebase_onto_main` HEAD — ровно том коде, что landed в `main`) и **ДО image-freshness** (фейл до дорогого docker-rebuild). FAIL → штатный откат на `development` (+ инкремент developer-retry, cap `MAX_DEVELOPER_RETRIES`) **и освобождение merge-lease** (merge-gate держал его на своём PASS — зеркало image-freshness rollback, TR-2). `STAGE_TRANSITIONS` не меняется (под-гейт, как security/merge/image-freshness).
|
||||||
- **Измерение (D2, FR-1/AC-1):** `python -m pytest tests/ --cov=src --cov-report=json` в изолированном per-branch worktree (`ensure_worktree`, прецедент `check_tests_local`); метрика — `totals.percent_covered` (line coverage `src/`). Измеритель инкапсулирован за `measure_coverage(repo, branch) -> float | None` (стек-расширяемость BR-6: jest/jacoco — новая ветка `measure_*`, без переписывания ядра). Тайм-аут `coverage_run_timeout_s`. Новая pip-зависимость `pytest-cov==5.0.0` (offline на момент замера).
|
- **Измерение (D2, FR-1/AC-1):** `python -m pytest tests/ --cov=src --cov-report=json` в изолированном per-branch worktree (`ensure_worktree`, прецедент `check_tests_local`); метрика — `totals.percent_covered` (line coverage `src/`). Измеритель инкапсулирован за `measure_coverage(repo, branch) -> float | None` (стек-расширяемость BR-6: jest/jacoco — новая ветка `measure_*`, без переписывания ядра). Тайм-аут `coverage_run_timeout_s`. Новая pip-зависимость `pytest-cov==5.0.0` (offline на момент замера).
|
||||||
|
|||||||
@@ -1009,6 +1009,7 @@ Monitoring after Deploy → Done
|
|||||||
| GET | `/health` | health check |
|
| GET | `/health` | health check |
|
||||||
| GET | `/status` | активные задачи (stage != done) |
|
| GET | `/status` | активные задачи (stage != done) |
|
||||||
| GET | `/queue` | очередь: counts + max_concurrency + resilience + reconcile (ORCH-053) + reaper (ORCH-065) + post_deploy (ORCH-021) + task_deps (ORCH-026) + serial_gate (ORCH-088) + auto_labels (ORCH-089) + stop (ORCH-090) + последние jobs |
|
| GET | `/queue` | очередь: counts + max_concurrency + resilience + reconcile (ORCH-053) + reaper (ORCH-065) + post_deploy (ORCH-021) + task_deps (ORCH-026) + serial_gate (ORCH-088) + auto_labels (ORCH-089) + stop (ORCH-090) + последние jobs |
|
||||||
|
| GET | `/metrics` | ORCH-099 (FND/F1a): read-only машинное «сырьё» для sidecar F1b — конверт `schema_version`/`generated_at`/`clk_tck` + разделы `stages`/`queue`/`agents` (liveness: pid/runtime/cpu_ticks)/`cost`. never-raise по разделам; kill-switch `ORCH_METRICS_ENABLED` (дефолт `True`). Контракт — см. раздел «Сырьё-эндпоинт `/metrics`» |
|
||||||
| POST | `/serial-gate/unfreeze` | ORCH-088 (FR-5): ручное снятие per-repo rollback-freeze (query/body `repo=<repo>`) → `{ok, repo, cleared, frozen}`; идемпотентно. Альтернатива — `UPDATE repo_freeze SET cleared_at=datetime('now') WHERE repo=? AND cleared_at IS NULL` |
|
| POST | `/serial-gate/unfreeze` | ORCH-088 (FR-5): ручное снятие per-repo rollback-freeze (query/body `repo=<repo>`) → `{ok, repo, cleared, frozen}`; идемпотентно. Альтернатива — `UPDATE repo_freeze SET cleared_at=datetime('now') WHERE repo=? AND cleared_at IS NULL` |
|
||||||
| POST | `/webhook/plane` | Plane webhook |
|
| POST | `/webhook/plane` | Plane webhook |
|
||||||
| POST | `/webhook/gitea` | Gitea webhook (push, PR, CI status) |
|
| POST | `/webhook/gitea` | Gitea webhook (push, PR, CI status) |
|
||||||
|
|||||||
@@ -1,7 +1,7 @@
|
|||||||
import logging
|
import logging
|
||||||
import re
|
import re
|
||||||
|
|
||||||
from pydantic import field_validator
|
from pydantic import Field, field_validator
|
||||||
from pydantic_settings import BaseSettings
|
from pydantic_settings import BaseSettings
|
||||||
|
|
||||||
|
|
||||||
@@ -819,6 +819,17 @@ class Settings(BaseSettings):
|
|||||||
# 200 (was hardcoded 80). Invalid/empty value -> default (graceful, no crash).
|
# 200 (was hardcoded 80). Invalid/empty value -> default (graceful, no crash).
|
||||||
qg0_title_max: int = 200
|
qg0_title_max: int = 200
|
||||||
|
|
||||||
|
# ORCH-099 (D8): operator off-switch for the read-only GET /metrics endpoint.
|
||||||
|
# The env var is ORCH_METRICS_ENABLED (explicit validation_alias — the documented
|
||||||
|
# contract name, ADR-001 D8 / README — overriding the default ORCH_ + field-name
|
||||||
|
# mapping so the documented switch actually controls the flag). Default True ->
|
||||||
|
# the endpoint is available out of the box (zero regression vs BRD). False ->
|
||||||
|
# /metrics returns a minimal parsable body {"schema_version": 1, "enabled": false}
|
||||||
|
# (200, NOT 404) so the F1b sidecar sees the off-switch explicitly. The endpoint
|
||||||
|
# is inert / read-only anyway; the flag is a cheap self-hosting insurance on the
|
||||||
|
# shared prod instance.
|
||||||
|
metrics_endpoint_enabled: bool = Field(True, validation_alias="ORCH_METRICS_ENABLED")
|
||||||
|
|
||||||
@field_validator("qg0_title_max", mode="before")
|
@field_validator("qg0_title_max", mode="before")
|
||||||
@classmethod
|
@classmethod
|
||||||
def _qg0_title_max_default(cls, v):
|
def _qg0_title_max_default(cls, v):
|
||||||
|
|||||||
105
src/db.py
105
src/db.py
@@ -1133,6 +1133,100 @@ def get_running_jobs() -> list[dict]:
|
|||||||
return [dict(r) for r in rows]
|
return [dict(r) for r in rows]
|
||||||
|
|
||||||
|
|
||||||
|
def get_running_agents() -> list[dict]:
|
||||||
|
"""ORCH-099 (D5): read-only liveness snapshot of every 'running' job for /metrics.
|
||||||
|
|
||||||
|
A dedicated read-only SELECT — deliberately NOT an extension of
|
||||||
|
``get_running_jobs()`` (the job-reaper hot path, ORCH-065): widening that
|
||||||
|
query under observability needs would migrate a foreign component's invariant.
|
||||||
|
Each row carries the process identity + cost context the F1b sidecar needs:
|
||||||
|
* ``job_id`` / ``run_id`` / ``pid`` — process identity (pid may be NULL until
|
||||||
|
the launcher stamps it / after the process exits);
|
||||||
|
* ``agent`` / ``repo`` — role and project (the sidecar is multi-project);
|
||||||
|
* ``running_age_s`` — seconds since ``jobs.started_at`` (the same process
|
||||||
|
anchor the reaper uses for backstop-liveness, D6);
|
||||||
|
* ``model`` / ``effort`` — cost context (LEFT JOIN ``agent_runs``);
|
||||||
|
* the token / ``cost_usd`` columns — current per-run accruals, usually NULL
|
||||||
|
until the launcher parses the CLI result JSON on finish (honest raw, TR-5).
|
||||||
|
|
||||||
|
A LEFT JOIN on ``run_id`` keeps a job with no ``agent_runs`` row. Read-only;
|
||||||
|
never mutates.
|
||||||
|
"""
|
||||||
|
conn = get_db()
|
||||||
|
try:
|
||||||
|
rows = conn.execute(
|
||||||
|
"SELECT j.id AS job_id, j.run_id AS run_id, j.pid AS pid, "
|
||||||
|
"j.agent AS agent, j.repo AS repo, j.started_at AS started_at, "
|
||||||
|
"CAST(strftime('%s','now') - strftime('%s', j.started_at) AS INTEGER) "
|
||||||
|
" AS running_age_s, "
|
||||||
|
"r.model AS model, r.effort AS effort, r.cost_usd AS cost_usd, "
|
||||||
|
"r.input_tokens AS input_tokens, r.output_tokens AS output_tokens, "
|
||||||
|
"r.cache_read_tokens AS cache_read_tokens, "
|
||||||
|
"r.cache_creation_tokens AS cache_creation_tokens "
|
||||||
|
"FROM jobs j LEFT JOIN agent_runs r ON r.id = j.run_id "
|
||||||
|
"WHERE j.status='running'"
|
||||||
|
).fetchall()
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
return [dict(r) for r in rows]
|
||||||
|
|
||||||
|
|
||||||
|
def agent_cost_totals() -> dict:
|
||||||
|
"""ORCH-099 (D7): read-only aggregate of cost / tokens over all agent_runs.
|
||||||
|
|
||||||
|
Pure ``SELECT COALESCE(SUM(...),0)`` — an empty ``agent_runs`` table yields
|
||||||
|
zeros, never an error (TC-06 / TC-11). Read-only; never mutates.
|
||||||
|
"""
|
||||||
|
conn = get_db()
|
||||||
|
try:
|
||||||
|
row = conn.execute(
|
||||||
|
"SELECT "
|
||||||
|
"COALESCE(SUM(cost_usd),0) AS cost_usd, "
|
||||||
|
"COALESCE(SUM(input_tokens),0) AS input_tokens, "
|
||||||
|
"COALESCE(SUM(output_tokens),0) AS output_tokens, "
|
||||||
|
"COALESCE(SUM(cache_read_tokens),0) AS cache_read_tokens, "
|
||||||
|
"COALESCE(SUM(cache_creation_tokens),0) AS cache_creation_tokens "
|
||||||
|
"FROM agent_runs"
|
||||||
|
).fetchone()
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
return dict(row) if row else {
|
||||||
|
"cost_usd": 0,
|
||||||
|
"input_tokens": 0,
|
||||||
|
"output_tokens": 0,
|
||||||
|
"cache_read_tokens": 0,
|
||||||
|
"cache_creation_tokens": 0,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def queue_retry_stats() -> dict:
|
||||||
|
"""ORCH-099 (D4): read-only retry raw over UNFINISHED jobs for /metrics.queue.
|
||||||
|
|
||||||
|
Aggregates ``attempts`` / ``transient_attempts`` and counts jobs currently in
|
||||||
|
backoff (``available_at > now``) across non-terminal jobs (status NOT IN
|
||||||
|
done/failed/cancelled). Read-only; never mutates.
|
||||||
|
"""
|
||||||
|
conn = get_db()
|
||||||
|
try:
|
||||||
|
row = conn.execute(
|
||||||
|
"SELECT "
|
||||||
|
"COALESCE(SUM(attempts),0) AS total_attempts, "
|
||||||
|
"COALESCE(SUM(transient_attempts),0) AS total_transient_attempts, "
|
||||||
|
"COALESCE(MAX(attempts),0) AS max_attempts_seen, "
|
||||||
|
"COALESCE(SUM(CASE WHEN available_at IS NOT NULL "
|
||||||
|
" AND available_at > datetime('now') THEN 1 ELSE 0 END),0) AS in_backoff "
|
||||||
|
"FROM jobs WHERE status NOT IN ('done','failed','cancelled')"
|
||||||
|
).fetchone()
|
||||||
|
finally:
|
||||||
|
conn.close()
|
||||||
|
return dict(row) if row else {
|
||||||
|
"total_attempts": 0,
|
||||||
|
"total_transient_attempts": 0,
|
||||||
|
"max_attempts_seen": 0,
|
||||||
|
"in_backoff": 0,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
def reap_running_job(
|
def reap_running_job(
|
||||||
job_id: int,
|
job_id: int,
|
||||||
status: str,
|
status: str,
|
||||||
@@ -1185,13 +1279,20 @@ def get_job(job_id: int) -> dict | None:
|
|||||||
|
|
||||||
|
|
||||||
def job_status_counts() -> dict:
|
def job_status_counts() -> dict:
|
||||||
"""Return counts grouped by status (for /queue observability)."""
|
"""Return counts grouped by status (for /queue and /metrics observability).
|
||||||
|
|
||||||
|
ORCH-099 (D4): the default dict carries the ``cancelled`` terminal key
|
||||||
|
(ORCH-090, terminal set ``{done, cancelled}``) so the key is always present
|
||||||
|
with a 0 default instead of materialising only when a cancelled job exists.
|
||||||
|
Purely additive — the GROUP BY query is unchanged and pre-existing keys keep
|
||||||
|
their meaning (no /queue contract break).
|
||||||
|
"""
|
||||||
conn = get_db()
|
conn = get_db()
|
||||||
rows = conn.execute(
|
rows = conn.execute(
|
||||||
"SELECT status, COUNT(*) AS n FROM jobs GROUP BY status"
|
"SELECT status, COUNT(*) AS n FROM jobs GROUP BY status"
|
||||||
).fetchall()
|
).fetchall()
|
||||||
conn.close()
|
conn.close()
|
||||||
counts = {"queued": 0, "running": 0, "done": 0, "failed": 0}
|
counts = {"queued": 0, "running": 0, "done": 0, "failed": 0, "cancelled": 0}
|
||||||
for r in rows:
|
for r in rows:
|
||||||
counts[r["status"]] = r["n"]
|
counts[r["status"]] = r["n"]
|
||||||
return counts
|
return counts
|
||||||
|
|||||||
20
src/main.py
20
src/main.py
@@ -213,6 +213,26 @@ async def queue():
|
|||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/metrics")
|
||||||
|
async def metrics():
|
||||||
|
"""ORCH-099 (FND/F1a): lightweight read-only raw-signal snapshot for the F1b sidecar.
|
||||||
|
|
||||||
|
A versioned JSON envelope (``schema_version`` / ``generated_at`` / ``clk_tck``)
|
||||||
|
with four raw-signal sections — ``stages`` (active task stages + age),
|
||||||
|
``queue`` (counts / retries / breaker / concurrency), ``agents`` (agent-liveness:
|
||||||
|
pid / runtime / cpu_ticks), ``cost`` (per-run + aggregate tokens/cost). The
|
||||||
|
orchestrator emits ONLY raw signal it alone knows; the stateful arbiter
|
||||||
|
(thresholds / deltas / alerts) is the separate sidecar (BRD §1).
|
||||||
|
|
||||||
|
Thin wrapper over ``metrics.build_metrics()`` (in the style of GET /queue): the
|
||||||
|
collector is already strictly read-only and never-raise, so no extra error
|
||||||
|
handling is needed here. Same access level as /queue//status. The format is the
|
||||||
|
documented contract for the sidecar (docs/architecture/README.md).
|
||||||
|
"""
|
||||||
|
from . import metrics as metrics_mod
|
||||||
|
return metrics_mod.build_metrics()
|
||||||
|
|
||||||
|
|
||||||
@app.post("/serial-gate/unfreeze")
|
@app.post("/serial-gate/unfreeze")
|
||||||
async def serial_gate_unfreeze(repo: str = ""):
|
async def serial_gate_unfreeze(repo: str = ""):
|
||||||
"""ORCH-088 (FR-5, ADR-001 D4): manually clear a per-repo rollback-freeze.
|
"""ORCH-088 (FR-5, ADR-001 D4): manually clear a per-repo rollback-freeze.
|
||||||
|
|||||||
276
src/metrics.py
Normal file
276
src/metrics.py
Normal file
@@ -0,0 +1,276 @@
|
|||||||
|
"""ORCH-099 (FND/F1a): lightweight read-only ``/metrics`` raw-signal collector.
|
||||||
|
|
||||||
|
A leaf module that builds a versioned JSON snapshot of the orchestrator's own
|
||||||
|
raw state for the future observability sidecar (F1b, ``watchdog/``): active task
|
||||||
|
stages, the job queue, agent-liveness, and cost/tokens. The orchestrator emits
|
||||||
|
ONLY raw signal it alone knows — the sidecar is the stateful arbiter that
|
||||||
|
computes thresholds / deltas / alerts (BRD §1, observer separated from observed).
|
||||||
|
|
||||||
|
Design (ADR-001, by образцу ``serial_gate.snapshot()`` / ``cancel.snapshot()``):
|
||||||
|
* pure, never-raise, no side effects — only reads existing tables
|
||||||
|
(``tasks`` / ``jobs`` / ``agent_runs``) and the in-memory worker snapshot;
|
||||||
|
* ``build_metrics()`` assembles the envelope section-by-section, each section in
|
||||||
|
its own ``try/except`` with a safe default (``None`` / ``[]`` / ``{}``) so a
|
||||||
|
failing source degrades one field, never the whole endpoint (FR-6, NFR-2);
|
||||||
|
* strictly read-only — no INSERT/UPDATE/DELETE/CREATE/ALTER, no process control,
|
||||||
|
no network. Self-hosting-safe on the shared prod instance.
|
||||||
|
|
||||||
|
The endpoint ``GET /metrics`` (``src/main.py``) is a thin wrapper that returns
|
||||||
|
``build_metrics()`` as-is.
|
||||||
|
"""
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
from datetime import datetime, timezone
|
||||||
|
|
||||||
|
logger = logging.getLogger("orchestrator.metrics")
|
||||||
|
|
||||||
|
# Contract version for the sidecar (D2). Additive changes (new field/section) do
|
||||||
|
# NOT bump it — the sidecar MUST ignore unknown keys and tolerate missing
|
||||||
|
# optional ones. Bumped ONLY on a breaking change (rename/remove/retype an
|
||||||
|
# existing field).
|
||||||
|
SCHEMA_VERSION = 1
|
||||||
|
|
||||||
|
|
||||||
|
def _now_iso() -> str:
|
||||||
|
"""UTC ISO-8601 snapshot timestamp (``...Z``), the orchestrator's own clock.
|
||||||
|
|
||||||
|
Same clock domain as the SQLite ``datetime('now')`` timestamps and the CPU
|
||||||
|
tick reads, so the sidecar's ``(cpu_ticks, generated_at)`` deltas are immune
|
||||||
|
to orchestrator↔sidecar clock skew (TR-3). Never raises.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
return datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
|
||||||
|
except Exception as e: # noqa: BLE001 - never-raise
|
||||||
|
logger.warning("metrics._now_iso error: %s", e)
|
||||||
|
return ""
|
||||||
|
|
||||||
|
|
||||||
|
def _clk_tck() -> int | None:
|
||||||
|
"""Process-global SC_CLK_TCK (ticks/second) — the basis for converting raw CPU
|
||||||
|
ticks to seconds on the sidecar side. ``None`` on non-Linux / failure.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
return int(os.sysconf("SC_CLK_TCK"))
|
||||||
|
except Exception as e: # noqa: BLE001 - never-raise (non-Linux / unsupported)
|
||||||
|
logger.warning("metrics._clk_tck error: %s", e)
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def _read_cpu_ticks(pid: int | None) -> int | None:
|
||||||
|
"""Sum of utime+stime (CPU ticks) from ``/proc/<pid>/stat`` — raw liveness signal.
|
||||||
|
|
||||||
|
The orchestrator emits raw ticks and does NOT compute the delta — the sidecar
|
||||||
|
is the stateless arbiter (it divides ``(ticks₂−ticks₁)/clk_tck`` by the
|
||||||
|
``generated_at`` delta to get a CPU fraction; a tiny fraction at a growing
|
||||||
|
``runtime_s`` ⇒ a "stuck" candidate). Parsing is robust to spaces in ``comm``:
|
||||||
|
fields are read AFTER the closing ``") "`` of the process name (canonical
|
||||||
|
proc-stat read). utime = field 14, stime = field 15 → indices 11 and 12 of the
|
||||||
|
post-``)`` token list (fields 3.. shift by 3).
|
||||||
|
|
||||||
|
never-raise (NFR-2, AC-6): ``pid is None`` / missing ``/proc/<pid>`` (process
|
||||||
|
died or non-Linux) / any parse error → ``None`` (NOT an exception). The caller
|
||||||
|
keeps every other field and the whole endpoint intact.
|
||||||
|
"""
|
||||||
|
if pid is None:
|
||||||
|
return None
|
||||||
|
try:
|
||||||
|
with open(f"/proc/{int(pid)}/stat", "r") as f:
|
||||||
|
data = f.read()
|
||||||
|
rparen = data.rfind(") ")
|
||||||
|
if rparen == -1:
|
||||||
|
return None
|
||||||
|
rest = data[rparen + 2:].split()
|
||||||
|
# rest[0] = state (field 3); utime = field 14 -> rest[11], stime -> rest[12]
|
||||||
|
return int(rest[11]) + int(rest[12])
|
||||||
|
except Exception: # noqa: BLE001 - dead pid / no /proc / non-Linux -> null
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
def _build_stages() -> list:
|
||||||
|
"""Active (non-terminal) task stages (D3, FR-1).
|
||||||
|
|
||||||
|
Source: ``db.get_active_tasks_for_reconcile()`` (``stage != 'done'`` + SQL
|
||||||
|
``age_s``), with an extra ``stage NOT IN ('done','cancelled')`` filter on the
|
||||||
|
metrics side: that helper deliberately still returns ``cancelled`` tasks for
|
||||||
|
the reconciler's skip-counter (ORCH-086), but terminal tasks are not raw
|
||||||
|
observability signal (terminal set ``{done, cancelled}``, ORCH-090). The helper
|
||||||
|
invariant belongs to ORCH-053/086 — we filter at the consumer, not the source.
|
||||||
|
"""
|
||||||
|
from . import db
|
||||||
|
|
||||||
|
rows = db.get_active_tasks_for_reconcile()
|
||||||
|
out = []
|
||||||
|
for t in rows:
|
||||||
|
if t.get("stage") in ("done", "cancelled"):
|
||||||
|
continue
|
||||||
|
out.append({
|
||||||
|
"work_item": t.get("work_item_id"),
|
||||||
|
"stage": t.get("stage"),
|
||||||
|
"age_in_stage_s": t.get("age_s"),
|
||||||
|
"repo": t.get("repo"),
|
||||||
|
"task_id": t.get("id"),
|
||||||
|
})
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def _build_queue() -> dict:
|
||||||
|
"""Job-queue raw signal (D4, FR-2): counts, depth, retries, breaker, concurrency.
|
||||||
|
|
||||||
|
Each sub-source is independently guarded: an uninitialised ``worker`` (e.g. in
|
||||||
|
a test) degrades to ``breaker: null`` / ``max_concurrency: null`` — never a 500
|
||||||
|
(NFR-2).
|
||||||
|
"""
|
||||||
|
from . import db
|
||||||
|
|
||||||
|
counts = None
|
||||||
|
try:
|
||||||
|
counts = db.job_status_counts()
|
||||||
|
except Exception as e: # noqa: BLE001
|
||||||
|
logger.warning("metrics queue counts error: %s", e)
|
||||||
|
|
||||||
|
retries = None
|
||||||
|
try:
|
||||||
|
retries = db.queue_retry_stats()
|
||||||
|
except Exception as e: # noqa: BLE001
|
||||||
|
logger.warning("metrics queue retries error: %s", e)
|
||||||
|
|
||||||
|
breaker = None
|
||||||
|
max_concurrency = None
|
||||||
|
poll_interval = None
|
||||||
|
try:
|
||||||
|
from .queue_worker import worker
|
||||||
|
try:
|
||||||
|
breaker = worker.breaker.snapshot()
|
||||||
|
except Exception as e: # noqa: BLE001
|
||||||
|
logger.warning("metrics breaker snapshot error: %s", e)
|
||||||
|
max_concurrency = getattr(worker, "max_concurrency", None)
|
||||||
|
poll_interval = getattr(worker, "poll_interval", None)
|
||||||
|
except Exception as e: # noqa: BLE001 - worker not initialised
|
||||||
|
logger.warning("metrics worker access error: %s", e)
|
||||||
|
|
||||||
|
depth = counts.get("queued") if isinstance(counts, dict) else None
|
||||||
|
return {
|
||||||
|
"counts": counts,
|
||||||
|
"depth": depth,
|
||||||
|
"retries": retries,
|
||||||
|
"breaker": breaker,
|
||||||
|
"max_concurrency": max_concurrency,
|
||||||
|
"poll_interval": poll_interval,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def _build_agents() -> list:
|
||||||
|
"""Agent-liveness raw signal (D5/D6, FR-3).
|
||||||
|
|
||||||
|
One entry per running job from ``db.get_running_agents()`` with process
|
||||||
|
identity (``agent`` / ``run_id`` / ``job_id`` / ``pid``), ``runtime_s``
|
||||||
|
(= ``running_age_s``, anchored on ``jobs.started_at``, D6), ``model`` /
|
||||||
|
``effort``, and the raw ``cpu_ticks`` from ``/proc/<pid>/stat``. ``pid is
|
||||||
|
None`` / dead process → ``cpu_ticks: null`` for THAT agent; the rest stays
|
||||||
|
intact (AC-6, TC-05).
|
||||||
|
"""
|
||||||
|
from . import db
|
||||||
|
|
||||||
|
rows = db.get_running_agents()
|
||||||
|
out = []
|
||||||
|
for j in rows:
|
||||||
|
pid = j.get("pid")
|
||||||
|
out.append({
|
||||||
|
"agent": j.get("agent"),
|
||||||
|
"run_id": j.get("run_id"),
|
||||||
|
"job_id": j.get("job_id"),
|
||||||
|
"repo": j.get("repo"),
|
||||||
|
"pid": pid,
|
||||||
|
"runtime_s": j.get("running_age_s"),
|
||||||
|
"model": j.get("model"),
|
||||||
|
"effort": j.get("effort"),
|
||||||
|
"cpu_ticks": _read_cpu_ticks(pid),
|
||||||
|
})
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def _build_cost() -> dict:
|
||||||
|
"""Cost / token raw signal (D7, FR-4).
|
||||||
|
|
||||||
|
``running`` — current per-running-job accruals from ``agent_runs`` (often
|
||||||
|
``null`` until the job finishes and the launcher parses the CLI JSON — ``null``
|
||||||
|
is honest raw, NOT zero, TR-5). ``aggregate`` — summed totals over all
|
||||||
|
``agent_runs`` (empty table → zeros, TC-06/TC-11).
|
||||||
|
"""
|
||||||
|
from . import db
|
||||||
|
|
||||||
|
running = []
|
||||||
|
try:
|
||||||
|
for j in db.get_running_agents():
|
||||||
|
running.append({
|
||||||
|
"run_id": j.get("run_id"),
|
||||||
|
"job_id": j.get("job_id"),
|
||||||
|
"agent": j.get("agent"),
|
||||||
|
"cost_usd": j.get("cost_usd"),
|
||||||
|
"input_tokens": j.get("input_tokens"),
|
||||||
|
"output_tokens": j.get("output_tokens"),
|
||||||
|
"cache_read_tokens": j.get("cache_read_tokens"),
|
||||||
|
"cache_creation_tokens": j.get("cache_creation_tokens"),
|
||||||
|
})
|
||||||
|
except Exception as e: # noqa: BLE001
|
||||||
|
logger.warning("metrics cost.running error: %s", e)
|
||||||
|
running = []
|
||||||
|
|
||||||
|
aggregate = None
|
||||||
|
try:
|
||||||
|
aggregate = db.agent_cost_totals()
|
||||||
|
except Exception as e: # noqa: BLE001
|
||||||
|
logger.warning("metrics cost.aggregate error: %s", e)
|
||||||
|
|
||||||
|
return {"running": running, "aggregate": aggregate}
|
||||||
|
|
||||||
|
|
||||||
|
def build_metrics() -> dict:
|
||||||
|
"""Assemble the ``/metrics`` envelope (FR-5). never-raise (FR-6, NFR-2, AC-4).
|
||||||
|
|
||||||
|
Each section is collected in its own ``try/except`` with a safe default so a
|
||||||
|
failing source degrades one section, not the whole response. Honours the
|
||||||
|
``metrics_endpoint_enabled`` kill-switch (D8): when off, returns a minimal
|
||||||
|
parsable body ``{"schema_version", "enabled": false}`` (200, NOT 404) so the
|
||||||
|
sidecar sees the off-switch explicitly.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
from .config import settings
|
||||||
|
if not bool(getattr(settings, "metrics_endpoint_enabled", True)):
|
||||||
|
return {"schema_version": SCHEMA_VERSION, "enabled": False}
|
||||||
|
except Exception as e: # noqa: BLE001 - kill-switch read must never break /metrics
|
||||||
|
logger.warning("metrics kill-switch read error: %s", e)
|
||||||
|
|
||||||
|
out: dict = {
|
||||||
|
"schema_version": SCHEMA_VERSION,
|
||||||
|
"generated_at": _now_iso(),
|
||||||
|
"clk_tck": _clk_tck(),
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
out["stages"] = _build_stages()
|
||||||
|
except Exception as e: # noqa: BLE001
|
||||||
|
logger.warning("metrics stages section error: %s", e)
|
||||||
|
out["stages"] = []
|
||||||
|
|
||||||
|
try:
|
||||||
|
out["queue"] = _build_queue()
|
||||||
|
except Exception as e: # noqa: BLE001
|
||||||
|
logger.warning("metrics queue section error: %s", e)
|
||||||
|
out["queue"] = None
|
||||||
|
|
||||||
|
try:
|
||||||
|
out["agents"] = _build_agents()
|
||||||
|
except Exception as e: # noqa: BLE001
|
||||||
|
logger.warning("metrics agents section error: %s", e)
|
||||||
|
out["agents"] = []
|
||||||
|
|
||||||
|
try:
|
||||||
|
out["cost"] = _build_cost()
|
||||||
|
except Exception as e: # noqa: BLE001
|
||||||
|
logger.warning("metrics cost section error: %s", e)
|
||||||
|
out["cost"] = None
|
||||||
|
|
||||||
|
return out
|
||||||
@@ -320,3 +320,20 @@ def test_deploy_status_guard_settings_env_override(monkeypatch):
|
|||||||
s = Settings()
|
s = Settings()
|
||||||
assert s.deploy_status_guard_enabled is False
|
assert s.deploy_status_guard_enabled is False
|
||||||
assert s.deploy_status_guard_repos == "orchestrator,enduro-trails"
|
assert s.deploy_status_guard_repos == "orchestrator,enduro-trails"
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# ORCH-099 (D8): metrics_endpoint_enabled default + env alias ORCH_METRICS_ENABLED.
|
||||||
|
# The field carries an explicit validation_alias so the DOCUMENTED env var
|
||||||
|
# (README / ADR-001 D8) actually controls the flag, overriding the default
|
||||||
|
# ORCH_ + field-name mapping (which would otherwise be ORCH_METRICS_ENDPOINT_*).
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
def test_metrics_endpoint_enabled_default_true(monkeypatch):
|
||||||
|
monkeypatch.delenv("ORCH_METRICS_ENABLED", raising=False)
|
||||||
|
monkeypatch.delenv("ORCH_METRICS_ENDPOINT_ENABLED", raising=False)
|
||||||
|
assert Settings().metrics_endpoint_enabled is True
|
||||||
|
|
||||||
|
|
||||||
|
def test_metrics_endpoint_enabled_reads_documented_env_alias(monkeypatch):
|
||||||
|
monkeypatch.setenv("ORCH_METRICS_ENABLED", "false")
|
||||||
|
assert Settings().metrics_endpoint_enabled is False
|
||||||
|
|||||||
295
tests/test_metrics.py
Normal file
295
tests/test_metrics.py
Normal file
@@ -0,0 +1,295 @@
|
|||||||
|
"""ORCH-099 (FND/F1a) — read-only GET /metrics raw-signal endpoint.
|
||||||
|
|
||||||
|
Covers the four-section envelope (TC-01..TC-04/TC-08/TC-11), never-raise by
|
||||||
|
section/field (TC-05/TC-07), the cost aggregate (TC-06), read-only invariant
|
||||||
|
(TC-09), and additivity vs /health//status//queue (TC-10).
|
||||||
|
|
||||||
|
Pattern mirrors tests/test_queue_endpoint.py: the async handler is driven via
|
||||||
|
asyncio.run(main.metrics()); the autouse conftest mutes Telegram; a per-test
|
||||||
|
fresh_db points settings.db_path at a tmp file + init_db.
|
||||||
|
"""
|
||||||
|
import asyncio
|
||||||
|
import os
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
import src.db as db # noqa: E402
|
||||||
|
from src.db import get_db, init_db # noqa: E402
|
||||||
|
from src import config as cfg # noqa: E402
|
||||||
|
from src import metrics as metrics_mod # noqa: E402
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(autouse=True)
|
||||||
|
def fresh_db(tmp_path, monkeypatch):
|
||||||
|
dbfile = tmp_path / "metrics.db"
|
||||||
|
monkeypatch.setattr(db.settings, "db_path", str(dbfile))
|
||||||
|
monkeypatch.setattr(cfg.settings, "metrics_endpoint_enabled", True, raising=False)
|
||||||
|
init_db()
|
||||||
|
yield
|
||||||
|
|
||||||
|
|
||||||
|
# --- helpers ---------------------------------------------------------------
|
||||||
|
def _make_task(work_item_id="ORCH-1", repo="orchestrator",
|
||||||
|
branch="feature/x", stage="development"):
|
||||||
|
conn = get_db()
|
||||||
|
cur = conn.execute(
|
||||||
|
"INSERT INTO tasks (plane_id, work_item_id, repo, branch, stage) "
|
||||||
|
"VALUES (?, ?, ?, ?, ?)",
|
||||||
|
(work_item_id, work_item_id, repo, branch, stage),
|
||||||
|
)
|
||||||
|
tid = cur.lastrowid
|
||||||
|
conn.commit()
|
||||||
|
conn.close()
|
||||||
|
return tid
|
||||||
|
|
||||||
|
|
||||||
|
def _make_agent_run(agent="developer", task_id=None, model="claude-opus-4-8",
|
||||||
|
effort="xhigh", cost_usd=None, input_tokens=None,
|
||||||
|
output_tokens=None, cache_read_tokens=None,
|
||||||
|
cache_creation_tokens=None, finished=False):
|
||||||
|
conn = get_db()
|
||||||
|
cur = conn.execute(
|
||||||
|
"INSERT INTO agent_runs (task_id, agent, model, effort, cost_usd, "
|
||||||
|
"input_tokens, output_tokens, cache_read_tokens, cache_creation_tokens, "
|
||||||
|
"finished_at) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, "
|
||||||
|
+ ("datetime('now')" if finished else "NULL") + ")",
|
||||||
|
(task_id, agent, model, effort, cost_usd, input_tokens, output_tokens,
|
||||||
|
cache_read_tokens, cache_creation_tokens),
|
||||||
|
)
|
||||||
|
rid = cur.lastrowid
|
||||||
|
conn.commit()
|
||||||
|
conn.close()
|
||||||
|
return rid
|
||||||
|
|
||||||
|
|
||||||
|
def _make_running_job(agent="developer", repo="orchestrator", task_id=None,
|
||||||
|
pid=None, run_id=None, age_s=0, attempts=0, max_attempts=2):
|
||||||
|
conn = get_db()
|
||||||
|
cur = conn.execute(
|
||||||
|
"INSERT INTO jobs (agent, repo, task_id, status, attempts, max_attempts, "
|
||||||
|
"run_id, pid, started_at) "
|
||||||
|
"VALUES (?, ?, ?, 'running', ?, ?, ?, ?, datetime('now', ?))",
|
||||||
|
(agent, repo, task_id, attempts, max_attempts, run_id, pid,
|
||||||
|
f"-{int(age_s)} seconds"),
|
||||||
|
)
|
||||||
|
job_id = cur.lastrowid
|
||||||
|
conn.commit()
|
||||||
|
conn.close()
|
||||||
|
return job_id
|
||||||
|
|
||||||
|
|
||||||
|
def _db_snapshot():
|
||||||
|
"""Full row snapshot of the mutable tables for the read-only assertion."""
|
||||||
|
conn = get_db()
|
||||||
|
snap = {}
|
||||||
|
for table in ("tasks", "jobs", "agent_runs"):
|
||||||
|
rows = conn.execute(f"SELECT * FROM {table} ORDER BY id").fetchall()
|
||||||
|
snap[table] = [dict(r) for r in rows]
|
||||||
|
conn.close()
|
||||||
|
return snap
|
||||||
|
|
||||||
|
|
||||||
|
# --- TC-01: envelope keys --------------------------------------------------
|
||||||
|
def test_tc01_envelope_has_all_sections():
|
||||||
|
m = metrics_mod.build_metrics()
|
||||||
|
assert isinstance(m, dict)
|
||||||
|
for key in ("schema_version", "generated_at", "stages", "queue", "agents", "cost"):
|
||||||
|
assert key in m, f"missing envelope key {key!r}"
|
||||||
|
assert m["schema_version"] == 1
|
||||||
|
assert isinstance(m["stages"], list)
|
||||||
|
assert isinstance(m["agents"], list)
|
||||||
|
assert isinstance(m["queue"], dict)
|
||||||
|
assert isinstance(m["cost"], dict)
|
||||||
|
|
||||||
|
|
||||||
|
# --- TC-02: stages section + terminal exclusion ----------------------------
|
||||||
|
def test_tc02_stages_active_only_with_fields():
|
||||||
|
_make_task(work_item_id="ORCH-10", stage="development", repo="orchestrator")
|
||||||
|
_make_task(work_item_id="ORCH-11", stage="done") # terminal -> excluded
|
||||||
|
_make_task(work_item_id="ORCH-12", stage="cancelled") # terminal -> excluded
|
||||||
|
|
||||||
|
stages = metrics_mod.build_metrics()["stages"]
|
||||||
|
wis = {s["work_item"] for s in stages}
|
||||||
|
assert "ORCH-10" in wis
|
||||||
|
assert "ORCH-11" not in wis
|
||||||
|
assert "ORCH-12" not in wis
|
||||||
|
|
||||||
|
item = next(s for s in stages if s["work_item"] == "ORCH-10")
|
||||||
|
assert item["stage"] == "development"
|
||||||
|
assert item["repo"] == "orchestrator"
|
||||||
|
assert isinstance(item["age_in_stage_s"], int)
|
||||||
|
|
||||||
|
|
||||||
|
# --- TC-03: queue section --------------------------------------------------
|
||||||
|
def test_tc03_queue_section_fields():
|
||||||
|
q = metrics_mod.build_metrics()["queue"]
|
||||||
|
assert "counts" in q
|
||||||
|
counts = q["counts"]
|
||||||
|
for k in ("queued", "running", "failed", "cancelled"):
|
||||||
|
assert k in counts
|
||||||
|
assert q["max_concurrency"] is not None
|
||||||
|
assert "retries" in q and isinstance(q["retries"], dict)
|
||||||
|
assert "in_backoff" in q["retries"]
|
||||||
|
# breaker snapshot present (worker is the module singleton, initialised)
|
||||||
|
assert q["breaker"] is not None
|
||||||
|
for k in ("state", "consecutive_transient", "pause_remaining_s"):
|
||||||
|
assert k in q["breaker"]
|
||||||
|
|
||||||
|
|
||||||
|
# --- TC-04: agents liveness section ----------------------------------------
|
||||||
|
def test_tc04_agents_liveness_fields():
|
||||||
|
tid = _make_task(work_item_id="ORCH-20")
|
||||||
|
rid = _make_agent_run(task_id=tid, model="claude-opus-4-8", effort="xhigh")
|
||||||
|
# use our own (alive) pid so cpu_ticks is a real integer
|
||||||
|
_make_running_job(task_id=tid, pid=os.getpid(), run_id=rid, age_s=5)
|
||||||
|
|
||||||
|
agents = metrics_mod.build_metrics()["agents"]
|
||||||
|
assert len(agents) == 1
|
||||||
|
a = agents[0]
|
||||||
|
for k in ("agent", "run_id", "job_id", "pid", "runtime_s", "model", "effort", "cpu_ticks"):
|
||||||
|
assert k in a, f"agent entry missing {k!r}"
|
||||||
|
assert a["agent"] == "developer"
|
||||||
|
assert a["run_id"] == rid
|
||||||
|
assert a["pid"] == os.getpid()
|
||||||
|
assert isinstance(a["runtime_s"], int)
|
||||||
|
# alive pid -> real cpu ticks (int), basis present at envelope level
|
||||||
|
assert isinstance(a["cpu_ticks"], int)
|
||||||
|
assert metrics_mod.build_metrics()["clk_tck"] is not None
|
||||||
|
|
||||||
|
|
||||||
|
# --- TC-05: agent-liveness never-raise on dead/None pid --------------------
|
||||||
|
def test_tc05_dead_or_none_pid_cpu_ticks_null():
|
||||||
|
tid = _make_task(work_item_id="ORCH-21")
|
||||||
|
rid = _make_agent_run(task_id=tid)
|
||||||
|
# pid=None -> cpu_ticks null; a very-unlikely-live pid -> /proc absent -> null
|
||||||
|
_make_running_job(task_id=tid, pid=None, run_id=rid)
|
||||||
|
_make_running_job(task_id=tid, pid=999999, run_id=rid)
|
||||||
|
|
||||||
|
m = metrics_mod.build_metrics()
|
||||||
|
agents = m["agents"]
|
||||||
|
assert len(agents) == 2
|
||||||
|
for a in agents:
|
||||||
|
assert a["cpu_ticks"] is None # field degraded, not an exception
|
||||||
|
assert a["agent"] == "developer" # other fields intact
|
||||||
|
# whole envelope still valid
|
||||||
|
assert m["schema_version"] == 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_tc05_read_cpu_ticks_helper_none_paths():
|
||||||
|
assert metrics_mod._read_cpu_ticks(None) is None
|
||||||
|
assert metrics_mod._read_cpu_ticks(999999) is None
|
||||||
|
# alive pid (this process) -> int
|
||||||
|
assert isinstance(metrics_mod._read_cpu_ticks(os.getpid()), int)
|
||||||
|
|
||||||
|
|
||||||
|
# --- TC-06: cost aggregate -------------------------------------------------
|
||||||
|
def test_tc06_cost_aggregate_sums_and_empty_zeros():
|
||||||
|
# empty agent_runs -> zeros, not error
|
||||||
|
agg0 = metrics_mod.build_metrics()["cost"]["aggregate"]
|
||||||
|
for k in ("cost_usd", "input_tokens", "output_tokens",
|
||||||
|
"cache_read_tokens", "cache_creation_tokens"):
|
||||||
|
assert agg0[k] == 0
|
||||||
|
|
||||||
|
tid = _make_task(work_item_id="ORCH-30")
|
||||||
|
_make_agent_run(task_id=tid, cost_usd=1.5, input_tokens=100, output_tokens=20,
|
||||||
|
cache_read_tokens=5, cache_creation_tokens=7, finished=True)
|
||||||
|
_make_agent_run(task_id=tid, cost_usd=2.5, input_tokens=200, output_tokens=30,
|
||||||
|
cache_read_tokens=10, cache_creation_tokens=3, finished=True)
|
||||||
|
|
||||||
|
agg = metrics_mod.build_metrics()["cost"]["aggregate"]
|
||||||
|
assert agg["cost_usd"] == 4.0
|
||||||
|
assert agg["input_tokens"] == 300
|
||||||
|
assert agg["output_tokens"] == 50
|
||||||
|
assert agg["cache_read_tokens"] == 15
|
||||||
|
assert agg["cache_creation_tokens"] == 10
|
||||||
|
|
||||||
|
|
||||||
|
# --- TC-07: never-raise when a section source throws -----------------------
|
||||||
|
def test_tc07_section_source_throws_degrades_not_500(monkeypatch):
|
||||||
|
def _boom(*a, **k):
|
||||||
|
raise RuntimeError("simulated source failure")
|
||||||
|
|
||||||
|
# queue counts source throws -> queue.counts null, build_metrics still returns
|
||||||
|
monkeypatch.setattr(db, "job_status_counts", _boom)
|
||||||
|
# cost aggregate source throws -> cost.aggregate null
|
||||||
|
monkeypatch.setattr(db, "agent_cost_totals", _boom)
|
||||||
|
# stages source throws -> stages []
|
||||||
|
monkeypatch.setattr(db, "get_active_tasks_for_reconcile", _boom)
|
||||||
|
|
||||||
|
m = metrics_mod.build_metrics()
|
||||||
|
assert m["schema_version"] == 1 # never raised
|
||||||
|
assert m["stages"] == []
|
||||||
|
assert m["queue"]["counts"] is None
|
||||||
|
assert m["cost"]["aggregate"] is None
|
||||||
|
|
||||||
|
|
||||||
|
def test_tc07_breaker_unavailable_is_null(monkeypatch):
|
||||||
|
from src import queue_worker
|
||||||
|
# simulate an uninitialised / broken worker breaker
|
||||||
|
monkeypatch.setattr(queue_worker.worker.breaker, "snapshot",
|
||||||
|
lambda: (_ for _ in ()).throw(RuntimeError("no breaker")))
|
||||||
|
q = metrics_mod.build_metrics()["queue"]
|
||||||
|
assert q["breaker"] is None # null, not 500
|
||||||
|
|
||||||
|
|
||||||
|
# --- TC-08: GET /metrics via handler returns valid JSON --------------------
|
||||||
|
def test_tc08_endpoint_returns_full_payload():
|
||||||
|
tid = _make_task(work_item_id="ORCH-40")
|
||||||
|
rid = _make_agent_run(task_id=tid)
|
||||||
|
_make_running_job(task_id=tid, pid=os.getpid(), run_id=rid)
|
||||||
|
|
||||||
|
from src import main
|
||||||
|
payload = asyncio.run(main.metrics())
|
||||||
|
assert payload["schema_version"] == 1
|
||||||
|
assert isinstance(payload["stages"], list) and len(payload["stages"]) == 1
|
||||||
|
assert isinstance(payload["agents"], list) and len(payload["agents"]) == 1
|
||||||
|
assert "aggregate" in payload["cost"]
|
||||||
|
assert "counts" in payload["queue"]
|
||||||
|
|
||||||
|
|
||||||
|
def test_tc08_kill_switch_minimal_body(monkeypatch):
|
||||||
|
monkeypatch.setattr(cfg.settings, "metrics_endpoint_enabled", False, raising=False)
|
||||||
|
from src import main
|
||||||
|
payload = asyncio.run(main.metrics())
|
||||||
|
assert payload == {"schema_version": 1, "enabled": False}
|
||||||
|
|
||||||
|
|
||||||
|
# --- TC-09: read-only invariant --------------------------------------------
|
||||||
|
def test_tc09_metrics_is_read_only():
|
||||||
|
tid = _make_task(work_item_id="ORCH-50")
|
||||||
|
rid = _make_agent_run(task_id=tid, cost_usd=1.0, input_tokens=10)
|
||||||
|
_make_running_job(task_id=tid, pid=os.getpid(), run_id=rid)
|
||||||
|
|
||||||
|
from src import main
|
||||||
|
before = _db_snapshot()
|
||||||
|
asyncio.run(main.metrics())
|
||||||
|
asyncio.run(main.metrics()) # repeat: state must not change
|
||||||
|
after = _db_snapshot()
|
||||||
|
assert before == after, "/metrics must not mutate any DB state"
|
||||||
|
|
||||||
|
|
||||||
|
# --- TC-10: additivity vs existing endpoints -------------------------------
|
||||||
|
def test_tc10_existing_endpoints_intact():
|
||||||
|
from src import main
|
||||||
|
health = asyncio.run(main.health())
|
||||||
|
assert health["status"] == "ok"
|
||||||
|
|
||||||
|
status = asyncio.run(main.status())
|
||||||
|
assert "active_tasks" in status
|
||||||
|
|
||||||
|
queue = asyncio.run(main.queue())
|
||||||
|
for key in ("counts", "max_concurrency", "poll_interval", "resilience",
|
||||||
|
"reconcile", "reaper", "serial_gate", "recent"):
|
||||||
|
assert key in queue, f"/queue lost existing key {key!r}"
|
||||||
|
|
||||||
|
|
||||||
|
# --- TC-11: empty state is valid -------------------------------------------
|
||||||
|
def test_tc11_empty_state_valid():
|
||||||
|
m = metrics_mod.build_metrics()
|
||||||
|
assert m["stages"] == []
|
||||||
|
assert m["agents"] == []
|
||||||
|
assert m["cost"]["running"] == []
|
||||||
|
agg = m["cost"]["aggregate"]
|
||||||
|
assert all(agg[k] == 0 for k in agg)
|
||||||
|
counts = m["queue"]["counts"]
|
||||||
|
assert counts["queued"] == 0 and counts["running"] == 0
|
||||||
Reference in New Issue
Block a user