Step 1 ("Foundation", F2) of the self-improvement epic: formalise free-text
"lessons" from memory/ into a machine-readable `lessons` table — the foundation
for the future retrospective agent (E2), the RICE prioritiser (E3) and Стрим.
- src/lessons.py: pure never-raise observer leaf (record/get/update/snapshot),
kill-switch only, NO repo scope (observer-only; records about any repo incl.
enduro; repo cut on the read side). Slug-convention constants.
- src/db.py: additive idempotent `lessons` table in init_db() (+3 indexes);
nullable attribution columns from the start (NFR-6, _ensure_column forward-safe);
helpers record_lesson/get_lessons/update_lesson/lessons_snapshot/
lessons_recent_dup_exists (auto-dedup window).
- 4 auto-detectors (best-effort, source="auto", deduped): gate_failure
(_handle_qg_failure_rollbacks), merge_hold (_handle_merge_verify HOLD),
transient_retry (launcher._finalize_transient budget-exhaustion), deploy_degraded
(post-deploy DEGRADED -> set_repo_freeze).
- src/main.py: GET /lessons, POST /lessons, POST /lessons/{id} + read-only
`lessons` block in GET /queue; off-switch -> {"enabled": false}.
- src/config.py: lessons_enabled / lessons_query_limit_default / lessons_dedup_window_s.
- tests/test_lessons.py: TC-01..TC-12 (unit + integration), all green.
- Docs: CLAUDE.md, docs/architecture/README.md (component + schema + API), CHANGELOG.
Invariant: the journal is an OBSERVER, not a Quality Gate — STAGE_TRANSITIONS /
QG_CHECKS / check_* / machine-verdict / existing table schemas are byte-for-byte
untouched; enduro not affected. never-raise on every public fn + injection.
Refs: ORCH-098
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Follow-up ORCH-040: legacy root:root files in /repos broke worktree creation
under uid 1000 with a raw "Permission denied" (agent never started, no diagnosis).
Three additive, kill-switch-reversible layers; STAGE_TRANSITIONS / QG_CHECKS /
check_* / machine-verdict keys / DB schema are byte-for-byte unchanged.
- D1: ensure_worktree classifies the permission class and raises an actionable
RuntimeError (cause + chown command + INFRA.md ref); non-permission errors keep
the prior raw-stderr contract; kill-switch off -> contract 1:1 as before ORCH-057.
- D2: new never-raise leaf src/fs_normalize.py — scan_ownership (TTL-cached,
early-exit per root), applies()-first scope (empty CSV -> self-hosting only),
opt-in normalize() that chowns ONLY when privileged (no-op under uid 1000).
- D3: best-effort startup detect in main.lifespan (WARNING + Telegram on mismatch,
never-fatal); read-only fs_ownership block in GET /queue; POST /fs-normalize/check.
Claim is NOT blocked — the clear early outcome is delivered by D1 at launch.
- Docs/config: .env.example flags + CHANGELOG (architecture README / adr-0031 /
INFRA.md procedure already landed on the branch).
- Tests: test_fs_normalize.py, test_git_worktree_perm.py,
test_fs_normalize_startup.py, test_api_queue.py (TC-01..TC-12). Full suite green.
Refs: ORCH-057
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
FND/F1a: add a versioned read-only JSON endpoint GET /metrics that exposes the
orchestrator's own raw state for the future observability sidecar F1b — active
task stages, job queue, agent-liveness (pid/runtime/cpu_ticks), and cost/tokens.
The orchestrator emits ONLY raw signal it alone knows; thresholds/alerts/history
live in the separate sidecar (observer separated from observed, BRD §1).
- src/metrics.py: new leaf collector build_metrics() (never-raise per section,
serial_gate.snapshot() pattern); envelope schema_version/generated_at/clk_tck +
stages/queue/agents/cost. _read_cpu_ticks(pid) reads utime+stime from
/proc/<pid>/stat (null on None/dead/non-Linux pid — never raises).
- src/main.py: thin @app.get("/metrics") wrapper (style of GET /queue).
- src/db.py: read-only helpers get_running_agents() (dedicated SELECT, not an
extension of the hot-path get_running_jobs()), agent_cost_totals(),
queue_retry_stats(); job_status_counts() default dict gains the cancelled key.
- src/config.py: metrics_endpoint_enabled kill-switch (default True), env
ORCH_METRICS_ENABLED via explicit validation_alias so the documented switch
actually controls the flag.
- docs: README API table row + CHANGELOG entry (contract section already added
by architect); .env.example ORCH_METRICS_ENABLED.
Strictly read-only / never-raise: STAGE_TRANSITIONS / QG_CHECKS / check_* /
machine-verdict keys / DB schema untouched; /health//status//queue byte-for-byte.
Tests: tests/test_metrics.py (TC-01..TC-11) + env-alias tests in test_config.py.
Full suite green (1482).
Refs: ORCH-099
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reviewer P1 (ORCH-027 attempt 2): inserting the ORCH-027 changelog
block duplicated the adjacent ORCH-095 entry — its paragraph body was
repeated verbatim, corrupting a golden-source doc and another work
item's artifact (CLAUDE.md §3). Remove the duplicate half, leaving a
single ORCH-095 body. ORCH-027 entry untouched (already correct).
Refs: ORCH-027
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
measure_coverage hardcoded "python" for the coverage subprocess; the prod container
and the CI runner expose "python3" (a bare "python" may be absent), and pytest-cov
lives in exactly the running interpreter's environment. Use sys.executable so the
measurement always runs under the same interpreter as the orchestrator.
Refs: ORCH-027
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Introduce a deterministic (no-LLM) coverage sub-gate that blocks coverage
degradation before a task branch merges into `main`. Existing gates judge only by
the FACT of passing (check_ci_green / check_tests_passed / merge-gate re-test), not
by completeness — so a batch autonomous run (ORCH-088) silently erodes coverage.
Pattern mirrors the security-gate (ORCH-022): leaf src/coverage_gate.py (never-raise)
+ thin check_coverage_gate in QG_CHECKS + _handle_coverage_gate splice in advance_stage,
run AFTER merge-gate (measured on the caught-up HEAD that lands in main) and BEFORE
image-freshness (fail before the expensive docker rebuild).
- measure_coverage: pytest --cov=src --cov-report=json in the per-branch worktree ->
line coverage %; None on tool error -> fail-open + WARNING by default (FR-6).
- compute_coverage_verdict (pure): absolute | baseline | both + epsilon (NFR-4 anti-flap);
baseline None -> bootstrap (absolute-only).
- coverage_baseline DB table (additive, CREATE TABLE IF NOT EXISTS) + ratchet-up in
_handle_merge_verify (deploy->done): atomic compare-and-set under merge-lease, never
decreases; bootstrap on first merge.
- Artefact 18-coverage-report.md (coverage_status: frontmatter, single source of truth);
GET /queue `coverage` block; FAIL -> Telegram; optional POST /coverage/baseline override.
- Flags ORCH_COVERAGE_* (kill-switch + self-hosting-only scope) -> enduro untouched;
STAGE_TRANSITIONS / existing check_* / verdict keys byte-for-byte unchanged (NFR-5/AC-8).
- pytest-cov==5.0.0 added to requirements.txt.
Tests: tests/test_coverage_gate.py (TC-01..TC-15). Frozen QG-registry anti-regress
tests + deploy-staging edge tests updated for the new sub-gate. Full suite green.
Docs: README / adr-0029 / PIPELINE_DOCS / 18-coverage-report.md template (architecture
stage) + CHANGELOG / CLAUDE.md / .env.example (this PR).
Refs: ORCH-027
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>