Merge-gate re-test runs under the orchestrator's prod env, where the
operator legitimately set ORCH_AGENT_FALLBACK_MODEL and changed
ORCH_AGENT_MODEL_DEFAULT / ORCH_AGENT_EFFORT_*. Two ORCH-41-era tests
asserted SHIPPED defaults through the env-backed settings singleton and
failed 3/3 there, while Gitea CI (clean env) stayed green. Branch
ORCH-009 touches neither src/ nor these tests - latent non-hermetic
landmine on main, detonated by the prod env change.
- test_resolve_agent_effort.py: autouse fixture now mirrors the sibling
model-file baseline (pins shipped model/fallback fields) so the
flag-assembly tests are env-independent.
- test_resolve_agent_model.py: fixture also resets agent_fallback_model;
test_fallback_model_disabled_by_default now asserts the CLASS field
default (the actual ORCH-074 ADR-001 G4 invariant: shipped default
is ""), never-break is_valid_model asserts unchanged byte-for-byte.
Clean-env behaviour is byte-equivalent (fixtures pin exactly what an
empty env yields). Full suite: 1713 passed (was 2 failed / 1711).
Refs: ORCH-009
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Operator capability to bring a NEW project online in one pass, fully
outside the runtime and the pipeline (src/** byte-exact, no kill-switch
needed — activation is an explicit human CLI run). Reference = the
orchestrator repo itself (ORCH-52b/c/d/e canons).
* onboarding/repo-skeleton/ — parametrized kit of a new repo: 6 agent
prompt templates per canon 52d/92 (5 ru + deployer en with the
shared-host guardrail frame), reviewer doc-gate (REQUEST_CHANGES),
CLAUDE.md passport, AGENTS.md, CONTRIBUTING.md, docs/ skeleton with
mandatory operations/INFRA.md, .env.example; {{NAME}} placeholders +
stdlib render, dictionary onboarding/placeholders.json (bijection
held by tests). Canon is NOT forked: docs/_templates + docs/_standards
are live-copied from the checkout at materialization time (BR-2/D3).
* scripts/onboard_project.py — plan (default, GET-only, zero mutations)
/ apply (idempotent ensure, no delete ops at all) / verify (registry
round-trip via the actual projects._parse_projects_json, all 22 state
names incl. fail-closed Confirm Deploy/STOP, labels, webhook, kit
completeness, unresolved-placeholder scan). Closed read-only src
import list (ADR D4); state groups fixed per ADR D5 (STOP→cancelled,
terminal groups only Done/Cancelled/STOP); Gitea webhook reuses the
single global ORCH_GITEA_WEBHOOK_SECRET (TR-6); initial push ONLY
into a freshly created empty repo (INV-4 untouched); never restarts
prod / never edits .env / deletes nothing (NFR-2); secrets masked
(NFR-3); Plane CE API gaps degrade to manual-step (fail-safe).
* docs/operations/ONBOARDING.md runbook + SETUP_WEBHOOKS.md generalized
per-repo; CLAUDE.md / docs/architecture/README.md / CHANGELOG.md
updated in the same PR (golden source).
* Anti-drift tests: test_onboarding_kit.py / test_onboarding_script.py
(mocked, no network) / test_onboarding_invariants.py (snapshots of
STAGE_TRANSITIONS/QG_CHECKS, closed CLI import list, reference
.openclaw/agents/ prompts untouched). Full regression: 1713 passed.
Refs: ORCH-009
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Staging suite (docker exec orchestrator-staging, port 8501) exit 0.
All REAL checks green; C9a/C9b INFRA-WAIVED (ORCH-061).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Merge-gate auto_rebase_onto_main bounced this branch back: ORCH-100 landed
in main first and claimed global ADR number adr-0033 (adr-0033-sidecar-watchdog),
while this branch had created adr-0033-lessons-journal. Resolved the genuine
land race:
- rebased feature/ORCH-098-fnd onto current origin/main (linear history)
- resolved docs/architecture/README.md component-list conflict — both the
Lessons-journal and Sidecar-watchdog bullets now coexist
- renamed docs/architecture/adr/adr-0033-lessons-journal.md →
adr-0034-lessons-journal.md (next free global ADR number) + fixed the
in-file header
- updated all cross-references (CLAUDE.md, README.md, work-item ADR-001,
12-review.md) 0033→0034 for the lessons journal; ORCH-100's adr-0033
(sidecar) left intact
- recovered the ORCH-098 CHANGELOG entry silently dropped by the rebase
auto-merge (now above ORCH-100, ADR ref corrected to 0034)
No code semantics changed; src/** auto-merged cleanly (ORCH-100 did not
touch src/**). ruff: n/a locally (CI). pytest tests/ -q: 1630 passed.
Refs: ORCH-098
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Step 1 ("Foundation", F2) of the self-improvement epic: formalise free-text
"lessons" from memory/ into a machine-readable `lessons` table — the foundation
for the future retrospective agent (E2), the RICE prioritiser (E3) and Стрим.
- src/lessons.py: pure never-raise observer leaf (record/get/update/snapshot),
kill-switch only, NO repo scope (observer-only; records about any repo incl.
enduro; repo cut on the read side). Slug-convention constants.
- src/db.py: additive idempotent `lessons` table in init_db() (+3 indexes);
nullable attribution columns from the start (NFR-6, _ensure_column forward-safe);
helpers record_lesson/get_lessons/update_lesson/lessons_snapshot/
lessons_recent_dup_exists (auto-dedup window).
- 4 auto-detectors (best-effort, source="auto", deduped): gate_failure
(_handle_qg_failure_rollbacks), merge_hold (_handle_merge_verify HOLD),
transient_retry (launcher._finalize_transient budget-exhaustion), deploy_degraded
(post-deploy DEGRADED -> set_repo_freeze).
- src/main.py: GET /lessons, POST /lessons, POST /lessons/{id} + read-only
`lessons` block in GET /queue; off-switch -> {"enabled": false}.
- src/config.py: lessons_enabled / lessons_query_limit_default / lessons_dedup_window_s.
- tests/test_lessons.py: TC-01..TC-12 (unit + integration), all green.
- Docs: CLAUDE.md, docs/architecture/README.md (component + schema + API), CHANGELOG.
Invariant: the journal is an OBSERVER, not a Quality Gate — STAGE_TRANSITIONS /
QG_CHECKS / check_* / machine-verdict / existing table schemas are byte-for-byte
untouched; enduro not affected. never-raise on every public fn + injection.
Refs: ORCH-098
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
test_queue.py::TestRetry::test_finalize_job_requeue_then_fail failed in the
self-hosting environment because launcher._finalize_job classifies a non-zero
exit by reading the tail of <settings.runs_dir>/<run_id>.log. settings.runs_dir
defaults to the live prod dir /app/data/runs, which on the host holds REAL
accumulated agent logs; a real 2.log containing "429" flips the expected
'permanent' classification to 'transient', requeueing the job instead of
marking it 'failed'. This is ambient prod pollution, not a code fault.
Add an autouse _isolate_runs_dir fixture (mirroring _no_telegram /
_disable_merge_verify) that redirects settings.runs_dir to a per-test tmp dir
so _run_log_path() resolves to a non-existent file and classify_log_file()
returns the documented 'permanent' default. Full suite: 1617 passed. src/**
untouched.
Refs: ORCH-100
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the `watchdog/` package (thin Python-3.12 stdlib-only daemon) and the
`orchestrator-watchdog` compose service — the brain half of the domain-0
observability pair. F1a (ORCH-099) exposes GET /metrics raw signal; F1b reads it,
augments with host / container / dependency probes, runs each signal through a
generalised pure decision function (decide(signal_active, prev, now, cooldown),
a strict superset of disk_watchdog.decide_action) with per-signal in-memory
dedup/throttle/recovery, and alerts over its OWN independent Telegram channel.
Key properties (ADR-001):
- Observer separated from observed: separate container; /metrics not answering is
itself the master `orch_down` alarm (debounced K ticks — no flap on a hiccup).
- Strictly read-only: docker.sock GET-only + mounted :ro (double guard), host
paths :ro, no DB/disk writes, no process control — self-hosting-safe.
- never-raise on three levels (per-source/per-tick/per-send) + WATCHDOG_ENABLED
kill-switch (disabled -> inert idle-loop, not exit).
- Disk anti-duplicate (D6): disk_watchdog (ORCH-063) stays sole owner of the 85%
alert; sidecar carries orch_down + an opt-in 97% ceiling (default off).
- NO import from src/** (C-1); src/**, STAGE_TRANSITIONS, QG_CHECKS, check_*, DB
schema — untouched. env_file optional so a missing .env.watchdog never breaks
`docker compose up` for the prod orchestrator.
Tests: tests/watchdog/ (TC-01…TC-13) + full tests/ regression green (TC-14).
Docs: CHANGELOG, .env.example canon (WATCHDOG_*); architecture README + adr-0033
authored at the architecture stage.
Refs: ORCH-100
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>