diff --git a/docs/eval_system.md b/docs/eval_system.md index 093f572..8a27040 100644 --- a/docs/eval_system.md +++ b/docs/eval_system.md @@ -1,287 +1,283 @@ -# Eval System +# Eval System — User Guide -LLM-as-judge evaluation of Navi sessions. Tracks quality dynamics over time without dedicated test scenarios — analysis runs against real, unmodified sessions. +Система оценки качества сессий Navi через LLM-as-judge. Анализирует реальные разговоры (не тестовые сценарии) по 7 осям, с усреднением по 3 независимым экспертам. -Status: **spec / not implemented**. +**Статус:** реализовано (фазы 1–5). UI: `/debug/eval/`. CLI: `python -m debug.eval`. -## Goals +--- -1. See how Navi's quality changes over time across multiple axes. -2. Detect regressions after prompt/model/architecture changes. -3. Surface concrete sessions for inspection (best, worst, biggest deltas). -4. No special test fixtures — evaluation runs against real usage. +## Как открыть -## Non-goals - -- Absolute "correctness" of scores. We care about **dynamics**, not whether a 75 is "really a 75". -- Real-time scoring during sessions. Eval is offline. -- Verification of factual claims (judge can't run code; this is a known limit). - -## Architecture - -Three parts, deliberately decoupled: - -1. **In-app feedback signal** — like/dislike per assistant response in the main webclient, stored alongside messages. -2. **Eval runner (CLI)** — standalone, runs offline against PostgreSQL, evaluates accumulated sessions, writes scores to eval tables. Does not require the FastAPI server to be running. -3. **Eval UI (debug page)** — read-only SPA for browsing sessions / scores / charts, plus a button to trigger an eval run on the server. Lives under `debug/eval/`, served as static and pulls data through a small REST namespace `/api/eval/...`. - -### Directory layout - -Everything for the eval system lives under `debug/eval/`. The directory contains both the standalone Python backend (CLI + REST router) and the frontend SPA. `debug/eval/` is a Python package (`__init__.py`) so the CLI can be invoked as `python -m debug.eval` from the project root. +Веб-интерфейс доступен по адресу: ``` -debug/eval/ - __init__.py - cli.py # entry: python -m debug.eval ... - judge.py # judge orchestration (3 experts, averaging) - schema.py # Pydantic models for scores / requests - db.py # asyncpg queries for feedback + evaluations - api.py # FastAPI APIRouter, mounted from navi/main.py - # GET /api/eval/sessions - # GET /api/eval/sessions/{id} - # GET /api/eval/stats - # POST /api/eval/run (background task) - # POST /api/eval/feedback (like/dislike) - index.html # frontend SPA (matches debug/index.html style) - app.js - style.css - prompts/ - expert_strict_critic.txt - expert_pragmatist.txt - expert_tech_lead.txt - rubric_v1.yaml # axes + anchors (frozen per version) - schema.sql # postgres migration (eval_v1) - README.md # ops doc — running CLI, applying migration +http://:8000/debug/eval ``` -`navi/main.py` adds two lines: include the eval router and serve `debug/eval/index.html` at `/debug/eval/`. Everything else stays out of `navi/`. +Требует PostgreSQL (`DATABASE_URL` в `.env`). Если база не настроена — система вернёт 503. -The webclient (`webclient/`) gets a small addition: like/dislike thumbs on each assistant message that POST to `/api/eval/feedback`. That's the only touchpoint outside `debug/eval/`. +--- -### What the judge sees +## Что оценивается -Maximum signal — the judge gets the full session, no filtering, no compression-summary substitution. +Судья (тот же LLM, которым пользуется Navi) читает **полную расшифровку сессии** — все сообщения пользователя, ответы ассистента, вызовы инструментов, результаты, thinking-блоки, логи планирования. Никакие сообщения не вырезаются и не заменяются сжатыми summary — судья видит реальный процесс работы. -- **Full transcript** in original order: user / assistant / tool calls + tool results / thinking blocks / sub-agent transcripts (recursively, with depth markers) / planning phases (Phase 1 analysis, Phase 2 review, Phase 3 plan) — exactly as they appeared. -- **Per-message feedback ratings** inlined next to each assistant message ("[user reaction: 👍]" / "[user reaction: 👎]" / nothing). -- **Aggregated counts** at the top: total likes, total dislikes. -- **Profile metadata**: which profile ran, model used, planning flags state at the time. -- **Session timing**: start, end, duration, iteration count, total tokens. +### Сигналы, которые видит судья -We do **not** substitute compressed summaries for the original messages — that would hide the actual work and only let the judge grade the final outcome. The point is to grade the **process**. +1. **Полный transcript** в оригинальном порядке. +2. **Лайки/дизлайки** пользователя — встроены рядом с каждым сообщением ассистента (`👍` / `👎`). +3. **Метаданные сессии**: профиль, модель, флаги thinking-механик, длительность, количество итераций, токенов. +4. **Логи планирования** (если профиль использует planning). -If a session is too long for the judge's context, the runner logs a warning and skips it (or chunks by user-turn group with explicit gaps — TBD; v1 just skips). +--- -## Signal sources +## 7 осей оценки (Rubric v1) -When evaluating a session, the judge LLM has access to: +| Ось | Что оценивается | Шкала | +|---|---|---| +| `task_complexity` | Сложность запроса пользователя, судится **только по вопросу**, не по ответу Navi | 10–100+ | +| `goal_completion` | Получил ли пользователь то, что хотел. Смотрим на финальный ответ и реакции | 10–100+ | +| `tool_usage_quality` | Правильно ли выбраны инструменты, нет ли лишних/повторных вызовов | 10–100+ | +| `efficiency` | Соотношение итераций к результату. Петли, тупики, переделки штрафуются | 10–100+ | +| `communication` | Ясность, честность, отсутствие галлюцинаций, лаконичность | 10–100+ | +| `subagent_orchestration` | Качество делегирования под-агентам (`spawn_agent`). **Null**, если под-агенты не использовались | 10–100+ или null | +| `self_extension` | Качество саморасширения (`write_tool`, `reload_tools`). **Null**, если инструменты не писались | 10–100+ или null | -- Full session transcript (user / assistant / tool calls / thinking). -- Per-message likes/dislikes from the user. -- The user's own follow-up text in chat ("не работает", "переделай", "спасибо") — judge extracts implicit signal. +### Якоря шкалы (фиксированы в `rubric_v1`) -Aggregated like/dislike counts are computed before judge runs. If `likes > dislikes` → tilt toward "successful". If `dislikes > likes` → tilt toward "unsuccessful". If both 0 → judge infers from transcript only. +- **10** — катастрофа / тривиально (зависит от оси) +- **30** — слабо / просто +- **50** — средне +- **75** — хорошо / сложно +- **100** — на пределе возможностей Navi сегодня -## Axes +Шкала **открытая** — если судья видит что-то за пределами 100, он может поставить 120+. Это становится якорем для будущих версий рубрики. -Fixed set, scored 0-100 (no hard upper limit — see "Open scale" below): +--- -| Axis | Meaning | +## 3 эксперта и усреднение + +Каждая сессия оценивается **3 параллельными вызовами** с разными system prompt: + +| Эксперт | Наклон | |---|---| -| `task_complexity` | Difficulty of what was asked, judged from the user's request alone | -| `goal_completion` | Did the user end up with what they wanted | -| `tool_usage_quality` | Right tools chosen, no thrashing, no unnecessary calls | -| `efficiency` | Iterations vs result; loops, dead-ends, redundancy | -| `communication` | Clarity of replies, no hallucinations, no excessive verbosity | -| `subagent_orchestration` | Quality of sub-agent delegation (null if no sub-agents used) | -| `self_extension` | Quality of write_tool / reload_tools usage (null if not used) | +| `strict_critic` | Ищет ошибки, штрафует за любой промах, консервативные оценки | +| `pragmatist` | «Пользователь получил результат?» — итог важнее пути | +| `tech_lead` | Архитектура, выбор инструментов, эффективность, технические решения | -The judge sees the planning structure as part of the transcript, but the rubric does **not** ask for separate scores per planning phase. The judge instructions deliberately stay at "did the agent reason / execute / communicate well" — the architectural details of how planning runs are not evaluated, since those are the very things we're trying to measure progress on. Coupling the rubric to current planning shape would lock the eval to today's mechanics. +Итоговый балл по каждой оси — **среднее арифметическое** трёх экспертов. Nullable-оси (`subagent_orchestration`, `self_extension`) усредняются только по ненулевым значениям. -Scoring scale anchors (designed once, frozen as `rubric_v1`): +Разброс между экспертами тоже сохраняется — большой разброс = спорная/шумная сессия. -- **10** — trivial, near-zero effort. -- **30** — straightforward, one tool, one step. -- **50** — moderate, 2-4 steps, planning helpful. -- **75** — complex, multi-tool with planning, easy to fail. -- **100** — at the limit of what Navi can do today (full project tasks, multiple sub-agents, self-extension). +Если эксперт вернул невалидный JSON — делается **один retry** с корректирующим сообщением. Если и retry не удался — сессия помечается failed в этом run. -Anchors include **2-3 real session examples** at each level (user picks them once from accumulated history). +--- -### Open scale +## Версионирование -Scale is **not capped at 100**. If the judge encounters a task harder than any 100-anchor, it scores 120, 150, etc. Those become future anchors when we expand the rubric. +Каждый run привязан к двум версиям: -## Experts (multi-judge averaging) +- **`judge_version`** — версия кода судьи (prompts, логика рендеринга, retry policy) +- **`rubric_version`** — версия рубрики (оси, якоря, описания) -Each session is evaluated by **3 different expert prompts**, then averaged. Different prompts produce different blind spots; averaging reduces variance and bias. +Когда вы меняете prompt эксперта или рубрику — **бампаете версию** в коде. Старые оценки остаются в базе, новые пишутся под новой версией. -| Expert | Prompt slant | +### Статусы сессий + +| Статус | Значение | |---|---| -| `strict_critic` | Looks for flaws, scores conservatively, penalizes weakly any slip-up | -| `pragmatist` | "Did the user end up with what they wanted, regardless of the path?" | -| `tech_lead` | Architecture / tool choice / efficiency, focused on technical decisions | +| `evaluated` | Есть полный run по текущим `judge_version` + `rubric_version` (все 3 эксперта) | +| `pending` | Нет ни одной оценки | +| `stale` | Есть оценки, но по старым версиям judge или rubric. Нужен re-run | -All three see the same transcript and the same rubric. Final per-axis score = mean across experts. Spread between experts is also stored — large spread = noisy/contested session. +--- -## Storage (PostgreSQL) +## Интерфейс — 4 вкладки -Append-only. Multiple evals per session are normal (re-evaluation when judge upgrades, rubric changes, or you just want a fresh take). +### 1. Sessions (список сессий) -```sql --- Per-message user feedback (drives the like/dislike signal) -CREATE TABLE message_feedback ( - message_id UUID PRIMARY KEY REFERENCES messages(id), - session_id UUID NOT NULL, - rating SMALLINT NOT NULL, -- +1 / -1 - created_at TIMESTAMPTZ NOT NULL DEFAULT now() -); -CREATE INDEX ON message_feedback(session_id); +Таблица всех сессий, новые сверху. Колонки: +- Время старта, профиль, ID, имя +- Количество сообщений +- 👍 / 👎 (суммарный feedback) +- Статус оценки (`evaluated` / `pending` / `stale`) +- Средние по `goal_completion`, `tool_usage_quality`, `communication` --- One row per (session, expert, eval_run) -CREATE TABLE evaluations ( - id UUID PRIMARY KEY, - session_id UUID NOT NULL, - eval_run_id UUID NOT NULL, -- groups the 3 experts of one run - eval_date TIMESTAMPTZ NOT NULL, - judge_model TEXT NOT NULL, -- e.g. "gemma4:31b-cloud" - judge_version TEXT NOT NULL, -- snapshotted version string - rubric_version TEXT NOT NULL, -- "v1", "v2", ... - expert_id TEXT NOT NULL, -- "strict_critic" | "pragmatist" | "tech_lead" - scores JSONB NOT NULL, -- {task_complexity: 65, goal_completion: 80, ...} - comment TEXT NOT NULL -- free-form "what stood out" -); -CREATE INDEX ON evaluations(session_id); -CREATE INDEX ON evaluations(eval_date); -CREATE INDEX ON evaluations(judge_version, rubric_version); +Фильтры: профиль, статус, лимит строк. --- View: averaged scores per session per eval_run -CREATE VIEW evaluation_summary AS -SELECT - session_id, - eval_run_id, - eval_date, - judge_version, - rubric_version, - jsonb_object_agg( - axis, - avg_score - ) AS avg_scores -FROM ( - SELECT - session_id, eval_run_id, eval_date, judge_version, rubric_version, - key AS axis, - AVG((value)::numeric) AS avg_score - FROM evaluations, jsonb_each_text(scores) - GROUP BY session_id, eval_run_id, eval_date, judge_version, rubric_version, key -) t -GROUP BY session_id, eval_run_id, eval_date, judge_version, rubric_version; -``` +Клик по строке → переход в Detail с подставленным ID. -## Judge model policy +### 2. Detail (детали сессии) -- Judge model is **pinned** in eval config. Don't change casually. -- When you do upgrade the judge, **re-evaluate the entire archive** with the new judge. Old scores stay (different `judge_version` row), new scores are the new baseline. -- Comparisons across `judge_version` boundaries are not meaningful — visualizations should respect this. +- Метаданные сессии (ID, профиль, время, сообщения, feedback) +- Все evaluation runs для этой сессии (новые сверху) +- Каждый run — таблица с оценками по 7 осям, колонка на каждого эксперта + колонка `avg` +- Комментарии экспертов под таблицей -## Rubric versioning +Кнопка **"evaluate this session"** — запускает run только для этой сессии. -Same policy. Rubric changes (new anchors, reworded prompts) bump `rubric_version`. Old rows preserved, new ones are the live series. +### 3. Stats (статистика) + +Недельные средние по осям за выбранный период (7–90 дней). + +Опция **"split by complexity bucket"** — разбивает по сложности задач (`0-25`, `26-50`, `51-75`, `76+`). Позволяет отловить смещение выборки: если средний балл вырос, но только потому что стали приходить простые задачи. + +### 4. Run (запуск оценки) + +Форма для триггера фонового run: + +| Поле | Описание | +|---|---| +| `scope` | `unevaluated` — только неоценённые, `all` — всё (re-eval), `session` — одна сессия | +| `session id` | Только для `scope=session` | +| `since` | Только сессии, начатые после этой даты | +| `limit` | Макс. количество сессий | +| `model` | Модель судьи (default: `gemma4:31b-cloud`) | +| `backend` | Бэкенд LLM (default: `ollama`) | + +После запуска появляется панель **Active run** с прогрессом: сколько сессий обработано, статус каждой (`pending` / `running` / `ok` / `failed`), средние баллы. + +Poll каждые 2.5 секунды. По завершении run обновляется история внизу. + +--- ## CLI -Standalone, no server dependency. +Альтернатива UI — запуск из терминала. Не требует запущенного сервера, но требует доступа к PostgreSQL. ```bash -# Evaluate all unevaluated sessions (with current pinned judge + rubric) -python -m navi.eval run +# Оценить все неоценённые сессии +python -m debug.eval run -# Re-evaluate everything (after judge or rubric change) -python -m navi.eval run --re-evaluate-all +# Оценить одну сессию +python -m debug.eval run --session -# Evaluate a single session -python -m navi.eval run --session +# Re-evaluate всего (после смены judge/rubric) +python -m debug.eval run --re-evaluate-all -# Limit to recent -python -m navi.eval run --since 2026-04-01 +# Только сессии после даты +python -m debug.eval run --since 2026-04-01 -# Show eval for one session -python -m navi.eval show +# Сухой прогон — показать, что будет оценено, но не звать LLM +python -m debug.eval run --dry-run -# Aggregate stats -python -m navi.eval stats --days 30 -python -m navi.eval stats --days 30 --by-complexity-bucket +# Сменить модель судьи +python -m debug.eval run --model qwen3.6:27b + +# Показать оценки сессии +python -m debug.eval show + +# Статистика (Phase 4 — пока заглушка) +python -m debug.eval stats --days 30 +python -m debug.eval stats --days 30 --csv ``` -`stats` exports CSV by default; visualization is a separate concern (see below). +--- -## UI (`debug/eval/index.html`) +## Feedback — лайки/дизлайки -Single-page debug SPA in the same style as the existing `debug/index.html` (dark mono theme, no framework). Tabbed layout: +В основном чате (не в eval UI) на каждом сообщении ассистента есть 👍 / 👎. Клик отправляет: -### Tab 1 — Sessions -Paginated table of all sessions, newest first. Columns: started_at, profile, turns count, likes / dislikes, last avg score (or "—"), eval status (`evaluated rubric_v1` / `pending` / `stale judge_v1 → v2`). Row click → Tab 2 with that session preselected. +``` +POST /eval/feedback +{ "session_id": "...", "message_index": N, "rating": 1 } +``` -Filters at top: profile, date range, "show only unevaluated", "show only stale". +- `rating: 1` — лайк +- `rating: -1` — дизлайк +- `rating: 0` — очистить -### Tab 2 — Session detail -Two-pane layout. Left: transcript (collapsed by default; user / assistant / tool-call / sub-agent indented). Right: eval results. +Индекс — позиция сообщения в `session.messages` (display history, append-only, индексы стабильны). -- All eval runs for this session listed (most recent first), each expandable. -- Inside an eval run: 3 expert blocks side-by-side with their per-axis scores, the spread, and free-form comment. -- Avg row at top of run with `(judge_version, rubric_version, eval_date)`. -- Action button: "Re-evaluate this session". +Судья видит эти реакции в transcript: `👍` / `👎` рядом с соответствующим сообщением ассистента. -### Tab 3 — Stats -Charts (server-rendered SVG or simple canvas, no chart library): +--- -1. Average score per axis over time — weekly rolling mean. -2. Score by complexity bucket (`0-25`, `26-50`, `51-75`, `76+`) — per-bucket trend, catches selection bias when overall score moves. -3. Likes / dislikes ratio per week — orthogonal sanity check. -4. Top-K worst sessions in the last 7 days — clickable, jumps to Tab 2. +## REST API -Filter bar: judge_version + rubric_version (mixing across versions disabled by default). +Все эндпоинты под префиксом `/eval`: -### Tab 4 — Run -Trigger an eval run. -- Form: scope (`all unevaluated` / `since date` / `single session id` / `re-evaluate all`), max sessions, dry-run checkbox. -- Submit → POST `/api/eval/run`, server kicks off the CLI as a background task and returns a `run_id`. -- Live log panel below subscribes to a small WS or SSE stream and prints progress: "session N/M, expert K/3, scores …". -- Run history table at the bottom: past runs with timestamp, count of sessions, judge_version, status. +| Метод | Путь | Описание | +|---|---|---| +| POST | `/eval/feedback` | Поставить/убрать лайк или дизлайк | +| GET | `/eval/feedback/{session_id}` | Список feedback для сессии | +| GET | `/eval/sessions` | Список сессий с оценками и статусами | +| GET | `/eval/sessions/{session_id}` | Детали сессии + все evaluation runs | +| GET | `/eval/stats` | Агрегированная статистика | +| POST | `/eval/run` | Запустить фоновый run | +| GET | `/eval/run/{run_id}` | Статус конкретного run | +| GET | `/eval/runs` | История всех runs | -CSV export available on Tab 3 for offline plotting. +--- -## Implementation phases +## Хранение (PostgreSQL) -1. **Phase 1 — Feedback signal** - - `message_feedback` postgres table + migration. - - Webclient UI: thumbs up/down on each assistant message, REST POST to `/api/eval/feedback`. - - Endpoint `POST /api/eval/feedback {message_id, rating}` — upsert. -2. **Phase 2 — Eval backend skeleton** - - `navi/eval/` package with CLI entry point (`python -m navi.eval`). - - `evaluations` table + migration. - - Judge prompt templates per expert (`prompts/expert_*.txt`). - - Rubric anchors as YAML (`prompts/rubric_v1.yaml`) — anchor examples filled in by user before going live. -3. **Phase 3 — Run + store** - - `run` command: pick unevaluated sessions, render full transcript, fan out to 3 experts, validate JSON output against pydantic schema, persist all expert rows under one `eval_run_id`. -4. **Phase 4 — Read endpoints** - - `/api/eval/sessions`, `/sessions/{id}`, `/stats` — read-only, used by debug UI. - - `/api/eval/run` (POST) — kicks off CLI in a background task, returns `run_id`. SSE/WS stream for live progress. -5. **Phase 5 — Debug UI** - - `debug/eval/index.html` + `app.js` + `style.css` in the existing debug-SPA style. - - All four tabs (Sessions / Detail / Stats / Run) wired to the endpoints above. +Таблицы создаются лениво при первом подключении (`debug/eval/schema.sql`): -CSV export from `python -m navi.eval stats --csv` is also available as a pure-CLI path for offline plotting. +```sql +-- Пользовательский feedback +CREATE TABLE message_feedback ( + session_id TEXT NOT NULL, + message_index INTEGER NOT NULL, + rating SMALLINT NOT NULL CHECK (rating IN (-1, 1)), + created_at TIMESTAMPTZ NOT NULL DEFAULT now(), + updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), + PRIMARY KEY (session_id, message_index) +); -## Costs / constraints +-- Оценки судьи +CREATE TABLE evaluations ( + id UUID PRIMARY KEY, + session_id TEXT NOT NULL, + eval_run_id UUID NOT NULL, -- группирует 3 экспертов одного run + eval_date TIMESTAMPTZ NOT NULL DEFAULT now(), + judge_model TEXT NOT NULL, + judge_version TEXT NOT NULL, + rubric_version TEXT NOT NULL, + expert_id TEXT NOT NULL, -- strict_critic | pragmatist | tech_lead + scores JSONB NOT NULL, -- {task_complexity: 65, ...} + comment TEXT NOT NULL DEFAULT '' +); +``` -- 3 experts × full session transcript per run. For 50-turn sessions with 50k+ token contexts that's 3 large LLM calls per session. Plan to run overnight on a small batch, not in real time. -- Judge calls go through the same backend stack (`FallbackOllamaBackend`) — so multi-server fallback applies. -- Eval runner should respect a `--max-tokens-per-session` guard so a runaway transcript doesn't burn the queue. +SQLite **не поддерживается** для eval-системы — только PostgreSQL. -## Known limits / open questions +--- -- No verification of factual or code correctness — judge sees only the transcript. For "did this code actually work?" we'd need separate runtime checks; out of scope here. -- Judge bias toward verbose / confident answers is not fully mitigated by 3 experts — partial only. -- Calibration set (manual scoring of N sessions to validate judge against user) is **deliberately skipped** — we only need dynamics, not absolute correctness. Re-open if the trends turn out to be uninterpretable. -- Rubric anchors must be set with care; once the archive is large, changing the rubric forces re-eval of everything. +## Стоимость + +3 эксперта × полный transcript на сессию. Для длинных сессий (50+ сообщений, 50k+ токенов контекста) — это 3 больших LLM-вызова. Рекомендуется запускать batch overnight, а не в реальном времени. + +Run использует тот же `FallbackOllamaBackend`, что и основной агент — мульти-серверный fallback работает и для судьи. + +--- + +## Известные ограничения + +- Судья не верифицирует факты и не запускает код — он видит только transcript. «А этот код действительно работает?» — out of scope. +- Систематическое смещение судьи к многословным/уверенным ответам частично компенсируется 3 экспертами, но не полностью. +- Оценки **относительные** (динамика), а не абсолютные. Мы не калибруем судью вручную — это осознанное решение, так как нам важны тренды, а не точное значение «75 vs 80». +- Слишком длинные сессии могут не поместиться в контекст судьи. На данный момент run падает с ошибкой — нет chunking'а. + +--- + +## Файлы системы + +``` +debug/eval/ + api.py # FastAPI router (GET/POST эндпоинты) + cli.py # CLI: run, show, stats + db.py # asyncpg: feedback + evaluations + judge.py # Рендеринг сессии, 3 эксперта, усреднение + runner.py # Фоновый runner (async tasks) + schema.py # Pydantic-модели + schema.sql # DDL для PostgreSQL + index.html # SPA UI (4 вкладки) + prompts/ + rubric_v1.yaml # Рубрика с якорями + expert_strict_critic.txt # System prompt строгого критика + expert_pragmatist.txt # System prompt прагматика + expert_tech_lead.txt # System prompt техлида +``` + +Судья и все вспомогательные функции — в `debug/eval/`. Ни одна из этих модулей не импортируется в production runtime агента; eval-система полностью обособлена. diff --git a/docs/eval_system_design.md b/docs/eval_system_design.md new file mode 100644 index 0000000..444d66d --- /dev/null +++ b/docs/eval_system_design.md @@ -0,0 +1,287 @@ +# Eval System — Design Spec + +LLM-as-judge evaluation of Navi sessions. Tracks quality dynamics over time without dedicated test scenarios — analysis runs against real, unmodified sessions. + +Status: **spec / not implemented**. + +## Goals + +1. See how Navi's quality changes over time across multiple axes. +2. Detect regressions after prompt/model/architecture changes. +3. Surface concrete sessions for inspection (best, worst, biggest deltas). +4. No special test fixtures — evaluation runs against real usage. + +## Non-goals + +- Absolute "correctness" of scores. We care about **dynamics**, not whether a 75 is "really a 75". +- Real-time scoring during sessions. Eval is offline. +- Verification of factual claims (judge can't run code; this is a known limit). + +## Architecture + +Three parts, deliberately decoupled: + +1. **In-app feedback signal** — like/dislike per assistant response in the main webclient, stored alongside messages. +2. **Eval runner (CLI)** — standalone, runs offline against PostgreSQL, evaluates accumulated sessions, writes scores to eval tables. Does not require the FastAPI server to be running. +3. **Eval UI (debug page)** — read-only SPA for browsing sessions / scores / charts, plus a button to trigger an eval run on the server. Lives under `debug/eval/`, served as static and pulls data through a small REST namespace `/api/eval/...`. + +### Directory layout + +Everything for the eval system lives under `debug/eval/`. The directory contains both the standalone Python backend (CLI + REST router) and the frontend SPA. `debug/eval/` is a Python package (`__init__.py`) so the CLI can be invoked as `python -m debug.eval` from the project root. + +``` +debug/eval/ + __init__.py + cli.py # entry: python -m debug.eval ... + judge.py # judge orchestration (3 experts, averaging) + schema.py # Pydantic models for scores / requests + db.py # asyncpg queries for feedback + evaluations + api.py # FastAPI APIRouter, mounted from navi/main.py + # GET /api/eval/sessions + # GET /api/eval/sessions/{id} + # GET /api/eval/stats + # POST /api/eval/run (background task) + # POST /api/eval/feedback (like/dislike) + index.html # frontend SPA (matches debug/index.html style) + app.js + style.css + prompts/ + expert_strict_critic.txt + expert_pragmatist.txt + expert_tech_lead.txt + rubric_v1.yaml # axes + anchors (frozen per version) + schema.sql # postgres migration (eval_v1) + README.md # ops doc — running CLI, applying migration +``` + +`navi/main.py` adds two lines: include the eval router and serve `debug/eval/index.html` at `/debug/eval/`. Everything else stays out of `navi/`. + +The webclient (`webclient/`) gets a small addition: like/dislike thumbs on each assistant message that POST to `/api/eval/feedback`. That's the only touchpoint outside `debug/eval/`. + +### What the judge sees + +Maximum signal — the judge gets the full session, no filtering, no compression-summary substitution. + +- **Full transcript** in original order: user / assistant / tool calls + tool results / thinking blocks / sub-agent transcripts (recursively, with depth markers) / planning phases (Phase 1 analysis, Phase 2 review, Phase 3 plan) — exactly as they appeared. +- **Per-message feedback ratings** inlined next to each assistant message ("[user reaction: 👍]" / "[user reaction: 👎]" / nothing). +- **Aggregated counts** at the top: total likes, total dislikes. +- **Profile metadata**: which profile ran, model used, planning flags state at the time. +- **Session timing**: start, end, duration, iteration count, total tokens. + +We do **not** substitute compressed summaries for the original messages — that would hide the actual work and only let the judge grade the final outcome. The point is to grade the **process**. + +If a session is too long for the judge's context, the runner logs a warning and skips it (or chunks by user-turn group with explicit gaps — TBD; v1 just skips). + +## Signal sources + +When evaluating a session, the judge LLM has access to: + +- Full session transcript (user / assistant / tool calls / thinking). +- Per-message likes/dislikes from the user. +- The user's own follow-up text in chat ("не работает", "переделай", "спасибо") — judge extracts implicit signal. + +Aggregated like/dislike counts are computed before judge runs. If `likes > dislikes` → tilt toward "successful". If `dislikes > likes` → tilt toward "unsuccessful". If both 0 → judge infers from transcript only. + +## Axes + +Fixed set, scored 0-100 (no hard upper limit — see "Open scale" below): + +| Axis | Meaning | +|---|---| +| `task_complexity` | Difficulty of what was asked, judged from the user's request alone | +| `goal_completion` | Did the user end up with what they wanted | +| `tool_usage_quality` | Right tools chosen, no thrashing, no unnecessary calls | +| `efficiency` | Iterations vs result; loops, dead-ends, redundancy | +| `communication` | Clarity of replies, no hallucinations, no excessive verbosity | +| `subagent_orchestration` | Quality of sub-agent delegation (null if no sub-agents used) | +| `self_extension` | Quality of write_tool / reload_tools usage (null if not used) | + +The judge sees the planning structure as part of the transcript, but the rubric does **not** ask for separate scores per planning phase. The judge instructions deliberately stay at "did the agent reason / execute / communicate well" — the architectural details of how planning runs are not evaluated, since those are the very things we're trying to measure progress on. Coupling the rubric to current planning shape would lock the eval to today's mechanics. + +Scoring scale anchors (designed once, frozen as `rubric_v1`): + +- **10** — trivial, near-zero effort. +- **30** — straightforward, one tool, one step. +- **50** — moderate, 2-4 steps, planning helpful. +- **75** — complex, multi-tool with planning, easy to fail. +- **100** — at the limit of what Navi can do today (full project tasks, multiple sub-agents, self-extension). + +Anchors include **2-3 real session examples** at each level (user picks them once from accumulated history). + +### Open scale + +Scale is **not capped at 100**. If the judge encounters a task harder than any 100-anchor, it scores 120, 150, etc. Those become future anchors when we expand the rubric. + +## Experts (multi-judge averaging) + +Each session is evaluated by **3 different expert prompts**, then averaged. Different prompts produce different blind spots; averaging reduces variance and bias. + +| Expert | Prompt slant | +|---|---| +| `strict_critic` | Looks for flaws, scores conservatively, penalizes weakly any slip-up | +| `pragmatist` | "Did the user end up with what they wanted, regardless of the path?" | +| `tech_lead` | Architecture / tool choice / efficiency, focused on technical decisions | + +All three see the same transcript and the same rubric. Final per-axis score = mean across experts. Spread between experts is also stored — large spread = noisy/contested session. + +## Storage (PostgreSQL) + +Append-only. Multiple evals per session are normal (re-evaluation when judge upgrades, rubric changes, or you just want a fresh take). + +```sql +-- Per-message user feedback (drives the like/dislike signal) +CREATE TABLE message_feedback ( + message_id UUID PRIMARY KEY REFERENCES messages(id), + session_id UUID NOT NULL, + rating SMALLINT NOT NULL, -- +1 / -1 + created_at TIMESTAMPTZ NOT NULL DEFAULT now() +); +CREATE INDEX ON message_feedback(session_id); + +-- One row per (session, expert, eval_run) +CREATE TABLE evaluations ( + id UUID PRIMARY KEY, + session_id UUID NOT NULL, + eval_run_id UUID NOT NULL, -- groups the 3 experts of one run + eval_date TIMESTAMPTZ NOT NULL, + judge_model TEXT NOT NULL, -- e.g. "gemma4:31b-cloud" + judge_version TEXT NOT NULL, -- snapshotted version string + rubric_version TEXT NOT NULL, -- "v1", "v2", ... + expert_id TEXT NOT NULL, -- "strict_critic" | "pragmatist" | "tech_lead" + scores JSONB NOT NULL, -- {task_complexity: 65, goal_completion: 80, ...} + comment TEXT NOT NULL -- free-form "what stood out" +); +CREATE INDEX ON evaluations(session_id); +CREATE INDEX ON evaluations(eval_date); +CREATE INDEX ON evaluations(judge_version, rubric_version); + +-- View: averaged scores per session per eval_run +CREATE VIEW evaluation_summary AS +SELECT + session_id, + eval_run_id, + eval_date, + judge_version, + rubric_version, + jsonb_object_agg( + axis, + avg_score + ) AS avg_scores +FROM ( + SELECT + session_id, eval_run_id, eval_date, judge_version, rubric_version, + key AS axis, + AVG((value)::numeric) AS avg_score + FROM evaluations, jsonb_each_text(scores) + GROUP BY session_id, eval_run_id, eval_date, judge_version, rubric_version, key +) t +GROUP BY session_id, eval_run_id, eval_date, judge_version, rubric_version; +``` + +## Judge model policy + +- Judge model is **pinned** in eval config. Don't change casually. +- When you do upgrade the judge, **re-evaluate the entire archive** with the new judge. Old scores stay (different `judge_version` row), new scores are the new baseline. +- Comparisons across `judge_version` boundaries are not meaningful — visualizations should respect this. + +## Rubric versioning + +Same policy. Rubric changes (new anchors, reworded prompts) bump `rubric_version`. Old rows preserved, new ones are the live series. + +## CLI + +Standalone, no server dependency. + +```bash +# Evaluate all unevaluated sessions (with current pinned judge + rubric) +python -m navi.eval run + +# Re-evaluate everything (after judge or rubric change) +python -m navi.eval run --re-evaluate-all + +# Evaluate a single session +python -m navi.eval run --session + +# Limit to recent +python -m navi.eval run --since 2026-04-01 + +# Show eval for one session +python -m navi.eval show + +# Aggregate stats +python -m navi.eval stats --days 30 +python -m navi.eval stats --days 30 --by-complexity-bucket +``` + +`stats` exports CSV by default; visualization is a separate concern (see below). + +## UI (`debug/eval/index.html`) + +Single-page debug SPA in the same style as the existing `debug/index.html` (dark mono theme, no framework). Tabbed layout: + +### Tab 1 — Sessions +Paginated table of all sessions, newest first. Columns: started_at, profile, turns count, likes / dislikes, last avg score (or "—"), eval status (`evaluated rubric_v1` / `pending` / `stale judge_v1 → v2`). Row click → Tab 2 with that session preselected. + +Filters at top: profile, date range, "show only unevaluated", "show only stale". + +### Tab 2 — Session detail +Two-pane layout. Left: transcript (collapsed by default; user / assistant / tool-call / sub-agent indented). Right: eval results. + +- All eval runs for this session listed (most recent first), each expandable. +- Inside an eval run: 3 expert blocks side-by-side with their per-axis scores, the spread, and free-form comment. +- Avg row at top of run with `(judge_version, rubric_version, eval_date)`. +- Action button: "Re-evaluate this session". + +### Tab 3 — Stats +Charts (server-rendered SVG or simple canvas, no chart library): + +1. Average score per axis over time — weekly rolling mean. +2. Score by complexity bucket (`0-25`, `26-50`, `51-75`, `76+`) — per-bucket trend, catches selection bias when overall score moves. +3. Likes / dislikes ratio per week — orthogonal sanity check. +4. Top-K worst sessions in the last 7 days — clickable, jumps to Tab 2. + +Filter bar: judge_version + rubric_version (mixing across versions disabled by default). + +### Tab 4 — Run +Trigger an eval run. +- Form: scope (`all unevaluated` / `since date` / `single session id` / `re-evaluate all`), max sessions, dry-run checkbox. +- Submit → POST `/api/eval/run`, server kicks off the CLI as a background task and returns a `run_id`. +- Live log panel below subscribes to a small WS or SSE stream and prints progress: "session N/M, expert K/3, scores …". +- Run history table at the bottom: past runs with timestamp, count of sessions, judge_version, status. + +CSV export available on Tab 3 for offline plotting. + +## Implementation phases + +1. **Phase 1 — Feedback signal** + - `message_feedback` postgres table + migration. + - Webclient UI: thumbs up/down on each assistant message, REST POST to `/api/eval/feedback`. + - Endpoint `POST /api/eval/feedback {message_id, rating}` — upsert. +2. **Phase 2 — Eval backend skeleton** + - `navi/eval/` package with CLI entry point (`python -m navi.eval`). + - `evaluations` table + migration. + - Judge prompt templates per expert (`prompts/expert_*.txt`). + - Rubric anchors as YAML (`prompts/rubric_v1.yaml`) — anchor examples filled in by user before going live. +3. **Phase 3 — Run + store** + - `run` command: pick unevaluated sessions, render full transcript, fan out to 3 experts, validate JSON output against pydantic schema, persist all expert rows under one `eval_run_id`. +4. **Phase 4 — Read endpoints** + - `/api/eval/sessions`, `/sessions/{id}`, `/stats` — read-only, used by debug UI. + - `/api/eval/run` (POST) — kicks off CLI in a background task, returns `run_id`. SSE/WS stream for live progress. +5. **Phase 5 — Debug UI** + - `debug/eval/index.html` + `app.js` + `style.css` in the existing debug-SPA style. + - All four tabs (Sessions / Detail / Stats / Run) wired to the endpoints above. + +CSV export from `python -m navi.eval stats --csv` is also available as a pure-CLI path for offline plotting. + +## Costs / constraints + +- 3 experts × full session transcript per run. For 50-turn sessions with 50k+ token contexts that's 3 large LLM calls per session. Plan to run overnight on a small batch, not in real time. +- Judge calls go through the same backend stack (`FallbackOllamaBackend`) — so multi-server fallback applies. +- Eval runner should respect a `--max-tokens-per-session` guard so a runaway transcript doesn't burn the queue. + +## Known limits / open questions + +- No verification of factual or code correctness — judge sees only the transcript. For "did this code actually work?" we'd need separate runtime checks; out of scope here. +- Judge bias toward verbose / confident answers is not fully mitigated by 3 experts — partial only. +- Calibration set (manual scoring of N sessions to validate judge against user) is **deliberately skipped** — we only need dynamics, not absolute correctness. Re-open if the trends turn out to be uninterpretable. +- Rubric anchors must be set with care; once the archive is large, changing the rubric forces re-eval of everything.