# Eval System — Design Spec

LLM-as-judge evaluation of Navi sessions. Tracks quality dynamics over time without dedicated test scenarios — analysis runs against real, unmodified sessions.

Status: **spec / not implemented**.

## Goals

1. See how Navi's quality changes over time across multiple axes.
2. Detect regressions after prompt/model/architecture changes.
3. Surface concrete sessions for inspection (best, worst, biggest deltas).
4. No special test fixtures — evaluation runs against real usage.

## Non-goals

- Absolute "correctness" of scores. We care about **dynamics**, not whether a 75 is "really a 75".
- Real-time scoring during sessions. Eval is offline.
- Verification of factual claims (judge can't run code; this is a known limit).

## Architecture

Three parts, deliberately decoupled:

1. **In-app feedback signal** — like/dislike per assistant response in the main webclient, stored alongside messages.
2. **Eval runner (CLI)** — standalone, runs offline against PostgreSQL, evaluates accumulated sessions, writes scores to eval tables. Does not require the FastAPI server to be running.
3. **Eval UI (debug page)** — read-only SPA for browsing sessions / scores / charts, plus a button to trigger an eval run on the server. Lives under `debug/eval/`, served as static and pulls data through a small REST namespace `/api/eval/...`.

### Directory layout

Everything for the eval system lives under `debug/eval/`. The directory contains both the standalone Python backend (CLI + REST router) and the frontend SPA. `debug/eval/` is a Python package (`__init__.py`) so the CLI can be invoked as `python -m debug.eval` from the project root.

```
debug/eval/
  __init__.py
  cli.py                        # entry: python -m debug.eval ...
  judge.py                      # judge orchestration (3 experts, averaging)
  schema.py                     # Pydantic models for scores / requests
  db.py                         # asyncpg queries for feedback + evaluations
  api.py                        # FastAPI APIRouter, mounted from navi/main.py
                                #   GET  /api/eval/sessions
                                #   GET  /api/eval/sessions/{id}
                                #   GET  /api/eval/stats
                                #   POST /api/eval/run         (background task)
                                #   POST /api/eval/feedback    (like/dislike)
  index.html                    # frontend SPA (matches debug/index.html style)
  app.js
  style.css
  prompts/
    expert_strict_critic.txt
    expert_pragmatist.txt
    expert_tech_lead.txt
    rubric_v1.yaml              # axes + anchors (frozen per version)
  schema.sql                    # postgres migration (eval_v1)
  README.md                     # ops doc — running CLI, applying migration
```

`navi/main.py` adds two lines: include the eval router and serve `debug/eval/index.html` at `/debug/eval/`. Everything else stays out of `navi/`.

The webclient (`webclient/`) gets a small addition: like/dislike thumbs on each assistant message that POST to `/api/eval/feedback`. That's the only touchpoint outside `debug/eval/`.

### What the judge sees

Maximum signal — the judge gets the full session, no filtering, no compression-summary substitution.

- **Full transcript** in original order: user / assistant / tool calls + tool results / thinking blocks / sub-agent transcripts (recursively, with depth markers) / planning phases (Phase 1 analysis, Phase 2 review, Phase 3 plan) — exactly as they appeared.
- **Per-message feedback ratings** inlined next to each assistant message ("[user reaction: 👍]" / "[user reaction: 👎]" / nothing).
- **Aggregated counts** at the top: total likes, total dislikes.
- **Profile metadata**: which profile ran, model used, planning flags state at the time.
- **Session timing**: start, end, duration, iteration count, total tokens.

We do **not** substitute compressed summaries for the original messages — that would hide the actual work and only let the judge grade the final outcome. The point is to grade the **process**.

If a session is too long for the judge's context, the runner logs a warning and skips it (or chunks by user-turn group with explicit gaps — TBD; v1 just skips).

## Signal sources

When evaluating a session, the judge LLM has access to:

- Full session transcript (user / assistant / tool calls / thinking).
- Per-message likes/dislikes from the user.
- The user's own follow-up text in chat ("не работает", "переделай", "спасибо") — judge extracts implicit signal.

Aggregated like/dislike counts are computed before judge runs. If `likes > dislikes` → tilt toward "successful". If `dislikes > likes` → tilt toward "unsuccessful". If both 0 → judge infers from transcript only.

## Axes

Fixed set, scored 0-100 (no hard upper limit — see "Open scale" below):

| Axis | Meaning |
|---|---|
| `task_complexity` | Difficulty of what was asked, judged from the user's request alone |
| `goal_completion` | Did the user end up with what they wanted |
| `tool_usage_quality` | Right tools chosen, no thrashing, no unnecessary calls |
| `efficiency` | Iterations vs result; loops, dead-ends, redundancy |
| `communication` | Clarity of replies, no hallucinations, no excessive verbosity |
| `subagent_orchestration` | Quality of sub-agent delegation (null if no sub-agents used) |
| `self_extension` | Quality of write_tool / reload_tools usage (null if not used) |

The judge sees the planning structure as part of the transcript, but the rubric does **not** ask for separate scores per planning phase. The judge instructions deliberately stay at "did the agent reason / execute / communicate well" — the architectural details of how planning runs are not evaluated, since those are the very things we're trying to measure progress on. Coupling the rubric to current planning shape would lock the eval to today's mechanics.

Scoring scale anchors (designed once, frozen as `rubric_v1`):

- **10** — trivial, near-zero effort.
- **30** — straightforward, one tool, one step.
- **50** — moderate, 2-4 steps, planning helpful.
- **75** — complex, multi-tool with planning, easy to fail.
- **100** — at the limit of what Navi can do today (full project tasks, multiple sub-agents, self-extension).

Anchors include **2-3 real session examples** at each level (user picks them once from accumulated history).

### Open scale

Scale is **not capped at 100**. If the judge encounters a task harder than any 100-anchor, it scores 120, 150, etc. Those become future anchors when we expand the rubric.

## Experts (multi-judge averaging)

Each session is evaluated by **3 different expert prompts**, then averaged. Different prompts produce different blind spots; averaging reduces variance and bias.

| Expert | Prompt slant |
|---|---|
| `strict_critic` | Looks for flaws, scores conservatively, penalizes weakly any slip-up |
| `pragmatist` | "Did the user end up with what they wanted, regardless of the path?" |
| `tech_lead` | Architecture / tool choice / efficiency, focused on technical decisions |

All three see the same transcript and the same rubric. Final per-axis score = mean across experts. Spread between experts is also stored — large spread = noisy/contested session.

## Storage (PostgreSQL)

Append-only. Multiple evals per session are normal (re-evaluation when judge upgrades, rubric changes, or you just want a fresh take).

```sql
-- Per-message user feedback (drives the like/dislike signal)
CREATE TABLE message_feedback (
    message_id      UUID PRIMARY KEY REFERENCES messages(id),
    session_id      UUID NOT NULL,
    rating          SMALLINT NOT NULL,  -- +1 / -1
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ON message_feedback(session_id);

-- One row per (session, expert, eval_run)
CREATE TABLE evaluations (
    id              UUID PRIMARY KEY,
    session_id      UUID NOT NULL,
    eval_run_id     UUID NOT NULL,        -- groups the 3 experts of one run
    eval_date       TIMESTAMPTZ NOT NULL,
    judge_model     TEXT NOT NULL,        -- e.g. "gemma4:31b-cloud"
    judge_version   TEXT NOT NULL,        -- snapshotted version string
    rubric_version  TEXT NOT NULL,        -- "v1", "v2", ...
    expert_id       TEXT NOT NULL,        -- "strict_critic" | "pragmatist" | "tech_lead"
    scores          JSONB NOT NULL,       -- {task_complexity: 65, goal_completion: 80, ...}
    comment         TEXT NOT NULL         -- free-form "what stood out"
);
CREATE INDEX ON evaluations(session_id);
CREATE INDEX ON evaluations(eval_date);
CREATE INDEX ON evaluations(judge_version, rubric_version);

-- View: averaged scores per session per eval_run
CREATE VIEW evaluation_summary AS
SELECT
    session_id,
    eval_run_id,
    eval_date,
    judge_version,
    rubric_version,
    jsonb_object_agg(
        axis,
        avg_score
    ) AS avg_scores
FROM (
    SELECT
        session_id, eval_run_id, eval_date, judge_version, rubric_version,
        key AS axis,
        AVG((value)::numeric) AS avg_score
    FROM evaluations, jsonb_each_text(scores)
    GROUP BY session_id, eval_run_id, eval_date, judge_version, rubric_version, key
) t
GROUP BY session_id, eval_run_id, eval_date, judge_version, rubric_version;
```

## Judge model policy

- Judge model is **pinned** in eval config. Don't change casually.
- When you do upgrade the judge, **re-evaluate the entire archive** with the new judge. Old scores stay (different `judge_version` row), new scores are the new baseline.
- Comparisons across `judge_version` boundaries are not meaningful — visualizations should respect this.

## Rubric versioning

Same policy. Rubric changes (new anchors, reworded prompts) bump `rubric_version`. Old rows preserved, new ones are the live series.

## CLI

Standalone, no server dependency.

```bash
# Evaluate all unevaluated sessions (with current pinned judge + rubric)
python -m navi.eval run

# Re-evaluate everything (after judge or rubric change)
python -m navi.eval run --re-evaluate-all

# Evaluate a single session
python -m navi.eval run --session <uuid>

# Limit to recent
python -m navi.eval run --since 2026-04-01

# Show eval for one session
python -m navi.eval show <uuid>

# Aggregate stats
python -m navi.eval stats --days 30
python -m navi.eval stats --days 30 --by-complexity-bucket
```

`stats` exports CSV by default; visualization is a separate concern (see below).

## UI (`debug/eval/index.html`)

Single-page debug SPA in the same style as the existing `debug/index.html` (dark mono theme, no framework). Tabbed layout:

### Tab 1 — Sessions
Paginated table of all sessions, newest first. Columns: started_at, profile, turns count, likes / dislikes, last avg score (or "—"), eval status (`evaluated rubric_v1` / `pending` / `stale judge_v1 → v2`). Row click → Tab 2 with that session preselected.

Filters at top: profile, date range, "show only unevaluated", "show only stale".

### Tab 2 — Session detail
Two-pane layout. Left: transcript (collapsed by default; user / assistant / tool-call / sub-agent indented). Right: eval results.

- All eval runs for this session listed (most recent first), each expandable.
- Inside an eval run: 3 expert blocks side-by-side with their per-axis scores, the spread, and free-form comment.
- Avg row at top of run with `(judge_version, rubric_version, eval_date)`.
- Action button: "Re-evaluate this session".

### Tab 3 — Stats
Charts (server-rendered SVG or simple canvas, no chart library):

1. Average score per axis over time — weekly rolling mean.
2. Score by complexity bucket (`0-25`, `26-50`, `51-75`, `76+`) — per-bucket trend, catches selection bias when overall score moves.
3. Likes / dislikes ratio per week — orthogonal sanity check.
4. Top-K worst sessions in the last 7 days — clickable, jumps to Tab 2.

Filter bar: judge_version + rubric_version (mixing across versions disabled by default).

### Tab 4 — Run
Trigger an eval run.
- Form: scope (`all unevaluated` / `since date` / `single session id` / `re-evaluate all`), max sessions, dry-run checkbox.
- Submit → POST `/api/eval/run`, server kicks off the CLI as a background task and returns a `run_id`.
- Live log panel below subscribes to a small WS or SSE stream and prints progress: "session N/M, expert K/3, scores …".
- Run history table at the bottom: past runs with timestamp, count of sessions, judge_version, status.

CSV export available on Tab 3 for offline plotting.

## Implementation phases

1. **Phase 1 — Feedback signal**
   - `message_feedback` postgres table + migration.
   - Webclient UI: thumbs up/down on each assistant message, REST POST to `/api/eval/feedback`.
   - Endpoint `POST /api/eval/feedback {message_id, rating}` — upsert.
2. **Phase 2 — Eval backend skeleton**
   - `navi/eval/` package with CLI entry point (`python -m navi.eval`).
   - `evaluations` table + migration.
   - Judge prompt templates per expert (`prompts/expert_*.txt`).
   - Rubric anchors as YAML (`prompts/rubric_v1.yaml`) — anchor examples filled in by user before going live.
3. **Phase 3 — Run + store**
   - `run` command: pick unevaluated sessions, render full transcript, fan out to 3 experts, validate JSON output against pydantic schema, persist all expert rows under one `eval_run_id`.
4. **Phase 4 — Read endpoints**
   - `/api/eval/sessions`, `/sessions/{id}`, `/stats` — read-only, used by debug UI.
   - `/api/eval/run` (POST) — kicks off CLI in a background task, returns `run_id`. SSE/WS stream for live progress.
5. **Phase 5 — Debug UI**
   - `debug/eval/index.html` + `app.js` + `style.css` in the existing debug-SPA style.
   - All four tabs (Sessions / Detail / Stats / Run) wired to the endpoints above.

CSV export from `python -m navi.eval stats --csv` is also available as a pure-CLI path for offline plotting.

## Costs / constraints

- 3 experts × full session transcript per run. For 50-turn sessions with 50k+ token contexts that's 3 large LLM calls per session. Plan to run overnight on a small batch, not in real time.
- Judge calls go through the same backend stack (`FallbackOllamaBackend`) — so multi-server fallback applies.
- Eval runner should respect a `--max-tokens-per-session` guard so a runaway transcript doesn't burn the queue.

## Known limits / open questions

- No verification of factual or code correctness — judge sees only the transcript. For "did this code actually work?" we'd need separate runtime checks; out of scope here.
- Judge bias toward verbose / confident answers is not fully mitigated by 3 experts — partial only.
- Calibration set (manual scoring of N sessions to validate judge against user) is **deliberately skipped** — we only need dynamics, not absolute correctness. Re-open if the trends turn out to be uninterpretable.
- Rubric anchors must be set with care; once the archive is large, changing the rubric forces re-eval of everything.