LLM-as-judge evaluation of Navi sessions. Tracks quality dynamics over time without dedicated test scenarios — analysis runs against real, unmodified sessions.
Status: spec / not implemented.
Three parts, deliberately decoupled:
debug/eval/, served as static and pulls data through a small REST namespace /api/eval/....Everything for the eval system lives under debug/eval/. The directory contains both the standalone Python backend (CLI + REST router) and the frontend SPA. debug/eval/ is a Python package (__init__.py) so the CLI can be invoked as python -m debug.eval from the project root.
debug/eval/
__init__.py
cli.py # entry: python -m debug.eval ...
judge.py # judge orchestration (3 experts, averaging)
schema.py # Pydantic models for scores / requests
db.py # asyncpg queries for feedback + evaluations
api.py # FastAPI APIRouter, mounted from navi/main.py
# GET /api/eval/sessions
# GET /api/eval/sessions/{id}
# GET /api/eval/stats
# POST /api/eval/run (background task)
# POST /api/eval/feedback (like/dislike)
index.html # frontend SPA (matches debug/index.html style)
app.js
style.css
prompts/
expert_strict_critic.txt
expert_pragmatist.txt
expert_tech_lead.txt
rubric_v1.yaml # axes + anchors (frozen per version)
schema.sql # postgres migration (eval_v1)
README.md # ops doc — running CLI, applying migration
navi/main.py adds two lines: include the eval router and serve debug/eval/index.html at /debug/eval/. Everything else stays out of navi/.
The webclient (webclient/) gets a small addition: like/dislike thumbs on each assistant message that POST to /api/eval/feedback. That's the only touchpoint outside debug/eval/.
Maximum signal — the judge gets the full session, no filtering, no compression-summary substitution.
We do not substitute compressed summaries for the original messages — that would hide the actual work and only let the judge grade the final outcome. The point is to grade the process.
If a session is too long for the judge's context, the runner logs a warning and skips it (or chunks by user-turn group with explicit gaps — TBD; v1 just skips).
When evaluating a session, the judge LLM has access to:
Aggregated like/dislike counts are computed before judge runs. If likes > dislikes → tilt toward "successful". If dislikes > likes → tilt toward "unsuccessful". If both 0 → judge infers from transcript only.
Fixed set, scored 0-100 (no hard upper limit — see "Open scale" below):
| Axis | Meaning |
|---|---|
task_complexity |
Difficulty of what was asked, judged from the user's request alone |
goal_completion |
Did the user end up with what they wanted |
tool_usage_quality |
Right tools chosen, no thrashing, no unnecessary calls |
efficiency |
Iterations vs result; loops, dead-ends, redundancy |
communication |
Clarity of replies, no hallucinations, no excessive verbosity |
subagent_orchestration |
Quality of sub-agent delegation (null if no sub-agents used) |
self_extension |
Quality of write_tool / reload_tools usage (null if not used) |
The judge sees the planning structure as part of the transcript, but the rubric does not ask for separate scores per planning phase. The judge instructions deliberately stay at "did the agent reason / execute / communicate well" — the architectural details of how planning runs are not evaluated, since those are the very things we're trying to measure progress on. Coupling the rubric to current planning shape would lock the eval to today's mechanics.
Scoring scale anchors (designed once, frozen as rubric_v1):
Anchors include 2-3 real session examples at each level (user picks them once from accumulated history).
Scale is not capped at 100. If the judge encounters a task harder than any 100-anchor, it scores 120, 150, etc. Those become future anchors when we expand the rubric.
Each session is evaluated by 3 different expert prompts, then averaged. Different prompts produce different blind spots; averaging reduces variance and bias.
| Expert | Prompt slant |
|---|---|
strict_critic |
Looks for flaws, scores conservatively, penalizes weakly any slip-up |
pragmatist |
"Did the user end up with what they wanted, regardless of the path?" |
tech_lead |
Architecture / tool choice / efficiency, focused on technical decisions |
All three see the same transcript and the same rubric. Final per-axis score = mean across experts. Spread between experts is also stored — large spread = noisy/contested session.
Append-only. Multiple evals per session are normal (re-evaluation when judge upgrades, rubric changes, or you just want a fresh take).
-- Per-message user feedback (drives the like/dislike signal)
CREATE TABLE message_feedback (
message_id UUID PRIMARY KEY REFERENCES messages(id),
session_id UUID NOT NULL,
rating SMALLINT NOT NULL, -- +1 / -1
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ON message_feedback(session_id);
-- One row per (session, expert, eval_run)
CREATE TABLE evaluations (
id UUID PRIMARY KEY,
session_id UUID NOT NULL,
eval_run_id UUID NOT NULL, -- groups the 3 experts of one run
eval_date TIMESTAMPTZ NOT NULL,
judge_model TEXT NOT NULL, -- e.g. "gemma4:31b-cloud"
judge_version TEXT NOT NULL, -- snapshotted version string
rubric_version TEXT NOT NULL, -- "v1", "v2", ...
expert_id TEXT NOT NULL, -- "strict_critic" | "pragmatist" | "tech_lead"
scores JSONB NOT NULL, -- {task_complexity: 65, goal_completion: 80, ...}
comment TEXT NOT NULL -- free-form "what stood out"
);
CREATE INDEX ON evaluations(session_id);
CREATE INDEX ON evaluations(eval_date);
CREATE INDEX ON evaluations(judge_version, rubric_version);
-- View: averaged scores per session per eval_run
CREATE VIEW evaluation_summary AS
SELECT
session_id,
eval_run_id,
eval_date,
judge_version,
rubric_version,
jsonb_object_agg(
axis,
avg_score
) AS avg_scores
FROM (
SELECT
session_id, eval_run_id, eval_date, judge_version, rubric_version,
key AS axis,
AVG((value)::numeric) AS avg_score
FROM evaluations, jsonb_each_text(scores)
GROUP BY session_id, eval_run_id, eval_date, judge_version, rubric_version, key
) t
GROUP BY session_id, eval_run_id, eval_date, judge_version, rubric_version;
judge_version row), new scores are the new baseline.judge_version boundaries are not meaningful — visualizations should respect this.Same policy. Rubric changes (new anchors, reworded prompts) bump rubric_version. Old rows preserved, new ones are the live series.
Standalone, no server dependency.
# Evaluate all unevaluated sessions (with current pinned judge + rubric) python -m navi.eval run # Re-evaluate everything (after judge or rubric change) python -m navi.eval run --re-evaluate-all # Evaluate a single session python -m navi.eval run --session <uuid> # Limit to recent python -m navi.eval run --since 2026-04-01 # Show eval for one session python -m navi.eval show <uuid> # Aggregate stats python -m navi.eval stats --days 30 python -m navi.eval stats --days 30 --by-complexity-bucket
stats exports CSV by default; visualization is a separate concern (see below).
debug/eval/index.html)Single-page debug SPA in the same style as the existing debug/index.html (dark mono theme, no framework). Tabbed layout:
Paginated table of all sessions, newest first. Columns: started_at, profile, turns count, likes / dislikes, last avg score (or "—"), eval status (evaluated rubric_v1 / pending / stale judge_v1 → v2). Row click → Tab 2 with that session preselected.
Filters at top: profile, date range, "show only unevaluated", "show only stale".
Two-pane layout. Left: transcript (collapsed by default; user / assistant / tool-call / sub-agent indented). Right: eval results.
(judge_version, rubric_version, eval_date).Charts (server-rendered SVG or simple canvas, no chart library):
0-25, 26-50, 51-75, 76+) — per-bucket trend, catches selection bias when overall score moves.Filter bar: judge_version + rubric_version (mixing across versions disabled by default).
Trigger an eval run.
all unevaluated / since date / single session id / re-evaluate all), max sessions, dry-run checkbox./api/eval/run, server kicks off the CLI as a background task and returns a run_id.CSV export available on Tab 3 for offline plotting.
message_feedback postgres table + migration./api/eval/feedback.POST /api/eval/feedback {message_id, rating} — upsert.navi/eval/ package with CLI entry point (python -m navi.eval).evaluations table + migration.prompts/expert_*.txt).prompts/rubric_v1.yaml) — anchor examples filled in by user before going live.run command: pick unevaluated sessions, render full transcript, fan out to 3 experts, validate JSON output against pydantic schema, persist all expert rows under one eval_run_id./api/eval/sessions, /sessions/{id}, /stats — read-only, used by debug UI./api/eval/run (POST) — kicks off CLI in a background task, returns run_id. SSE/WS stream for live progress.debug/eval/index.html + app.js + style.css in the existing debug-SPA style.CSV export from python -m navi.eval stats --csv is also available as a pure-CLI path for offline plotting.
FallbackOllamaBackend) — so multi-server fallback applies.--max-tokens-per-session guard so a runaway transcript doesn't burn the queue.