root/navi-1

Fork: 0

root / navi-1

History for navi-1 / debug / eval

2026-05-08	193b7a5 Browse files » Add pagination, search, and sorting to admin sessions ... Backend: - Add count_all and search_list abstract methods to SessionStore - Implement count_all and search_list in PgSessionStore (SQL with ILIKE) - Implement count_all and search_list in InMemorySessionStore - Update /admin/sessions to accept limit, offset, search, sort_by, sort_order - Return {total, limit, offset, items} from /admin/sessions Frontend: - Add search input for sessions in admin panel - Add clickable sortable column headers with asc/desc toggle - Add pagination controls (prev/next, page size selector, item count) - Debounce search input (300ms) Tests: - Add integration tests for pagination, offset, search, and sorting - All 217 tests pass Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Eugene Sukhodolskiy committed on 8 May
2026-04-28	a643119 Browse files » Slim eval rubric to 3 levels with one reference per axis ... Five anchors per axis (10/30/50/75/100, even after the earlier shift) were both redundant and amplified the model's snap-to-round-numbers prior. Cut to three level descriptions per axis (weak / typical / strong) with a single non-round reference score (53) on `typical`. Re-state the scale as open-ended with no upper bound to make the "future Navi may exceed past ceilings" intent explicit. - rubric_v1.yaml: anchors → levels (5 → 3 per axis), reference score 53 only on typical, scale framed as fully open-ended. - judge.py: render_rubric_for_prompt walks the new `levels` shape and surfaces the reference score only when present. - expert prompts (strict_critic, pragmatist, tech_lead): drop the example output blocks (their concrete numbers were misleading the judges), rewrite the scale paragraph for the new structure. - schema.py: docstring no longer pins ">100" as the open-scale marker. User intent: dynamics, not absolute scores. Weekly aggregates over three averaged experts smooth individual snap-to-5 into continuous trends; the rubric is a calibration aid, not a grading ceiling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Eugene Sukhodolskiy committed on 28 Apr
2026-04-28	9d96249 Browse files » Fight rubric-anchor snapping in eval judges ... Judges were clustering scores onto the rubric's round anchor values (30, 50, 75, 100) instead of producing fine-grained continuous scores, which made small differences between sessions invisible. - rubric_v1.yaml: shift anchors off round numbers (33/51/77/102), reframe the scale as open-ended integers ≥ 0 with illustrative level descriptions, and tell judges explicitly not to round to anchors. - expert prompts (strict_critic, pragmatist, tech_lead): mirror the scale framing and add an example output with deliberately non-round scores between anchors. - judge.py: bump expert temperature 0.2 → 0.5 so the judges produce more varied, non-deterministic scores. Old v1 evaluations in the DB are not comparable to new ones; user intends to wipe and re-run from scratch, so versions are not bumped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Eugene Sukhodolskiy committed on 28 Apr
2026-04-26	307f639 Browse files » Add eval system Phase 5 — debug UI ... Self-contained SPA at /debug/eval (route already wired in 8e0eed6). Single index.html in the existing debug/ style — vanilla JS, embedded CSS, no framework, no build step. Four tabs: - Sessions — filterable table (profile / status / limit), eval status pill, headline avg scores, click-through to detail - Detail — session metadata + every stored eval run, axes laid out as axis × expert grids with inline averages, expert comments, button to re-evaluate this single session - Stats — weekly per-axis means table, optional complexity-bucket split - Run — form to trigger any scope (unevaluated / single / all), live status panel polling /eval/run/{id} every 2.5s, run history with click-to-attach Hash routing: #detail/<session_id> deep-links to a session. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Eugene Sukhodolskiy committed on 26 Apr
	8d5c351 Browse files » Add eval system Phase 4 — read endpoints and background runner ... REST surface for the debug UI: - GET /eval/sessions — overview list with eval status / latest avg / feedback counts (single SQL: sessions ⨝ feedback ⨝ latest run) - GET /eval/sessions/{id} — session detail with all evaluations - GET /eval/stats — weekly per-axis means; optional complexity-bucket split - POST /eval/run — fire-and-forget background eval, returns run_id - GET /eval/run/{id}, GET /eval/runs — poll progress and history Pulled the runner loop out of cli into runner.py so both the CLI and the REST endpoint share the same loop. State for in-flight runs lives in an in-memory registry (single-process, cleared on restart). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Eugene Sukhodolskiy committed on 26 Apr
	864261a Browse files » Add eval system Phase 3 — judge runner end to end ... Fills in the stubs from Phase 2: - judge.render_session: full transcript with tool_call/tool_result folding, reactions inlined per assistant block, planning_logs appendix, no compression-summary substitution - judge.run_expert: real LLM call, fence-tolerant JSON parse, single retry with corrective nudge on schema or parse error - judge.evaluate_session: asyncio.gather across the three experts - db.EvalDB: insert_evaluation_run (txn), list_evaluations, evaluated_session_ids, feedback_by_index helper - cli `run` (filters: --session, --since, --limit, --re-evaluate-all, --dry-run, --model, --backend) and `show` (groups by eval_run_id, prints per-expert axes plus averaged scores) Verified end-to-end against a real 10-message secretary session: all three experts returned valid JSON first try; spread between strict critic and the others surfaced as expected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Eugene Sukhodolskiy committed on 26 Apr
	e477127 Browse files » Add eval system Phase 2 — rubric, expert prompts, judge skeleton ... Drafts the v1 rubric (7 axes, anchors at 10/30/50/75/100, open scale), three independent expert prompts (strict_critic / pragmatist / tech_lead) that all return the same JSON shape, and the orchestration scaffolding: schema.py (pydantic models), judge.py (rubric loader, score averaging, fence-tolerant JSON parser, new_run_metadata), cli.py with argparse for run / show / stats. Real LLM calls and transcript rendering land in Phase 3 — the stubs raise NotImplementedError. `python -m debug.eval` works as the entry point. Anchor `examples` are left empty for now; user fills them with real session_ids later without bumping rubric_version. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Eugene Sukhodolskiy committed on 26 Apr
	5817fb9 Browse files » Add eval system Phase 1 — message feedback signal ... Spec at docs/eval_system.md describes the full LLM-as-judge plan; this commit lands only the in-app feedback layer: - debug/eval/ Python package with EvalDB (asyncpg) and FastAPI router exposing /eval/feedback (set / clear / list) - message_feedback postgres table keyed by (session_id, message_index) - thumbs up / down on each completed assistant block in the webclient, optimistic update with rollback on failure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Eugene Sukhodolskiy committed on 26 Apr