Add eval system Phase 3 — judge runner end to end

Fork: 0

root / navi-1

Browse code Add eval system Phase 3 — judge runner end to end Fills in the stubs from Phase 2: - judge.render_session: full transcript with tool_call/tool_result folding, reactions inlined per assistant block, planning_logs appendix, no compression-summary substitution - judge.run_expert: real LLM call, fence-tolerant JSON parse, single retry with corrective nudge on schema or parse error - judge.evaluate_session: asyncio.gather across the three experts - db.EvalDB: insert_evaluation_run (txn), list_evaluations, evaluated_session_ids, feedback_by_index helper - cli `run` (filters: --session, --since, --limit, --re-evaluate-all, --dry-run, --model, --backend) and `show` (groups by eval_run_id, prints per-expert axes plus averaged scores) Verified end-to-end against a real 10-message secretary session: all three experts returned valid JSON first try; spread between strict critic and the others surfaced as expected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> feature/navi-code master vmkdemo
1 parent e477127 commit 864261a6cf895e6bbf76fe7c886036a55ae05282 Eugene Sukhodolskiy authored on 26 Apr

Browse code

Fills in the stubs from Phase 2:
- judge.render_session: full transcript with tool_call/tool_result folding,
  reactions inlined per assistant block, planning_logs appendix, no
  compression-summary substitution
- judge.run_expert: real LLM call, fence-tolerant JSON parse, single retry
  with corrective nudge on schema or parse error
- judge.evaluate_session: asyncio.gather across the three experts
- db.EvalDB: insert_evaluation_run (txn), list_evaluations,
  evaluated_session_ids, feedback_by_index helper
- cli `run` (filters: --session, --since, --limit, --re-evaluate-all,
  --dry-run, --model, --backend) and `show` (groups by eval_run_id, prints
  per-expert axes plus averaged scores)

Verified end-to-end against a real 10-message secretary session:
all three experts returned valid JSON first try; spread between strict
critic and the others surfaced as expected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feature/navi-code master vmkdemo

1 parent e477127 commit 864261a6cf895e6bbf76fe7c886036a55ae05282

Eugene Sukhodolskiy authored on 26 Apr

Patch

Unified Split

Showing 3 changed files

Ignore Space Show notes View debug/eval/cli.py

Ignore Space Show notes View debug/eval/db.py

Ignore Space Show notes View debug/eval/judge.py

Show line notes below