Add eval system Phase 3 — judge runner end to end
Fills in the stubs from Phase 2:
- judge.render_session: full transcript with tool_call/tool_result folding,
  reactions inlined per assistant block, planning_logs appendix, no
  compression-summary substitution
- judge.run_expert: real LLM call, fence-tolerant JSON parse, single retry
  with corrective nudge on schema or parse error
- judge.evaluate_session: asyncio.gather across the three experts
- db.EvalDB: insert_evaluation_run (txn), list_evaluations,
  evaluated_session_ids, feedback_by_index helper
- cli `run` (filters: --session, --since, --limit, --re-evaluate-all,
  --dry-run, --model, --backend) and `show` (groups by eval_run_id, prints
  per-expert axes plus averaged scores)

Verified end-to-end against a real 10-message secretary session:
all three experts returned valid JSON first try; spread between strict
critic and the others surfaced as expected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent e477127 commit 864261a6cf895e6bbf76fe7c886036a55ae05282
@Eugene Sukhodolskiy Eugene Sukhodolskiy authored on 26 Apr
Showing 3 changed files
View
debug/eval/cli.py
View
debug/eval/db.py
View
debug/eval/judge.py