|
Add eval system Phase 3 — judge runner end to end
Fills in the stubs from Phase 2: - judge.render_session: full transcript with tool_call/tool_result folding, reactions inlined per assistant block, planning_logs appendix, no compression-summary substitution - judge.run_expert: real LLM call, fence-tolerant JSON parse, single retry with corrective nudge on schema or parse error - judge.evaluate_session: asyncio.gather across the three experts - db.EvalDB: insert_evaluation_run (txn), list_evaluations, evaluated_session_ids, feedback_by_index helper - cli `run` (filters: --session, --since, --limit, --re-evaluate-all, --dry-run, --model, --backend) and `show` (groups by eval_run_id, prints per-expert axes plus averaged scores) Verified end-to-end against a real 10-message secretary session: all three experts returned valid JSON first try; spread between strict critic and the others surfaced as expected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
|---|
|
|
| debug/eval/cli.py |
|---|
| debug/eval/db.py |
|---|
| debug/eval/judge.py |
|---|