|
Add eval system Phase 2 — rubric, expert prompts, judge skeleton
Drafts the v1 rubric (7 axes, anchors at 10/30/50/75/100, open scale), three independent expert prompts (strict_critic / pragmatist / tech_lead) that all return the same JSON shape, and the orchestration scaffolding: schema.py (pydantic models), judge.py (rubric loader, score averaging, fence-tolerant JSON parser, new_run_metadata), cli.py with argparse for run / show / stats. Real LLM calls and transcript rendering land in Phase 3 — the stubs raise NotImplementedError. `python -m debug.eval` works as the entry point. Anchor `examples` are left empty for now; user fills them with real session_ids later without bumping rubric_version. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> |
|---|
|
|
| debug/eval/__main__.py 0 → 100644 |
|---|
| debug/eval/cli.py 0 → 100644 |
|---|
| debug/eval/judge.py 0 → 100644 |
|---|
| debug/eval/prompts/expert_pragmatist.txt 0 → 100644 |
|---|
| debug/eval/prompts/expert_strict_critic.txt 0 → 100644 |
|---|
| debug/eval/prompts/expert_tech_lead.txt 0 → 100644 |
|---|
| debug/eval/prompts/rubric_v1.yaml 0 → 100644 |
|---|
| debug/eval/schema.py 0 → 100644 |
|---|