Add eval system Phase 2 — rubric, expert prompts, judge skeleton
Drafts the v1 rubric (7 axes, anchors at 10/30/50/75/100, open scale),
three independent expert prompts (strict_critic / pragmatist / tech_lead)
that all return the same JSON shape, and the orchestration scaffolding:
schema.py (pydantic models), judge.py (rubric loader, score averaging,
fence-tolerant JSON parser, new_run_metadata), cli.py with argparse for
run / show / stats. Real LLM calls and transcript rendering land in
Phase 3 — the stubs raise NotImplementedError.

`python -m debug.eval` works as the entry point. Anchor `examples` are
left empty for now; user fills them with real session_ids later without
bumping rubric_version.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 5817fb9 commit e4771277847c3da2f64395883c6f65adfdf137fb
@Eugene Sukhodolskiy Eugene Sukhodolskiy authored on 26 Apr
Showing 8 changed files
View
debug/eval/__main__.py 0 → 100644
View
debug/eval/cli.py 0 → 100644
View
debug/eval/judge.py 0 → 100644
View
debug/eval/prompts/expert_pragmatist.txt 0 → 100644
View
debug/eval/prompts/expert_strict_critic.txt 0 → 100644
View
debug/eval/prompts/expert_tech_lead.txt 0 → 100644
View
debug/eval/prompts/rubric_v1.yaml 0 → 100644
View
debug/eval/schema.py 0 → 100644