Add eval system Phase 2 — rubric, expert prompts, judge skeleton

Fork: 0

root / navi-1

Browse code Add eval system Phase 2 — rubric, expert prompts, judge skeleton Drafts the v1 rubric (7 axes, anchors at 10/30/50/75/100, open scale), three independent expert prompts (strict_critic / pragmatist / tech_lead) that all return the same JSON shape, and the orchestration scaffolding: schema.py (pydantic models), judge.py (rubric loader, score averaging, fence-tolerant JSON parser, new_run_metadata), cli.py with argparse for run / show / stats. Real LLM calls and transcript rendering land in Phase 3 — the stubs raise NotImplementedError. `python -m debug.eval` works as the entry point. Anchor `examples` are left empty for now; user fills them with real session_ids later without bumping rubric_version. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> feature/navi-code master vmkdemo
1 parent 5817fb9 commit e4771277847c3da2f64395883c6f65adfdf137fb Eugene Sukhodolskiy authored on 26 Apr

Browse code

Drafts the v1 rubric (7 axes, anchors at 10/30/50/75/100, open scale),
three independent expert prompts (strict_critic / pragmatist / tech_lead)
that all return the same JSON shape, and the orchestration scaffolding:
schema.py (pydantic models), judge.py (rubric loader, score averaging,
fence-tolerant JSON parser, new_run_metadata), cli.py with argparse for
run / show / stats. Real LLM calls and transcript rendering land in
Phase 3 — the stubs raise NotImplementedError.

`python -m debug.eval` works as the entry point. Anchor `examples` are
left empty for now; user fills them with real session_ids later without
bumping rubric_version.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feature/navi-code master vmkdemo

1 parent 5817fb9 commit e4771277847c3da2f64395883c6f65adfdf137fb

Eugene Sukhodolskiy authored on 26 Apr

Patch

Unified Split

Showing 8 changed files

Ignore Space Show notes View debug/eval/__main__.py 0 → 100644

Ignore Space Show notes View debug/eval/cli.py 0 → 100644

Ignore Space Show notes View debug/eval/judge.py 0 → 100644

Ignore Space Show notes View debug/eval/prompts/expert_pragmatist.txt 0 → 100644

Ignore Space Show notes View debug/eval/prompts/expert_strict_critic.txt 0 → 100644

Ignore Space Show notes View debug/eval/prompts/expert_tech_lead.txt 0 → 100644

Ignore Space Show notes View debug/eval/prompts/rubric_v1.yaml 0 → 100644

Ignore Space Show notes View debug/eval/schema.py 0 → 100644

Show line notes below