|
Fight rubric-anchor snapping in eval judges
Judges were clustering scores onto the rubric's round anchor values (30, 50, 75, 100) instead of producing fine-grained continuous scores, which made small differences between sessions invisible. - rubric_v1.yaml: shift anchors off round numbers (33/51/77/102), reframe the scale as open-ended integers ≥ 0 with illustrative level descriptions, and tell judges explicitly not to round to anchors. - expert prompts (strict_critic, pragmatist, tech_lead): mirror the scale framing and add an example output with deliberately non-round scores between anchors. - judge.py: bump expert temperature 0.2 → 0.5 so the judges produce more varied, non-deterministic scores. Old v1 evaluations in the DB are not comparable to new ones; user intends to wipe and re-run from scratch, so versions are not bumped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
|---|
|
|
| debug/eval/judge.py |
|---|
| debug/eval/prompts/expert_pragmatist.txt |
|---|
| debug/eval/prompts/expert_strict_critic.txt |
|---|
| debug/eval/prompts/expert_tech_lead.txt |
|---|
| debug/eval/prompts/rubric_v1.yaml |
|---|