Fight rubric-anchor snapping in eval judges
Judges were clustering scores onto the rubric's round anchor values
(30, 50, 75, 100) instead of producing fine-grained continuous scores,
which made small differences between sessions invisible.

- rubric_v1.yaml: shift anchors off round numbers (33/51/77/102),
  reframe the scale as open-ended integers ≥ 0 with illustrative level
  descriptions, and tell judges explicitly not to round to anchors.
- expert prompts (strict_critic, pragmatist, tech_lead): mirror the
  scale framing and add an example output with deliberately non-round
  scores between anchors.
- judge.py: bump expert temperature 0.2 → 0.5 so the judges produce
  more varied, non-deterministic scores.

Old v1 evaluations in the DB are not comparable to new ones; user
intends to wipe and re-run from scratch, so versions are not bumped.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent d9e9f4d commit 9d962498c2779fa134e06397712f7fc37b251c35
@Eugene Sukhodolskiy Eugene Sukhodolskiy authored on 28 Apr
Showing 5 changed files
View
debug/eval/judge.py
View
debug/eval/prompts/expert_pragmatist.txt
View
debug/eval/prompts/expert_strict_critic.txt
View
debug/eval/prompts/expert_tech_lead.txt
View
debug/eval/prompts/rubric_v1.yaml