Fight rubric-anchor snapping in eval judges

Fork: 0

root / navi-1

Browse code Fight rubric-anchor snapping in eval judges Judges were clustering scores onto the rubric's round anchor values (30, 50, 75, 100) instead of producing fine-grained continuous scores, which made small differences between sessions invisible. - rubric_v1.yaml: shift anchors off round numbers (33/51/77/102), reframe the scale as open-ended integers ≥ 0 with illustrative level descriptions, and tell judges explicitly not to round to anchors. - expert prompts (strict_critic, pragmatist, tech_lead): mirror the scale framing and add an example output with deliberately non-round scores between anchors. - judge.py: bump expert temperature 0.2 → 0.5 so the judges produce more varied, non-deterministic scores. Old v1 evaluations in the DB are not comparable to new ones; user intends to wipe and re-run from scratch, so versions are not bumped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> feature/navi-code master vmkdemo
1 parent d9e9f4d commit 9d962498c2779fa134e06397712f7fc37b251c35 Eugene Sukhodolskiy authored on 28 Apr

Browse code

Judges were clustering scores onto the rubric's round anchor values
(30, 50, 75, 100) instead of producing fine-grained continuous scores,
which made small differences between sessions invisible.

- rubric_v1.yaml: shift anchors off round numbers (33/51/77/102),
  reframe the scale as open-ended integers ≥ 0 with illustrative level
  descriptions, and tell judges explicitly not to round to anchors.
- expert prompts (strict_critic, pragmatist, tech_lead): mirror the
  scale framing and add an example output with deliberately non-round
  scores between anchors.
- judge.py: bump expert temperature 0.2 → 0.5 so the judges produce
  more varied, non-deterministic scores.

Old v1 evaluations in the DB are not comparable to new ones; user
intends to wipe and re-run from scratch, so versions are not bumped.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feature/navi-code master vmkdemo

1 parent d9e9f4d commit 9d962498c2779fa134e06397712f7fc37b251c35

Eugene Sukhodolskiy authored on 28 Apr

Patch

Unified Split

Showing 5 changed files

Ignore Space Show notes View debug/eval/judge.py

Ignore Space Show notes View debug/eval/prompts/expert_pragmatist.txt

Ignore Space Show notes View debug/eval/prompts/expert_strict_critic.txt

Ignore Space Show notes View debug/eval/prompts/expert_tech_lead.txt

Ignore Space Show notes View debug/eval/prompts/rubric_v1.yaml

Show line notes below