Slim eval rubric to 3 levels with one reference per axis
Five anchors per axis (10/30/50/75/100, even after the earlier shift)
were both redundant and amplified the model's snap-to-round-numbers
prior. Cut to three level descriptions per axis (weak / typical /
strong) with a single non-round reference score (53) on `typical`.
Re-state the scale as open-ended with no upper bound to make the
"future Navi may exceed past ceilings" intent explicit.

- rubric_v1.yaml: anchors → levels (5 → 3 per axis), reference score
  53 only on typical, scale framed as fully open-ended.
- judge.py: render_rubric_for_prompt walks the new `levels` shape and
  surfaces the reference score only when present.
- expert prompts (strict_critic, pragmatist, tech_lead): drop the
  example output blocks (their concrete numbers were misleading the
  judges), rewrite the scale paragraph for the new structure.
- schema.py: docstring no longer pins ">100" as the open-scale marker.

User intent: dynamics, not absolute scores. Weekly aggregates over
three averaged experts smooth individual snap-to-5 into continuous
trends; the rubric is a calibration aid, not a grading ceiling.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 9d96249 commit a643119ceea95f9d0b922e08a9bdc6f14b2746bc
@Eugene Sukhodolskiy Eugene Sukhodolskiy authored on 28 Apr
Showing 6 changed files
View
debug/eval/judge.py
View
debug/eval/prompts/expert_pragmatist.txt
View
debug/eval/prompts/expert_strict_critic.txt
View
debug/eval/prompts/expert_tech_lead.txt
View
debug/eval/prompts/rubric_v1.yaml
View
debug/eval/schema.py