|
Slim eval rubric to 3 levels with one reference per axis
Five anchors per axis (10/30/50/75/100, even after the earlier shift) were both redundant and amplified the model's snap-to-round-numbers prior. Cut to three level descriptions per axis (weak / typical / strong) with a single non-round reference score (53) on `typical`. Re-state the scale as open-ended with no upper bound to make the "future Navi may exceed past ceilings" intent explicit. - rubric_v1.yaml: anchors → levels (5 → 3 per axis), reference score 53 only on typical, scale framed as fully open-ended. - judge.py: render_rubric_for_prompt walks the new `levels` shape and surfaces the reference score only when present. - expert prompts (strict_critic, pragmatist, tech_lead): drop the example output blocks (their concrete numbers were misleading the judges), rewrite the scale paragraph for the new structure. - schema.py: docstring no longer pins ">100" as the open-scale marker. User intent: dynamics, not absolute scores. Weekly aggregates over three averaged experts smooth individual snap-to-5 into continuous trends; the rubric is a calibration aid, not a grading ceiling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
|---|
|
|
| debug/eval/judge.py |
|---|
| debug/eval/prompts/expert_pragmatist.txt |
|---|
| debug/eval/prompts/expert_strict_critic.txt |
|---|
| debug/eval/prompts/expert_tech_lead.txt |
|---|
| debug/eval/prompts/rubric_v1.yaml |
|---|
| debug/eval/schema.py |
|---|