Slim eval rubric to 3 levels with one reference per axis

Fork: 0

root / navi-1

Browse code Slim eval rubric to 3 levels with one reference per axis Five anchors per axis (10/30/50/75/100, even after the earlier shift) were both redundant and amplified the model's snap-to-round-numbers prior. Cut to three level descriptions per axis (weak / typical / strong) with a single non-round reference score (53) on `typical`. Re-state the scale as open-ended with no upper bound to make the "future Navi may exceed past ceilings" intent explicit. - rubric_v1.yaml: anchors → levels (5 → 3 per axis), reference score 53 only on typical, scale framed as fully open-ended. - judge.py: render_rubric_for_prompt walks the new `levels` shape and surfaces the reference score only when present. - expert prompts (strict_critic, pragmatist, tech_lead): drop the example output blocks (their concrete numbers were misleading the judges), rewrite the scale paragraph for the new structure. - schema.py: docstring no longer pins ">100" as the open-scale marker. User intent: dynamics, not absolute scores. Weekly aggregates over three averaged experts smooth individual snap-to-5 into continuous trends; the rubric is a calibration aid, not a grading ceiling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> feature/navi-code master vmkdemo
1 parent 9d96249 commit a643119ceea95f9d0b922e08a9bdc6f14b2746bc Eugene Sukhodolskiy authored on 28 Apr

Browse code

Five anchors per axis (10/30/50/75/100, even after the earlier shift)
were both redundant and amplified the model's snap-to-round-numbers
prior. Cut to three level descriptions per axis (weak / typical /
strong) with a single non-round reference score (53) on `typical`.
Re-state the scale as open-ended with no upper bound to make the
"future Navi may exceed past ceilings" intent explicit.

- rubric_v1.yaml: anchors → levels (5 → 3 per axis), reference score
  53 only on typical, scale framed as fully open-ended.
- judge.py: render_rubric_for_prompt walks the new `levels` shape and
  surfaces the reference score only when present.
- expert prompts (strict_critic, pragmatist, tech_lead): drop the
  example output blocks (their concrete numbers were misleading the
  judges), rewrite the scale paragraph for the new structure.
- schema.py: docstring no longer pins ">100" as the open-scale marker.

User intent: dynamics, not absolute scores. Weekly aggregates over
three averaged experts smooth individual snap-to-5 into continuous
trends; the rubric is a calibration aid, not a grading ceiling.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

feature/navi-code master vmkdemo

1 parent 9d96249 commit a643119ceea95f9d0b922e08a9bdc6f14b2746bc

Eugene Sukhodolskiy authored on 28 Apr

Patch

Unified Split

Showing 6 changed files

Ignore Space Show notes View debug/eval/judge.py

Ignore Space Show notes View debug/eval/prompts/expert_pragmatist.txt

Ignore Space Show notes View debug/eval/prompts/expert_strict_critic.txt

Ignore Space Show notes View debug/eval/prompts/expert_tech_lead.txt

Ignore Space Show notes View debug/eval/prompts/rubric_v1.yaml

Ignore Space Show notes View debug/eval/schema.py

Show line notes below