root/navi-1

Fork: 0

root / navi-1

History for navi-1 / debug / eval / prompts / expert_strict_critic.txt

2026-04-28	a643119 Browse files » Slim eval rubric to 3 levels with one reference per axis ... Five anchors per axis (10/30/50/75/100, even after the earlier shift) were both redundant and amplified the model's snap-to-round-numbers prior. Cut to three level descriptions per axis (weak / typical / strong) with a single non-round reference score (53) on `typical`. Re-state the scale as open-ended with no upper bound to make the "future Navi may exceed past ceilings" intent explicit. - rubric_v1.yaml: anchors → levels (5 → 3 per axis), reference score 53 only on typical, scale framed as fully open-ended. - judge.py: render_rubric_for_prompt walks the new `levels` shape and surfaces the reference score only when present. - expert prompts (strict_critic, pragmatist, tech_lead): drop the example output blocks (their concrete numbers were misleading the judges), rewrite the scale paragraph for the new structure. - schema.py: docstring no longer pins ">100" as the open-scale marker. User intent: dynamics, not absolute scores. Weekly aggregates over three averaged experts smooth individual snap-to-5 into continuous trends; the rubric is a calibration aid, not a grading ceiling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Eugene Sukhodolskiy committed on 28 Apr
9d96249 Browse files » Fight rubric-anchor snapping in eval judges ... Judges were clustering scores onto the rubric's round anchor values (30, 50, 75, 100) instead of producing fine-grained continuous scores, which made small differences between sessions invisible. - rubric_v1.yaml: shift anchors off round numbers (33/51/77/102), reframe the scale as open-ended integers ≥ 0 with illustrative level descriptions, and tell judges explicitly not to round to anchors. - expert prompts (strict_critic, pragmatist, tech_lead): mirror the scale framing and add an example output with deliberately non-round scores between anchors. - judge.py: bump expert temperature 0.2 → 0.5 so the judges produce more varied, non-deterministic scores. Old v1 evaluations in the DB are not comparable to new ones; user intends to wipe and re-run from scratch, so versions are not bumped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Eugene Sukhodolskiy committed on 28 Apr
2026-04-26	e477127 Browse files » Add eval system Phase 2 — rubric, expert prompts, judge skeleton ... Drafts the v1 rubric (7 axes, anchors at 10/30/50/75/100, open scale), three independent expert prompts (strict_critic / pragmatist / tech_lead) that all return the same JSON shape, and the orchestration scaffolding: schema.py (pydantic models), judge.py (rubric loader, score averaging, fence-tolerant JSON parser, new_run_metadata), cli.py with argparse for run / show / stats. Real LLM calls and transcript rendering land in Phase 3 — the stubs raise NotImplementedError. `python -m debug.eval` works as the entry point. Anchor `examples` are left empty for now; user fills them with real session_ids later without bumping rubric_version. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Eugene Sukhodolskiy committed on 26 Apr

2026-04-28

a643119
Browse files »

Slim eval rubric to 3 levels with one reference per axis ...

Five anchors per axis (10/30/50/75/100, even after the earlier shift)
were both redundant and amplified the model's snap-to-round-numbers
prior. Cut to three level descriptions per axis (weak / typical /
strong) with a single non-round reference score (53) on `typical`.
Re-state the scale as open-ended with no upper bound to make the
"future Navi may exceed past ceilings" intent explicit.

- rubric_v1.yaml: anchors → levels (5 → 3 per axis), reference score
  53 only on typical, scale framed as fully open-ended.
- judge.py: render_rubric_for_prompt walks the new `levels` shape and
  surfaces the reference score only when present.
- expert prompts (strict_critic, pragmatist, tech_lead): drop the
  example output blocks (their concrete numbers were misleading the
  judges), rewrite the scale paragraph for the new structure.
- schema.py: docstring no longer pins ">100" as the open-scale marker.

User intent: dynamics, not absolute scores. Weekly aggregates over
three averaged experts smooth individual snap-to-5 into continuous
trends; the rubric is a calibration aid, not a grading ceiling.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Eugene Sukhodolskiy committed on 28 Apr

9d96249
Browse files »

Fight rubric-anchor snapping in eval judges ...

Judges were clustering scores onto the rubric's round anchor values
(30, 50, 75, 100) instead of producing fine-grained continuous scores,
which made small differences between sessions invisible.

- rubric_v1.yaml: shift anchors off round numbers (33/51/77/102),
  reframe the scale as open-ended integers ≥ 0 with illustrative level
  descriptions, and tell judges explicitly not to round to anchors.
- expert prompts (strict_critic, pragmatist, tech_lead): mirror the
  scale framing and add an example output with deliberately non-round
  scores between anchors.
- judge.py: bump expert temperature 0.2 → 0.5 so the judges produce
  more varied, non-deterministic scores.

Old v1 evaluations in the DB are not comparable to new ones; user
intends to wipe and re-run from scratch, so versions are not bumped.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Eugene Sukhodolskiy committed on 28 Apr

2026-04-26

e477127
Browse files »

Add eval system Phase 2 — rubric, expert prompts, judge skeleton ...

Drafts the v1 rubric (7 axes, anchors at 10/30/50/75/100, open scale),
three independent expert prompts (strict_critic / pragmatist / tech_lead)
that all return the same JSON shape, and the orchestration scaffolding:
schema.py (pydantic models), judge.py (rubric loader, score averaging,
fence-tolerant JSON parser, new_run_metadata), cli.py with argparse for
run / show / stats. Real LLM calls and transcript rendering land in
Phase 3 — the stubs raise NotImplementedError.

`python -m debug.eval` works as the entry point. Anchor `examples` are
left empty for now; user fills them with real session_ids later without
bumping rubric_version.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Eugene Sukhodolskiy committed on 26 Apr