navi-1/debug/eval/prompts/rubric_v1.yaml at 3fcccd5ae5e668e593bb1a660c7593e855883324

Fork: 0
root / navi-1
Find file
Newer
Older
navi-1 / debug / eval / prompts / rubric_v1.yaml
Eugene Sukhodolskiy on 28 Apr 5 KB Slim eval rubric to 3 levels with one reference per axis
Raw Blame History
# Rubric v1
#
# Each axis carries three level descriptions: weak / typical / strong.
# Only the `typical` level holds a numeric reference score (`53`) — it is
# the single calibration anchor; weak and strong have no numbers attached.
# Judges score every axis with any non-negative integer on a fully
# open-ended scale — there is no upper bound, and exceptional sessions
# may score arbitrarily high as Navi grows. There are no preferred values;
# round multiples of 5 or 10 are not expected. The reference is just a
# pin: a typical-day session lands near 53; clearly weaker sessions go
# below, clearly stronger go above with no ceiling.
#
# Conventions:
#   - Each axis is independent. Don't let one pull another.
#   - `task_complexity` is judged from the user's request alone, before
#     the response is considered.
#   - `subagent_orchestration` and `self_extension` may be null when the
#     session never used those mechanics. Do not invent zeros.

version: "v1"

axes:
  task_complexity:
    description: >
      How hard the user's request is, judged from the request alone — before
      considering how Navi handled it. Reflects ambiguity, depth, multi-step
      reasoning, and how many things have to go right. Independent of outcome.
    levels:
      - label: weak
        what: "Single fact or single tool, no real planning, expected answer is obvious."
      - label: typical
        score: 53
        what: "Multi-step task with mild ambiguity; planning helps but the path is mostly clear."
      - label: strong
        what: "Long-horizon, multi-tool, real ambiguity to resolve, several places where it could fail; possibly project-shaped with sub-agents or self-extension."

  goal_completion:
    description: >
      What fraction of the user's intent — including unstated needs they
      would care about — was actually delivered. This is a continuous
      dimension, not yes/no: even a successful response can leave gaps,
      and even a failed one can deliver part of what was asked.
    levels:
      - label: weak
        what: "User did not get what they asked for; gave up, redirected, or output was off-target."
      - label: typical
        score: 53
        what: "Most of the request was met; user has minor gaps to fill in or live with."
      - label: strong
        what: "Goal fully met, including edge cases or caveats the user didn't have to ask about."

  tool_usage_quality:
    description: >
      Whether tools were chosen appropriately, called efficiently, with
      errors handled cleanly and results reused.
    levels:
      - label: weak
        what: "Wrong tools picked, repeated identical calls, no recovery from errors, or substantial wasted work."
      - label: typical
        score: 53
        what: "Tools were appropriate; one or two avoidable detours or redundant lookups."
      - label: strong
        what: "Minimal sufficient toolset, each call has a clear purpose, errors handled gracefully, results reused."

  efficiency:
    description: >
      Iterations and total work relative to the result. Detours, loops,
      and re-do attempts cost points; tight planning saves them.
    levels:
      - label: weak
        what: "Loops, runs out of iteration budget, or never converges; many aborted or duplicated attempts."
      - label: typical
        score: 53
        what: "Linear path with minor stalls; reaches the goal without too many detours."
      - label: strong
        what: "Few wasted moves; planning anticipated the work; shortest reasonable path."

  communication:
    description: >
      Clarity, honesty, brevity, and absence of hallucinations. Penalise
      padded replies and unverified claims even when the underlying answer
      is correct. A correct answer wrapped in fluff is not strong.
    levels:
      - label: weak
        what: "Hallucinations, false claims that work was done, walls of filler, or major inaccuracies."
      - label: typical
        score: 53
        what: "Conveys the answer; some unnecessary text or minor inaccuracies, but no major errors."
      - label: strong
        what: "Direct, accurate, appropriately brief; flags genuine uncertainties; no padding."

  subagent_orchestration:
    description: >
      Quality of delegation to sub-agents via spawn_agent. Score null if no
      sub-agents were spawned in the session — do not punish absence.
    nullable: true
    levels:
      - label: weak
        what: "Sub-agent given a vague prompt; output unusable, ignored, or duplicated by the parent."
      - label: typical
        score: 53
        what: "Sub-agent helped; the delegation paid off but the prompt or hand-off wasn't clean."
      - label: strong
        what: "Clear sub-task, clean hand-off, parent uses the result without rework; no overlap."

  self_extension:
    description: >
      Quality of self-extension via write_tool / reload_tools / delete_tool.
      Score null if Navi did not modify her own tooling in this session.
    nullable: true
    levels:
      - label: weak
        what: "Tool fails to load, is in wrong format, or solves the wrong problem."
      - label: typical
        score: 53
        what: "Tool loads and works for the immediate need but is narrow or quirky."
      - label: strong
        what: "Tool is well-formed, reusable, integrates cleanly, manual or doc updated."