# Rubric v1
#
# Each axis carries three level descriptions: weak / typical / strong.
# Only the `typical` level holds a numeric reference score (`53`) — it is
# the single calibration anchor; weak and strong have no numbers attached.
# Judges score every axis with any non-negative integer on a fully
# open-ended scale — there is no upper bound, and exceptional sessions
# may score arbitrarily high as Navi grows. There are no preferred values;
# round multiples of 5 or 10 are not expected. The reference is just a
# pin: a typical-day session lands near 53; clearly weaker sessions go
# below, clearly stronger go above with no ceiling.
#
# Conventions:
# - Each axis is independent. Don't let one pull another.
# - `task_complexity` is judged from the user's request alone, before
# the response is considered.
# - `subagent_orchestration` and `self_extension` may be null when the
# session never used those mechanics. Do not invent zeros.
version: "v1"
axes:
task_complexity:
description: >
How hard the user's request is, judged from the request alone — before
considering how Navi handled it. Reflects ambiguity, depth, multi-step
reasoning, and how many things have to go right. Independent of outcome.
levels:
- label: weak
what: "Single fact or single tool, no real planning, expected answer is obvious."
- label: typical
score: 53
what: "Multi-step task with mild ambiguity; planning helps but the path is mostly clear."
- label: strong
what: "Long-horizon, multi-tool, real ambiguity to resolve, several places where it could fail; possibly project-shaped with sub-agents or self-extension."
goal_completion:
description: >
What fraction of the user's intent — including unstated needs they
would care about — was actually delivered. This is a continuous
dimension, not yes/no: even a successful response can leave gaps,
and even a failed one can deliver part of what was asked.
levels:
- label: weak
what: "User did not get what they asked for; gave up, redirected, or output was off-target."
- label: typical
score: 53
what: "Most of the request was met; user has minor gaps to fill in or live with."
- label: strong
what: "Goal fully met, including edge cases or caveats the user didn't have to ask about."
tool_usage_quality:
description: >
Whether tools were chosen appropriately, called efficiently, with
errors handled cleanly and results reused.
levels:
- label: weak
what: "Wrong tools picked, repeated identical calls, no recovery from errors, or substantial wasted work."
- label: typical
score: 53
what: "Tools were appropriate; one or two avoidable detours or redundant lookups."
- label: strong
what: "Minimal sufficient toolset, each call has a clear purpose, errors handled gracefully, results reused."
efficiency:
description: >
Iterations and total work relative to the result. Detours, loops,
and re-do attempts cost points; tight planning saves them.
levels:
- label: weak
what: "Loops, runs out of iteration budget, or never converges; many aborted or duplicated attempts."
- label: typical
score: 53
what: "Linear path with minor stalls; reaches the goal without too many detours."
- label: strong
what: "Few wasted moves; planning anticipated the work; shortest reasonable path."
communication:
description: >
Clarity, honesty, brevity, and absence of hallucinations. Penalise
padded replies and unverified claims even when the underlying answer
is correct. A correct answer wrapped in fluff is not strong.
levels:
- label: weak
what: "Hallucinations, false claims that work was done, walls of filler, or major inaccuracies."
- label: typical
score: 53
what: "Conveys the answer; some unnecessary text or minor inaccuracies, but no major errors."
- label: strong
what: "Direct, accurate, appropriately brief; flags genuine uncertainties; no padding."
subagent_orchestration:
description: >
Quality of delegation to sub-agents via spawn_agent. Score null if no
sub-agents were spawned in the session — do not punish absence.
nullable: true
levels:
- label: weak
what: "Sub-agent given a vague prompt; output unusable, ignored, or duplicated by the parent."
- label: typical
score: 53
what: "Sub-agent helped; the delegation paid off but the prompt or hand-off wasn't clean."
- label: strong
what: "Clear sub-task, clean hand-off, parent uses the result without rework; no overlap."
self_extension:
description: >
Quality of self-extension via write_tool / reload_tools / delete_tool.
Score null if Navi did not modify her own tooling in this session.
nullable: true
levels:
- label: weak
what: "Tool fails to load, is in wrong format, or solves the wrong problem."
- label: typical
score: 53
what: "Tool loads and works for the immediate need but is narrow or quirky."
- label: strong
what: "Tool is well-formed, reusable, integrates cleanly, manual or doc updated."