root/navi-1

Fork: 0

root / navi-1

History for navi-1 / debug / eval / db.py

2026-04-26	8d5c351 Browse files » Add eval system Phase 4 — read endpoints and background runner ... REST surface for the debug UI: - GET /eval/sessions — overview list with eval status / latest avg / feedback counts (single SQL: sessions ⨝ feedback ⨝ latest run) - GET /eval/sessions/{id} — session detail with all evaluations - GET /eval/stats — weekly per-axis means; optional complexity-bucket split - POST /eval/run — fire-and-forget background eval, returns run_id - GET /eval/run/{id}, GET /eval/runs — poll progress and history Pulled the runner loop out of cli into runner.py so both the CLI and the REST endpoint share the same loop. State for in-flight runs lives in an in-memory registry (single-process, cleared on restart). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Eugene Sukhodolskiy committed on 26 Apr
864261a Browse files » Add eval system Phase 3 — judge runner end to end ... Fills in the stubs from Phase 2: - judge.render_session: full transcript with tool_call/tool_result folding, reactions inlined per assistant block, planning_logs appendix, no compression-summary substitution - judge.run_expert: real LLM call, fence-tolerant JSON parse, single retry with corrective nudge on schema or parse error - judge.evaluate_session: asyncio.gather across the three experts - db.EvalDB: insert_evaluation_run (txn), list_evaluations, evaluated_session_ids, feedback_by_index helper - cli `run` (filters: --session, --since, --limit, --re-evaluate-all, --dry-run, --model, --backend) and `show` (groups by eval_run_id, prints per-expert axes plus averaged scores) Verified end-to-end against a real 10-message secretary session: all three experts returned valid JSON first try; spread between strict critic and the others surfaced as expected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Eugene Sukhodolskiy committed on 26 Apr
5817fb9 Browse files » Add eval system Phase 1 — message feedback signal ... Spec at docs/eval_system.md describes the full LLM-as-judge plan; this commit lands only the in-app feedback layer: - debug/eval/ Python package with EvalDB (asyncpg) and FastAPI router exposing /eval/feedback (set / clear / list) - message_feedback postgres table keyed by (session_id, message_index) - thumbs up / down on each completed assistant block in the webclient, optimistic update with rollback on failure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Eugene Sukhodolskiy committed on 26 Apr

2026-04-26

8d5c351
Browse files »

Add eval system Phase 4 — read endpoints and background runner ...

REST surface for the debug UI:
- GET /eval/sessions  — overview list with eval status / latest avg /
  feedback counts (single SQL: sessions ⨝ feedback ⨝ latest run)
- GET /eval/sessions/{id} — session detail with all evaluations
- GET /eval/stats — weekly per-axis means; optional complexity-bucket split
- POST /eval/run — fire-and-forget background eval, returns run_id
- GET /eval/run/{id}, GET /eval/runs — poll progress and history

Pulled the runner loop out of cli into runner.py so both the CLI and
the REST endpoint share the same loop. State for in-flight runs lives
in an in-memory registry (single-process, cleared on restart).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Eugene Sukhodolskiy committed on 26 Apr

864261a
Browse files »

Add eval system Phase 3 — judge runner end to end ...

Fills in the stubs from Phase 2:
- judge.render_session: full transcript with tool_call/tool_result folding,
  reactions inlined per assistant block, planning_logs appendix, no
  compression-summary substitution
- judge.run_expert: real LLM call, fence-tolerant JSON parse, single retry
  with corrective nudge on schema or parse error
- judge.evaluate_session: asyncio.gather across the three experts
- db.EvalDB: insert_evaluation_run (txn), list_evaluations,
  evaluated_session_ids, feedback_by_index helper
- cli `run` (filters: --session, --since, --limit, --re-evaluate-all,
  --dry-run, --model, --backend) and `show` (groups by eval_run_id, prints
  per-expert axes plus averaged scores)

Verified end-to-end against a real 10-message secretary session:
all three experts returned valid JSON first try; spread between strict
critic and the others surfaced as expected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Eugene Sukhodolskiy committed on 26 Apr

5817fb9
Browse files »

Add eval system Phase 1 — message feedback signal ...

Spec at docs/eval_system.md describes the full LLM-as-judge plan;
this commit lands only the in-app feedback layer:
- debug/eval/ Python package with EvalDB (asyncpg) and FastAPI router
  exposing /eval/feedback (set / clear / list)
- message_feedback postgres table keyed by (session_id, message_index)
- thumbs up / down on each completed assistant block in the webclient,
  optimistic update with rollback on failure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Eugene Sukhodolskiy committed on 26 Apr