| 2026-05-11 |
Add deterministic line-based file editing (edit_lines), rating UI fix, and session refresh
...
- filesystem.py: add edit_lines action (deterministic line ops via operations array)
+ numbered param for read (1-based line numbers in output)
+ clarify four editing modes in tool description
- chat.js: fix rating IDs for streaming messages (assign h_ ID on stream_end)
- SessionList.vue: mobile pull-to-refresh with PTR_THRESHOLD=80
- AppSidebar.vue: desktop refresh button next to Conversations header
- planning.py: knowledge source assessment in Phase 1
- debug panel: MCP servers tab + resolved tools per profile
- NAVI.md: reposition as neutral quick-reference
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 11 May
|
| 2026-05-08 |
Add pagination, search, and sorting to admin sessions
...
Backend:
- Add count_all and search_list abstract methods to SessionStore
- Implement count_all and search_list in PgSessionStore (SQL with ILIKE)
- Implement count_all and search_list in InMemorySessionStore
- Update /admin/sessions to accept limit, offset, search, sort_by, sort_order
- Return {total, limit, offset, items} from /admin/sessions
Frontend:
- Add search input for sessions in admin panel
- Add clickable sortable column headers with asc/desc toggle
- Add pagination controls (prev/next, page size selector, item count)
- Debounce search input (300ms)
Tests:
- Add integration tests for pagination, offset, search, and sorting
- All 217 tests pass
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 8 May
|
| 2026-04-28 |

Slim eval rubric to 3 levels with one reference per axis
...
Five anchors per axis (10/30/50/75/100, even after the earlier shift)
were both redundant and amplified the model's snap-to-round-numbers
prior. Cut to three level descriptions per axis (weak / typical /
strong) with a single non-round reference score (53) on `typical`.
Re-state the scale as open-ended with no upper bound to make the
"future Navi may exceed past ceilings" intent explicit.
- rubric_v1.yaml: anchors → levels (5 → 3 per axis), reference score
53 only on typical, scale framed as fully open-ended.
- judge.py: render_rubric_for_prompt walks the new `levels` shape and
surfaces the reference score only when present.
- expert prompts (strict_critic, pragmatist, tech_lead): drop the
example output blocks (their concrete numbers were misleading the
judges), rewrite the scale paragraph for the new structure.
- schema.py: docstring no longer pins ">100" as the open-scale marker.
User intent: dynamics, not absolute scores. Weekly aggregates over
three averaged experts smooth individual snap-to-5 into continuous
trends; the rubric is a calibration aid, not a grading ceiling.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 28 Apr
|
Fight rubric-anchor snapping in eval judges
...
Judges were clustering scores onto the rubric's round anchor values
(30, 50, 75, 100) instead of producing fine-grained continuous scores,
which made small differences between sessions invisible.
- rubric_v1.yaml: shift anchors off round numbers (33/51/77/102),
reframe the scale as open-ended integers ≥ 0 with illustrative level
descriptions, and tell judges explicitly not to round to anchors.
- expert prompts (strict_critic, pragmatist, tech_lead): mirror the
scale framing and add an example output with deliberately non-round
scores between anchors.
- judge.py: bump expert temperature 0.2 → 0.5 so the judges produce
more varied, non-deterministic scores.
Old v1 evaluations in the DB are not comparable to new ones; user
intends to wipe and re-run from scratch, so versions are not bumped.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 28 Apr
|
| 2026-04-26 |
Add eval system Phase 5 — debug UI
...
Self-contained SPA at /debug/eval (route already wired in 8e0eed6).
Single index.html in the existing debug/ style — vanilla JS, embedded
CSS, no framework, no build step. Four tabs:
- Sessions — filterable table (profile / status / limit), eval status
pill, headline avg scores, click-through to detail
- Detail — session metadata + every stored eval run, axes laid out as
axis × expert grids with inline averages, expert comments, button to
re-evaluate this single session
- Stats — weekly per-axis means table, optional complexity-bucket split
- Run — form to trigger any scope (unevaluated / single / all), live
status panel polling /eval/run/{id} every 2.5s, run history with
click-to-attach
Hash routing: #detail/<session_id> deep-links to a session.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 26 Apr
|
Add eval system Phase 4 — read endpoints and background runner
...
REST surface for the debug UI:
- GET /eval/sessions — overview list with eval status / latest avg /
feedback counts (single SQL: sessions ⨝ feedback ⨝ latest run)
- GET /eval/sessions/{id} — session detail with all evaluations
- GET /eval/stats — weekly per-axis means; optional complexity-bucket split
- POST /eval/run — fire-and-forget background eval, returns run_id
- GET /eval/run/{id}, GET /eval/runs — poll progress and history
Pulled the runner loop out of cli into runner.py so both the CLI and
the REST endpoint share the same loop. State for in-flight runs lives
in an in-memory registry (single-process, cleared on restart).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 26 Apr
|

Add eval system Phase 3 — judge runner end to end
...
Fills in the stubs from Phase 2:
- judge.render_session: full transcript with tool_call/tool_result folding,
reactions inlined per assistant block, planning_logs appendix, no
compression-summary substitution
- judge.run_expert: real LLM call, fence-tolerant JSON parse, single retry
with corrective nudge on schema or parse error
- judge.evaluate_session: asyncio.gather across the three experts
- db.EvalDB: insert_evaluation_run (txn), list_evaluations,
evaluated_session_ids, feedback_by_index helper
- cli `run` (filters: --session, --since, --limit, --re-evaluate-all,
--dry-run, --model, --backend) and `show` (groups by eval_run_id, prints
per-expert axes plus averaged scores)
Verified end-to-end against a real 10-message secretary session:
all three experts returned valid JSON first try; spread between strict
critic and the others surfaced as expected.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 26 Apr
|
Add eval system Phase 2 — rubric, expert prompts, judge skeleton
...
Drafts the v1 rubric (7 axes, anchors at 10/30/50/75/100, open scale),
three independent expert prompts (strict_critic / pragmatist / tech_lead)
that all return the same JSON shape, and the orchestration scaffolding:
schema.py (pydantic models), judge.py (rubric loader, score averaging,
fence-tolerant JSON parser, new_run_metadata), cli.py with argparse for
run / show / stats. Real LLM calls and transcript rendering land in
Phase 3 — the stubs raise NotImplementedError.
`python -m debug.eval` works as the entry point. Anchor `examples` are
left empty for now; user fills them with real session_ids later without
bumping rubric_version.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 26 Apr
|
Add eval system Phase 1 — message feedback signal
...
Spec at docs/eval_system.md describes the full LLM-as-judge plan;
this commit lands only the in-app feedback layer:
- debug/eval/ Python package with EvalDB (asyncpg) and FastAPI router
exposing /eval/feedback (set / clear / list)
- message_feedback postgres table keyed by (session_id, message_index)
- thumbs up / down on each completed assistant block in the webclient,
optimistic update with rollback on failure
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 26 Apr
|
| 2026-04-21 |
Add instagram_engine and instagram_viewer tools (Navi-generated)
...
Browser automation tools for scraping public Instagram profiles using
Playwright + stealth. Registered in enabled.json and developer profile.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 21 Apr
|
| 2026-04-20 |

Planning debug panel, todo auto-populate, scratchpad/persona improvements
...
- Planning debug panel: new Planning tab in debug/index.html shows raw
phase 1/2 outputs and token counts per planning run, stored in
session.planning_logs (new column in both SQLite and PostgreSQL)
- New GET /sessions/{id}/planning API endpoint
- PlanningDebugData internal event wires _run_planning() output into
session storage; never forwarded to WebSocket clients
- Phase 3 (plan critic) disabled — to be reworked with reflect integration
- Todo tool: auto-populated from plan steps after phase 2; model only
needs to call update/view, not set
- Scratchpad: clarified description and persona instructions; removed
context_transfer from user-facing docs (internal mechanism only)
- web_search: switched to ddgs package, SearXNG as primary backend,
DDG html-only fallback; added find_up action to filesystem tool
- Persona: added SCRATCHPAD and TODO sections with clear usage rules;
added NAVI.md project context instructions
- chat.js: fixed subagent planning event fallthrough into parent UI;
statusLabel cleared on first stream delta
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 20 Apr
|
| 2026-04-17 |
Add Prompts and Tools tabs to debug page
...
Backend:
- GET /agents/prompts — returns full built system prompt for every
profile, broken into sections (persona / profile / profiles block)
with char/token counts; mirrors Agent._build_system_prompt() exactly
- GET /agents/tools — now includes parameters schema alongside name
and description
Debug page:
- Tab bar: Context / Prompts / Tools
- Prompts tab: profile sidebar + collapsible sections per prompt part
(persona, profile prompt, profiles block), togglable tools list
- Tools tab: searchable list of all tools with description and
parameter table (name, type, description, required marker)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 17 Apr
|
Add standalone debug page at /debug
...
Replaces the old_webclient/debug.html with a proper self-contained
tool at debug/index.html. New features over the old page:
- Sidebar session list with profile, message count, pin indicator
- Auto-refresh toggle (3s interval)
- Refresh button
- Renders thinking blocks, is_plan and is_summary tags
- Shows tool call name on tool result messages
- Clickable image thumbnails (open full-size)
- All new fields from the current LLM context API
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 17 Apr
|