| 2026-04-29 |
Remove SQLite legacy support
...
SQLite is no longer supported; PostgreSQL is now required.
- Delete navi/core/sqlite_session_store.py
- Delete navi/memory/sqlite_store.py
- Remove SqliteSessionStore from navi/core/__init__.py exports
- deps.py: drop SQLite fallback, raise RuntimeError if DATABASE_URL missing
- config.py: remove db_path setting
- pyproject.toml & requirements.txt: drop aiosqlite dependency
- .gitignore: remove navi.db entry
- tech_debt_review_2026-04-29.md: mark #8 as REMOVED
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 29 Apr
|

Stability fixes batch — tech debt review 2026-04-29
...
Critical:
- Concurrent WS run race guard (#1)
- Tool task cancellation on generator teardown (#2)
- StopAsyncIteration kills fallback chain (#3)
- Session loading race with _lastLoadId guard (#4)
- ContentCard .match() crash on non-string result (#5)
- Image data type guard in buildMessageList (#6)
High:
- Cap WS replay buffer at 500 events (#7)
- Deduplicate memory extraction task with asyncio.Lock (#9)
- TTL-based fallback blacklisting (5 min) (#10)
- Subagent tool exception isolation (#11)
- Inline image size/count validation on WS (#12)
- Clean up orphaned file on DB insert failure (#13)
- Deep watch streamingMsg for auto-scroll (#14)
- WS_SCHEME wss:// support for HTTPS (#15)
- Sending guard against duplicate message sends (#16)
- Global unhandledrejection listener in API layer (#17)
Medium:
- Cap planning_logs at 20 entries (#22)
- Store cleanup_loop task reference (#23)
- BaseException → Exception in _run_with_sentinel (#24)
- Propagate SystemExit in agent loop (#25)
- Configurable output_reserve_tokens (#26)
- Always reloadSession on session_sync (#30)
- FIFO queue for confirm dialogs (#31)
- Reset body.overflow on ImageLightbox unmount (#32)
- try/finally in fallback copy (#33)
- _isConnecting guard in WS send() (#34)
Low:
- Lazy-init deps.py singletons (#36)
- Replace __import__ with direct imports (#38)
- Preserve token count 0 in ollama.py (#39)
- Clear orphaned streamingMsg on reconnect reload (#43)
- Escape single quote in UserMessage (#44)
- Polyfill-free findLast replacement (#48)
- Match <table> tags with attributes in markdown (#49)
- Attach copy buttons only when msg.done (#50)
- Fix hasMeta falsy-0 bug (#53)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 29 Apr
|
| 2026-04-28 |
Add SVG/HTML/XML tag formatting rule to persona
...
Prevents model from generating doubled/escaped tags like <<svgsvg>
by explicitly instructing single-angle-bracket markup in code output.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 28 Apr
|
Fix system prompt leakage into chat history; polish content cards
...
Backend:
- websocket.py + agent.py: separate user-visible display_message from
LLM user_message. System hints (image/file attachments) no longer leak
into session.messages and appear after page reload.
- main.py: add ensure_tables() on startup so session_content table is
created before first publish.
- profiles: add kimi-k2.6:cloud to all model lists as fallback.
Frontend:
- ContentCard.vue: remove border-radius, add scrollbar styles, fix
metadata fallback parsing so cards survive page reload.
- content-viewers/*.html: add matching scrollbar styles.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 28 Apr
|
Add content hosting system with inline viewers
...
Backend:
- Add navi/content/ directory for published files
- Add content_store.py with publish/list/delete/cleanup functions
- Add content_publish tool for publishing files as viewable content
- Add /content static file mount in main.py
- Add /content-viewers mount for viewer pages
- Extend ToolEvent with metadata field
- Forward metadata through websocket tool_call events
- Update Agent to include metadata in ToolEvent
Frontend:
- Add ContentCard.vue component for displaying published content
- Add viewer pages: stl.html (Three.js), svg.html, html.html, pdf.html
- Update AssistantMessage.vue to render ContentCard for content_publish
- Update chat store to preserve metadata in tool cards
- Update websocket protocol docs with metadata field
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 28 Apr
|
Update memory docs to reflect pgvector + dedicated embedding backend
...
- Add dedicated embedding backend section (.env variables)
- Add backfill_embeddings script documentation
- Update storage methods: upsert_fact generates embeddings, search_facts
uses vector search with cosine distance fallback to ILIKE
- Update extractor process: tool calls/results in transcript, source/confidence
- Replace memory_search/memory_forget references with unified memory tool
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 28 Apr
|
Add dedicated CPU embedding server for memory backfill
...
- Install Ollama CPU-only on 192.168.1.168 server
- Pull nomic-embed-text:latest on server
- Create systemd service ollama-embed.service (0.0.0.0:11434)
- Add embedding_ollama_host / embedding_ollama_api_key to config.py
- Update deps.py to build separate embedding backend when host configured
- Update backfill_embeddings.py to use dedicated embedding backend
- Add _generate_embeddings batch helper and backfill_embeddings to store.py
- Backfilled 119 existing facts with embeddings
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 28 Apr
|
Enrich memory extractor with tool calls/results in transcript
...
- _EXTRACT_SYSTEM now explains 4 transcript entry types and instructs
LLM to trust tool results over chat, return source/source_context
- _extract_facts builds tool_call_map, appends [Tool call] and
[Tool result] lines with truncation (500/200 chars)
- Transcript capped at 12k chars (head+tail, drop middle)
- Parse source/source_context from LLM response; map confidence:
tool_call/auto_discovery=95, user_explicit=90, default=70
- Add TODO comment about deferred semantic deduplication
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 28 Apr
|
Document pgvector migration in memory system
...
- Add PostgreSQL vs SQLite comparison table
- Document migrate_pgvector.py usage
- Update storage section to mention asyncpg + aiosqlite
Eugene Sukhodolskiy
committed
on 28 Apr
|
Add pgvector migration script for memory_facts
...
- ALTER TABLE memory_facts: embedding, source, confidence, expires_at, source_context
- CREATE INDEX: hnsw(embedding), expires, source+category
- Safe to run multiple times (IF NOT EXISTS)
- Reads DATABASE_URL from settings
Eugene Sukhodolskiy
committed
on 28 Apr
|
Wire pgvector semantic search into memory system
...
- Add vector(768) column + HNSW index to memory_facts
- Add LLMBackend.embed() with Ollama + fallback implementation
- MemoryStore: cosine-distance search with ILIKE fallback
- New memory tool params: source, confidence, expires_days, source_context
- Update extractor, sqlite_store, deps wiring
- Add pgvector to requirements
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 28 Apr
|

Slim eval rubric to 3 levels with one reference per axis
...
Five anchors per axis (10/30/50/75/100, even after the earlier shift)
were both redundant and amplified the model's snap-to-round-numbers
prior. Cut to three level descriptions per axis (weak / typical /
strong) with a single non-round reference score (53) on `typical`.
Re-state the scale as open-ended with no upper bound to make the
"future Navi may exceed past ceilings" intent explicit.
- rubric_v1.yaml: anchors → levels (5 → 3 per axis), reference score
53 only on typical, scale framed as fully open-ended.
- judge.py: render_rubric_for_prompt walks the new `levels` shape and
surfaces the reference score only when present.
- expert prompts (strict_critic, pragmatist, tech_lead): drop the
example output blocks (their concrete numbers were misleading the
judges), rewrite the scale paragraph for the new structure.
- schema.py: docstring no longer pins ">100" as the open-scale marker.
User intent: dynamics, not absolute scores. Weekly aggregates over
three averaged experts smooth individual snap-to-5 into continuous
trends; the rubric is a calibration aid, not a grading ceiling.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 28 Apr
|
Fight rubric-anchor snapping in eval judges
...
Judges were clustering scores onto the rubric's round anchor values
(30, 50, 75, 100) instead of producing fine-grained continuous scores,
which made small differences between sessions invisible.
- rubric_v1.yaml: shift anchors off round numbers (33/51/77/102),
reframe the scale as open-ended integers ≥ 0 with illustrative level
descriptions, and tell judges explicitly not to round to anchors.
- expert prompts (strict_critic, pragmatist, tech_lead): mirror the
scale framing and add an example output with deliberately non-round
scores between anchors.
- judge.py: bump expert temperature 0.2 → 0.5 so the judges produce
more varied, non-deterministic scores.
Old v1 evaluations in the DB are not comparable to new ones; user
intends to wipe and re-run from scratch, so versions are not bumped.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 28 Apr
|
Stop image_view hallucinations on inline-attached images
...
The model was inventing fake paths/URLs (e.g. files.oaiusercontent.com,
/home/ubuntu/navi-1/input_file_0.png) and calling image_view on them
when the user attached an image directly in chat — the image was
already in the multimodal context, but the tool description and lack
of a signal pushed the model to "load" it anyway.
- websocket.py: when a user message has inline images, append a brief
note that they are already in context.
- image_view.py: soften the description — keep proactive use for paths
and URLs the model genuinely cannot see, but tell it inline images
don't need this tool.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 28 Apr
|
| 2026-04-26 |
Rewrite eval_system.md as user guide; preserve original spec as eval_system_design.md
...
- docs/eval_system.md: replaced stale spec with current user guide
covering UI tabs, CLI, rubric, experts, versioning, API
- docs/eval_system_design.md: preserved original design spec for
reference
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 26 Apr
|
Add eval system Phase 5 — debug UI
...
Self-contained SPA at /debug/eval (route already wired in 8e0eed6).
Single index.html in the existing debug/ style — vanilla JS, embedded
CSS, no framework, no build step. Four tabs:
- Sessions — filterable table (profile / status / limit), eval status
pill, headline avg scores, click-through to detail
- Detail — session metadata + every stored eval run, axes laid out as
axis × expert grids with inline averages, expert comments, button to
re-evaluate this single session
- Stats — weekly per-axis means table, optional complexity-bucket split
- Run — form to trigger any scope (unevaluated / single / all), live
status panel polling /eval/run/{id} every 2.5s, run history with
click-to-attach
Hash routing: #detail/<session_id> deep-links to a session.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 26 Apr
|
changed llm
Eugene Sukhodolskiy
committed
on 26 Apr
|
Add eval system Phase 4 — read endpoints and background runner
...
REST surface for the debug UI:
- GET /eval/sessions — overview list with eval status / latest avg /
feedback counts (single SQL: sessions ⨝ feedback ⨝ latest run)
- GET /eval/sessions/{id} — session detail with all evaluations
- GET /eval/stats — weekly per-axis means; optional complexity-bucket split
- POST /eval/run — fire-and-forget background eval, returns run_id
- GET /eval/run/{id}, GET /eval/runs — poll progress and history
Pulled the runner loop out of cli into runner.py so both the CLI and
the REST endpoint share the same loop. State for in-flight runs lives
in an in-memory registry (single-process, cleared on restart).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 26 Apr
|

Add eval system Phase 3 — judge runner end to end
...
Fills in the stubs from Phase 2:
- judge.render_session: full transcript with tool_call/tool_result folding,
reactions inlined per assistant block, planning_logs appendix, no
compression-summary substitution
- judge.run_expert: real LLM call, fence-tolerant JSON parse, single retry
with corrective nudge on schema or parse error
- judge.evaluate_session: asyncio.gather across the three experts
- db.EvalDB: insert_evaluation_run (txn), list_evaluations,
evaluated_session_ids, feedback_by_index helper
- cli `run` (filters: --session, --since, --limit, --re-evaluate-all,
--dry-run, --model, --backend) and `show` (groups by eval_run_id, prints
per-expert axes plus averaged scores)
Verified end-to-end against a real 10-message secretary session:
all three experts returned valid JSON first try; spread between strict
critic and the others surfaced as expected.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 26 Apr
|
Add eval system Phase 2 — rubric, expert prompts, judge skeleton
...
Drafts the v1 rubric (7 axes, anchors at 10/30/50/75/100, open scale),
three independent expert prompts (strict_critic / pragmatist / tech_lead)
that all return the same JSON shape, and the orchestration scaffolding:
schema.py (pydantic models), judge.py (rubric loader, score averaging,
fence-tolerant JSON parser, new_run_metadata), cli.py with argparse for
run / show / stats. Real LLM calls and transcript rendering land in
Phase 3 — the stubs raise NotImplementedError.
`python -m debug.eval` works as the entry point. Anchor `examples` are
left empty for now; user fills them with real session_ids later without
bumping rubric_version.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 26 Apr
|
Add eval system Phase 1 — message feedback signal
...
Spec at docs/eval_system.md describes the full LLM-as-judge plan;
this commit lands only the in-app feedback layer:
- debug/eval/ Python package with EvalDB (asyncpg) and FastAPI router
exposing /eval/feedback (set / clear / list)
- message_feedback postgres table keyed by (session_id, message_index)
- thumbs up / down on each completed assistant block in the webclient,
optimistic update with rollback on failure
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
on 26 Apr
|
changed llm & new ollama param
Eugene Sukhodolskiy
committed
on 26 Apr
|
| 2026-04-25 |
changed model
Eugene Sukhodolskiy
committed
on 25 Apr
|
Strengthen todo progress discipline
Eugene Sukhodolskiy
committed
on 25 Apr
|
Add structured planning review and adaptive depth
Eugene Sukhodolskiy
committed
on 25 Apr
|
Tune reflect autonomy guidance
Eugene Sukhodolskiy
committed
on 25 Apr
|
Improve compression and memory prompts
Eugene Sukhodolskiy
committed
on 25 Apr
|
Remove tool-call-like examples from prompts
Eugene Sukhodolskiy
committed
on 25 Apr
|
Improve prompt resilience and project context use
Eugene Sukhodolskiy
committed
on 25 Apr
|
Fix profile prompt inconsistencies
Eugene Sukhodolskiy
committed
on 25 Apr
|