diff --git a/docs/tech_debt_review_2026-04-29.md b/docs/tech_debt_review_2026-04-29.md new file mode 100644 index 0000000..2e1ae85 --- /dev/null +++ b/docs/tech_debt_review_2026-04-29.md @@ -0,0 +1,108 @@ +# Code Review — Technical Debt & Stability Issues + +**Date:** 2026-04-29 +**Scope:** Full backend + frontend review +**Total items:** 53 issues + +Security items are listed in a separate deferred section at the end of this document. They are **not** in scope for the current fix pass but must be addressed before any multi-user or public-facing deployment. + +--- + +## Critical — Stability / Data Loss / Crash + +| # | Problem | File | Line | Explanation | +|---|---------|------|------|-------------| +| 1 | **[FIXED] Race condition: concurrent runs overwrite each other** | `navi/api/websocket.py` | 70, 328-331, 164 | `_runs[session_id] = run` has no guard. Two tabs or rapid reconnects for the same session overwrite each other. The first run's `finally` block calls `_runs.pop()`, which accidentally removes the *second* run. Cooperative stop signals break; zombie tasks remain. | +| 2 | **[FIXED] Resource leak: tool tasks never cancelled on generator teardown** | `navi/core/agent.py` | 842-892 | Each tool call runs in `asyncio.create_task()`. If the consumer (WebSocket handler) is cancelled, `GeneratorExit` breaks the `while True` loop at `yield item` but `tool_task` is never awaited or cancelled. It continues executing filesystem/SSH/shell commands in the background with no way to stop it. | +| 3 | **[FIXED] Empty LLM stream silently kills fallback** | `navi/llm/fallback.py` | 162-165, 208-210 | `stream()` and `stream_complete()` catch `StopAsyncIteration` from an empty generator and do `return`, ending the entire fallback chain instead of trying the next server. User sees an empty response with no error. | +| 4 | **[FIXED] Frontend race condition in session loading** | `webclient/src/stores/chat.js` | 34-58 | `loadSession()` guards against reloading the *same* ID, but has no guard against two *different* concurrent loads. If user switches A→B while `loadSession(A)` is in-flight, the late completion overwrites `currentId` back to A. UI snaps to the wrong session. | +| 5 | **[FIXED] Frontend crash on non-string tool result** | `webclient/src/components/messages/ContentCard.vue` | 100-111 | `metadata` computed property calls `.match()` on `props.tool.result ?? ''`. If `result` is an object or `null`, `.match()` throws `TypeError` and crashes the component. | +| 6 | **[FIXED] Frontend crash on malformed image data** | `webclient/src/stores/chat.js` | 393 | `buildMessageList` calls `b.startsWith('data:')` on every element of `m.images`. If backend sends `null` or a non-string, the history builder crashes. | + +--- + +## High — Performance / Reliability / UX Breakage + +| # | Problem | File | Line | Explanation | +|---|---------|------|------|-------------| +| 7 | **[FIXED] Unbounded WebSocket replay buffer** | `navi/api/websocket.py` | 44-45 | `_AgentRun.events` is a plain `list[dict]` with no size limit. Large tool results (e.g. 5 MB terminal output) accumulate. Every reconnect replays the entire buffer, causing memory and bandwidth spikes. | +| 8 | **[LEGACY] SQLite opens a new connection per operation** | `navi/core/sqlite_session_store.py` | 68, 79, 89, 101, 111, 121, 129 | Every `create`, `get`, `save`, etc. does `async with aiosqlite.connect(...)`. No WAL, no busy timeout, no pooling. Under concurrent sessions this causes "database is locked" errors. | +| 9 | **[FIXED] Memory extraction task storm** | `navi/api/routes/sessions.py` | 50 | Every `POST /sessions` spawns a fire-and-forget `asyncio.create_task(_process_stale_sessions(...))`. No deduplication or rate-limiting. Rapid session creation launches many concurrent LLM calls, overwhelming the backend. | +| 10 | **[FIXED] Permanent blacklisting with no recovery** | `navi/llm/fallback.py` | 38-39 | `_dead_servers` and `_dead_models` are module-level sets. A transient network blip permanently blacklists the server until Python process restart. | +| 11 | **[FIXED] Subagent exception isolation missing** | `navi/core/agent.py` | 512 | Inside `run_ephemeral()`, tools execute with bare `await tool.execute()`. No `_run_with_sentinel` wrapper. If a tool crashes, the subagent dies and all partial progress is lost. | +| 12 | **[FIXED] No size validation on inline images** | `navi/api/websocket.py` | 297-304 | `raw_images` is taken directly from client JSON with no length or size checks. A client can send massive base64 payloads, causing memory exhaustion. | +| 13 | **[FIXED] Orphaned files when DB insert fails** | `navi/content_store.py` | 114-129 | In `publish()`, `shutil.copy2` runs first. If the subsequent DB `INSERT` fails (caught and logged), the copied file remains on disk with no metadata record, accumulating orphaned files in `navi/content/`. | +| 14 | **[FIXED] Frontend: broken streaming auto-scroll** | `webclient/src/components/chat/MessageList.vue` | 76-89 | Watchers on `chat.streamingMsg?.text` never fire during streaming because `streamingMsg` is a `shallowRef`. Vue does not track deep mutations, so the chat list does not auto-scroll as new deltas arrive. | +| 15 | **[FIXED] Frontend: mixed-content WebSocket on HTTPS** | `webclient/src/composables/useWebSocket.js` | 4-6 | WS URL is hardcoded to `ws://${location.host}`. When served over HTTPS, browsers block `ws://` as mixed content. Must switch to `wss://` when `location.protocol === 'https:'`. | +| 16 | **[FIXED] Frontend: duplicate message sends from rapid clicks** | `webclient/src/components/chat/InputBar.vue` | 141-160 | `handleSend` has no debounce or in-flight guard. `canSend` only checks `!chat.streaming`, which is not set until `stream_start`. Rapid clicks can queue multiple identical messages. | +| 17 | **[FIXED] Frontend: unhandled promise rejections across API layer** | `webclient/src/api/index.js` and callers | — | `request()` and `uploadFile()` throw on errors, but almost no callers wrap their awaits in `try/catch`. Failures become unhandled rejections and leave the UI in a broken silent state. | +| 18 | **SSH connections not closed on shutdown** | `navi/tools/ssh_exec.py` | 54-90 | `_PoolEntry` connections hang until TTL or process kill. No graceful cleanup on server shutdown. | + +--- + +## Medium — Architectural / Edge Cases / Maintainability + +| # | Problem | File | Line | Explanation | +|---|---------|------|------|-------------| +| 19 | **N+1 UPDATE queries in embedding backfill** | `navi/memory/store.py` | 170-176 | `backfill_embeddings()` loops over rows, issuing one `UPDATE` per row. Should batch via `executemany` or multi-row `UPDATE`. | +| 20 | **Unindexed ILIKE fallback search** | `navi/memory/store.py` | 282-314 | Text fallback constructs dynamic SQL with `ILIKE` on `category`, `key`, `value`. No indexes exist on these columns individually. Full table scan on large tables. | +| 21 | **Context list rebuilt from scratch every iteration** | `navi/core/agent.py` | ~683 | `_build_context()` creates a brand-new list by filtering and prepending system messages on every tool-calling iteration. O(n) per iteration, generating significant garbage. | +| 22 | **[FIXED] Unbounded `planning_logs` growth** | `navi/core/agent.py` | 651 | `session.planning_logs.append(_ev.log)` with no size limit or rotation. Long sessions with planning enabled produce huge JSON in DB, slowing `save()`. | +| 23 | **[FIXED] Background cleanup task unreferenced** | `navi/main.py` | 72 | `asyncio.create_task(cleanup_loop(...))` — the returned `Task` is not stored. In Python < 3.11, unhandled exceptions may be silently discarded. | +| 24 | **[FIXED] BaseException catches SystemExit** | `navi/core/agent.py` | 847-848 | `_run_with_sentinel` catches `BaseException`, which includes `SystemExit`. Should catch only `Exception`. | +| 25 | **[FIXED] SystemExit swallowed in agent loop** | `navi/core/agent.py` | 871 | `isinstance(r, BaseException)` includes `SystemExit`. It should propagate to allow graceful server shutdown. | +| 26 | **[FIXED] Hardcoded context output reserve** | `navi/core/agent.py` | 1521 | `output_reserve = 2048` tokens is not configurable per model. May falsely reject valid contexts for small models. | +| 27 | **Vector literal NaN/Infinity vulnerability** | `navi/memory/store.py` | 73-75 | `_vector_to_str` converts floats naively. A misbehaving embedding model returning `NaN` or `Infinity` creates invalid PostgreSQL vector syntax. | +| 28 | **Tight coupling to todo module internals** | `navi/core/agent.py` | 176-224 | `_todo_status_snapshot` directly reads `navi.tools.todo._plans`, a module-level dict. Hard to test and refactor. | +| 29 | **ContextVar values not reset in subagent path** | `navi/core/agent.py` | 350-351 | `run_ephemeral()` sets `_sid_var` and `_model_var` but never resets them. Background tasks spawned by the subagent may inherit stale values. | +| 30 | **[FIXED] Replay/reattach may miss post-finish events** | `navi/api/websocket.py` | 274-277 | If a client reconnects after a run finishes but before disconnect is fully processed, it gets `session_sync`. No push event indicates new messages were persisted while disconnected. | +| 31 | **[FIXED] Frontend: singleton confirm-dialog promise leak** | `webclient/src/composables/useConfirm.js` | 5-30 | Module-level `_resolve`. If `confirm()` is called twice before the first resolves, the second overwrites `_resolve`, causing the first promise to hang forever. | +| 32 | **[FIXED] Frontend: body overflow leak on lightbox unmount** | `webclient/src/components/ui/ImageLightbox.vue` | 26-28 | `watch(src)` sets `document.body.style.overflow = 'hidden'`. If component unmounts while open, the side effect is never reverted. Page stays permanently unscrollable. | +| 33 | **[FIXED] Frontend: DOM leak on fallback copy failure** | `webclient/src/composables/useCopy.js` | 10-17 | Fallback copy appends a `