diff --git a/docs/agent.md b/docs/agent.md index b51ca01..153a762 100644 --- a/docs/agent.md +++ b/docs/agent.md @@ -84,7 +84,7 @@ ## Tool-calling loop -Runs up to `profile.max_iterations` times. +Runs up to `profile.max_iterations` times. Tool schemas are built at the start of `run_stream()` from `profile.get_agent_tools()` (see [`profiles.md`](profiles.md) for the `tools.agent` / `tools.subagent` structure). ``` Each iteration: diff --git a/docs/api.md b/docs/api.md index e651661..5b2a002 100644 --- a/docs/api.md +++ b/docs/api.md @@ -157,7 +157,6 @@ "id": "secretary", "name": "Personal Secretary", "description": "General-purpose assistant", - "enabled_tools": ["todo", "mcp__navi_web__web_search", "filesystem", "..."], "llm_backend": "ollama", "model": ["gemma4:31b-cloud", "gemma4:26b-a4b-it-q4_K_M"], "temperature": 0.65, @@ -167,7 +166,18 @@ "iteration_budget_enabled": true, "think_enabled": true, "subagent_think_enabled": null, - "mcp_servers": {"gnexus-book": ["read", "write"]} + "tools": { + "agent": { + "native": ["todo", "scratchpad", "filesystem"], + "mcp": { + "navi-web": ["search"] + } + }, + "subagent": { + "native": ["todo", "filesystem"], + "mcp": {} + } + } } ] ``` @@ -180,7 +190,7 @@ ```json [ { - "name": "mcp__navi_web__web_search", + "name": "mcp__navi-web__web_search", "description": "Search the web using DuckDuckGo.", "parameters": {"type": "object", "properties": {...}, "required": [...]} }, @@ -215,10 +225,12 @@ **Response `200`** ```json { - "gnexus-book": { + "navi-web": { "connected": true, "tools": [ - {"name": "gnexus-book_list_inventory", "description": "..."} + {"name": "mcp__navi-web__web_search", "description": "..."}, + {"name": "mcp__navi-web__web_view", "description": "..."}, + {"name": "mcp__navi-web__http_request", "description": "..."} ], "instructions": "MANDATORY: Before answering ANY question..." } @@ -342,7 +354,7 @@ "tool_calls": [ { "id": "abc123", - "name": "mcp__navi_web__web_search", + "name": "mcp__navi-web__web_search", "arguments": { "query": "..." } } ] @@ -351,7 +363,7 @@ "role": "tool", "content": "tool result", "tool_call_id": "abc123", - "name": "mcp__navi_web__web_search" + "name": "mcp__navi-web__web_search" } ] } @@ -786,7 +798,7 @@ ```json { "type": "tool_started", - "tool": "mcp__navi_web__web_search", + "tool": "mcp__navi-web__web_search", "args": { "query": "weather in moscow" }, "is_subagent": false } @@ -800,7 +812,7 @@ ```json { "type": "tool_call", - "tool": "mcp__navi_web__web_search", + "tool": "mcp__navi-web__web_search", "args": { "query": "weather in moscow" }, "result": "Today +12°C, cloudy.", "success": true, @@ -1241,10 +1253,18 @@ "anti_stall_threshold": 8, "step_validation_enabled": false, "adaptive_replan_enabled": false, - "subagent_tools": [...], - "subagent_planning_enabled": false, - "subagent_think_enabled": null, - "enabled_tools": [...], + "tools": { + "agent": { + "native": ["todo", "scratchpad", "filesystem"], + "mcp": { + "navi-web": ["search"] + } + }, + "subagent": { + "native": ["todo", "filesystem"], + "mcp": {} + } + }, "context_providers": [], "is_admin_only": false } diff --git a/docs/eval_system_design.md b/docs/eval_system_design.md deleted file mode 100644 index 444d66d..0000000 --- a/docs/eval_system_design.md +++ /dev/null @@ -1,287 +0,0 @@ -# Eval System — Design Spec - -LLM-as-judge evaluation of Navi sessions. Tracks quality dynamics over time without dedicated test scenarios — analysis runs against real, unmodified sessions. - -Status: **spec / not implemented**. - -## Goals - -1. See how Navi's quality changes over time across multiple axes. -2. Detect regressions after prompt/model/architecture changes. -3. Surface concrete sessions for inspection (best, worst, biggest deltas). -4. No special test fixtures — evaluation runs against real usage. - -## Non-goals - -- Absolute "correctness" of scores. We care about **dynamics**, not whether a 75 is "really a 75". -- Real-time scoring during sessions. Eval is offline. -- Verification of factual claims (judge can't run code; this is a known limit). - -## Architecture - -Three parts, deliberately decoupled: - -1. **In-app feedback signal** — like/dislike per assistant response in the main webclient, stored alongside messages. -2. **Eval runner (CLI)** — standalone, runs offline against PostgreSQL, evaluates accumulated sessions, writes scores to eval tables. Does not require the FastAPI server to be running. -3. **Eval UI (debug page)** — read-only SPA for browsing sessions / scores / charts, plus a button to trigger an eval run on the server. Lives under `debug/eval/`, served as static and pulls data through a small REST namespace `/api/eval/...`. - -### Directory layout - -Everything for the eval system lives under `debug/eval/`. The directory contains both the standalone Python backend (CLI + REST router) and the frontend SPA. `debug/eval/` is a Python package (`__init__.py`) so the CLI can be invoked as `python -m debug.eval` from the project root. - -``` -debug/eval/ - __init__.py - cli.py # entry: python -m debug.eval ... - judge.py # judge orchestration (3 experts, averaging) - schema.py # Pydantic models for scores / requests - db.py # asyncpg queries for feedback + evaluations - api.py # FastAPI APIRouter, mounted from navi/main.py - # GET /api/eval/sessions - # GET /api/eval/sessions/{id} - # GET /api/eval/stats - # POST /api/eval/run (background task) - # POST /api/eval/feedback (like/dislike) - index.html # frontend SPA (matches debug/index.html style) - app.js - style.css - prompts/ - expert_strict_critic.txt - expert_pragmatist.txt - expert_tech_lead.txt - rubric_v1.yaml # axes + anchors (frozen per version) - schema.sql # postgres migration (eval_v1) - README.md # ops doc — running CLI, applying migration -``` - -`navi/main.py` adds two lines: include the eval router and serve `debug/eval/index.html` at `/debug/eval/`. Everything else stays out of `navi/`. - -The webclient (`webclient/`) gets a small addition: like/dislike thumbs on each assistant message that POST to `/api/eval/feedback`. That's the only touchpoint outside `debug/eval/`. - -### What the judge sees - -Maximum signal — the judge gets the full session, no filtering, no compression-summary substitution. - -- **Full transcript** in original order: user / assistant / tool calls + tool results / thinking blocks / sub-agent transcripts (recursively, with depth markers) / planning phases (Phase 1 analysis, Phase 2 review, Phase 3 plan) — exactly as they appeared. -- **Per-message feedback ratings** inlined next to each assistant message ("[user reaction: 👍]" / "[user reaction: 👎]" / nothing). -- **Aggregated counts** at the top: total likes, total dislikes. -- **Profile metadata**: which profile ran, model used, planning flags state at the time. -- **Session timing**: start, end, duration, iteration count, total tokens. - -We do **not** substitute compressed summaries for the original messages — that would hide the actual work and only let the judge grade the final outcome. The point is to grade the **process**. - -If a session is too long for the judge's context, the runner logs a warning and skips it (or chunks by user-turn group with explicit gaps — TBD; v1 just skips). - -## Signal sources - -When evaluating a session, the judge LLM has access to: - -- Full session transcript (user / assistant / tool calls / thinking). -- Per-message likes/dislikes from the user. -- The user's own follow-up text in chat ("не работает", "переделай", "спасибо") — judge extracts implicit signal. - -Aggregated like/dislike counts are computed before judge runs. If `likes > dislikes` → tilt toward "successful". If `dislikes > likes` → tilt toward "unsuccessful". If both 0 → judge infers from transcript only. - -## Axes - -Fixed set, scored 0-100 (no hard upper limit — see "Open scale" below): - -| Axis | Meaning | -|---|---| -| `task_complexity` | Difficulty of what was asked, judged from the user's request alone | -| `goal_completion` | Did the user end up with what they wanted | -| `tool_usage_quality` | Right tools chosen, no thrashing, no unnecessary calls | -| `efficiency` | Iterations vs result; loops, dead-ends, redundancy | -| `communication` | Clarity of replies, no hallucinations, no excessive verbosity | -| `subagent_orchestration` | Quality of sub-agent delegation (null if no sub-agents used) | -| `self_extension` | Quality of write_tool / reload_tools usage (null if not used) | - -The judge sees the planning structure as part of the transcript, but the rubric does **not** ask for separate scores per planning phase. The judge instructions deliberately stay at "did the agent reason / execute / communicate well" — the architectural details of how planning runs are not evaluated, since those are the very things we're trying to measure progress on. Coupling the rubric to current planning shape would lock the eval to today's mechanics. - -Scoring scale anchors (designed once, frozen as `rubric_v1`): - -- **10** — trivial, near-zero effort. -- **30** — straightforward, one tool, one step. -- **50** — moderate, 2-4 steps, planning helpful. -- **75** — complex, multi-tool with planning, easy to fail. -- **100** — at the limit of what Navi can do today (full project tasks, multiple sub-agents, self-extension). - -Anchors include **2-3 real session examples** at each level (user picks them once from accumulated history). - -### Open scale - -Scale is **not capped at 100**. If the judge encounters a task harder than any 100-anchor, it scores 120, 150, etc. Those become future anchors when we expand the rubric. - -## Experts (multi-judge averaging) - -Each session is evaluated by **3 different expert prompts**, then averaged. Different prompts produce different blind spots; averaging reduces variance and bias. - -| Expert | Prompt slant | -|---|---| -| `strict_critic` | Looks for flaws, scores conservatively, penalizes weakly any slip-up | -| `pragmatist` | "Did the user end up with what they wanted, regardless of the path?" | -| `tech_lead` | Architecture / tool choice / efficiency, focused on technical decisions | - -All three see the same transcript and the same rubric. Final per-axis score = mean across experts. Spread between experts is also stored — large spread = noisy/contested session. - -## Storage (PostgreSQL) - -Append-only. Multiple evals per session are normal (re-evaluation when judge upgrades, rubric changes, or you just want a fresh take). - -```sql --- Per-message user feedback (drives the like/dislike signal) -CREATE TABLE message_feedback ( - message_id UUID PRIMARY KEY REFERENCES messages(id), - session_id UUID NOT NULL, - rating SMALLINT NOT NULL, -- +1 / -1 - created_at TIMESTAMPTZ NOT NULL DEFAULT now() -); -CREATE INDEX ON message_feedback(session_id); - --- One row per (session, expert, eval_run) -CREATE TABLE evaluations ( - id UUID PRIMARY KEY, - session_id UUID NOT NULL, - eval_run_id UUID NOT NULL, -- groups the 3 experts of one run - eval_date TIMESTAMPTZ NOT NULL, - judge_model TEXT NOT NULL, -- e.g. "gemma4:31b-cloud" - judge_version TEXT NOT NULL, -- snapshotted version string - rubric_version TEXT NOT NULL, -- "v1", "v2", ... - expert_id TEXT NOT NULL, -- "strict_critic" | "pragmatist" | "tech_lead" - scores JSONB NOT NULL, -- {task_complexity: 65, goal_completion: 80, ...} - comment TEXT NOT NULL -- free-form "what stood out" -); -CREATE INDEX ON evaluations(session_id); -CREATE INDEX ON evaluations(eval_date); -CREATE INDEX ON evaluations(judge_version, rubric_version); - --- View: averaged scores per session per eval_run -CREATE VIEW evaluation_summary AS -SELECT - session_id, - eval_run_id, - eval_date, - judge_version, - rubric_version, - jsonb_object_agg( - axis, - avg_score - ) AS avg_scores -FROM ( - SELECT - session_id, eval_run_id, eval_date, judge_version, rubric_version, - key AS axis, - AVG((value)::numeric) AS avg_score - FROM evaluations, jsonb_each_text(scores) - GROUP BY session_id, eval_run_id, eval_date, judge_version, rubric_version, key -) t -GROUP BY session_id, eval_run_id, eval_date, judge_version, rubric_version; -``` - -## Judge model policy - -- Judge model is **pinned** in eval config. Don't change casually. -- When you do upgrade the judge, **re-evaluate the entire archive** with the new judge. Old scores stay (different `judge_version` row), new scores are the new baseline. -- Comparisons across `judge_version` boundaries are not meaningful — visualizations should respect this. - -## Rubric versioning - -Same policy. Rubric changes (new anchors, reworded prompts) bump `rubric_version`. Old rows preserved, new ones are the live series. - -## CLI - -Standalone, no server dependency. - -```bash -# Evaluate all unevaluated sessions (with current pinned judge + rubric) -python -m navi.eval run - -# Re-evaluate everything (after judge or rubric change) -python -m navi.eval run --re-evaluate-all - -# Evaluate a single session -python -m navi.eval run --session - -# Limit to recent -python -m navi.eval run --since 2026-04-01 - -# Show eval for one session -python -m navi.eval show - -# Aggregate stats -python -m navi.eval stats --days 30 -python -m navi.eval stats --days 30 --by-complexity-bucket -``` - -`stats` exports CSV by default; visualization is a separate concern (see below). - -## UI (`debug/eval/index.html`) - -Single-page debug SPA in the same style as the existing `debug/index.html` (dark mono theme, no framework). Tabbed layout: - -### Tab 1 — Sessions -Paginated table of all sessions, newest first. Columns: started_at, profile, turns count, likes / dislikes, last avg score (or "—"), eval status (`evaluated rubric_v1` / `pending` / `stale judge_v1 → v2`). Row click → Tab 2 with that session preselected. - -Filters at top: profile, date range, "show only unevaluated", "show only stale". - -### Tab 2 — Session detail -Two-pane layout. Left: transcript (collapsed by default; user / assistant / tool-call / sub-agent indented). Right: eval results. - -- All eval runs for this session listed (most recent first), each expandable. -- Inside an eval run: 3 expert blocks side-by-side with their per-axis scores, the spread, and free-form comment. -- Avg row at top of run with `(judge_version, rubric_version, eval_date)`. -- Action button: "Re-evaluate this session". - -### Tab 3 — Stats -Charts (server-rendered SVG or simple canvas, no chart library): - -1. Average score per axis over time — weekly rolling mean. -2. Score by complexity bucket (`0-25`, `26-50`, `51-75`, `76+`) — per-bucket trend, catches selection bias when overall score moves. -3. Likes / dislikes ratio per week — orthogonal sanity check. -4. Top-K worst sessions in the last 7 days — clickable, jumps to Tab 2. - -Filter bar: judge_version + rubric_version (mixing across versions disabled by default). - -### Tab 4 — Run -Trigger an eval run. -- Form: scope (`all unevaluated` / `since date` / `single session id` / `re-evaluate all`), max sessions, dry-run checkbox. -- Submit → POST `/api/eval/run`, server kicks off the CLI as a background task and returns a `run_id`. -- Live log panel below subscribes to a small WS or SSE stream and prints progress: "session N/M, expert K/3, scores …". -- Run history table at the bottom: past runs with timestamp, count of sessions, judge_version, status. - -CSV export available on Tab 3 for offline plotting. - -## Implementation phases - -1. **Phase 1 — Feedback signal** - - `message_feedback` postgres table + migration. - - Webclient UI: thumbs up/down on each assistant message, REST POST to `/api/eval/feedback`. - - Endpoint `POST /api/eval/feedback {message_id, rating}` — upsert. -2. **Phase 2 — Eval backend skeleton** - - `navi/eval/` package with CLI entry point (`python -m navi.eval`). - - `evaluations` table + migration. - - Judge prompt templates per expert (`prompts/expert_*.txt`). - - Rubric anchors as YAML (`prompts/rubric_v1.yaml`) — anchor examples filled in by user before going live. -3. **Phase 3 — Run + store** - - `run` command: pick unevaluated sessions, render full transcript, fan out to 3 experts, validate JSON output against pydantic schema, persist all expert rows under one `eval_run_id`. -4. **Phase 4 — Read endpoints** - - `/api/eval/sessions`, `/sessions/{id}`, `/stats` — read-only, used by debug UI. - - `/api/eval/run` (POST) — kicks off CLI in a background task, returns `run_id`. SSE/WS stream for live progress. -5. **Phase 5 — Debug UI** - - `debug/eval/index.html` + `app.js` + `style.css` in the existing debug-SPA style. - - All four tabs (Sessions / Detail / Stats / Run) wired to the endpoints above. - -CSV export from `python -m navi.eval stats --csv` is also available as a pure-CLI path for offline plotting. - -## Costs / constraints - -- 3 experts × full session transcript per run. For 50-turn sessions with 50k+ token contexts that's 3 large LLM calls per session. Plan to run overnight on a small batch, not in real time. -- Judge calls go through the same backend stack (`FallbackOllamaBackend`) — so multi-server fallback applies. -- Eval runner should respect a `--max-tokens-per-session` guard so a runaway transcript doesn't burn the queue. - -## Known limits / open questions - -- No verification of factual or code correctness — judge sees only the transcript. For "did this code actually work?" we'd need separate runtime checks; out of scope here. -- Judge bias toward verbose / confident answers is not fully mitigated by 3 experts — partial only. -- Calibration set (manual scoring of N sessions to validate judge against user) is **deliberately skipped** — we only need dynamics, not absolute correctness. Re-open if the trends turn out to be uninterpretable. -- Rubric anchors must be set with care; once the archive is large, changing the rubric forces re-eval of everything. diff --git a/docs/index.md b/docs/index.md index cf01232..39d52be 100644 --- a/docs/index.md +++ b/docs/index.md @@ -45,8 +45,9 @@ | `navi/core/registry.py` | `build_default_registries()` — wires everything together | | `navi/api/websocket.py` | WebSocket handler + `POST /sessions/{id}/stop` | | `navi/config.py` | `Settings` — all config loaded from `.env` | -| `navi/profiles/` | Profile definitions (`secretary`, `server_admin`, `developer`) | +| `navi/profiles/` | Profile definitions (`secretary`, `server_admin`, `developer`, `navi_code`, etc.) | | `tools/` | User-defined tools (auto-discovered at startup) | +| `clients/terminal/` | Navi Code TUI and raw CLI (`navi-code`) | ## Stack @@ -56,3 +57,4 @@ - **Database**: PostgreSQL via asyncpg - **Logging**: structlog - **Config**: pydantic-settings (reads `.env`) +- **Terminal client**: Textual 8.x (`clients/terminal/`, command `navi-code`) diff --git a/docs/navi_code.md b/docs/navi_code.md index 59c5b62..873509d 100644 --- a/docs/navi_code.md +++ b/docs/navi_code.md @@ -66,8 +66,8 @@ - База: `developer`, адаптирован под терминал. - Включённые инструменты: - Native: `terminal`, `filesystem`, `code_exec`, `spawn_agent`, `todo`, `scratchpad`, `reflect`, `list_tools`, `tool_manual`, `switch_profile`, `list_profiles`, `memory`, `schedule_recall`, `manage_recall`. - - MCP `navi-web`: `mcp__navi-web__web_search`, `mcp__navi-web__web_view`, `mcp__navi-web__http_request`. -- Отключённые инструменты: `share_file`, `content_publish`, `ssh_exec`, `gmail`. + - MCP: отключены (`"mcp": {}`) для чистого терминального опыта. +- Отключённые инструменты: `share_file`, `content_publish`, `ssh_exec`, `gmail`, `image_view`, `mcp__navi-web`. - `planning_phase2_enabled: false` — уменьшает latency. ## Безопасность diff --git a/docs/navi_code_cli.md b/docs/navi_code_cli.md index 2c2ff7f..697bdf6 100644 --- a/docs/navi_code_cli.md +++ b/docs/navi_code_cli.md @@ -59,15 +59,18 @@ | `/export [path]` | Экспортировать текущую сессию в Markdown; без пути — во временный файл и `$EDITOR`. Если редактор не запускается, ошибка выводится в чат. | | `/themes` | Открыть выбор темы с live-preview. | | `/mouse on|off` | Включить/выключить поддержку мыши (требует перезапуска). | +| `/thinking` | Переключить отображение thinking-блоков в текущей сессии. | +| `/compact` | Вручную запустить сжатие контекста текущей сессии. | | `/clear` | Очистить локально сохранённый `session_id`. | `/quit` | Выйти. | ## Интерфейс TUI -В интерактивном режиме (`navi-code`) экран разделён на две части: +В интерактивном режиме (`navi-code`) экран разделён на три зоны: - **Левая панель (`ChatPanel`)** — история сообщений, поле ввода и текущий статус. -- **Правая панель (`SessionsPanel`)** — список сессий на сервере с колонками ID, профиль и превью. Клик или `Enter` на строке переключает сессию. +- **Правая верхняя панель (`StatusPanel`)** — текущий профиль, модель, статус соединения, токены и оставшиеся итерации. +- **Правая нижняя панель (`SessionsPanel`)** — список сессий на сервере с колонками ID, профиль и превью. Клик или `Enter` на строке переключает сессию. Список сессий обновляется автоматически при запуске и при выполнении `/new`, `/sessions`, `/switch`. diff --git a/docs/plan_navi_code.md b/docs/plan_navi_code.md deleted file mode 100644 index 308a418..0000000 --- a/docs/plan_navi_code.md +++ /dev/null @@ -1,249 +0,0 @@ -# План: Navi Code — локальный терминальный клиент для Navi - -## Цель - -Создать систему "Navi Code": локально запускаемый вариант Navi, управляемый через терминал. Без авторизации (`NAVI_AUTH_ENABLED=false`), с выделенным профилем, ориентированным на работу с кодом, терминалом и файловой системой. - -## Что НЕ входит в этот этап - -- Docker-упаковка (отложено). -- Рендеринг изображений, content_publish, share_file UI (отложено). -- Авторизация (используем готовый `NAVI_AUTH_ENABLED=false`). - -## Что входит в этот этап - -1. Создание профиля `navi_code` на базе `developer`. -2. Механизм дефолтного профиля. -3. Подготовка bundled `.env` / конфигурации для локального терминального режима. -4. Подготовка персоны / системного промпта для Navi Code. -5. CLI-терминал-клиент для взаимодействия с Нави. -6. Документация по запуску и использованию. - ---- - -## 1. Профиль `navi_code` - -### 1.1. База - -- Скопировать `navi/profiles/developer/` → `navi/profiles/navi_code/`. -- `id`: `navi_code`. -- `name`: `"Navi Code"`. - -### 1.2. Тюнинг инструментов - -Включить: - -- `terminal` — основной инструмент. -- `filesystem` — чтение/запись файлов. -- `code_exec` — выполнение кода. -- `spawn_agent` — для сложных подзадач. -- `list_tools`, `tool_manual`, `write_tool`, `reload_tools` — саморасширение. -- `scratchpad`, `todo`, `reflect` — для планирования. - -Исключить (для упрощения терминального опыта): - -- `share_file`. -- `content_publish`. -- `image_view`. -- `http_request` — оставить по необходимости, но по умолчанию убрать. -- `web_search`, `ssh_exec` — оставить как опцию, но не включать по умолчанию. - -### 1.3. Тюнинг параметров - -- `max_iterations`: сохранить высокое значение (например, 100), но не безгранично. -- `temperature`: 0.3–0.4. -- `model`: локальная модель по умолчанию, например `gemma4:26b-a4b-it-q4_K_M`. -- Планирование: включить phase 1 и 3, отключить phase 2 (3 advisor) для снижения latency. -- `iteration_budget_enabled`, `goal_anchoring_enabled`, `anti_stall_enabled`: оставить включёнными. -- `step_validation_enabled`: отключить. -- `adaptive_replan_enabled`: оставить выключенным. - -### 1.4. Системный промпт - -- Скопировать `developer/system_prompt.txt`. -- Адаптировать под терминальный контекст: Нави работает локально, у неё есть терминал, файловая система и возможность выполнять код. -- Добавить инструкции по безопасности: перед разрушительными операциями (`rm`, перезапись) спрашивать подтверждение. - ---- - -## 2. Механизм дефолтного профиля - -### 2.1. Варианты - -Вариант A — env-переменная (предпочтительный): - -- Добавить в `navi/config.py`: `navi_default_profile_id: str = ""`. -- Читать `NAVI_DEFAULT_PROFILE_ID` из `.env`. -- Если задана и профиль существует, использовать её как fallback при создании сессии без `profile_id`. -- REST `POST /sessions` разрешить отсутствие `profile_id`, взяв дефолт. - -Вариант B — клиент-side: - -- Терминал-клиент сам знает профиль `navi_code` и всегда шлёт его. -- Проще, но менее универсально. - -### 2.2. Решение - -Реализовать **вариант A**: серверная env-переменная + поддержка отсутствующего `profile_id` в `POST /sessions`. Это позволит любому клиенту (CLI, веб, скрипт) работать с дефолтным профилем. - ---- - -## 3. Конфигурация для локального режима - -### 3.1. Новые/изменённые env-переменные - -- `NAVI_AUTH_ENABLED=false` (уже есть). -- `NAVI_DEFAULT_PROFILE_ID=navi_code` (новая). -- `NAVI_PERSONA_FILE=persona_navi_code.txt` (новая персона). -- `FS_ALLOWED_PATHS=*`. -- `TERMINAL_ALLOWED_COMMANDS=*`. -- `OLLAMA_HOST=http://localhost:11434`. -- `DATABASE_URL=postgresql://navi:navipass@localhost:5432/navidb` (или локальная настройка пользователя). - -### 3.2. Файлы, которые нужно подготовить - -- `.env.navi_code.example` — пример `.env` для Navi Code. -- `persona_navi_code.txt` — глобальная персона для Navi Code. - -### 3.3. Что не меняем - -- Структура конфигурации (`navi/config.py`) — добавляем только новые поля. -- Поведение при `NAVI_AUTH_ENABLED=true` — не ломаем. - ---- - -## 4. Персона Navi Code - -### 4.1. Основные черты - -- Локальная ассистентка-разработчик. -- Имеет доступ к терминалу, файловой системе и выполнению кода. -- Умеет планировать, разбивать задачи на todo, работать с spawn_agent. -- Перед опасными операциями спрашивает подтверждение. -- Говорит с пользователем на его языке (русский/английский). - -### 4.2. Правила работы с инструментами - -- Использует `terminal` для shell-команд. -- Использует `filesystem` для чтения/записи. -- Использует `code_exec` для быстрой проверки небольших фрагментов. -- Использует `scratchpad` для длительных мыслей, `todo` для планирования. -- Перед `write_tool` всегда вызывает `tool_manual("write_tool")`. - ---- - -## 5. CLI-терминал-клиент - -### 5.1. Расположение - -- `navi_code_cli/` в корне проекта (отдельный Python-пакет). -- Или `clients/terminal/`. - -### 5.2. Минимальный функционал MVP - -- Подключение к запущенному Navi backend по WebSocket. -- Поддержка интерактивного режима (`navi-code` без аргументов → чат). -- Поддержка one-shot режима (`navi-code "задача"` → выполнить и выйти). -- Сохранение `session_id` между запусками (`~/.navi_code/state.json`). -- Поддержка команд: - - `/new` — новая сессия, - - `/sessions` — список сессий, - - `/switch ` — переключиться, - - `/profile` — показать текущий профиль, - - `/quit` — выход. - -### 5.3. Рендеринг событий - -- `stream_delta` — печатать текст. -- `thinking_delta` / `thinking_end` — показывать в сворачиваемом блоке или с флагом `--show-thinking`. -- `tool_started` / `tool_call` — показывать имя инструмента и результат. -- `stream_end` — завершение ответа. -- `error` — красным цветом. - -### 5.4. Зависимости - -- `click` или `typer`. -- `websockets`. -- `rich` — для цветного вывода, markdown, таблиц. -- `pydantic` — для моделей. - -### 5.5. Взаимодействие с сервером - -- `GET /agents/profiles` — проверить профиль. -- `POST /sessions` — создать сессию (с дефолтным профилем). -- `WS /ws/sessions/` — основной чат. -- `POST /sessions//stop` — остановить генерацию. - ---- - -## 6. Документация - -### 6.1. Новые документы - -- `docs/navi_code.md` — полное руководство по Navi Code. -- `docs/navi_code_cli.md` — документация по CLI. -- `docs/profiles.md` — обновить: добавить профиль `navi_code`, описать `NAVI_DEFAULT_PROFILE_ID`. -- `docs/config.md` — обновить: новые env-переменные. - -### 6.2. README - -- Добавить раздел "Navi Code" в основной README. - ---- - -## 7. Порядок реализации - -### Этап 1 — Профиль и конфигурация - -1. Создать `navi/profiles/navi_code/` на базе `developer/`. -2. Добавить `navi_default_profile_id` в `navi/config.py`. -3. Обновить `POST /sessions` для использования дефолтного профиля. -4. Создать `persona_navi_code.txt`. -5. Создать `.env.navi_code.example`. -6. Обновить документацию (`docs/profiles.md`, `docs/config.md`). - -### Этап 2 — CLI клиент - -1. Создать структуру `navi_code_cli/`. -2. Реализовать WebSocket-клиент. -3. Реализовать интерактивный режим. -4. Реализовать one-shot режим. -5. Реализовать сохранение состояния сессии. -6. Добавить README и документацию. - -### Этап 3 — Тестирование и полировка - -1. Проверить создание сессии с дефолтным профилем. -2. Проверить работу терминала через CLI. -3. Проверить персистентность сессий. -4. Проверить no-auth режим. -5. Добавить юнит-тесты на новые механизмы. - ---- - -## 8. Риски и вопросы - -### Риски - -- **Безопасность:** `TERMINAL_ALLOWED_COMMANDS=*` и admin-роль дают полный доступ к системе. Нужно ясно документировать, что Navi Code предназначена только для локального использования. -- **Зависимость от Ollama:** пользователь должен сам запускать Ollama. Нужно документировать. -- **Postgres:** нужен локальный Postgres с pgvector. Возможно, стоит позже предоставить docker-compose для БД. -- **Приватная зависимость `gnexus-auth-client-py`:** при сборке/установке может потребоваться доступ к Git. Для локальной разработки текущий venv уже настроен. - -### Открытые вопросы - -- Как именно CLI должен обрабатывать persistent терминалы? Сразу в интерактивном режиме или через отдельную команду? -- Нужна ли команда `/cd` для смены рабочей директории в CLI? -- Стоит ли добавить provider контекста (`cwd_provider`) или передавать cwd через параметры CLI? - ---- - -## 9. Критерии завершения - -- [ ] Профиль `navi_code` создан и загружается. -- [ ] `NAVI_DEFAULT_PROFILE_ID` работает, `POST /sessions` без `profile_id` создаёт сессию с дефолтным профилем. -- [ ] Персона Navi Code подключена через `NAVI_PERSONA_FILE`. -- [ ] CLI клиент умеет подключаться и вести диалог. -- [ ] CLI клиент сохраняет `session_id` между запусками. -- [ ] Документация обновлена. -- [ ] Все тесты проходят. diff --git a/docs/plan_navi_code_tui.md b/docs/plan_navi_code_tui.md deleted file mode 100644 index dbb0684..0000000 --- a/docs/plan_navi_code_tui.md +++ /dev/null @@ -1,134 +0,0 @@ -# План: Navi Code TUI (OpenCode-style) - -Цель: превратить `navi-code` из простого click-CLI в полноэкранный терминальный UI, вдохновлённый OpenCode, сохранив click-CLI как `navi-code --raw`. - ---- - -## Принципы - -- **Микро-архитектура**: каждый компонент отвечает за одну задачу, общаётся через события/шину. -- **Расширяемость**: новые slash-команды, виджеты, renderers, themes добавляются без переделки ядра. -- **Совместимость**: TUI и click-CLI используют общий `ws_client.py`, `api.py`, `config.py`, `state.py`. -- **Постепенность**: каждая фаза — отдельный коммит, после которого продукт работает. - ---- - -## Общая архитектура - -``` -┌─────────────────────────────────────────────────────────────┐ -│ NaviCodeApp (Textual) │ -│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │ -│ │ ChatPanel │ │ StatusPanel │ │ SessionsPanel │ │ -│ │ (messages) │ │ (profile, │ │ (optional/right)│ │ -│ │ │ │ model, │ │ │ │ -│ │ │ │ connection) │ │ │ │ -│ └──────────────┘ └──────────────┘ └──────────────────┘ │ -│ ┌────────────────────────────────────────────────────────┐ │ -│ │ InputBox (prompt frame, slash commands, @/! parsing) │ │ -│ └────────────────────────────────────────────────────────┘ │ -└─────────────────────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────┐ -│ EventBus / Dispatcher │ -│ - WebSocket events → ChatPanel/StatusPanel │ -│ - User input → CommandParser → execute command/send message │ -└─────────────────────────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────┐ -│ Shared services │ -│ - NaviWebSocketClient │ -│ - api (REST wrappers) │ -│ - StateManager (~/.navi_code/state.json) │ -│ - Settings │ -└─────────────────────────────────────────────────────────────┘ -``` - ---- - -## Фазы - -### Phase 3 — TUI skeleton - -- Добавить `textual>=0.70` в `pyproject.toml`. -- Создать `clients/terminal/tui_app.py` с базовым `App`: - - `ChatPanel` — `ScrollableContainer` + RichLog для сообщений. - - `StatusPanel` — Static/Label с профилем, сессией, моделью, статусом. - - `InputBox` — кастомный виджет ввода с рамкой в стиле OpenCode. -- `navi-code` по умолчанию запускает TUI; `navi-code --raw` — старый click-CLI. -- Интегрировать `NaviWebSocketClient` с Textual event loop через `asyncio.create_task` + `call_from_thread`/`post_message`. -- Базовый рендеринг событий: `stream_delta`, `thinking_delta`/`thinking_end`, `tool_started`/`tool_call`, `error`, `stream_end`. -- Добавить `TuiRenderer`, который превращает WebSocket-события в Rich renderables. -- Обновить тесты: хотя бы один smoke-test, что TUI App монтируется и запускается без ошибок. - -### Phase 4 — OpenCode UX - -- **Slash commands**: `/help`, `/new`, `/sessions`, `/switch`, `/profile`, `/thinking`, `/compact`, `/quit`, `/models`. -- **Command palette**: `Ctrl+P`, поиск по командам и настройкам. -- **`@` file references**: fuzzy autocomplete файлов в CWD при вводе `@`. -- **`!` shell pre-command**: если сообщение начинается с `!`, выполнить shell и подставить вывод. -- **Permission prompt**: inline prompt для destructive tool calls (`rm`, overwrite, format), кнопки Allow once / Allow always / Reject. -- **Markdown/code highlighting**: `rich.markdown` + `rich.syntax` в ChatPanel. -- **Diff/artifact renderers**: расширяемый `ContentRenderer` registry — code, diff, plain, image mention. - -### Phase 5 — Polish & config - -- **Mouse support** включить в Textual. -- **Themes**: `/themes` + `~/.navi_code/tui.json` с `theme`, `keybinds`, `diff_style`, `mouse`, `scroll_speed`. -- **SessionsPanel**: боковая панель со списком сессий, переключение по клику/стрелкам. -- **Export**: `/export` сохраняет текущий чат в markdown и открывает `$EDITOR`. -- **Advanced status panel**: tokens used, remaining iterations, backend, connection health. -- **Undo/Redo**: если получится интегрировать с git — отдельно. -- **Тесты**: unit + TUI integration tests через Textual Pilot. - ---- - -## Расширяемые точки - -1. **Command registry** (`clients/terminal/commands/registry.py`) - - Каждая slash-команда = класс с `name`, `aliases`, `description`, `keybind`, `async execute(ctx)`. - - Регистрация через декоратор `@register_command`. - -2. **Content renderers** (`clients/terminal/renderers/`) - - `BaseRenderer` → `CodeRenderer`, `DiffRenderer`, `MarkdownRenderer`, `ToolCallRenderer`, `ErrorRenderer`. - - `RendererRegistry` выбирает по `type`/`mime`. - -3. **Themes** (`clients/terminal/themes/`) - - `Theme` dataclass: цвета рамок, фона, акцента, статуса, ошибок, thinking. - - `ThemeRegistry` с built-in темами и загрузкой из `tui.json`. - -4. **Event bus** (`clients/terminal/events.py`) - - Textual-native `post_message`, но с типизированными событиями `WsEvent`, `CommandEvent`, `PermissionEvent`. - -5. **Permission engine** (`clients/terminal/permissions.py`) - - Правила по имени инструмента + action/pattern. - - `PermissionStore` хранит `allow_always` в `~/.navi_code/permissions.json`. - ---- - -## Интеграция с существующим CLI - -- `cli.py` остаётся, получает флаг `--raw`. -- `tui_app.py` импортирует `Settings`, `StateManager`, `api`, `NaviWebSocketClient`. -- `render.py` остаётся для `--raw`; TUI использует новые renderers поверх Rich. - ---- - -## Тестирование - -- `tests/clients/test_tui_app.py` — монтирование App, проверка layout. -- `tests/clients/test_tui_commands.py` — unit tests командного парсера и registry. -- `tests/clients/test_tui_renderers.py` — рендеринг разных типов контента. -- `tests/clients/test_tui_permissions.py` — permission prompt и `allow_always`. -- Smoke test: `navi-code --help` и `navi-code --version` работают в обоих режимах. - ---- - -## Критерий завершения - -- `navi-code` запускается в полноэкранном TUI. -- Click-CLI доступен через `navi-code --raw`. -- Все новые файлы покрыты тестами, ruff чистый, pytest зелёный. -- Документация `docs/navi_code_cli.md` обновлена с TUI-режимом. diff --git a/docs/profiles.md b/docs/profiles.md index 7bbb4bf..bdb3479 100644 --- a/docs/profiles.md +++ b/docs/profiles.md @@ -39,14 +39,31 @@ | Key | Type | Default | Description | |---|---|---|---| -| `enabled_tools` | list[str] | **required** | Tool names available in the main loop | -| `subagent_tools` | list[str] | `[]` | Tools available to sub-agents spawned from this profile. Falls back to `enabled_tools` (full list) if empty. **Also acts as a whitelist for MCP tools** — only `mcp____` entries listed here are exposed to the sub-agent. If the list is non-empty and contains no `mcp__` entries, the sub-agent receives no MCP tools at all. | +| `tools` | `ToolConfig` | `{}` | **Required.** Explicit tool configuration with two scopes: `agent` and `subagent`. Each scope has `native: list[str]` (built-in and user tool names) and `mcp: dict[str, list[str]]` (MCP server groups). | -`spawn_agent` may receive an optional `profile_id`. If omitted, the subagent uses the parent session's current profile. If provided, the subagent uses the selected profile's model, prompt, planning flags, and `subagent_tools`/`enabled_tools` fallback. +`tools.agent` controls what the main loop sees. `tools.subagent` controls what sub-agents spawned from this profile see. If `tools.subagent` is empty, it falls back to `tools.agent`. -### MCP tools in sub-agents +`spawn_agent` may receive an optional `profile_id`. If omitted, the subagent uses the parent session's current profile. If provided, the subagent uses the selected profile's model, prompt, planning flags, and `tools.subagent` fallback. -When `subagent_tools` is non-empty, `mcp_servers` is filtered so that only MCP tools whose full name (`mcp____`) appears in `subagent_tools` are available to the sub-agent. This prevents a profile's main MCP servers from leaking into restricted sub-agent contexts. To grant a sub-agent access to a specific MCP tool, add it explicitly to `subagent_tools`, e.g. `mcp__navi_web__web_search`. +### MCP tool groups + +Inside `tools.{agent,subagent}.mcp`, each key is an MCP server name and each value is a list of group names (or `"*"` for all groups). Named groups resolve to concrete tools via the server's config in `mcp_servers.d/`. Example: + +```json +{ + "mcp": { + "navi-web": ["search", "browse", "request"] + } +} +``` + +This exposes the tools `mcp__navi-web__web_search`, `mcp__navi-web__web_view`, and `mcp__navi-web__http_request`. `*` expands to every tool advertised by that server. + +### Deprecated tool fields + +Older configs used `enabled_tools`, `subagent_tools`, and `mcp_servers` as flat top-level fields. The loader still auto-migrates them into `tools.agent` / `tools.subagent` for backward compatibility, but new profiles should use the explicit `tools` structure. + +When `tools.subagent` is non-empty, only MCP groups listed there are exposed to the sub-agent. This prevents a profile's main MCP servers from leaking into restricted sub-agent contexts. ### Thinking mechanics @@ -84,7 +101,6 @@ | `subagent_think_enabled` | bool \| None | `None` | Extended reasoning for sub-agents. `None` = inherit `think_enabled` from parent profile. | | `subagent_planning_enabled` | bool | `false` | Sub-agents spawned from this profile also run the planning pipeline before their tool loop. | | `context_providers` | list[str] | `[]` | Extra context providers to inject for this profile (by name). Global providers are always injected. | -| `mcp_servers` | dict | `{}` | MCP servers referenced by this profile. Format: `{"server_name": ["group1", "group2"]}` or `{"server_name": ["*"]}` for all tools. | | `is_admin_only` | bool | `false` | If `true`, profile is hidden from non-admin users in the profile list. | | `is_subagent_only` | bool | `false` | If `true`, profile can only be used via `spawn_agent`; `switch_profile` is blocked. Useful for narrow specialist agents that should never become the main session profile. | @@ -94,9 +110,9 @@ | ID | Name | Models (priority order) | Temp | Planning | |---|---|---|---|---| -| `secretary` | Personal Secretary | gemma4:31b-cloud → gemma4:26b-a4b-it-q4_K_M | 0.65 | Yes | +| `secretary` | Personal Secretary | gemma4:31b-cloud, qwen3.5:397b-cloud, kimi-k2.6:cloud, gemma4:26b-a4b-it-q4_K_M, qwen3.6:27b | 0.45 | Yes | | `server_admin` | Server Administrator | gemma4:31b-cloud → gemma4:26b-a4b-it-q4_K_M | 0.3 | Yes | -| `developer` | Developer | gemma4:31b-cloud → gemma4:26b-a4b-it-q4_K_M | 0.45 | Yes | +| `developer` | Developer | gemma4:31b-cloud, qwen3.5:397b-cloud, kimi-k2.6:cloud, gemma4:26b-a4b-it-q4_K_M, qwen3.6:27b | 0.35 | Yes | | `tool_developer` | Tool Developer | gemma4:31b-cloud → gemma4:26b-a4b-it-q4_K_M | 0.35 | Yes | | `discuss` | Discussion | gemma4:31b-cloud → gemma4:26b-a4b-it-q4_K_M | 0.85 | No | | `modeler_3d` | 3D Modeler | gemma4:26b-a4b-it-q4_K_M → gemma4:31b-cloud | 0.35 | Yes | @@ -109,8 +125,8 @@ Terminal-first local coding assistant. Designed for the Navi Code CLI and single-user local deployments: - **Native tools:** `todo`, `scratchpad`, `reflect`, `switch_profile`, `list_profiles`, `filesystem`, `code_exec`, `terminal`, `memory`, `list_tools`, `tool_manual`, `spawn_agent`, `schedule_recall`, `manage_recall`. -- **MCP tools:** disabled by default for the terminal experience. -- **Excluded:** `share_file`, `content_publish`, `ssh_exec`, `gmail`, `mcp__navi-web`. +- **MCP tools:** disabled (`"mcp": {}`) for the terminal experience. +- **Excluded:** `share_file`, `content_publish`, `ssh_exec`, `gmail`, `image_view`, `mcp__navi-web`. - **Planning:** Phase 1 and Phase 3 enabled, Phase 2 disabled to reduce latency. - **Safety:** the system prompt asks Navi to confirm destructive operations (`rm`, overwrites) before executing them. @@ -149,8 +165,18 @@ "model": ["gemma4:31b-cloud", "gemma4:26b-a4b-it-q4_K_M"], "temperature": 0.5, "max_iterations": 20, - "enabled_tools": ["todo", "scratchpad", "mcp__navi_web__web_search", "filesystem"], - "subagent_tools": ["todo", "filesystem", "terminal"], + "tools": { + "agent": { + "native": ["todo", "scratchpad", "filesystem", "terminal"], + "mcp": { + "navi-web": ["search"] + } + }, + "subagent": { + "native": ["todo", "filesystem", "terminal"], + "mcp": {} + } + }, "planning_enabled": true, "planning_mandatory": false, "planning_phase1_enabled": true, diff --git a/docs/testing.md b/docs/testing.md index 9668891..b2c95b1 100644 --- a/docs/testing.md +++ b/docs/testing.md @@ -14,54 +14,83 @@ ## Directory layout ``` -tests/ # Backend (pytest) +tests/ # Backend + terminal client (pytest) ├── conftest.py ├── conftest_factory.py ├── unit/ │ ├── api/ -│ │ └── test_session_files.py # upload/download file endpoint logic +│ │ └── test_session_files.py +│ ├── auth/ +│ │ ├── test_api_tokens.py +│ │ ├── test_deps.py +│ │ └── test_encrypt.py │ ├── core/ -│ │ ├── test_events.py # 17 tests -│ │ ├── test_context_builder.py -│ │ ├── test_compressor.py -│ │ ├── test_registry.py # registries, backend discovery, context provider registry -│ │ ├── test_planning.py +│ │ ├── test_agent.py │ │ ├── test_agent_context_size.py -│ │ └── test_agent_stream_guard.py +│ │ ├── test_agent_stream_guard.py +│ │ ├── test_anti_stall.py +│ │ ├── test_compressor.py +│ │ ├── test_context_builder.py +│ │ ├── test_events.py +│ │ ├── test_pg_session_store.py +│ │ ├── test_planning.py +│ │ ├── test_registry.py +│ │ ├── test_scheduler.py +│ │ └── test_tool_executor.py │ ├── llm/ -│ │ └── test_ollama.py # timeout/error classification + fallback timeout wiring +│ │ └── test_ollama.py │ ├── memory/ -│ │ ├── test_store.py -│ │ └── test_extractor.py -│ ├── tools/ -│ │ ├── test_filesystem.py -│ │ ├── test_code_exec.py -│ │ ├── test_terminal.py -│ │ ├── test_share_file.py -│ │ └── test_content_publish.py +│ │ ├── test_extractor.py +│ │ └── test_store.py │ ├── profiles/ -│ │ └── test_base.py -│ ├── config/ -│ └── test_settings.py +│ │ ├── test_base.py +│ │ └── test_overrides.py +│ ├── store/ +│ │ └── test_kv_store.py +│ ├── tools/ +│ │ ├── test_code_exec.py +│ │ ├── test_content_publish.py +│ │ ├── test_filesystem.py +│ │ ├── test_image_view.py +│ │ ├── test_memory.py +│ │ ├── test_recall_tools.py +│ │ ├── test_scratchpad.py +│ │ └── ... │ ├── test_content_store.py +│ ├── test_mcp.py │ └── test_startup.py ├── integration/ │ ├── conftest.py │ ├── test_api_routes.py +│ ├── test_auth_disabled.py +│ ├── test_mcp_integration.py +│ ├── test_recall_api.py +│ ├── test_scheduler_loop.py │ └── test_websocket.py -└── e2e/ - └── test_chat_flow.py +└── clients/ # Terminal client tests + ├── test_terminal_client.py + ├── test_terminal_ws.py + ├── test_tui_app.py + ├── test_tui_export.py + ├── test_tui_sessions_panel.py + ├── test_tui_settings.py + ├── test_tui_themes.py + ├── test_permissions.py + ├── test_permission_dialog.py + ├── test_shell_runner.py + ├── test_file_refs.py + └── test_diff_artifact_renderers.py webclient/tests/ # Web client (Vitest) ├── unit/ │ ├── api/ -│ │ └── index.test.js # 8 tests — request helper, verbs, errors, FormData +│ │ └── index.test.js │ ├── stores/ -│ │ ├── chat.test.js # 23 tests — buildMessageList, WS handlers, session load -│ │ ├── sessions.test.js # 6 tests — fetch, create, delete, pin sorting -│ │ └── profiles.test.js # 3 tests — fetch, selection, lookup +│ │ ├── chat.test.js +│ │ ├── sessions.test.js +│ │ └── profiles.test.js │ └── composables/ -│ └── useWebSocket.test.js # 7 tests — connect, dispatch, reconnect +│ └── useWebSocket.test.js ``` ## Mock strategy @@ -94,99 +123,18 @@ ## Coverage status -| Phase | Module | Tests | Status | -|-------|--------|-------|--------| -| 1 | `navi.core.events` | 17 | ✅ Done | -| 1 | `navi.core.compressor` | 14 | ✅ Done | -| 1 | `navi.core.registry` + `ContextProviderRegistry` | 13 | ✅ Done | -| 1 | `navi.core.context_builder` | 9 | ✅ Done | -| 1 | `navi.profiles.base` | 9 | ✅ Done | -| 2 | `navi.memory.store` | 18 | ✅ Done | -| 2 | `navi.memory.extractor` | 11 | ✅ Done | -| 3 | `navi.api.routes` | 19 | ✅ Done | -| 3 | `navi.api.routes.sessions` file endpoint logic | 5 | ✅ Basic | -| 3 | `navi.api.websocket` | 7 | ✅ Done | -| 3 | `navi.main` startup ordering | 1 | ✅ Basic | -| 4 | `navi.core.agent` | 9 | ✅ Done | -| 4 | `navi.core.planning` | 5 | ✅ Done | -| 5 | `navi.tools.filesystem` | 13 | ✅ Done | -| 5 | `navi.tools.code_exec` | 5 | ✅ Done | -| 5 | `navi.tools.terminal` | 4 | ✅ Done | -| 5 | `navi.tools.share_file` | 5 | ✅ Basic | -| 5 | `navi.tools.content_publish` | 4 | ✅ Basic | -| 5 | `navi.content_store` | 5 | ✅ Basic | -| 5 | `navi.llm.ollama` + fallback timeout wiring | 3 | ✅ Basic | -| 6 | `webclient/api` | 8 | ✅ Done | -| 6 | `webclient/stores/chat` | 23 | ✅ Done | -| 6 | `webclient/stores/sessions` | 6 | ✅ Done | -| 6 | `webclient/stores/profiles` | 3 | ✅ Done | -| 6 | `webclient/composables/useWebSocket` | 7 | ✅ Done | +The project is covered by backend (`pytest`), terminal-client (`pytest`), and web-client (`Vitest`) tests. Key areas with dedicated tests: -## Coverage roadmap +- **Agent loop & planning**: `tests/unit/core/test_agent*.py`, `test_planning.py`, `test_anti_stall.py`. +- **Sessions, compression, events**: `test_compressor.py`, `test_events.py`, `test_pg_session_store.py`. +- **Tools**: `tests/unit/tools/test_*.py`. +- **Memory**: `tests/unit/memory/test_*.py`. +- **Auth**: `tests/unit/auth/test_*.py`. +- **MCP & recall**: `tests/unit/test_mcp.py`, `tests/integration/test_mcp_integration.py`, `test_recall_tools.py`, `test_scheduler.py`. +- **Terminal client**: `tests/clients/test_*.py`. +- **Web client**: `webclient/tests/unit/**/*.test.js`. -This is the living plan for what still needs tests. Keep it updated whenever a -bug is fixed, a new module is added, or a planned area becomes covered. - -Status meanings: -- ✅ Covered enough for current risk -- 🟨 Basic coverage exists, important edge cases remain -- ⬜ Not covered yet -- 🔴 Regression target from a real bug - -### Phase 7 — Recent Regression Coverage - -| Priority | Area | Target tests | Status | -|---|---|---|---| -| P0 | `navi.content_store.ensure_tables()` | creates `session_content`, creates `idx_session_content_file`, is idempotent when index already exists | ✅ | -| P0 | `navi.content_store.publish()` | repeated publish of same `(session_id, filename)` updates one row instead of creating duplicates | ✅ 🔴 | -| P0 | `navi.main` startup | registries are initialized before `_check_embed()` so memory has an embedding backend | ✅ 🔴 | -| P0 | `navi.core.registry._discover_backends()` | primary Ollama backend receives HTTP timeout >= `LLM_COMPLETE_TIMEOUT` and `LLM_STREAM_FIRST_CHUNK_TIMEOUT` | ✅ 🔴 | -| P0 | `navi.llm.fallback.FallbackOllamaBackend` | per-server `OllamaBackend` clients receive the same expanded timeout | ✅ 🔴 | -| P1 | `navi.tools.content_publish` | missing file, directory instead of file, successful publish metadata, filename path stripping | ✅ | -| P1 | `navi.tools.share_file` | duplicate filename collision produces numbered output without overwrite | ✅ | -| P1 | `navi.api.routes.sessions` file endpoints | upload duplicate naming, forbidden extension, download path traversal, content disposition | ✅ | - -### Phase 8 — Agent Loop Behavior - -| Priority | Area | Target tests | Status | -|---|---|---|---| -| P0 | `Agent.run_stream()` planning entry | first user message forces planning; later messages follow `planning_enabled` | ⬜ | -| P0 | plan → todo bridge | numbered plan steps auto-populate todo exactly once per plan | ⬜ | -| P0 | stop handling | stop during stream prefill yields `StreamStopped` and closes LLM generator | 🟨 | -| P1 | subagent forwarding | parent forwards subagent tool events and counts subagent tokens/tool calls | ⬜ | -| P1 | adaptive replan | newly failed todo step injects replan prompt on next iteration | ⬜ | -| P1 | anti-stall | repeated tool calls or no todo progress inject warning after threshold | ⬜ | -| P1 | workers | post-turn workers run after `StreamEnd`; worker failure is logged and non-fatal | ⬜ | - -### Phase 9 — Memory And Embeddings - -| Priority | Area | Target tests | Status | -|---|---|---|---| -| P0 | embedding backend wiring | `get_registries()` wires dedicated `EMBEDDING_OLLAMA_HOST` backend into `MemoryStore` | ⬜ | -| P0 | pgvector detection | `_has_pgvector()` true/false paths and caching behavior | ⬜ | -| P1 | embedding generation | invalid/empty/NaN vectors are skipped before PostgreSQL update | 🟨 | -| P1 | backfill | `backfill_embeddings()` batches rows and only updates rows with valid vectors | 🟨 | -| P1 | search | vector search falls back to ILIKE when vector search unavailable or empty | 🟨 | - -### Phase 10 — WebSocket And API Lifecycles - -| Priority | Area | Target tests | Status | -|---|---|---|---| -| P0 | active run guard | duplicate message while a run is active returns `run_already_active` | 🟨 | -| P0 | reconnect replay | reconnect receives missed events and session sync after finished run | 🟨 | -| P1 | stop endpoint | `POST /sessions/{id}/stop` sets stop event and is idempotent | 🟨 | -| P1 | malformed input | oversize images, invalid file refs, and non-string payloads are rejected or sanitized | ⬜ | -| P1 | startup cleanup | session file cleanup task is started once and deletes orphaned dirs | ⬜ | - -### Phase 11 — Frontend - -| Priority | Area | Target tests | Status | -|---|---|---|---| -| P0 | `ContentCard.vue` | renders download links and inline viewer links; handles encoded filenames | ⬜ | -| P0 | streaming chat | auto-scroll and streaming message updates stay reactive | 🟨 | -| P1 | session switching | concurrent session loads cannot overwrite active session with stale response | 🟨 | -| P1 | error surfaces | API/store failures show recoverable UI state, no unhandled rejection | 🟨 | -| P1 | file upload UI | upload success/failure, duplicate names, large-file errors | ⬜ | +The detailed coverage roadmap lives in the test directories themselves. Add a regression test whenever a real bug is fixed. ## Running tests diff --git a/docs/tools.md b/docs/tools.md index 0eb5280..07481f0 100644 --- a/docs/tools.md +++ b/docs/tools.md @@ -8,19 +8,38 @@ Registered in `build_default_registries()` as builtins. Never removed on hot-reload. +### MCP tools (external servers) + +MCP tools are not built into Navi directly. They are provided by MCP servers configured in `mcp_servers.d/*.json` and registered at startup by `McpManager`. Each MCP tool name follows the format `mcp____`. + +| Server | Tool name | Description | +|---|---|---| +| `navi-web` | `mcp__navi-web__web_search` | Web search (SearXNG primary, DDG fallback, Brave tertiary) | +| `navi-web` | `mcp__navi-web__web_view` | Open a URL in a headless browser and return clean readable text | +| `navi-web` | `mcp__navi-web__http_request` | Raw HTTP request (GET/POST/PUT/PATCH/DELETE) | +| `navi-3d` | `mcp__navi-3d__compile_scad` | Compile an OpenSCAD script into a binary STL file | +| `navi-3d` | `mcp__navi-3d__lint_scad` | Lightweight OpenSCAD source linting before STL compilation | +| `navi-3d` | `mcp__navi-3d__render_stl` | Render preview PNG images from an STL file (up to 3 views) | +| `gnexus-creds` | `mcp__gnexus-creds__search_secrets` | Search personal secrets (UUID id, masked values) | +| `gnexus-creds` | `mcp__gnexus-creds__get_secret` | Get secret metadata and masked fields | +| `gnexus-creds` | `mcp__gnexus-creds__reveal_secret` | Decrypt and return plaintext value (audited) | +| `gnexus-creds` | `mcp__gnexus-creds__create_secret` | Create a new secret with encrypted fields | +| `gnexus-creds` | `mcp__gnexus-creds__update_secret` | Update fields/metadata of an existing secret | +| `gnexus-creds` | `mcp__gnexus-creds__set_secret_status` | Change secret status (actual / outdated / archived) | +| `gnexus-creds` | `mcp__gnexus-creds__archive_secret` | Permanently hide secret from MCP queries | + +MCP tools survive `reload_tools` because they are registered as external tools in `ToolRegistry`. + | Tool | Name | Description | |---|---|---| -| `WebSearchTool` | `mcp__navi_web__web_search` | DuckDuckGo search | -| `WebViewTool` | `mcp__navi_web__web_view` | Fetch and render a URL | | `FilesystemTool` | `filesystem` | Read/write/list/copy/grep/diff local files (path restrictions via config) | -| `HttpRequestTool` | `mcp__navi_web__http_request` | Generic HTTP client (GET/POST/etc.) | | `CodeExecTool` | `code_exec` | Execute Python in a subprocess sandbox | | `TerminalTool` | `terminal` | Run shell commands (command allowlist via config) | | `SshExecTool` | `ssh_exec` | SSH exec and SCP file transfer; connection pool keyed by session ID | | `ImageViewTool` | `image_view` | Load image from path/URL → resize to 1024px, convert to JPEG, return base64 for multimodal LLM | | `TodoTool` | `todo` | Per-session task checklist (set/update/read) | | `ScratchpadTool` | `scratchpad` | Per-session named working notes (write/append/read/clear) | -| `ReloadToolsTool` | `reload_tools` | Hot-reload user tools without server restart | +| `ReloadToolsTool` | `reload_tools` | Hot-reload user tools and context providers without server restart | | `ListToolsTool` | `list_tools` | Return the live tool list from registry | | `ToolManualTool` | `tool_manual` | Return manuals/{name}.md or auto-generate from schema | | `MemoryTool` | `memory` | Unified memory tool: save, search, and forget facts | @@ -29,20 +48,9 @@ | `ListProfilesTool` | `list_profiles` | List all available profiles | | `ShareFileTool` | `share_file` | Copy an existing local file into session files and return a download link | | `ContentPublishTool` | `content_publish` | Register an existing session file for inline viewing in chat | -| `McpTool` (gnexus-creds) | `mcp__gnexus_creds__search_secrets` | Search personal secrets (UUID id, masked values) | -| `McpTool` (gnexus-creds) | `mcp__gnexus_creds__get_secret` | Get secret metadata and masked fields | -| `McpTool` (gnexus-creds) | `mcp__gnexus_creds__reveal_secret` | Decrypt and return plaintext value (audited) | -| `McpTool` (gnexus-creds) | `mcp__gnexus_creds__create_secret` | Create a new secret with encrypted fields | -| `McpTool` (gnexus-creds) | `mcp__gnexus_creds__update_secret` | Update fields/metadata of an existing secret | -| `McpTool` (gnexus-creds) | `mcp__gnexus_creds__set_secret_status` | Change secret status (actual / outdated / archived) | -| `McpTool` (gnexus-creds) | `mcp__gnexus_creds__archive_secret` | Permanently hide secret from MCP queries | -| `McpTool` (navi-3d) | `mcp__navi_3d__compile_scad` | Compile an OpenSCAD script into a binary STL file | -| `McpTool` (navi-3d) | `mcp__navi_3d__lint_scad` | Lightweight OpenSCAD source linting before STL compilation | -| `McpTool` (navi-3d) | `mcp__navi_3d__render_stl` | Render preview PNG images from an STL file (up to 3 views) | -| `McpTool` (navi-web) | `mcp__navi_web__web_search` | Web search (SearXNG primary, DDG fallback, Brave tertiary) | -| `McpTool` (navi-web) | `mcp__navi_web__web_view` | Open a URL in a headless browser and return clean readable text | -| `McpTool` (navi-web) | `mcp__navi_web__http_request` | Raw HTTP request (GET/POST/PUT/PATCH/DELETE) | | `McpStatusTool` | `mcp_status` | Check connectivity and list tools for configured MCP servers | +| `CreateMcpServerTool` | `create_mcp_server` | Scaffold a new MCP server directory with boilerplate | +| `TestMcpToolTool` | `test_mcp_tool` | Execute a single MCP tool call in isolation for diagnostics | | `ReflectTool` | `reflect` | Self-reflection and analysis | | `ScheduleRecallTool` | `schedule_recall` | Schedule a headless callback for the current session (once/recurring/immediate) | | `ManageRecallTool` | `manage_recall` | Cancel, skip, or list scheduled recalls for the current session | @@ -133,9 +141,23 @@ --- -## Self-extension via MCP servers +## Self-extension -New capabilities are added as MCP servers using `create_mcp_server`. The server scaffolding includes: +Navi supports two ways to add new capabilities: + +### User tools (simple scripts) + +For small, single-purpose helpers, use `write_tool`. It writes a Python file into `tools/` and hot-reloads it in one call. The new tool is added to `tools/enabled.json` and becomes available in all profiles from the next user message. + +Requirements for a user tool: +- Module-level `name`, `description`, `parameters`. +- `async def execute(params: dict) -> str`. + +The agent should call `tool_manual("write_tool")` before using it. + +### MCP servers (complex integrations) + +For richer integrations that need their own process, dependencies, or state, scaffold an MCP server using `create_mcp_server`. This creates: 1. A directory under `mcp-servers/{name}/`. 2. A `server.py` entrypoint with stdio transport. 3. A config file at `mcp_servers.d/{name}.json`.