Server Architecture

High-level overview of the Navi backend for client developers. You don't need to modify the server, but understanding it helps build a better client.

Stack

Framework: FastAPI + uvicorn
LLM: Ollama (local), model gemma4:26b-a4b-it-q4_K_M (26B params, 4-bit quant)
Thinking: Model reasoning is enabled and streamed (OLLAMA_THINK=true)
Database: PostgreSQL (primary) or SQLite fallback, via asyncpg / aiosqlite
Context window: 65 536 tokens

Component overview

Browser
  │
  ├── WebSocket /ws/sessions/{id}    ← streaming agent output
  └── REST /sessions/* /agents/*     ← session management

FastAPI (navi/main.py)
  │
  ├── websocket.py  (_AgentRun, subscriber queues, stop endpoint)
  └── routes/       (sessions, agents, messages, health)
       │
       └── Agent (navi/core/agent.py)
            │
            ├── Planning phase  (one non-streaming LLM call before the tool loop)
            ├── Tool-calling loop  (stream_complete, up to 40 iterations)
            │    └── Tool execution  (built-ins + user tools)
            └── Workers  (post-response: context compression, memory extraction)
                 │
                 ├── LLM backend (Ollama)
                 ├── ToolRegistry  (built-ins + user tools from tools/)
                 ├── ProfileRegistry  (loaded from navi/profiles/*/config.json)
                 └── SessionStore  (PostgreSQL or SQLite)
                      └── MemoryStore  (long-term user facts, same DB)

Request lifecycle (WebSocket message)

Client sends {type: "message", content: "..."} over WebSocket.
Server creates _AgentRun, launches the agent task, subscribes a queue.
Agent loads session + profile, runs planning phase (if enabled).
Tool-calling loop:
- LLM streams → emits thinking_delta, then tool calls or text.
- If tool calls: executes each tool, emits tool_started → tool_call.
- If finish_reason == stop: emits stream_end, runs post-turn workers.
Events broadcast to all subscriber queues → forwarded to WebSocket.

Planning phase

When profile.planning_enabled = true (all current profiles), the agent makes an extra non-streaming LLM call before entering the tool loop. It produces a structured step-by-step plan, injects it as an assistant message in context, and emits plan_ready. Simple/direct questions are detected and skip this phase.

Two-buffer session design

Sessions have two separate message lists:

messages — full display history, never compressed. This is what GET /sessions/{id} returns and what the client shows to the user.
context — what the LLM actually sees. When context reaches ~80% of the window, a summarization worker compresses older messages, replacing them with a summary. This does NOT affect messages.

The client should always render from messages (via REST), not try to track context.

Sub-agents

The spawn_agent tool creates a nested agent run. It is synchronous and blocking — by the time tool_call for spawn_agent arrives, the sub-agent has fully completed. Sub-agent tool events are forwarded to the parent's WebSocket stream with is_subagent: true.

Profile system

Profiles live in navi/profiles/<name>/:

config.json — model, temperature, enabled tools list, planning flag
system_prompt.txt — the domain-specific system prompt

The global personality (persona.txt) is prepended to every profile's system prompt. Profile switches take effect on the next LLM call within the same run, and fully on the next user message.

Long-term memory

The agent has a persistent memory system (user facts stored in the database). The memory summary is injected as a system message at the start of each run. This is transparent to the client — no special handling needed.

Context compression

Fires automatically post-response when context_token_count / ollama_num_ctx ≥ 0.80. Emits context_compressed event. The client only needs to display it as an informational notice if desired.