# Server Architecture

High-level overview of the Navi backend for client developers. You don't need to modify the server, but understanding it helps build a better client.

## Stack

- **Framework**: FastAPI + uvicorn
- **LLM**: Ollama (local), model `gemma4:26b-a4b-it-q4_K_M` (26B params, 4-bit quant)
- **Thinking**: Model reasoning is enabled and streamed (`OLLAMA_THINK=true`)
- **Database**: PostgreSQL via asyncpg
- **Context window**: 65 536 tokens

## Component overview

```
Browser
  │
  ├── WebSocket /ws/sessions/{id}    ← streaming agent output
  └── REST /sessions/* /agents/*     ← session management

FastAPI (navi/main.py)
  │
  ├── websocket.py  (_AgentRun, subscriber queues, stop endpoint)
  └── routes/       (sessions, agents, messages, health)
       │
       └── Agent (navi/core/agent.py)
            │
            ├── Planning phase  (one non-streaming LLM call before the tool loop)
            ├── Tool-calling loop  (stream_complete, up to 40 iterations)
            │    └── Tool execution  (built-ins + user tools)
            └── Workers  (post-response: context compression, memory extraction)
                 │
                 ├── LLM backend (Ollama)
                 ├── ToolRegistry  (built-ins + user tools from tools/)
                 ├── ProfileRegistry  (loaded from navi/profiles/*/config.json)
                 └── SessionStore  (PostgreSQL)
                      └── MemoryStore  (long-term user facts, same DB)
```

## Request lifecycle (WebSocket message)

1. Client sends `{type: "message", content: "..."}` over WebSocket.
2. Server creates `_AgentRun`, launches the agent task, subscribes a queue.
3. Agent loads session + profile, runs planning phase (if enabled).
4. Tool-calling loop:
   - LLM streams → emits `thinking_delta`, then tool calls or text.
   - If tool calls: executes each tool, emits `tool_started` → `tool_call`.
   - If `finish_reason == stop`: emits `stream_end`, runs post-turn workers.
5. Events broadcast to all subscriber queues → forwarded to WebSocket.

## Planning phase

When `profile.planning_enabled = true` (all current profiles), the agent makes an extra non-streaming LLM call before entering the tool loop. It produces a structured step-by-step plan, injects it as an assistant message in context, and emits `plan_ready`. Simple/direct questions are detected and skip this phase.

## Two-buffer session design

Sessions have two separate message lists:

- **`messages`** — full display history, **never compressed**. This is what `GET /sessions/{id}` returns and what the client shows to the user.
- **`context`** — what the LLM actually sees. When context reaches ~80% of the window, a summarization worker compresses older messages, replacing them with a summary. This does NOT affect `messages`.

The client should always render from `messages` (via REST), not try to track context.

## Sub-agents

The `spawn_agent` tool creates a nested agent run. It is **synchronous and blocking** — by the time `tool_call` for `spawn_agent` arrives, the sub-agent has fully completed. Sub-agent tool events are forwarded to the parent's WebSocket stream with `is_subagent: true`.

## Profile system

Profiles live in `navi/profiles/<name>/`:
- `config.json` — model, temperature, enabled tools list, planning flag
- `system_prompt.txt` — the domain-specific system prompt

The global personality (`persona.txt`) is prepended to every profile's system prompt. Profile switches take effect on the next LLM call within the same run, and fully on the next user message.

## Long-term memory

The agent has a persistent memory system (user facts stored in the database). The memory summary is injected as a system message at the start of each run. This is transparent to the client — no special handling needed.

## Context compression

Fires automatically post-response when `context_token_count / ollama_num_ctx ≥ 0.80`. Emits `context_compressed` event. The client only needs to display it as an informational notice if desired.
