# Agent Loop

Core execution engine. File: `navi/core/agent.py`.

## Entry points

### `run_stream(session_id, user_message)` → `AsyncGenerator[AgentEvent]`
Streaming. Yields `AgentEvent` objects in real time. Used by the WebSocket handler. Runs the planning phase if `profile.planning_enabled = True`.

### `run(session_id, user_message)` → `str`
Non-streaming. Full tool-calling loop, returns final text. No planning phase.

### `run_ephemeral(user_message, profile_id)` → `tuple[str, bool]`
Non-persistent subagent. Temporary in-memory context. Called by `SpawnAgentTool`.

Returns `(result_text, completed_normally)`. `completed_normally` is `False` if the subagent hit the iteration limit or timed out.

`spawn_agent.profile_id` is optional. If omitted, `SpawnAgentTool` resolves the parent session's current profile. If provided, the subagent uses the selected profile's model, `subagent_system_prompt`, planning flags, and tool set. Its tools come from that profile's `subagent_tools`, falling back to `enabled_tools` when `subagent_tools` is empty.

When spawned from a persistent parent session, session-aware tools run under the parent session id so file tools resolve the user's session directory rather than a `subagent_*` directory.

`run_ephemeral` reads the parent session from the DB when `parent_session_id` is provided, so session-aware tools (filesystem, todo, scratchpad) operate on the parent's data.

### ContextVar restoration
`run_ephemeral` saves the parent's `current_session_id`, `current_model`, `current_user_id`, `current_user_role`, and `current_user_info` before starting and restores them in a `finally` block. This prevents background tasks or the next parent iteration from inheriting stale subagent IDs.

---

## Planning phase (`_run_planning`)

Runs before the tool loop when `profile.planning_enabled = True`.

### Phase 1 — Analysis
LLM receives the user request with a classification prompt. Outputs:
- `DIRECT` → skip planning entirely (simple request).
- A structured analysis + `REFLECT: yes/no` → continue to Phase 2 or 3.

### Phase 2 — Structured review (conditional)
Runs only when `planning_phase2_enabled = True` AND Phase 1 outputs `REFLECT: yes`.
One LLM call reviews the Phase 1 analysis and returns four sections:
- **Critic** — wrong assumptions, risks, contradictions, facts to verify
- **Pragmatist** — simpler path, unnecessary steps, better executor choices
- **Detailer** — missing requirements, source files/docs/tools to inspect, validation gaps
- **Plan Adjustments** — concrete changes Phase 3 must apply

The review is embedded into the Phase 3 prompt.

### Phase 3 — Execution plan
LLM produces milestones plus a numbered step list. Each step is assigned an executor:
- `TOOL: tool_name` — single tool call
- `AGENT: profile_id` — bounded 3+ tool-call subtask delegated to a subagent via `spawn_agent`
- `SELF` — handled inline (synthesis, context-dependent action)

Plan depth is adaptive:
- simple: 1-3 steps
- medium: 5-9 steps
- complex or autonomous: 8-15 steps
- hard maximum: 15 steps

**Comma test (enforced in prompt):** if a step description lists multiple things with "and" or commas, each item must be a separate step.

The plan is injected into `session.context` as an assistant message and saved to `session.messages` with `is_plan=True` for UI rendering. The todo list is auto-populated from the plan steps.

---

## Thinking mechanics

All flags live on `AgentProfile` and can be set per-profile in `config.json`.

| Flag | Default | What it does |
|---|---|---|
| `think_enabled` | `true` | Passes `think=True` to LLM on every main-loop call (extended reasoning) |
| `iteration_budget_enabled` | `true` | Injects remaining iteration count into context so model wraps up in time |
| `planning_phase2_enabled` | `false` | Enables Phase 2 structured review (one extra LLM call when Phase 1 outputs `REFLECT: yes`) |
| `goal_anchoring_enabled` | `true` | Injects goal-reminder system message every N iterations |
| `goal_anchoring_interval` | `5` | N for goal anchoring |
| `anti_stall_enabled` | `true` | Detects looping without todo progress and injects a warning |
| `anti_stall_threshold` | `8` | Consecutive iterations without progress before warning fires |
| `step_validation_enabled` | `false` | Blocks marking a todo step `done` without a `validation` field |
| `adaptive_replan_enabled` | `false` | When a step is marked `failed`, queues a re-plan prompt for the next iteration |
| `subagent_planning_enabled` | `false` | Subagents run their own planning phase |

---

## Tool-calling loop

Runs up to `profile.max_iterations` times.

```
Each iteration:
  1. Check stop_event → yield StreamStopped if set
  2. Build context: _build_context() injects iteration budget and goal anchor (if due)
  3. Check anti-stall: if stalled, append warning message to context
  4. Inject queued adaptive re-plan message (if a step failed last iteration)
  5. llm.stream_complete(context, tool_schemas)
     → ThinkingDelta/ThinkingEnd events during reasoning
     → TextDelta events during text generation
  6a. No tool calls → save session, yield StreamEnd, run workers, return
  6b. Tool calls → execute each, yield ToolEvent, append results to context
  7. Update anti-stall counters, detect newly-failed todo steps
  8. Check if profile switched → reload profile + tools
```

### Sub-agent event forwarding
When `spawn_agent` runs a subagent, its events arrive through `current_event_sink`. The parent drains the queue in real time, yielding subagent events marked with `is_subagent=True`.

### Cooperative stop
Stop is signalled via `current_stop_event` (an `asyncio.Event`). Checked before each LLM call, during streaming, and after tool execution. Never use `task.cancel()` — it corrupts WebSocket state.

### Streaming guard wrapper
`run_stream()` wraps the LLM generator with `_iter_stream_guarded()`, which provides two safety layers:

1. **Stop-event polling during prefill.** Ollama emits no chunks during prefill, so a plain `await` on the first token can block for minutes. The wrapper polls `stop_event` every second so the user's Stop button works even during silent prefill.
2. **Hard timeouts.** `first_chunk_timeout` (default 120 s) caps prefill wait time. `chunk_timeout` (default 60 s) caps gaps between subsequent tokens. On timeout the generator is closed, terminating the HTTP connection to Ollama so GPU load drops to idle.

| Env var | Default | Purpose |
|---|---|---|
| `LLM_STREAM_FIRST_CHUNK_TIMEOUT` | `120` | Max seconds to wait for the first token |
| `LLM_STREAM_CHUNK_TIMEOUT` | `60` | Max seconds between tokens after the first |

---

## Workers

Run sequentially after `StreamEnd`. Currently: `CompressionWorker`.

Pre-turn compression also runs at the start of `run_stream()` if `session.context_token_count` exceeds the threshold. See [`sessions.md`](sessions.md).

---

## System prompt construction (`_build_context`)

Every LLM call receives:
1. System message: `persona + "---" + profile.system_prompt` (injected fresh, never stored).
2. Optional memory message: `"## What I remember about the user\n..."`.
3. `session.context` messages (system messages stripped to avoid duplication).

Profile switches and persona changes take effect immediately.

### System prompt caching
The built system prompt string is cached per profile ID in `ContextBuilder` to avoid rebuilding on every turn. The cache is invalidated when the profile is reloaded (e.g. after `switch_profile` or hot-reload). This saves ~1–2 ms per turn for profiles with large system prompts.