Fix LLM hang: stop button during prefill, context guard, timeouts

Fork: 0

root / navi-1

Browse code Fix LLM hang: stop button during prefill, context guard, timeouts Root cause: during prefill (processing input tokens), Ollama emits no HTTP chunks. The `async for chunk in stream_complete()` loop body never executes, so stop_event is never checked — Stop button has no effect. Same issue with complete() calls (planning, compression): blocking await with no cancellation path. Fixes: _iter_stream_guarded() (agent.py, module-level): Wraps any stream_complete() generator. Polls stop_event every 1s while waiting for the next chunk using asyncio.wait() — so Stop works even during multi-minute prefill. On stop or timeout, calls aclose() on the generator which closes the HTTP connection to Ollama → generation halts → GPU drops to idle. Applied to both run_stream() and run_ephemeral(). _check_context_size() (Agent method): Estimates context tokens (chars/4 + 500 per image) before every LLM call. Raises ContextTooLargeError (new NaviError subclass) at 92% of ollama_num_ctx — before Ollama ever receives the request. _run_planning() timeouts: Both complete() calls (phase 1 and 2) wrapped with asyncio.wait_for(). Timeout logged and planning skipped gracefully — execution continues. New config (config.py): llm_complete_timeout = 120s llm_stream_first_chunk_timeout = 180s (prefill budget) llm_stream_chunk_timeout = 60s (inter-token budget) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> feature/navi-code master vmkdemo
1 parent 8b09439 commit 8c88f4987c923620aae607ba382c9b0c12183b15 Eugene Sukhodolskiy authored on 14 Apr

Browse code

Root cause: during prefill (processing input tokens), Ollama emits no
HTTP chunks. The `async for chunk in stream_complete()` loop body never
executes, so stop_event is never checked — Stop button has no effect.
Same issue with complete() calls (planning, compression): blocking await
with no cancellation path.

Fixes:

_iter_stream_guarded() (agent.py, module-level):
  Wraps any stream_complete() generator. Polls stop_event every 1s while
  waiting for the next chunk using asyncio.wait() — so Stop works even
  during multi-minute prefill. On stop or timeout, calls aclose() on the
  generator which closes the HTTP connection to Ollama → generation halts
  → GPU drops to idle. Applied to both run_stream() and run_ephemeral().

_check_context_size() (Agent method):
  Estimates context tokens (chars/4 + 500 per image) before every LLM
  call. Raises ContextTooLargeError (new NaviError subclass) at 92% of
  ollama_num_ctx — before Ollama ever receives the request.

_run_planning() timeouts:
  Both complete() calls (phase 1 and 2) wrapped with asyncio.wait_for().
  Timeout logged and planning skipped gracefully — execution continues.

New config (config.py):
  llm_complete_timeout = 120s
  llm_stream_first_chunk_timeout = 180s  (prefill budget)
  llm_stream_chunk_timeout = 60s         (inter-token budget)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feature/navi-code master vmkdemo

1 parent 8b09439 commit 8c88f4987c923620aae607ba382c9b0c12183b15

Eugene Sukhodolskiy authored on 14 Apr

Patch

Unified Split

Showing 3 changed files

Ignore Space Show notes View navi/config.py

Ignore Space Show notes View navi/core/agent.py

Ignore Space Show notes View navi/exceptions.py

Show line notes below