| 2026-06-12 |
fix: clarify that only field values are translated, not keys
Eugene Sukhodolskiy
committed
1 day ago
|
feat: translate normalization and embeddings to Ukrainian
Eugene Sukhodolskiy
committed
1 day ago
|
feat: add PostgreSQL full-text search
...
- Add generated search_vector tsvector column with immutable wrapper
to_tsvector_simple() for mixed ru/ua text
- Add GIN index ix_property_listings_search_vector_gin
- Add PropertyRepository.search_fulltext() using plainto_tsquery(simple)
and ts_rank_cd() with optional filters
- Add POST /api/v1/search/fulltext endpoint (120/min rate limit)
- Add FulltextSearchRequest/Result/Response schemas
- Update alembic.ini to use Docker PostgreSQL on port 5433
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
feat: add pgvector semantic search
...
- Add pgvector dependency and Alembic migration (vector extension, embedding
column, HNSW index with cosine ops)
- Add nomic-embed-text embedding model to config
- Add OllamaClient.embed() method for /api/embed endpoint
- Add embedding generation stage to PropertyPipeline (_stage_embed)
- Add PropertyRepository.update_embedding() and search_similar() with
cosine distance + optional filters (deal_type, city, price range)
- Add POST /api/v1/search/similar endpoint with query embedding + filters
- Add SimilarSearchRequest/Response schemas
- Add backfill script for existing listings
- Update docker-compose.yml to pgvector/pgvector:pg16 image
- Update .env to use Docker PostgreSQL on port 5433
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
docs: add parser integration guide and link in README
...
- Create docs/PARSER_INTEGRATION.md with:
- Endpoint selection guide (/ingest vs /with-images)
- Full Python integration example with VmkCollectorClient
- Response handling (completed, invalid, failed)
- Retry logic and rate limits
- FAQ for parser developers
- Add link to README in API for parsers section
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
feat: multipart/form-data image upload endpoint with inline pipeline
...
- Add POST /api/v1/ingest/with-images accepting metadata (JSON Form) + images (UploadFile list)
- Stream images to temp storage and run pipeline inline with fresh AsyncSessionLocal
- Mark raw status=processing immediately to prevent queue worker race condition
- Add process_local_file() to ImageDownloader for handling already-downloaded images
- Split pipeline image processing: _stage_process_uploaded_images vs _stage_process_remote_images
- Process uploaded images sequentially to avoid SQLAlchemy concurrent flush errors
- Commit pipeline session explicitly after inline processing
- Clean up temp files in finally block regardless of pipeline outcome
- Add python-multipart dependency for FastAPI multipart parsing
- Update README with multipart endpoint docs and examples
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
fix: use configured model name in ai_enricher instead of hardcoded llama3.2
...
- Remove hardcoded 'model_version': 'llama3.2' from system prompt
- Remove hardcoded 'llama3.2-mock' from mock response
- Inject actual settings.ollama_text_model into enrichment result programmatically
- Ensures DB records reflect the real model used (gemma4:e2b-it-q4_K_M)
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
fix: pipeline type coercion — bool, datetime, enum mappings
...
- Add _to_bool() for boolean fields (has_balcony, has_loggia, etc.)
- Add _to_datetime() for publish_date / archived_at
- Add _to_enum() with Russian→English mappings for all DB enums:
building_type, renovation_status, deal_type, layout,
bathroom_type, parking_type, heating_type, window_view,
metro_distance_type, listing_status
- Change pipeline except blocks to re-raise exceptions
so worker handles rollback + _mark_failed properly
- Fixes stuck 'processing' jobs caused by SQL type errors
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
docs: update README with Docker guide and parser API reference
...
- Add Docker quick-start instructions
- Add manual dev setup instructions
- Document POST /api/v1/ingest endpoint for parsers
- Add payload field reference table
- Add response codes and processing statuses
- Add curl examples
- Add configuration reference table
- Add architecture diagram
- Add logging/monitoring section
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
feat: dockerize app, add structured logging, fix rate limiter
...
Changes:
- Add Dockerfile with python:3.12-slim, alembic migrations, uvicorn
- Add .dockerignore
- Update docker-compose.yml with app service, port 8020, external Ollama
- Configure alembic/env.py to read DATABASE_URL from env
- Update .env.example with port 8020, Ollama host 192.168.1.75, gemma4 model
- Fix slowapi rate limiter: sync key_func instead of async
- Add structured JSON logging (structlog) to ingest endpoint, pipeline stages
- Fix logging output via logging.basicConfig for Docker stdout
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
fix: code review critical and high issues
...
- tenacity+structlog: replace before_sleep_log with structlog-compatible lambda
to prevent TypeError on retry
- NormalizedProperty: filter AI response dict by allowed dataclass fields before
unpacking to avoid TypeError on unknown keys
- property_pipeline: remove duplicate update_status(failed) from _stage_normalize
- security: add URL validator (SSRF protection) for ImageDownloader and archive-check
- ai prompts: replace raw <user_data> tags with JSON-serialized payload
to mitigate prompt injection
- queue_worker: wrap _process_one in try/except so DB errors don't kill the loop
- image processing: parallelize with asyncio.gather + Semaphore(3)
- ai services: unify OllamaFatalError handling — all propagate instead of swallow
- router_properties: catch only pydantic.ValidationError/ValueError in ingest,
let infrastructure errors return 500
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
feat: background queue worker for async pipeline processing
...
- Endpoint /ingest now only validates payload, creates raw_data with
status=pending, commits and returns 202 (no longer blocks on AI).
- QueueWorker polls DB for pending jobs every 1s, grabs one with
FOR UPDATE SKIP LOCKED, marks it processing, runs PropertyPipeline.
- PipelineFactory extracted from deps.py for reuse by both HTTP deps
and the background worker.
- Lifespan starts QueueWorker as asyncio.Task; on shutdown signals
stop_event, awaits worker termination (60s timeout) before closing
Ollama client and active_jobs.
- Worker checks pipeline result status and logs completed/invalid/failed
appropriately. Unhandled exceptions mark raw_data failed explicitly.
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
feat: classify AI errors — ValidationError→invalid, fatal/network→failed
...
Item 6 from REVIEW_FOLLOWUP:
- ai_normalizer: pydantic.ValidationError returns is_real_estate=False (invalid)
OllamaFatalError / unexpected exceptions now raise → pipeline status=failed
- ai_enricher: ValidationError returns None (skip enrichment), fatal returns None,
unexpected raise → pipeline status=failed
- pipeline._stage_enrich: propagate OllamaRetryableError so pipeline can mark failed
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
| 2026-06-11 |
feat: implement review items 8-14
...
- Soft-delete/archive for listings: archived_at column + archive-check endpoint
- Rate limiting on /ingest: slowapi with 60/minute per source_slug
- Prometheus metrics: /metrics endpoint + custom counters/histograms
- Graceful shutdown: track active jobs in app.state, wait up to 30s
- Prompt injection protection: wrap user data in <user_data> XML tags
- Image download size limit: 50MB max with httpx streaming
- Raw data cleanup: admin endpoint to delete completed raw data older than N days
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
feat: implement review items 1-7
...
- Decompose PropertyPipeline into 8 explicit stages with PipelineContext
- Add tenacity retry (3 attempts, exponential backoff) to OllamaClient and ImageDownloader
- Add simple in-memory circuit breaker for Ollama calls
- Resize images to 1024px before base64 encoding for vision model
- Add /health endpoint (DB, Ollama, disk checks)
- Add DB performance indexes + alembic migration
- Classify AI errors: OllamaRetryableError vs OllamaFatalError
- Add strict PayloadSchema validation for ingest endpoint
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
feat: core pipeline + FastAPI API (Phases 0-6)
...
Implemented the full VMK Data Collector foundation and processing pipeline:
- Config, logging, exception hierarchy
- DB models (listings, raw data, images, snapshots, enrichments, custom fields)
- Alembic async migrations
- Repositories with upsert/snapshot support
- Domain entities and Pydantic schemas
- Ollama AI client with mock mode
- AI normalizer, image analyzer, enricher
- Image downloader with SHA-256 dedup
- PropertyPipeline: raw -> AI validate -> upsert/snapshot -> images -> enrich
- FastAPI app with /api/v1/ingest endpoint
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
|
|