Newer
Older
vmk-360-data_collector / .plan.md
@Eugene Sukhodolskiy Eugene Sukhodolskiy 1 day ago 2 KB feat: add PostgreSQL full-text search

Plan: Multipart Image Ingest (Вариант 1)

Goal

Add a new endpoint POST /api/v1/ingest/with-images that accepts:

  • metadata — JSON string with source_slug, external_id, payload
  • images — 0–N binary image files via multipart/form-data

The parser downloads images from its own source and uploads them directly to our service. We never fight foreign CDNs again.

Architecture

1. Endpoint (router_properties.py)

  • metadata: str = Form(...) — validated as JSON, must contain source_slug + payload with title|description
  • images: list[UploadFile] = File(default=[]) — streamed to disk, not held in memory
  • Flow:
    1. Parse & validate metadata
    2. Save raw_parsing_data (status = pending)
    3. Stream each UploadFile to /var/lib/vmk/images/temp/{raw_id}/{idx}{ext}
    4. Inject _uploaded_image_paths into raw.payload
    5. Inline await pipeline.process(raw.id) — synchronous, because images are already local and we want immediate result
    6. Return IngestResponse

2. Pipeline (property_pipeline.py)

  • _stage_process_images checks raw.payload.get("_uploaded_image_paths")
    • If present → _stage_process_uploaded_images(property_id, paths)
    • If absent → _stage_process_remote_images(property_id, urls) (existing behaviour)
  • New helper _process_uploaded_one:
    • Reads local file → SHA256, width, height via Pillow
    • Moves file from temp/{raw_id}/ to permanent /{property_id}/{hash}.{ext}
    • Creates PropertyImage row with downloaded status
    • Runs AI image analysis via OllamaClient.image_to_base64
  • On successful completion: cleans up temp dir for this raw_id
  • On failure: leaves temp dir for inspection (cleanup later)

3. Image Processing Helper (image_downloader.py)

  • New method ImageDownloader.process_local_file(property_id, temp_path, order):
    • Mirrors download() return type (PropertyImageDownloadResult)
    • No HTTP, just filesystem + Pillow

4. Limits & Validation

  • Max files per request: 50 (configurable)
  • Max file size: 10 MB each (configurable)
  • FastAPI UploadFile already spills large files to disk — we just copy.

5. README

  • Add curl -F example
  • Add Python requests multipart example
  • Explain when to use /ingest (URLs) vs /ingest/with-images (binary)

Files to modify

  1. src/vmk_data_collector/api/v1/router_properties.py
  2. src/vmk_data_collector/services/property_pipeline.py
  3. src/vmk_data_collector/services/image_downloader.py
  4. README.md

Why inline pipeline instead of queue?

  • Parser already spent resources downloading images; we should not leave them in temp for an unknown queue delay.
  • Immediate feedback: parser gets property_id, snapshot_id, validation result right away.
  • Simpler state management — no orphaned temp files.

Why not base64 in JSON?

  • 33% overhead, huge JSON payloads, harder to debug, timeouts. Multipart is the industry standard for file uploads.