| 2026-06-12 |
fix: correct DOM.RIA URL slugs for commercial, garages, rent_daily
...
- kommercheskih → kom-nedvizhimosti (commercial)
- garazhey → garazhei (garages)
- posutochnaya-arenda → posutochnaia-arenda (daily rent)
- Add rooms category (komnat)
- Update normalizer fallback title for new categories
Eugene Sukhodolskiy
committed
1 day ago
|
fix: increase ingest timeout to 300s and cap images at 15 per listing
...
- Prevents inline pipeline timeouts on listings with 40+ photos
- 15 images is enough for AI analysis while keeping request duration reasonable
Eugene Sukhodolskiy
committed
1 day ago
|
feat: download images locally and ingest via multipart /with-images endpoint
...
- Update collector to support ingest_with_images using CurlMime multipart
- Crawler now downloads listing photos in parallel (5 workers) to temp files
- Fix photo URL builder: use photosnew CDN + beautifulUrl (legacy file paths are 415)
- Add fallback title generation in normalizer for PayloadSchema validation
- Increase ingest timeout to 120s for inline pipeline processing
Closes: photo ingestion via binary upload
Eugene Sukhodolskiy
committed
1 day ago
|
Align payload fields with data_collector PayloadSchema
...
- Rename 'photos' -> 'images' (data_collector expects images: list[str])
- Add 'price' field matching PayloadSchema price field
- Add 'contact_phone' (first phone string, not list)
- Add 'address' (combined city+district+street+building string)
- Add 'area', 'rooms', 'floor' per strict schema
- Keep extra fields (price_usd, city_name, etc.) because extra=allow
- Remove 'photos_count' in favor of checking len(images)
Fixes photo ingestion — previously data_collector rejected/ignored 'photos'
because PayloadSchema only knows 'images'.
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
Improve photo extraction robustness and add logging
...
- Add _extract_photos() helper to handle both list and dict formats
- _build_photo_urls() now skips non-dict entries gracefully
- Fallback to detail.main_photo when no photos array is present
- Add photos_count to payload for easy verification
- Add crawler warning log when a listing has zero photos after normalization
- Prefer detail photos (usually higher quality), fallback to catalog
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
Fix city discovery: keep cities with -obl- suffix
...
- Remove '-obl-' from subpage indicators; cities like brovary-obl-kievskaya
and vasylkov-obl-kievskaya are real settlements with region disambiguation
- Only slug starting with 'obl-' is treated as a region, not mid-string
- Increases discovered city count from ~307 to ~330
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
Add dynamic city discovery from DOM.RIA sitemap
...
- Implement src/discovery.py to extract city slugs from sitemap.xml.gz
- Use heuristics to filter out regions, districts, metro stations, streets
- Merge sitemap data with homepage navigation links for completeness
- Update config.py to remove hardcoded CITY_SLUGS, use FALLBACK_CITY_NAMES
- Update main.py to auto-discover cities when --city not provided
- Add --discover-cities flag to list all discovered cities and exit
- Tested with Kiev, Lviv, Odessa multi-city dry-run
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
Implement MVP DOM.RIA parser with curl_cffi scraping
...
- Add src/ modules: config, session, parser, normalizer, collector, storage, crawler, main
- Implement curl_cffi session with chrome124 impersonation and cookie warmup
- Extract __INITIAL_STATE__ using bracket-counting JSON parser
- Normalize DOM.RIA data into data_collector payload format
- Add SQLite resume storage for deduplication
- Implement rate limiting: 10s between pages/details, 2min pause every 50 pages
- Add CLI entry point with argparse for city/category/operation selection
- Add Dockerfile and requirements.txt
- Update TECH_SPEC.md with v3.0 post-MVP status and empirical data findings
- Update README.md with usage instructions
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|