| 2026-06-12 |
Align payload fields with data_collector PayloadSchema
...
- Rename 'photos' -> 'images' (data_collector expects images: list[str])
- Add 'price' field matching PayloadSchema price field
- Add 'contact_phone' (first phone string, not list)
- Add 'address' (combined city+district+street+building string)
- Add 'area', 'rooms', 'floor' per strict schema
- Keep extra fields (price_usd, city_name, etc.) because extra=allow
- Remove 'photos_count' in favor of checking len(images)
Fixes photo ingestion — previously data_collector rejected/ignored 'photos'
because PayloadSchema only knows 'images'.
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
Improve photo extraction robustness and add logging
...
- Add _extract_photos() helper to handle both list and dict formats
- _build_photo_urls() now skips non-dict entries gracefully
- Fallback to detail.main_photo when no photos array is present
- Add photos_count to payload for easy verification
- Add crawler warning log when a listing has zero photos after normalization
- Prefer detail photos (usually higher quality), fallback to catalog
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
Fix city discovery: keep cities with -obl- suffix
...
- Remove '-obl-' from subpage indicators; cities like brovary-obl-kievskaya
and vasylkov-obl-kievskaya are real settlements with region disambiguation
- Only slug starting with 'obl-' is treated as a region, not mid-string
- Increases discovered city count from ~307 to ~330
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
Add dynamic city discovery from DOM.RIA sitemap
...
- Implement src/discovery.py to extract city slugs from sitemap.xml.gz
- Use heuristics to filter out regions, districts, metro stations, streets
- Merge sitemap data with homepage navigation links for completeness
- Update config.py to remove hardcoded CITY_SLUGS, use FALLBACK_CITY_NAMES
- Update main.py to auto-discover cities when --city not provided
- Add --discover-cities flag to list all discovered cities and exit
- Tested with Kiev, Lviv, Odessa multi-city dry-run
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
Implement MVP DOM.RIA parser with curl_cffi scraping
...
- Add src/ modules: config, session, parser, normalizer, collector, storage, crawler, main
- Implement curl_cffi session with chrome124 impersonation and cookie warmup
- Extract __INITIAL_STATE__ using bracket-counting JSON parser
- Normalize DOM.RIA data into data_collector payload format
- Add SQLite resume storage for deduplication
- Implement rate limiting: 10s between pages/details, 2min pause every 50 pages
- Add CLI entry point with argparse for city/category/operation selection
- Add Dockerfile and requirements.txt
- Update TECH_SPEC.md with v3.0 post-MVP status and empirical data findings
- Update README.md with usage instructions
Co-Authored-By: Claude <noreply@anthropic.com>
Eugene Sukhodolskiy
committed
1 day ago
|
|
|