Implement MVP DOM.RIA parser with curl_cffi scraping
- Add src/ modules: config, session, parser, normalizer, collector, storage, crawler, main
- Implement curl_cffi session with chrome124 impersonation and cookie warmup
- Extract __INITIAL_STATE__ using bracket-counting JSON parser
- Normalize DOM.RIA data into data_collector payload format
- Add SQLite resume storage for deduplication
- Implement rate limiting: 10s between pages/details, 2min pause every 50 pages
- Add CLI entry point with argparse for city/category/operation selection
- Add Dockerfile and requirements.txt
- Update TECH_SPEC.md with v3.0 post-MVP status and empirical data findings
- Update README.md with usage instructions

Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 200f006 commit a679a818b5b5af7febd750e966f0ae4e53f49635
@Eugene Sukhodolskiy Eugene Sukhodolskiy authored 1 day ago
Showing 14 changed files
View
.gitignore 0 → 100644
View
Dockerfile 0 → 100644
View
README.md
View
TECH_SPEC.md 0 → 100644
View
requirements.txt 0 → 100644
View
src/__init__.py 0 → 100644
View
src/collector.py 0 → 100644
View
src/config.py 0 → 100644
View
src/crawler.py 0 → 100644
View
src/main.py 0 → 100644
View
src/normalizer.py 0 → 100644
View
src/parser.py 0 → 100644
View
src/session.py 0 → 100644
View
src/storage.py 0 → 100644