Implement MVP DOM.RIA parser with curl

Fork: 0

root / vmk-360-domria_parser

Browse code Implement MVP DOM.RIA parser with curl_cffi scraping - Add src/ modules: config, session, parser, normalizer, collector, storage, crawler, main - Implement curl_cffi session with chrome124 impersonation and cookie warmup - Extract __INITIAL_STATE__ using bracket-counting JSON parser - Normalize DOM.RIA data into data_collector payload format - Add SQLite resume storage for deduplication - Implement rate limiting: 10s between pages/details, 2min pause every 50 pages - Add CLI entry point with argparse for city/category/operation selection - Add Dockerfile and requirements.txt - Update TECH_SPEC.md with v3.0 post-MVP status and empirical data findings - Update README.md with usage instructions Co-Authored-By: Claude <noreply@anthropic.com> master
1 parent 200f006 commit a679a818b5b5af7febd750e966f0ae4e53f49635 Eugene Sukhodolskiy authored on 12 Jun

Browse code

Implement MVP DOM.RIA parser with curl_cffi scraping

- Add src/ modules: config, session, parser, normalizer, collector, storage, crawler, main
- Implement curl_cffi session with chrome124 impersonation and cookie warmup
- Extract __INITIAL_STATE__ using bracket-counting JSON parser
- Normalize DOM.RIA data into data_collector payload format
- Add SQLite resume storage for deduplication
- Implement rate limiting: 10s between pages/details, 2min pause every 50 pages
- Add CLI entry point with argparse for city/category/operation selection
- Add Dockerfile and requirements.txt
- Update TECH_SPEC.md with v3.0 post-MVP status and empirical data findings
- Update README.md with usage instructions

Co-Authored-By: Claude <noreply@anthropic.com>

master

1 parent 200f006 commit a679a818b5b5af7febd750e966f0ae4e53f49635

Eugene Sukhodolskiy authored on 12 Jun

Patch

Unified Split

Showing 14 changed files

Ignore Space Show notes View .gitignore 0 → 100644

Ignore Space Show notes View Dockerfile 0 → 100644

Ignore Space Show notes View README.md

Ignore Space Show notes View TECH_SPEC.md 0 → 100644

Ignore Space Show notes View requirements.txt 0 → 100644

Ignore Space Show notes View src/__init__.py 0 → 100644

Ignore Space Show notes View src/collector.py 0 → 100644

Ignore Space Show notes View src/config.py 0 → 100644

Ignore Space Show notes View src/crawler.py 0 → 100644

Ignore Space Show notes View src/main.py 0 → 100644

Ignore Space Show notes View src/normalizer.py 0 → 100644

Ignore Space Show notes View src/parser.py 0 → 100644

Ignore Space Show notes View src/session.py 0 → 100644

Ignore Space Show notes View src/storage.py 0 → 100644

Show line notes below