|
Implement MVP DOM.RIA parser with curl_cffi scraping
- Add src/ modules: config, session, parser, normalizer, collector, storage, crawler, main - Implement curl_cffi session with chrome124 impersonation and cookie warmup - Extract __INITIAL_STATE__ using bracket-counting JSON parser - Normalize DOM.RIA data into data_collector payload format - Add SQLite resume storage for deduplication - Implement rate limiting: 10s between pages/details, 2min pause every 50 pages - Add CLI entry point with argparse for city/category/operation selection - Add Dockerfile and requirements.txt - Update TECH_SPEC.md with v3.0 post-MVP status and empirical data findings - Update README.md with usage instructions Co-Authored-By: Claude <noreply@anthropic.com> |
|---|
|
|
| .gitignore 0 → 100644 |
|---|
| Dockerfile 0 → 100644 |
|---|
| README.md |
|---|
| TECH_SPEC.md 0 → 100644 |
|---|
| requirements.txt 0 → 100644 |
|---|
| src/__init__.py 0 → 100644 |
|---|
| src/collector.py 0 → 100644 |
|---|
| src/config.py 0 → 100644 |
|---|
| src/crawler.py 0 → 100644 |
|---|
| src/main.py 0 → 100644 |
|---|
| src/normalizer.py 0 → 100644 |
|---|
| src/parser.py 0 → 100644 |
|---|
| src/session.py 0 → 100644 |
|---|
| src/storage.py 0 → 100644 |
|---|