Backfill first HKEX IPO document batch

Request:
Start progressively filling detailed information for recent HK IPO targets.

Changes:
- Add scripts/archive_hkex_documents.py to map tickers to HKEXnews stock IDs, select official prospectus and allotment-results PDFs, archive them under data/raw/{ticker}, parse high-confidence T0/T1 facts, export snapshots, and refresh sync state.
- Document the small-batch HKEX document backfill workflow in README.md and the archivist skill.
- Archive prospectus and allotment-results PDFs for 00901, 01081, 01779, 02290, 02553, and 03388.
- Fill T0 details including application dates, expected allotment date, board lot, minimum subscription amount, and offer-share counts for the six tickers.
- Fill T1 allotment-demand details including valid/successful applications, public subscription level, international placees, international subscription level, and final offer-share allocations.
- Refresh source_refs, ipo_master, offering_terms, ipo_demand, ticker_sync_state, and sync_tasks snapshots.

Verification:
- Ran archive_hkex_documents.py in a first small batch and re-ran corrected tickers after parser hardening.
- Parsed project Python scripts with ast.parse.
- Checked SQLite integrity and DB-to-snapshot row counts.
- Verified source_refs paths are repo-relative, source files exist, and SHA-256 hashes match.
- Confirmed batch field completeness for the six processed tickers.
- Ran git diff --check and git diff --cached --check.
- Checked for Python cache and SQLite transient files.

Next useful context:
- This batch added about 55MB of official HKEXnews PDFs.
- Sync state now has 16 complete stages, 1993 pending_due stages, and 42 pending_not_due stages.
- Continue with small --limit batches because HKEXnews title search can include historical or postponed offering documents for the same stock code.
This commit is contained in:
2026-06-15 07:07:46 +00:00
parent c65b20a1c4
commit 993d7b26fa
23 changed files with 4908 additions and 4110 deletions
+12
View File
@@ -109,6 +109,18 @@ The updater archives the HKEXnews XLSX reports under `data/raw/hkex_new_listing_
Rows without an IPO offer price, such as transfers of listing, introductions, or de-SPAC transactions, are skipped by default because they are not ordinary public subscription targets.
## HKEX Document Backfill
Use the HKEX document archiver to progressively fill detailed T0/T1 facts for open sync tasks:
```bash
.venv/bin/python scripts/archive_hkex_documents.py --limit 5 --as-of 2026-06-15T08:30:00Z
```
The archiver maps stock codes to HKEXnews title-search stock IDs, downloads the selected prospectus and allotment-results PDFs under `data/raw/{ticker}/`, records `source_refs`, parses high-confidence T0/T1 fields into `ipo_master`, `offering_terms`, and `ipo_demand`, exports snapshots, and refreshes `sync_tasks`.
Run in small batches because prospectus PDFs are large and title-search results can contain historical or postponed-offering documents.
## Incremental Archive Sync
The archivist keeps a per-ticker sync ledger so repeated updates can focus on missing stages: