Backfill structured T1 demand from archived text

Request: - Use archivist to close the 137 T1 ipo_demand source-only gaps using extracted PDF text. Changes: - Add an incremental T1 demand text backfill script. - Parse existing allotment-result extracted text into ipo_demand. - Archive linked Summary PDFs from old HKEX HTML allotment-result pages. - Correct allotment-result selection to prefer primary result announcements over clarification or supplemental notices. - Add robust line-aware allotment parsing and document the workflow in archivist and README. - Record the backfill result in a report. Execution: - Selected 137 source-only T1 demand gaps. - Wrote 137 ipo_demand rows, increasing ipo_demand from 154 to 291 rows. - Archived 38 new HKEX allotment-result PDFs and extracted their text. - Confirmed an incremental rerun selects 0 gaps and writes 0 rows. Verification: - Ran git diff --cached --check. - Ran py_compile for archive_hkex_documents.py and backfill_t1_demand_from_text.py. - Checked SQLite integrity and foreign keys. - Confirmed DB row counts match CSV snapshots. - Verified no T1 complete row is missing ipo_demand. - Verified source_refs paths/files/hashes and PDF extracted-text manifest hashes. Next useful context: - T1 demand structure is complete for listed rows; 06106 and 06675 remain pending_not_due. - T2 grey-market and due price-performance gaps remain separate archivist priorities. - Analyst output should be regenerated before using the new T1 demand facts for scoring.
2026-06-15 13:59:06 +00:00
parent 33d0bc056e
commit 6d05056609
88 changed files with 55785 additions and 2456 deletions
@@ -151,6 +151,18 @@ The extractor is incremental: unchanged PDFs with matching manifest rows are ski

 Do not expect `data/extracted_text/` entries for Yahoo JSON market data or HKEX `.htm`/`.html` notices. Those are already text-like raw evidence files and are tracked under `data/raw/`.

+## T1 Demand Text Backfill
+
+When audit finds T1 rows where an allotment-results source is archived but `ipo_demand` is missing, use the text backfill script:
+
+```bash
+.venv/bin/python scripts/backfill_t1_demand_from_text.py --as-of YYYY-MM-DDTHH:MM:SSZ
+```
+
+The script is incremental. It selects only `T1_allotment` rows that are complete from source evidence but have no `ipo_demand` row. It parses archived PDF extracted text, follows old HKEX HTML allotment-result pages to their linked Summary PDFs, archives those PDFs, extracts their text, writes `ipo_demand`, exports snapshots, and refreshes sync state only when facts or sources changed.
+
+Do not infer missing demand fields. If a Summary PDF gives valid applications and public subscription but omits successful applicants or international subscription level, store the available fields and leave unavailable fields null.
+
 ## Grey-Market Source Policy

 `T2_grey_market` is not an HKEX official disclosure stage. Grey-market trading is broker or third-party OTC activity, so do not bulk archive a grey-market feed unless the source is reproducible and redistribution-safe.