Private
Public Access
0
0

Backfill structured T1 demand from archived text

Request:
- Use archivist to close the 137 T1 ipo_demand source-only gaps using extracted PDF text.

Changes:
- Add an incremental T1 demand text backfill script.
- Parse existing allotment-result extracted text into ipo_demand.
- Archive linked Summary PDFs from old HKEX HTML allotment-result pages.
- Correct allotment-result selection to prefer primary result announcements over clarification or supplemental notices.
- Add robust line-aware allotment parsing and document the workflow in archivist and README.
- Record the backfill result in a report.

Execution:
- Selected 137 source-only T1 demand gaps.
- Wrote 137 ipo_demand rows, increasing ipo_demand from 154 to 291 rows.
- Archived 38 new HKEX allotment-result PDFs and extracted their text.
- Confirmed an incremental rerun selects 0 gaps and writes 0 rows.

Verification:
- Ran git diff --cached --check.
- Ran py_compile for archive_hkex_documents.py and backfill_t1_demand_from_text.py.
- Checked SQLite integrity and foreign keys.
- Confirmed DB row counts match CSV snapshots.
- Verified no T1 complete row is missing ipo_demand.
- Verified source_refs paths/files/hashes and PDF extracted-text manifest hashes.

Next useful context:
- T1 demand structure is complete for listed rows; 06106 and 06675 remain pending_not_due.
- T2 grey-market and due price-performance gaps remain separate archivist priorities.
- Analyst output should be regenerated before using the new T1 demand facts for scoring.
This commit is contained in:
2026-06-15 13:59:06 +00:00
parent 33d0bc056e
commit 6d05056609
88 changed files with 55785 additions and 2456 deletions
@@ -0,0 +1,50 @@
# T1 Demand Text Backfill
Run date: 2026-06-15
Archive mode: `t1_demand_text_backfill`
Target: T1 allotment rows with archived allotment-results sources but missing `ipo_demand`
## Result
The T1 source-only demand gap was closed.
- Initial source-only T1 demand gaps: 137.
- `ipo_demand` rows before backfill: 154.
- `ipo_demand` rows after backfill: 291.
- T1 complete rows without `ipo_demand`: 0.
- T1 pending-not-due rows: 2 (`06106`, `06675`).
## Source Handling
- Existing extracted PDF text supplied most of the backfill.
- Old HKEX HTML allotment-result pages were followed to their linked Summary PDFs.
- Clarification or supplemental notices that had been selected as allotment-results sources were corrected by archiving the primary allotment-results announcement from the same HKEX title-search window.
- New allotment-result PDF sources archived: 38.
- PDF source refs after backfill: 595.
- Extracted-text manifest rows after backfill: 595.
- Extracted-text manifest status: 595 `ok`.
## Field Policy
Only explicitly disclosed demand fields were stored.
No missing demand field was inferred from share counts or other derived calculations. For example, where a Summary PDF disclosed valid applications, public subscription, international placee count, and final share counts but omitted successful applicants or international subscription level, the omitted fields were left null.
## Verification
- SQLite integrity check: `ok`.
- Foreign-key violations: 0.
- DB row counts match CSV snapshots for `ipo_master`, `offering_terms`, `ipo_demand`, `source_refs`, `sync_runs`, `ticker_sync_state`, and `sync_tasks`.
- `source_refs`: 1,187 rows, 0 bad paths, 0 missing files, 0 hash mismatches.
- PDF manifest reconciliation: 595 PDF sources, 595 manifest rows, 0 missing manifest rows, 0 orphan manifest rows, 0 missing text files, 0 PDF hash mismatches, 0 text hash mismatches.
- Incremental empty rerun selected 0 source-only gaps and wrote 0 rows.
## Remaining Non-T1 Gaps
The T1 structural gap is closed, but historical completeness is still not fully complete:
- T2 grey-market remains blocked for 291 listed tickers pending a reproducible source strategy.
- Price-performance open tasks remain for D1/D5/D20/D60.
- Context fields such as industry label, market cap, and net proceeds remain incomplete.
The v0 analysis dataset should be regenerated by `analyst` before using the new T1 demand facts for scoring or calibration.