Add PDF text extraction workflow

Request:
- Provide a way to install or develop a PDF extraction tool for archived HK IPO documents.

Changes:
- Add requirements.txt with pypdf as the lightweight PDF text extraction dependency.
- Add scripts/extract_pdf_text.py to extract text from PDF source_refs into repo-relative data/extracted_text files.
- Add extracted text outputs and an extracted_text_manifest snapshot for the six archived HKEXnews PDFs.
- Document the extraction workflow in README.md.
- Ignore .venv and keep generated SQLite/Python transient files out of git.
- Use extracted text to verify the 06106 full prospectus, update source_refs, remove the related data gap, and fill 06106 offering terms.

Verification:
- Installed python3.14-venv system support, created a local .venv, and installed requirements.txt.
- Re-ran scripts/bootstrap_historical_data.py and scripts/extract_pdf_text.py.
- Verified extracted text paths and hashes against data/snapshots/extracted_text_manifest.csv.
- Verified SQLite integrity and snapshot row counts.
- Ran git diff --cached --check and searched durable files for machine-specific absolute paths.
This commit is contained in:
2026-06-15 06:21:16 +00:00
parent 7a8c648d87
commit eae427d85b
18 changed files with 65850 additions and 41 deletions
+4 -4
View File
@@ -1,4 +1,4 @@
ticker,source_id,prospectus_date,offer_price_hkd,board_lot,min_subscription_amount_hkd,global_offer_shares,hk_offer_shares_initial,international_offer_shares_initial,public_offer_pct_initial,over_allotment_offer_shares,offer_size_adjustment_offer_shares,market_cap_hkd_m,gross_proceeds_hkd_m,net_proceeds_hkd_m,issued_shares_upon_listing,data_as_of
06106,06106_prospectus_notice_2026_06_15,2026-06-15,101.6,50,5131.24,10497300,524900,9972400,0.05,1574550,1574550,,,,,2026-06-15T06:15:00Z
06658,06658_prospectus_2026_06_05,2026-06-05,43.58,100,4401.96,11464100,1146500,10317600,0.1,,,3434.59,499.6,440.1,78811208,2026-06-15T06:15:00Z
06675,06675_global_offering_announcement_2026_06_09,2026-06-09,18.36,200,3709.04,53407000,5340800,48066200,0.1,8011000,,6959.2,,906.7,379041820,2026-06-15T06:15:00Z
ticker,source_id,prospectus_date,offer_price_hkd,board_lot,min_subscription_amount_hkd,global_offer_shares,hk_offer_shares_initial,international_offer_shares_initial,public_offer_pct_initial,over_allotment_offer_shares,offer_size_adjustment_offer_shares,market_cap_hkd_m,gross_proceeds_hkd_m,net_proceeds_hkd_m,issued_shares_upon_listing,data_as_of
06106,06106_prospectus_candidate_2026_06_15,2026-06-15,101.6,50,5131.24,10497300,524900,9972400,0.05,1574550,1574550,11226.52568,1066.52568,995.4,110497300,2026-06-15T06:15:00Z
06658,06658_prospectus_2026_06_05,2026-06-05,43.58,100,4401.96,11464100,1146500,10317600,0.1,,,3434.59,499.6,440.1,78811208,2026-06-15T06:15:00Z
06675,06675_global_offering_announcement_2026_06_09,2026-06-09,18.36,200,3709.04,53407000,5340800,48066200,0.1,8011000,,6959.2,,906.7,379041820,2026-06-15T06:15:00Z
1 ticker source_id prospectus_date offer_price_hkd board_lot min_subscription_amount_hkd global_offer_shares hk_offer_shares_initial international_offer_shares_initial public_offer_pct_initial over_allotment_offer_shares offer_size_adjustment_offer_shares market_cap_hkd_m gross_proceeds_hkd_m net_proceeds_hkd_m issued_shares_upon_listing data_as_of
2 06106 06106_prospectus_notice_2026_06_15 06106_prospectus_candidate_2026_06_15 2026-06-15 101.6 50 5131.24 10497300 524900 9972400 0.05 1574550 1574550 11226.52568 1066.52568 995.4 110497300 2026-06-15T06:15:00Z
3 06658 06658_prospectus_2026_06_05 2026-06-05 43.58 100 4401.96 11464100 1146500 10317600 0.1 3434.59 499.6 440.1 78811208 2026-06-15T06:15:00Z
4 06675 06675_global_offering_announcement_2026_06_09 2026-06-09 18.36 200 3709.04 53407000 5340800 48066200 0.1 8011000 6959.2 906.7 379041820 2026-06-15T06:15:00Z