Add PDF text extraction workflow

Request:
- Provide a way to install or develop a PDF extraction tool for archived HK IPO documents.

Changes:
- Add requirements.txt with pypdf as the lightweight PDF text extraction dependency.
- Add scripts/extract_pdf_text.py to extract text from PDF source_refs into repo-relative data/extracted_text files.
- Add extracted text outputs and an extracted_text_manifest snapshot for the six archived HKEXnews PDFs.
- Document the extraction workflow in README.md.
- Ignore .venv and keep generated SQLite/Python transient files out of git.
- Use extracted text to verify the 06106 full prospectus, update source_refs, remove the related data gap, and fill 06106 offering terms.

Verification:
- Installed python3.14-venv system support, created a local .venv, and installed requirements.txt.
- Re-ran scripts/bootstrap_historical_data.py and scripts/extract_pdf_text.py.
- Verified extracted text paths and hashes against data/snapshots/extracted_text_manifest.csv.
- Verified SQLite integrity and snapshot row counts.
- Ran git diff --cached --check and searched durable files for machine-specific absolute paths.
This commit is contained in:
2026-06-15 06:21:16 +00:00
parent 7a8c648d87
commit eae427d85b
18 changed files with 65850 additions and 41 deletions
+2 -2
View File
@@ -1,2 +1,2 @@
demand_id,ticker,source_id,stage_date,valid_applications,successful_applications,public_oversubscription_times,international_placees,international_oversubscription_times,final_hk_offer_shares,final_international_offer_shares,data_as_of,notes
06658_allotment_2026_06_12,06658,06658_allotment_results_2026_06_12,2026-06-12,180507,11465,6586.73,64,2.64,1146500,10317600,2026-06-15T06:15:00Z,Claw-back shown as N/A in the HKEXnews allotment results.
demand_id,ticker,source_id,stage_date,valid_applications,successful_applications,public_oversubscription_times,international_placees,international_oversubscription_times,final_hk_offer_shares,final_international_offer_shares,data_as_of,notes
06658_allotment_2026_06_12,06658,06658_allotment_results_2026_06_12,2026-06-12,180507,11465,6586.73,64,2.64,1146500,10317600,2026-06-15T06:15:00Z,Claw-back shown as N/A in the HKEXnews allotment results.
1 demand_id ticker source_id stage_date valid_applications successful_applications public_oversubscription_times international_placees international_oversubscription_times final_hk_offer_shares final_international_offer_shares data_as_of notes
2 06658_allotment_2026_06_12 06658 06658_allotment_results_2026_06_12 2026-06-12 180507 11465 6586.73 64 2.64 1146500 10317600 2026-06-15T06:15:00Z Claw-back shown as N/A in the HKEXnews allotment results.