Files
hk-ipo/data/snapshots/extracted_text_manifest.csv
T
geometrybase eae427d85b Add PDF text extraction workflow
Request:
- Provide a way to install or develop a PDF extraction tool for archived HK IPO documents.

Changes:
- Add requirements.txt with pypdf as the lightweight PDF text extraction dependency.
- Add scripts/extract_pdf_text.py to extract text from PDF source_refs into repo-relative data/extracted_text files.
- Add extracted text outputs and an extracted_text_manifest snapshot for the six archived HKEXnews PDFs.
- Document the extraction workflow in README.md.
- Ignore .venv and keep generated SQLite/Python transient files out of git.
- Use extracted text to verify the 06106 full prospectus, update source_refs, remove the related data gap, and fill 06106 offering terms.

Verification:
- Installed python3.14-venv system support, created a local .venv, and installed requirements.txt.
- Re-ran scripts/bootstrap_historical_data.py and scripts/extract_pdf_text.py.
- Verified extracted text paths and hashes against data/snapshots/extracted_text_manifest.csv.
- Verified SQLite integrity and snapshot row counts.
- Ran git diff --cached --check and searched durable files for machine-specific absolute paths.
2026-06-15 06:21:16 +00:00

8 lines
2.0 KiB
CSV

source_id,ticker,source_type,pdf_local_path,pdf_sha256,text_local_path,text_sha256,page_count,pages_with_text,char_count,status,notes
06106_prospectus_candidate_2026_06_15,06106,prospectus,data/raw/06106/prospectus_candidate_2026-06-15.pdf,e8b129296563e43b7834be9d59ac41926fbaeb4f088da2c908b1f04b4151967b,data/extracted_text/06106/prospectus_candidate_2026-06-15.txt,8ae30cdbedb43ea17c242e31cfab6bf937e6e5b64b534051b80a77770e4d1da3,424,424,1463866,ok,
06106_prospectus_notice_2026_06_15,06106,prospectus_notice,data/raw/06106/prospectus_notice_2026-06-15.pdf,510983deaba5614975a57c5e77d3ea83af071a24609c28cd3f89914e1649bff5,data/extracted_text/06106/prospectus_notice_2026-06-15.txt,544e10f80f60aeb5d278cf64b90b2def03aab3011fbde054686dbfaf3eb314d8,11,11,24636,ok,
06658_allotment_results_2026_06_12,06658,allotment_results,data/raw/06658/allotment_results_2026-06-12.pdf,bb305cf55cc87809ecd845ea44243c4f41fcfaa31dbf496580e2ed8fc06d54a0,data/extracted_text/06658/allotment_results_2026-06-12.txt,eca3db4f06e5f228764f15f4ffed712528a094d3254150e5fcff5d49639bf7e4,16,16,33659,ok,
06658_prospectus_2026_06_05,06658,prospectus,data/raw/06658/prospectus_2026-06-05.pdf,e928dd8082e8aaf28156a46f64c98bee308d8ae4d10a9571a4531a3f9a8f0eb1,data/extracted_text/06658/prospectus_2026-06-05.txt,ec7d3817d77fb9f5a3795fbe9dd4e7d5fc927e40c5c67ef7427bf3e9b4275c8d,418,418,1198420,ok,
06675_global_offering_announcement_2026_06_09,06675,global_offering_announcement,data/raw/06675/global_offering_announcement_2026-06-09.pdf,a6b0c03d6b7a42cab0865aa0abf6dfa2dd80e6d16e392d73ddd3cd3839f7aeff,data/extracted_text/06675/global_offering_announcement_2026-06-09.txt,dea36ebe739c7dbb50121ce2cf63698d2d5a84fa2d5593fddbf0b6cdb05f4d1d,11,11,24716,ok,
06675_prospectus_2026_06_09,06675,prospectus,data/raw/06675/prospectus_2026-06-09.pdf,0c0c634786b7e7da921dd631fa7ba696043fae4ab29cf29dcc5f9e976c53b160,data/extracted_text/06675/prospectus_2026-06-09.txt,068578388742b94dd8f40381a4e6f62a88487374aa9e856b61f71acd9fcd93c2,388,388,1356301,ok,