Add PDF text extraction workflow

Request:
- Provide a way to install or develop a PDF extraction tool for archived HK IPO documents.

Changes:
- Add requirements.txt with pypdf as the lightweight PDF text extraction dependency.
- Add scripts/extract_pdf_text.py to extract text from PDF source_refs into repo-relative data/extracted_text files.
- Add extracted text outputs and an extracted_text_manifest snapshot for the six archived HKEXnews PDFs.
- Document the extraction workflow in README.md.
- Ignore .venv and keep generated SQLite/Python transient files out of git.
- Use extracted text to verify the 06106 full prospectus, update source_refs, remove the related data gap, and fill 06106 offering terms.

Verification:
- Installed python3.14-venv system support, created a local .venv, and installed requirements.txt.
- Re-ran scripts/bootstrap_historical_data.py and scripts/extract_pdf_text.py.
- Verified extracted text paths and hashes against data/snapshots/extracted_text_manifest.csv.
- Verified SQLite integrity and snapshot row counts.
- Ran git diff --cached --check and searched durable files for machine-specific absolute paths.
This commit is contained in:
2026-06-15 06:21:16 +00:00
parent 7a8c648d87
commit eae427d85b
18 changed files with 65850 additions and 41 deletions
+7 -7
View File
@@ -1,7 +1,7 @@
source_id,ticker,source_type,title,path_base,local_path,url,file_sha256,source_date,archived_at,notes
06106_prospectus_candidate_2026_06_15,06106,prospectus_candidate_pending_verification,"Shanghai Seer Intelligent Technology Co., Ltd. Prospectus Candidate",repo_root,data/raw/06106/prospectus_candidate_2026-06-15.pdf,https://www1.hkexnews.hk/listedco/listconews/sehk/2026/0615/2026061500013.pdf,e8b129296563e43b7834be9d59ac41926fbaeb4f088da2c908b1f04b4151967b,2026-06-15,2026-06-15T06:15:00Z,Downloaded from HKEXnews; document role should be verified before using for detailed fact extraction.
06106_prospectus_notice_2026_06_15,06106,prospectus_notice,"Shanghai Seer Intelligent Technology Co., Ltd. Prospectus Notice",repo_root,data/raw/06106/prospectus_notice_2026-06-15.pdf,https://www1.hkexnews.hk/listedco/listconews/sehk/2026/0615/2026061500011.pdf,510983deaba5614975a57c5e77d3ea83af071a24609c28cd3f89914e1649bff5,2026-06-15,2026-06-15T06:15:00Z,HKEXnews announcement containing global offering terms and timetable.
06658_allotment_results_2026_06_12,06658,allotment_results,"Liuliumei Co., Ltd. Announcement of Allotment Results",repo_root,data/raw/06658/allotment_results_2026-06-12.pdf,https://www1.hkexnews.hk/listedco/listconews/sehk/2026/0612/2026061202100.pdf,bb305cf55cc87809ecd845ea44243c4f41fcfaa31dbf496580e2ed8fc06d54a0,2026-06-12,2026-06-15T06:15:00Z,HKEXnews allotment results.
06658_prospectus_2026_06_05,06658,prospectus,"Liuliumei Co., Ltd. Prospectus",repo_root,data/raw/06658/prospectus_2026-06-05.pdf,https://www1.hkexnews.hk/listedco/listconews/sehk/2026/0605/2026060500023.pdf,e928dd8082e8aaf28156a46f64c98bee308d8ae4d10a9571a4531a3f9a8f0eb1,2026-06-05,2026-06-15T06:15:00Z,HKEXnews prospectus.
06675_global_offering_announcement_2026_06_09,06675,global_offering_announcement,"SENASIC Electronics Technology Co., Ltd. Global Offering Announcement",repo_root,data/raw/06675/global_offering_announcement_2026-06-09.pdf,https://www.hkexnews.hk/listedco/listconews/sehk/2026/0609/2026060900009.pdf,a6b0c03d6b7a42cab0865aa0abf6dfa2dd80e6d16e392d73ddd3cd3839f7aeff,2026-06-09,2026-06-15T06:15:00Z,HKEXnews global offering announcement.
06675_prospectus_2026_06_09,06675,prospectus,"SENASIC Electronics Technology Co., Ltd. Prospectus",repo_root,data/raw/06675/prospectus_2026-06-09.pdf,https://www.hkexnews.hk/listedco/listconews/sehk/2026/0609/2026060900029.pdf,0c0c634786b7e7da921dd631fa7ba696043fae4ab29cf29dcc5f9e976c53b160,2026-06-09,2026-06-15T06:15:00Z,HKEXnews prospectus.
source_id,ticker,source_type,title,path_base,local_path,url,file_sha256,source_date,archived_at,notes
06106_prospectus_candidate_2026_06_15,06106,prospectus,"Shanghai Seer Intelligent Technology Co., Ltd. Prospectus",repo_root,data/raw/06106/prospectus_candidate_2026-06-15.pdf,https://www1.hkexnews.hk/listedco/listconews/sehk/2026/0615/2026061500013.pdf,e8b129296563e43b7834be9d59ac41926fbaeb4f088da2c908b1f04b4151967b,2026-06-15,2026-06-15T06:15:00Z,HKEXnews prospectus; verified by text extraction as a 424-page GLOBAL OFFERING document.
06106_prospectus_notice_2026_06_15,06106,prospectus_notice,"Shanghai Seer Intelligent Technology Co., Ltd. Prospectus Notice",repo_root,data/raw/06106/prospectus_notice_2026-06-15.pdf,https://www1.hkexnews.hk/listedco/listconews/sehk/2026/0615/2026061500011.pdf,510983deaba5614975a57c5e77d3ea83af071a24609c28cd3f89914e1649bff5,2026-06-15,2026-06-15T06:15:00Z,HKEXnews announcement containing global offering terms and timetable.
06658_allotment_results_2026_06_12,06658,allotment_results,"Liuliumei Co., Ltd. Announcement of Allotment Results",repo_root,data/raw/06658/allotment_results_2026-06-12.pdf,https://www1.hkexnews.hk/listedco/listconews/sehk/2026/0612/2026061202100.pdf,bb305cf55cc87809ecd845ea44243c4f41fcfaa31dbf496580e2ed8fc06d54a0,2026-06-12,2026-06-15T06:15:00Z,HKEXnews allotment results.
06658_prospectus_2026_06_05,06658,prospectus,"Liuliumei Co., Ltd. Prospectus",repo_root,data/raw/06658/prospectus_2026-06-05.pdf,https://www1.hkexnews.hk/listedco/listconews/sehk/2026/0605/2026060500023.pdf,e928dd8082e8aaf28156a46f64c98bee308d8ae4d10a9571a4531a3f9a8f0eb1,2026-06-05,2026-06-15T06:15:00Z,HKEXnews prospectus.
06675_global_offering_announcement_2026_06_09,06675,global_offering_announcement,"SENASIC Electronics Technology Co., Ltd. Global Offering Announcement",repo_root,data/raw/06675/global_offering_announcement_2026-06-09.pdf,https://www.hkexnews.hk/listedco/listconews/sehk/2026/0609/2026060900009.pdf,a6b0c03d6b7a42cab0865aa0abf6dfa2dd80e6d16e392d73ddd3cd3839f7aeff,2026-06-09,2026-06-15T06:15:00Z,HKEXnews global offering announcement.
06675_prospectus_2026_06_09,06675,prospectus,"SENASIC Electronics Technology Co., Ltd. Prospectus",repo_root,data/raw/06675/prospectus_2026-06-09.pdf,https://www.hkexnews.hk/listedco/listconews/sehk/2026/0609/2026060900029.pdf,0c0c634786b7e7da921dd631fa7ba696043fae4ab29cf29dcc5f9e976c53b160,2026-06-09,2026-06-15T06:15:00Z,HKEXnews prospectus.
1 source_id ticker source_type title path_base local_path url file_sha256 source_date archived_at notes
2 06106_prospectus_candidate_2026_06_15 06106 prospectus_candidate_pending_verification prospectus Shanghai Seer Intelligent Technology Co., Ltd. Prospectus Candidate Shanghai Seer Intelligent Technology Co., Ltd. Prospectus repo_root data/raw/06106/prospectus_candidate_2026-06-15.pdf https://www1.hkexnews.hk/listedco/listconews/sehk/2026/0615/2026061500013.pdf e8b129296563e43b7834be9d59ac41926fbaeb4f088da2c908b1f04b4151967b 2026-06-15 2026-06-15T06:15:00Z Downloaded from HKEXnews; document role should be verified before using for detailed fact extraction. HKEXnews prospectus; verified by text extraction as a 424-page GLOBAL OFFERING document.
3 06106_prospectus_notice_2026_06_15 06106 prospectus_notice Shanghai Seer Intelligent Technology Co., Ltd. Prospectus Notice repo_root data/raw/06106/prospectus_notice_2026-06-15.pdf https://www1.hkexnews.hk/listedco/listconews/sehk/2026/0615/2026061500011.pdf 510983deaba5614975a57c5e77d3ea83af071a24609c28cd3f89914e1649bff5 2026-06-15 2026-06-15T06:15:00Z HKEXnews announcement containing global offering terms and timetable.
4 06658_allotment_results_2026_06_12 06658 allotment_results Liuliumei Co., Ltd. Announcement of Allotment Results repo_root data/raw/06658/allotment_results_2026-06-12.pdf https://www1.hkexnews.hk/listedco/listconews/sehk/2026/0612/2026061202100.pdf bb305cf55cc87809ecd845ea44243c4f41fcfaa31dbf496580e2ed8fc06d54a0 2026-06-12 2026-06-15T06:15:00Z HKEXnews allotment results.
5 06658_prospectus_2026_06_05 06658 prospectus Liuliumei Co., Ltd. Prospectus repo_root data/raw/06658/prospectus_2026-06-05.pdf https://www1.hkexnews.hk/listedco/listconews/sehk/2026/0605/2026060500023.pdf e928dd8082e8aaf28156a46f64c98bee308d8ae4d10a9571a4531a3f9a8f0eb1 2026-06-05 2026-06-15T06:15:00Z HKEXnews prospectus.
6 06675_global_offering_announcement_2026_06_09 06675 global_offering_announcement SENASIC Electronics Technology Co., Ltd. Global Offering Announcement repo_root data/raw/06675/global_offering_announcement_2026-06-09.pdf https://www.hkexnews.hk/listedco/listconews/sehk/2026/0609/2026060900009.pdf a6b0c03d6b7a42cab0865aa0abf6dfa2dd80e6d16e392d73ddd3cd3839f7aeff 2026-06-09 2026-06-15T06:15:00Z HKEXnews global offering announcement.
7 06675_prospectus_2026_06_09 06675 prospectus SENASIC Electronics Technology Co., Ltd. Prospectus repo_root data/raw/06675/prospectus_2026-06-09.pdf https://www.hkexnews.hk/listedco/listconews/sehk/2026/0609/2026060900029.pdf 0c0c634786b7e7da921dd631fa7ba696043fae4ab29cf29dcc5f9e976c53b160 2026-06-09 2026-06-15T06:15:00Z HKEXnews prospectus.