Add PDF text extraction workflow

Request:
- Provide a way to install or develop a PDF extraction tool for archived HK IPO documents.

Changes:
- Add requirements.txt with pypdf as the lightweight PDF text extraction dependency.
- Add scripts/extract_pdf_text.py to extract text from PDF source_refs into repo-relative data/extracted_text files.
- Add extracted text outputs and an extracted_text_manifest snapshot for the six archived HKEXnews PDFs.
- Document the extraction workflow in README.md.
- Ignore .venv and keep generated SQLite/Python transient files out of git.
- Use extracted text to verify the 06106 full prospectus, update source_refs, remove the related data gap, and fill 06106 offering terms.

Verification:
- Installed python3.14-venv system support, created a local .venv, and installed requirements.txt.
- Re-ran scripts/bootstrap_historical_data.py and scripts/extract_pdf_text.py.
- Verified extracted text paths and hashes against data/snapshots/extracted_text_manifest.csv.
- Verified SQLite integrity and snapshot row counts.
- Ran git diff --cached --check and searched durable files for machine-specific absolute paths.
This commit is contained in:
2026-06-15 06:21:16 +00:00
parent 7a8c648d87
commit eae427d85b
18 changed files with 65850 additions and 41 deletions
+4 -4
View File
@@ -1,4 +1,4 @@
ticker,company_name_en,company_name_zh,stock_short_name,exchange,board,status,listing_date,application_start_date,application_end_date,allotment_results_expected_date,industry_label,data_as_of,notes
06106,"Shanghai Seer Intelligent Technology Co., Ltd.",上海仙工智能科技股份有限公司,,HKEX,Main Board,open_for_subscription,2026-06-24,2026-06-15,2026-06-18,2026-06-23,Industrial intelligent robots / robot controllers,2026-06-15T06:15:00Z,Seeded from HKEXnews global offering announcement; full prospectus source classification needs follow-up.
06658,"Liuliumei Co., Ltd.",溜溜梅股份有限公司,LIULIUMEI,HKEX,Main Board,listed,2026-06-15,2026-06-05,2026-06-10,2026-06-12,Snack food / preserved fruit,2026-06-15T06:15:00Z,Seeded from HKEXnews prospectus and allotment results.
06675,"SENASIC Electronics Technology Co., Ltd.",琻捷電子科技(江蘇)股份有限公司,,HKEX,Main Board,pending_listing,2026-06-17,2026-06-09,2026-06-12,2026-06-16,Automotive wireless sensing SoC / semiconductors,2026-06-15T06:15:00Z,Seeded from HKEXnews prospectus and global offering announcement; allotment results not yet archived.
ticker,company_name_en,company_name_zh,stock_short_name,exchange,board,status,listing_date,application_start_date,application_end_date,allotment_results_expected_date,industry_label,data_as_of,notes
06106,"Shanghai Seer Intelligent Technology Co., Ltd.",上海仙工智能科技股份有限公司,,HKEX,Main Board,open_for_subscription,2026-06-24,2026-06-15,2026-06-18,2026-06-23,Industrial intelligent robots / robot controllers,2026-06-15T06:15:00Z,Seeded from HKEXnews global offering announcement; full prospectus source classification needs follow-up.
06658,"Liuliumei Co., Ltd.",溜溜梅股份有限公司,LIULIUMEI,HKEX,Main Board,listed,2026-06-15,2026-06-05,2026-06-10,2026-06-12,Snack food / preserved fruit,2026-06-15T06:15:00Z,Seeded from HKEXnews prospectus and allotment results.
06675,"SENASIC Electronics Technology Co., Ltd.",琻捷電子科技(江蘇)股份有限公司,,HKEX,Main Board,pending_listing,2026-06-17,2026-06-09,2026-06-12,2026-06-16,Automotive wireless sensing SoC / semiconductors,2026-06-15T06:15:00Z,Seeded from HKEXnews prospectus and global offering announcement; allotment results not yet archived.
1 ticker company_name_en company_name_zh stock_short_name exchange board status listing_date application_start_date application_end_date allotment_results_expected_date industry_label data_as_of notes
2 06106 Shanghai Seer Intelligent Technology Co., Ltd. 上海仙工智能科技股份有限公司 HKEX Main Board open_for_subscription 2026-06-24 2026-06-15 2026-06-18 2026-06-23 Industrial intelligent robots / robot controllers 2026-06-15T06:15:00Z Seeded from HKEXnews global offering announcement; full prospectus source classification needs follow-up.
3 06658 Liuliumei Co., Ltd. 溜溜梅股份有限公司 LIULIUMEI HKEX Main Board listed 2026-06-15 2026-06-05 2026-06-10 2026-06-12 Snack food / preserved fruit 2026-06-15T06:15:00Z Seeded from HKEXnews prospectus and allotment results.
4 06675 SENASIC Electronics Technology Co., Ltd. 琻捷電子科技(江蘇)股份有限公司 HKEX Main Board pending_listing 2026-06-17 2026-06-09 2026-06-12 2026-06-16 Automotive wireless sensing SoC / semiconductors 2026-06-15T06:15:00Z Seeded from HKEXnews prospectus and global offering announcement; allotment results not yet archived.