eae427d85b
Request: - Provide a way to install or develop a PDF extraction tool for archived HK IPO documents. Changes: - Add requirements.txt with pypdf as the lightweight PDF text extraction dependency. - Add scripts/extract_pdf_text.py to extract text from PDF source_refs into repo-relative data/extracted_text files. - Add extracted text outputs and an extracted_text_manifest snapshot for the six archived HKEXnews PDFs. - Document the extraction workflow in README.md. - Ignore .venv and keep generated SQLite/Python transient files out of git. - Use extracted text to verify the 06106 full prospectus, update source_refs, remove the related data gap, and fill 06106 offering terms. Verification: - Installed python3.14-venv system support, created a local .venv, and installed requirements.txt. - Re-ran scripts/bootstrap_historical_data.py and scripts/extract_pdf_text.py. - Verified extracted text paths and hashes against data/snapshots/extracted_text_manifest.csv. - Verified SQLite integrity and snapshot row counts. - Ran git diff --cached --check and searched durable files for machine-specific absolute paths.
4 lines
614 B
CSV
4 lines
614 B
CSV
gap_id,ticker,stage,field_name,reason,expected_resolution_date,created_at,notes
|
|
06106_allotment_results_pending_2026_06_15,06106,T1_allotment,ipo_demand,Allotment results were expected on 2026-06-23 and were not available in this seed archive.,2026-06-23,2026-06-15T06:15:00Z,Update after the HKEXnews allotment results announcement is published.
|
|
06675_allotment_results_pending_2026_06_15,06675,T1_allotment,ipo_demand,Allotment results were expected on 2026-06-16 and were not available in this seed archive.,2026-06-16,2026-06-15T06:15:00Z,Update after the HKEXnews allotment results announcement is published.
|