Commit Graph

3 Commits

Author SHA1 Message Date
geometrybase eae427d85b Add PDF text extraction workflow
Request:
- Provide a way to install or develop a PDF extraction tool for archived HK IPO documents.

Changes:
- Add requirements.txt with pypdf as the lightweight PDF text extraction dependency.
- Add scripts/extract_pdf_text.py to extract text from PDF source_refs into repo-relative data/extracted_text files.
- Add extracted text outputs and an extracted_text_manifest snapshot for the six archived HKEXnews PDFs.
- Document the extraction workflow in README.md.
- Ignore .venv and keep generated SQLite/Python transient files out of git.
- Use extracted text to verify the 06106 full prospectus, update source_refs, remove the related data gap, and fill 06106 offering terms.

Verification:
- Installed python3.14-venv system support, created a local .venv, and installed requirements.txt.
- Re-ran scripts/bootstrap_historical_data.py and scripts/extract_pdf_text.py.
- Verified extracted text paths and hashes against data/snapshots/extracted_text_manifest.csv.
- Verified SQLite integrity and snapshot row counts.
- Ran git diff --cached --check and searched durable files for machine-specific absolute paths.
2026-06-15 06:21:16 +00:00
geometrybase 408ba59bc6 Document HK IPO project workflow
Request:
- Write a README introducing the project.

Changes:
- Describe the HK IPO research feedback loop.
- Document the stage-based workflow, project-local skills, storage model, path rules, and Git discipline.

Verification:
- Reviewed README contents with sed.
- Ran rg for machine-specific absolute path patterns; none found.
- Ran git diff --check.
2026-06-15 06:02:10 +00:00
geometrybase 6907418731 first commit 2026-06-15 05:43:41 +00:00