Add PDF text extraction workflow
Request: - Provide a way to install or develop a PDF extraction tool for archived HK IPO documents. Changes: - Add requirements.txt with pypdf as the lightweight PDF text extraction dependency. - Add scripts/extract_pdf_text.py to extract text from PDF source_refs into repo-relative data/extracted_text files. - Add extracted text outputs and an extracted_text_manifest snapshot for the six archived HKEXnews PDFs. - Document the extraction workflow in README.md. - Ignore .venv and keep generated SQLite/Python transient files out of git. - Use extracted text to verify the 06106 full prospectus, update source_refs, remove the related data gap, and fill 06106 offering terms. Verification: - Installed python3.14-venv system support, created a local .venv, and installed requirements.txt. - Re-ran scripts/bootstrap_historical_data.py and scripts/extract_pdf_text.py. - Verified extracted text paths and hashes against data/snapshots/extracted_text_manifest.csv. - Verified SQLite integrity and snapshot row counts. - Ran git diff --cached --check and searched durable files for machine-specific absolute paths.
This commit is contained in:
@@ -85,8 +85,20 @@ Path rules:
|
||||
|
||||
SQLite is the embedded source of structured facts. CSV snapshots provide readable Git diffs. Markdown memos preserve the reasoning at each decision point.
|
||||
|
||||
## PDF Text Extraction
|
||||
|
||||
Archived PDFs can be converted into searchable text files:
|
||||
|
||||
```bash
|
||||
python3 -m venv .venv
|
||||
.venv/bin/python -m pip install -r requirements.txt
|
||||
.venv/bin/python scripts/extract_pdf_text.py
|
||||
```
|
||||
|
||||
The extractor reads PDF paths from `data/hk_ipo.sqlite`, writes derived text files under `data/extracted_text/`, and exports `data/snapshots/extracted_text_manifest.csv` with page counts, text hashes, and extraction status.
|
||||
|
||||
## Git Discipline
|
||||
|
||||
The repository uses automatic focused commits for completed project changes. Push only when explicitly requested.
|
||||
The repository uses automatic focused commits for completed project changes.
|
||||
|
||||
Before committing, check that unrelated dirty files are not included and that generated durable files use repo-relative paths.
|
||||
|
||||
Reference in New Issue
Block a user