Make PDF text extraction a standard archive step
Request: - Add extracted PDF text generation to the archivist workflow as a standard step. Changes: - Run PDF text extraction automatically for newly archived HKEX PDF sources. - Make the PDF text extractor incremental and manifest-preserving. - Document extracted-text handling in the archivist skill and README. - Mark generated extracted text as no-diff data evidence. - Backfill extracted text for all archived PDF source references. Verification: - Ran git diff --cached --check. - Ran .venv/bin/python -m py_compile scripts/extract_pdf_text.py scripts/archive_hkex_documents.py. - Ran full PDF extraction, then confirmed an incremental rerun skips unchanged files. - Verified 557 PDF source_refs, 557 manifest rows, all status ok, and zero missing text/hash/path issues. Next useful context: - HKEX HTML notices and Yahoo JSON market data remain under data/raw and are not expected in data/extracted_text.
This commit is contained in:
@@ -97,6 +97,8 @@ python3 -m venv .venv
|
||||
|
||||
The extractor reads PDF paths from `data/hk_ipo.sqlite`, writes derived text files under `data/extracted_text/`, and exports `data/snapshots/extracted_text_manifest.csv` with page counts, text hashes, and extraction status.
|
||||
|
||||
The extractor is incremental. If a PDF hash and manifest row are unchanged, the existing text output is reused. Use `--force` only when extraction behavior changes and all derived text should be regenerated.
|
||||
|
||||
## Recent IPO Target Refresh
|
||||
|
||||
Use HKEXnews annual new listing reports to seed recent subscription-relevant IPO targets:
|
||||
@@ -117,7 +119,9 @@ Use the HKEX document archiver to fill detailed T0/T1 facts for open sync tasks:
|
||||
.venv/bin/python scripts/archive_hkex_documents.py --as-of 2026-06-15T08:30:00Z
|
||||
```
|
||||
|
||||
The archiver maps stock codes to HKEXnews title-search stock IDs, downloads the selected prospectus and allotment-results PDFs under `data/raw/{ticker}/`, records `source_refs`, parses high-confidence T0/T1 fields into `ipo_master`, `offering_terms`, and `ipo_demand`, exports snapshots, and refreshes `sync_tasks`.
|
||||
The archiver maps stock codes to HKEXnews title-search stock IDs, downloads the selected prospectus and allotment-results documents under `data/raw/{ticker}/`, records `source_refs`, parses high-confidence T0/T1 fields into `ipo_master`, `offering_terms`, and `ipo_demand`, exports snapshots, refreshes `sync_tasks`, and extracts text for newly archived PDF sources.
|
||||
|
||||
HKEX `.htm`/`.html` notices and Yahoo Finance JSON market data stay in `data/raw/`; they are not copied into `data/extracted_text/`.
|
||||
|
||||
## Price Performance Backfill
|
||||
|
||||
|
||||
Reference in New Issue
Block a user