hk-ipo

geometrybase/hk-ipo

Private

Public Access

Fork 0

Commit Graph

Author	SHA1	Message	Date
geometrybase	8a0dfd88f0	Make PDF text extraction a standard archive step Request: - Add extracted PDF text generation to the archivist workflow as a standard step. Changes: - Run PDF text extraction automatically for newly archived HKEX PDF sources. - Make the PDF text extractor incremental and manifest-preserving. - Document extracted-text handling in the archivist skill and README. - Mark generated extracted text as no-diff data evidence. - Backfill extracted text for all archived PDF source references. Verification: - Ran git diff --cached --check. - Ran .venv/bin/python -m py_compile scripts/extract_pdf_text.py scripts/archive_hkex_documents.py. - Ran full PDF extraction, then confirmed an incremental rerun skips unchanged files. - Verified 557 PDF source_refs, 557 manifest rows, all status ok, and zero missing text/hash/path issues. Next useful context: - HKEX HTML notices and Yahoo JSON market data remain under data/raw and are not expected in data/extracted_text.	2026-06-15 13:27:41 +00:00

Author

SHA1

Message

Date

geometrybase

8a0dfd88f0

Make PDF text extraction a standard archive step

Request:
- Add extracted PDF text generation to the archivist workflow as a standard step.

Changes:
- Run PDF text extraction automatically for newly archived HKEX PDF sources.
- Make the PDF text extractor incremental and manifest-preserving.
- Document extracted-text handling in the archivist skill and README.
- Mark generated extracted text as no-diff data evidence.
- Backfill extracted text for all archived PDF source references.

Verification:
- Ran git diff --cached --check.
- Ran .venv/bin/python -m py_compile scripts/extract_pdf_text.py scripts/archive_hkex_documents.py.
- Ran full PDF extraction, then confirmed an incremental rerun skips unchanged files.
- Verified 557 PDF source_refs, 557 manifest rows, all status ok, and zero missing text/hash/path issues.

Next useful context:
- HKEX HTML notices and Yahoo JSON market data remain under data/raw and are not expected in data/extracted_text.

2026-06-15 13:27:41 +00:00

1 Commits