hk-ipo

geometrybase/hk-ipo

Private

Public Access

Fork 0

Commit Graph

Author	SHA1	Message	Date
geometrybase	fcb795b583	Add 02335 T0 analyst report Request: - Generate an analyst report for HK IPO ticker 02335. Changes: - Archived the official HKEXnews 02335 prospectus PDF and extracted text under project-relative data paths. - Seeded 02335 T0 prospectus facts, source references, sync state, and analysis snapshots. - Generated reports/2026-06-15_02335_T0_prospectus_analysis.md in Simplified Chinese with concrete T0/T1/T2/D1 dates and short-exit T2/D1 discipline. - Made PDF text extraction tolerant of invalid Unicode surrogate characters emitted by pypdf. Verification: - Compiled archive_hkex_documents.py, generate_ipo_report.py, build_analysis_dataset.py, extract_pdf_text.py, and update_sync_state.py. - Ran SQLite integrity_check and foreign_key_check. - Verified the archived 02335 PDF hash, extracted-text manifest row, and analysis dataset row. - Ran git diff --check. Next useful context: - 02335 is currently T0_prospectus; T1_allotment is pending for 2026-06-23.	2026-06-15 15:07:44 +00:00
geometrybase	8a0dfd88f0	Make PDF text extraction a standard archive step Request: - Add extracted PDF text generation to the archivist workflow as a standard step. Changes: - Run PDF text extraction automatically for newly archived HKEX PDF sources. - Make the PDF text extractor incremental and manifest-preserving. - Document extracted-text handling in the archivist skill and README. - Mark generated extracted text as no-diff data evidence. - Backfill extracted text for all archived PDF source references. Verification: - Ran git diff --cached --check. - Ran .venv/bin/python -m py_compile scripts/extract_pdf_text.py scripts/archive_hkex_documents.py. - Ran full PDF extraction, then confirmed an incremental rerun skips unchanged files. - Verified 557 PDF source_refs, 557 manifest rows, all status ok, and zero missing text/hash/path issues. Next useful context: - HKEX HTML notices and Yahoo JSON market data remain under data/raw and are not expected in data/extracted_text.	2026-06-15 13:27:41 +00:00
geometrybase	eae427d85b	Add PDF text extraction workflow Request: - Provide a way to install or develop a PDF extraction tool for archived HK IPO documents. Changes: - Add requirements.txt with pypdf as the lightweight PDF text extraction dependency. - Add scripts/extract_pdf_text.py to extract text from PDF source_refs into repo-relative data/extracted_text files. - Add extracted text outputs and an extracted_text_manifest snapshot for the six archived HKEXnews PDFs. - Document the extraction workflow in README.md. - Ignore .venv and keep generated SQLite/Python transient files out of git. - Use extracted text to verify the 06106 full prospectus, update source_refs, remove the related data gap, and fill 06106 offering terms. Verification: - Installed python3.14-venv system support, created a local .venv, and installed requirements.txt. - Re-ran scripts/bootstrap_historical_data.py and scripts/extract_pdf_text.py. - Verified extracted text paths and hashes against data/snapshots/extracted_text_manifest.csv. - Verified SQLite integrity and snapshot row counts. - Ran git diff --cached --check and searched durable files for machine-specific absolute paths.	2026-06-15 06:21:16 +00:00

Author

SHA1

Message

Date

geometrybase

fcb795b583

Add 02335 T0 analyst report

Request:
- Generate an analyst report for HK IPO ticker 02335.

Changes:
- Archived the official HKEXnews 02335 prospectus PDF and extracted text under project-relative data paths.
- Seeded 02335 T0 prospectus facts, source references, sync state, and analysis snapshots.
- Generated reports/2026-06-15_02335_T0_prospectus_analysis.md in Simplified Chinese with concrete T0/T1/T2/D1 dates and short-exit T2/D1 discipline.
- Made PDF text extraction tolerant of invalid Unicode surrogate characters emitted by pypdf.

Verification:
- Compiled archive_hkex_documents.py, generate_ipo_report.py, build_analysis_dataset.py, extract_pdf_text.py, and update_sync_state.py.
- Ran SQLite integrity_check and foreign_key_check.
- Verified the archived 02335 PDF hash, extracted-text manifest row, and analysis dataset row.
- Ran git diff --check.

Next useful context:
- 02335 is currently T0_prospectus; T1_allotment is pending for 2026-06-23.

2026-06-15 15:07:44 +00:00

geometrybase

8a0dfd88f0

Make PDF text extraction a standard archive step

Request:
- Add extracted PDF text generation to the archivist workflow as a standard step.

Changes:
- Run PDF text extraction automatically for newly archived HKEX PDF sources.
- Make the PDF text extractor incremental and manifest-preserving.
- Document extracted-text handling in the archivist skill and README.
- Mark generated extracted text as no-diff data evidence.
- Backfill extracted text for all archived PDF source references.

Verification:
- Ran git diff --cached --check.
- Ran .venv/bin/python -m py_compile scripts/extract_pdf_text.py scripts/archive_hkex_documents.py.
- Ran full PDF extraction, then confirmed an incremental rerun skips unchanged files.
- Verified 557 PDF source_refs, 557 manifest rows, all status ok, and zero missing text/hash/path issues.

Next useful context:
- HKEX HTML notices and Yahoo JSON market data remain under data/raw and are not expected in data/extracted_text.

2026-06-15 13:27:41 +00:00

geometrybase

eae427d85b

Add PDF text extraction workflow

Request:
- Provide a way to install or develop a PDF extraction tool for archived HK IPO documents.

Changes:
- Add requirements.txt with pypdf as the lightweight PDF text extraction dependency.
- Add scripts/extract_pdf_text.py to extract text from PDF source_refs into repo-relative data/extracted_text files.
- Add extracted text outputs and an extracted_text_manifest snapshot for the six archived HKEXnews PDFs.
- Document the extraction workflow in README.md.
- Ignore .venv and keep generated SQLite/Python transient files out of git.
- Use extracted text to verify the 06106 full prospectus, update source_refs, remove the related data gap, and fill 06106 offering terms.

Verification:
- Installed python3.14-venv system support, created a local .venv, and installed requirements.txt.
- Re-ran scripts/bootstrap_historical_data.py and scripts/extract_pdf_text.py.
- Verified extracted text paths and hashes against data/snapshots/extracted_text_manifest.csv.
- Verified SQLite integrity and snapshot row counts.
- Ran git diff --cached --check and searched durable files for machine-specific absolute paths.

2026-06-15 06:21:16 +00:00

3 Commits