Private
Public Access
0
0
Commit Graph

4 Commits

Author SHA1 Message Date
geometrybase c65b20a1c4 Archive recent HKEX IPO targets
Request:
Use the project archivist workflow to update IPO target coverage for the most recent three-year window.

Changes:
- Add scripts/update_recent_ipo_list.py to discover HKEXnews annual new listing reports, archive XLSX sources, parse subscription-relevant IPO rows, and update SQLite plus snapshots.
- Add new_listing_report_entries to preserve annual report row-level evidence.
- Archive 2023-2026 Main Board new listing reports and 2024-2026 GEM new listing reports.
- Seed 290 report-backed IPO targets for 2023-06-15 through 2026-06-15, skipping 10 non-IPO rows without numeric offer prices.
- Refresh ipo_master, missing offering_terms fields, source_refs, ticker_sync_state, and sync_tasks.
- Add openpyxl as the XLSX parser dependency and document the archivist refresh flow.
- Limit sync summary output while keeping the full queue in SQLite and CSV snapshots.

Verification:
- Ran update_recent_ipo_list.py for 2023-06-15 to 2026-06-15 with as-of 2026-06-15T07:30:00Z.
- Parsed project Python scripts with ast.parse.
- Checked SQLite integrity and DB-to-snapshot row counts.
- Verified source_refs paths are repo-relative, files exist, and SHA-256 hashes match.
- Ran git diff --check and git diff --cached --check.
- Checked for Python cache and SQLite transient files.

Next useful context:
- ipo_master now has 293 tickers; new_listing_report_entries has 290 report-backed targets.
- Current sync queue has 2005 open tasks and 42 waiting_until_due tasks for deeper per-ticker archival stages.
2026-06-15 06:42:31 +00:00
geometrybase 08db218b6d Add archivist incremental sync state
Request:
Add archivist support for remembering which IPO archive stages have already been synced and which stages should be updated next.

Changes:
- Add sync_runs, ticker_sync_state, sync_tasks, and price_performance tables to the archive schema.
- Add scripts/update_sync_state.py to derive per-ticker stage status and rebuild the next-sync task queue.
- Export the new sync-state tables as Git-friendly CSV snapshots.
- Document the incremental archive flow in the archivist skill and README.

Verification:
- Ran scripts/bootstrap_historical_data.py.
- Ran scripts/update_sync_state.py with a deterministic as-of timestamp.
- Checked SQLite integrity and DB-to-snapshot row counts with Python sqlite3.
- Parsed Python scripts with ast.parse.
- Ran git diff --check and checked for temporary SQLite/cache files.

Next useful context:
- Current derived queue has 2 open tasks for 06658 and 15 waiting_until_due tasks for future stages.
2026-06-15 06:29:54 +00:00
geometrybase eae427d85b Add PDF text extraction workflow
Request:
- Provide a way to install or develop a PDF extraction tool for archived HK IPO documents.

Changes:
- Add requirements.txt with pypdf as the lightweight PDF text extraction dependency.
- Add scripts/extract_pdf_text.py to extract text from PDF source_refs into repo-relative data/extracted_text files.
- Add extracted text outputs and an extracted_text_manifest snapshot for the six archived HKEXnews PDFs.
- Document the extraction workflow in README.md.
- Ignore .venv and keep generated SQLite/Python transient files out of git.
- Use extracted text to verify the 06106 full prospectus, update source_refs, remove the related data gap, and fill 06106 offering terms.

Verification:
- Installed python3.14-venv system support, created a local .venv, and installed requirements.txt.
- Re-ran scripts/bootstrap_historical_data.py and scripts/extract_pdf_text.py.
- Verified extracted text paths and hashes against data/snapshots/extracted_text_manifest.csv.
- Verified SQLite integrity and snapshot row counts.
- Ran git diff --cached --check and searched durable files for machine-specific absolute paths.
2026-06-15 06:21:16 +00:00
geometrybase 7a8c648d87 Bootstrap HK IPO historical archive
Request:
- Use the project archivist workflow to update historical IPO data.

Changes:
- Add an embedded SQLite archive at data/hk_ipo.sqlite.
- Add schema/hk_ipo.schema.sql and scripts/bootstrap_historical_data.py for reproducible archive generation.
- Archive HKEXnews source PDFs for 06658, 06675, and 06106 under repo-relative data/raw paths.
- Export Git-friendly snapshots for ipo_master, offering_terms, ipo_demand, source_refs, and data_gaps.
- Add .gitignore rules for Python cache and SQLite transient files.

Verification:
- Re-ran the bootstrap script successfully.
- Ran PRAGMA integrity_check on the SQLite database.
- Verified source_refs paths are repo-relative, files exist, and SHA-256 hashes match.
- Verified snapshot row counts match SQLite table counts.
- Ran git diff --check and searched generated durable files for machine-specific absolute paths.
2026-06-15 06:13:27 +00:00