hk-ipo

Author	SHA1	Message	Date
geometrybase	9aab267f80	Run full HKEX document backfill Request: - Remove small-batch guidance and execute the HKEX document archiver across all open T0/T1 sync tasks in one run. Changes: - Make archive_hkex_documents.py process every open T0/T1 ticker by default when --limit is omitted. - Add per-ticker progress output and keep full refreshes moving if one ticker fails. - Suppress noisy pypdf warnings during large official document extraction. - Update archivist and README instructions to show the full-run command without batch notes. - Archive official HKEXnews prospectus and allotment-results PDFs under repo-relative data/raw paths. - Refresh hk_ipo.sqlite and CSV snapshots for parsed T0/T1 fields, source_refs, sync_runs, sync_tasks, and ticker_sync_state. Execution: - Ran .venv/bin/python scripts/archive_hkex_documents.py --as-of 2026-06-15T09:00:00Z. - Selected 284 open T0/T1 tickers, processed 210 tickers, and archived 398 source files. - Left 74 tickers as missing target docs because title search did not return target prospectus/allotment documents for this pass. Verification: - Parsed archivist scripts with Python ast. - Confirmed README, archivist skill, and archiver script no longer contain batch guidance. - Ran git diff --check. - Checked SQLite integrity and DB/snapshot row counts. - Verified 706 source_refs use relative local paths, all files exist, and SHA256 hashes match. Next useful context: - Current source_refs count is 706 and ipo_demand count is 134. - Sync ledger now reports 414 complete, 1595 pending_due, and 42 pending_not_due states.	2026-06-15 07:57:33 +00:00
geometrybase	993d7b26fa	Backfill first HKEX IPO document batch Request: Start progressively filling detailed information for recent HK IPO targets. Changes: - Add scripts/archive_hkex_documents.py to map tickers to HKEXnews stock IDs, select official prospectus and allotment-results PDFs, archive them under data/raw/{ticker}, parse high-confidence T0/T1 facts, export snapshots, and refresh sync state. - Document the small-batch HKEX document backfill workflow in README.md and the archivist skill. - Archive prospectus and allotment-results PDFs for 00901, 01081, 01779, 02290, 02553, and 03388. - Fill T0 details including application dates, expected allotment date, board lot, minimum subscription amount, and offer-share counts for the six tickers. - Fill T1 allotment-demand details including valid/successful applications, public subscription level, international placees, international subscription level, and final offer-share allocations. - Refresh source_refs, ipo_master, offering_terms, ipo_demand, ticker_sync_state, and sync_tasks snapshots. Verification: - Ran archive_hkex_documents.py in a first small batch and re-ran corrected tickers after parser hardening. - Parsed project Python scripts with ast.parse. - Checked SQLite integrity and DB-to-snapshot row counts. - Verified source_refs paths are repo-relative, source files exist, and SHA-256 hashes match. - Confirmed batch field completeness for the six processed tickers. - Ran git diff --check and git diff --cached --check. - Checked for Python cache and SQLite transient files. Next useful context: - This batch added about 55MB of official HKEXnews PDFs. - Sync state now has 16 complete stages, 1993 pending_due stages, and 42 pending_not_due stages. - Continue with small --limit batches because HKEXnews title search can include historical or postponed offering documents for the same stock code.	2026-06-15 07:07:46 +00:00
geometrybase	c65b20a1c4	Archive recent HKEX IPO targets Request: Use the project archivist workflow to update IPO target coverage for the most recent three-year window. Changes: - Add scripts/update_recent_ipo_list.py to discover HKEXnews annual new listing reports, archive XLSX sources, parse subscription-relevant IPO rows, and update SQLite plus snapshots. - Add new_listing_report_entries to preserve annual report row-level evidence. - Archive 2023-2026 Main Board new listing reports and 2024-2026 GEM new listing reports. - Seed 290 report-backed IPO targets for 2023-06-15 through 2026-06-15, skipping 10 non-IPO rows without numeric offer prices. - Refresh ipo_master, missing offering_terms fields, source_refs, ticker_sync_state, and sync_tasks. - Add openpyxl as the XLSX parser dependency and document the archivist refresh flow. - Limit sync summary output while keeping the full queue in SQLite and CSV snapshots. Verification: - Ran update_recent_ipo_list.py for 2023-06-15 to 2026-06-15 with as-of 2026-06-15T07:30:00Z. - Parsed project Python scripts with ast.parse. - Checked SQLite integrity and DB-to-snapshot row counts. - Verified source_refs paths are repo-relative, files exist, and SHA-256 hashes match. - Ran git diff --check and git diff --cached --check. - Checked for Python cache and SQLite transient files. Next useful context: - ipo_master now has 293 tickers; new_listing_report_entries has 290 report-backed targets. - Current sync queue has 2005 open tasks and 42 waiting_until_due tasks for deeper per-ticker archival stages.	2026-06-15 06:42:31 +00:00
geometrybase	08db218b6d	Add archivist incremental sync state Request: Add archivist support for remembering which IPO archive stages have already been synced and which stages should be updated next. Changes: - Add sync_runs, ticker_sync_state, sync_tasks, and price_performance tables to the archive schema. - Add scripts/update_sync_state.py to derive per-ticker stage status and rebuild the next-sync task queue. - Export the new sync-state tables as Git-friendly CSV snapshots. - Document the incremental archive flow in the archivist skill and README. Verification: - Ran scripts/bootstrap_historical_data.py. - Ran scripts/update_sync_state.py with a deterministic as-of timestamp. - Checked SQLite integrity and DB-to-snapshot row counts with Python sqlite3. - Parsed Python scripts with ast.parse. - Ran git diff --check and checked for temporary SQLite/cache files. Next useful context: - Current derived queue has 2 open tasks for 06658 and 15 waiting_until_due tasks for future stages.	2026-06-15 06:29:54 +00:00
geometrybase	eae427d85b	Add PDF text extraction workflow Request: - Provide a way to install or develop a PDF extraction tool for archived HK IPO documents. Changes: - Add requirements.txt with pypdf as the lightweight PDF text extraction dependency. - Add scripts/extract_pdf_text.py to extract text from PDF source_refs into repo-relative data/extracted_text files. - Add extracted text outputs and an extracted_text_manifest snapshot for the six archived HKEXnews PDFs. - Document the extraction workflow in README.md. - Ignore .venv and keep generated SQLite/Python transient files out of git. - Use extracted text to verify the 06106 full prospectus, update source_refs, remove the related data gap, and fill 06106 offering terms. Verification: - Installed python3.14-venv system support, created a local .venv, and installed requirements.txt. - Re-ran scripts/bootstrap_historical_data.py and scripts/extract_pdf_text.py. - Verified extracted text paths and hashes against data/snapshots/extracted_text_manifest.csv. - Verified SQLite integrity and snapshot row counts. - Ran git diff --cached --check and searched durable files for machine-specific absolute paths.	2026-06-15 06:21:16 +00:00
geometrybase	7a8c648d87	Bootstrap HK IPO historical archive Request: - Use the project archivist workflow to update historical IPO data. Changes: - Add an embedded SQLite archive at data/hk_ipo.sqlite. - Add schema/hk_ipo.schema.sql and scripts/bootstrap_historical_data.py for reproducible archive generation. - Archive HKEXnews source PDFs for 06658, 06675, and 06106 under repo-relative data/raw paths. - Export Git-friendly snapshots for ipo_master, offering_terms, ipo_demand, source_refs, and data_gaps. - Add .gitignore rules for Python cache and SQLite transient files. Verification: - Re-ran the bootstrap script successfully. - Ran PRAGMA integrity_check on the SQLite database. - Verified source_refs paths are repo-relative, files exist, and SHA-256 hashes match. - Verified snapshot row counts match SQLite table counts. - Ran git diff --check and searched generated durable files for machine-specific absolute paths.	2026-06-15 06:13:27 +00:00

6 Commits