# Historical Data Completeness Audit Audit date: 2026-06-15 Audit target: Full historical Hong Kong IPO archive in `data/hk_ipo.sqlite` Stage scope: T0, T1, T2, D1, D5, D20, D60, source integrity, PDF extracted text, and analysis-dataset readiness Audit status: fail for full historical completeness; pass for source integrity and PDF-derived evidence completeness ## Summary The archive is internally consistent at the raw-source and snapshot level, and the previous PDF extracted-text blind spot is now closed. The current archive is not yet complete enough to call the historical dataset fully analysis-ready across all intended stages. Main blockers: - T1 structured demand is incomplete: 137 tickers have allotment-results sources archived but no parsed `ipo_demand` row. - T2 grey-market data remains unresolved: 291 listed tickers are blocked pending a reproducible source strategy. - Due post-listing performance gaps remain: 18 D1, 18 D5, 18 D20, and 17 D60 tasks are open. Major non-blocking gaps: - Industry labels are mostly missing: 285 of 293 `ipo_master` rows are blank. - Market-cap fields are mostly missing: 285 of 293 `offering_terms` rows are blank. - Net proceeds are missing for 108 of 293 `offering_terms` rows. ## Data Integrity Pass. - SQLite `PRAGMA integrity_check`: `ok`. - SQLite foreign-key check: 0 violations. - DB row counts match CSV snapshots for `ipo_master`, `offering_terms`, `ipo_demand`, `price_performance`, `new_listing_report_entries`, `source_refs`, `data_gaps`, `sync_runs`, `ticker_sync_state`, and `sync_tasks`. - `source_refs`: 1,149 rows. - Source path and file checks: 0 bad paths, 0 missing files, 0 null hashes, 0 hash mismatches. - Source file extensions: 557 `.pdf`, 29 `.htm`, 273 `.json`, 290 `.xlsx`. ## PDF Derived Evidence Pass. - PDF `source_refs`: 557. - Extracted-text manifest rows: 557. - Manifest status counts: 557 `ok`. - Duplicate manifest source IDs: 0. - Missing manifest rows for PDFs: 0. - Orphan manifest rows: 0. - Bad manifest PDF/text paths: 0. - Missing extracted-text files: 0. - PDF hash mismatches: 0. - Text hash mismatches: 0. - Zero or empty extractions: 0. - Incremental extractor rerun result: 557 processed, 557 skipped unchanged, 557 manifest rows. This check is now required in `.codex/skills/hk-ipo-audit/SKILL.md` so future broad historical audits must reconcile PDF raw sources, extracted text, manifest rows, and hashes. ## Stage Completeness IPO master sample: - Total IPO rows: 293. - Status counts: 291 listed, 1 open for subscription, 1 pending listing. T0 prospectus: - `T0_prospectus`: 293 complete. - Offering terms rows: 293. - No IPO is missing an `offering_terms` row. - Offer price is present for all 293 rows. T1 allotment: - `T1_allotment`: 291 complete, 2 pending not due. - Pending-not-due tickers: `06106` due 2026-06-23, `06675` due 2026-06-16. - Allotment result sources: 291 total, including 262 PDF and 29 HTM sources. - `ipo_demand`: 154 rows across 154 tickers. - T1 complete but no structured demand row: 137 tickers. - Source-only T1 gaps by source format: 108 PDF, 29 HTM. - Demand rows with missing fields: public oversubscription 10, valid applications 6, successful applications 7, international oversubscription 1. T2 grey market: - `T2_grey_market`: 291 blocked, 2 pending not due. - `data_gaps`: 291 grey-market gaps. - This remains a blocking completeness issue unless T2 is explicitly excluded from a model or analysis stage. Post-listing performance: - D1: 273 complete, 18 pending due, 2 pending not due. - D5: 272 complete, 18 pending due, 3 pending not due. - D20: 267 complete, 18 pending due, 8 pending not due. - D60: 251 complete, 17 pending due, 25 pending not due. - Existing `price_performance` rows have 0 missing core price fields. ## Field Sufficiency Major gaps remain for richer qualitative and valuation analysis: - `ipo_master.industry_label` missing: 285 of 293. - `ipo_master.application_start_date` missing: 45 of 293. - `ipo_master.application_end_date` missing: 1 of 293. - `ipo_master.allotment_results_expected_date` missing: 31 of 293. - `offering_terms.board_lot` missing: 18 of 293. - `offering_terms.min_subscription_amount_hkd` missing: 19 of 293. - `offering_terms.global_offer_shares` missing: 11 of 293. - `offering_terms.hk_offer_shares_initial` missing: 17 of 293. - `offering_terms.international_offer_shares_initial` missing: 33 of 293. - `offering_terms.public_offer_pct_initial` missing: 23 of 293. - `offering_terms.market_cap_hkd_m` missing: 285 of 293. - `offering_terms.net_proceeds_hkd_m` missing: 108 of 293. - `offering_terms.issued_shares_upon_listing` missing: 45 of 293. ## Analysis Dataset Readiness The v0 analysis dataset is structurally usable but not fully sufficient for T1/T2-aware feedback. - Dataset rows: 293. - Missing `total_score`: 0. - Missing prospectus source path: 0. - Missing allotment source path: 2, matching the two pending-not-due T1 tickers. - `has_structured_t1`: 154 yes, 139 no. - Blank `t1_add_score`: 139. - Blank `total_score_bucket`: 139. - Missing D1 return: 20. - Missing industry label: 285. - Missing market cap: 285. Decision bands are populated for all rows, but many rows are effectively T0-only or T0-plus-partial-context because structured T1, T2, industry, and valuation fields are incomplete. ## Required Fixes 1. Route to `archivist`: use extracted PDF text and HTM raw notices to parse missing T1 demand facts for the 137 source-only allotment rows. 2. Route to `archivist`: decide and document a reproducible T2 grey-market source strategy, or formally exclude T2 from the v0 model scope. 3. Route to `archivist`: retry or source the open due D1/D5/D20/D60 price-performance tasks. 4. Route to `archivist`: backfill industry labels, market cap, net proceeds, issued shares, and missing offering-term fields. 5. Re-run `analyst` only after the relevant archive gaps are closed or explicitly marked out of scope for the model version being evaluated. ## Audit Skill Self-Correction The prior audit skill failed to make derived evidence a mandatory check, which allowed archived PDFs without extracted text to pass unnoticed. The skill now requires broad historical audits to reconcile: - PDF `source_refs` - `data/snapshots/extracted_text_manifest.csv` - `data/extracted_text/` files - PDF hashes - extracted text hashes Missing derived text or manifest mismatches are now blocker issues for historical-data completeness.