Request: - Rename the corresponding skill folders to match the hk-ipo-* skill names. Changes: - Renamed .codex/skills/analyst to .codex/skills/hk-ipo-analyst. - Renamed .codex/skills/archivist to .codex/skills/hk-ipo-archivist. - Renamed .codex/skills/audit to .codex/skills/hk-ipo-audit. - Updated the remaining repo path reference to the audit skill. Verification: - Ran git diff --check and git diff --cached --check. - Validated .codex/skills contains exactly hk-ipo-analyst, hk-ipo-archivist, and hk-ipo-audit. - Confirmed each SKILL.md frontmatter name matches its directory name. - Confirmed current repo docs/skills have no old .codex/skills/analyst, archivist, or audit path references.
6.3 KiB
Historical Data Completeness Audit
Audit date: 2026-06-15
Audit target: Full historical Hong Kong IPO archive in data/hk_ipo.sqlite
Stage scope: T0, T1, T2, D1, D5, D20, D60, source integrity, PDF extracted text, and analysis-dataset readiness
Audit status: fail for full historical completeness; pass for source integrity and PDF-derived evidence completeness
Summary
The archive is internally consistent at the raw-source and snapshot level, and the previous PDF extracted-text blind spot is now closed. The current archive is not yet complete enough to call the historical dataset fully analysis-ready across all intended stages.
Main blockers:
- T1 structured demand is incomplete: 137 tickers have allotment-results sources archived but no parsed
ipo_demandrow. - T2 grey-market data remains unresolved: 291 listed tickers are blocked pending a reproducible source strategy.
- Due post-listing performance gaps remain: 18 D1, 18 D5, 18 D20, and 17 D60 tasks are open.
Major non-blocking gaps:
- Industry labels are mostly missing: 285 of 293
ipo_masterrows are blank. - Market-cap fields are mostly missing: 285 of 293
offering_termsrows are blank. - Net proceeds are missing for 108 of 293
offering_termsrows.
Data Integrity
Pass.
- SQLite
PRAGMA integrity_check:ok. - SQLite foreign-key check: 0 violations.
- DB row counts match CSV snapshots for
ipo_master,offering_terms,ipo_demand,price_performance,new_listing_report_entries,source_refs,data_gaps,sync_runs,ticker_sync_state, andsync_tasks. source_refs: 1,149 rows.- Source path and file checks: 0 bad paths, 0 missing files, 0 null hashes, 0 hash mismatches.
- Source file extensions: 557
.pdf, 29.htm, 273.json, 290.xlsx.
PDF Derived Evidence
Pass.
- PDF
source_refs: 557. - Extracted-text manifest rows: 557.
- Manifest status counts: 557
ok. - Duplicate manifest source IDs: 0.
- Missing manifest rows for PDFs: 0.
- Orphan manifest rows: 0.
- Bad manifest PDF/text paths: 0.
- Missing extracted-text files: 0.
- PDF hash mismatches: 0.
- Text hash mismatches: 0.
- Zero or empty extractions: 0.
- Incremental extractor rerun result: 557 processed, 557 skipped unchanged, 557 manifest rows.
This check is now required in .codex/skills/hk-ipo-audit/SKILL.md so future broad historical audits must reconcile PDF raw sources, extracted text, manifest rows, and hashes.
Stage Completeness
IPO master sample:
- Total IPO rows: 293.
- Status counts: 291 listed, 1 open for subscription, 1 pending listing.
T0 prospectus:
T0_prospectus: 293 complete.- Offering terms rows: 293.
- No IPO is missing an
offering_termsrow. - Offer price is present for all 293 rows.
T1 allotment:
T1_allotment: 291 complete, 2 pending not due.- Pending-not-due tickers:
06106due 2026-06-23,06675due 2026-06-16. - Allotment result sources: 291 total, including 262 PDF and 29 HTM sources.
ipo_demand: 154 rows across 154 tickers.- T1 complete but no structured demand row: 137 tickers.
- Source-only T1 gaps by source format: 108 PDF, 29 HTM.
- Demand rows with missing fields: public oversubscription 10, valid applications 6, successful applications 7, international oversubscription 1.
T2 grey market:
T2_grey_market: 291 blocked, 2 pending not due.data_gaps: 291 grey-market gaps.- This remains a blocking completeness issue unless T2 is explicitly excluded from a model or analysis stage.
Post-listing performance:
- D1: 273 complete, 18 pending due, 2 pending not due.
- D5: 272 complete, 18 pending due, 3 pending not due.
- D20: 267 complete, 18 pending due, 8 pending not due.
- D60: 251 complete, 17 pending due, 25 pending not due.
- Existing
price_performancerows have 0 missing core price fields.
Field Sufficiency
Major gaps remain for richer qualitative and valuation analysis:
ipo_master.industry_labelmissing: 285 of 293.ipo_master.application_start_datemissing: 45 of 293.ipo_master.application_end_datemissing: 1 of 293.ipo_master.allotment_results_expected_datemissing: 31 of 293.offering_terms.board_lotmissing: 18 of 293.offering_terms.min_subscription_amount_hkdmissing: 19 of 293.offering_terms.global_offer_sharesmissing: 11 of 293.offering_terms.hk_offer_shares_initialmissing: 17 of 293.offering_terms.international_offer_shares_initialmissing: 33 of 293.offering_terms.public_offer_pct_initialmissing: 23 of 293.offering_terms.market_cap_hkd_mmissing: 285 of 293.offering_terms.net_proceeds_hkd_mmissing: 108 of 293.offering_terms.issued_shares_upon_listingmissing: 45 of 293.
Analysis Dataset Readiness
The v0 analysis dataset is structurally usable but not fully sufficient for T1/T2-aware feedback.
- Dataset rows: 293.
- Missing
total_score: 0. - Missing prospectus source path: 0.
- Missing allotment source path: 2, matching the two pending-not-due T1 tickers.
has_structured_t1: 154 yes, 139 no.- Blank
t1_add_score: 139. - Blank
total_score_bucket: 139. - Missing D1 return: 20.
- Missing industry label: 285.
- Missing market cap: 285.
Decision bands are populated for all rows, but many rows are effectively T0-only or T0-plus-partial-context because structured T1, T2, industry, and valuation fields are incomplete.
Required Fixes
- Route to
archivist: use extracted PDF text and HTM raw notices to parse missing T1 demand facts for the 137 source-only allotment rows. - Route to
archivist: decide and document a reproducible T2 grey-market source strategy, or formally exclude T2 from the v0 model scope. - Route to
archivist: retry or source the open due D1/D5/D20/D60 price-performance tasks. - Route to
archivist: backfill industry labels, market cap, net proceeds, issued shares, and missing offering-term fields. - Re-run
analystonly after the relevant archive gaps are closed or explicitly marked out of scope for the model version being evaluated.
Audit Skill Self-Correction
The prior audit skill failed to make derived evidence a mandatory check, which allowed archived PDFs without extracted text to pass unnoticed. The skill now requires broad historical audits to reconcile:
- PDF
source_refs data/snapshots/extracted_text_manifest.csvdata/extracted_text/files- PDF hashes
- extracted text hashes
Missing derived text or manifest mismatches are now blocker issues for historical-data completeness.