Files
hk-ipo/reports/2026-06-15_historical_data_audit.md
T
geometrybase 33d0bc056e Tighten historical data audit coverage
Request:
- Use the audit skill to check historical data completeness and self-correct the audit criteria after the missed PDF extracted-text gap.

Changes:
- Add a mandatory derived-evidence checklist to the audit skill.
- Require broad historical audits to reconcile PDF source_refs, extracted text files, manifest rows, and hashes.
- Add a historical data completeness audit report for the current archive.

Findings:
- Source integrity and PDF extracted-text completeness now pass.
- Full historical completeness still fails due to incomplete structured T1 demand, unresolved T2 grey-market data, open due price-performance tasks, and missing context fields.

Verification:
- Ran SQLite integrity, foreign-key, source hash, snapshot, PDF manifest, extracted-text hash, stage coverage, and analysis-dataset checks.
- Ran scripts/extract_pdf_text.py and confirmed 557 PDF sources were skipped unchanged with 557 manifest rows.
- Ran git diff --check.
2026-06-15 13:43:22 +00:00

6.3 KiB

Historical Data Completeness Audit

Audit date: 2026-06-15 Audit target: Full historical Hong Kong IPO archive in data/hk_ipo.sqlite Stage scope: T0, T1, T2, D1, D5, D20, D60, source integrity, PDF extracted text, and analysis-dataset readiness Audit status: fail for full historical completeness; pass for source integrity and PDF-derived evidence completeness

Summary

The archive is internally consistent at the raw-source and snapshot level, and the previous PDF extracted-text blind spot is now closed. The current archive is not yet complete enough to call the historical dataset fully analysis-ready across all intended stages.

Main blockers:

  • T1 structured demand is incomplete: 137 tickers have allotment-results sources archived but no parsed ipo_demand row.
  • T2 grey-market data remains unresolved: 291 listed tickers are blocked pending a reproducible source strategy.
  • Due post-listing performance gaps remain: 18 D1, 18 D5, 18 D20, and 17 D60 tasks are open.

Major non-blocking gaps:

  • Industry labels are mostly missing: 285 of 293 ipo_master rows are blank.
  • Market-cap fields are mostly missing: 285 of 293 offering_terms rows are blank.
  • Net proceeds are missing for 108 of 293 offering_terms rows.

Data Integrity

Pass.

  • SQLite PRAGMA integrity_check: ok.
  • SQLite foreign-key check: 0 violations.
  • DB row counts match CSV snapshots for ipo_master, offering_terms, ipo_demand, price_performance, new_listing_report_entries, source_refs, data_gaps, sync_runs, ticker_sync_state, and sync_tasks.
  • source_refs: 1,149 rows.
  • Source path and file checks: 0 bad paths, 0 missing files, 0 null hashes, 0 hash mismatches.
  • Source file extensions: 557 .pdf, 29 .htm, 273 .json, 290 .xlsx.

PDF Derived Evidence

Pass.

  • PDF source_refs: 557.
  • Extracted-text manifest rows: 557.
  • Manifest status counts: 557 ok.
  • Duplicate manifest source IDs: 0.
  • Missing manifest rows for PDFs: 0.
  • Orphan manifest rows: 0.
  • Bad manifest PDF/text paths: 0.
  • Missing extracted-text files: 0.
  • PDF hash mismatches: 0.
  • Text hash mismatches: 0.
  • Zero or empty extractions: 0.
  • Incremental extractor rerun result: 557 processed, 557 skipped unchanged, 557 manifest rows.

This check is now required in .codex/skills/audit/SKILL.md so future broad historical audits must reconcile PDF raw sources, extracted text, manifest rows, and hashes.

Stage Completeness

IPO master sample:

  • Total IPO rows: 293.
  • Status counts: 291 listed, 1 open for subscription, 1 pending listing.

T0 prospectus:

  • T0_prospectus: 293 complete.
  • Offering terms rows: 293.
  • No IPO is missing an offering_terms row.
  • Offer price is present for all 293 rows.

T1 allotment:

  • T1_allotment: 291 complete, 2 pending not due.
  • Pending-not-due tickers: 06106 due 2026-06-23, 06675 due 2026-06-16.
  • Allotment result sources: 291 total, including 262 PDF and 29 HTM sources.
  • ipo_demand: 154 rows across 154 tickers.
  • T1 complete but no structured demand row: 137 tickers.
  • Source-only T1 gaps by source format: 108 PDF, 29 HTM.
  • Demand rows with missing fields: public oversubscription 10, valid applications 6, successful applications 7, international oversubscription 1.

T2 grey market:

  • T2_grey_market: 291 blocked, 2 pending not due.
  • data_gaps: 291 grey-market gaps.
  • This remains a blocking completeness issue unless T2 is explicitly excluded from a model or analysis stage.

Post-listing performance:

  • D1: 273 complete, 18 pending due, 2 pending not due.
  • D5: 272 complete, 18 pending due, 3 pending not due.
  • D20: 267 complete, 18 pending due, 8 pending not due.
  • D60: 251 complete, 17 pending due, 25 pending not due.
  • Existing price_performance rows have 0 missing core price fields.

Field Sufficiency

Major gaps remain for richer qualitative and valuation analysis:

  • ipo_master.industry_label missing: 285 of 293.
  • ipo_master.application_start_date missing: 45 of 293.
  • ipo_master.application_end_date missing: 1 of 293.
  • ipo_master.allotment_results_expected_date missing: 31 of 293.
  • offering_terms.board_lot missing: 18 of 293.
  • offering_terms.min_subscription_amount_hkd missing: 19 of 293.
  • offering_terms.global_offer_shares missing: 11 of 293.
  • offering_terms.hk_offer_shares_initial missing: 17 of 293.
  • offering_terms.international_offer_shares_initial missing: 33 of 293.
  • offering_terms.public_offer_pct_initial missing: 23 of 293.
  • offering_terms.market_cap_hkd_m missing: 285 of 293.
  • offering_terms.net_proceeds_hkd_m missing: 108 of 293.
  • offering_terms.issued_shares_upon_listing missing: 45 of 293.

Analysis Dataset Readiness

The v0 analysis dataset is structurally usable but not fully sufficient for T1/T2-aware feedback.

  • Dataset rows: 293.
  • Missing total_score: 0.
  • Missing prospectus source path: 0.
  • Missing allotment source path: 2, matching the two pending-not-due T1 tickers.
  • has_structured_t1: 154 yes, 139 no.
  • Blank t1_add_score: 139.
  • Blank total_score_bucket: 139.
  • Missing D1 return: 20.
  • Missing industry label: 285.
  • Missing market cap: 285.

Decision bands are populated for all rows, but many rows are effectively T0-only or T0-plus-partial-context because structured T1, T2, industry, and valuation fields are incomplete.

Required Fixes

  1. Route to archivist: use extracted PDF text and HTM raw notices to parse missing T1 demand facts for the 137 source-only allotment rows.
  2. Route to archivist: decide and document a reproducible T2 grey-market source strategy, or formally exclude T2 from the v0 model scope.
  3. Route to archivist: retry or source the open due D1/D5/D20/D60 price-performance tasks.
  4. Route to archivist: backfill industry labels, market cap, net proceeds, issued shares, and missing offering-term fields.
  5. Re-run analyst only after the relevant archive gaps are closed or explicitly marked out of scope for the model version being evaluated.

Audit Skill Self-Correction

The prior audit skill failed to make derived evidence a mandatory check, which allowed archived PDFs without extracted text to pass unnoticed. The skill now requires broad historical audits to reconcile:

  • PDF source_refs
  • data/snapshots/extracted_text_manifest.csv
  • data/extracted_text/ files
  • PDF hashes
  • extracted text hashes

Missing derived text or manifest mismatches are now blocker issues for historical-data completeness.