Tighten historical data audit coverage

Request: - Use the audit skill to check historical data completeness and self-correct the audit criteria after the missed PDF extracted-text gap. Changes: - Add a mandatory derived-evidence checklist to the audit skill. - Require broad historical audits to reconcile PDF source_refs, extracted text files, manifest rows, and hashes. - Add a historical data completeness audit report for the current archive. Findings: - Source integrity and PDF extracted-text completeness now pass. - Full historical completeness still fails due to incomplete structured T1 demand, unresolved T2 grey-market data, open due price-performance tasks, and missing context fields. Verification: - Ran SQLite integrity, foreign-key, source hash, snapshot, PDF manifest, extracted-text hash, stage coverage, and analysis-dataset checks. - Ran scripts/extract_pdf_text.py and confirmed 557 PDF sources were skipped unchanged with 557 manifest rows. - Ran git diff --check.
2026-06-15 13:43:22 +00:00
parent 8a0dfd88f0
commit 33d0bc056e
2 changed files with 164 additions and 4 deletions
@@ -26,6 +26,8 @@ Separate three standards:

 Do not treat a filled field as sufficient evidence by itself. A conclusion is only audit-ready when the source, stage, assumption, and reasoning chain can be followed.

+Treat derived evidence as first-class audit material, not optional convenience. If an archived raw source is expected to generate a reusable derived artifact, the audit must reconcile the raw source, derived artifact, manifest row, and hashes.
+
 ## Stage Data Checklist

 Use the stage being audited to decide what must exist:
@@ -37,16 +39,28 @@ Use the stage being audited to decide what must exist:

 For broad historical or cross-IPO work, also check that the sample definition, inclusion/exclusion rules, and date range are explicit.

+## Derived Evidence Checklist
+
+For broad historical audits and any analysis-readiness audit:
+
+- Every archived PDF in `source_refs` must have one row in `data/snapshots/extracted_text_manifest.csv`.
+- Every extracted-text manifest row must point back to an existing PDF `source_id`.
+- `pdf_sha256` in the manifest must match `source_refs.file_sha256`.
+- `text_local_path` must be repo-relative, must exist, and must match `text_sha256`.
+- Manifest extraction status must be reviewed. `error`, missing text, missing manifest rows, orphan manifest rows, hash mismatches, or non-repo-relative paths are `blocker` issues for historical-data completeness.
+- HKEX `.htm`/`.html` notices and Yahoo JSON files are raw text-like evidence under `data/raw/`; do not require `data/extracted_text/` rows for them.
+
 ## Data Audit Workflow

 1. Inspect current repo state and recent commits before auditing.
 2. Identify the ticker, report, rule version, stage, and data-as-of timestamp being audited.
 3. Load the relevant archived facts from `data/hk_ipo.sqlite`, CSV snapshots, raw source paths, memo/report files, and rule files.
 4. Check `source_refs` for repo-relative `local_path` values, existing files, and matching `file_sha256` values when present.
-5. Compare database row counts with `data/snapshots/` exports for tables used by the audit.
-6. Review `ticker_sync_state` and `sync_tasks` for the target ticker or sample. Treat open due tasks as possible blockers.
-7. Mark each required stage fact as `present`, `missing`, `stale`, `estimated`, `inferred`, or `not_applicable`.
-8. Decide whether remaining gaps are blocking or non-blocking for the specific conclusion being audited.
+5. Reconcile derived artifacts, especially PDF extracted text, against their manifests and source hashes.
+6. Compare database row counts with `data/snapshots/` exports for tables used by the audit.
+7. Review `ticker_sync_state` and `sync_tasks` for the target ticker or sample. Treat open due tasks as possible blockers.
+8. Mark each required stage fact as `present`, `missing`, `stale`, `estimated`, `inferred`, or `not_applicable`.
+9. Decide whether remaining gaps are blocking or non-blocking for the specific conclusion being audited.

 ## Logic Audit Workflow

@@ -111,6 +125,7 @@ Before finishing, confirm:

 - The audit target and stage are explicit.
 - Data completeness and data sufficiency are judged separately.
+- PDF source references are reconciled to extracted text and manifest hashes when auditing historical or analysis-ready data.
 - Missing facts are not converted into assumptions without labels.
 - Later facts are not used to validate earlier predictions.
 - Any pass/fail result names the evidence that supports it.
@@ -0,0 +1,145 @@
+# Historical Data Completeness Audit
+
+Audit date: 2026-06-15
+Audit target: Full historical Hong Kong IPO archive in `data/hk_ipo.sqlite`
+Stage scope: T0, T1, T2, D1, D5, D20, D60, source integrity, PDF extracted text, and analysis-dataset readiness
+Audit status: fail for full historical completeness; pass for source integrity and PDF-derived evidence completeness
+
+## Summary
+
+The archive is internally consistent at the raw-source and snapshot level, and the previous PDF extracted-text blind spot is now closed. The current archive is not yet complete enough to call the historical dataset fully analysis-ready across all intended stages.
+
+Main blockers:
+
+- T1 structured demand is incomplete: 137 tickers have allotment-results sources archived but no parsed `ipo_demand` row.
+- T2 grey-market data remains unresolved: 291 listed tickers are blocked pending a reproducible source strategy.
+- Due post-listing performance gaps remain: 18 D1, 18 D5, 18 D20, and 17 D60 tasks are open.
+
+Major non-blocking gaps:
+
+- Industry labels are mostly missing: 285 of 293 `ipo_master` rows are blank.
+- Market-cap fields are mostly missing: 285 of 293 `offering_terms` rows are blank.
+- Net proceeds are missing for 108 of 293 `offering_terms` rows.
+
+## Data Integrity
+
+Pass.
+
+- SQLite `PRAGMA integrity_check`: `ok`.
+- SQLite foreign-key check: 0 violations.
+- DB row counts match CSV snapshots for `ipo_master`, `offering_terms`, `ipo_demand`, `price_performance`, `new_listing_report_entries`, `source_refs`, `data_gaps`, `sync_runs`, `ticker_sync_state`, and `sync_tasks`.
+- `source_refs`: 1,149 rows.
+- Source path and file checks: 0 bad paths, 0 missing files, 0 null hashes, 0 hash mismatches.
+- Source file extensions: 557 `.pdf`, 29 `.htm`, 273 `.json`, 290 `.xlsx`.
+
+## PDF Derived Evidence
+
+Pass.
+
+- PDF `source_refs`: 557.
+- Extracted-text manifest rows: 557.
+- Manifest status counts: 557 `ok`.
+- Duplicate manifest source IDs: 0.
+- Missing manifest rows for PDFs: 0.
+- Orphan manifest rows: 0.
+- Bad manifest PDF/text paths: 0.
+- Missing extracted-text files: 0.
+- PDF hash mismatches: 0.
+- Text hash mismatches: 0.
+- Zero or empty extractions: 0.
+- Incremental extractor rerun result: 557 processed, 557 skipped unchanged, 557 manifest rows.
+
+This check is now required in `.codex/skills/audit/SKILL.md` so future broad historical audits must reconcile PDF raw sources, extracted text, manifest rows, and hashes.
+
+## Stage Completeness
+
+IPO master sample:
+
+- Total IPO rows: 293.
+- Status counts: 291 listed, 1 open for subscription, 1 pending listing.
+
+T0 prospectus:
+
+- `T0_prospectus`: 293 complete.
+- Offering terms rows: 293.
+- No IPO is missing an `offering_terms` row.
+- Offer price is present for all 293 rows.
+
+T1 allotment:
+
+- `T1_allotment`: 291 complete, 2 pending not due.
+- Pending-not-due tickers: `06106` due 2026-06-23, `06675` due 2026-06-16.
+- Allotment result sources: 291 total, including 262 PDF and 29 HTM sources.
+- `ipo_demand`: 154 rows across 154 tickers.
+- T1 complete but no structured demand row: 137 tickers.
+- Source-only T1 gaps by source format: 108 PDF, 29 HTM.
+- Demand rows with missing fields: public oversubscription 10, valid applications 6, successful applications 7, international oversubscription 1.
+
+T2 grey market:
+
+- `T2_grey_market`: 291 blocked, 2 pending not due.
+- `data_gaps`: 291 grey-market gaps.
+- This remains a blocking completeness issue unless T2 is explicitly excluded from a model or analysis stage.
+
+Post-listing performance:
+
+- D1: 273 complete, 18 pending due, 2 pending not due.
+- D5: 272 complete, 18 pending due, 3 pending not due.
+- D20: 267 complete, 18 pending due, 8 pending not due.
+- D60: 251 complete, 17 pending due, 25 pending not due.
+- Existing `price_performance` rows have 0 missing core price fields.
+
+## Field Sufficiency
+
+Major gaps remain for richer qualitative and valuation analysis:
+
+- `ipo_master.industry_label` missing: 285 of 293.
+- `ipo_master.application_start_date` missing: 45 of 293.
+- `ipo_master.application_end_date` missing: 1 of 293.
+- `ipo_master.allotment_results_expected_date` missing: 31 of 293.
+- `offering_terms.board_lot` missing: 18 of 293.
+- `offering_terms.min_subscription_amount_hkd` missing: 19 of 293.
+- `offering_terms.global_offer_shares` missing: 11 of 293.
+- `offering_terms.hk_offer_shares_initial` missing: 17 of 293.
+- `offering_terms.international_offer_shares_initial` missing: 33 of 293.
+- `offering_terms.public_offer_pct_initial` missing: 23 of 293.
+- `offering_terms.market_cap_hkd_m` missing: 285 of 293.
+- `offering_terms.net_proceeds_hkd_m` missing: 108 of 293.
+- `offering_terms.issued_shares_upon_listing` missing: 45 of 293.
+
+## Analysis Dataset Readiness
+
+The v0 analysis dataset is structurally usable but not fully sufficient for T1/T2-aware feedback.
+
+- Dataset rows: 293.
+- Missing `total_score`: 0.
+- Missing prospectus source path: 0.
+- Missing allotment source path: 2, matching the two pending-not-due T1 tickers.
+- `has_structured_t1`: 154 yes, 139 no.
+- Blank `t1_add_score`: 139.
+- Blank `total_score_bucket`: 139.
+- Missing D1 return: 20.
+- Missing industry label: 285.
+- Missing market cap: 285.
+
+Decision bands are populated for all rows, but many rows are effectively T0-only or T0-plus-partial-context because structured T1, T2, industry, and valuation fields are incomplete.
+
+## Required Fixes
+
+1. Route to `archivist`: use extracted PDF text and HTM raw notices to parse missing T1 demand facts for the 137 source-only allotment rows.
+2. Route to `archivist`: decide and document a reproducible T2 grey-market source strategy, or formally exclude T2 from the v0 model scope.
+3. Route to `archivist`: retry or source the open due D1/D5/D20/D60 price-performance tasks.
+4. Route to `archivist`: backfill industry labels, market cap, net proceeds, issued shares, and missing offering-term fields.
+5. Re-run `analyst` only after the relevant archive gaps are closed or explicitly marked out of scope for the model version being evaluated.
+
+## Audit Skill Self-Correction
+
+The prior audit skill failed to make derived evidence a mandatory check, which allowed archived PDFs without extracted text to pass unnoticed. The skill now requires broad historical audits to reconcile:
+
+- PDF `source_refs`
+- `data/snapshots/extracted_text_manifest.csv`
+- `data/extracted_text/` files
+- PDF hashes
+- extracted text hashes
+
+Missing derived text or manifest mismatches are now blocker issues for historical-data completeness.