Files
geometrybase 7cbdd533b0 Add A/H share-class mapping workflow
Request:
- Add a repeatable mechanism so HK IPO reports detect issuers that already have Mainland A shares.
- Include a third internet/official-exchange cross-check layer beyond structured history and prospectus scans.

Changes:
- Add listed_share_classes schema support for same-issuer A-share mappings and evidence links.
- Add scripts/archive_a_share_mappings.py to scan prospectus extracted text, reject sponsor/portfolio/cornerstone false positives, archive optional official web evidence and A-share/FX quote evidence, and export snapshots on write.
- Surface a_share_* fields in the analysis dataset and single-ticker report output.
- Update hk-ipo analyst/archivist skill rules and scheduled refresh prompt to require the three-layer A/H mapping check.

Verification:
- python3 -m py_compile scripts/archive_a_share_mappings.py scripts/build_analysis_dataset.py scripts/generate_ipo_report.py
- .venv/bin/python scripts/archive_a_share_mappings.py --as-of 2026-06-24T00:00:00Z --tickers 00668,01688,03661,09630 --dry-run
- .venv/bin/python scripts/build_analysis_dataset.py --db /tmp/hk_ipo_ah_dataset_test.sqlite --dataset /tmp/hk_ipo_ah_dataset_test.csv --report /tmp/hk_ipo_ah_model_test.md --as-of 2026-06-24T00:00:00Z
- .venv/bin/python scripts/generate_ipo_report.py 09630 --dataset /tmp/hk_ipo_ah_dataset_test.csv --stdout --as-of 2026-06-24T00:00:00Z
- git diff --check

Next useful context:
- Dry-run detected 00668->300866.SZ, 01688->002600.SZ, 03661->300661.SZ, and 09630->688630.SH.
- A false positive 01688->300476.SZ from a cornerstone investor parent was rejected by the issuer-context filter.
2026-06-24 07:21:21 +00:00

13 KiB

name, description
name description
hk-ipo-archivist Use for Hong Kong IPO fact archiving in this project: downloading or recording prospectuses, allotment results, listing facts, market data, source references, file hashes, SQLite updates, and CSV snapshots. Do not use for investment conclusions, subscription recommendations, score interpretation, or research memos.

HK IPO Archivist

Purpose

Maintain the project-local Hong Kong IPO evidence archive and structured fact database. This skill owns facts, sources, database updates, path hygiene, and reproducible snapshots.

It does not decide whether an IPO is worth subscribing for. Route judgment, scoring, prediction cards, review cards, and reports to hk-ipo-analyst.

Project Storage Contract

Use repo-relative paths everywhere. Never store machine-specific absolute paths.

  • Resolve the repo root at runtime, for example with git rev-parse --show-toplevel.
  • Store paths without a leading ./.
  • Store paths with POSIX separators, such as data/raw/06658/prospectus.pdf.
  • Store path_base = "repo_root" when a table needs an explicit base.
  • Store file_sha256 for archived source files whenever practical.

Expected project layout:

data/
  hk_ipo.sqlite
  raw/
  snapshots/
memos/
reports/
rules/
schema/
scripts/
references/

Responsibilities

  • Archive primary source files under data/raw/{ticker}/.
  • Record source references, URLs, as-of timestamps, relative paths, and hashes.
  • Update embedded SQLite tables for IPO facts.
  • Export Git-friendly CSV snapshots after database updates.
  • Maintain sync_runs, ticker_sync_state, and sync_tasks so repeated syncs know what is already archived and what remains pending.
  • Use HKEXnews annual new listing reports to seed broad recent-IPO target coverage before collecting deeper per-ticker documents.
  • Preserve raw source files; do not overwrite without first checking whether the contents changed.
  • Label missing, stale, inconsistent, or estimated fields explicitly.
  • Use audit findings to prioritize historical data gaps before expanding analysis coverage.

Boundaries

Do not write:

  • Subscription decisions.
  • Investment ratings.
  • Scoring interpretations.
  • Prediction cards.
  • Review conclusions.
  • Rule-change recommendations.

If a user asks for both data update and analysis, complete the archive/update step first, then hand the frozen as-of dataset to hk-ipo-analyst.

Workflow

  1. Inspect current repo state and recent commits before changing files.
  2. Identify the IPO ticker, company, stage, and source documents needed.
  3. Save raw source files under data/raw/{ticker}/ using descriptive names.
  4. Compute hashes for archived files.
  5. Insert or update structured facts in data/hk_ipo.sqlite.
  6. Record every source in the source reference table using repo-relative paths.
  7. Extract text for archived PDF sources with scripts/extract_pdf_text.py.
  8. Refresh sync state with scripts/update_sync_state.py after fact updates.
  9. Export key tables to data/snapshots/ for readable Git diffs.
  10. Verify path rules, required fields, hash checks, extracted text manifest, sync state, and snapshot generation.
  11. Commit only the related archive/database/snapshot changes.

Incremental Sync State

Use ticker_sync_state as the per-ticker stage ledger and sync_tasks as the next-sync queue.

Stages:

  • T0_prospectus
  • T0_5_market_heat
  • T1_allotment
  • T2_grey_market
  • D1
  • D5
  • D20
  • D60

Status values:

  • complete: required facts or source files are archived.
  • pending_not_due: the stage is expected in the future.
  • pending_due: the stage is due and should be updated on the next sync.
  • blocked: the missing data has no known resolution date or needs manual intervention.
  • not_applicable: the stage does not apply.

Default incremental flow:

python3 scripts/update_sync_state.py

Then update only rows in sync_tasks whose task_status is open or blocked. Do not re-download existing source files unless the upstream source changed or the stored hash no longer matches.

Audit-Driven Gap Closure

When hk-ipo-audit finds historical data gaps, close them in this order unless the user specifies otherwise:

  1. Integrity blockers: missing raw files, bad hashes, absolute paths, broken snapshots, or failed foreign keys.
  2. Stage blockers: open due T0_prospectus and T1_allotment tasks that prevent stage-correct analysis.
  3. Outcome blockers: due D1, D5, D20, and D60 price performance needed for feedback and review.
  4. Context fields: industry labels, market cap, net proceeds, timetable gaps, and other comparison fields.
  5. Hard-to-source signals: T2_grey_market, only after a reproducible source strategy is available.

After each gap-closure run, refresh sync_tasks, export snapshots, and report what remains open. Do not mark unavailable data complete just to reduce the queue.

Recent IPO Target Coverage

Use the recent IPO updater when the user asks to update a broad date range of HK IPO targets:

.venv/bin/python scripts/update_recent_ipo_list.py --start-date YYYY-MM-DD --end-date YYYY-MM-DD --as-of YYYY-MM-DDTHH:MM:SSZ

The script discovers HKEXnews annual new listing report XLSX files, archives them under data/raw/hkex_new_listing_reports/, inserts new_listing_report_entries, updates ipo_master and missing offering_terms fields, records report-backed source_refs, exports snapshots, and refreshes sync state.

By default, exclude report rows without a numeric IPO offer price because transfers, introductions, and de-SPAC transactions are not ordinary public subscription targets.

HKEX Document Backfill

Use the document archiver to fill detailed T0/T1 facts from official HKEXnews documents:

.venv/bin/python scripts/archive_hkex_documents.py --as-of YYYY-MM-DDTHH:MM:SSZ

The script resolves HKEXnews stock IDs, archives prospectus and allotment-results documents under data/raw/{ticker}/, updates source_refs, parses high-confidence fields into ipo_master, offering_terms, and ipo_demand, exports snapshots, and refreshes sync state.

The document archiver should use HKEXnews date-window title search around the IPO timetable, not only the latest title-search page. IPO documents for active listed companies are often buried behind later post-listing announcements. Treat official HKEXnews .pdf, .htm, and .html allotment-result notices as valid archived sources; parse structured demand facts only where parser coverage is reliable.

PDF text extraction is a standard HKEX document post-processing step. scripts/archive_hkex_documents.py extracts text for newly archived PDFs by default after source references are written:

.venv/bin/python scripts/extract_pdf_text.py

The extractor is incremental: unchanged PDFs with matching manifest rows are skipped, and data/snapshots/extracted_text_manifest.csv is preserved and updated. Use --force only when parser behavior changes and derived text should be regenerated.

Do not expect data/extracted_text/ entries for Yahoo JSON market data or HKEX .htm/.html notices. Those are already text-like raw evidence files and are tracked under data/raw/.

A/H Share-Class Mapping Archive

When current or recently closed IPO candidates may already have Mainland A Shares or another onshore listed share class, refresh the structured mapping archive before analyst reports consume the dataset:

.venv/bin/python scripts/archive_a_share_mappings.py --as-of YYYY-MM-DDTHH:MM:SSZ --web-cross-check --archive-quotes

The archive uses three evidence layers:

  1. Structured prior mappings in listed_share_classes and data/snapshots/listed_share_classes.csv.
  2. Issuer-context prospectus text scans from data/snapshots/extracted_text_manifest.csv.
  3. Internet search or supported official exchange pages as public cross-check evidence.

Prospectus text remains the primary source for same-issuer identity. Public web evidence supports the mapping when it is reproducible and can be archived as a source_refs row. If a web cross-check is not supported for a market or exchange page, leave web_source_id blank and make the analyst report state the web-evidence gap rather than inventing a source.

For detected same-issuer A-share mappings, archive recent A-share quote data and HKD/CNY FX evidence when --archive-quotes is used so the analyst can compute A/H discount or premium without relying on stale manual prices. Do not use sponsor, portfolio-company, shareholder, or comparable-company stock codes as same-issuer mappings.

T1 Demand Text Backfill

When audit finds T1 rows where an allotment-results source is archived but ipo_demand is missing, use the text backfill script:

.venv/bin/python scripts/backfill_t1_demand_from_text.py --as-of YYYY-MM-DDTHH:MM:SSZ

The script is incremental. It selects only T1_allotment rows that are complete from source evidence but have no ipo_demand row. It parses archived PDF extracted text, follows old HKEX HTML allotment-result pages to their linked Summary PDFs, archives those PDFs, extracts their text, writes ipo_demand, exports snapshots, and refreshes sync state only when facts or sources changed.

Do not infer missing demand fields. If a Summary PDF gives valid applications and public subscription but omits successful applicants or international subscription level, store the available fields and leave unavailable fields null.

T0.5 Market Heat Archive

When the user asks to include subscription-period heat before official T1 allotment results, archive a reproducible market-heat snapshot:

.venv/bin/python scripts/archive_t0_5_market_heat.py --as-of YYYY-MM-DDTHH:MM:SSZ --tickers 01392,06067

When the user can still place, amend, or cancel an order near the subscription cutoff, archive the late actionable snapshot as T0_95_final_heat:

.venv/bin/python scripts/archive_t0_5_market_heat.py --stage T0_95_final_heat --as-of YYYY-MM-DDTHH:MM:SSZ --tickers 01392,06067

The script stores the raw page under data/raw/market_heat/, records per-ticker source_refs, writes structured rows to ipo_market_heat, exports data/snapshots/ipo_market_heat.csv, and refreshes sync state. The default source_type is t0_5_market_heat.

For T0.95 runs, source_type is t0_95_final_heat and ipo_market_heat.stage is T0_95_final_heat. Use this stage only when the snapshot was observed before the user's actual executable order cutoff; otherwise store it as ordinary T0_5_market_heat or post-hoc research evidence.

Market-heat data is non-official and live. It may include broker-aggregated margin subscription multiples or similar estimates. Never store it as ipo_demand, never treat it as final HKEX subscription data, and always preserve provider, source URL, raw path, observed_at, and the intended decision stage.

External IPO History Archive

When a historical third-party table is useful for coverage checks or calibration research, archive it separately from official HKEX data:

.venv/bin/python scripts/archive_ipohk_history.py --as-of YYYY-MM-DDTHH:MM:SSZ

The ipohk archive stores raw JSON under data/raw/external_history/, writes structured rows to external_ipo_history, and exports data/snapshots/external_ipo_history.csv.

Treat this as external historical context. Fields such as final oversubscription, one-lot win rate, grey-market return, and first-day return are not T0.5 margin snapshots and must not be backfilled into ipo_market_heat.

Grey-Market Source Policy

T2_grey_market is not an HKEX official disclosure stage. Grey-market trading is broker or third-party OTC activity, so do not bulk archive a grey-market feed unless the source is reproducible and redistribution-safe.

Accept a T2 source only when one of these conditions is met:

  • A licensed vendor or broker export is provided for this project and may be stored in Git.
  • A user-provided evidence file is added under data/raw/{ticker}/ with clear source notes.
  • A public historical source has stable ticker/date records and clear reuse terms.

Until one of those conditions is met, mark due T2 tasks as blocked data gaps instead of repeatedly leaving them as open sync failures:

.venv/bin/python scripts/mark_grey_market_gaps.py --as-of YYYY-MM-DDTHH:MM:SSZ

Do not mark T2 complete from screenshots, unsourced forum posts, or proprietary pages whose terms prohibit copying or redistribution.

Price Performance Backfill

Use the price-performance archiver to fill due D1, D5, D20, and D60 review checkpoints:

.venv/bin/python scripts/archive_price_performance.py --as-of YYYY-MM-DDTHH:MM:SSZ

The script archives one raw market-data response per ticker under data/raw/{ticker}/, records it in source_refs, writes structured rows into price_performance, exports snapshots, and refreshes sync state. Checkpoints use the configured calendar due date and the next available trading day in the archived market data.

Quality Checks

Before finishing, confirm:

  • No stored local path is absolute.
  • No stored local path starts with ./.
  • Raw files referenced by the database exist.
  • Source hashes match current file contents.
  • Extracted text exists or has a manifest status for archived PDF source references.
  • CSV snapshots reflect the database update.
  • sync_tasks reflects only missing or future work, not completed stages.
  • Any unavailable field is marked as a data gap rather than invented.