7cbdd533b0
Request: - Add a repeatable mechanism so HK IPO reports detect issuers that already have Mainland A shares. - Include a third internet/official-exchange cross-check layer beyond structured history and prospectus scans. Changes: - Add listed_share_classes schema support for same-issuer A-share mappings and evidence links. - Add scripts/archive_a_share_mappings.py to scan prospectus extracted text, reject sponsor/portfolio/cornerstone false positives, archive optional official web evidence and A-share/FX quote evidence, and export snapshots on write. - Surface a_share_* fields in the analysis dataset and single-ticker report output. - Update hk-ipo analyst/archivist skill rules and scheduled refresh prompt to require the three-layer A/H mapping check. Verification: - python3 -m py_compile scripts/archive_a_share_mappings.py scripts/build_analysis_dataset.py scripts/generate_ipo_report.py - .venv/bin/python scripts/archive_a_share_mappings.py --as-of 2026-06-24T00:00:00Z --tickers 00668,01688,03661,09630 --dry-run - .venv/bin/python scripts/build_analysis_dataset.py --db /tmp/hk_ipo_ah_dataset_test.sqlite --dataset /tmp/hk_ipo_ah_dataset_test.csv --report /tmp/hk_ipo_ah_model_test.md --as-of 2026-06-24T00:00:00Z - .venv/bin/python scripts/generate_ipo_report.py 09630 --dataset /tmp/hk_ipo_ah_dataset_test.csv --stdout --as-of 2026-06-24T00:00:00Z - git diff --check Next useful context: - Dry-run detected 00668->300866.SZ, 01688->002600.SZ, 03661->300661.SZ, and 09630->688630.SH. - A false positive 01688->300476.SZ from a cornerstone investor parent was rejected by the issuer-context filter.
262 lines
13 KiB
Markdown
262 lines
13 KiB
Markdown
---
|
|
name: hk-ipo-archivist
|
|
description: >-
|
|
Use for Hong Kong IPO fact archiving in this project: downloading or recording
|
|
prospectuses, allotment results, listing facts, market data, source
|
|
references, file hashes, SQLite updates, and CSV snapshots. Do not use for
|
|
investment conclusions, subscription recommendations, score interpretation,
|
|
or research memos.
|
|
---
|
|
|
|
# HK IPO Archivist
|
|
|
|
## Purpose
|
|
|
|
Maintain the project-local Hong Kong IPO evidence archive and structured fact database. This skill owns facts, sources, database updates, path hygiene, and reproducible snapshots.
|
|
|
|
It does not decide whether an IPO is worth subscribing for. Route judgment, scoring, prediction cards, review cards, and reports to `hk-ipo-analyst`.
|
|
|
|
## Project Storage Contract
|
|
|
|
Use repo-relative paths everywhere. Never store machine-specific absolute paths.
|
|
|
|
- Resolve the repo root at runtime, for example with `git rev-parse --show-toplevel`.
|
|
- Store paths without a leading `./`.
|
|
- Store paths with POSIX separators, such as `data/raw/06658/prospectus.pdf`.
|
|
- Store `path_base = "repo_root"` when a table needs an explicit base.
|
|
- Store `file_sha256` for archived source files whenever practical.
|
|
|
|
Expected project layout:
|
|
|
|
```text
|
|
data/
|
|
hk_ipo.sqlite
|
|
raw/
|
|
snapshots/
|
|
memos/
|
|
reports/
|
|
rules/
|
|
schema/
|
|
scripts/
|
|
references/
|
|
```
|
|
|
|
## Responsibilities
|
|
|
|
- Archive primary source files under `data/raw/{ticker}/`.
|
|
- Record source references, URLs, as-of timestamps, relative paths, and hashes.
|
|
- Update embedded SQLite tables for IPO facts.
|
|
- Export Git-friendly CSV snapshots after database updates.
|
|
- Maintain `sync_runs`, `ticker_sync_state`, and `sync_tasks` so repeated syncs know what is already archived and what remains pending.
|
|
- Use HKEXnews annual new listing reports to seed broad recent-IPO target coverage before collecting deeper per-ticker documents.
|
|
- Preserve raw source files; do not overwrite without first checking whether the contents changed.
|
|
- Label missing, stale, inconsistent, or estimated fields explicitly.
|
|
- Use audit findings to prioritize historical data gaps before expanding analysis coverage.
|
|
|
|
## Boundaries
|
|
|
|
Do not write:
|
|
|
|
- Subscription decisions.
|
|
- Investment ratings.
|
|
- Scoring interpretations.
|
|
- Prediction cards.
|
|
- Review conclusions.
|
|
- Rule-change recommendations.
|
|
|
|
If a user asks for both data update and analysis, complete the archive/update step first, then hand the frozen as-of dataset to `hk-ipo-analyst`.
|
|
|
|
## Workflow
|
|
|
|
1. Inspect current repo state and recent commits before changing files.
|
|
2. Identify the IPO ticker, company, stage, and source documents needed.
|
|
3. Save raw source files under `data/raw/{ticker}/` using descriptive names.
|
|
4. Compute hashes for archived files.
|
|
5. Insert or update structured facts in `data/hk_ipo.sqlite`.
|
|
6. Record every source in the source reference table using repo-relative paths.
|
|
7. Extract text for archived PDF sources with `scripts/extract_pdf_text.py`.
|
|
8. Refresh sync state with `scripts/update_sync_state.py` after fact updates.
|
|
9. Export key tables to `data/snapshots/` for readable Git diffs.
|
|
10. Verify path rules, required fields, hash checks, extracted text manifest, sync state, and snapshot generation.
|
|
11. Commit only the related archive/database/snapshot changes.
|
|
|
|
## Incremental Sync State
|
|
|
|
Use `ticker_sync_state` as the per-ticker stage ledger and `sync_tasks` as the next-sync queue.
|
|
|
|
Stages:
|
|
|
|
- `T0_prospectus`
|
|
- `T0_5_market_heat`
|
|
- `T1_allotment`
|
|
- `T2_grey_market`
|
|
- `D1`
|
|
- `D5`
|
|
- `D20`
|
|
- `D60`
|
|
|
|
Status values:
|
|
|
|
- `complete`: required facts or source files are archived.
|
|
- `pending_not_due`: the stage is expected in the future.
|
|
- `pending_due`: the stage is due and should be updated on the next sync.
|
|
- `blocked`: the missing data has no known resolution date or needs manual intervention.
|
|
- `not_applicable`: the stage does not apply.
|
|
|
|
Default incremental flow:
|
|
|
|
```bash
|
|
python3 scripts/update_sync_state.py
|
|
```
|
|
|
|
Then update only rows in `sync_tasks` whose `task_status` is `open` or `blocked`. Do not re-download existing source files unless the upstream source changed or the stored hash no longer matches.
|
|
|
|
## Audit-Driven Gap Closure
|
|
|
|
When `hk-ipo-audit` finds historical data gaps, close them in this order unless the user specifies otherwise:
|
|
|
|
1. Integrity blockers: missing raw files, bad hashes, absolute paths, broken snapshots, or failed foreign keys.
|
|
2. Stage blockers: open due `T0_prospectus` and `T1_allotment` tasks that prevent stage-correct analysis.
|
|
3. Outcome blockers: due `D1`, `D5`, `D20`, and `D60` price performance needed for feedback and review.
|
|
4. Context fields: industry labels, market cap, net proceeds, timetable gaps, and other comparison fields.
|
|
5. Hard-to-source signals: `T2_grey_market`, only after a reproducible source strategy is available.
|
|
|
|
After each gap-closure run, refresh `sync_tasks`, export snapshots, and report what remains open. Do not mark unavailable data complete just to reduce the queue.
|
|
|
|
## Recent IPO Target Coverage
|
|
|
|
Use the recent IPO updater when the user asks to update a broad date range of HK IPO targets:
|
|
|
|
```bash
|
|
.venv/bin/python scripts/update_recent_ipo_list.py --start-date YYYY-MM-DD --end-date YYYY-MM-DD --as-of YYYY-MM-DDTHH:MM:SSZ
|
|
```
|
|
|
|
The script discovers HKEXnews annual new listing report XLSX files, archives them under `data/raw/hkex_new_listing_reports/`, inserts `new_listing_report_entries`, updates `ipo_master` and missing `offering_terms` fields, records report-backed `source_refs`, exports snapshots, and refreshes sync state.
|
|
|
|
By default, exclude report rows without a numeric IPO offer price because transfers, introductions, and de-SPAC transactions are not ordinary public subscription targets.
|
|
|
|
## HKEX Document Backfill
|
|
|
|
Use the document archiver to fill detailed T0/T1 facts from official HKEXnews documents:
|
|
|
|
```bash
|
|
.venv/bin/python scripts/archive_hkex_documents.py --as-of YYYY-MM-DDTHH:MM:SSZ
|
|
```
|
|
|
|
The script resolves HKEXnews stock IDs, archives prospectus and allotment-results documents under `data/raw/{ticker}/`, updates `source_refs`, parses high-confidence fields into `ipo_master`, `offering_terms`, and `ipo_demand`, exports snapshots, and refreshes sync state.
|
|
|
|
The document archiver should use HKEXnews date-window title search around the IPO timetable, not only the latest title-search page. IPO documents for active listed companies are often buried behind later post-listing announcements. Treat official HKEXnews `.pdf`, `.htm`, and `.html` allotment-result notices as valid archived sources; parse structured demand facts only where parser coverage is reliable.
|
|
|
|
PDF text extraction is a standard HKEX document post-processing step. `scripts/archive_hkex_documents.py` extracts text for newly archived PDFs by default after source references are written:
|
|
|
|
```bash
|
|
.venv/bin/python scripts/extract_pdf_text.py
|
|
```
|
|
|
|
The extractor is incremental: unchanged PDFs with matching manifest rows are skipped, and `data/snapshots/extracted_text_manifest.csv` is preserved and updated. Use `--force` only when parser behavior changes and derived text should be regenerated.
|
|
|
|
Do not expect `data/extracted_text/` entries for Yahoo JSON market data or HKEX `.htm`/`.html` notices. Those are already text-like raw evidence files and are tracked under `data/raw/`.
|
|
|
|
## A/H Share-Class Mapping Archive
|
|
|
|
When current or recently closed IPO candidates may already have Mainland A Shares or another onshore listed share class, refresh the structured mapping archive before analyst reports consume the dataset:
|
|
|
|
```bash
|
|
.venv/bin/python scripts/archive_a_share_mappings.py --as-of YYYY-MM-DDTHH:MM:SSZ --web-cross-check --archive-quotes
|
|
```
|
|
|
|
The archive uses three evidence layers:
|
|
|
|
1. Structured prior mappings in `listed_share_classes` and `data/snapshots/listed_share_classes.csv`.
|
|
2. Issuer-context prospectus text scans from `data/snapshots/extracted_text_manifest.csv`.
|
|
3. Internet search or supported official exchange pages as public cross-check evidence.
|
|
|
|
Prospectus text remains the primary source for same-issuer identity. Public web evidence supports the mapping when it is reproducible and can be archived as a `source_refs` row. If a web cross-check is not supported for a market or exchange page, leave `web_source_id` blank and make the analyst report state the web-evidence gap rather than inventing a source.
|
|
|
|
For detected same-issuer A-share mappings, archive recent A-share quote data and HKD/CNY FX evidence when `--archive-quotes` is used so the analyst can compute A/H discount or premium without relying on stale manual prices. Do not use sponsor, portfolio-company, shareholder, or comparable-company stock codes as same-issuer mappings.
|
|
|
|
## T1 Demand Text Backfill
|
|
|
|
When audit finds T1 rows where an allotment-results source is archived but `ipo_demand` is missing, use the text backfill script:
|
|
|
|
```bash
|
|
.venv/bin/python scripts/backfill_t1_demand_from_text.py --as-of YYYY-MM-DDTHH:MM:SSZ
|
|
```
|
|
|
|
The script is incremental. It selects only `T1_allotment` rows that are complete from source evidence but have no `ipo_demand` row. It parses archived PDF extracted text, follows old HKEX HTML allotment-result pages to their linked Summary PDFs, archives those PDFs, extracts their text, writes `ipo_demand`, exports snapshots, and refreshes sync state only when facts or sources changed.
|
|
|
|
Do not infer missing demand fields. If a Summary PDF gives valid applications and public subscription but omits successful applicants or international subscription level, store the available fields and leave unavailable fields null.
|
|
|
|
## T0.5 Market Heat Archive
|
|
|
|
When the user asks to include subscription-period heat before official T1 allotment results, archive a reproducible market-heat snapshot:
|
|
|
|
```bash
|
|
.venv/bin/python scripts/archive_t0_5_market_heat.py --as-of YYYY-MM-DDTHH:MM:SSZ --tickers 01392,06067
|
|
```
|
|
|
|
When the user can still place, amend, or cancel an order near the subscription cutoff, archive the late actionable snapshot as `T0_95_final_heat`:
|
|
|
|
```bash
|
|
.venv/bin/python scripts/archive_t0_5_market_heat.py --stage T0_95_final_heat --as-of YYYY-MM-DDTHH:MM:SSZ --tickers 01392,06067
|
|
```
|
|
|
|
The script stores the raw page under `data/raw/market_heat/`, records per-ticker `source_refs`, writes structured rows to `ipo_market_heat`, exports `data/snapshots/ipo_market_heat.csv`, and refreshes sync state. The default `source_type` is `t0_5_market_heat`.
|
|
|
|
For T0.95 runs, `source_type` is `t0_95_final_heat` and `ipo_market_heat.stage` is `T0_95_final_heat`. Use this stage only when the snapshot was observed before the user's actual executable order cutoff; otherwise store it as ordinary `T0_5_market_heat` or post-hoc research evidence.
|
|
|
|
Market-heat data is non-official and live. It may include broker-aggregated margin subscription multiples or similar estimates. Never store it as `ipo_demand`, never treat it as final HKEX subscription data, and always preserve provider, source URL, raw path, `observed_at`, and the intended decision stage.
|
|
|
|
## External IPO History Archive
|
|
|
|
When a historical third-party table is useful for coverage checks or calibration research, archive it separately from official HKEX data:
|
|
|
|
```bash
|
|
.venv/bin/python scripts/archive_ipohk_history.py --as-of YYYY-MM-DDTHH:MM:SSZ
|
|
```
|
|
|
|
The ipohk archive stores raw JSON under `data/raw/external_history/`, writes structured rows to `external_ipo_history`, and exports `data/snapshots/external_ipo_history.csv`.
|
|
|
|
Treat this as external historical context. Fields such as final oversubscription, one-lot win rate, grey-market return, and first-day return are not T0.5 margin snapshots and must not be backfilled into `ipo_market_heat`.
|
|
|
|
## Grey-Market Source Policy
|
|
|
|
`T2_grey_market` is not an HKEX official disclosure stage. Grey-market trading is broker or third-party OTC activity, so do not bulk archive a grey-market feed unless the source is reproducible and redistribution-safe.
|
|
|
|
Accept a T2 source only when one of these conditions is met:
|
|
|
|
- A licensed vendor or broker export is provided for this project and may be stored in Git.
|
|
- A user-provided evidence file is added under `data/raw/{ticker}/` with clear source notes.
|
|
- A public historical source has stable ticker/date records and clear reuse terms.
|
|
|
|
Until one of those conditions is met, mark due T2 tasks as blocked data gaps instead of repeatedly leaving them as open sync failures:
|
|
|
|
```bash
|
|
.venv/bin/python scripts/mark_grey_market_gaps.py --as-of YYYY-MM-DDTHH:MM:SSZ
|
|
```
|
|
|
|
Do not mark T2 complete from screenshots, unsourced forum posts, or proprietary pages whose terms prohibit copying or redistribution.
|
|
|
|
## Price Performance Backfill
|
|
|
|
Use the price-performance archiver to fill due `D1`, `D5`, `D20`, and `D60` review checkpoints:
|
|
|
|
```bash
|
|
.venv/bin/python scripts/archive_price_performance.py --as-of YYYY-MM-DDTHH:MM:SSZ
|
|
```
|
|
|
|
The script archives one raw market-data response per ticker under `data/raw/{ticker}/`, records it in `source_refs`, writes structured rows into `price_performance`, exports snapshots, and refreshes sync state. Checkpoints use the configured calendar due date and the next available trading day in the archived market data.
|
|
|
|
## Quality Checks
|
|
|
|
Before finishing, confirm:
|
|
|
|
- No stored local path is absolute.
|
|
- No stored local path starts with `./`.
|
|
- Raw files referenced by the database exist.
|
|
- Source hashes match current file contents.
|
|
- Extracted text exists or has a manifest status for archived PDF source references.
|
|
- CSV snapshots reflect the database update.
|
|
- `sync_tasks` reflects only missing or future work, not completed stages.
|
|
- Any unavailable field is marked as a data gap rather than invented.
|