Add external IPO history to heat model

Request: - Add historical data around T0.5 margin heat and rebuild the model. Changes: - Add external_ipo_history to store third-party historical IPO records separately from true T0.5 market-heat snapshots. - Add scripts/archive_ipohk_history.py to archive ipohk structured listed IPO history. - Archive 807 ipohk rows, including final oversubscription, one-lot win rate, grey-market return, and first-day return where available. - Extend the v0 analysis dataset with true T0.5 market-heat columns and separate external final-heat columns. - Rebuild reports/2026-06-15_analysis_model_v0.md with T0.5 coverage and external final-heat calibration. - Add a Chinese report explaining why historical final oversubscription cannot be treated as T0.5 margin snapshots. - Update analyst and archivist skills to keep T0.5 and external final history separate. Verification: - .venv/bin/python -m py_compile scripts/build_analysis_dataset.py scripts/archive_ipohk_history.py scripts/archive_t0_5_market_heat.py - .venv/bin/python scripts/build_analysis_dataset.py --as-of 2026-06-15T19:20:00Z - Python sqlite3 PRAGMA integrity_check returned ok and foreign_key_check returned zero rows. - Confirmed 807 external_ipo_history rows, 792 rows with external final oversubscription, 5 true T0.5 market-heat rows, and 297 analysis dataset rows. - git diff --cached --check Next useful context: - True T0.5 historical backtesting still requires ongoing frozen margin-heat snapshots during each IPO subscription window.
2026-06-15 16:06:56 +00:00
parent 222f55c140
commit 943eab27cb
12 changed files with 1589 additions and 299 deletions
@@ -1,7 +1,7 @@
 # HK IPO Analysis Model v0

 - Model version: `ipo_score_v0`
- Analysis as of: `2026-06-15T18:20:00Z`
+- Analysis as of: `2026-06-15T19:20:00Z`
 - Rule file: `rules/ipo_score_v0.yaml`
 - Dataset: `data/snapshots/analysis_model_v0_dataset.csv`

@@ -21,6 +21,11 @@ The model is built for a short IPO allocation trade: sell in T2 grey market when
 - Rows with offer size: 297
 - Rows with public oversubscription: 281
 - Rows with international oversubscription: 277
+- Rows with T0.5 margin heat snapshots: 5
+- Rows with T0.5 margin heat and D1 labels: 0
+- Rows matched to external ipohk history: 102
+- Rows with external final oversubscription: 95
+- Rows with external final oversubscription and D1 labels: 85
 - Rows pending T1 structure: 6 (01392, 02335, 06067, 06106, 06132, 06675)
 - T1 field-level blanks: public oversubscription 10, international oversubscription 14, valid applications 6, successful applications 18

@@ -47,6 +52,29 @@ T1 adds allotment-stage demand: public subscription, international placing deman
 | total_gte_26 | 59 | 94.9% | 88.1% | 86.7 | 80.0 |
 | total_lt_0 | 68 | 61.8% | 23.5% | 0.4 | 1.0 |

+## T0.5 Market Heat
+
+T0.5 uses archived subscription-period margin heat snapshots. These are non-official live signals and are kept separate from T1 allotment demand. The current archive is not yet a historical training set: it has too few rows and no D1 labels for calibration.
+
+- T0.5 margin rows: 5
+- T0.5 rows with D1 labels: 0
+
+## External Final Heat Proxy
+
+The ipohk history archive adds final public oversubscription, one-lot win rate, grey-market return, and first-day return where available. These fields are useful for coverage checks and post-hoc calibration, but they are not T0.5 inputs because they are final or near-final history.
+
+- External history rows matched into this dataset: 102
+- Matched rows with final oversubscription: 95
+- Matched rows with final oversubscription and D1 labels: 85
+
+| Bucket | N | D1 positive | D1 >= 10% | Avg D1 return | Median D1 return |
+| --- | ---: | ---: | ---: | ---: | ---: |
+| external_os_1000x_to_5000x | 33 | 93.9% | 78.8% | 60.4 | 44.2 |
+| external_os_100x_to_1000x | 21 | 61.9% | 38.1% | 8.8 | 4.2 |
+| external_os_10x_to_100x | 7 | 28.6% | 14.3% | -23.0 | -21.9 |
+| external_os_gte_5000x | 18 | 83.3% | 72.2% | 101.7 | 89.7 |
+| external_os_lt_10x | 6 | 50.0% | 16.7% | 4.7 | -4.1 |
+
 ## Current Read

 After the T1 demand text backfill, the strongest v0 T1 bucket is `total_gte_26` with 59 historical D1 observations and a 94.9% D1 positive rate. The model is most useful after allotment results are available; T0 is a watchlist filter rather than a final subscription call.
@@ -0,0 +1,58 @@
+# 2026-06-15 T0.5 历史数据与模型重建说明
+
+## 结论
+
+这次已经把可复现的历史数据接进模型，但不能把它称为“全部历史 T0.5 孖展倍数”。
+
+- 真实 `T0_5_market_heat`：目前只有华盛/捷利页面的实时快照，已归档 5 条当前申购标的。
+- 历史结构化数据：`ipohk` 可提供 807 条历史上市数据，其中包括最终超购倍数、一手中签率、暗盘涨幅和首日涨幅。
+- 关键限制：`ipohk` 的“超购倍数”是最终或接近最终结果，不是申购过程中的 T0.5 孖展快照，不能倒灌进 T0.5。
+
+所以，本次模型重建采用两层处理：
+
+1. `t0_5_*` 字段：真实申购期间孖展热度，只来自可归档快照。
+2. `external_*` 字段：历史最终热度与结果参照，只用于覆盖检查和后验校准。
+
+## 本次新增数据
+
+| 数据集 | 行数 | 用途 | 阶段安全性 |
+| --- | ---: | --- | --- |
+| `ipo_market_heat` | 5 | 当前申购标的 T0.5 孖展热度 | 可用于 T0.5，但历史样本不足 |
+| `external_ipo_history` | 807 | 历史最终超购、中签率、暗盘、首日表现 | 后验参照，不可用于 T0/T0.5 决策 |
+| `analysis_model_v0_dataset.csv` | 297 | 主模型训练/分析数据集 | 已加入 T0.5 与 external 字段 |
+
+## 重建后模型覆盖
+
+- 主模型 IPO 行数：297
+- D1 标签：273
+- 真实 T0.5 孖展快照：5
+- 真实 T0.5 且有 D1 标签：0
+- 匹配到 `ipohk` 历史记录：102
+- 匹配到 `ipohk` 最终超购倍数：95
+- `ipohk` 最终超购倍数且有 D1 标签：85
+
+## 如何读模型结果
+
+真实 T0.5 目前还不能做统计回测，因为 5 条都是当前未上市标的，没有 D1 结果。
+
+`ipohk` 历史最终超购倍数可以用来验证一个方向：高最终热度是否对应更好的 D1 胜率。但它不能回答“当时申购中途看到的孖展倍数是否有效”，因为它不是当时的冻结快照。
+
+## 已更新文件
+
+- 原始历史数据：`data/raw/external_history/ipohk_listed_20260615T191000Z.json`
+- 结构化历史数据：`data/snapshots/external_ipo_history.csv`
+- 主模型数据集：`data/snapshots/analysis_model_v0_dataset.csv`
+- 模型报告：`reports/2026-06-15_analysis_model_v0.md`
+- 历史归档脚本：`scripts/archive_ipohk_history.py`
+
+## 下一步
+
+真正让 T0.5 进入可回测模型，需要从今天开始持续采样申购期快照。
+
+建议采样节奏：
+
+- T0.5 early：申购首日收盘附近。
+- T0.5 mid：申购中段。
+- T0.5 final：截止前半天或截止前一晚。
+
+未来每只 IPO 至少保留一到三个冻结快照，等 T1/T2/D1 出来后再做 `heat_miss`、`structure_miss` 和 `market_window_miss` 复盘。