Files
hk-ipo/README.md
T
geometrybase 8a0dfd88f0 Make PDF text extraction a standard archive step
Request:
- Add extracted PDF text generation to the archivist workflow as a standard step.

Changes:
- Run PDF text extraction automatically for newly archived HKEX PDF sources.
- Make the PDF text extractor incremental and manifest-preserving.
- Document extracted-text handling in the archivist skill and README.
- Mark generated extracted text as no-diff data evidence.
- Backfill extracted text for all archived PDF source references.

Verification:
- Ran git diff --cached --check.
- Ran .venv/bin/python -m py_compile scripts/extract_pdf_text.py scripts/archive_hkex_documents.py.
- Ran full PDF extraction, then confirmed an incremental rerun skips unchanged files.
- Verified 557 PDF source_refs, 557 manifest rows, all status ok, and zero missing text/hash/path issues.

Next useful context:
- HKEX HTML notices and Yahoo JSON market data remain under data/raw and are not expected in data/extracted_text.
2026-06-15 13:27:41 +00:00

165 lines
6.2 KiB
Markdown

# HK IPO
HK IPO is a project for building a repeatable, auditable research workflow for Hong Kong new listing subscription decisions.
The project is designed around a feedback loop:
1. Archive IPO facts and source documents.
2. Freeze the analysis that was possible at each decision stage.
3. Compare predictions with post-listing outcomes.
4. Improve the scoring rules only from reviewed evidence.
## Goals
- Maintain a local, Git-tracked history of Hong Kong IPO data.
- Separate factual archiving from investment judgment.
- Keep every subscription decision tied to the information available at that time.
- Review actual IPO outcomes against prior predictions.
- Build a better IPO scoring process through structured error attribution.
## Workflow
Each IPO is evaluated by stage:
- `T0_prospectus`: prospectus and offer terms only.
- `T1_allotment`: allotment results, public subscription, placing, allocation, and final pricing.
- `T2_grey_market`: grey-market result and immediate pre-listing context.
- `D1`, `D5`, `D20`, `D60`: post-listing review checkpoints.
The key discipline is to avoid hindsight leakage. A T0 prediction should only use T0 information, even after the IPO has listed.
## Project Skills
This repository includes project-local Codex skills under `.codex/skills/`.
### `archivist`
Owns facts and source control:
- archive prospectuses, allotment results, listing facts, and market data;
- record source URLs, as-of timestamps, repo-relative paths, and file hashes;
- update the embedded SQLite database;
- export Git-friendly CSV snapshots.
It does not make investment recommendations.
### `analyst`
Owns IPO judgment and review:
- produce T0/T1/T2 prediction cards;
- score IPO candidates;
- compare multiple IPOs;
- write research memos and review cards;
- classify forecast errors;
- recommend scoring-rule updates.
It should use archived facts when available and keep prediction cards append-only.
## Storage Model
The project is intended to be self-contained and portable across machines. Durable paths should always be relative to the repository root.
Expected layout:
```text
data/
hk_ipo.sqlite
raw/
snapshots/
memos/
reports/
rules/
schema/
scripts/
references/
```
Path rules:
- store paths like `data/raw/06658/prospectus.pdf`;
- do not store absolute paths;
- do not store paths with a leading `./`;
- use POSIX `/` separators;
- store file hashes for archived source documents when practical.
SQLite is the embedded source of structured facts. CSV snapshots provide readable Git diffs. Markdown memos preserve the reasoning at each decision point.
## PDF Text Extraction
Archived PDFs can be converted into searchable text files:
```bash
python3 -m venv .venv
.venv/bin/python -m pip install -r requirements.txt
.venv/bin/python scripts/extract_pdf_text.py
```
The extractor reads PDF paths from `data/hk_ipo.sqlite`, writes derived text files under `data/extracted_text/`, and exports `data/snapshots/extracted_text_manifest.csv` with page counts, text hashes, and extraction status.
The extractor is incremental. If a PDF hash and manifest row are unchanged, the existing text output is reused. Use `--force` only when extraction behavior changes and all derived text should be regenerated.
## Recent IPO Target Refresh
Use HKEXnews annual new listing reports to seed recent subscription-relevant IPO targets:
```bash
.venv/bin/python scripts/update_recent_ipo_list.py --start-date 2023-06-15 --end-date 2026-06-15 --as-of 2026-06-15T07:30:00Z
```
The updater archives the HKEXnews XLSX reports under `data/raw/hkex_new_listing_reports/`, records report-backed source references, writes `new_listing_report_entries`, updates `ipo_master` and missing `offering_terms` fields, exports CSV snapshots, and refreshes `sync_tasks`.
Rows without an IPO offer price, such as transfers of listing, introductions, or de-SPAC transactions, are skipped by default because they are not ordinary public subscription targets.
## HKEX Document Backfill
Use the HKEX document archiver to fill detailed T0/T1 facts for open sync tasks:
```bash
.venv/bin/python scripts/archive_hkex_documents.py --as-of 2026-06-15T08:30:00Z
```
The archiver maps stock codes to HKEXnews title-search stock IDs, downloads the selected prospectus and allotment-results documents under `data/raw/{ticker}/`, records `source_refs`, parses high-confidence T0/T1 fields into `ipo_master`, `offering_terms`, and `ipo_demand`, exports snapshots, refreshes `sync_tasks`, and extracts text for newly archived PDF sources.
HKEX `.htm`/`.html` notices and Yahoo Finance JSON market data stay in `data/raw/`; they are not copied into `data/extracted_text/`.
## Price Performance Backfill
Use the price-performance archiver to fill due D1/D5/D20/D60 review checkpoints:
```bash
.venv/bin/python scripts/archive_price_performance.py --as-of 2026-06-15T10:00:00Z
```
The archiver stores raw Yahoo Finance chart responses under `data/raw/{ticker}/`, records source references and hashes, writes structured rows into `price_performance`, exports snapshots, and refreshes `sync_tasks`.
## Analysis Model
Use the analyst model builder to digest archived data into a stage-safe scoring dataset and calibration report:
```bash
.venv/bin/python scripts/build_analysis_dataset.py --as-of 2026-06-15T13:00:00Z
```
The v0 model is documented in `rules/ipo_score_v0.yaml`. It writes `data/snapshots/analysis_model_v0_dataset.csv` and `reports/2026-06-15_analysis_model_v0.md`.
The model separates T0 prospectus inputs from T1 allotment inputs. D1/D5/D20/D60 returns are labels for calibration and review, not prediction inputs.
## Incremental Archive Sync
The archivist keeps a per-ticker sync ledger so repeated updates can focus on missing stages:
```bash
python3 scripts/update_sync_state.py
```
This writes `ticker_sync_state` and `sync_tasks` into `data/hk_ipo.sqlite`, then exports `data/snapshots/ticker_sync_state.csv`, `data/snapshots/sync_tasks.csv`, and `data/snapshots/sync_runs.csv`.
Use `sync_tasks` as the next-sync queue. Tasks marked `open` are due now; tasks marked `waiting_until_due` are known future updates.
## Git Discipline
The repository uses automatic focused commits for completed project changes.
Before committing, check that unrelated dirty files are not included and that generated durable files use repo-relative paths.