Files
hk-ipo/README.md
T
geometrybase eae427d85b Add PDF text extraction workflow
Request:
- Provide a way to install or develop a PDF extraction tool for archived HK IPO documents.

Changes:
- Add requirements.txt with pypdf as the lightweight PDF text extraction dependency.
- Add scripts/extract_pdf_text.py to extract text from PDF source_refs into repo-relative data/extracted_text files.
- Add extracted text outputs and an extracted_text_manifest snapshot for the six archived HKEXnews PDFs.
- Document the extraction workflow in README.md.
- Ignore .venv and keep generated SQLite/Python transient files out of git.
- Use extracted text to verify the 06106 full prospectus, update source_refs, remove the related data gap, and fill 06106 offering terms.

Verification:
- Installed python3.14-venv system support, created a local .venv, and installed requirements.txt.
- Re-ran scripts/bootstrap_historical_data.py and scripts/extract_pdf_text.py.
- Verified extracted text paths and hashes against data/snapshots/extracted_text_manifest.csv.
- Verified SQLite integrity and snapshot row counts.
- Ran git diff --cached --check and searched durable files for machine-specific absolute paths.
2026-06-15 06:21:16 +00:00

105 lines
3.2 KiB
Markdown

# HK IPO
HK IPO is a project for building a repeatable, auditable research workflow for Hong Kong new listing subscription decisions.
The project is designed around a feedback loop:
1. Archive IPO facts and source documents.
2. Freeze the analysis that was possible at each decision stage.
3. Compare predictions with post-listing outcomes.
4. Improve the scoring rules only from reviewed evidence.
## Goals
- Maintain a local, Git-tracked history of Hong Kong IPO data.
- Separate factual archiving from investment judgment.
- Keep every subscription decision tied to the information available at that time.
- Review actual IPO outcomes against prior predictions.
- Build a better IPO scoring process through structured error attribution.
## Workflow
Each IPO is evaluated by stage:
- `T0_prospectus`: prospectus and offer terms only.
- `T1_allotment`: allotment results, public subscription, placing, allocation, and final pricing.
- `T2_grey_market`: grey-market result and immediate pre-listing context.
- `D1`, `D5`, `D20`, `D60`: post-listing review checkpoints.
The key discipline is to avoid hindsight leakage. A T0 prediction should only use T0 information, even after the IPO has listed.
## Project Skills
This repository includes project-local Codex skills under `.codex/skills/`.
### `archivist`
Owns facts and source control:
- archive prospectuses, allotment results, listing facts, and market data;
- record source URLs, as-of timestamps, repo-relative paths, and file hashes;
- update the embedded SQLite database;
- export Git-friendly CSV snapshots.
It does not make investment recommendations.
### `analyst`
Owns IPO judgment and review:
- produce T0/T1/T2 prediction cards;
- score IPO candidates;
- compare multiple IPOs;
- write research memos and review cards;
- classify forecast errors;
- recommend scoring-rule updates.
It should use archived facts when available and keep prediction cards append-only.
## Storage Model
The project is intended to be self-contained and portable across machines. Durable paths should always be relative to the repository root.
Expected layout:
```text
data/
hk_ipo.sqlite
raw/
snapshots/
memos/
reports/
rules/
schema/
scripts/
references/
```
Path rules:
- store paths like `data/raw/06658/prospectus.pdf`;
- do not store absolute paths;
- do not store paths with a leading `./`;
- use POSIX `/` separators;
- store file hashes for archived source documents when practical.
SQLite is the embedded source of structured facts. CSV snapshots provide readable Git diffs. Markdown memos preserve the reasoning at each decision point.
## PDF Text Extraction
Archived PDFs can be converted into searchable text files:
```bash
python3 -m venv .venv
.venv/bin/python -m pip install -r requirements.txt
.venv/bin/python scripts/extract_pdf_text.py
```
The extractor reads PDF paths from `data/hk_ipo.sqlite`, writes derived text files under `data/extracted_text/`, and exports `data/snapshots/extracted_text_manifest.csv` with page counts, text hashes, and extraction status.
## Git Discipline
The repository uses automatic focused commits for completed project changes.
Before committing, check that unrelated dirty files are not included and that generated durable files use repo-relative paths.