eae427d85b
Request: - Provide a way to install or develop a PDF extraction tool for archived HK IPO documents. Changes: - Add requirements.txt with pypdf as the lightweight PDF text extraction dependency. - Add scripts/extract_pdf_text.py to extract text from PDF source_refs into repo-relative data/extracted_text files. - Add extracted text outputs and an extracted_text_manifest snapshot for the six archived HKEXnews PDFs. - Document the extraction workflow in README.md. - Ignore .venv and keep generated SQLite/Python transient files out of git. - Use extracted text to verify the 06106 full prospectus, update source_refs, remove the related data gap, and fill 06106 offering terms. Verification: - Installed python3.14-venv system support, created a local .venv, and installed requirements.txt. - Re-ran scripts/bootstrap_historical_data.py and scripts/extract_pdf_text.py. - Verified extracted text paths and hashes against data/snapshots/extracted_text_manifest.csv. - Verified SQLite integrity and snapshot row counts. - Ran git diff --cached --check and searched durable files for machine-specific absolute paths.
105 lines
3.2 KiB
Markdown
105 lines
3.2 KiB
Markdown
# HK IPO
|
|
|
|
HK IPO is a project for building a repeatable, auditable research workflow for Hong Kong new listing subscription decisions.
|
|
|
|
The project is designed around a feedback loop:
|
|
|
|
1. Archive IPO facts and source documents.
|
|
2. Freeze the analysis that was possible at each decision stage.
|
|
3. Compare predictions with post-listing outcomes.
|
|
4. Improve the scoring rules only from reviewed evidence.
|
|
|
|
## Goals
|
|
|
|
- Maintain a local, Git-tracked history of Hong Kong IPO data.
|
|
- Separate factual archiving from investment judgment.
|
|
- Keep every subscription decision tied to the information available at that time.
|
|
- Review actual IPO outcomes against prior predictions.
|
|
- Build a better IPO scoring process through structured error attribution.
|
|
|
|
## Workflow
|
|
|
|
Each IPO is evaluated by stage:
|
|
|
|
- `T0_prospectus`: prospectus and offer terms only.
|
|
- `T1_allotment`: allotment results, public subscription, placing, allocation, and final pricing.
|
|
- `T2_grey_market`: grey-market result and immediate pre-listing context.
|
|
- `D1`, `D5`, `D20`, `D60`: post-listing review checkpoints.
|
|
|
|
The key discipline is to avoid hindsight leakage. A T0 prediction should only use T0 information, even after the IPO has listed.
|
|
|
|
## Project Skills
|
|
|
|
This repository includes project-local Codex skills under `.codex/skills/`.
|
|
|
|
### `archivist`
|
|
|
|
Owns facts and source control:
|
|
|
|
- archive prospectuses, allotment results, listing facts, and market data;
|
|
- record source URLs, as-of timestamps, repo-relative paths, and file hashes;
|
|
- update the embedded SQLite database;
|
|
- export Git-friendly CSV snapshots.
|
|
|
|
It does not make investment recommendations.
|
|
|
|
### `analyst`
|
|
|
|
Owns IPO judgment and review:
|
|
|
|
- produce T0/T1/T2 prediction cards;
|
|
- score IPO candidates;
|
|
- compare multiple IPOs;
|
|
- write research memos and review cards;
|
|
- classify forecast errors;
|
|
- recommend scoring-rule updates.
|
|
|
|
It should use archived facts when available and keep prediction cards append-only.
|
|
|
|
## Storage Model
|
|
|
|
The project is intended to be self-contained and portable across machines. Durable paths should always be relative to the repository root.
|
|
|
|
Expected layout:
|
|
|
|
```text
|
|
data/
|
|
hk_ipo.sqlite
|
|
raw/
|
|
snapshots/
|
|
memos/
|
|
reports/
|
|
rules/
|
|
schema/
|
|
scripts/
|
|
references/
|
|
```
|
|
|
|
Path rules:
|
|
|
|
- store paths like `data/raw/06658/prospectus.pdf`;
|
|
- do not store absolute paths;
|
|
- do not store paths with a leading `./`;
|
|
- use POSIX `/` separators;
|
|
- store file hashes for archived source documents when practical.
|
|
|
|
SQLite is the embedded source of structured facts. CSV snapshots provide readable Git diffs. Markdown memos preserve the reasoning at each decision point.
|
|
|
|
## PDF Text Extraction
|
|
|
|
Archived PDFs can be converted into searchable text files:
|
|
|
|
```bash
|
|
python3 -m venv .venv
|
|
.venv/bin/python -m pip install -r requirements.txt
|
|
.venv/bin/python scripts/extract_pdf_text.py
|
|
```
|
|
|
|
The extractor reads PDF paths from `data/hk_ipo.sqlite`, writes derived text files under `data/extracted_text/`, and exports `data/snapshots/extracted_text_manifest.csv` with page counts, text hashes, and extraction status.
|
|
|
|
## Git Discipline
|
|
|
|
The repository uses automatic focused commits for completed project changes.
|
|
|
|
Before committing, check that unrelated dirty files are not included and that generated durable files use repo-relative paths.
|