Files
hk-ipo/.codex/skills/archivist/SKILL.md
T
geometrybase 8a0dfd88f0 Make PDF text extraction a standard archive step
Request:
- Add extracted PDF text generation to the archivist workflow as a standard step.

Changes:
- Run PDF text extraction automatically for newly archived HKEX PDF sources.
- Make the PDF text extractor incremental and manifest-preserving.
- Document extracted-text handling in the archivist skill and README.
- Mark generated extracted text as no-diff data evidence.
- Backfill extracted text for all archived PDF source references.

Verification:
- Ran git diff --cached --check.
- Ran .venv/bin/python -m py_compile scripts/extract_pdf_text.py scripts/archive_hkex_documents.py.
- Ran full PDF extraction, then confirmed an incremental rerun skips unchanged files.
- Verified 557 PDF source_refs, 557 manifest rows, all status ok, and zero missing text/hash/path issues.

Next useful context:
- HKEX HTML notices and Yahoo JSON market data remain under data/raw and are not expected in data/extracted_text.
2026-06-15 13:27:41 +00:00

194 lines
9.0 KiB
Markdown

---
name: archivist
description: Use for Hong Kong IPO fact archiving in this project: downloading or recording prospectuses, allotment results, listing facts, market data, source references, file hashes, SQLite updates, and CSV snapshots. Do not use for investment conclusions, subscription recommendations, score interpretation, or research memos.
---
# HK IPO Archivist
## Purpose
Maintain the project-local Hong Kong IPO evidence archive and structured fact database. This skill owns facts, sources, database updates, path hygiene, and reproducible snapshots.
It does not decide whether an IPO is worth subscribing for. Route judgment, scoring, prediction cards, review cards, and reports to `analyst`.
## Project Storage Contract
Use repo-relative paths everywhere. Never store machine-specific absolute paths.
- Resolve the repo root at runtime, for example with `git rev-parse --show-toplevel`.
- Store paths without a leading `./`.
- Store paths with POSIX separators, such as `data/raw/06658/prospectus.pdf`.
- Store `path_base = "repo_root"` when a table needs an explicit base.
- Store `file_sha256` for archived source files whenever practical.
Expected project layout:
```text
data/
hk_ipo.sqlite
raw/
snapshots/
memos/
reports/
rules/
schema/
scripts/
references/
```
## Responsibilities
- Archive primary source files under `data/raw/{ticker}/`.
- Record source references, URLs, as-of timestamps, relative paths, and hashes.
- Update embedded SQLite tables for IPO facts.
- Export Git-friendly CSV snapshots after database updates.
- Maintain `sync_runs`, `ticker_sync_state`, and `sync_tasks` so repeated syncs know what is already archived and what remains pending.
- Use HKEXnews annual new listing reports to seed broad recent-IPO target coverage before collecting deeper per-ticker documents.
- Preserve raw source files; do not overwrite without first checking whether the contents changed.
- Label missing, stale, inconsistent, or estimated fields explicitly.
- Use audit findings to prioritize historical data gaps before expanding analysis coverage.
## Boundaries
Do not write:
- Subscription decisions.
- Investment ratings.
- Scoring interpretations.
- Prediction cards.
- Review conclusions.
- Rule-change recommendations.
If a user asks for both data update and analysis, complete the archive/update step first, then hand the frozen as-of dataset to `analyst`.
## Workflow
1. Inspect current repo state and recent commits before changing files.
2. Identify the IPO ticker, company, stage, and source documents needed.
3. Save raw source files under `data/raw/{ticker}/` using descriptive names.
4. Compute hashes for archived files.
5. Insert or update structured facts in `data/hk_ipo.sqlite`.
6. Record every source in the source reference table using repo-relative paths.
7. Extract text for archived PDF sources with `scripts/extract_pdf_text.py`.
8. Refresh sync state with `scripts/update_sync_state.py` after fact updates.
9. Export key tables to `data/snapshots/` for readable Git diffs.
10. Verify path rules, required fields, hash checks, extracted text manifest, sync state, and snapshot generation.
11. Commit only the related archive/database/snapshot changes.
## Incremental Sync State
Use `ticker_sync_state` as the per-ticker stage ledger and `sync_tasks` as the next-sync queue.
Stages:
- `T0_prospectus`
- `T1_allotment`
- `T2_grey_market`
- `D1`
- `D5`
- `D20`
- `D60`
Status values:
- `complete`: required facts or source files are archived.
- `pending_not_due`: the stage is expected in the future.
- `pending_due`: the stage is due and should be updated on the next sync.
- `blocked`: the missing data has no known resolution date or needs manual intervention.
- `not_applicable`: the stage does not apply.
Default incremental flow:
```bash
python3 scripts/update_sync_state.py
```
Then update only rows in `sync_tasks` whose `task_status` is `open` or `blocked`. Do not re-download existing source files unless the upstream source changed or the stored hash no longer matches.
## Audit-Driven Gap Closure
When `audit` finds historical data gaps, close them in this order unless the user specifies otherwise:
1. Integrity blockers: missing raw files, bad hashes, absolute paths, broken snapshots, or failed foreign keys.
2. Stage blockers: open due `T0_prospectus` and `T1_allotment` tasks that prevent stage-correct analysis.
3. Outcome blockers: due `D1`, `D5`, `D20`, and `D60` price performance needed for feedback and review.
4. Context fields: industry labels, market cap, net proceeds, timetable gaps, and other comparison fields.
5. Hard-to-source signals: `T2_grey_market`, only after a reproducible source strategy is available.
After each gap-closure run, refresh `sync_tasks`, export snapshots, and report what remains open. Do not mark unavailable data complete just to reduce the queue.
## Recent IPO Target Coverage
Use the recent IPO updater when the user asks to update a broad date range of HK IPO targets:
```bash
.venv/bin/python scripts/update_recent_ipo_list.py --start-date YYYY-MM-DD --end-date YYYY-MM-DD --as-of YYYY-MM-DDTHH:MM:SSZ
```
The script discovers HKEXnews annual new listing report XLSX files, archives them under `data/raw/hkex_new_listing_reports/`, inserts `new_listing_report_entries`, updates `ipo_master` and missing `offering_terms` fields, records report-backed `source_refs`, exports snapshots, and refreshes sync state.
By default, exclude report rows without a numeric IPO offer price because transfers, introductions, and de-SPAC transactions are not ordinary public subscription targets.
## HKEX Document Backfill
Use the document archiver to fill detailed T0/T1 facts from official HKEXnews documents:
```bash
.venv/bin/python scripts/archive_hkex_documents.py --as-of YYYY-MM-DDTHH:MM:SSZ
```
The script resolves HKEXnews stock IDs, archives prospectus and allotment-results documents under `data/raw/{ticker}/`, updates `source_refs`, parses high-confidence fields into `ipo_master`, `offering_terms`, and `ipo_demand`, exports snapshots, and refreshes sync state.
The document archiver should use HKEXnews date-window title search around the IPO timetable, not only the latest title-search page. IPO documents for active listed companies are often buried behind later post-listing announcements. Treat official HKEXnews `.pdf`, `.htm`, and `.html` allotment-result notices as valid archived sources; parse structured demand facts only where parser coverage is reliable.
PDF text extraction is a standard HKEX document post-processing step. `scripts/archive_hkex_documents.py` extracts text for newly archived PDFs by default after source references are written:
```bash
.venv/bin/python scripts/extract_pdf_text.py
```
The extractor is incremental: unchanged PDFs with matching manifest rows are skipped, and `data/snapshots/extracted_text_manifest.csv` is preserved and updated. Use `--force` only when parser behavior changes and derived text should be regenerated.
Do not expect `data/extracted_text/` entries for Yahoo JSON market data or HKEX `.htm`/`.html` notices. Those are already text-like raw evidence files and are tracked under `data/raw/`.
## Grey-Market Source Policy
`T2_grey_market` is not an HKEX official disclosure stage. Grey-market trading is broker or third-party OTC activity, so do not bulk archive a grey-market feed unless the source is reproducible and redistribution-safe.
Accept a T2 source only when one of these conditions is met:
- A licensed vendor or broker export is provided for this project and may be stored in Git.
- A user-provided evidence file is added under `data/raw/{ticker}/` with clear source notes.
- A public historical source has stable ticker/date records and clear reuse terms.
Until one of those conditions is met, mark due T2 tasks as blocked data gaps instead of repeatedly leaving them as open sync failures:
```bash
.venv/bin/python scripts/mark_grey_market_gaps.py --as-of YYYY-MM-DDTHH:MM:SSZ
```
Do not mark T2 complete from screenshots, unsourced forum posts, or proprietary pages whose terms prohibit copying or redistribution.
## Price Performance Backfill
Use the price-performance archiver to fill due `D1`, `D5`, `D20`, and `D60` review checkpoints:
```bash
.venv/bin/python scripts/archive_price_performance.py --as-of YYYY-MM-DDTHH:MM:SSZ
```
The script archives one raw market-data response per ticker under `data/raw/{ticker}/`, records it in `source_refs`, writes structured rows into `price_performance`, exports snapshots, and refreshes sync state. Checkpoints use the configured calendar due date and the next available trading day in the archived market data.
## Quality Checks
Before finishing, confirm:
- No stored local path is absolute.
- No stored local path starts with `./`.
- Raw files referenced by the database exist.
- Source hashes match current file contents.
- Extracted text exists or has a manifest status for archived PDF source references.
- CSV snapshots reflect the database update.
- `sync_tasks` reflects only missing or future work, not completed stages.
- Any unavailable field is marked as a data gap rather than invented.