gsd: plan M001 (Chrysopedia Foundation) with 5 slices and S01 task breakdown

Milestone: Chrysopedia Foundation — Infrastructure, Pipeline Core, and Skeleton UI Slices: S01: Docker Compose + Database + Whisper Script (5 tasks) S02: Transcript Ingestion API S03: LLM Extraction Pipeline + Qdrant Integration S04: Review Queue Admin UI S05: Search-First Web UI Requirements: R001-R015 covering all spec sections. Decisions: D001 (tech stack), D002 (Docker conventions), D003 (storage layer)
2026-03-29 21:39:04 +00:00 · 2026-03-29 21:39:04 +00:00 · e15dd97b73
commit e15dd97b73
parent 8b506a95ca
15 changed files with 415 additions and 1 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1,2 +1,6 @@
 .bg-shell/
-.gsd/
+.gsd/gsd.db
+.gsd/gsd.db-shm
+.gsd/gsd.db-wal
+.gsd/event-log.jsonl
+.gsd/state-manifest.json
--- a/.gsd/DECISIONS.md
+++ b/.gsd/DECISIONS.md
@ -0,0 +1,9 @@
+# Decisions Register
+
+<!-- Append-only. Never edit or remove existing rows.
+     To reverse a decision, add a new row that supersedes it.
+     Read this file at the start of any planning or research phase. -->
+
+| # | When | Scope | Decision | Choice | Rationale | Revisable? | Made By |
+|---|------|-------|----------|--------|-----------|------------|---------|
+| D001 |  | architecture | Docker Compose project naming and path conventions | xpltd_chrysopedia with bind mounts at /vmPool/r/services/chrysopedia_*, compose at /vmPool/r/compose/chrysopedia/ | XPLTD lore: compose projects at /vmPool/r/compose/{name}/, service data at /vmPool/r/services/{service}_{role}/, project naming follows xpltd_{name} pattern. Network will be a dedicated bridge subnet avoiding existing 172.16-172.23 and 172.29-172.30 ranges. | Yes | agent |
--- a/.gsd/REQUIREMENTS.md
+++ b/.gsd/REQUIREMENTS.md
@ -0,0 +1,91 @@
+# Requirements
+
+## R001 — Whisper Transcription Pipeline
+**Status:** active
+**Description:** Desktop Python script that accepts video files (MP4/MKV), extracts audio via ffmpeg, runs Whisper large-v3 on RTX 4090, and outputs timestamped transcript JSON with segment-level timestamps and word-level timing. Must be resumable.
+**Validation:** Script processes a sample video and produces valid JSON with timestamped segments.
+**Primary Owner:** M001/S01
+
+## R002 — Transcript Ingestion API
+**Status:** active
+**Description:** FastAPI endpoint that accepts transcript JSON uploads, creates/updates Creator and Source Video records, and stores transcript data in PostgreSQL. Handles new creator detection from folder names.
+**Validation:** POST transcript JSON → 200 OK, records created in DB, file stored on filesystem.
+**Primary Owner:** M001/S02
+
+## R003 — LLM-Powered Extraction Pipeline (Stages 2-5)
+**Status:** active
+**Description:** Background worker pipeline: transcript segmentation → key moment extraction → classification/tagging → technique page synthesis. Uses OpenAI-compatible API with primary (DGX Sparks Qwen) and fallback (local Ollama) endpoints. Pipeline must be resumable per-video per-stage.
+**Validation:** End-to-end: transcript JSON in → technique pages with key moments, tags, and cross-references out.
+**Primary Owner:** M001/S03
+
+## R004 — Review Queue UI
+**Status:** active
+**Description:** Admin interface for reviewing extracted key moments: approve, edit+approve, split, merge, reject. Organized by source video for contextual review. Includes mode toggle (review vs auto-publish).
+**Validation:** Admin can review, edit, and approve/reject moments; mode toggle controls whether new moments require review.
+**Primary Owner:** M001/S04
+
+## R005 — Search-First Web UI
+**Status:** active
+**Description:** Landing page with prominent search bar, live typeahead (results after 2-3 chars), scope toggle (All/Topics/Creators), and two navigation cards (Topics, Creators). Recently added section. Search powered by Qdrant semantic search with keyword fallback.
+**Validation:** User types query → results appear within 500ms, grouped by type, with clickable navigation.
+**Primary Owner:** M001/S05
+
+## R006 — Technique Page Display
+**Status:** active
+**Description:** Core content unit: header (tags, title, creator, meta), study guide prose (organized by sub-aspects with signal chain blocks and quotes), key moments index (timestamped list), related techniques, plugins referenced. Amber banner for livestream-sourced content.
+**Validation:** Technique page renders with all sections populated from synthesized data.
+**Primary Owner:** M001/S05
+
+## R007 — Creators Browse Page
+**Status:** active
+**Description:** Filterable creator list with genre filter pills, type-to-narrow, sort options (randomized default, alphabetical, view count). Each row: name, genre tags, technique count, video count, view count. Links to creator detail page.
+**Validation:** Page loads with randomized order, genre filtering works, clicking row navigates to creator detail.
+**Primary Owner:** M001/S05
+
+## R008 — Topics Browse Page
+**Status:** active
+**Description:** Two-level topic hierarchy (6 top-level categories → sub-topics). Filter input, genre filter pills. Each sub-topic shows technique count and creator count. Clicking sub-topic shows technique pages.
+**Validation:** Hierarchy renders, filtering works, sub-topic links show correct technique pages.
+**Primary Owner:** M001/S05
+
+## R009 — Qdrant Vector Search Integration
+**Status:** active
+**Description:** Embed key moment summaries, technique page content, and transcript segments in Qdrant using configurable embedding model (nomic-embed-text default). Power semantic search with metadata filtering.
+**Validation:** Semantic search returns relevant results for natural language queries; embeddings update when content changes.
+**Primary Owner:** M001/S03
+
+## R010 — Docker Compose Deployment
+**Status:** active
+**Description:** Single docker-compose.yml packaging API, web UI, PostgreSQL, and worker services. Follows XPLTD conventions: bind mounts at /vmPool/r/services/, compose at /vmPool/r/compose/chrysopedia/, xpltd_chrysopedia project name, dedicated Docker network.
+**Validation:** `docker compose up -d` brings up all services; data persists across restarts.
+**Primary Owner:** M001/S01
+
+## R011 — Canonical Tag System
+**Status:** active
+**Description:** Editable canonical tag list (config file) with aliases. Pipeline references tags during classification. New tags can be proposed by LLM and queued for admin approval or auto-added within existing categories.
+**Validation:** Tag list is editable; pipeline uses canonical tags consistently; alias normalization works.
+**Primary Owner:** M001/S03
+
+## R012 — Incremental Content Addition
+**Status:** active
+**Description:** System handles ongoing content: new videos processed through pipeline, new creators auto-detected, existing technique pages updated when new moments are added for same creator+topic.
+**Validation:** Adding a new video for an existing creator updates their technique pages; new creator folder creates new Creator record.
+**Primary Owner:** M001/S03
+
+## R013 — Prompt Template System
+**Status:** active
+**Description:** Extraction prompts (stages 2-5) stored as editable configuration files, not hardcoded. Admin can edit prompts and re-run extraction on specific or all videos for calibration.
+**Validation:** Prompt files are editable; re-processing a video with updated prompts produces different output.
+**Primary Owner:** M001/S03
+
+## R014 — Creator Equity
+**Status:** active
+**Description:** No creator is privileged in the UI. Default sort on Creators page is randomized on every page load. All creators get equal visual weight.
+**Validation:** Refreshing Creators page shows different order each time; no creator gets larger/bolder display.
+**Primary Owner:** M001/S05
+
+## R015 — 30-Second Retrieval Target
+**Status:** active
+**Description:** A producer mid-session can find a specific technique in under 30 seconds from Alt+Tab to reading the key insight.
+**Validation:** Timed test: Alt+Tab → search → read technique → under 30 seconds.
+**Primary Owner:** M001/S05
--- a/.gsd/STATE.md
+++ b/.gsd/STATE.md
@ -0,0 +1,18 @@
+# GSD State
+
+**Active Milestone:** M001: Chrysopedia Foundation — Infrastructure, Pipeline Core, and Skeleton UI
+**Active Slice:** S01: Docker Compose + Database + Whisper Script
+**Phase:** evaluating-gates
+**Requirements Status:** 0 active · 0 validated · 0 deferred · 0 out of scope
+
+## Milestone Registry
+- 🔄 **M001:** Chrysopedia Foundation — Infrastructure, Pipeline Core, and Skeleton UI
+
+## Recent Decisions
+- None recorded
+
+## Blockers
+- None
+
+## Next Action
+Evaluate 3 quality gate(s) for S01 before execution.
--- a/.gsd/milestones/M001/M001-ROADMAP.md
+++ b/.gsd/milestones/M001/M001-ROADMAP.md
@ -0,0 +1,13 @@
+# M001: Chrysopedia Foundation — Infrastructure, Pipeline Core, and Skeleton UI
+
+## Vision
+Stand up the complete Chrysopedia stack: Docker Compose deployment on ub01, PostgreSQL data model, FastAPI backend with transcript ingestion, Whisper transcription script for the desktop, LLM extraction pipeline (stages 2-5), review queue, Qdrant integration, and the search-first web UI with technique pages, creators, and topics browsing. By the end, a video file can be transcribed → ingested → extracted → reviewed → searched and read in the web UI.
+
+## Slice Overview
+| ID | Slice | Risk | Depends | Done | After this |
+|----|-------|------|---------|------|------------|
+| S01 | Docker Compose + Database + Whisper Script | low | — | ⬜ | docker compose up -d starts all services on ub01; Whisper script transcribes a sample video to JSON |
+| S02 | Transcript Ingestion API | low | S01 | ⬜ | POST a transcript JSON file to the API; Creator and Source Video records appear in PostgreSQL |
+| S03 | LLM Extraction Pipeline + Qdrant Integration | high | S02 | ⬜ | A transcript JSON triggers stages 2-5: segmentation → extraction → classification → synthesis. Technique pages with key moments appear in DB. Qdrant has searchable embeddings. |
+| S04 | Review Queue Admin UI | medium | S03 | ⬜ | Admin views pending key moments, approves/edits/rejects them, toggles between review and auto mode |
+| S05 | Search-First Web UI | medium | S03 | ⬜ | User searches for a technique, gets semantic results in <500ms, clicks through to a full technique page with study guide prose, key moments, and related links |
--- a/.gsd/milestones/M001/slices/S01/S01-PLAN.md
+++ b/.gsd/milestones/M001/slices/S01/S01-PLAN.md
@ -0,0 +1,81 @@
+# S01: Docker Compose + Database + Whisper Script
+
+**Goal:** Deployable infrastructure: Docker Compose project with PostgreSQL (full schema), FastAPI skeleton, and desktop Whisper transcription script
+**Demo:** After this: docker compose up -d starts all services on ub01; Whisper script transcribes a sample video to JSON
+
+## Tasks
+- [ ] **T01: Project scaffolding and Docker Compose** — 1. Create project directory structure:
+   - backend/ (FastAPI app)
+   - frontend/ (React app, placeholder)
+   - whisper/ (desktop transcription script)
+   - docker/ (Dockerfiles)
+   - prompts/ (editable prompt templates)
+   - config/ (canonical tags, settings)
+2. Write docker-compose.yml with services:
+   - chrysopedia-api (FastAPI, Uvicorn)
+   - chrysopedia-web (React, nginx)
+   - chrysopedia-db (PostgreSQL 16)
+   - chrysopedia-worker (Celery)
+   - chrysopedia-redis (Redis for Celery broker)
+3. Follow XPLTD conventions: bind mounts, project naming xpltd_chrysopedia, dedicated bridge network
+4. Create .env.example with all required env vars
+5. Write Dockerfiles for API and web services
+  - Estimate: 2-3 hours
+  - Files: docker-compose.yml, .env.example, docker/Dockerfile.api, docker/Dockerfile.web, backend/main.py, backend/requirements.txt
+  - Verify: docker compose config validates without errors
+- [ ] **T02: PostgreSQL schema and migrations** — 1. Create SQLAlchemy models for all 7 entities:
+   - Creator (id, name, slug, genres, folder_name, view_count, timestamps)
+   - SourceVideo (id, creator_id FK, filename, file_path, duration, content_type enum, transcript_path, processing_status enum, timestamps)
+   - TranscriptSegment (id, source_video_id FK, start_time, end_time, text, segment_index, topic_label)
+   - KeyMoment (id, source_video_id FK, technique_page_id FK nullable, title, summary, start/end time, content_type enum, plugins, review_status enum, raw_transcript, timestamps)
+   - TechniquePage (id, creator_id FK, title, slug, topic_category, topic_tags, summary, body_sections JSONB, signal_chains JSONB, plugins, source_quality enum, view_count, review_status enum, timestamps)
+   - RelatedTechniqueLink (id, source_page_id FK, target_page_id FK, relationship enum)
+   - Tag (id, name, category, aliases)
+2. Set up Alembic for migrations
+3. Create initial migration
+4. Add seed data for canonical tags (6 top-level categories)
+  - Estimate: 2-3 hours
+  - Files: backend/models.py, backend/database.py, alembic.ini, alembic/versions/*.py, config/canonical_tags.yaml
+  - Verify: alembic upgrade head succeeds; all 7 tables exist with correct columns and constraints
+- [ ] **T03: FastAPI application skeleton with health checks** — 1. Set up FastAPI app with:
+   - CORS middleware
+   - Database session dependency
+   - Health check endpoint (/health)
+   - API versioning prefix (/api/v1)
+2. Create Pydantic schemas for all entities
+3. Implement basic CRUD endpoints:
+   - GET /api/v1/creators
+   - GET /api/v1/creators/{slug}
+   - GET /api/v1/videos
+   - GET /api/v1/health
+4. Add structured logging
+5. Configure environment variable loading from .env
+  - Estimate: 1-2 hours
+  - Files: backend/main.py, backend/schemas.py, backend/routers/__init__.py, backend/routers/health.py, backend/routers/creators.py, backend/config.py
+  - Verify: curl http://localhost:8000/health returns 200; curl http://localhost:8000/api/v1/creators returns empty list
+- [ ] **T04: Whisper transcription script** — 1. Create Python script whisper/transcribe.py that:
+   - Accepts video file path (or directory for batch mode)
+   - Extracts audio via ffmpeg (subprocess)
+   - Runs Whisper large-v3 with segment-level and word-level timestamps
+   - Outputs JSON matching the spec format (source_file, creator_folder, duration, segments with words)
+   - Supports resumability: checks if output JSON already exists, skips
+2. Create whisper/requirements.txt (openai-whisper, ffmpeg-python)
+3. Write output to a configurable output directory
+4. Add CLI arguments: --input, --output-dir, --model (default large-v3), --device (default cuda)
+5. Include progress logging for long transcriptions
+  - Estimate: 1-2 hours
+  - Files: whisper/transcribe.py, whisper/requirements.txt, whisper/README.md
+  - Verify: python whisper/transcribe.py --help shows usage; script validates ffmpeg is available
+- [ ] **T05: Integration verification and documentation** — 1. Write README.md with:
+   - Project overview
+   - Architecture diagram (text)
+   - Setup instructions (Docker Compose + desktop Whisper)
+   - Environment variable documentation
+   - Development workflow
+2. Verify Docker Compose stack starts with: docker compose up -d
+3. Verify PostgreSQL schema with: alembic upgrade head
+4. Verify API health check responds
+5. Create sample transcript JSON for testing subsequent slices
+  - Estimate: 1 hour
+  - Files: README.md, tests/fixtures/sample_transcript.json
+  - Verify: docker compose config validates; README covers all setup steps; sample transcript JSON is valid
--- a/.gsd/milestones/M001/slices/S01/tasks/T01-PLAN.md
+++ b/.gsd/milestones/M001/slices/S01/tasks/T01-PLAN.md
@ -0,0 +1,40 @@
+---
+estimated_steps: 16
+estimated_files: 6
+skills_used: []
+---
+
+# T01: Project scaffolding and Docker Compose
+
+1. Create project directory structure:
+   - backend/ (FastAPI app)
+   - frontend/ (React app, placeholder)
+   - whisper/ (desktop transcription script)
+   - docker/ (Dockerfiles)
+   - prompts/ (editable prompt templates)
+   - config/ (canonical tags, settings)
+2. Write docker-compose.yml with services:
+   - chrysopedia-api (FastAPI, Uvicorn)
+   - chrysopedia-web (React, nginx)
+   - chrysopedia-db (PostgreSQL 16)
+   - chrysopedia-worker (Celery)
+   - chrysopedia-redis (Redis for Celery broker)
+3. Follow XPLTD conventions: bind mounts, project naming xpltd_chrysopedia, dedicated bridge network
+4. Create .env.example with all required env vars
+5. Write Dockerfiles for API and web services
+
+## Inputs
+
+- `chrysopedia-spec.md`
+- `XPLTD lore conventions`
+
+## Expected Output
+
+- `docker-compose.yml`
+- `.env.example`
+- `docker/Dockerfile.api`
+- `backend/main.py`
+
+## Verification
+
+docker compose config validates without errors
--- a/.gsd/milestones/M001/slices/S01/tasks/T02-PLAN.md
+++ b/.gsd/milestones/M001/slices/S01/tasks/T02-PLAN.md
@ -0,0 +1,34 @@
+---
+estimated_steps: 11
+estimated_files: 5
+skills_used: []
+---
+
+# T02: PostgreSQL schema and migrations
+
+1. Create SQLAlchemy models for all 7 entities:
+   - Creator (id, name, slug, genres, folder_name, view_count, timestamps)
+   - SourceVideo (id, creator_id FK, filename, file_path, duration, content_type enum, transcript_path, processing_status enum, timestamps)
+   - TranscriptSegment (id, source_video_id FK, start_time, end_time, text, segment_index, topic_label)
+   - KeyMoment (id, source_video_id FK, technique_page_id FK nullable, title, summary, start/end time, content_type enum, plugins, review_status enum, raw_transcript, timestamps)
+   - TechniquePage (id, creator_id FK, title, slug, topic_category, topic_tags, summary, body_sections JSONB, signal_chains JSONB, plugins, source_quality enum, view_count, review_status enum, timestamps)
+   - RelatedTechniqueLink (id, source_page_id FK, target_page_id FK, relationship enum)
+   - Tag (id, name, category, aliases)
+2. Set up Alembic for migrations
+3. Create initial migration
+4. Add seed data for canonical tags (6 top-level categories)
+
+## Inputs
+
+- `chrysopedia-spec.md section 6 (Data Model)`
+
+## Expected Output
+
+- `backend/models.py`
+- `backend/database.py`
+- `alembic/versions/001_initial.py`
+- `config/canonical_tags.yaml`
+
+## Verification
+
+alembic upgrade head succeeds; all 7 tables exist with correct columns and constraints
--- a/.gsd/milestones/M001/slices/S01/tasks/T03-PLAN.md
+++ b/.gsd/milestones/M001/slices/S01/tasks/T03-PLAN.md
@ -0,0 +1,37 @@
+---
+estimated_steps: 13
+estimated_files: 6
+skills_used: []
+---
+
+# T03: FastAPI application skeleton with health checks
+
+1. Set up FastAPI app with:
+   - CORS middleware
+   - Database session dependency
+   - Health check endpoint (/health)
+   - API versioning prefix (/api/v1)
+2. Create Pydantic schemas for all entities
+3. Implement basic CRUD endpoints:
+   - GET /api/v1/creators
+   - GET /api/v1/creators/{slug}
+   - GET /api/v1/videos
+   - GET /api/v1/health
+4. Add structured logging
+5. Configure environment variable loading from .env
+
+## Inputs
+
+- `backend/models.py`
+- `backend/database.py`
+
+## Expected Output
+
+- `backend/main.py`
+- `backend/schemas.py`
+- `backend/routers/creators.py`
+- `backend/config.py`
+
+## Verification
+
+curl http://localhost:8000/health returns 200; curl http://localhost:8000/api/v1/creators returns empty list
--- a/.gsd/milestones/M001/slices/S01/tasks/T04-PLAN.md
+++ b/.gsd/milestones/M001/slices/S01/tasks/T04-PLAN.md
@ -0,0 +1,32 @@
+---
+estimated_steps: 10
+estimated_files: 3
+skills_used: []
+---
+
+# T04: Whisper transcription script
+
+1. Create Python script whisper/transcribe.py that:
+   - Accepts video file path (or directory for batch mode)
+   - Extracts audio via ffmpeg (subprocess)
+   - Runs Whisper large-v3 with segment-level and word-level timestamps
+   - Outputs JSON matching the spec format (source_file, creator_folder, duration, segments with words)
+   - Supports resumability: checks if output JSON already exists, skips
+2. Create whisper/requirements.txt (openai-whisper, ffmpeg-python)
+3. Write output to a configurable output directory
+4. Add CLI arguments: --input, --output-dir, --model (default large-v3), --device (default cuda)
+5. Include progress logging for long transcriptions
+
+## Inputs
+
+- `chrysopedia-spec.md section 7.2 Stage 1`
+
+## Expected Output
+
+- `whisper/transcribe.py`
+- `whisper/requirements.txt`
+- `whisper/README.md`
+
+## Verification
+
+python whisper/transcribe.py --help shows usage; script validates ffmpeg is available
--- a/.gsd/milestones/M001/slices/S01/tasks/T05-PLAN.md
+++ b/.gsd/milestones/M001/slices/S01/tasks/T05-PLAN.md
@ -0,0 +1,31 @@
+---
+estimated_steps: 10
+estimated_files: 2
+skills_used: []
+---
+
+# T05: Integration verification and documentation
+
+1. Write README.md with:
+   - Project overview
+   - Architecture diagram (text)
+   - Setup instructions (Docker Compose + desktop Whisper)
+   - Environment variable documentation
+   - Development workflow
+2. Verify Docker Compose stack starts with: docker compose up -d
+3. Verify PostgreSQL schema with: alembic upgrade head
+4. Verify API health check responds
+5. Create sample transcript JSON for testing subsequent slices
+
+## Inputs
+
+- `All T01-T04 outputs`
+
+## Expected Output
+
+- `README.md`
+- `tests/fixtures/sample_transcript.json`
+
+## Verification
+
+docker compose config validates; README covers all setup steps; sample transcript JSON is valid
--- a/.gsd/milestones/M001/slices/S02/S02-PLAN.md
+++ b/.gsd/milestones/M001/slices/S02/S02-PLAN.md
@ -0,0 +1,6 @@
+# S02: Transcript Ingestion API
+
+**Goal:** FastAPI endpoints for transcript upload, creator management, and source video tracking
+**Demo:** After this: POST a transcript JSON file to the API; Creator and Source Video records appear in PostgreSQL
+
+## Tasks
--- a/.gsd/milestones/M001/slices/S03/S03-PLAN.md
+++ b/.gsd/milestones/M001/slices/S03/S03-PLAN.md
@ -0,0 +1,6 @@
+# S03: LLM Extraction Pipeline + Qdrant Integration
+
+**Goal:** Complete LLM pipeline with editable prompt templates, canonical tag system, Qdrant embedding, and resumable processing
+**Demo:** After this: A transcript JSON triggers stages 2-5: segmentation → extraction → classification → synthesis. Technique pages with key moments appear in DB. Qdrant has searchable embeddings.
+
+## Tasks
--- a/.gsd/milestones/M001/slices/S04/S04-PLAN.md
+++ b/.gsd/milestones/M001/slices/S04/S04-PLAN.md
@ -0,0 +1,6 @@
+# S04: Review Queue Admin UI
+
+**Goal:** Functional review workflow for calibrating extraction quality
+**Demo:** After this: Admin views pending key moments, approves/edits/rejects them, toggles between review and auto mode
+
+## Tasks
--- a/.gsd/milestones/M001/slices/S05/S05-PLAN.md
+++ b/.gsd/milestones/M001/slices/S05/S05-PLAN.md
@ -0,0 +1,6 @@
+# S05: Search-First Web UI
+
+**Goal:** Complete public-facing UI: landing page, live search, technique pages, creators browse, topics browse
+**Demo:** After this: User searches for a technique, gets semantic results in <500ms, clicks through to a full technique page with study guide prose, key moments, and related links
+
+## Tasks