# Chrysopedia > From *chrysopoeia* (alchemical transmutation of base material into gold) + *encyclopedia*. > Chrysopedia transmutes raw video content into refined, searchable production knowledge. A self-hosted knowledge extraction system for electronic music production content. Video libraries are transcribed with Whisper, analyzed through a multi-stage LLM pipeline, curated via an admin review workflow, and served through a search-first web UI designed for mid-session retrieval. --- ## Information Flow Content moves through six stages from raw video to searchable knowledge: ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ STAGE 1 · Transcription [Desktop / GPU] │ │ │ │ Video files → Whisper large-v3 (CUDA) → JSON transcripts │ │ Output: timestamped segments with speaker text │ └────────────────────────────────┬────────────────────────────────────────┘ │ JSON files (manual or folder watcher) ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ STAGE 2 · Ingestion [API + Watcher] │ │ │ │ POST /api/v1/ingest ← watcher auto-submits from /watch folder │ │ • Validate JSON structure │ │ • Compute content hash (SHA-256) for deduplication │ │ • Find-or-create Creator from folder name │ │ • Upsert SourceVideo (exact filename → content hash → fuzzy match) │ │ • Bulk-insert TranscriptSegment rows │ │ • Dispatch pipeline to Celery worker │ └────────────────────────────────┬────────────────────────────────────────┘ │ Celery task: run_pipeline(video_id) ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ STAGE 3 · LLM Extraction Pipeline [Celery Worker] │ │ │ │ Four sequential LLM stages, each with its own prompt template: │ │ │ │ 3a. Segmentation — Split transcript into semantic topic boundaries │ │ Model: chat (fast) Prompt: stage2_segmentation.txt │ │ │ │ 3b. Extraction — Identify key moments (title, summary, timestamps) │ │ Model: reasoning (think) Prompt: stage3_extraction.txt │ │ │ │ 3c. Classification — Assign content types + extract plugin names │ │ Model: chat (fast) Prompt: stage4_classification.txt │ │ │ │ 3d. Synthesis — Compose technique pages from approved moments │ │ Model: reasoning (think) Prompt: stage5_synthesis.txt │ │ │ │ Each stage emits PipelineEvent rows (tokens, duration, model, errors) │ └────────────────────────────────┬────────────────────────────────────────┘ │ KeyMoment rows (review_status: pending) ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ STAGE 4 · Review & Curation [Admin UI] │ │ │ │ Admin reviews extracted KeyMoments before they become technique pages: │ │ • Approve — moment proceeds to synthesis │ │ • Edit — correct title, summary, content type, plugins, then approve │ │ • Reject — moment is excluded from knowledge base │ │ (When REVIEW_MODE=false, moments auto-approve and skip this stage) │ └────────────────────────────────┬────────────────────────────────────────┘ │ Approved moments → Stage 3d synthesis ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ STAGE 5 · Knowledge Base [Web UI] │ │ │ │ TechniquePages — the primary output: │ │ • Structured body sections, signal chains, plugin lists │ │ • Linked to source KeyMoments with video timestamps │ │ • Cross-referenced via RelatedTechniqueLinks │ │ • Versioned (snapshots before each re-synthesis) │ │ • Organized by topic taxonomy (6 categories from canonical_tags.yaml) │ └────────────────────────────────┬────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ STAGE 6 · Search & Retrieval [Web UI] │ │ │ │ • Semantic search: query → embedding → Qdrant vector similarity │ │ • Keyword fallback: ILIKE search on title/summary (300ms timeout) │ │ • Browse by topic hierarchy, creator, or content type │ │ • Typeahead search from home page (debounced, top 5 results) │ └─────────────────────────────────────────────────────────────────────────┘ ``` --- ## Architecture ``` ┌──────────────────────────────────────────────────────────────────────────┐ │ Desktop (GPU workstation — hal0022) │ │ whisper/transcribe.py → JSON transcripts → copy to /watch folder │ └────────────────────────────┬─────────────────────────────────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────────────┐ │ Docker Compose: xpltd_chrysopedia (ub01) │ │ Network: chrysopedia (172.32.0.0/24) │ │ │ │ ┌────────────┐ ┌─────────────┐ ┌───────────────┐ ┌──────────────┐ │ │ │ PostgreSQL │ │ Redis │ │ Qdrant │ │ Ollama │ │ │ │ :5433 │ │ broker + │ │ vector DB │ │ embeddings │ │ │ │ 7 entities │ │ cache │ │ semantic │ │ nomic-embed │ │ │ └─────┬───────┘ └──────┬──────┘ └───────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ │ │ ┌─────┴─────────────────┴─────────────────┴─────────────────┴────────┐ │ │ │ FastAPI (API) │ │ │ │ Ingest · Pipeline control · Review · Search · CRUD · Reports │ │ │ └──────────────────────────────┬────────────────────────────────────┘ │ │ │ │ │ ┌──────────────┐ ┌────────────┴───┐ ┌──────────────────────────┐ │ │ │ Watcher │ │ Celery Worker │ │ Web UI (React) │ │ │ │ /watch → │ │ LLM pipeline │ │ nginx → :8096 │ │ │ │ auto-ingest │ │ stages 2-5 │ │ search-first interface │ │ │ └──────────────┘ └────────────────┘ └──────────────────────────┘ │ └──────────────────────────────────────────────────────────────────────────┘ ``` ### Services | Service | Image | Port | Purpose | |---------|-------|------|---------| | `chrysopedia-db` | `postgres:16-alpine` | `5433 → 5432` | Primary data store | | `chrysopedia-redis` | `redis:7-alpine` | — | Celery broker + feature flag cache | | `chrysopedia-qdrant` | `qdrant/qdrant:v1.13.2` | — | Vector DB for semantic search | | `chrysopedia-ollama` | `ollama/ollama` | — | Embedding model server (nomic-embed-text) | | `chrysopedia-api` | `Dockerfile.api` | `8000` | FastAPI REST API | | `chrysopedia-worker` | `Dockerfile.api` | — | Celery worker (LLM pipeline) | | `chrysopedia-watcher` | `Dockerfile.api` | — | Folder monitor → auto-ingest | | `chrysopedia-web` | `Dockerfile.web` | `8096 → 80` | React frontend (nginx) | ### Data Model | Entity | Purpose | |--------|---------| | **Creator** | Artists/producers whose content is indexed | | **SourceVideo** | Video files processed by the pipeline (with content hash dedup) | | **TranscriptSegment** | Timestamped text segments from Whisper | | **KeyMoment** | Discrete insights extracted by LLM analysis | | **TechniquePage** | Synthesized knowledge pages — the primary output | | **TechniquePageVersion** | Snapshots before re-synthesis overwrites | | **RelatedTechniqueLink** | Cross-references between technique pages | | **Tag** | Hierarchical topic taxonomy | | **ContentReport** | User-submitted content issues | | **PipelineEvent** | Structured pipeline execution logs (tokens, timing, errors) | --- ## Quick Start ### Prerequisites - Docker ≥ 24.0 and Docker Compose ≥ 2.20 - Python 3.10+ with NVIDIA GPU + CUDA (for Whisper transcription) ### Setup ```bash # Clone and configure git clone git@github.com:xpltdco/chrysopedia.git cd chrysopedia cp .env.example .env # edit with real values # Start the stack docker compose up -d # Run database migrations docker exec chrysopedia-api alembic upgrade head # Pull the embedding model (first time only) docker exec chrysopedia-ollama ollama pull nomic-embed-text # Verify curl http://localhost:8096/health ``` ### Transcribe videos ```bash cd whisper && pip install -r requirements.txt # Single file python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts # Batch python transcribe.py --input ./videos/ --output-dir ./transcripts ``` See [`whisper/README.md`](whisper/README.md) for full transcription docs. --- ## Environment Variables Copy `.env.example` to `.env`. Key groups: | Group | Variables | Notes | |-------|-----------|-------| | **Database** | `POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB` | Default user: `chrysopedia` | | **LLM** | `LLM_API_URL`, `LLM_API_KEY`, `LLM_MODEL` | OpenAI-compatible endpoint | | **LLM Fallback** | `LLM_FALLBACK_URL`, `LLM_FALLBACK_MODEL` | Automatic failover | | **Per-Stage Models** | `LLM_STAGE{2-5}_MODEL`, `LLM_STAGE{2-5}_MODALITY` | `chat` for fast stages, `thinking` for reasoning | | **Embedding** | `EMBEDDING_API_URL`, `EMBEDDING_MODEL` | Ollama nomic-embed-text | | **Vector DB** | `QDRANT_URL`, `QDRANT_COLLECTION` | Container-internal | | **Features** | `REVIEW_MODE`, `DEBUG_MODE` | Review gate + LLM I/O capture | | **Storage** | `TRANSCRIPT_STORAGE_PATH`, `VIDEO_METADATA_PATH` | Container bind mounts | --- ## API Endpoints ### Public | Method | Path | Description | |--------|------|-------------| | GET | `/health` | Health check (DB connectivity) | | GET | `/api/v1/search?q=&scope=&limit=` | Semantic + keyword search | | GET | `/api/v1/techniques` | List technique pages | | GET | `/api/v1/techniques/{slug}` | Technique detail + key moments | | GET | `/api/v1/techniques/{slug}/versions` | Version history | | GET | `/api/v1/creators` | List creators (sort, genre filter) | | GET | `/api/v1/creators/{slug}` | Creator detail | | GET | `/api/v1/topics` | Topic hierarchy with counts | | GET | `/api/v1/videos` | List source videos | | POST | `/api/v1/reports` | Submit content report | ### Admin | Method | Path | Description | |--------|------|-------------| | GET | `/api/v1/review/queue` | Review queue (status filter) | | POST | `/api/v1/review/moments/{id}/approve` | Approve key moment | | POST | `/api/v1/review/moments/{id}/reject` | Reject key moment | | PUT | `/api/v1/review/moments/{id}` | Edit key moment | | POST | `/api/v1/admin/pipeline/trigger/{video_id}` | Trigger/retrigger pipeline | | GET | `/api/v1/admin/pipeline/events/{video_id}` | Pipeline event log | | GET | `/api/v1/admin/pipeline/token-summary/{video_id}` | Token usage by stage | | GET | `/api/v1/admin/pipeline/worker-status` | Celery worker status | | PUT | `/api/v1/admin/pipeline/debug-mode` | Toggle debug mode | ### Ingest | Method | Path | Description | |--------|------|-------------| | POST | `/api/v1/ingest` | Upload Whisper JSON transcript | --- ## Development ```bash # Local backend (with Docker services) python -m venv .venv && source .venv/bin/activate pip install -r backend/requirements.txt docker compose up -d chrysopedia-db chrysopedia-redis alembic upgrade head cd backend && uvicorn main:app --reload --host 0.0.0.0 --port 8000 # Database migrations alembic revision --autogenerate -m "describe_change" alembic upgrade head ``` ### Project Structure ``` chrysopedia/ ├── backend/ # FastAPI application │ ├── main.py # Entry point, middleware, router mounting │ ├── config.py # Pydantic Settings (all env vars) │ ├── models.py # SQLAlchemy ORM models │ ├── schemas.py # Pydantic request/response schemas │ ├── worker.py # Celery app configuration │ ├── watcher.py # Transcript folder watcher service │ ├── search_service.py # Semantic search + keyword fallback │ ├── routers/ # API endpoint handlers │ ├── pipeline/ # LLM pipeline stages + clients │ │ ├── stages.py # Stages 2-5 (Celery tasks) │ │ ├── llm_client.py # OpenAI-compatible LLM client │ │ ├── embedding_client.py │ │ └── qdrant_client.py │ └── tests/ ├── frontend/ # React + TypeScript + Vite │ └── src/ │ ├── pages/ # Home, Search, Technique, Creator, Topic, Admin │ ├── components/ # Shared UI components │ └── api/ # Typed API clients ├── whisper/ # Desktop transcription (Whisper large-v3) ├── docker/ # Dockerfiles + nginx config ├── alembic/ # Database migrations ├── config/ # canonical_tags.yaml (topic taxonomy) ├── prompts/ # LLM prompt templates (editable at runtime) ├── docker-compose.yml └── .env.example ``` --- ## Deployment (ub01) ```bash ssh ub01 cd /vmPool/r/repos/xpltdco/chrysopedia git pull && docker compose build && docker compose up -d ``` | Resource | Location | |----------|----------| | Web UI | `http://ub01:8096` | | API | `http://ub01:8096/health` | | PostgreSQL | `ub01:5433` | | Compose config | `/vmPool/r/compose/xpltd_chrysopedia/docker-compose.yml` | | Persistent data | `/vmPool/r/services/chrysopedia_*` | XPLTD conventions: `xpltd_chrysopedia` project name, dedicated bridge network (`172.32.0.0/24`), bind mounts under `/vmPool/r/services/`, PostgreSQL on port `5433`.