Chrysopedia
From chrysopoeia (alchemical transmutation of base material into gold) + encyclopedia.
Chrysopedia transmutes raw video content into refined, searchable production knowledge.
A self-hosted knowledge extraction system for electronic music production content. Video libraries are transcribed with Whisper, analyzed through a multi-stage LLM pipeline, curated via an admin review workflow, and served through a search-first web UI designed for mid-session retrieval.
Information Flow
Content moves through six stages from raw video to searchable knowledge:
┌─────────────────────────────────────────────────────────────────────────┐
│ STAGE 1 · Transcription [Desktop / GPU] │
│ │
│ Video files → Whisper large-v3 (CUDA) → JSON transcripts │
│ Output: timestamped segments with speaker text │
└────────────────────────────────┬────────────────────────────────────────┘
│ JSON files (manual or folder watcher)
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STAGE 2 · Ingestion [API + Watcher] │
│ │
│ POST /api/v1/ingest ← watcher auto-submits from /watch folder │
│ • Validate JSON structure │
│ • Compute content hash (SHA-256) for deduplication │
│ • Find-or-create Creator from folder name │
│ • Upsert SourceVideo (exact filename → content hash → fuzzy match) │
│ • Bulk-insert TranscriptSegment rows │
│ • Dispatch pipeline to Celery worker │
└────────────────────────────────┬────────────────────────────────────────┘
│ Celery task: run_pipeline(video_id)
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STAGE 3 · LLM Extraction Pipeline [Celery Worker] │
│ │
│ Four sequential LLM stages, each with its own prompt template: │
│ │
│ 3a. Segmentation — Split transcript into semantic topic boundaries │
│ Model: chat (fast) Prompt: stage2_segmentation.txt │
│ │
│ 3b. Extraction — Identify key moments (title, summary, timestamps) │
│ Model: reasoning (think) Prompt: stage3_extraction.txt │
│ │
│ 3c. Classification — Assign content types + extract plugin names │
│ Model: chat (fast) Prompt: stage4_classification.txt │
│ │
│ 3d. Synthesis — Compose technique pages from approved moments │
│ Model: reasoning (think) Prompt: stage5_synthesis.txt │
│ │
│ Each stage emits PipelineEvent rows (tokens, duration, model, errors) │
└────────────────────────────────┬────────────────────────────────────────┘
│ KeyMoment rows (review_status: pending)
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STAGE 4 · Review & Curation [Admin UI] │
│ │
│ Admin reviews extracted KeyMoments before they become technique pages: │
│ • Approve — moment proceeds to synthesis │
│ • Edit — correct title, summary, content type, plugins, then approve │
│ • Reject — moment is excluded from knowledge base │
│ (When REVIEW_MODE=false, moments auto-approve and skip this stage) │
└────────────────────────────────┬────────────────────────────────────────┘
│ Approved moments → Stage 3d synthesis
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STAGE 5 · Knowledge Base [Web UI] │
│ │
│ TechniquePages — the primary output: │
│ • Structured body sections, signal chains, plugin lists │
│ • Linked to source KeyMoments with video timestamps │
│ • Cross-referenced via RelatedTechniqueLinks │
│ • Versioned (snapshots before each re-synthesis) │
│ • Organized by topic taxonomy (6 categories from canonical_tags.yaml) │
└────────────────────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ STAGE 6 · Search & Retrieval [Web UI] │
│ │
│ • Semantic search: query → embedding → Qdrant vector similarity │
│ • Keyword fallback: ILIKE search on title/summary (300ms timeout) │
│ • Browse by topic hierarchy, creator, or content type │
│ • Typeahead search from home page (debounced, top 5 results) │
└─────────────────────────────────────────────────────────────────────────┘
Architecture
┌──────────────────────────────────────────────────────────────────────────┐
│ Desktop (GPU workstation — hal0022) │
│ whisper/transcribe.py → JSON transcripts → copy to /watch folder │
└────────────────────────────┬─────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ Docker Compose: xpltd_chrysopedia (ub01) │
│ Network: chrysopedia (172.32.0.0/24) │
│ │
│ ┌────────────┐ ┌─────────────┐ ┌───────────────┐ ┌──────────────┐ │
│ │ PostgreSQL │ │ Redis │ │ Qdrant │ │ Ollama │ │
│ │ :5433 │ │ broker + │ │ vector DB │ │ embeddings │ │
│ │ 7 entities │ │ cache │ │ semantic │ │ nomic-embed │ │
│ └─────┬───────┘ └──────┬──────┘ └───────┬───────┘ └──────┬───────┘ │
│ │ │ │ │ │
│ ┌─────┴─────────────────┴─────────────────┴─────────────────┴────────┐ │
│ │ FastAPI (API) │ │
│ │ Ingest · Pipeline control · Review · Search · CRUD · Reports │ │
│ └──────────────────────────────┬────────────────────────────────────┘ │
│ │ │
│ ┌──────────────┐ ┌────────────┴───┐ ┌──────────────────────────┐ │
│ │ Watcher │ │ Celery Worker │ │ Web UI (React) │ │
│ │ /watch → │ │ LLM pipeline │ │ nginx → :8096 │ │
│ │ auto-ingest │ │ stages 2-5 │ │ search-first interface │ │
│ └──────────────┘ └────────────────┘ └──────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
Services
| Service |
Image |
Port |
Purpose |
chrysopedia-db |
postgres:16-alpine |
5433 → 5432 |
Primary data store |
chrysopedia-redis |
redis:7-alpine |
— |
Celery broker + feature flag cache |
chrysopedia-qdrant |
qdrant/qdrant:v1.13.2 |
— |
Vector DB for semantic search |
chrysopedia-ollama |
ollama/ollama |
— |
Embedding model server (nomic-embed-text) |
chrysopedia-api |
Dockerfile.api |
8000 |
FastAPI REST API |
chrysopedia-worker |
Dockerfile.api |
— |
Celery worker (LLM pipeline) |
chrysopedia-watcher |
Dockerfile.api |
— |
Folder monitor → auto-ingest |
chrysopedia-web |
Dockerfile.web |
8096 → 80 |
React frontend (nginx) |
Data Model
| Entity |
Purpose |
| Creator |
Artists/producers whose content is indexed |
| SourceVideo |
Video files processed by the pipeline (with content hash dedup) |
| TranscriptSegment |
Timestamped text segments from Whisper |
| KeyMoment |
Discrete insights extracted by LLM analysis |
| TechniquePage |
Synthesized knowledge pages — the primary output |
| TechniquePageVersion |
Snapshots before re-synthesis overwrites |
| RelatedTechniqueLink |
Cross-references between technique pages |
| Tag |
Hierarchical topic taxonomy |
| ContentReport |
User-submitted content issues |
| PipelineEvent |
Structured pipeline execution logs (tokens, timing, errors) |
Quick Start
Prerequisites
- Docker ≥ 24.0 and Docker Compose ≥ 2.20
- Python 3.10+ with NVIDIA GPU + CUDA (for Whisper transcription)
Setup
# Clone and configure
git clone git@github.com:xpltdco/chrysopedia.git
cd chrysopedia
cp .env.example .env # edit with real values
# Start the stack
docker compose up -d
# Run database migrations
docker exec chrysopedia-api alembic upgrade head
# Pull the embedding model (first time only)
docker exec chrysopedia-ollama ollama pull nomic-embed-text
# Verify
curl http://localhost:8096/health
Transcribe videos
cd whisper && pip install -r requirements.txt
# Single file
python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts
# Batch
python transcribe.py --input ./videos/ --output-dir ./transcripts
See whisper/README.md for full transcription docs.
Environment Variables
Copy .env.example to .env. Key groups:
| Group |
Variables |
Notes |
| Database |
POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB |
Default user: chrysopedia |
| LLM |
LLM_API_URL, LLM_API_KEY, LLM_MODEL |
OpenAI-compatible endpoint |
| LLM Fallback |
LLM_FALLBACK_URL, LLM_FALLBACK_MODEL |
Automatic failover |
| Per-Stage Models |
LLM_STAGE{2-5}_MODEL, LLM_STAGE{2-5}_MODALITY |
chat for fast stages, thinking for reasoning |
| Embedding |
EMBEDDING_API_URL, EMBEDDING_MODEL |
Ollama nomic-embed-text |
| Vector DB |
QDRANT_URL, QDRANT_COLLECTION |
Container-internal |
| Features |
REVIEW_MODE, DEBUG_MODE |
Review gate + LLM I/O capture |
| Storage |
TRANSCRIPT_STORAGE_PATH, VIDEO_METADATA_PATH |
Container bind mounts |
API Endpoints
Public
| Method |
Path |
Description |
| GET |
/health |
Health check (DB connectivity) |
| GET |
/api/v1/search?q=&scope=&limit= |
Semantic + keyword search |
| GET |
/api/v1/techniques |
List technique pages |
| GET |
/api/v1/techniques/{slug} |
Technique detail + key moments |
| GET |
/api/v1/techniques/{slug}/versions |
Version history |
| GET |
/api/v1/creators |
List creators (sort, genre filter) |
| GET |
/api/v1/creators/{slug} |
Creator detail |
| GET |
/api/v1/topics |
Topic hierarchy with counts |
| GET |
/api/v1/videos |
List source videos |
| POST |
/api/v1/reports |
Submit content report |
Admin
| Method |
Path |
Description |
| GET |
/api/v1/review/queue |
Review queue (status filter) |
| POST |
/api/v1/review/moments/{id}/approve |
Approve key moment |
| POST |
/api/v1/review/moments/{id}/reject |
Reject key moment |
| PUT |
/api/v1/review/moments/{id} |
Edit key moment |
| POST |
/api/v1/admin/pipeline/trigger/{video_id} |
Trigger/retrigger pipeline |
| GET |
/api/v1/admin/pipeline/events/{video_id} |
Pipeline event log |
| GET |
/api/v1/admin/pipeline/token-summary/{video_id} |
Token usage by stage |
| GET |
/api/v1/admin/pipeline/worker-status |
Celery worker status |
| PUT |
/api/v1/admin/pipeline/debug-mode |
Toggle debug mode |
Ingest
| Method |
Path |
Description |
| POST |
/api/v1/ingest |
Upload Whisper JSON transcript |
Development
# Local backend (with Docker services)
python -m venv .venv && source .venv/bin/activate
pip install -r backend/requirements.txt
docker compose up -d chrysopedia-db chrysopedia-redis
alembic upgrade head
cd backend && uvicorn main:app --reload --host 0.0.0.0 --port 8000
# Database migrations
alembic revision --autogenerate -m "describe_change"
alembic upgrade head
Project Structure
chrysopedia/
├── backend/ # FastAPI application
│ ├── main.py # Entry point, middleware, router mounting
│ ├── config.py # Pydantic Settings (all env vars)
│ ├── models.py # SQLAlchemy ORM models
│ ├── schemas.py # Pydantic request/response schemas
│ ├── worker.py # Celery app configuration
│ ├── watcher.py # Transcript folder watcher service
│ ├── search_service.py # Semantic search + keyword fallback
│ ├── routers/ # API endpoint handlers
│ ├── pipeline/ # LLM pipeline stages + clients
│ │ ├── stages.py # Stages 2-5 (Celery tasks)
│ │ ├── llm_client.py # OpenAI-compatible LLM client
│ │ ├── embedding_client.py
│ │ └── qdrant_client.py
│ └── tests/
├── frontend/ # React + TypeScript + Vite
│ └── src/
│ ├── pages/ # Home, Search, Technique, Creator, Topic, Admin
│ ├── components/ # Shared UI components
│ └── api/ # Typed API clients
├── whisper/ # Desktop transcription (Whisper large-v3)
├── docker/ # Dockerfiles + nginx config
├── alembic/ # Database migrations
├── config/ # canonical_tags.yaml (topic taxonomy)
├── prompts/ # LLM prompt templates (editable at runtime)
├── docker-compose.yml
└── .env.example
Deployment (ub01)
ssh ub01
cd /vmPool/r/repos/xpltdco/chrysopedia
git pull && docker compose build && docker compose up -d
| Resource |
Location |
| Web UI |
http://ub01:8096 |
| API |
http://ub01:8096/health |
| PostgreSQL |
ub01:5433 |
| Compose config |
/vmPool/r/compose/xpltd_chrysopedia/docker-compose.yml |
| Persistent data |
/vmPool/r/services/chrysopedia_* |
XPLTD conventions: xpltd_chrysopedia project name, dedicated bridge network (172.32.0.0/24), bind mounts under /vmPool/r/services/, PostgreSQL on port 5433.