Knowledge base for music production techniques — extracted from video content via LLM pipeline

Find a file

jlightner 657d604e5b fix: Added collapse arrow styling, stage chevrons, filter right-alignme… - "frontend/src/App.css" - "frontend/src/pages/AdminPipeline.tsx" GSD-Task: S02/T01		2026-04-03 05:38:10 +00:00
.artifacts	feat: Refactored keyword_search to multi-token AND with cross-field mat…	2026-04-01 06:41:52 +00:00
.gsd	fix: Added collapse arrow styling, stage chevrons, filter right-alignme…	2026-04-03 05:38:10 +00:00
.planning	fix: remove moments from recent cards, pin footer to bottom of card	2026-04-03 05:14:17 +00:00
alembic	perf: Added SearchLog model, Alembic migration 013, Pydantic schemas, f…	2026-04-03 04:02:55 +00:00
backend	feat: add GET /api/v1/stats endpoint with technique and creator counts	2026-04-03 04:24:58 +00:00
config	fix: restore complete project tree from ub01 canonical state	2026-03-31 02:10:41 +00:00
docker	fix: MCP server port 8097→8101 (8097 already allocated on ub01)	2026-04-03 02:58:57 +00:00
frontend	fix: Added collapse arrow styling, stage chevrons, filter right-alignme…	2026-04-03 05:38:10 +00:00
mcp_server	fix: MCP server API URL patterns — path params not JSON body, stage name mapping	2026-04-03 03:07:39 +00:00
prompts	feat: Created composition prompt with merge rules, citation re-indexing…	2026-04-03 01:03:01 +00:00
tests/fixtures	fix: restore complete project tree from ub01 canonical state	2026-03-31 02:10:41 +00:00
whisper	fix: restore complete project tree from ub01 canonical state	2026-03-31 02:10:41 +00:00
.env.example	fix: restore complete project tree from ub01 canonical state	2026-03-31 02:10:41 +00:00
.gitignore	feat: Created full Docker Compose project (xpltd_chrysopedia) with Post…	2026-03-29 21:42:56 +00:00
.mcp.json	fix: MCP server port 8097→8101 (8097 already allocated on ub01)	2026-04-03 02:58:57 +00:00
alembic.ini	fix: restore complete project tree from ub01 canonical state	2026-03-31 02:10:41 +00:00
CHRYSOPEDIA-ASSESSMENT.md	feat: Added scale(1.02) hover to all 6 card types, cardEnter stagger an…	2026-03-31 08:22:37 +00:00
chrysopedia-spec.md	fix: restore complete project tree from ub01 canonical state	2026-03-31 02:10:41 +00:00
CLAUDE.md	fix: restore complete project tree from ub01 canonical state	2026-03-31 02:10:41 +00:00
conftest.py	feat: Added body_sections_format column, technique_page_videos associat…	2026-04-03 01:16:31 +00:00
docker-compose.yml	fix: MCP server port 8097→8101 (8097 already allocated on ub01)	2026-04-03 02:58:57 +00:00
generate_stage5_variants.py	stage5: replace synthesis prompt with v016 (masterclass-recap) + add 100 variant prompts	2026-04-01 10:49:16 +00:00
pipeline	feat: Created PromptVariantGenerator (LLM-powered prompt mutation) and…	2026-04-01 09:08:01 +00:00
PROJECT_CONTEXT.md	test: Added BodySection/BodySubSection schema models, changed Synthesiz…	2026-04-03 00:50:30 +00:00
README.md	fix: restore complete project tree from ub01 canonical state	2026-03-31 02:10:41 +00:00

README.md

Chrysopedia

From chrysopoeia (alchemical transmutation of base material into gold) + encyclopedia. Chrysopedia transmutes raw video content into refined, searchable production knowledge.

A self-hosted knowledge extraction system for electronic music production content. Video libraries are transcribed with Whisper, analyzed through a multi-stage LLM pipeline, curated via an admin review workflow, and served through a search-first web UI designed for mid-session retrieval.

Information Flow

Content moves through six stages from raw video to searchable knowledge:

 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 1 · Transcription                            [Desktop / GPU]    │
 │                                                                         │
 │  Video files → Whisper large-v3 (CUDA) → JSON transcripts              │
 │  Output: timestamped segments with speaker text                         │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │ JSON files (manual or folder watcher)
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 2 · Ingestion                                [API + Watcher]    │
 │                                                                         │
 │  POST /api/v1/ingest ← watcher auto-submits from /watch folder         │
 │  • Validate JSON structure                                              │
 │  • Compute content hash (SHA-256) for deduplication                     │
 │  • Find-or-create Creator from folder name                              │
 │  • Upsert SourceVideo (exact filename → content hash → fuzzy match)     │
 │  • Bulk-insert TranscriptSegment rows                                   │
 │  • Dispatch pipeline to Celery worker                                   │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │ Celery task: run_pipeline(video_id)
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 3 · LLM Extraction Pipeline                  [Celery Worker]    │
 │                                                                         │
 │  Four sequential LLM stages, each with its own prompt template:         │
 │                                                                         │
 │  3a. Segmentation — Split transcript into semantic topic boundaries     │
 │      Model: chat (fast)         Prompt: stage2_segmentation.txt         │
 │                                                                         │
 │  3b. Extraction — Identify key moments (title, summary, timestamps)     │
 │      Model: reasoning (think)   Prompt: stage3_extraction.txt           │
 │                                                                         │
 │  3c. Classification — Assign content types + extract plugin names       │
 │      Model: chat (fast)         Prompt: stage4_classification.txt       │
 │                                                                         │
 │  3d. Synthesis — Compose technique pages from approved moments          │
 │      Model: reasoning (think)   Prompt: stage5_synthesis.txt            │
 │                                                                         │
 │  Each stage emits PipelineEvent rows (tokens, duration, model, errors)  │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │ KeyMoment rows (review_status: pending)
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 4 · Review & Curation                        [Admin UI]         │
 │                                                                         │
 │  Admin reviews extracted KeyMoments before they become technique pages:  │
 │  • Approve — moment proceeds to synthesis                               │
 │  • Edit — correct title, summary, content type, plugins, then approve   │
 │  • Reject — moment is excluded from knowledge base                      │
 │  (When REVIEW_MODE=false, moments auto-approve and skip this stage)     │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │ Approved moments → Stage 3d synthesis
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 5 · Knowledge Base                           [Web UI]           │
 │                                                                         │
 │  TechniquePages — the primary output:                                   │
 │  • Structured body sections, signal chains, plugin lists                │
 │  • Linked to source KeyMoments with video timestamps                    │
 │  • Cross-referenced via RelatedTechniqueLinks                           │
 │  • Versioned (snapshots before each re-synthesis)                       │
 │  • Organized by topic taxonomy (6 categories from canonical_tags.yaml)  │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 6 · Search & Retrieval                       [Web UI]           │
 │                                                                         │
 │  • Semantic search: query → embedding → Qdrant vector similarity        │
 │  • Keyword fallback: ILIKE search on title/summary (300ms timeout)      │
 │  • Browse by topic hierarchy, creator, or content type                  │
 │  • Typeahead search from home page (debounced, top 5 results)           │
 └─────────────────────────────────────────────────────────────────────────┘

Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│  Desktop (GPU workstation — hal0022)                                     │
│  whisper/transcribe.py → JSON transcripts → copy to /watch folder        │
└────────────────────────────┬─────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────────────────┐
│  Docker Compose: xpltd_chrysopedia (ub01)                                │
│  Network: chrysopedia (172.32.0.0/24)                                    │
│                                                                          │
│  ┌────────────┐  ┌─────────────┐  ┌───────────────┐  ┌──────────────┐  │
│  │  PostgreSQL │  │    Redis    │  │    Qdrant     │  │    Ollama    │  │
│  │  :5433      │  │  broker +   │  │  vector DB    │  │  embeddings  │  │
│  │  7 entities │  │  cache      │  │  semantic     │  │  nomic-embed │  │
│  └─────┬───────┘  └──────┬──────┘  └───────┬───────┘  └──────┬───────┘  │
│        │                 │                 │                 │           │
│  ┌─────┴─────────────────┴─────────────────┴─────────────────┴────────┐  │
│  │                         FastAPI (API)                              │  │
│  │  Ingest · Pipeline control · Review · Search · CRUD · Reports     │  │
│  └──────────────────────────────┬────────────────────────────────────┘  │
│                                 │                                       │
│  ┌──────────────┐  ┌────────────┴───┐  ┌──────────────────────────┐    │
│  │   Watcher    │  │  Celery Worker │  │     Web UI (React)       │    │
│  │  /watch →    │  │  LLM pipeline  │  │  nginx → :8096           │    │
│  │  auto-ingest │  │  stages 2-5    │  │  search-first interface  │    │
│  └──────────────┘  └────────────────┘  └──────────────────────────┘    │
└──────────────────────────────────────────────────────────────────────────┘

Services

Service	Image	Port	Purpose
`chrysopedia-db`	`postgres:16-alpine`	`5433 → 5432`	Primary data store
`chrysopedia-redis`	`redis:7-alpine`	—	Celery broker + feature flag cache
`chrysopedia-qdrant`	`qdrant/qdrant:v1.13.2`	—	Vector DB for semantic search
`chrysopedia-ollama`	`ollama/ollama`	—	Embedding model server (nomic-embed-text)
`chrysopedia-api`	`Dockerfile.api`	`8000`	FastAPI REST API
`chrysopedia-worker`	`Dockerfile.api`	—	Celery worker (LLM pipeline)
`chrysopedia-watcher`	`Dockerfile.api`	—	Folder monitor → auto-ingest
`chrysopedia-web`	`Dockerfile.web`	`8096 → 80`	React frontend (nginx)

Data Model

Entity	Purpose
Creator	Artists/producers whose content is indexed
SourceVideo	Video files processed by the pipeline (with content hash dedup)
TranscriptSegment	Timestamped text segments from Whisper
KeyMoment	Discrete insights extracted by LLM analysis
TechniquePage	Synthesized knowledge pages — the primary output
TechniquePageVersion	Snapshots before re-synthesis overwrites
RelatedTechniqueLink	Cross-references between technique pages
Tag	Hierarchical topic taxonomy
ContentReport	User-submitted content issues
PipelineEvent	Structured pipeline execution logs (tokens, timing, errors)

Quick Start

Prerequisites

Docker ≥ 24.0 and Docker Compose ≥ 2.20
Python 3.10+ with NVIDIA GPU + CUDA (for Whisper transcription)

Setup

# Clone and configure
git clone git@github.com:xpltdco/chrysopedia.git
cd chrysopedia
cp .env.example .env    # edit with real values

# Start the stack
docker compose up -d

# Run database migrations
docker exec chrysopedia-api alembic upgrade head

# Pull the embedding model (first time only)
docker exec chrysopedia-ollama ollama pull nomic-embed-text

# Verify
curl http://localhost:8096/health

Transcribe videos

cd whisper && pip install -r requirements.txt

# Single file
python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts

# Batch
python transcribe.py --input ./videos/ --output-dir ./transcripts

See whisper/README.md for full transcription docs.

Environment Variables

Copy .env.example to .env. Key groups:

Group	Variables	Notes
Database	`POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB`	Default user: `chrysopedia`
LLM	`LLM_API_URL`, `LLM_API_KEY`, `LLM_MODEL`	OpenAI-compatible endpoint
LLM Fallback	`LLM_FALLBACK_URL`, `LLM_FALLBACK_MODEL`	Automatic failover
Per-Stage Models	`LLM_STAGE{2-5}_MODEL`, `LLM_STAGE{2-5}_MODALITY`	`chat` for fast stages, `thinking` for reasoning
Embedding	`EMBEDDING_API_URL`, `EMBEDDING_MODEL`	Ollama nomic-embed-text
Vector DB	`QDRANT_URL`, `QDRANT_COLLECTION`	Container-internal
Features	`REVIEW_MODE`, `DEBUG_MODE`	Review gate + LLM I/O capture
Storage	`TRANSCRIPT_STORAGE_PATH`, `VIDEO_METADATA_PATH`	Container bind mounts

API Endpoints

Public

Method	Path	Description
GET	`/health`	Health check (DB connectivity)
GET	`/api/v1/search?q=&scope=&limit=`	Semantic + keyword search
GET	`/api/v1/techniques`	List technique pages
GET	`/api/v1/techniques/{slug}`	Technique detail + key moments
GET	`/api/v1/techniques/{slug}/versions`	Version history
GET	`/api/v1/creators`	List creators (sort, genre filter)
GET	`/api/v1/creators/{slug}`	Creator detail
GET	`/api/v1/topics`	Topic hierarchy with counts
GET	`/api/v1/videos`	List source videos
POST	`/api/v1/reports`	Submit content report

Admin

Method	Path	Description
GET	`/api/v1/review/queue`	Review queue (status filter)
POST	`/api/v1/review/moments/{id}/approve`	Approve key moment
POST	`/api/v1/review/moments/{id}/reject`	Reject key moment
PUT	`/api/v1/review/moments/{id}`	Edit key moment
POST	`/api/v1/admin/pipeline/trigger/{video_id}`	Trigger/retrigger pipeline
GET	`/api/v1/admin/pipeline/events/{video_id}`	Pipeline event log
GET	`/api/v1/admin/pipeline/token-summary/{video_id}`	Token usage by stage
GET	`/api/v1/admin/pipeline/worker-status`	Celery worker status
PUT	`/api/v1/admin/pipeline/debug-mode`	Toggle debug mode

Ingest

Method	Path	Description
POST	`/api/v1/ingest`	Upload Whisper JSON transcript

Development

# Local backend (with Docker services)
python -m venv .venv && source .venv/bin/activate
pip install -r backend/requirements.txt
docker compose up -d chrysopedia-db chrysopedia-redis
alembic upgrade head
cd backend && uvicorn main:app --reload --host 0.0.0.0 --port 8000

# Database migrations
alembic revision --autogenerate -m "describe_change"
alembic upgrade head

Project Structure

chrysopedia/
├── backend/                 # FastAPI application
│   ├── main.py              # Entry point, middleware, router mounting
│   ├── config.py            # Pydantic Settings (all env vars)
│   ├── models.py            # SQLAlchemy ORM models
│   ├── schemas.py           # Pydantic request/response schemas
│   ├── worker.py            # Celery app configuration
│   ├── watcher.py           # Transcript folder watcher service
│   ├── search_service.py    # Semantic search + keyword fallback
│   ├── routers/             # API endpoint handlers
│   ├── pipeline/            # LLM pipeline stages + clients
│   │   ├── stages.py        # Stages 2-5 (Celery tasks)
│   │   ├── llm_client.py    # OpenAI-compatible LLM client
│   │   ├── embedding_client.py
│   │   └── qdrant_client.py
│   └── tests/
├── frontend/                # React + TypeScript + Vite
│   └── src/
│       ├── pages/           # Home, Search, Technique, Creator, Topic, Admin
│       ├── components/      # Shared UI components
│       └── api/             # Typed API clients
├── whisper/                 # Desktop transcription (Whisper large-v3)
├── docker/                  # Dockerfiles + nginx config
├── alembic/                 # Database migrations
├── config/                  # canonical_tags.yaml (topic taxonomy)
├── prompts/                 # LLM prompt templates (editable at runtime)
├── docker-compose.yml
└── .env.example

Deployment (ub01)

ssh ub01
cd /vmPool/r/repos/xpltdco/chrysopedia
git pull && docker compose build && docker compose up -d

Resource	Location
Web UI	`http://ub01:8096`
API	`http://ub01:8096/health`
PostgreSQL	`ub01:5433`
Compose config	`/vmPool/r/compose/xpltd_chrysopedia/docker-compose.yml`
Persistent data	`/vmPool/r/services/chrysopedia_*`

XPLTD conventions: xpltd_chrysopedia project name, dedicated bridge network (172.32.0.0/24), bind mounts under /vmPool/r/services/, PostgreSQL on port 5433.