Knowledge base for music production techniques — extracted from video content via LLM pipeline
Find a file
jlightner 657d604e5b fix: Added collapse arrow styling, stage chevrons, filter right-alignme…
- "frontend/src/App.css"
- "frontend/src/pages/AdminPipeline.tsx"

GSD-Task: S02/T01
2026-04-03 05:38:10 +00:00
.artifacts feat: Refactored keyword_search to multi-token AND with cross-field mat… 2026-04-01 06:41:52 +00:00
.gsd fix: Added collapse arrow styling, stage chevrons, filter right-alignme… 2026-04-03 05:38:10 +00:00
.planning fix: remove moments from recent cards, pin footer to bottom of card 2026-04-03 05:14:17 +00:00
alembic perf: Added SearchLog model, Alembic migration 013, Pydantic schemas, f… 2026-04-03 04:02:55 +00:00
backend feat: add GET /api/v1/stats endpoint with technique and creator counts 2026-04-03 04:24:58 +00:00
config fix: restore complete project tree from ub01 canonical state 2026-03-31 02:10:41 +00:00
docker fix: MCP server port 8097→8101 (8097 already allocated on ub01) 2026-04-03 02:58:57 +00:00
frontend fix: Added collapse arrow styling, stage chevrons, filter right-alignme… 2026-04-03 05:38:10 +00:00
mcp_server fix: MCP server API URL patterns — path params not JSON body, stage name mapping 2026-04-03 03:07:39 +00:00
prompts feat: Created composition prompt with merge rules, citation re-indexing… 2026-04-03 01:03:01 +00:00
tests/fixtures fix: restore complete project tree from ub01 canonical state 2026-03-31 02:10:41 +00:00
whisper fix: restore complete project tree from ub01 canonical state 2026-03-31 02:10:41 +00:00
.env.example fix: restore complete project tree from ub01 canonical state 2026-03-31 02:10:41 +00:00
.gitignore feat: Created full Docker Compose project (xpltd_chrysopedia) with Post… 2026-03-29 21:42:56 +00:00
.mcp.json fix: MCP server port 8097→8101 (8097 already allocated on ub01) 2026-04-03 02:58:57 +00:00
alembic.ini fix: restore complete project tree from ub01 canonical state 2026-03-31 02:10:41 +00:00
CHRYSOPEDIA-ASSESSMENT.md feat: Added scale(1.02) hover to all 6 card types, cardEnter stagger an… 2026-03-31 08:22:37 +00:00
chrysopedia-spec.md fix: restore complete project tree from ub01 canonical state 2026-03-31 02:10:41 +00:00
CLAUDE.md fix: restore complete project tree from ub01 canonical state 2026-03-31 02:10:41 +00:00
conftest.py feat: Added body_sections_format column, technique_page_videos associat… 2026-04-03 01:16:31 +00:00
docker-compose.yml fix: MCP server port 8097→8101 (8097 already allocated on ub01) 2026-04-03 02:58:57 +00:00
generate_stage5_variants.py stage5: replace synthesis prompt with v016 (masterclass-recap) + add 100 variant prompts 2026-04-01 10:49:16 +00:00
pipeline feat: Created PromptVariantGenerator (LLM-powered prompt mutation) and… 2026-04-01 09:08:01 +00:00
PROJECT_CONTEXT.md test: Added BodySection/BodySubSection schema models, changed Synthesiz… 2026-04-03 00:50:30 +00:00
README.md fix: restore complete project tree from ub01 canonical state 2026-03-31 02:10:41 +00:00

Chrysopedia

From chrysopoeia (alchemical transmutation of base material into gold) + encyclopedia. Chrysopedia transmutes raw video content into refined, searchable production knowledge.

A self-hosted knowledge extraction system for electronic music production content. Video libraries are transcribed with Whisper, analyzed through a multi-stage LLM pipeline, curated via an admin review workflow, and served through a search-first web UI designed for mid-session retrieval.


Information Flow

Content moves through six stages from raw video to searchable knowledge:

 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 1 · Transcription                            [Desktop / GPU]    │
 │                                                                         │
 │  Video files → Whisper large-v3 (CUDA) → JSON transcripts              │
 │  Output: timestamped segments with speaker text                         │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │ JSON files (manual or folder watcher)
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 2 · Ingestion                                [API + Watcher]    │
 │                                                                         │
 │  POST /api/v1/ingest ← watcher auto-submits from /watch folder         │
 │  • Validate JSON structure                                              │
 │  • Compute content hash (SHA-256) for deduplication                     │
 │  • Find-or-create Creator from folder name                              │
 │  • Upsert SourceVideo (exact filename → content hash → fuzzy match)     │
 │  • Bulk-insert TranscriptSegment rows                                   │
 │  • Dispatch pipeline to Celery worker                                   │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │ Celery task: run_pipeline(video_id)
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 3 · LLM Extraction Pipeline                  [Celery Worker]    │
 │                                                                         │
 │  Four sequential LLM stages, each with its own prompt template:         │
 │                                                                         │
 │  3a. Segmentation — Split transcript into semantic topic boundaries     │
 │      Model: chat (fast)         Prompt: stage2_segmentation.txt         │
 │                                                                         │
 │  3b. Extraction — Identify key moments (title, summary, timestamps)     │
 │      Model: reasoning (think)   Prompt: stage3_extraction.txt           │
 │                                                                         │
 │  3c. Classification — Assign content types + extract plugin names       │
 │      Model: chat (fast)         Prompt: stage4_classification.txt       │
 │                                                                         │
 │  3d. Synthesis — Compose technique pages from approved moments          │
 │      Model: reasoning (think)   Prompt: stage5_synthesis.txt            │
 │                                                                         │
 │  Each stage emits PipelineEvent rows (tokens, duration, model, errors)  │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │ KeyMoment rows (review_status: pending)
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 4 · Review & Curation                        [Admin UI]         │
 │                                                                         │
 │  Admin reviews extracted KeyMoments before they become technique pages:  │
 │  • Approve — moment proceeds to synthesis                               │
 │  • Edit — correct title, summary, content type, plugins, then approve   │
 │  • Reject — moment is excluded from knowledge base                      │
 │  (When REVIEW_MODE=false, moments auto-approve and skip this stage)     │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │ Approved moments → Stage 3d synthesis
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 5 · Knowledge Base                           [Web UI]           │
 │                                                                         │
 │  TechniquePages — the primary output:                                   │
 │  • Structured body sections, signal chains, plugin lists                │
 │  • Linked to source KeyMoments with video timestamps                    │
 │  • Cross-referenced via RelatedTechniqueLinks                           │
 │  • Versioned (snapshots before each re-synthesis)                       │
 │  • Organized by topic taxonomy (6 categories from canonical_tags.yaml)  │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 6 · Search & Retrieval                       [Web UI]           │
 │                                                                         │
 │  • Semantic search: query → embedding → Qdrant vector similarity        │
 │  • Keyword fallback: ILIKE search on title/summary (300ms timeout)      │
 │  • Browse by topic hierarchy, creator, or content type                  │
 │  • Typeahead search from home page (debounced, top 5 results)           │
 └─────────────────────────────────────────────────────────────────────────┘

Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│  Desktop (GPU workstation — hal0022)                                     │
│  whisper/transcribe.py → JSON transcripts → copy to /watch folder        │
└────────────────────────────┬─────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────────────────┐
│  Docker Compose: xpltd_chrysopedia (ub01)                                │
│  Network: chrysopedia (172.32.0.0/24)                                    │
│                                                                          │
│  ┌────────────┐  ┌─────────────┐  ┌───────────────┐  ┌──────────────┐  │
│  │  PostgreSQL │  │    Redis    │  │    Qdrant     │  │    Ollama    │  │
│  │  :5433      │  │  broker +   │  │  vector DB    │  │  embeddings  │  │
│  │  7 entities │  │  cache      │  │  semantic     │  │  nomic-embed │  │
│  └─────┬───────┘  └──────┬──────┘  └───────┬───────┘  └──────┬───────┘  │
│        │                 │                 │                 │           │
│  ┌─────┴─────────────────┴─────────────────┴─────────────────┴────────┐  │
│  │                         FastAPI (API)                              │  │
│  │  Ingest · Pipeline control · Review · Search · CRUD · Reports     │  │
│  └──────────────────────────────┬────────────────────────────────────┘  │
│                                 │                                       │
│  ┌──────────────┐  ┌────────────┴───┐  ┌──────────────────────────┐    │
│  │   Watcher    │  │  Celery Worker │  │     Web UI (React)       │    │
│  │  /watch →    │  │  LLM pipeline  │  │  nginx → :8096           │    │
│  │  auto-ingest │  │  stages 2-5    │  │  search-first interface  │    │
│  └──────────────┘  └────────────────┘  └──────────────────────────┘    │
└──────────────────────────────────────────────────────────────────────────┘

Services

Service Image Port Purpose
chrysopedia-db postgres:16-alpine 5433 → 5432 Primary data store
chrysopedia-redis redis:7-alpine Celery broker + feature flag cache
chrysopedia-qdrant qdrant/qdrant:v1.13.2 Vector DB for semantic search
chrysopedia-ollama ollama/ollama Embedding model server (nomic-embed-text)
chrysopedia-api Dockerfile.api 8000 FastAPI REST API
chrysopedia-worker Dockerfile.api Celery worker (LLM pipeline)
chrysopedia-watcher Dockerfile.api Folder monitor → auto-ingest
chrysopedia-web Dockerfile.web 8096 → 80 React frontend (nginx)

Data Model

Entity Purpose
Creator Artists/producers whose content is indexed
SourceVideo Video files processed by the pipeline (with content hash dedup)
TranscriptSegment Timestamped text segments from Whisper
KeyMoment Discrete insights extracted by LLM analysis
TechniquePage Synthesized knowledge pages — the primary output
TechniquePageVersion Snapshots before re-synthesis overwrites
RelatedTechniqueLink Cross-references between technique pages
Tag Hierarchical topic taxonomy
ContentReport User-submitted content issues
PipelineEvent Structured pipeline execution logs (tokens, timing, errors)

Quick Start

Prerequisites

  • Docker ≥ 24.0 and Docker Compose ≥ 2.20
  • Python 3.10+ with NVIDIA GPU + CUDA (for Whisper transcription)

Setup

# Clone and configure
git clone git@github.com:xpltdco/chrysopedia.git
cd chrysopedia
cp .env.example .env    # edit with real values

# Start the stack
docker compose up -d

# Run database migrations
docker exec chrysopedia-api alembic upgrade head

# Pull the embedding model (first time only)
docker exec chrysopedia-ollama ollama pull nomic-embed-text

# Verify
curl http://localhost:8096/health

Transcribe videos

cd whisper && pip install -r requirements.txt

# Single file
python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts

# Batch
python transcribe.py --input ./videos/ --output-dir ./transcripts

See whisper/README.md for full transcription docs.


Environment Variables

Copy .env.example to .env. Key groups:

Group Variables Notes
Database POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB Default user: chrysopedia
LLM LLM_API_URL, LLM_API_KEY, LLM_MODEL OpenAI-compatible endpoint
LLM Fallback LLM_FALLBACK_URL, LLM_FALLBACK_MODEL Automatic failover
Per-Stage Models LLM_STAGE{2-5}_MODEL, LLM_STAGE{2-5}_MODALITY chat for fast stages, thinking for reasoning
Embedding EMBEDDING_API_URL, EMBEDDING_MODEL Ollama nomic-embed-text
Vector DB QDRANT_URL, QDRANT_COLLECTION Container-internal
Features REVIEW_MODE, DEBUG_MODE Review gate + LLM I/O capture
Storage TRANSCRIPT_STORAGE_PATH, VIDEO_METADATA_PATH Container bind mounts

API Endpoints

Public

Method Path Description
GET /health Health check (DB connectivity)
GET /api/v1/search?q=&scope=&limit= Semantic + keyword search
GET /api/v1/techniques List technique pages
GET /api/v1/techniques/{slug} Technique detail + key moments
GET /api/v1/techniques/{slug}/versions Version history
GET /api/v1/creators List creators (sort, genre filter)
GET /api/v1/creators/{slug} Creator detail
GET /api/v1/topics Topic hierarchy with counts
GET /api/v1/videos List source videos
POST /api/v1/reports Submit content report

Admin

Method Path Description
GET /api/v1/review/queue Review queue (status filter)
POST /api/v1/review/moments/{id}/approve Approve key moment
POST /api/v1/review/moments/{id}/reject Reject key moment
PUT /api/v1/review/moments/{id} Edit key moment
POST /api/v1/admin/pipeline/trigger/{video_id} Trigger/retrigger pipeline
GET /api/v1/admin/pipeline/events/{video_id} Pipeline event log
GET /api/v1/admin/pipeline/token-summary/{video_id} Token usage by stage
GET /api/v1/admin/pipeline/worker-status Celery worker status
PUT /api/v1/admin/pipeline/debug-mode Toggle debug mode

Ingest

Method Path Description
POST /api/v1/ingest Upload Whisper JSON transcript

Development

# Local backend (with Docker services)
python -m venv .venv && source .venv/bin/activate
pip install -r backend/requirements.txt
docker compose up -d chrysopedia-db chrysopedia-redis
alembic upgrade head
cd backend && uvicorn main:app --reload --host 0.0.0.0 --port 8000

# Database migrations
alembic revision --autogenerate -m "describe_change"
alembic upgrade head

Project Structure

chrysopedia/
├── backend/                 # FastAPI application
│   ├── main.py              # Entry point, middleware, router mounting
│   ├── config.py            # Pydantic Settings (all env vars)
│   ├── models.py            # SQLAlchemy ORM models
│   ├── schemas.py           # Pydantic request/response schemas
│   ├── worker.py            # Celery app configuration
│   ├── watcher.py           # Transcript folder watcher service
│   ├── search_service.py    # Semantic search + keyword fallback
│   ├── routers/             # API endpoint handlers
│   ├── pipeline/            # LLM pipeline stages + clients
│   │   ├── stages.py        # Stages 2-5 (Celery tasks)
│   │   ├── llm_client.py    # OpenAI-compatible LLM client
│   │   ├── embedding_client.py
│   │   └── qdrant_client.py
│   └── tests/
├── frontend/                # React + TypeScript + Vite
│   └── src/
│       ├── pages/           # Home, Search, Technique, Creator, Topic, Admin
│       ├── components/      # Shared UI components
│       └── api/             # Typed API clients
├── whisper/                 # Desktop transcription (Whisper large-v3)
├── docker/                  # Dockerfiles + nginx config
├── alembic/                 # Database migrations
├── config/                  # canonical_tags.yaml (topic taxonomy)
├── prompts/                 # LLM prompt templates (editable at runtime)
├── docker-compose.yml
└── .env.example

Deployment (ub01)

ssh ub01
cd /vmPool/r/repos/xpltdco/chrysopedia
git pull && docker compose build && docker compose up -d
Resource Location
Web UI http://ub01:8096
API http://ub01:8096/health
PostgreSQL ub01:5433
Compose config /vmPool/r/compose/xpltd_chrysopedia/docker-compose.yml
Persistent data /vmPool/r/services/chrysopedia_*

XPLTD conventions: xpltd_chrysopedia project name, dedicated bridge network (172.32.0.0/24), bind mounts under /vmPool/r/services/, PostgreSQL on port 5433.