chrysopedia/README.md

# Chrysopedia

> From *chrysopoeia* (alchemical transmutation of base material into gold) + *encyclopedia*.
> Chrysopedia transmutes raw video content into refined, searchable production knowledge.

A self-hosted knowledge extraction system for electronic music production content. Video libraries are transcribed with Whisper, analyzed through a multi-stage LLM pipeline, curated via an admin review workflow, and served through a search-first web UI designed for mid-session retrieval.

---

## Information Flow

Content moves through six stages from raw video to searchable knowledge:

```
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 1 · Transcription                            [Desktop / GPU]    │
 │                                                                         │
 │  Video files → Whisper large-v3 (CUDA) → JSON transcripts              │
 │  Output: timestamped segments with speaker text                         │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │ JSON files (manual or folder watcher)
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 2 · Ingestion                                [API + Watcher]    │
 │                                                                         │
 │  POST /api/v1/ingest ← watcher auto-submits from /watch folder         │
 │  • Validate JSON structure                                              │
 │  • Compute content hash (SHA-256) for deduplication                     │
 │  • Find-or-create Creator from folder name                              │
 │  • Upsert SourceVideo (exact filename → content hash → fuzzy match)     │
 │  • Bulk-insert TranscriptSegment rows                                   │
 │  • Dispatch pipeline to Celery worker                                   │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │ Celery task: run_pipeline(video_id)
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 3 · LLM Extraction Pipeline                  [Celery Worker]    │
 │                                                                         │
 │  Four sequential LLM stages, each with its own prompt template:         │
 │                                                                         │
 │  3a. Segmentation — Split transcript into semantic topic boundaries     │
 │      Model: chat (fast)         Prompt: stage2_segmentation.txt         │
 │                                                                         │
 │  3b. Extraction — Identify key moments (title, summary, timestamps)     │
 │      Model: reasoning (think)   Prompt: stage3_extraction.txt           │
 │                                                                         │
 │  3c. Classification — Assign content types + extract plugin names       │
 │      Model: chat (fast)         Prompt: stage4_classification.txt       │
 │                                                                         │
 │  3d. Synthesis — Compose technique pages from approved moments          │
 │      Model: reasoning (think)   Prompt: stage5_synthesis.txt            │
 │                                                                         │
 │  Each stage emits PipelineEvent rows (tokens, duration, model, errors)  │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │ KeyMoment rows (review_status: pending)
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 4 · Review & Curation                        [Admin UI]         │
 │                                                                         │
 │  Admin reviews extracted KeyMoments before they become technique pages:  │
 │  • Approve — moment proceeds to synthesis                               │
 │  • Edit — correct title, summary, content type, plugins, then approve   │
 │  • Reject — moment is excluded from knowledge base                      │
 │  (When REVIEW_MODE=false, moments auto-approve and skip this stage)     │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │ Approved moments → Stage 3d synthesis
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 5 · Knowledge Base                           [Web UI]           │
 │                                                                         │
 │  TechniquePages — the primary output:                                   │
 │  • Structured body sections, signal chains, plugin lists                │
 │  • Linked to source KeyMoments with video timestamps                    │
 │  • Cross-referenced via RelatedTechniqueLinks                           │
 │  • Versioned (snapshots before each re-synthesis)                       │
 │  • Organized by topic taxonomy (6 categories from canonical_tags.yaml)  │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 6 · Search & Retrieval                       [Web UI]           │
 │                                                                         │
 │  • Semantic search: query → embedding → Qdrant vector similarity        │
 │  • Keyword fallback: ILIKE search on title/summary (300ms timeout)      │
 │  • Browse by topic hierarchy, creator, or content type                  │
 │  • Typeahead search from home page (debounced, top 5 results)           │
 └─────────────────────────────────────────────────────────────────────────┘
```

---

## Architecture

```
┌──────────────────────────────────────────────────────────────────────────┐
│  Desktop (GPU workstation — hal0022)                                     │
│  whisper/transcribe.py → JSON transcripts → copy to /watch folder        │
└────────────────────────────┬─────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────────────────┐
│  Docker Compose: xpltd_chrysopedia (ub01)                                │
│  Network: chrysopedia (172.32.0.0/24)                                    │
│                                                                          │
│  ┌────────────┐  ┌─────────────┐  ┌───────────────┐  ┌──────────────┐  │
│  │  PostgreSQL │  │    Redis    │  │    Qdrant     │  │    Ollama    │  │
│  │  :5433      │  │  broker +   │  │  vector DB    │  │  embeddings  │  │
│  │  7 entities │  │  cache      │  │  semantic     │  │  nomic-embed │  │
│  └─────┬───────┘  └──────┬──────┘  └───────┬───────┘  └──────┬───────┘  │
│        │                 │                 │                 │           │
│  ┌─────┴─────────────────┴─────────────────┴─────────────────┴────────┐  │
│  │                         FastAPI (API)                              │  │
│  │  Ingest · Pipeline control · Review · Search · CRUD · Reports     │  │
│  └──────────────────────────────┬────────────────────────────────────┘  │
│                                 │                                       │
│  ┌──────────────┐  ┌────────────┴───┐  ┌──────────────────────────┐    │
│  │   Watcher    │  │  Celery Worker │  │     Web UI (React)       │    │
│  │  /watch →    │  │  LLM pipeline  │  │  nginx → :8096           │    │
│  │  auto-ingest │  │  stages 2-5    │  │  search-first interface  │    │
│  └──────────────┘  └────────────────┘  └──────────────────────────┘    │
└──────────────────────────────────────────────────────────────────────────┘
```

### Services

| Service | Image | Port | Purpose |
|---------|-------|------|---------|
| `chrysopedia-db` | `postgres:16-alpine` | `5433 → 5432` | Primary data store |
| `chrysopedia-redis` | `redis:7-alpine` | — | Celery broker + feature flag cache |
| `chrysopedia-qdrant` | `qdrant/qdrant:v1.13.2` | — | Vector DB for semantic search |
| `chrysopedia-ollama` | `ollama/ollama` | — | Embedding model server (nomic-embed-text) |
| `chrysopedia-api` | `Dockerfile.api` | `8000` | FastAPI REST API |
| `chrysopedia-worker` | `Dockerfile.api` | — | Celery worker (LLM pipeline) |
| `chrysopedia-watcher` | `Dockerfile.api` | — | Folder monitor → auto-ingest |
| `chrysopedia-web` | `Dockerfile.web` | `8096 → 80` | React frontend (nginx) |

### Data Model

| Entity | Purpose |
|--------|---------|
| **Creator** | Artists/producers whose content is indexed |
| **SourceVideo** | Video files processed by the pipeline (with content hash dedup) |
| **TranscriptSegment** | Timestamped text segments from Whisper |
| **KeyMoment** | Discrete insights extracted by LLM analysis |
| **TechniquePage** | Synthesized knowledge pages — the primary output |
| **TechniquePageVersion** | Snapshots before re-synthesis overwrites |
| **RelatedTechniqueLink** | Cross-references between technique pages |
| **Tag** | Hierarchical topic taxonomy |
| **ContentReport** | User-submitted content issues |
| **PipelineEvent** | Structured pipeline execution logs (tokens, timing, errors) |

---

## Quick Start

### Prerequisites

- Docker ≥ 24.0 and Docker Compose ≥ 2.20
- Python 3.10+ with NVIDIA GPU + CUDA (for Whisper transcription)

### Setup

```bash
# Clone and configure
git clone git@github.com:xpltdco/chrysopedia.git
cd chrysopedia
cp .env.example .env    # edit with real values

# Start the stack
docker compose up -d

# Run database migrations
docker exec chrysopedia-api alembic upgrade head

# Pull the embedding model (first time only)
docker exec chrysopedia-ollama ollama pull nomic-embed-text

# Verify
curl http://localhost:8096/health
```

### Transcribe videos

```bash
cd whisper && pip install -r requirements.txt

# Single file
python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts

# Batch
python transcribe.py --input ./videos/ --output-dir ./transcripts
```

See [`whisper/README.md`](whisper/README.md) for full transcription docs.

---

## Environment Variables

Copy `.env.example` to `.env`. Key groups:

| Group | Variables | Notes |
|-------|-----------|-------|
| **Database** | `POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB` | Default user: `chrysopedia` |
| **LLM** | `LLM_API_URL`, `LLM_API_KEY`, `LLM_MODEL` | OpenAI-compatible endpoint |
| **LLM Fallback** | `LLM_FALLBACK_URL`, `LLM_FALLBACK_MODEL` | Automatic failover |
| **Per-Stage Models** | `LLM_STAGE{2-5}_MODEL`, `LLM_STAGE{2-5}_MODALITY` | `chat` for fast stages, `thinking` for reasoning |
| **Embedding** | `EMBEDDING_API_URL`, `EMBEDDING_MODEL` | Ollama nomic-embed-text |
| **Vector DB** | `QDRANT_URL`, `QDRANT_COLLECTION` | Container-internal |
| **Features** | `REVIEW_MODE`, `DEBUG_MODE` | Review gate + LLM I/O capture |
| **Storage** | `TRANSCRIPT_STORAGE_PATH`, `VIDEO_METADATA_PATH` | Container bind mounts |

---

## API Endpoints

### Public

| Method | Path | Description |
|--------|------|-------------|
| GET | `/health` | Health check (DB connectivity) |
| GET | `/api/v1/search?q=&scope=&limit=` | Semantic + keyword search |
| GET | `/api/v1/techniques` | List technique pages |
| GET | `/api/v1/techniques/{slug}` | Technique detail + key moments |
| GET | `/api/v1/techniques/{slug}/versions` | Version history |
| GET | `/api/v1/creators` | List creators (sort, genre filter) |
| GET | `/api/v1/creators/{slug}` | Creator detail |
| GET | `/api/v1/topics` | Topic hierarchy with counts |
| GET | `/api/v1/videos` | List source videos |
| POST | `/api/v1/reports` | Submit content report |

### Admin

| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/v1/review/queue` | Review queue (status filter) |
| POST | `/api/v1/review/moments/{id}/approve` | Approve key moment |
| POST | `/api/v1/review/moments/{id}/reject` | Reject key moment |
| PUT | `/api/v1/review/moments/{id}` | Edit key moment |
| POST | `/api/v1/admin/pipeline/trigger/{video_id}` | Trigger/retrigger pipeline |
| GET | `/api/v1/admin/pipeline/events/{video_id}` | Pipeline event log |
| GET | `/api/v1/admin/pipeline/token-summary/{video_id}` | Token usage by stage |
| GET | `/api/v1/admin/pipeline/worker-status` | Celery worker status |
| PUT | `/api/v1/admin/pipeline/debug-mode` | Toggle debug mode |

### Ingest

| Method | Path | Description |
|--------|------|-------------|
| POST | `/api/v1/ingest` | Upload Whisper JSON transcript |

---

## Development

```bash
# Local backend (with Docker services)
python -m venv .venv && source .venv/bin/activate
pip install -r backend/requirements.txt
docker compose up -d chrysopedia-db chrysopedia-redis
alembic upgrade head
cd backend && uvicorn main:app --reload --host 0.0.0.0 --port 8000

# Database migrations
alembic revision --autogenerate -m "describe_change"
alembic upgrade head
```

### Project Structure

```
chrysopedia/
├── backend/                 # FastAPI application
│   ├── main.py              # Entry point, middleware, router mounting
│   ├── config.py            # Pydantic Settings (all env vars)
│   ├── models.py            # SQLAlchemy ORM models
│   ├── schemas.py           # Pydantic request/response schemas
│   ├── worker.py            # Celery app configuration
│   ├── watcher.py           # Transcript folder watcher service
│   ├── search_service.py    # Semantic search + keyword fallback
│   ├── routers/             # API endpoint handlers
│   ├── pipeline/            # LLM pipeline stages + clients
│   │   ├── stages.py        # Stages 2-5 (Celery tasks)
│   │   ├── llm_client.py    # OpenAI-compatible LLM client
│   │   ├── embedding_client.py
│   │   └── qdrant_client.py
│   └── tests/
├── frontend/                # React + TypeScript + Vite
│   └── src/
│       ├── pages/           # Home, Search, Technique, Creator, Topic, Admin
│       ├── components/      # Shared UI components
│       └── api/             # Typed API clients
├── whisper/                 # Desktop transcription (Whisper large-v3)
├── docker/                  # Dockerfiles + nginx config
├── alembic/                 # Database migrations
├── config/                  # canonical_tags.yaml (topic taxonomy)
├── prompts/                 # LLM prompt templates (editable at runtime)
├── docker-compose.yml
└── .env.example
```

---

## Deployment (ub01)

```bash
ssh ub01
cd /vmPool/r/repos/xpltdco/chrysopedia
git pull && docker compose build && docker compose up -d
```

| Resource | Location |
|----------|----------|
| Web UI | `http://ub01:8096` |
| API | `http://ub01:8096/health` |
| PostgreSQL | `ub01:5433` |
| Compose config | `/vmPool/r/compose/xpltd_chrysopedia/docker-compose.yml` |
| Persistent data | `/vmPool/r/services/chrysopedia_*` |

XPLTD conventions: `xpltd_chrysopedia` project name, dedicated bridge network (`172.32.0.0/24`), bind mounts under `/vmPool/r/services/`, PostgreSQL on port `5433`.