chrysopedia/README.md

# Chrysopedia

> From *chrysopoeia* (alchemical transmutation of base material into gold) + *encyclopedia*.
> Chrysopedia transmutes raw video content into refined, searchable production knowledge.

A self-hosted knowledge extraction and retrieval system for electronic music production content. Transcribes video libraries with Whisper, extracts key moments and techniques with LLM analysis, and serves a search-first web UI for mid-session retrieval.

---

## Architecture

```
┌──────────────────────────────────────────────────────────────────┐
│  Desktop (GPU workstation)                                       │
│  ┌──────────────┐                                                │
│  │ whisper/      │  Transcribes video → JSON (Whisper large-v3)  │
│  │ transcribe.py │  Runs locally with CUDA, outputs to /data     │
│  └──────┬───────┘                                                │
│         │ JSON transcripts                                       │
└─────────┼────────────────────────────────────────────────────────┘
          │
          ▼
┌──────────────────────────────────────────────────────────────────┐
│  Docker Compose (xpltd_chrysopedia) — Server (e.g. ub01)         │
│                                                                  │
│  ┌────────────────┐  ┌────────────────┐  ┌──────────────────┐   │
│  │ chrysopedia-db │  │chrysopedia-redis│  │ chrysopedia-api  │   │
│  │ PostgreSQL 16  │  │  Redis 7       │  │ FastAPI + Uvicorn│   │
│  │ :5433→5432     │  │                │  │ :8000            │   │
│  └────────────────┘  └────────────────┘  └────────┬─────────┘   │
│                                                    │             │
│  ┌──────────────────┐  ┌──────────────────────┐    │             │
│  │ chrysopedia-web  │  │ chrysopedia-worker   │    │             │
│  │ React + nginx    │  │ Celery (LLM pipeline)│    │             │
│  │ :3000→80         │  │                      │    │             │
│  └──────────────────┘  └──────────────────────┘    │             │
│                                                    │             │
│  Network: chrysopedia (172.24.0.0/24)              │             │
└──────────────────────────────────────────────────────────────────┘
```

### Services

| Service              | Image / Build          | Port          | Purpose                                    |
|----------------------|------------------------|---------------|--------------------------------------------|
| `chrysopedia-db`     | `postgres:16-alpine`   | `5433 → 5432` | Primary data store (7 entity schema)       |
| `chrysopedia-redis`  | `redis:7-alpine`       | —             | Celery broker / cache                      |
| `chrysopedia-api`    | `docker/Dockerfile.api`| `8000`        | FastAPI REST API                           |
| `chrysopedia-worker` | `docker/Dockerfile.api`| —             | Celery worker for LLM pipeline stages 2-5  |
| `chrysopedia-web`    | `docker/Dockerfile.web`| `3000 → 80`  | React frontend (nginx)                     |

### Data Model (7 entities)

- **Creator** — artists/producers whose content is indexed
- **SourceVideo** — original video files processed by the pipeline
- **TranscriptSegment** — timestamped text segments from Whisper
- **KeyMoment** — discrete insights extracted by LLM analysis
- **TechniquePage** — synthesized knowledge pages (primary output)
- **RelatedTechniqueLink** — cross-references between technique pages
- **Tag** — hierarchical topic/genre taxonomy

---

## Prerequisites

- **Docker** ≥ 24.0 and **Docker Compose** ≥ 2.20
- **Python 3.10+** (for the Whisper transcription script)
- **ffmpeg** (for audio extraction)
- **NVIDIA GPU + CUDA** (recommended for Whisper; CPU fallback available)

---

## Quick Start

### 1. Clone and configure

```bash
git clone <repository-url>
cd content-to-kb-automator

# Create environment file from template
cp .env.example .env
# Edit .env with your actual values (see Environment Variables below)
```

### 2. Start the Docker Compose stack

```bash
docker compose up -d
```

This starts PostgreSQL, Redis, the API server, the Celery worker, and the web UI.

### 3. Run database migrations

```bash
# From inside the API container:
docker compose exec chrysopedia-api alembic upgrade head

# Or locally (requires Python venv with backend deps):
alembic upgrade head
```

### 4. Verify the stack

```bash
# Health check (with DB connectivity)
curl http://localhost:8000/health

# API health (lightweight, no DB)
curl http://localhost:8000/api/v1/health

# Docker Compose status
docker compose ps
```

### 5. Transcribe videos (desktop)

```bash
cd whisper
pip install -r requirements.txt

# Single file
python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts

# Batch (all videos in a directory)
python transcribe.py --input ./videos/ --output-dir ./transcripts
```

See [`whisper/README.md`](whisper/README.md) for full transcription documentation.

---

## Environment Variables

Create `.env` from `.env.example`. All variables have sensible defaults for local development.

### Database

| Variable           | Default        | Description                     |
|--------------------|----------------|---------------------------------|
| `POSTGRES_USER`    | `chrysopedia`  | PostgreSQL username             |
| `POSTGRES_PASSWORD`| `changeme`     | PostgreSQL password             |
| `POSTGRES_DB`      | `chrysopedia`  | Database name                   |
| `DATABASE_URL`     | *(composed)*   | Full async connection string    |

### Services

| Variable        | Default                            | Description              |
|-----------------|------------------------------------|--------------------------|
| `REDIS_URL`     | `redis://chrysopedia-redis:6379/0` | Redis connection string  |

### LLM Configuration

| Variable            | Default                                   | Description                        |
|---------------------|-------------------------------------------|------------------------------------|
| `LLM_API_URL`       | `https://friend-openwebui.example.com/api`| Primary LLM endpoint (OpenAI-compatible) |
| `LLM_API_KEY`       | `sk-changeme`                             | API key for primary LLM            |
| `LLM_MODEL`         | `qwen2.5-72b`                             | Primary model name                 |
| `LLM_FALLBACK_URL`  | `http://localhost:11434/v1`               | Fallback LLM endpoint (Ollama)     |
| `LLM_FALLBACK_MODEL`| `qwen2.5:14b-q8_0`                       | Fallback model name                |

### Embedding / Vector

| Variable              | Default                       | Description              |
|-----------------------|-------------------------------|--------------------------|
| `EMBEDDING_API_URL`   | `http://localhost:11434/v1`   | Embedding endpoint       |
| `EMBEDDING_MODEL`     | `nomic-embed-text`            | Embedding model name     |
| `QDRANT_URL`          | `http://qdrant:6333`          | Qdrant vector DB URL     |
| `QDRANT_COLLECTION`   | `chrysopedia`                 | Qdrant collection name   |

### Application

| Variable                 | Default                          | Description                    |
|--------------------------|----------------------------------|--------------------------------|
| `APP_ENV`                | `production`                     | Environment (`development` / `production`) |
| `APP_LOG_LEVEL`          | `info`                           | Log level                      |
| `APP_SECRET_KEY`         | `changeme-generate-a-real-secret`| Application secret key         |
| `TRANSCRIPT_STORAGE_PATH`| `/data/transcripts`             | Transcript JSON storage path   |
| `VIDEO_METADATA_PATH`    | `/data/video_meta`              | Video metadata storage path    |
| `REVIEW_MODE`            | `true`                           | Enable human review workflow   |

---

## Development Workflow

### Local development (without Docker)

```bash
# Create virtual environment
python -m venv .venv
source .venv/bin/activate

# Install backend dependencies
pip install -r backend/requirements.txt

# Start PostgreSQL and Redis (via Docker)
docker compose up -d chrysopedia-db chrysopedia-redis

# Run migrations
alembic upgrade head

# Start the API server with hot-reload
cd backend && uvicorn main:app --reload --host 0.0.0.0 --port 8000
```

### Database migrations

```bash
# Create a new migration after model changes
alembic revision --autogenerate -m "describe_change"

# Apply all pending migrations
alembic upgrade head

# Rollback one migration
alembic downgrade -1
```

### Project structure

```
content-to-kb-automator/
├── backend/               # FastAPI application
│   ├── main.py            # App entry point, middleware, routers
│   ├── config.py          # pydantic-settings configuration
│   ├── database.py        # SQLAlchemy async engine + session
│   ├── models.py          # 7-entity ORM models
│   ├── schemas.py         # Pydantic request/response schemas
│   ├── routers/           # API route handlers
│   │   ├── health.py      # /health (DB check)
│   │   ├── creators.py    # /api/v1/creators
│   │   └── videos.py      # /api/v1/videos
│   └── requirements.txt   # Python dependencies
├── whisper/               # Desktop transcription script
│   ├── transcribe.py      # Whisper CLI tool
│   ├── requirements.txt   # Whisper + ffmpeg deps
│   └── README.md          # Transcription documentation
├── docker/                # Dockerfiles
│   ├── Dockerfile.api     # FastAPI + Celery image
│   ├── Dockerfile.web     # React + nginx image
│   └── nginx.conf         # nginx reverse proxy config
├── alembic/               # Database migrations
│   ├── env.py             # Migration environment
│   └── versions/          # Migration scripts
├── config/                # Configuration files
│   └── canonical_tags.yaml # 6 topic categories + genre taxonomy
├── prompts/               # LLM prompt templates (editable)
├── frontend/              # React web UI (placeholder)
├── tests/                 # Test fixtures and test suites
│   └── fixtures/          # Sample data for testing
├── docker-compose.yml     # Full stack definition
├── alembic.ini            # Alembic configuration
├── .env.example           # Environment variable template
└── chrysopedia-spec.md    # Full project specification
```

---

## API Endpoints

| Method | Path                        | Description                     |
|--------|-----------------------------|---------------------------------|
| GET    | `/health`                   | Health check with DB connectivity |
| GET    | `/api/v1/health`            | Lightweight health (no DB)      |
| GET    | `/api/v1/creators`          | List all creators               |
| GET    | `/api/v1/creators/{slug}`   | Get creator by slug             |
| GET    | `/api/v1/videos`            | List all source videos          |

---

## XPLTD Conventions

This project follows XPLTD infrastructure conventions:

- **Docker project name:** `xpltd_chrysopedia`
- **Bind mounts:** persistent data stored under `/vmPool/r/services/`
- **Network:** dedicated bridge `chrysopedia` (`172.32.0.0/24`)
- **PostgreSQL host port:** `5433` (avoids conflict with system PostgreSQL on `5432`)

---

## Deployment (ub01)

The production stack runs on **ub01.a.xpltd.co**:

```bash
# Clone (first time only — requires SSH agent forwarding)
ssh -A ub01
cd /vmPool/r/repos/xpltdco/chrysopedia
git clone git@github.com:xpltdco/chrysopedia.git .

# Create .env from template
cp .env.example .env
# Edit .env with production secrets

# Build and start
docker compose build
docker compose up -d

# Run migrations
docker exec chrysopedia-api alembic upgrade head

# Pull embedding model (first time only)
docker exec chrysopedia-ollama ollama pull nomic-embed-text
```

### Service URLs
| Service | URL |
|---------|-----|
| Web UI | http://ub01:8096 |
| API Health | http://ub01:8096/health |
| PostgreSQL | ub01:5433 |
| Compose config | `/vmPool/r/compose/xpltd_chrysopedia/docker-compose.yml` |

### Update Workflow
```bash
ssh -A ub01
cd /vmPool/r/repos/xpltdco/chrysopedia
git pull
docker compose build && docker compose up -d
```