chrysopedia/README.md

322 lines
13 KiB
Markdown

# Chrysopedia
> From *chrysopoeia* (alchemical transmutation of base material into gold) + *encyclopedia*.
> Chrysopedia transmutes raw video content into refined, searchable production knowledge.
A self-hosted knowledge extraction and retrieval system for electronic music production content. Transcribes video libraries with Whisper, extracts key moments and techniques with LLM analysis, and serves a search-first web UI for mid-session retrieval.
---
## Architecture
```
┌──────────────────────────────────────────────────────────────────┐
│ Desktop (GPU workstation) │
│ ┌──────────────┐ │
│ │ whisper/ │ Transcribes video → JSON (Whisper large-v3) │
│ │ transcribe.py │ Runs locally with CUDA, outputs to /data │
│ └──────┬───────┘ │
│ │ JSON transcripts │
└─────────┼────────────────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────┐
│ Docker Compose (xpltd_chrysopedia) — Server (e.g. ub01) │
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌──────────────────┐ │
│ │ chrysopedia-db │ │chrysopedia-redis│ │ chrysopedia-api │ │
│ │ PostgreSQL 16 │ │ Redis 7 │ │ FastAPI + Uvicorn│ │
│ │ :5433→5432 │ │ │ │ :8000 │ │
│ └────────────────┘ └────────────────┘ └────────┬─────────┘ │
│ │ │
│ ┌──────────────────┐ ┌──────────────────────┐ │ │
│ │ chrysopedia-web │ │ chrysopedia-worker │ │ │
│ │ React + nginx │ │ Celery (LLM pipeline)│ │ │
│ │ :3000→80 │ │ │ │ │
│ └──────────────────┘ └──────────────────────┘ │ │
│ │ │
│ Network: chrysopedia (172.24.0.0/24) │ │
└──────────────────────────────────────────────────────────────────┘
```
### Services
| Service | Image / Build | Port | Purpose |
|----------------------|------------------------|---------------|--------------------------------------------|
| `chrysopedia-db` | `postgres:16-alpine` | `5433 → 5432` | Primary data store (7 entity schema) |
| `chrysopedia-redis` | `redis:7-alpine` | — | Celery broker / cache |
| `chrysopedia-api` | `docker/Dockerfile.api`| `8000` | FastAPI REST API |
| `chrysopedia-worker` | `docker/Dockerfile.api`| — | Celery worker for LLM pipeline stages 2-5 |
| `chrysopedia-web` | `docker/Dockerfile.web`| `3000 → 80` | React frontend (nginx) |
### Data Model (7 entities)
- **Creator** — artists/producers whose content is indexed
- **SourceVideo** — original video files processed by the pipeline
- **TranscriptSegment** — timestamped text segments from Whisper
- **KeyMoment** — discrete insights extracted by LLM analysis
- **TechniquePage** — synthesized knowledge pages (primary output)
- **RelatedTechniqueLink** — cross-references between technique pages
- **Tag** — hierarchical topic/genre taxonomy
---
## Prerequisites
- **Docker** ≥ 24.0 and **Docker Compose** ≥ 2.20
- **Python 3.10+** (for the Whisper transcription script)
- **ffmpeg** (for audio extraction)
- **NVIDIA GPU + CUDA** (recommended for Whisper; CPU fallback available)
---
## Quick Start
### 1. Clone and configure
```bash
git clone <repository-url>
cd content-to-kb-automator
# Create environment file from template
cp .env.example .env
# Edit .env with your actual values (see Environment Variables below)
```
### 2. Start the Docker Compose stack
```bash
docker compose up -d
```
This starts PostgreSQL, Redis, the API server, the Celery worker, and the web UI.
### 3. Run database migrations
```bash
# From inside the API container:
docker compose exec chrysopedia-api alembic upgrade head
# Or locally (requires Python venv with backend deps):
alembic upgrade head
```
### 4. Verify the stack
```bash
# Health check (with DB connectivity)
curl http://localhost:8000/health
# API health (lightweight, no DB)
curl http://localhost:8000/api/v1/health
# Docker Compose status
docker compose ps
```
### 5. Transcribe videos (desktop)
```bash
cd whisper
pip install -r requirements.txt
# Single file
python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts
# Batch (all videos in a directory)
python transcribe.py --input ./videos/ --output-dir ./transcripts
```
See [`whisper/README.md`](whisper/README.md) for full transcription documentation.
---
## Environment Variables
Create `.env` from `.env.example`. All variables have sensible defaults for local development.
### Database
| Variable | Default | Description |
|--------------------|----------------|---------------------------------|
| `POSTGRES_USER` | `chrysopedia` | PostgreSQL username |
| `POSTGRES_PASSWORD`| `changeme` | PostgreSQL password |
| `POSTGRES_DB` | `chrysopedia` | Database name |
| `DATABASE_URL` | *(composed)* | Full async connection string |
### Services
| Variable | Default | Description |
|-----------------|------------------------------------|--------------------------|
| `REDIS_URL` | `redis://chrysopedia-redis:6379/0` | Redis connection string |
### LLM Configuration
| Variable | Default | Description |
|---------------------|-------------------------------------------|------------------------------------|
| `LLM_API_URL` | `https://friend-openwebui.example.com/api`| Primary LLM endpoint (OpenAI-compatible) |
| `LLM_API_KEY` | `sk-changeme` | API key for primary LLM |
| `LLM_MODEL` | `qwen2.5-72b` | Primary model name |
| `LLM_FALLBACK_URL` | `http://localhost:11434/v1` | Fallback LLM endpoint (Ollama) |
| `LLM_FALLBACK_MODEL`| `qwen2.5:14b-q8_0` | Fallback model name |
### Embedding / Vector
| Variable | Default | Description |
|-----------------------|-------------------------------|--------------------------|
| `EMBEDDING_API_URL` | `http://localhost:11434/v1` | Embedding endpoint |
| `EMBEDDING_MODEL` | `nomic-embed-text` | Embedding model name |
| `QDRANT_URL` | `http://qdrant:6333` | Qdrant vector DB URL |
| `QDRANT_COLLECTION` | `chrysopedia` | Qdrant collection name |
### Application
| Variable | Default | Description |
|--------------------------|----------------------------------|--------------------------------|
| `APP_ENV` | `production` | Environment (`development` / `production`) |
| `APP_LOG_LEVEL` | `info` | Log level |
| `APP_SECRET_KEY` | `changeme-generate-a-real-secret`| Application secret key |
| `TRANSCRIPT_STORAGE_PATH`| `/data/transcripts` | Transcript JSON storage path |
| `VIDEO_METADATA_PATH` | `/data/video_meta` | Video metadata storage path |
| `REVIEW_MODE` | `true` | Enable human review workflow |
---
## Development Workflow
### Local development (without Docker)
```bash
# Create virtual environment
python -m venv .venv
source .venv/bin/activate
# Install backend dependencies
pip install -r backend/requirements.txt
# Start PostgreSQL and Redis (via Docker)
docker compose up -d chrysopedia-db chrysopedia-redis
# Run migrations
alembic upgrade head
# Start the API server with hot-reload
cd backend && uvicorn main:app --reload --host 0.0.0.0 --port 8000
```
### Database migrations
```bash
# Create a new migration after model changes
alembic revision --autogenerate -m "describe_change"
# Apply all pending migrations
alembic upgrade head
# Rollback one migration
alembic downgrade -1
```
### Project structure
```
content-to-kb-automator/
├── backend/ # FastAPI application
│ ├── main.py # App entry point, middleware, routers
│ ├── config.py # pydantic-settings configuration
│ ├── database.py # SQLAlchemy async engine + session
│ ├── models.py # 7-entity ORM models
│ ├── schemas.py # Pydantic request/response schemas
│ ├── routers/ # API route handlers
│ │ ├── health.py # /health (DB check)
│ │ ├── creators.py # /api/v1/creators
│ │ └── videos.py # /api/v1/videos
│ └── requirements.txt # Python dependencies
├── whisper/ # Desktop transcription script
│ ├── transcribe.py # Whisper CLI tool
│ ├── requirements.txt # Whisper + ffmpeg deps
│ └── README.md # Transcription documentation
├── docker/ # Dockerfiles
│ ├── Dockerfile.api # FastAPI + Celery image
│ ├── Dockerfile.web # React + nginx image
│ └── nginx.conf # nginx reverse proxy config
├── alembic/ # Database migrations
│ ├── env.py # Migration environment
│ └── versions/ # Migration scripts
├── config/ # Configuration files
│ └── canonical_tags.yaml # 6 topic categories + genre taxonomy
├── prompts/ # LLM prompt templates (editable)
├── frontend/ # React web UI (placeholder)
├── tests/ # Test fixtures and test suites
│ └── fixtures/ # Sample data for testing
├── docker-compose.yml # Full stack definition
├── alembic.ini # Alembic configuration
├── .env.example # Environment variable template
└── chrysopedia-spec.md # Full project specification
```
---
## API Endpoints
| Method | Path | Description |
|--------|-----------------------------|---------------------------------|
| GET | `/health` | Health check with DB connectivity |
| GET | `/api/v1/health` | Lightweight health (no DB) |
| GET | `/api/v1/creators` | List all creators |
| GET | `/api/v1/creators/{slug}` | Get creator by slug |
| GET | `/api/v1/videos` | List all source videos |
---
## XPLTD Conventions
This project follows XPLTD infrastructure conventions:
- **Docker project name:** `xpltd_chrysopedia`
- **Bind mounts:** persistent data stored under `/vmPool/r/services/`
- **Network:** dedicated bridge `chrysopedia` (`172.32.0.0/24`)
- **PostgreSQL host port:** `5433` (avoids conflict with system PostgreSQL on `5432`)
---
## Deployment (ub01)
The production stack runs on **ub01.a.xpltd.co**:
```bash
# Clone (first time only — requires SSH agent forwarding)
ssh -A ub01
cd /vmPool/r/repos/xpltdco/chrysopedia
git clone git@github.com:xpltdco/chrysopedia.git .
# Create .env from template
cp .env.example .env
# Edit .env with production secrets
# Build and start
docker compose build
docker compose up -d
# Run migrations
docker exec chrysopedia-api alembic upgrade head
# Pull embedding model (first time only)
docker exec chrysopedia-ollama ollama pull nomic-embed-text
```
### Service URLs
| Service | URL |
|---------|-----|
| Web UI | http://ub01:8096 |
| API Health | http://ub01:8096/health |
| PostgreSQL | ub01:5433 |
| Compose config | `/vmPool/r/compose/xpltd_chrysopedia/docker-compose.yml` |
### Update Workflow
```bash
ssh -A ub01
cd /vmPool/r/repos/xpltdco/chrysopedia
git pull
docker compose build && docker compose up -d
```