diff --git a/README.md b/README.md index f0b4568..58d6e84 100644 --- a/README.md +++ b/README.md @@ -3,320 +3,318 @@ > From *chrysopoeia* (alchemical transmutation of base material into gold) + *encyclopedia*. > Chrysopedia transmutes raw video content into refined, searchable production knowledge. -A self-hosted knowledge extraction and retrieval system for electronic music production content. Transcribes video libraries with Whisper, extracts key moments and techniques with LLM analysis, and serves a search-first web UI for mid-session retrieval. +A self-hosted knowledge extraction system for electronic music production content. Video libraries are transcribed with Whisper, analyzed through a multi-stage LLM pipeline, curated via an admin review workflow, and served through a search-first web UI designed for mid-session retrieval. + +--- + +## Information Flow + +Content moves through six stages from raw video to searchable knowledge: + +``` + ┌─────────────────────────────────────────────────────────────────────────┐ + │ STAGE 1 · Transcription [Desktop / GPU] │ + │ │ + │ Video files → Whisper large-v3 (CUDA) → JSON transcripts │ + │ Output: timestamped segments with speaker text │ + └────────────────────────────────┬────────────────────────────────────────┘ + │ JSON files (manual or folder watcher) + ▼ + ┌─────────────────────────────────────────────────────────────────────────┐ + │ STAGE 2 · Ingestion [API + Watcher] │ + │ │ + │ POST /api/v1/ingest ← watcher auto-submits from /watch folder │ + │ • Validate JSON structure │ + │ • Compute content hash (SHA-256) for deduplication │ + │ • Find-or-create Creator from folder name │ + │ • Upsert SourceVideo (exact filename → content hash → fuzzy match) │ + │ • Bulk-insert TranscriptSegment rows │ + │ • Dispatch pipeline to Celery worker │ + └────────────────────────────────┬────────────────────────────────────────┘ + │ Celery task: run_pipeline(video_id) + ▼ + ┌─────────────────────────────────────────────────────────────────────────┐ + │ STAGE 3 · LLM Extraction Pipeline [Celery Worker] │ + │ │ + │ Four sequential LLM stages, each with its own prompt template: │ + │ │ + │ 3a. Segmentation — Split transcript into semantic topic boundaries │ + │ Model: chat (fast) Prompt: stage2_segmentation.txt │ + │ │ + │ 3b. Extraction — Identify key moments (title, summary, timestamps) │ + │ Model: reasoning (think) Prompt: stage3_extraction.txt │ + │ │ + │ 3c. Classification — Assign content types + extract plugin names │ + │ Model: chat (fast) Prompt: stage4_classification.txt │ + │ │ + │ 3d. Synthesis — Compose technique pages from approved moments │ + │ Model: reasoning (think) Prompt: stage5_synthesis.txt │ + │ │ + │ Each stage emits PipelineEvent rows (tokens, duration, model, errors) │ + └────────────────────────────────┬────────────────────────────────────────┘ + │ KeyMoment rows (review_status: pending) + ▼ + ┌─────────────────────────────────────────────────────────────────────────┐ + │ STAGE 4 · Review & Curation [Admin UI] │ + │ │ + │ Admin reviews extracted KeyMoments before they become technique pages: │ + │ • Approve — moment proceeds to synthesis │ + │ • Edit — correct title, summary, content type, plugins, then approve │ + │ • Reject — moment is excluded from knowledge base │ + │ (When REVIEW_MODE=false, moments auto-approve and skip this stage) │ + └────────────────────────────────┬────────────────────────────────────────┘ + │ Approved moments → Stage 3d synthesis + ▼ + ┌─────────────────────────────────────────────────────────────────────────┐ + │ STAGE 5 · Knowledge Base [Web UI] │ + │ │ + │ TechniquePages — the primary output: │ + │ • Structured body sections, signal chains, plugin lists │ + │ • Linked to source KeyMoments with video timestamps │ + │ • Cross-referenced via RelatedTechniqueLinks │ + │ • Versioned (snapshots before each re-synthesis) │ + │ • Organized by topic taxonomy (6 categories from canonical_tags.yaml) │ + └────────────────────────────────┬────────────────────────────────────────┘ + │ + ▼ + ┌─────────────────────────────────────────────────────────────────────────┐ + │ STAGE 6 · Search & Retrieval [Web UI] │ + │ │ + │ • Semantic search: query → embedding → Qdrant vector similarity │ + │ • Keyword fallback: ILIKE search on title/summary (300ms timeout) │ + │ • Browse by topic hierarchy, creator, or content type │ + │ • Typeahead search from home page (debounced, top 5 results) │ + └─────────────────────────────────────────────────────────────────────────┘ +``` --- ## Architecture ``` -┌──────────────────────────────────────────────────────────────────┐ -│ Desktop (GPU workstation) │ -│ ┌──────────────┐ │ -│ │ whisper/ │ Transcribes video → JSON (Whisper large-v3) │ -│ │ transcribe.py │ Runs locally with CUDA, outputs to /data │ -│ └──────┬───────┘ │ -│ │ JSON transcripts │ -└─────────┼────────────────────────────────────────────────────────┘ - │ - ▼ -┌──────────────────────────────────────────────────────────────────┐ -│ Docker Compose (xpltd_chrysopedia) — Server (e.g. ub01) │ -│ │ -│ ┌────────────────┐ ┌────────────────┐ ┌──────────────────┐ │ -│ │ chrysopedia-db │ │chrysopedia-redis│ │ chrysopedia-api │ │ -│ │ PostgreSQL 16 │ │ Redis 7 │ │ FastAPI + Uvicorn│ │ -│ │ :5433→5432 │ │ │ │ :8000 │ │ -│ └────────────────┘ └────────────────┘ └────────┬─────────┘ │ -│ │ │ -│ ┌──────────────────┐ ┌──────────────────────┐ │ │ -│ │ chrysopedia-web │ │ chrysopedia-worker │ │ │ -│ │ React + nginx │ │ Celery (LLM pipeline)│ │ │ -│ │ :3000→80 │ │ │ │ │ -│ └──────────────────┘ └──────────────────────┘ │ │ -│ │ │ -│ Network: chrysopedia (172.24.0.0/24) │ │ -└──────────────────────────────────────────────────────────────────┘ +┌──────────────────────────────────────────────────────────────────────────┐ +│ Desktop (GPU workstation — hal0022) │ +│ whisper/transcribe.py → JSON transcripts → copy to /watch folder │ +└────────────────────────────┬─────────────────────────────────────────────┘ + │ + ▼ +┌──────────────────────────────────────────────────────────────────────────┐ +│ Docker Compose: xpltd_chrysopedia (ub01) │ +│ Network: chrysopedia (172.32.0.0/24) │ +│ │ +│ ┌────────────┐ ┌─────────────┐ ┌───────────────┐ ┌──────────────┐ │ +│ │ PostgreSQL │ │ Redis │ │ Qdrant │ │ Ollama │ │ +│ │ :5433 │ │ broker + │ │ vector DB │ │ embeddings │ │ +│ │ 7 entities │ │ cache │ │ semantic │ │ nomic-embed │ │ +│ └─────┬───────┘ └──────┬──────┘ └───────┬───────┘ └──────┬───────┘ │ +│ │ │ │ │ │ +│ ┌─────┴─────────────────┴─────────────────┴─────────────────┴────────┐ │ +│ │ FastAPI (API) │ │ +│ │ Ingest · Pipeline control · Review · Search · CRUD · Reports │ │ +│ └──────────────────────────────┬────────────────────────────────────┘ │ +│ │ │ +│ ┌──────────────┐ ┌────────────┴───┐ ┌──────────────────────────┐ │ +│ │ Watcher │ │ Celery Worker │ │ Web UI (React) │ │ +│ │ /watch → │ │ LLM pipeline │ │ nginx → :8096 │ │ +│ │ auto-ingest │ │ stages 2-5 │ │ search-first interface │ │ +│ └──────────────┘ └────────────────┘ └──────────────────────────┘ │ +└──────────────────────────────────────────────────────────────────────────┘ ``` ### Services -| Service | Image / Build | Port | Purpose | -|----------------------|------------------------|---------------|--------------------------------------------| -| `chrysopedia-db` | `postgres:16-alpine` | `5433 → 5432` | Primary data store (7 entity schema) | -| `chrysopedia-redis` | `redis:7-alpine` | — | Celery broker / cache | -| `chrysopedia-api` | `docker/Dockerfile.api`| `8000` | FastAPI REST API | -| `chrysopedia-worker` | `docker/Dockerfile.api`| — | Celery worker for LLM pipeline stages 2-5 | -| `chrysopedia-web` | `docker/Dockerfile.web`| `3000 → 80` | React frontend (nginx) | +| Service | Image | Port | Purpose | +|---------|-------|------|---------| +| `chrysopedia-db` | `postgres:16-alpine` | `5433 → 5432` | Primary data store | +| `chrysopedia-redis` | `redis:7-alpine` | — | Celery broker + feature flag cache | +| `chrysopedia-qdrant` | `qdrant/qdrant:v1.13.2` | — | Vector DB for semantic search | +| `chrysopedia-ollama` | `ollama/ollama` | — | Embedding model server (nomic-embed-text) | +| `chrysopedia-api` | `Dockerfile.api` | `8000` | FastAPI REST API | +| `chrysopedia-worker` | `Dockerfile.api` | — | Celery worker (LLM pipeline) | +| `chrysopedia-watcher` | `Dockerfile.api` | — | Folder monitor → auto-ingest | +| `chrysopedia-web` | `Dockerfile.web` | `8096 → 80` | React frontend (nginx) | -### Data Model (7 entities) +### Data Model -- **Creator** — artists/producers whose content is indexed -- **SourceVideo** — original video files processed by the pipeline -- **TranscriptSegment** — timestamped text segments from Whisper -- **KeyMoment** — discrete insights extracted by LLM analysis -- **TechniquePage** — synthesized knowledge pages (primary output) -- **RelatedTechniqueLink** — cross-references between technique pages -- **Tag** — hierarchical topic/genre taxonomy - ---- - -## Prerequisites - -- **Docker** ≥ 24.0 and **Docker Compose** ≥ 2.20 -- **Python 3.10+** (for the Whisper transcription script) -- **ffmpeg** (for audio extraction) -- **NVIDIA GPU + CUDA** (recommended for Whisper; CPU fallback available) +| Entity | Purpose | +|--------|---------| +| **Creator** | Artists/producers whose content is indexed | +| **SourceVideo** | Video files processed by the pipeline (with content hash dedup) | +| **TranscriptSegment** | Timestamped text segments from Whisper | +| **KeyMoment** | Discrete insights extracted by LLM analysis | +| **TechniquePage** | Synthesized knowledge pages — the primary output | +| **TechniquePageVersion** | Snapshots before re-synthesis overwrites | +| **RelatedTechniqueLink** | Cross-references between technique pages | +| **Tag** | Hierarchical topic taxonomy | +| **ContentReport** | User-submitted content issues | +| **PipelineEvent** | Structured pipeline execution logs (tokens, timing, errors) | --- ## Quick Start -### 1. Clone and configure +### Prerequisites + +- Docker ≥ 24.0 and Docker Compose ≥ 2.20 +- Python 3.10+ with NVIDIA GPU + CUDA (for Whisper transcription) + +### Setup ```bash -git clone -cd content-to-kb-automator +# Clone and configure +git clone git@github.com:xpltdco/chrysopedia.git +cd chrysopedia +cp .env.example .env # edit with real values -# Create environment file from template -cp .env.example .env -# Edit .env with your actual values (see Environment Variables below) -``` - -### 2. Start the Docker Compose stack - -```bash +# Start the stack docker compose up -d + +# Run database migrations +docker exec chrysopedia-api alembic upgrade head + +# Pull the embedding model (first time only) +docker exec chrysopedia-ollama ollama pull nomic-embed-text + +# Verify +curl http://localhost:8096/health ``` -This starts PostgreSQL, Redis, the API server, the Celery worker, and the web UI. - -### 3. Run database migrations +### Transcribe videos ```bash -# From inside the API container: -docker compose exec chrysopedia-api alembic upgrade head - -# Or locally (requires Python venv with backend deps): -alembic upgrade head -``` - -### 4. Verify the stack - -```bash -# Health check (with DB connectivity) -curl http://localhost:8000/health - -# API health (lightweight, no DB) -curl http://localhost:8000/api/v1/health - -# Docker Compose status -docker compose ps -``` - -### 5. Transcribe videos (desktop) - -```bash -cd whisper -pip install -r requirements.txt +cd whisper && pip install -r requirements.txt # Single file python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts -# Batch (all videos in a directory) +# Batch python transcribe.py --input ./videos/ --output-dir ./transcripts ``` -See [`whisper/README.md`](whisper/README.md) for full transcription documentation. +See [`whisper/README.md`](whisper/README.md) for full transcription docs. --- ## Environment Variables -Create `.env` from `.env.example`. All variables have sensible defaults for local development. +Copy `.env.example` to `.env`. Key groups: -### Database - -| Variable | Default | Description | -|--------------------|----------------|---------------------------------| -| `POSTGRES_USER` | `chrysopedia` | PostgreSQL username | -| `POSTGRES_PASSWORD`| `changeme` | PostgreSQL password | -| `POSTGRES_DB` | `chrysopedia` | Database name | -| `DATABASE_URL` | *(composed)* | Full async connection string | - -### Services - -| Variable | Default | Description | -|-----------------|------------------------------------|--------------------------| -| `REDIS_URL` | `redis://chrysopedia-redis:6379/0` | Redis connection string | - -### LLM Configuration - -| Variable | Default | Description | -|---------------------|-------------------------------------------|------------------------------------| -| `LLM_API_URL` | `https://friend-openwebui.example.com/api`| Primary LLM endpoint (OpenAI-compatible) | -| `LLM_API_KEY` | `sk-changeme` | API key for primary LLM | -| `LLM_MODEL` | `qwen2.5-72b` | Primary model name | -| `LLM_FALLBACK_URL` | `http://localhost:11434/v1` | Fallback LLM endpoint (Ollama) | -| `LLM_FALLBACK_MODEL`| `qwen2.5:14b-q8_0` | Fallback model name | - -### Embedding / Vector - -| Variable | Default | Description | -|-----------------------|-------------------------------|--------------------------| -| `EMBEDDING_API_URL` | `http://localhost:11434/v1` | Embedding endpoint | -| `EMBEDDING_MODEL` | `nomic-embed-text` | Embedding model name | -| `QDRANT_URL` | `http://qdrant:6333` | Qdrant vector DB URL | -| `QDRANT_COLLECTION` | `chrysopedia` | Qdrant collection name | - -### Application - -| Variable | Default | Description | -|--------------------------|----------------------------------|--------------------------------| -| `APP_ENV` | `production` | Environment (`development` / `production`) | -| `APP_LOG_LEVEL` | `info` | Log level | -| `APP_SECRET_KEY` | `changeme-generate-a-real-secret`| Application secret key | -| `TRANSCRIPT_STORAGE_PATH`| `/data/transcripts` | Transcript JSON storage path | -| `VIDEO_METADATA_PATH` | `/data/video_meta` | Video metadata storage path | -| `REVIEW_MODE` | `true` | Enable human review workflow | - ---- - -## Development Workflow - -### Local development (without Docker) - -```bash -# Create virtual environment -python -m venv .venv -source .venv/bin/activate - -# Install backend dependencies -pip install -r backend/requirements.txt - -# Start PostgreSQL and Redis (via Docker) -docker compose up -d chrysopedia-db chrysopedia-redis - -# Run migrations -alembic upgrade head - -# Start the API server with hot-reload -cd backend && uvicorn main:app --reload --host 0.0.0.0 --port 8000 -``` - -### Database migrations - -```bash -# Create a new migration after model changes -alembic revision --autogenerate -m "describe_change" - -# Apply all pending migrations -alembic upgrade head - -# Rollback one migration -alembic downgrade -1 -``` - -### Project structure - -``` -content-to-kb-automator/ -├── backend/ # FastAPI application -│ ├── main.py # App entry point, middleware, routers -│ ├── config.py # pydantic-settings configuration -│ ├── database.py # SQLAlchemy async engine + session -│ ├── models.py # 7-entity ORM models -│ ├── schemas.py # Pydantic request/response schemas -│ ├── routers/ # API route handlers -│ │ ├── health.py # /health (DB check) -│ │ ├── creators.py # /api/v1/creators -│ │ └── videos.py # /api/v1/videos -│ └── requirements.txt # Python dependencies -├── whisper/ # Desktop transcription script -│ ├── transcribe.py # Whisper CLI tool -│ ├── requirements.txt # Whisper + ffmpeg deps -│ └── README.md # Transcription documentation -├── docker/ # Dockerfiles -│ ├── Dockerfile.api # FastAPI + Celery image -│ ├── Dockerfile.web # React + nginx image -│ └── nginx.conf # nginx reverse proxy config -├── alembic/ # Database migrations -│ ├── env.py # Migration environment -│ └── versions/ # Migration scripts -├── config/ # Configuration files -│ └── canonical_tags.yaml # 6 topic categories + genre taxonomy -├── prompts/ # LLM prompt templates (editable) -├── frontend/ # React web UI (placeholder) -├── tests/ # Test fixtures and test suites -│ └── fixtures/ # Sample data for testing -├── docker-compose.yml # Full stack definition -├── alembic.ini # Alembic configuration -├── .env.example # Environment variable template -└── chrysopedia-spec.md # Full project specification -``` +| Group | Variables | Notes | +|-------|-----------|-------| +| **Database** | `POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB` | Default user: `chrysopedia` | +| **LLM** | `LLM_API_URL`, `LLM_API_KEY`, `LLM_MODEL` | OpenAI-compatible endpoint | +| **LLM Fallback** | `LLM_FALLBACK_URL`, `LLM_FALLBACK_MODEL` | Automatic failover | +| **Per-Stage Models** | `LLM_STAGE{2-5}_MODEL`, `LLM_STAGE{2-5}_MODALITY` | `chat` for fast stages, `thinking` for reasoning | +| **Embedding** | `EMBEDDING_API_URL`, `EMBEDDING_MODEL` | Ollama nomic-embed-text | +| **Vector DB** | `QDRANT_URL`, `QDRANT_COLLECTION` | Container-internal | +| **Features** | `REVIEW_MODE`, `DEBUG_MODE` | Review gate + LLM I/O capture | +| **Storage** | `TRANSCRIPT_STORAGE_PATH`, `VIDEO_METADATA_PATH` | Container bind mounts | --- ## API Endpoints -| Method | Path | Description | -|--------|-----------------------------|---------------------------------| -| GET | `/health` | Health check with DB connectivity | -| GET | `/api/v1/health` | Lightweight health (no DB) | -| GET | `/api/v1/creators` | List all creators | -| GET | `/api/v1/creators/{slug}` | Get creator by slug | -| GET | `/api/v1/videos` | List all source videos | +### Public + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/health` | Health check (DB connectivity) | +| GET | `/api/v1/search?q=&scope=&limit=` | Semantic + keyword search | +| GET | `/api/v1/techniques` | List technique pages | +| GET | `/api/v1/techniques/{slug}` | Technique detail + key moments | +| GET | `/api/v1/techniques/{slug}/versions` | Version history | +| GET | `/api/v1/creators` | List creators (sort, genre filter) | +| GET | `/api/v1/creators/{slug}` | Creator detail | +| GET | `/api/v1/topics` | Topic hierarchy with counts | +| GET | `/api/v1/videos` | List source videos | +| POST | `/api/v1/reports` | Submit content report | + +### Admin + +| Method | Path | Description | +|--------|------|-------------| +| GET | `/api/v1/review/queue` | Review queue (status filter) | +| POST | `/api/v1/review/moments/{id}/approve` | Approve key moment | +| POST | `/api/v1/review/moments/{id}/reject` | Reject key moment | +| PUT | `/api/v1/review/moments/{id}` | Edit key moment | +| POST | `/api/v1/admin/pipeline/trigger/{video_id}` | Trigger/retrigger pipeline | +| GET | `/api/v1/admin/pipeline/events/{video_id}` | Pipeline event log | +| GET | `/api/v1/admin/pipeline/token-summary/{video_id}` | Token usage by stage | +| GET | `/api/v1/admin/pipeline/worker-status` | Celery worker status | +| PUT | `/api/v1/admin/pipeline/debug-mode` | Toggle debug mode | + +### Ingest + +| Method | Path | Description | +|--------|------|-------------| +| POST | `/api/v1/ingest` | Upload Whisper JSON transcript | --- -## XPLTD Conventions +## Development -This project follows XPLTD infrastructure conventions: +```bash +# Local backend (with Docker services) +python -m venv .venv && source .venv/bin/activate +pip install -r backend/requirements.txt +docker compose up -d chrysopedia-db chrysopedia-redis +alembic upgrade head +cd backend && uvicorn main:app --reload --host 0.0.0.0 --port 8000 -- **Docker project name:** `xpltd_chrysopedia` -- **Bind mounts:** persistent data stored under `/vmPool/r/services/` -- **Network:** dedicated bridge `chrysopedia` (`172.32.0.0/24`) -- **PostgreSQL host port:** `5433` (avoids conflict with system PostgreSQL on `5432`) +# Database migrations +alembic revision --autogenerate -m "describe_change" +alembic upgrade head +``` + +### Project Structure + +``` +chrysopedia/ +├── backend/ # FastAPI application +│ ├── main.py # Entry point, middleware, router mounting +│ ├── config.py # Pydantic Settings (all env vars) +│ ├── models.py # SQLAlchemy ORM models +│ ├── schemas.py # Pydantic request/response schemas +│ ├── worker.py # Celery app configuration +│ ├── watcher.py # Transcript folder watcher service +│ ├── search_service.py # Semantic search + keyword fallback +│ ├── routers/ # API endpoint handlers +│ ├── pipeline/ # LLM pipeline stages + clients +│ │ ├── stages.py # Stages 2-5 (Celery tasks) +│ │ ├── llm_client.py # OpenAI-compatible LLM client +│ │ ├── embedding_client.py +│ │ └── qdrant_client.py +│ └── tests/ +├── frontend/ # React + TypeScript + Vite +│ └── src/ +│ ├── pages/ # Home, Search, Technique, Creator, Topic, Admin +│ ├── components/ # Shared UI components +│ └── api/ # Typed API clients +├── whisper/ # Desktop transcription (Whisper large-v3) +├── docker/ # Dockerfiles + nginx config +├── alembic/ # Database migrations +├── config/ # canonical_tags.yaml (topic taxonomy) +├── prompts/ # LLM prompt templates (editable at runtime) +├── docker-compose.yml +└── .env.example +``` --- ## Deployment (ub01) -The production stack runs on **ub01.a.xpltd.co**: - ```bash -# Clone (first time only — requires SSH agent forwarding) -ssh -A ub01 +ssh ub01 cd /vmPool/r/repos/xpltdco/chrysopedia -git clone git@github.com:xpltdco/chrysopedia.git . - -# Create .env from template -cp .env.example .env -# Edit .env with production secrets - -# Build and start -docker compose build -docker compose up -d - -# Run migrations -docker exec chrysopedia-api alembic upgrade head - -# Pull embedding model (first time only) -docker exec chrysopedia-ollama ollama pull nomic-embed-text +git pull && docker compose build && docker compose up -d ``` -### Service URLs -| Service | URL | -|---------|-----| -| Web UI | http://ub01:8096 | -| API Health | http://ub01:8096/health | -| PostgreSQL | ub01:5433 | +| Resource | Location | +|----------|----------| +| Web UI | `http://ub01:8096` | +| API | `http://ub01:8096/health` | +| PostgreSQL | `ub01:5433` | | Compose config | `/vmPool/r/compose/xpltd_chrysopedia/docker-compose.yml` | +| Persistent data | `/vmPool/r/services/chrysopedia_*` | -### Update Workflow -```bash -ssh -A ub01 -cd /vmPool/r/repos/xpltdco/chrysopedia -git pull -docker compose build && docker compose up -d -``` +XPLTD conventions: `xpltd_chrysopedia` project name, dedicated bridge network (`172.32.0.0/24`), bind mounts under `/vmPool/r/services/`, PostgreSQL on port `5433`.