jlightner 4b0914b12b fix: restore complete project tree from ub01 canonical state

Auto-mode commit 7aa33cd accidentally deleted 78 files (14,814 lines) during M005
execution. Subsequent commits rebuilt some frontend files but backend/, alembic/,
tests/, whisper/, docker configs, and prompts were never restored in this repo.

This commit restores the full project tree by syncing from ub01's working directory,
which has all M001-M007 features running in production containers.

Restored: backend/ (config, models, routers, database, redis, search_service, worker),
alembic/ (6 migrations), docker/ (Dockerfiles, nginx, compose), prompts/ (4 stages),
tests/, whisper/, README.md, .env.example, chrysopedia-spec.md

2026-03-31 02:10:41 +00:00

19 KiB

Raw Blame History

Chrysopedia

From chrysopoeia (alchemical transmutation of base material into gold) + encyclopedia. Chrysopedia transmutes raw video content into refined, searchable production knowledge.

A self-hosted knowledge extraction system for electronic music production content. Video libraries are transcribed with Whisper, analyzed through a multi-stage LLM pipeline, curated via an admin review workflow, and served through a search-first web UI designed for mid-session retrieval.

Information Flow

Content moves through six stages from raw video to searchable knowledge:

 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 1 · Transcription                            [Desktop / GPU]    │
 │                                                                         │
 │  Video files → Whisper large-v3 (CUDA) → JSON transcripts              │
 │  Output: timestamped segments with speaker text                         │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │ JSON files (manual or folder watcher)
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 2 · Ingestion                                [API + Watcher]    │
 │                                                                         │
 │  POST /api/v1/ingest ← watcher auto-submits from /watch folder         │
 │  • Validate JSON structure                                              │
 │  • Compute content hash (SHA-256) for deduplication                     │
 │  • Find-or-create Creator from folder name                              │
 │  • Upsert SourceVideo (exact filename → content hash → fuzzy match)     │
 │  • Bulk-insert TranscriptSegment rows                                   │
 │  • Dispatch pipeline to Celery worker                                   │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │ Celery task: run_pipeline(video_id)
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 3 · LLM Extraction Pipeline                  [Celery Worker]    │
 │                                                                         │
 │  Four sequential LLM stages, each with its own prompt template:         │
 │                                                                         │
 │  3a. Segmentation — Split transcript into semantic topic boundaries     │
 │      Model: chat (fast)         Prompt: stage2_segmentation.txt         │
 │                                                                         │
 │  3b. Extraction — Identify key moments (title, summary, timestamps)     │
 │      Model: reasoning (think)   Prompt: stage3_extraction.txt           │
 │                                                                         │
 │  3c. Classification — Assign content types + extract plugin names       │
 │      Model: chat (fast)         Prompt: stage4_classification.txt       │
 │                                                                         │
 │  3d. Synthesis — Compose technique pages from approved moments          │
 │      Model: reasoning (think)   Prompt: stage5_synthesis.txt            │
 │                                                                         │
 │  Each stage emits PipelineEvent rows (tokens, duration, model, errors)  │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │ KeyMoment rows (review_status: pending)
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 4 · Review & Curation                        [Admin UI]         │
 │                                                                         │
 │  Admin reviews extracted KeyMoments before they become technique pages:  │
 │  • Approve — moment proceeds to synthesis                               │
 │  • Edit — correct title, summary, content type, plugins, then approve   │
 │  • Reject — moment is excluded from knowledge base                      │
 │  (When REVIEW_MODE=false, moments auto-approve and skip this stage)     │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │ Approved moments → Stage 3d synthesis
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 5 · Knowledge Base                           [Web UI]           │
 │                                                                         │
 │  TechniquePages — the primary output:                                   │
 │  • Structured body sections, signal chains, plugin lists                │
 │  • Linked to source KeyMoments with video timestamps                    │
 │  • Cross-referenced via RelatedTechniqueLinks                           │
 │  • Versioned (snapshots before each re-synthesis)                       │
 │  • Organized by topic taxonomy (6 categories from canonical_tags.yaml)  │
 └────────────────────────────────┬────────────────────────────────────────┘
                                  │
                                  ▼
 ┌─────────────────────────────────────────────────────────────────────────┐
 │  STAGE 6 · Search & Retrieval                       [Web UI]           │
 │                                                                         │
 │  • Semantic search: query → embedding → Qdrant vector similarity        │
 │  • Keyword fallback: ILIKE search on title/summary (300ms timeout)      │
 │  • Browse by topic hierarchy, creator, or content type                  │
 │  • Typeahead search from home page (debounced, top 5 results)           │
 └─────────────────────────────────────────────────────────────────────────┘

Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│  Desktop (GPU workstation — hal0022)                                     │
│  whisper/transcribe.py → JSON transcripts → copy to /watch folder        │
└────────────────────────────┬─────────────────────────────────────────────┘
                             │
                             ▼
┌──────────────────────────────────────────────────────────────────────────┐
│  Docker Compose: xpltd_chrysopedia (ub01)                                │
│  Network: chrysopedia (172.32.0.0/24)                                    │
│                                                                          │
│  ┌────────────┐  ┌─────────────┐  ┌───────────────┐  ┌──────────────┐  │
│  │  PostgreSQL │  │    Redis    │  │    Qdrant     │  │    Ollama    │  │
│  │  :5433      │  │  broker +   │  │  vector DB    │  │  embeddings  │  │
│  │  7 entities │  │  cache      │  │  semantic     │  │  nomic-embed │  │
│  └─────┬───────┘  └──────┬──────┘  └───────┬───────┘  └──────┬───────┘  │
│        │                 │                 │                 │           │
│  ┌─────┴─────────────────┴─────────────────┴─────────────────┴────────┐  │
│  │                         FastAPI (API)                              │  │
│  │  Ingest · Pipeline control · Review · Search · CRUD · Reports     │  │
│  └──────────────────────────────┬────────────────────────────────────┘  │
│                                 │                                       │
│  ┌──────────────┐  ┌────────────┴───┐  ┌──────────────────────────┐    │
│  │   Watcher    │  │  Celery Worker │  │     Web UI (React)       │    │
│  │  /watch →    │  │  LLM pipeline  │  │  nginx → :8096           │    │
│  │  auto-ingest │  │  stages 2-5    │  │  search-first interface  │    │
│  └──────────────┘  └────────────────┘  └──────────────────────────┘    │
└──────────────────────────────────────────────────────────────────────────┘

Services

Service	Image	Port	Purpose
`chrysopedia-db`	`postgres:16-alpine`	`5433 → 5432`	Primary data store
`chrysopedia-redis`	`redis:7-alpine`	—	Celery broker + feature flag cache
`chrysopedia-qdrant`	`qdrant/qdrant:v1.13.2`	—	Vector DB for semantic search
`chrysopedia-ollama`	`ollama/ollama`	—	Embedding model server (nomic-embed-text)
`chrysopedia-api`	`Dockerfile.api`	`8000`	FastAPI REST API
`chrysopedia-worker`	`Dockerfile.api`	—	Celery worker (LLM pipeline)
`chrysopedia-watcher`	`Dockerfile.api`	—	Folder monitor → auto-ingest
`chrysopedia-web`	`Dockerfile.web`	`8096 → 80`	React frontend (nginx)

Data Model

Entity	Purpose
Creator	Artists/producers whose content is indexed
SourceVideo	Video files processed by the pipeline (with content hash dedup)
TranscriptSegment	Timestamped text segments from Whisper
KeyMoment	Discrete insights extracted by LLM analysis
TechniquePage	Synthesized knowledge pages — the primary output
TechniquePageVersion	Snapshots before re-synthesis overwrites
RelatedTechniqueLink	Cross-references between technique pages
Tag	Hierarchical topic taxonomy
ContentReport	User-submitted content issues
PipelineEvent	Structured pipeline execution logs (tokens, timing, errors)

Quick Start

Prerequisites

Docker ≥ 24.0 and Docker Compose ≥ 2.20
Python 3.10+ with NVIDIA GPU + CUDA (for Whisper transcription)

Setup

# Clone and configure
git clone git@github.com:xpltdco/chrysopedia.git
cd chrysopedia
cp .env.example .env    # edit with real values

# Start the stack
docker compose up -d

# Run database migrations
docker exec chrysopedia-api alembic upgrade head

# Pull the embedding model (first time only)
docker exec chrysopedia-ollama ollama pull nomic-embed-text

# Verify
curl http://localhost:8096/health

Transcribe videos

cd whisper && pip install -r requirements.txt

# Single file
python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts

# Batch
python transcribe.py --input ./videos/ --output-dir ./transcripts

See whisper/README.md for full transcription docs.

Environment Variables

Copy .env.example to .env. Key groups:

Group	Variables	Notes
Database	`POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB`	Default user: `chrysopedia`
LLM	`LLM_API_URL`, `LLM_API_KEY`, `LLM_MODEL`	OpenAI-compatible endpoint
LLM Fallback	`LLM_FALLBACK_URL`, `LLM_FALLBACK_MODEL`	Automatic failover
Per-Stage Models	`LLM_STAGE{2-5}_MODEL`, `LLM_STAGE{2-5}_MODALITY`	`chat` for fast stages, `thinking` for reasoning
Embedding	`EMBEDDING_API_URL`, `EMBEDDING_MODEL`	Ollama nomic-embed-text
Vector DB	`QDRANT_URL`, `QDRANT_COLLECTION`	Container-internal
Features	`REVIEW_MODE`, `DEBUG_MODE`	Review gate + LLM I/O capture
Storage	`TRANSCRIPT_STORAGE_PATH`, `VIDEO_METADATA_PATH`	Container bind mounts

API Endpoints

Public

Method	Path	Description
GET	`/health`	Health check (DB connectivity)
GET	`/api/v1/search?q=&scope=&limit=`	Semantic + keyword search
GET	`/api/v1/techniques`	List technique pages
GET	`/api/v1/techniques/{slug}`	Technique detail + key moments
GET	`/api/v1/techniques/{slug}/versions`	Version history
GET	`/api/v1/creators`	List creators (sort, genre filter)
GET	`/api/v1/creators/{slug}`	Creator detail
GET	`/api/v1/topics`	Topic hierarchy with counts
GET	`/api/v1/videos`	List source videos
POST	`/api/v1/reports`	Submit content report

Admin

Method	Path	Description
GET	`/api/v1/review/queue`	Review queue (status filter)
POST	`/api/v1/review/moments/{id}/approve`	Approve key moment
POST	`/api/v1/review/moments/{id}/reject`	Reject key moment
PUT	`/api/v1/review/moments/{id}`	Edit key moment
POST	`/api/v1/admin/pipeline/trigger/{video_id}`	Trigger/retrigger pipeline
GET	`/api/v1/admin/pipeline/events/{video_id}`	Pipeline event log
GET	`/api/v1/admin/pipeline/token-summary/{video_id}`	Token usage by stage
GET	`/api/v1/admin/pipeline/worker-status`	Celery worker status
PUT	`/api/v1/admin/pipeline/debug-mode`	Toggle debug mode

Ingest

Method	Path	Description
POST	`/api/v1/ingest`	Upload Whisper JSON transcript

Development

# Local backend (with Docker services)
python -m venv .venv && source .venv/bin/activate
pip install -r backend/requirements.txt
docker compose up -d chrysopedia-db chrysopedia-redis
alembic upgrade head
cd backend && uvicorn main:app --reload --host 0.0.0.0 --port 8000

# Database migrations
alembic revision --autogenerate -m "describe_change"
alembic upgrade head

Project Structure

chrysopedia/
├── backend/                 # FastAPI application
│   ├── main.py              # Entry point, middleware, router mounting
│   ├── config.py            # Pydantic Settings (all env vars)
│   ├── models.py            # SQLAlchemy ORM models
│   ├── schemas.py           # Pydantic request/response schemas
│   ├── worker.py            # Celery app configuration
│   ├── watcher.py           # Transcript folder watcher service
│   ├── search_service.py    # Semantic search + keyword fallback
│   ├── routers/             # API endpoint handlers
│   ├── pipeline/            # LLM pipeline stages + clients
│   │   ├── stages.py        # Stages 2-5 (Celery tasks)
│   │   ├── llm_client.py    # OpenAI-compatible LLM client
│   │   ├── embedding_client.py
│   │   └── qdrant_client.py
│   └── tests/
├── frontend/                # React + TypeScript + Vite
│   └── src/
│       ├── pages/           # Home, Search, Technique, Creator, Topic, Admin
│       ├── components/      # Shared UI components
│       └── api/             # Typed API clients
├── whisper/                 # Desktop transcription (Whisper large-v3)
├── docker/                  # Dockerfiles + nginx config
├── alembic/                 # Database migrations
├── config/                  # canonical_tags.yaml (topic taxonomy)
├── prompts/                 # LLM prompt templates (editable at runtime)
├── docker-compose.yml
└── .env.example

Deployment (ub01)

ssh ub01
cd /vmPool/r/repos/xpltdco/chrysopedia
git pull && docker compose build && docker compose up -d

Resource	Location
Web UI	`http://ub01:8096`
API	`http://ub01:8096/health`
PostgreSQL	`ub01:5433`
Compose config	`/vmPool/r/compose/xpltd_chrysopedia/docker-compose.yml`
Persistent data	`/vmPool/r/services/chrysopedia_*`

XPLTD conventions: xpltd_chrysopedia project name, dedicated bridge network (172.32.0.0/24), bind mounts under /vmPool/r/services/, PostgreSQL on port 5433.

19 KiB Raw Blame History