chrysopedia/chrysopedia-spec.md
jlightner 4b0914b12b fix: restore complete project tree from ub01 canonical state
Auto-mode commit 7aa33cd accidentally deleted 78 files (14,814 lines) during M005
execution. Subsequent commits rebuilt some frontend files but backend/, alembic/,
tests/, whisper/, docker configs, and prompts were never restored in this repo.

This commit restores the full project tree by syncing from ub01's working directory,
which has all M001-M007 features running in production containers.

Restored: backend/ (config, models, routers, database, redis, search_service, worker),
alembic/ (6 migrations), docker/ (Dockerfiles, nginx, compose), prompts/ (4 stages),
tests/, whisper/, README.md, .env.example, chrysopedia-spec.md
2026-03-31 02:10:41 +00:00

713 lines
36 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Chrysopedia — Project Specification
> **Etymology:** From *chrysopoeia* (the alchemical transmutation of base material into gold) + *encyclopedia* (an organized body of knowledge). Chrysopedia transmutes raw video content into refined, searchable production knowledge.
---
## 1. Project overview
### 1.1 Problem statement
Hundreds of hours of educational video content from electronic music producers sit on local storage — tutorials, livestreams, track breakdowns, and deep dives covering techniques in sound design, mixing, arrangement, synthesis, and more. This content is extremely valuable but nearly impossible to retrieve: videos are unsearchable, unchaptered, and undocumented. A 4-hour livestream may contain 6 minutes of actionable gold buried among tangents and chat interaction. The current retrieval method is "scrub through from memory and hope" — or more commonly, the knowledge is simply lost.
### 1.2 Solution
Chrysopedia is a self-hosted knowledge extraction and retrieval system that:
1. **Transcribes** video content using local Whisper inference
2. **Extracts** key moments, techniques, and insights using LLM analysis
3. **Classifies** content by topic, creator, plugins, and production stage
4. **Synthesizes** knowledge across multiple sources into coherent technique pages
5. **Serves** a fast, search-first web UI for mid-session retrieval
The system transforms raw video files into a browsable, searchable knowledge base with direct timestamp links back to source material.
### 1.3 Design principles
- **Search-first.** The primary interaction is typing a query and getting results in seconds. Browse is secondary, for exploration.
- **Surgical retrieval.** A producer mid-session should be able to Alt+Tab, find the technique they need, absorb the key insight, and get back to their DAW in under 2 minutes.
- **Creator equity.** No artist is privileged in the UI. All creators get equal visual weight. Default sort is randomized.
- **Dual-axis navigation.** Content is accessible by Topic (technique/production stage) and by Creator (artist), with both paths being first-class citizens.
- **Incremental, not one-time.** The system must handle ongoing content additions, not just an initial batch.
- **Self-hosted and portable.** Packaged as a Docker Compose project, deployable on existing infrastructure.
### 1.4 Name and identity
- **Project name:** Chrysopedia
- **Suggested subdomain:** `chrysopedia.xpltd.co`
- **Docker project name:** `chrysopedia`
---
## 2. Content inventory and source material
### 2.1 Current state
- **Volume:** 100500 video files
- **Creators:** 50+ distinct artists/producers
- **Formats:** Primarily MP4/MKV, mixed quality and naming conventions
- **Organization:** Folders per artist, filenames loosely descriptive
- **Location:** Local desktop storage (not yet on the hypervisor/NAS)
- **Content types:**
- Full-length tutorials (30min4hrs, structured walkthroughs)
- Livestream recordings (long, unstructured, conversational)
- Track breakdowns / start-to-finish productions
### 2.2 Content characteristics
The audio track carries the vast majority of the value. Visual demonstrations (screen recordings of DAW work) are useful context but are not the primary extraction target. The transcript is the primary ore.
**Structured content** (tutorials, breakdowns) tends to have natural topic boundaries — the producer announces what they're about to cover, then demonstrates. These are easier to segment.
**Unstructured content** (livestreams) is chaotic: tangents, chat interaction, rambling, with gems appearing without warning. The extraction pipeline must handle both structured and unstructured content using semantic understanding, not just topic detection from speaker announcements.
---
## 3. Terminology
| Term | Definition |
|------|-----------|
| **Creator** | An artist, producer, or educator whose video content is in the system. Formerly "artist" — renamed for flexibility. |
| **Technique page** | The primary knowledge unit: a structured page covering one technique or concept from one creator, compiled from one or more source videos. |
| **Key moment** | A discrete, timestamped insight extracted from a video — a specific technique, setting, or piece of reasoning worth capturing. |
| **Topic** | A production domain or concept category (e.g., "sound design," "mixing," "snare design"). Organized hierarchically. |
| **Genre** | A broad musical style tag (e.g., "dubstep," "drum & bass," "halftime"). Stored as metadata on Creators, not on techniques. Used as a filter across all views. |
| **Source video** | An original video file that has been processed by the pipeline. |
| **Transcript** | The timestamped text output of Whisper processing a source video's audio. |
---
## 4. User experience
### 4.1 UX philosophy
The system is accessed via Alt+Tab from a DAW on the same desktop machine. Every design decision optimizes for speed of retrieval and minimal cognitive load. The interface should feel like a tool, not a destination.
**Primary access method:** Same machine, Alt+Tab to browser.
### 4.2 Landing page (Launchpad)
The landing page is a decision point, not a dashboard. Minimal, focused, fast.
**Layout (top to bottom):**
1. **Search bar** — prominent, full-width, with live typeahead (results appear after 23 characters). This is the primary interaction for most visits. Scope toggle tabs below the search input: `All | Topics | Creators`
2. **Two navigation cards** — side-by-side:
- **Topics** — "Browse by technique, production stage, or concept" with count of total techniques and categories
- **Creators** — "Browse by artist, filterable by genre" with count of total creators and genres
3. **Recently added** — a short list of the most recently processed/published technique pages with creator name, topic tag, and relative timestamp
**Future feature (not v1):** Trending / popular section alongside recently added, driven by view counts and cross-reference frequency.
### 4.3 Live search (typeahead)
The search bar is the primary interface. Behavior:
- Results begin appearing after 23 characters typed
- Scope toggle: `All | Topics | Creators` — filters what types of results appear
- **"All" scope** groups results by type:
- **Topics** — technique pages matching the query, showing title, creator name(s), parent topic tag
- **Key moments** — individual timestamped insights matching the query, showing moment title, creator, source file, and timestamp. Clicking jumps to the technique page (or eventually direct to the video moment)
- **Creators** — creator names matching the query
- **"Topics" scope** — shows only technique pages
- **"Creators" scope** — shows only creator matches
- Genre filter is accessible on Creators scope and cross-filters Topics scope (using creator-level genre metadata)
- Search is semantic where possible (powered by Qdrant vector search), with keyword fallback
### 4.4 Technique page (A+C hybrid format)
The core content unit. Each technique page covers one technique or concept from one creator. The format adapts by content type but follows a consistent structure.
**Layout (top to bottom):**
1. **Header:**
- Topic tags (e.g., "sound design," "drums," "snare")
- Technique title (e.g., "Snare design")
- Creator name
- Meta line: "Compiled from N sources · M key moments · Last updated [date]"
- Source quality warning (amber banner) if content came from an unstructured livestream
2. **Study guide prose (Section A):**
- Organized by sub-aspects of the technique (e.g., "Layer construction," "Saturation & character," "Mix context")
- Rich prose capturing:
- The specific technique/method described (highest priority)
- Exact settings, plugins, and parameters when the creator was *teaching* the setting (not incidental use)
- The reasoning/philosophy behind choices when the creator explains *why*
- Signal chain blocks rendered in monospace when a creator walks through a routing chain
- Direct quotes of creator opinions/warnings when they add value (e.g., "He says it 'smears the transient into mush'")
3. **Key moments index (Section C):**
- Compact list of individual timestamped insights
- Each row: moment title, source video filename, clickable timestamp
- Sorted chronologically within each source video
4. **Related techniques:**
- Links to related technique pages — same technique by other creators, adjacent techniques by the same creator, general/cross-creator technique pages
- Renders as clickable pill-shaped tags
5. **Plugins referenced:**
- List of all plugins/tools mentioned in the technique page
- Each is a clickable tag that could lead to "all techniques referencing this plugin" (future: dedicated plugin pages)
**Content type adaptation:**
- **Technique-heavy content** (sound design, specific methods): Full A+C treatment with signal chains, plugin details, parameter specifics
- **Philosophy/workflow content** (mixdown approach, creative process): More prose-heavy, fewer signal chain blocks, but same overall structure. These pages are still browsable but also serve as rich context for future RAG/chat retrieval
- **Livestream-sourced content:** Amber warning banner noting source quality. Timestamps may land in messy context with tangents nearby
### 4.5 Creators browse page
Accessed from the landing page "Creators" card.
**Layout:**
- Page title: "Creators" with total count
- Filter input: type-to-narrow the list
- Genre filter pills: `All genres | Bass music | Drum & bass | Dubstep | Halftime | House | IDM | Neuro | Techno | ...` — clicking a genre filters the list to creators tagged with that genre
- Sort options: Randomized (default, re-shuffled on every page load), Alphabetical, View count
- Creator list: flat, equal-weight rows. Each row shows:
- Creator name
- Genre tags (multiple allowed)
- Technique count
- Video count
- View count (sum of activity across all content derived from this creator)
- Clicking a row navigates to that creator's detail page (list of all their technique pages)
**Default sort is randomized on every page load** to prevent discovery bias. Users can toggle to alphabetical or sort by view count.
### 4.6 Topics browse page
Accessed from the landing page "Topics" card.
**Layout:**
- Page title: "Topics" with total technique count
- Filter input: type-to-narrow
- Genre filter pills (uses creator-level genre metadata to filter): show only techniques from creators tagged with the selected genre
- **Two-level hierarchy displayed:**
- **Top-level categories:** Sound design, Mixing, Synthesis, Arrangement, Workflow, Mastering
- **Sub-topics within each:** clicking a top-level category expands or navigates to show sub-topics (e.g., Sound Design → Bass, Drums, Pads, Leads, FX, Foley; Drums → Kick, Snare, Hi-hat, Percussion)
- Each sub-topic shows: technique count, number of creators covering it
- Clicking a sub-topic shows all technique pages in that category, filterable by creator and genre
### 4.7 Search results page
For complex queries that go beyond typeahead (e.g., hitting Enter after typing a full query).
**Layout:**
- Search bar at top (retains query)
- Scope tabs: `All results (N) | Techniques (N) | Key moments (N) | Creators (N)`
- Results split into two tiers:
- **Technique pages** — first-class results with title, creator, summary snippet, tags, moment count, plugin list
- **Also mentioned in** — cross-references where the search term appears inside other technique pages (e.g., searching "snare" surfaces "drum bus processing" because it mentions snare bus techniques)
---
## 5. Taxonomy and topic hierarchy
### 5.1 Top-level categories
These are broad production stages/domains. They should cover the full scope of music production education:
| Category | Description | Example sub-topics |
|----------|-------------|-------------------|
| Sound design | Creating and shaping sounds from scratch or samples | Bass, drums (kick, snare, hi-hat, percussion), pads, leads, FX, foley, vocals, textures |
| Mixing | Balancing, processing, and spatializing elements in a session | EQ, compression, bus processing, reverb/delay, stereo imaging, gain staging, automation |
| Synthesis | Methods of generating sound | FM, wavetable, granular, additive, subtractive, modular, physical modeling |
| Arrangement | Structuring a track from intro to outro | Song structure, transitions, tension/release, energy flow, breakdowns, drops |
| Workflow | Creative process, session management, productivity | DAW setup, templates, creative process, collaboration, file management, resampling |
| Mastering | Final stage processing for release | Limiting, stereo width, loudness, format delivery, referencing |
### 5.2 Sub-topic management
Sub-topics are not rigidly pre-defined. The extraction pipeline proposes sub-topic tags during classification, and the taxonomy grows organically as content is processed. However, the system maintains a **canonical tag list** that the LLM references during classification to ensure consistency (e.g., always "snare" not sometimes "snare drum" and sometimes "snare design").
The canonical tag list is editable by the administrator and should be stored as a configuration file that the pipeline references. New tags can be proposed by the pipeline and queued for admin approval, or auto-added if they fit within an existing top-level category.
### 5.3 Genre taxonomy
Genres are broad, general-level tags. Sub-genre classification is explicitly out of scope to avoid complexity.
**Initial genre set (expandable):**
Bass music, Drum & bass, Dubstep, Halftime, House, Techno, IDM, Glitch, Downtempo, Neuro, Ambient, Experimental, Cinematic
**Rules:**
- Genres are metadata on Creators, not on techniques
- A Creator can have multiple genre tags
- Genre is available as a filter on both the Creators browse page and the Topics browse page (filtering Topics by genre shows techniques from creators tagged with that genre)
- Genre tags are assigned during initial creator setup (manually or LLM-suggested based on content analysis) and can be edited by the administrator
---
## 6. Data model
### 6.1 Core entities
**Creator**
```
id UUID
name string (display name, e.g., "KOAN Sound")
slug string (URL-safe, e.g., "koan-sound")
genres string[] (e.g., ["glitch hop", "neuro", "bass music"])
folder_name string (matches the folder name on disk for source mapping)
view_count integer (aggregated from child technique page views)
created_at timestamp
updated_at timestamp
```
**Source Video**
```
id UUID
creator_id FK → Creator
filename string (original filename)
file_path string (path on disk)
duration_seconds integer
content_type enum: tutorial | livestream | breakdown | short_form
transcript_path string (path to transcript JSON)
processing_status enum: pending | transcribed | extracted | reviewed | published
created_at timestamp
updated_at timestamp
```
**Transcript Segment**
```
id UUID
source_video_id FK → Source Video
start_time float (seconds)
end_time float (seconds)
text text
segment_index integer (order within video)
topic_label string (LLM-assigned topic label for this segment)
```
**Key Moment**
```
id UUID
source_video_id FK → Source Video
technique_page_id FK → Technique Page (nullable until assigned)
title string (e.g., "Three-layer snare construction")
summary text (1-3 sentence description)
start_time float (seconds)
end_time float (seconds)
content_type enum: technique | settings | reasoning | workflow
plugins string[] (plugin names detected)
review_status enum: pending | approved | edited | rejected
raw_transcript text (the original transcript text for this segment)
created_at timestamp
updated_at timestamp
```
**Technique Page**
```
id UUID
creator_id FK → Creator
title string (e.g., "Snare design")
slug string (URL-safe)
topic_category string (top-level: "sound design")
topic_tags string[] (sub-topics: ["drums", "snare", "layering", "saturation"])
summary text (synthesized overview paragraph)
body_sections JSONB (structured prose sections with headings)
signal_chains JSONB[] (structured signal chain representations)
plugins string[] (all plugins referenced across all moments)
source_quality enum: structured | mixed | unstructured (derived from source video types)
view_count integer
review_status enum: draft | reviewed | published
created_at timestamp
updated_at timestamp
```
**Related Technique Link**
```
id UUID
source_page_id FK → Technique Page
target_page_id FK → Technique Page
relationship enum: same_technique_other_creator | same_creator_adjacent | general_cross_reference
```
**Tag (canonical)**
```
id UUID
name string (e.g., "snare")
category string (parent top-level category: "sound design")
aliases string[] (alternative phrasings the LLM should normalize: ["snare drum", "snare design"])
```
### 6.2 Storage layer
| Store | Purpose | Technology |
|-------|---------|------------|
| Relational DB | All structured data (creators, videos, moments, technique pages, tags) | PostgreSQL (preferred) or SQLite for initial simplicity |
| Vector DB | Semantic search embeddings for transcripts, key moments, and technique page content | Qdrant (already running on hypervisor) |
| File store | Raw transcript JSON files, source video reference metadata | Local filesystem on hypervisor, organized by creator slug |
### 6.3 Vector embeddings
The following content gets embedded in Qdrant for semantic search:
- Key moment summaries (with metadata: creator, topic, timestamp, source video)
- Technique page summaries and body sections
- Transcript segments (for future RAG/chat retrieval)
Embedding model: configurable. Can use a local model via Ollama (e.g., `nomic-embed-text`) or an API-based model. The embedding endpoint should be a configurable URL, same pattern as the LLM endpoint.
---
## 7. Pipeline architecture
### 7.1 Infrastructure topology
```
Desktop (RTX 4090) Hypervisor (Docker host)
┌─────────────────────┐ ┌─────────────────────────────────┐
│ Video files (local) │ │ Chrysopedia Docker Compose │
│ Whisper (local GPU) │──2.5GbE──────▶│ ├─ API / pipeline service │
│ Output: transcript │ (text only) │ ├─ Web UI │
│ JSON files │ │ ├─ PostgreSQL │
└─────────────────────┘ │ ├─ Qdrant (existing) │
│ └─ File store │
└────────────┬────────────────────┘
│ API calls (text)
┌─────────────▼────────────────────┐
│ Friend's DGX Sparks │
│ Qwen via Open WebUI API │
│ (2Gb fiber, high uptime) │
└──────────────────────────────────┘
```
**Bandwidth analysis:** Transcript JSON files are 200500KB each. At 50Mbit upload, the entire library's transcripts could transfer in under a minute. The bandwidth constraint is irrelevant for this workload. The only large files (videos) stay on the desktop.
**Future centralization:** The Docker Compose project should be structured so that when all hardware is co-located, the only change is config (moving Whisper into the compose stack and pointing file paths to local storage). No architectural rewrite.
### 7.2 Processing stages
#### Stage 1: Audio extraction and transcription (Desktop)
**Tool:** Whisper large-v3 running locally on RTX 4090
**Input:** Video file (MP4/MKV)
**Process:**
1. Extract audio track from video (ffmpeg → WAV or direct pipe)
2. Run Whisper with word-level or segment-level timestamps
3. Output: JSON file with timestamped transcript
**Output format:**
```json
{
"source_file": "Skope — Sound Design Masterclass pt2.mp4",
"creator_folder": "Skope",
"duration_seconds": 7243,
"segments": [
{
"start": 0.0,
"end": 4.52,
"text": "Hey everyone welcome back to part two...",
"words": [
{"word": "Hey", "start": 0.0, "end": 0.28},
{"word": "everyone", "start": 0.32, "end": 0.74}
]
}
]
}
```
**Performance estimate:** Whisper large-v3 on a 4090 processes audio at roughly 10-20x real-time. A 2-hour video takes ~6-12 minutes to transcribe. For 300 videos averaging 1.5 hours each, the initial transcription pass is roughly 15-40 hours of GPU time.
#### Stage 2: Transcript segmentation (Hypervisor → LLM)
**Tool:** LLM (Qwen on DGX Sparks, or local Ollama as fallback)
**Input:** Full timestamped transcript JSON
**Process:** The LLM analyzes the transcript to identify topic boundaries — points where the creator shifts from one subject to another. Output is a segmented transcript with topic labels per segment.
**This stage can use a lighter model** if needed (segmentation is more mechanical than extraction). However, for simplicity in v1, use the same model endpoint as stages 3-5.
#### Stage 3: Key moment extraction (Hypervisor → LLM)
**Tool:** LLM (Qwen on DGX Sparks)
**Input:** Individual transcript segments from Stage 2
**Process:** The LLM reads each segment and identifies actionable insights. The extraction prompt should distinguish between:
- **Instructional content** (the creator is *teaching* something) → extract as a key moment
- **Incidental content** (the creator is *using* a tool without explaining it) → skip
- **Philosophical/reasoning content** (the creator explains *why* they make a choice) → extract with `content_type: reasoning`
- **Settings/parameters** (specific plugin settings, values, configurations being demonstrated) → extract with `content_type: settings`
**Extraction rule for plugin detail:** Capture plugin names and settings when the creator is *teaching* the setting — spending time explaining why they chose it, what it does, how to configure it. Skip incidental plugin usage (a plugin is visible but not discussed).
#### Stage 4: Classification and tagging (Hypervisor → LLM)
**Tool:** LLM (Qwen on DGX Sparks)
**Input:** Extracted key moments from Stage 3
**Process:** Each moment is classified with:
- Top-level topic category
- Sub-topic tags (referencing the canonical tag list)
- Plugin names (normalized to canonical names)
- Content type classification
The LLM is provided the canonical tag list as context and instructed to use existing tags where possible, proposing new tags only when no existing tag fits.
#### Stage 5: Synthesis (Hypervisor → LLM)
**Tool:** LLM (Qwen on DGX Sparks)
**Input:** All approved/published key moments for a given creator + topic combination
**Process:** When multiple key moments from the same creator cover overlapping or related topics, the synthesis stage merges them into a coherent technique page. This includes:
- Writing the overview summary paragraph
- Organizing body sections by sub-aspect
- Generating signal chain blocks where applicable
- Identifying related technique pages for cross-linking
- Compiling the plugin reference list
This stage runs whenever new key moments are approved for a creator+topic combination that already has a technique page (updating it), or when enough moments accumulate to warrant a new page.
### 7.3 LLM endpoint configuration
The pipeline talks to an **OpenAI-compatible API endpoint** (which both Ollama and Open WebUI expose). The LLM is not hardcoded — it's configured via environment variables:
```
LLM_API_URL=https://friend-openwebui.example.com/api
LLM_API_KEY=sk-...
LLM_MODEL=qwen2.5-72b
LLM_FALLBACK_URL=http://localhost:11434/v1 # local Ollama
LLM_FALLBACK_MODEL=qwen2.5:14b-q8_0
```
The pipeline should attempt the primary endpoint first and fall back to the local model if the primary is unavailable.
### 7.4 Embedding endpoint configuration
Same configurable pattern:
```
EMBEDDING_API_URL=http://localhost:11434/v1
EMBEDDING_MODEL=nomic-embed-text
```
### 7.5 Processing estimates for initial seeding
| Stage | Per video | 300 videos total |
|-------|----------|-----------------|
| Transcription (Whisper, 4090) | 612 min | 3060 hours |
| Segmentation (LLM) | ~1 min | ~5 hours |
| Extraction (LLM) | ~2 min | ~10 hours |
| Classification (LLM) | ~30 sec | ~2.5 hours |
| Synthesis (LLM) | ~2 min per technique page | Varies by page count |
**Recommendation:** Tell the DGX Sparks friend to expect a weekend of sustained processing for the initial seed. The pipeline must be **resumable** — if it drops, it picks up from the last successfully processed video/stage, not from the beginning.
---
## 8. Review and approval workflow
### 8.1 Modes
The system supports two modes:
- **Review mode (initial calibration):** All extracted key moments enter a review queue. The administrator reviews, edits, approves, or rejects each moment before it's published.
- **Auto mode (post-calibration):** Extracted moments are published automatically. The review queue still exists but functions as an audit log rather than a gate.
The mode is a system-level toggle. The transition from review to auto mode happens when the administrator is satisfied with extraction quality — typically after reviewing the first several videos and tuning prompts.
### 8.2 Review queue interface
The review UI is part of the Chrysopedia web application (an admin section, not a separate tool).
**Queue view:**
- Counts: pending, approved, edited, rejected
- Filter tabs: Pending | Approved | Edited | Rejected
- Items organized by source video (review all moments from one video in sequence for context)
**Individual moment review:**
- Extracted moment: title, timestamp range, summary, tags, plugins detected
- Raw transcript segment displayed alongside for comparison
- Five actions:
- **Approve** — publish as-is
- **Edit & approve** — modify summary, tags, timestamp, or plugins, then publish
- **Split** — the moment actually contains two distinct insights; split into two separate moments
- **Merge with adjacent** — the system over-segmented; combine with the next or previous moment
- **Reject** — not a key moment; discard
### 8.3 Prompt tuning
The extraction prompts (stages 2-5) should be stored as editable configuration, not hardcoded. If review reveals systematic issues (e.g., the LLM consistently misclassifies mixing techniques as sound design), the administrator should be able to:
1. Edit the prompt templates
2. Re-run extraction on specific videos or all videos
3. Review the new output
This is the "calibration loop" — run pipeline, review output, tune prompts, re-run, repeat until quality is sufficient for auto mode.
---
## 9. New content ingestion workflow
### 9.1 Adding new videos
The ongoing workflow for adding new content after initial seeding:
1. **Drop file:** Place new video file(s) in the appropriate creator folder on the desktop (or create a new folder for a new creator)
2. **Trigger transcription:** Run the Whisper transcription stage on the new file(s). This could be a manual CLI command, a watched-folder daemon, or an n8n workflow trigger.
3. **Ship transcript:** Transfer the transcript JSON to the hypervisor (automated via the pipeline)
4. **Process:** Stages 2-5 run automatically on the new transcript
5. **Review or auto-publish:** Depending on mode, moments enter the review queue or publish directly
6. **Synthesis update:** If the new content covers a topic that already has a technique page for this creator, the synthesis stage updates the existing page. If it's a new topic, a new technique page is created.
### 9.2 Adding new creators
When a new creator's content is added:
1. Create a new folder on the desktop with the creator's name
2. Add video files
3. The pipeline detects the new folder name and creates a Creator record
4. Genre tags can be auto-suggested by the LLM based on content analysis, or manually assigned by the administrator
5. Process videos as normal
### 9.3 Watched folder (optional, future)
For maximum automation, a filesystem watcher on the desktop could detect new video files and automatically trigger the transcription pipeline. This is a nice-to-have for v2, not a v1 requirement. In v1, transcription is triggered manually.
---
## 10. Deployment and infrastructure
### 10.1 Docker Compose project
The entire Chrysopedia stack (excluding Whisper, which runs on the desktop GPU) is packaged as a single `docker-compose.yml`:
```yaml
# Indicative structure — not final
services:
chrysopedia-api:
# FastAPI or similar — handles pipeline orchestration, API endpoints
chrysopedia-web:
# Web UI — React, Svelte, or similar SPA
chrysopedia-db:
# PostgreSQL
chrysopedia-qdrant:
# Only if not using the existing Qdrant instance
chrysopedia-worker:
# Background job processor for pipeline stages 2-5
```
### 10.2 Existing infrastructure integration
**IMPORTANT:** The implementing agent should reference **XPLTD Lore** when making deployment decisions. This includes:
- Existing Docker conventions, naming patterns, and network configuration
- The hypervisor's current resource allocation and available capacity (~60 containers already running)
- Existing Qdrant instance (may be shared or a new collection created)
- Existing n8n instance (potential for workflow triggers)
- Storage paths and volume mount conventions
- Any reverse proxy or DNS configuration patterns
Do not assume infrastructure details — consult XPLTD Lore for how applications are typically deployed in this environment.
### 10.3 Whisper on desktop
Whisper runs separately on the desktop with the RTX 4090. It is NOT part of the Docker Compose stack (for now). It should be packaged as a simple Python script or lightweight container that:
1. Accepts a video file path (or watches a directory)
2. Extracts audio via ffmpeg
3. Runs Whisper large-v3
4. Outputs transcript JSON
5. Ships the JSON to the hypervisor (SCP, rsync, or API upload to the Chrysopedia API)
**Future centralization:** When all hardware is co-located, Whisper can be added to the Docker Compose stack with GPU passthrough, and the video files can be mounted directly. The pipeline should be designed so this migration is a config change, not a rewrite.
### 10.4 Network considerations
- Desktop ↔ Hypervisor: 2.5GbE (ample for transcript JSON transfer)
- Hypervisor ↔ DGX Sparks: Internet (50Mbit up from Chrysopedia side, 2Gb fiber on the DGX side). Transcript text payloads are tiny; this is not a bottleneck.
- Web UI: Served from hypervisor, accessed via local network (same machine Alt+Tab) or from other devices on the network. Eventually shareable with external users.
---
## 11. Technology recommendations
These are recommendations, not mandates. The implementing agent should evaluate alternatives based on current best practices and XPLTD Lore.
| Component | Recommendation | Rationale |
|-----------|---------------|-----------|
| Transcription | Whisper large-v3 (local, 4090) | Best accuracy, local processing keeps media files on-network |
| LLM inference | Qwen via Open WebUI API (DGX Sparks) | Free, powerful, high uptime. Ollama on 4090 as fallback |
| Embedding | nomic-embed-text via Ollama (local) | Good quality, runs easily alongside other local models |
| Vector DB | Qdrant | Already running on hypervisor |
| Relational DB | PostgreSQL | Robust, good JSONB support for flexible schema fields |
| API framework | FastAPI (Python) | Strong async support, good for pipeline orchestration |
| Web UI | React or Svelte SPA | Fast, component-based, good for search-heavy UIs |
| Background jobs | Celery with Redis, or a simpler task queue | Pipeline stages 2-5 run as background jobs |
| Audio extraction | ffmpeg | Universal, reliable |
---
## 12. Open questions and future considerations
These items are explicitly out of scope for v1 but should be considered in architectural decisions:
### 12.1 Chat / RAG retrieval
Not required for v1, but the system should be **architected to support it easily.** The Qdrant embeddings and structured knowledge base provide the foundation. A future chat interface could use the Qwen instance (or any compatible LLM) with RAG over the Chrysopedia knowledge base to answer natural language questions like "How does Skope approach snare design differently from Au5?"
### 12.2 Direct video playback
v1 provides file paths and timestamps ("Skope — Sound Design Masterclass pt2.mp4 @ 1:42:30"). Future versions could embed video playback directly in the web UI, jumping to the exact timestamp. This requires the video files to be network-accessible from the web UI, which depends on centralizing storage.
### 12.3 Access control
Not needed for v1. The system is initially for personal/local use. Future versions may add authentication for sharing with friends or external users. The architecture should not preclude this (e.g., don't hardcode single-user assumptions into the data model).
### 12.4 Multi-user features
Eventually: user-specific bookmarks, personal notes on technique pages, view history, and personalized "trending" based on individual usage patterns.
### 12.5 Content types beyond video
The extraction pipeline is fundamentally transcript-based. It could be extended to process podcast episodes, audio-only recordings, or even written tutorials/blog posts with minimal architectural changes.
### 12.6 Plugin knowledge base
Plugins referenced across all technique pages could be promoted to a first-class entity with their own browse page: "All techniques that reference Serum" or "Signal chains using Pro-Q 3." The data model already captures plugin references — this is primarily a UI feature.
---
## 13. Success criteria
The system is successful when:
1. **A producer mid-session can find a specific technique in under 30 seconds** — from Alt+Tab to reading the key insight
2. **The extraction pipeline correctly identifies 80%+ of key moments** without human intervention (post-calibration)
3. **New content can be added and processed within hours**, not days
4. **The knowledge base grows more useful over time** — cross-references and related techniques create a web of connected knowledge that surfaces unexpected insights
5. **The system runs reliably on existing infrastructure** without requiring significant new hardware or ongoing cloud costs
---
## 14. Implementation phases
### Phase 1: Foundation
- Set up Docker Compose project with PostgreSQL, API service, and web UI skeleton
- Implement Whisper transcription script for desktop
- Build transcript ingestion endpoint on the API
- Implement basic Creator and Source Video management
### Phase 2: Extraction pipeline
- Implement stages 2-5 (segmentation, extraction, classification, synthesis)
- Build the review queue UI
- Process a small batch of videos (5-10) for calibration
- Tune extraction prompts based on review feedback
### Phase 3: Knowledge UI
- Build the search-first web UI: landing page, live search, technique pages
- Implement Qdrant integration for semantic search
- Build Creators and Topics browse pages
- Implement related technique cross-linking
### Phase 4: Initial seeding
- Process the full video library through the pipeline
- Review and approve extractions (transitioning toward auto mode)
- Populate the canonical tag list and genre taxonomy
- Build out cross-references and related technique links
### Phase 5: Polish and ongoing
- Transition to auto mode for new content
- Implement view count tracking
- Optimize search ranking and relevance
- Begin sharing with trusted external users
---
*This specification was developed through collaborative ideation between the project owner and Claude. The implementing agent should treat this as a comprehensive guide while exercising judgment on technical implementation details, consulting XPLTD Lore for infrastructure conventions, and adapting to discoveries made during development.*