Pipeline

6-stage LLM-powered extraction pipeline that transforms video transcripts into structured technique articles.

Pipeline Stages

Video File
    ↓
[Desktop] Whisper large-v3 (RTX 4090) → transcript JSON
    ↓
[Watcher/API] Ingest → SourceVideo + TranscriptSegments in PostgreSQL
    ↓
Stage 1: Transcript Segmentation — chunk transcript into logical segments
    ↓
Stage 2: Key Moment Extraction — identify teachable moments with timestamps
    ↓
Stage 3: (reserved)
    ↓
Stage 4: Classification & Tagging — assign topic_category + topic_tags per moment
    ↓
Stage 5: Technique Page Synthesis — compose study guide articles from moments
    ↓
Stage 6: Embed & Index — generate embeddings, upsert to Qdrant (non-blocking)

Stage Details

Stage 1: Transcript Segmentation

Chunks raw transcript into logical segments
Input: TranscriptSegments from DB
Output: Segmented data for stage 2

Stage 2: Key Moment Extraction

Identifies teachable moments with titles, summaries, timestamps
Uses LLM with prompt template from prompts/ directory
Output: KeyMoment records in PostgreSQL

Stage 4: Classification & Tagging

Assigns topic_category and topic_tags to each key moment
References canonical tag list (canonical_tags.yaml) with aliases
Output: Classification data stored in Redis (chrysopedia:classification:{video_id}, 24h TTL)

Stage 5: Technique Page Synthesis

Composes study guide articles from classified key moments
Handles multi-source merging: new video moments merge into existing technique pages
Uses offset-based citation indexing (existing [0]-[N-1], new [N]-[N+M-1])
Creates pre-overwrite version snapshot before mutating existing pages (D018)
Output: TechniquePage records with body_sections (v2 format), signal_chains, plugins

Stage 6: Embed & Index

Generates embeddings via Ollama (nomic-embed-text)
Embedding text enriched with creator_name and topic_tags (D023)
Upserts to Qdrant with deterministic UUIDs based on content
Non-blocking: Failures log WARNING but don't fail the pipeline (D005)
Can be re-triggered independently via /admin/pipeline/reindex-all

Stage 7: Highlight Detection (M021/S04)

Scores every KeyMoment in a video using 7 weighted heuristic dimensions
Pure function scoring: duration_fitness (0.25), content_type (0.20), specificity_density (0.20), plugin_richness (0.10), transcript_energy (0.10), source_quality (0.10), video_type (0.05)
Celery task stage_highlight_detection with bind=True, max_retries=3
Bulk upserts via INSERT ON CONFLICT on named constraint uq_highlight_candidate_moment
Output: HighlightCandidate records in PostgreSQL with composite score and per-dimension breakdown
See Highlights for full scoring details and API endpoints

LLM Configuration

Setting	Value
Primary LLM	DGX Sparks Qwen (OpenAI-compatible API)
Fallback LLM	Local Ollama
Embedding model	nomic-embed-text (Ollama)
Model routing	Per-stage configuration (chat vs thinking models)

Prompt Template System

Prompt files stored in prompts/ directory (D013)
Templates use XML-style content fencing
Editable without code changes — pipeline reads from disk at runtime
SHA-256 hashes tracked in TechniquePageVersion.pipeline_metadata for reproducibility
Re-process after prompt edits via POST /admin/pipeline/trigger/{video_id}

Pipeline Admin Features

Debug mode: Redis-backed toggle captures full LLM I/O (system prompt, user prompt, response) in pipeline_events
Token tracking: Per-event and per-video token usage visible in admin UI
Stale page detection: Identifies pages needing regeneration
Bulk operations: Bulk resynthesize, wipe all output, reindex all
Worker status: Real-time Celery worker health check

Prompt Quality Toolkit

CLI tool (python -m pipeline.quality) with:

LLM fitness suite — 9 tests (Mandelbrot reasoning, JSON compliance, instruction following)
5-dimension quality scorer with voice preservation dial
Automated prompt A/B optimization loop — LLM-powered variant generation, iterative scoring, leaderboard
Multi-stage support for pipeline stages 2-5 with per-stage rubrics and fixtures

Shorts Pipeline Modules (M024/S04)

caption_generator.py

Converts Whisper word-level timings to ASS (Advanced SubStation Alpha) subtitles with karaoke highlighting.

generate_ass_captions() — main entry point, produces ASS file with \k tags for word-by-word highlighting
Clip-relative timing — offsets word timestamps by clip_start for accurate subtitle sync
Non-blocking — failures log WARNING, never fail the parent short generation stage
17 unit tests covering time formatting, ASS structure, clip offset math, karaoke duration, empty/whitespace handling, custom styles, negative time clamping

card_renderer.py

ffmpeg-based intro/outro card generation and concatenation pipeline.

render_card() — builds lavfi command using color + drawtext filters
render_card_to_file() — executes ffmpeg to produce card video segment
build_concat_list() — writes file manifest for ffmpeg concat demuxer
concat_segments() — runs ffmpeg concat demuxer with -c copy
parse_template_config() — JSONB normalizer with defaults for missing/null fields
Cards include silent audio track via anullsrc for codec-compatible concat with audio main clips
Non-blocking — card render failures log WARNING, shorts proceed without intro/outro
28 unit tests

shorts_generator.py Updates (M024/S01, S04)

extract_clip() accepts optional ass_path for subtitle burn-in via ffmpeg -vf ass= filter
extract_clip_with_template() — orchestrates intro/main/outro concatenation using card_renderer
stage_generate_shorts now: loads transcripts for captions, loads creator templates for cards, generates share_token on completion via secrets.token_urlsafe(8)

Key Design Decisions

Sync clients in Celery (D004): openai.OpenAI, QdrantClient, sync SQLAlchemy. Avoids nested event loop errors.
Non-blocking embedding (D005): Stage 6 failures don't block core pipeline output.
Redis for stage 4 data: Classification results in Redis with 24h TTL, not DB columns.
Best-effort versioning (D018): Version snapshot failure doesn't block page update.

Transcript Watcher

Standalone service (watcher.py) monitors /vmPool/r/services/chrysopedia_watch/ for new transcript JSON files:

Uses watchdog.observers.polling.PollingObserver for ZFS reliability
Validates file structure, waits for size stability (handles partial SCP writes)
POSTs to ingest API on file detection
Moves processed files to processed/, failures to failed/ with .error sidecar

See also: Architecture, Data-Model, Deployment

Chrysopedia Wiki

Architecture

Features

Reference

Operations