No results
3
Pipeline
jlightner edited this page 2026-04-04 06:55:06 -05:00
Table of Contents
- Pipeline
- Pipeline Stages
- Stage Details
- Stage 1: Transcript Segmentation
- Stage 2: Key Moment Extraction
- Stage 4: Classification & Tagging
- Stage 5: Technique Page Synthesis
- Stage 6: Embed & Index
- Stage 7: Highlight Detection (M021/S04)
- LLM Configuration
- Prompt Template System
- Pipeline Admin Features
- Prompt Quality Toolkit
- Shorts Pipeline Modules (M024/S04)
- Key Design Decisions
- Transcript Watcher
Pipeline
6-stage LLM-powered extraction pipeline that transforms video transcripts into structured technique articles.
Pipeline Stages
Video File
↓
[Desktop] Whisper large-v3 (RTX 4090) → transcript JSON
↓
[Watcher/API] Ingest → SourceVideo + TranscriptSegments in PostgreSQL
↓
Stage 1: Transcript Segmentation — chunk transcript into logical segments
↓
Stage 2: Key Moment Extraction — identify teachable moments with timestamps
↓
Stage 3: (reserved)
↓
Stage 4: Classification & Tagging — assign topic_category + topic_tags per moment
↓
Stage 5: Technique Page Synthesis — compose study guide articles from moments
↓
Stage 6: Embed & Index — generate embeddings, upsert to Qdrant (non-blocking)
Stage Details
Stage 1: Transcript Segmentation
- Chunks raw transcript into logical segments
- Input: TranscriptSegments from DB
- Output: Segmented data for stage 2
Stage 2: Key Moment Extraction
- Identifies teachable moments with titles, summaries, timestamps
- Uses LLM with prompt template from
prompts/directory - Output: KeyMoment records in PostgreSQL
Stage 4: Classification & Tagging
- Assigns topic_category and topic_tags to each key moment
- References canonical tag list (
canonical_tags.yaml) with aliases - Output: Classification data stored in Redis (
chrysopedia:classification:{video_id}, 24h TTL)
Stage 5: Technique Page Synthesis
- Composes study guide articles from classified key moments
- Handles multi-source merging: new video moments merge into existing technique pages
- Uses offset-based citation indexing (existing [0]-[N-1], new [N]-[N+M-1])
- Creates pre-overwrite version snapshot before mutating existing pages (D018)
- Output: TechniquePage records with body_sections (v2 format), signal_chains, plugins
Stage 6: Embed & Index
- Generates embeddings via Ollama (nomic-embed-text)
- Embedding text enriched with creator_name and topic_tags (D023)
- Upserts to Qdrant with deterministic UUIDs based on content
- Non-blocking: Failures log WARNING but don't fail the pipeline (D005)
- Can be re-triggered independently via
/admin/pipeline/reindex-all
Stage 7: Highlight Detection (M021/S04)
- Scores every KeyMoment in a video using 7 weighted heuristic dimensions
- Pure function scoring: duration_fitness (0.25), content_type (0.20), specificity_density (0.20), plugin_richness (0.10), transcript_energy (0.10), source_quality (0.10), video_type (0.05)
- Celery task
stage_highlight_detectionwithbind=True, max_retries=3 - Bulk upserts via
INSERT ON CONFLICTon named constraintuq_highlight_candidate_moment - Output: HighlightCandidate records in PostgreSQL with composite score and per-dimension breakdown
- See Highlights for full scoring details and API endpoints
LLM Configuration
| Setting | Value |
|---|---|
| Primary LLM | DGX Sparks Qwen (OpenAI-compatible API) |
| Fallback LLM | Local Ollama |
| Embedding model | nomic-embed-text (Ollama) |
| Model routing | Per-stage configuration (chat vs thinking models) |
Prompt Template System
- Prompt files stored in
prompts/directory (D013) - Templates use XML-style content fencing
- Editable without code changes — pipeline reads from disk at runtime
- SHA-256 hashes tracked in TechniquePageVersion.pipeline_metadata for reproducibility
- Re-process after prompt edits via
POST /admin/pipeline/trigger/{video_id}
Pipeline Admin Features
- Debug mode: Redis-backed toggle captures full LLM I/O (system prompt, user prompt, response) in pipeline_events
- Token tracking: Per-event and per-video token usage visible in admin UI
- Stale page detection: Identifies pages needing regeneration
- Bulk operations: Bulk resynthesize, wipe all output, reindex all
- Worker status: Real-time Celery worker health check
Prompt Quality Toolkit
CLI tool (python -m pipeline.quality) with:
- LLM fitness suite — 9 tests (Mandelbrot reasoning, JSON compliance, instruction following)
- 5-dimension quality scorer with voice preservation dial
- Automated prompt A/B optimization loop — LLM-powered variant generation, iterative scoring, leaderboard
- Multi-stage support for pipeline stages 2-5 with per-stage rubrics and fixtures
Shorts Pipeline Modules (M024/S04)
caption_generator.py
Converts Whisper word-level timings to ASS (Advanced SubStation Alpha) subtitles with karaoke highlighting.
generate_ass_captions()— main entry point, produces ASS file with\ktags for word-by-word highlighting- Clip-relative timing — offsets word timestamps by clip_start for accurate subtitle sync
- Non-blocking — failures log WARNING, never fail the parent short generation stage
- 17 unit tests covering time formatting, ASS structure, clip offset math, karaoke duration, empty/whitespace handling, custom styles, negative time clamping
card_renderer.py
ffmpeg-based intro/outro card generation and concatenation pipeline.
render_card()— builds lavfi command usingcolor+drawtextfiltersrender_card_to_file()— executes ffmpeg to produce card video segmentbuild_concat_list()— writes file manifest for ffmpeg concat demuxerconcat_segments()— runs ffmpeg concat demuxer with-c copyparse_template_config()— JSONB normalizer with defaults for missing/null fields- Cards include silent audio track via
anullsrcfor codec-compatible concat with audio main clips - Non-blocking — card render failures log WARNING, shorts proceed without intro/outro
- 28 unit tests
shorts_generator.py Updates (M024/S01, S04)
extract_clip()accepts optionalass_pathfor subtitle burn-in via ffmpeg-vf ass=filterextract_clip_with_template()— orchestrates intro/main/outro concatenation using card_rendererstage_generate_shortsnow: loads transcripts for captions, loads creator templates for cards, generatesshare_tokenon completion viasecrets.token_urlsafe(8)
Key Design Decisions
- Sync clients in Celery (D004): openai.OpenAI, QdrantClient, sync SQLAlchemy. Avoids nested event loop errors.
- Non-blocking embedding (D005): Stage 6 failures don't block core pipeline output.
- Redis for stage 4 data: Classification results in Redis with 24h TTL, not DB columns.
- Best-effort versioning (D018): Version snapshot failure doesn't block page update.
Transcript Watcher
Standalone service (watcher.py) monitors /vmPool/r/services/chrysopedia_watch/ for new transcript JSON files:
- Uses
watchdog.observers.polling.PollingObserverfor ZFS reliability - Validates file structure, waits for size stability (handles partial SCP writes)
- POSTs to ingest API on file detection
- Moves processed files to
processed/, failures tofailed/with.errorsidecar
See also: Architecture, Data-Model, Deployment
Chrysopedia Wiki
Architecture
Features
- Chat-Engine
- Search-Retrieval
- Highlights
- Personality-Profiles
- Posts (via Post Editor)
Reference
Operations