3 Pipeline
jlightner edited this page 2026-04-04 06:55:06 -05:00

Pipeline

6-stage LLM-powered extraction pipeline that transforms video transcripts into structured technique articles.

Pipeline Stages

Video File
    ↓
[Desktop] Whisper large-v3 (RTX 4090) → transcript JSON
    ↓
[Watcher/API] Ingest → SourceVideo + TranscriptSegments in PostgreSQL
    ↓
Stage 1: Transcript Segmentation — chunk transcript into logical segments
    ↓
Stage 2: Key Moment Extraction — identify teachable moments with timestamps
    ↓
Stage 3: (reserved)
    ↓
Stage 4: Classification & Tagging — assign topic_category + topic_tags per moment
    ↓
Stage 5: Technique Page Synthesis — compose study guide articles from moments
    ↓
Stage 6: Embed & Index — generate embeddings, upsert to Qdrant (non-blocking)

Stage Details

Stage 1: Transcript Segmentation

  • Chunks raw transcript into logical segments
  • Input: TranscriptSegments from DB
  • Output: Segmented data for stage 2

Stage 2: Key Moment Extraction

  • Identifies teachable moments with titles, summaries, timestamps
  • Uses LLM with prompt template from prompts/ directory
  • Output: KeyMoment records in PostgreSQL

Stage 4: Classification & Tagging

  • Assigns topic_category and topic_tags to each key moment
  • References canonical tag list (canonical_tags.yaml) with aliases
  • Output: Classification data stored in Redis (chrysopedia:classification:{video_id}, 24h TTL)

Stage 5: Technique Page Synthesis

  • Composes study guide articles from classified key moments
  • Handles multi-source merging: new video moments merge into existing technique pages
  • Uses offset-based citation indexing (existing [0]-[N-1], new [N]-[N+M-1])
  • Creates pre-overwrite version snapshot before mutating existing pages (D018)
  • Output: TechniquePage records with body_sections (v2 format), signal_chains, plugins

Stage 6: Embed & Index

  • Generates embeddings via Ollama (nomic-embed-text)
  • Embedding text enriched with creator_name and topic_tags (D023)
  • Upserts to Qdrant with deterministic UUIDs based on content
  • Non-blocking: Failures log WARNING but don't fail the pipeline (D005)
  • Can be re-triggered independently via /admin/pipeline/reindex-all

Stage 7: Highlight Detection (M021/S04)

  • Scores every KeyMoment in a video using 7 weighted heuristic dimensions
  • Pure function scoring: duration_fitness (0.25), content_type (0.20), specificity_density (0.20), plugin_richness (0.10), transcript_energy (0.10), source_quality (0.10), video_type (0.05)
  • Celery task stage_highlight_detection with bind=True, max_retries=3
  • Bulk upserts via INSERT ON CONFLICT on named constraint uq_highlight_candidate_moment
  • Output: HighlightCandidate records in PostgreSQL with composite score and per-dimension breakdown
  • See Highlights for full scoring details and API endpoints

LLM Configuration

Setting Value
Primary LLM DGX Sparks Qwen (OpenAI-compatible API)
Fallback LLM Local Ollama
Embedding model nomic-embed-text (Ollama)
Model routing Per-stage configuration (chat vs thinking models)

Prompt Template System

  • Prompt files stored in prompts/ directory (D013)
  • Templates use XML-style content fencing
  • Editable without code changes — pipeline reads from disk at runtime
  • SHA-256 hashes tracked in TechniquePageVersion.pipeline_metadata for reproducibility
  • Re-process after prompt edits via POST /admin/pipeline/trigger/{video_id}

Pipeline Admin Features

  • Debug mode: Redis-backed toggle captures full LLM I/O (system prompt, user prompt, response) in pipeline_events
  • Token tracking: Per-event and per-video token usage visible in admin UI
  • Stale page detection: Identifies pages needing regeneration
  • Bulk operations: Bulk resynthesize, wipe all output, reindex all
  • Worker status: Real-time Celery worker health check

Prompt Quality Toolkit

CLI tool (python -m pipeline.quality) with:

  • LLM fitness suite — 9 tests (Mandelbrot reasoning, JSON compliance, instruction following)
  • 5-dimension quality scorer with voice preservation dial
  • Automated prompt A/B optimization loop — LLM-powered variant generation, iterative scoring, leaderboard
  • Multi-stage support for pipeline stages 2-5 with per-stage rubrics and fixtures

Shorts Pipeline Modules (M024/S04)

caption_generator.py

Converts Whisper word-level timings to ASS (Advanced SubStation Alpha) subtitles with karaoke highlighting.

  • generate_ass_captions() — main entry point, produces ASS file with \k tags for word-by-word highlighting
  • Clip-relative timing — offsets word timestamps by clip_start for accurate subtitle sync
  • Non-blocking — failures log WARNING, never fail the parent short generation stage
  • 17 unit tests covering time formatting, ASS structure, clip offset math, karaoke duration, empty/whitespace handling, custom styles, negative time clamping

card_renderer.py

ffmpeg-based intro/outro card generation and concatenation pipeline.

  • render_card() — builds lavfi command using color + drawtext filters
  • render_card_to_file() — executes ffmpeg to produce card video segment
  • build_concat_list() — writes file manifest for ffmpeg concat demuxer
  • concat_segments() — runs ffmpeg concat demuxer with -c copy
  • parse_template_config() — JSONB normalizer with defaults for missing/null fields
  • Cards include silent audio track via anullsrc for codec-compatible concat with audio main clips
  • Non-blocking — card render failures log WARNING, shorts proceed without intro/outro
  • 28 unit tests

shorts_generator.py Updates (M024/S01, S04)

  • extract_clip() accepts optional ass_path for subtitle burn-in via ffmpeg -vf ass= filter
  • extract_clip_with_template() — orchestrates intro/main/outro concatenation using card_renderer
  • stage_generate_shorts now: loads transcripts for captions, loads creator templates for cards, generates share_token on completion via secrets.token_urlsafe(8)

Key Design Decisions

  • Sync clients in Celery (D004): openai.OpenAI, QdrantClient, sync SQLAlchemy. Avoids nested event loop errors.
  • Non-blocking embedding (D005): Stage 6 failures don't block core pipeline output.
  • Redis for stage 4 data: Classification results in Redis with 24h TTL, not DB columns.
  • Best-effort versioning (D018): Version snapshot failure doesn't block page update.

Transcript Watcher

Standalone service (watcher.py) monitors /vmPool/r/services/chrysopedia_watch/ for new transcript JSON files:

  • Uses watchdog.observers.polling.PollingObserver for ZFS reliability
  • Validates file structure, waits for size stability (handles partial SCP writes)
  • POSTs to ingest API on file detection
  • Moves processed files to processed/, failures to failed/ with .error sidecar

See also: Architecture, Data-Model, Deployment