xpltdco/chrysopedia

Fork 0

Table of Contents

Highlight Detection

Overview
Scoring Dimensions

Audio Proxy Dimensions (M022/S05)
Duration Fitness Curve

Data Model

HighlightCandidate
HighlightStatus Enum
Database Indexes
Migrations

API Endpoints

Admin Endpoints
Creator Endpoints (M022/S01)
Score Breakdown Response

Pipeline Integration

Celery Task: stage_highlight_detection
Scoring Function
Word Timing Extraction

Frontend: Highlight Review Queue (M022/S01)

Components

Design Decisions
Key Files

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Highlight Detection

Heuristic scoring engine that ranks KeyMoment records into highlight candidates using 10 weighted dimensions. Originally added in M021/S04 with 7 dimensions, expanded to 10 in M022/S05.

Overview

Highlight detection scores every KeyMoment in a video to identify the most "highlightable" segments — moments that would work well as standalone clips or featured content. The scoring is a pure function (no ML model, no external API) based on 10 dimensions derived from existing KeyMoment metadata and word-level transcript timing data.

Scoring Dimensions

Total weight sums to 1.0. Each dimension produces a 0.0–1.0 score.

Dimension	Weight	What It Measures
`duration_fitness`	0.20	Piecewise linear curve peaking at 30–60 seconds (ideal clip length)
`content_type`	0.16	Content type favorability: tutorial > tip > walkthrough > exploration
`specificity_density`	0.16	Regex-based counting of specific units, ratios, and named parameters in summary text
`plugin_richness`	0.08	Number of plugins/VSTs referenced (more = more actionable)
`transcript_energy`	0.08	Teaching-phrase detection in transcript text (e.g., "the trick is", "key thing")
`source_quality`	0.08	Source quality rating: high=1.0, medium=0.6, low=0.3
`video_type`	0.02	Video type favorability mapping
`speech_rate_variance`	~0.07	Coefficient of variation of words-per-second in 5s sliding windows
`pause_density`	~0.08	Count and weight of inter-word gaps (>0.5s short, >1.0s long)
`speaking_pace`	~0.07	Bell-curve fitness around optimal 3–5 WPS teaching pace

Audio Proxy Dimensions (M022/S05)

The three new dimensions (speech_rate_variance, pause_density, speaking_pace) are derived from word-level transcript timing data — not raw audio. This provides meaningful speech-pattern signals without requiring librosa or audio processing dependencies.

Neutral fallback: When word_timings are unavailable (no word-level data in transcript), all three audio proxy dimensions default to 0.5 (neutral score). This preserves backward compatibility — existing scoring paths are unaffected. The weights of the original 7 dimensions were reduced proportionally to accommodate the new 0.22 total weight for audio dimensions (D041).

Duration Fitness Curve

Uses piecewise linear (not Gaussian) for predictability:

0–10s → low score (too short)
10–30s → ramp up
30–60s → peak score (1.0)
60–120s → gradual decline
120s+ → low score (too long for a highlight)

Data Model

HighlightCandidate

Field	Type	Notes
id	UUID PK
key_moment_id	FK → KeyMoment	Unique constraint (`highlight_candidates_key_moment_id_key`)
source_video_id	FK → SourceVideo	Indexed
score	Float	Composite score 0.0–1.0
score_breakdown	JSONB	Per-dimension scores (10 fields)
duration_secs	Float	Cached from KeyMoment for display
status	Enum(HighlightStatus)	candidate / approved / rejected
trim_start	Float	Nullable — trim start offset in seconds (M022/S01)
trim_end	Float	Nullable — trim end offset in seconds (M022/S01)
created_at	Timestamp
updated_at	Timestamp

HighlightStatus Enum

Value	Meaning
`candidate`	Scored but not reviewed
`approved`	Admin-approved as a highlight
`rejected`	Admin-rejected

Database Indexes

source_video_id — filter by video
score DESC — rank ordering
status — filter by review state

Migrations

019_add_highlight_candidates.py — Creates table with indexes and unique constraint
021_add_highlight_trim_columns.py — Adds trim_start and trim_end columns (M022/S01)

API Endpoints

Admin Endpoints

All under /api/v1/admin/highlights/. Admin access.

Method	Path	Purpose
POST	`/admin/highlights/detect/{video_id}`	Score all KeyMoments for a video, upsert candidates
POST	`/admin/highlights/detect-all`	Score all videos (triggers Celery tasks)
GET	`/admin/highlights/candidates`	Paginated candidate list, sorted by score DESC
GET	`/admin/highlights/candidates/{id}`	Single candidate with full `score_breakdown`

Creator Endpoints (M022/S01)

Creator-scoped highlight review. Requires JWT auth with creator ownership verification.

Method	Path	Purpose
GET	`/api/v1/creator/highlights`	List highlights for authenticated creator (status/shorts_only filters, score DESC)
GET	`/api/v1/creator/highlights/{id}`	Detail with score_breakdown and key_moment
PATCH	`/api/v1/creator/highlights/{id}/status`	Update status (approve/reject) with ownership verification
PATCH	`/api/v1/creator/highlights/{id}/trim`	Update trim_start/trim_end (validation: non-negative, start < end)

Score Breakdown Response

{
  "duration_fitness": 0.95,
  "content_type_weight": 0.80,
  "specificity_density": 0.72,
  "plugin_richness": 0.60,
  "transcript_energy": 0.85,
  "source_quality_weight": 1.00,
  "video_type_weight": 0.50,
  "speech_rate_variance_score": 0.057,
  "pause_density_score": 0.0,
  "speaking_pace_score": 1.0
}

Pipeline Integration

Celery Task: `stage_highlight_detection`

Binding: bind=True, max_retries=3
Session: Uses _get_sync_session (sync SQLAlchemy, per D004)
Flow: Load KeyMoments for video → load transcript JSON → extract word timings per moment → score each via score_moment() → bulk upsert via INSERT ON CONFLICT on constraint highlight_candidates_key_moment_id_key
Transcript handling: Loads transcript JSON once per video via SourceVideo.transcript_path. Accepts both {segments: [...]} and bare [...] JSON formats.
Fallback: If transcript is missing or malformed, word_timings=None and scorer uses neutral values for audio dimensions
Events: Emits pipeline_events rows for start/complete/error with candidate count in payload

Scoring Function

score_moment() in backend/pipeline/highlight_scorer.py is a pure function — no DB access, no side effects. Takes a KeyMoment-like dict and optional word_timings list, returns (score, breakdown_dict). This separation enables easy unit testing (62 tests, runs in 0.09s).

Word Timing Extraction

extract_word_timings() filters word-level timing dicts from transcript JSON by time window. Used by the Celery task to extract timings per KeyMoment before scoring.

Frontend: Highlight Review Queue (M022/S01)

Route: /creator/highlights (JWT-protected, lazy-loaded)

Components

Filter tabs — All / Shorts / Approved / Rejected
Candidate cards — Key moment title, duration, composite score, status badge
Score breakdown bars — Visual bars for each of the 10 scoring dimensions (fetched lazily on expand)
Action buttons — Approve / Discard with ownership verification
Inline trim panel — Validated number inputs for trim_start / trim_end
Sidebar link — Star icon in creator dashboard SidebarNav

Design Decisions

Pure function scoring — No DB or side effects in score_moment(), enabling fast unit tests
Piecewise linear duration — Predictable behavior vs. Gaussian bell curve
Neutral fallback at 0.5 — New audio dimensions don't penalize moments without word-level timing data (D041)
Proportional weight reduction — Original 7 dimensions reduced proportionally to make room for 0.22 audio weight
Lazy detail fetch — Score breakdown fetched on expand, not on list load (avoids N+1)
Creator-scoped router — Ownership verification pattern reusable for future creator endpoints

Key Files

backend/pipeline/highlight_scorer.py — Pure scoring function with 10 dimensions, word timing extraction
backend/pipeline/highlight_schemas.py — Pydantic schemas (HighlightScoreBreakdown with 10 fields)
backend/pipeline/stages.py — stage_highlight_detection Celery task
backend/routers/highlights.py — 4 admin API endpoints
backend/routers/creator_highlights.py — 4 creator-scoped endpoints (M022/S01)
backend/models.py — HighlightCandidate model with trim columns
alembic/versions/019_add_highlight_candidates.py — Initial migration
alembic/versions/021_add_highlight_trim_columns.py — Trim columns migration
backend/pipeline/test_highlight_scorer.py — 62 unit tests
frontend/src/pages/HighlightQueue.tsx — Creator review queue page
frontend/src/api/highlights.ts — Highlight API client

See also: Pipeline, Data-Model, API-Surface, Frontend

Chrysopedia Wiki

Architecture

Features

Reference

Operations