test: Added 3 audio proxy scoring functions, extract_word_timings utili…

- "backend/pipeline/highlight_scorer.py" - "backend/pipeline/highlight_schemas.py" - "backend/pipeline/test_highlight_scorer.py" GSD-Task: S05/T01
2026-04-04 08:05:22 +00:00 · 2026-04-04 08:05:22 +00:00 · 27c5f4866b
commit 27c5f4866b
parent 34acf468c6
12 changed files with 949 additions and 15 deletions
--- a/.gsd/milestones/M022/M022-ROADMAP.md
+++ b/.gsd/milestones/M022/M022-ROADMAP.md
@ -9,7 +9,7 @@ Creator-facing tools take shape: shorts queue, follow system, chat widget (UI on
 | S01 | [A] Highlight Reel + Shorts Queue UI | medium | — | ✅ | Creator reviews auto-detected highlights and short candidates in a review queue — approve, trim, discard |
 | S02 | [A] Follow System + Tier UI (Demo Placeholders) | medium | — | ✅ | Users can follow creators. Tier config page has styled Coming Soon payment placeholders. |
 | S03 | [A] Chat Widget Shell (UI Only) | low | — | ✅ | Chat bubble on creator profile pages with conversation UI, typing indicator, suggested questions |
-| S04 | [B] Multi-Turn Conversation Memory | medium | — | ⬜ | Multi-turn conversations maintain context across messages using Redis-backed history |
+| S04 | [B] Multi-Turn Conversation Memory | medium | — | ✅ | Multi-turn conversations maintain context across messages using Redis-backed history |
 | S05 | [B] Highlight Detection v2 (Audio Signals) | medium | — | ⬜ | Highlight detection uses audio energy analysis (librosa) alongside transcript signals for improved scoring |
 | S06 | [B] Personality Profile Extraction | high | — | ⬜ | Personality profiles extracted for 3+ creators showing distinct vocabulary, tone, and style markers |
 | S07 | Forgejo KB Update — Follow, Personality, Highlights | low | S01, S02, S03, S04, S05, S06 | ⬜ | Forgejo wiki updated with follow system, personality system, highlight engine v2 |
--- a/.gsd/milestones/M022/slices/S04/S04-SUMMARY.md
+++ b/.gsd/milestones/M022/slices/S04/S04-SUMMARY.md
@ -0,0 +1,91 @@
 ---
 id: S04
 parent: M022
 milestone: M022
 provides:
  - Multi-turn conversation memory via conversation_id threading
  - ChatDoneMeta type for typed SSE done event parsing
 requires:
  []
 affects:
  - S07
 key_files:
  - backend/chat_service.py
  - backend/routers/chat.py
  - backend/tests/test_chat.py
  - frontend/src/api/chat.ts
  - frontend/src/components/ChatWidget.tsx
  - frontend/src/pages/ChatPage.tsx
  - frontend/src/pages/ChatPage.module.css
 key_decisions:
  - Redis history stored as single JSON string with list slice cap rather than Redis list type — simpler atomic read/write
  - Auto-generate UUID conversation_id when omitted for consistent done event shape
  - ChatWidget resets conversation state on close for clean slate UX
  - ChatDoneMeta exported type for typed done event parsing across consumers
 patterns_established:
  - conversation_id threading: API POST body → ChatService → SSE done event → frontend state update loop
 observability_surfaces:
  - Redis key pattern chrysopedia:chat:{conversation_id} with 1h TTL for monitoring conversation state
  - conversation_id in SSE done event for correlating frontend sessions to backend history
 drill_down_paths:
  - .gsd/milestones/M022/slices/S04/tasks/T01-SUMMARY.md
  - .gsd/milestones/M022/slices/S04/tasks/T02-SUMMARY.md
 duration: ""
 verification_result: passed
 completed_at: 2026-04-04T07:55:14.777Z
 blocker_discovered: false
 ---
 # S04: [B] Multi-Turn Conversation Memory
 **Multi-turn conversations maintain context across messages using Redis-backed history with conversation_id threading through API, SSE events, and both frontend chat surfaces.**
 ## What Happened
 Two-task slice adding conversation memory to the existing streaming chat engine.
 **T01 (Backend):** ChatService gained `_load_history()` and `_save_history()` methods using Redis JSON storage keyed by `chrysopedia:chat:{conversation_id}`. History is injected between the system prompt and current user message in the LLM messages array. The assistant response is accumulated during streaming and saved alongside the user message after the done event. History capped at 10 turn pairs (20 messages), TTL refreshed to 1 hour on each interaction. ChatRequest model gained an optional `conversation_id` field — when omitted, a UUID is auto-generated for consistent done event shape. The `done` SSE event now includes `conversation_id` so the frontend can thread follow-up messages. 7 new tests cover round-trip save, history injection, cap enforcement, TTL refresh, auto-ID generation, and single-turn fallback. All 13 chat tests pass.
 **T02 (Frontend):** Three coordinated changes: (1) `streamChat()` in `api/chat.ts` gained a `conversationId` param sent in the POST body and a `ChatDoneMeta` type for typed done event parsing. (2) `ChatWidget.tsx` added `conversationId` state generated on first send via `crypto.randomUUID()`, threaded through `streamChat()`, updated from the done event, and reset on panel close. (3) `ChatPage.tsx` converted from single-response rendering to a multi-message `messages[]` array with conversation threading, a "New conversation" button, per-message citation rendering, typing indicator, and auto-scroll. CSS replaced with conversation bubble layout. Frontend builds cleanly.
 ## Verification
 Backend: `cd backend && python -m pytest tests/test_chat.py -v` — 13 passed in 0.65s (6 existing + 7 new conversation memory tests). Frontend: `cd frontend && npm run build` — clean build, 0 errors. Both run during slice closure.
 ## Requirements Advanced
 None.
 ## Requirements Validated
 None.
 ## New Requirements Surfaced
 None.
 ## Requirements Invalidated or Re-scoped
 None.
 ## Deviations
 T01: Removed `_get_chat_service` Depends factory in favor of direct construction with Redis injection — simpler testing pattern. T02: ChatWidget resets messages array on close (plan didn't specify preservation behavior).
 ## Known Limitations
 History TTL is 1 hour — long gaps between messages lose context. No persistent conversation history beyond Redis TTL. ChatWidget resets state on close rather than preserving across open/close cycles.
 ## Follow-ups
 None.
 ## Files Created/Modified
 - `backend/chat_service.py` — Added _load_history/_save_history methods, conversation_id param, history injection into LLM messages, response accumulation and save after streaming
 - `backend/routers/chat.py` — ChatRequest gained optional conversation_id field, done event includes conversation_id, Redis passed to ChatService constructor
 - `backend/tests/test_chat.py` — Added mock_redis fixture and 7 new tests for conversation memory (save, inject, cap, TTL, auto-ID, fallback)
 - `frontend/src/api/chat.ts` — streamChat() gained conversationId param and ChatDoneMeta type for done event parsing
 - `frontend/src/components/ChatWidget.tsx` — Added conversationId state, crypto.randomUUID() on first send, done event update, reset on close
 - `frontend/src/pages/ChatPage.tsx` — Converted from single-response to multi-message UI with conversation threading, new-conversation button, per-message citations
 - `frontend/src/pages/ChatPage.module.css` — Replaced with conversation bubble layout, headerRow, sticky input, responsive styles
--- a/.gsd/milestones/M022/slices/S04/S04-UAT.md
+++ b/.gsd/milestones/M022/slices/S04/S04-UAT.md
@ -0,0 +1,53 @@
 # S04: [B] Multi-Turn Conversation Memory — UAT
 **Milestone:** M022
 **Written:** 2026-04-04T07:55:14.777Z
 ## UAT: Multi-Turn Conversation Memory
 ### Preconditions
 - Chrysopedia running on ub01:8096 with Redis and LightRAG available
 - At least one creator with technique pages indexed for chat to have searchable content
 ### Test 1: Single-turn chat still works (backward compatibility)
 1. Navigate to `/chat`
 2. Type a question (e.g., "What is sidechain compression?") and send
 3. **Expected:** Response streams in with citations. No errors in console.
 4. Observe the SSE done event in Network tab — it should include a `conversation_id` UUID even though none was sent
 ### Test 2: Multi-turn conversation on ChatPage
 1. Navigate to `/chat`
 2. Send: "Tell me about bass sound design"
 3. Wait for full response
 4. Send follow-up: "What plugins are commonly used for that?"
 5. **Expected:** Second response references bass sound design context from the first message without re-stating the topic. Both messages visible in conversation history with user/assistant bubble styling.
 ### Test 3: New conversation button resets context
 1. Complete Test 2 (have a multi-turn conversation)
 2. Click the "New conversation" button
 3. Send: "What plugins are commonly used for that?"
 4. **Expected:** Response does NOT reference bass sound design — context was reset. Message history is cleared. A new conversation_id is generated on this send.
 ### Test 4: ChatWidget multi-turn
 1. Navigate to any creator profile page that has the chat widget bubble
 2. Open the chat widget
 3. Send: "Who is this creator?"
 4. Wait for response
 5. Send follow-up: "What are their best techniques?"
 6. **Expected:** Follow-up response is contextualized to the creator from message 1. Both messages visible in widget.
 ### Test 5: ChatWidget reset on close
 1. Complete Test 4 (have a conversation in the widget)
 2. Close the chat widget panel
 3. Reopen it
 4. **Expected:** Conversation history is cleared. No previous messages visible. Next send generates a fresh conversation_id.
 ### Test 6: History cap at 10 turn pairs
 1. In `/chat`, send 11 sequential messages (each with a response)
 2. On the 12th message, the backend should only include the 10 most recent turn pairs in the LLM context
 3. **Expected:** No errors. Conversation continues normally. Earliest messages may lose context influence but the conversation doesn't break.
 ### Edge Cases
 - **Empty query:** Sending an empty message should return 422 (existing behavior preserved)
 - **Redis unavailable:** If Redis is down, single-turn chat should still work (graceful degradation — history load returns empty, save fails silently)
 - **Rapid sends:** Sending a second message before the first response completes should not corrupt conversation state
--- a/.gsd/milestones/M022/slices/S04/tasks/T02-VERIFY.json
+++ b/.gsd/milestones/M022/slices/S04/tasks/T02-VERIFY.json
@ -0,0 +1,16 @@
 {
  "schemaVersion": 1,
  "taskId": "T02",
  "unitId": "M022/S04/T02",
  "timestamp": 1775289230401,
  "passed": true,
  "discoverySource": "task-plan",
  "checks": [
    {
      "command": "cd frontend",
      "exitCode": 0,
      "durationMs": 7,
      "verdict": "pass"
    }
  ]
 }
--- a/.gsd/milestones/M022/slices/S05/S05-PLAN.md
+++ b/.gsd/milestones/M022/slices/S05/S05-PLAN.md
@ -1,6 +1,36 @@
 # S05: [B] Highlight Detection v2 (Audio Signals)
-**Goal:** Add librosa-based audio energy analysis to highlight detection scoring
+**Goal:** Highlight detection scorer uses transcript word-level timing data to compute speech-rate variance, pause density, and speaking pace signals — improving highlight ranking beyond text-only heuristics.
 **Demo:** After this: Highlight detection uses audio energy analysis (librosa) alongside transcript signals for improved scoring
 ## Tasks
 - [x] **T01: Added 3 audio proxy scoring functions, extract_word_timings utility, rebalanced 10-dimension weights, and 34 new tests — all 62 tests pass** — Add three new pure scoring functions to highlight_scorer.py that analyze word-level timing data as audio-energy proxies:
 1. `_speech_rate_variance(word_timings)` — compute words-per-second in sliding windows, return normalized stdev. High variance = emphasis shifts = better highlight.
 2. `_pause_density(word_timings)` — count inter-word gaps >0.5s and inter-segment gaps >1.0s, normalize by duration. Strategic pauses = better highlight.
 3. `_speaking_pace_fitness(word_timings)` — bell-curve around 3-5 wps optimal teaching pace.
 Add a `extract_word_timings(transcript_data, start_time, end_time)` utility function that takes parsed transcript JSON (list of segments with `words` arrays) and a time window, returns the filtered word-timing dicts.
 Update `_WEIGHTS` to 10 dimensions summing to 1.0. Update `score_moment()` to accept optional `word_timings` parameter — when None, new dimensions score 0.5 (neutral).
 Update `HighlightScoreBreakdown` in highlight_schemas.py with 3 new float fields.
 Add comprehensive tests for all new functions + backward compatibility.
  - Estimate: 1h30m
  - Files: backend/pipeline/highlight_scorer.py, backend/pipeline/highlight_schemas.py, backend/pipeline/test_highlight_scorer.py
  - Verify: cd /home/aux/projects/content-to-kb-automator && python -m pytest backend/pipeline/test_highlight_scorer.py -v 2>&1 | tail -40
 - [ ] **T02: Wire word-timing extraction into stage_highlight_detection and verify on ub01** — Update stage_highlight_detection() in stages.py to:
 1. Look up SourceVideo.transcript_path for the video
 2. Load the transcript JSON file once (from the /data/transcripts/ mount)
 3. For each KeyMoment, call extract_word_timings() with the moment's start_time/end_time
 4. Pass the resulting word_timings list to score_moment()
 Handle failure gracefully: if transcript_path is None or file doesn't exist or JSON is malformed, log WARNING and pass word_timings=None (scorer falls back to neutral 0.5).
 Rebuild the Docker image on ub01, run the highlight detection stage on a real video, and verify score_breakdown JSONB contains 10 dimensions with non-neutral values for the new audio proxy signals.
 IMPORTANT: The worker container mounts /data/transcripts/ read-only. SourceVideo.transcript_path stores the relative path. The stage runs in a sync Celery context — use standard open()/json.load(), not async.
  - Estimate: 1h
  - Files: backend/pipeline/stages.py
  - Verify: ssh ub01 'cd /vmPool/r/repos/xpltdco/chrysopedia && docker compose build chrysopedia-worker && docker compose up -d chrysopedia-worker && sleep 5 && docker exec chrysopedia-api python -c "from sqlalchemy import create_engine, text; import os; e=create_engine(os.environ[\"DATABASE_URL\"].replace(\"asyncpg\",\"psycopg2\")); r=e.execute(text(\"SELECT score_breakdown FROM highlight_candidates ORDER BY updated_at DESC LIMIT 1\")); print(r.fetchone()[0])"'
--- a/.gsd/milestones/M022/slices/S05/S05-RESEARCH.md
+++ b/.gsd/milestones/M022/slices/S05/S05-RESEARCH.md
@ -0,0 +1,136 @@
 # S05 Research: Highlight Detection v2 (Audio Signals)
 ## Summary
 The slice goal is to enhance highlight detection scoring with "audio energy analysis (librosa)." However, **no audio or video files exist on disk** — the system ingests pre-transcribed JSON only. The transcripts do contain **word-level timing data** (`{word, start, end}` arrays per segment), which provides rich speech-rate and pause-pattern signals that serve as audio-energy proxies without requiring librosa or any audio file processing.
 **Recommendation:** Implement transcript-derived "audio proxy" signals (speech rate variance, pause density, emphasis detection) as new scoring dimensions in the existing pure-function scorer. Skip librosa — it would require ~300MB of dependencies (numpy, scipy, numba) in the slim Docker image and audio extraction infrastructure that doesn't exist. The word-level timing data provides equivalent discriminative power for highlight detection.
 ## Implementation Landscape
 ### Existing Highlight Scorer Architecture
 **`backend/pipeline/highlight_scorer.py`** — Pure function `score_moment()` with 7 weighted dimensions summing to 1.0:
 | Dimension | Weight | Input |
 |-----------|--------|-------|
 | duration_score | 0.25 | start_time, end_time |
 | content_density_score | 0.20 | summary text |
 | technique_relevance_score | 0.20 | content_type enum |
 | plugin_diversity_score | 0.10 | plugins list |
 | engagement_proxy_score | 0.10 | raw_transcript text |
 | position_score | 0.10 | source_quality |
 | uniqueness_score | 0.05 | video_content_type |
 All current signals are text-based. No timing or prosody signals are used.
 **`backend/pipeline/highlight_schemas.py`** — Pydantic schemas: `HighlightScoreBreakdown` (7 fields matching above), `HighlightCandidateResponse`, `HighlightBatchResult`.
 **`backend/pipeline/test_highlight_scorer.py`** — 28 pure-function tests (no DB), testing individual scoring functions and composite ordering. Tests run in ~0.03s.
 **`backend/pipeline/stages.py:2444`** — `stage_highlight_detection()` Celery task: queries KeyMoments for a video, calls `score_moment()` per moment, upserts into `highlight_candidates` table. Currently passes only text fields — does not load TranscriptSegments or word-level data.
 **`backend/models.py:706`** — `HighlightCandidate` model: `score` (float), `score_breakdown` (JSONB), `duration_secs` (float), `status` enum, FK to key_moment and source_video.
 ### Data Available for Audio Proxy Signals
 **Word-level timing in raw JSON files** (on disk at `/data/transcripts/{creator}/{filename}.json`):
 - Each segment has `words: [{word, start, end}, ...]`
 - `SourceVideo.transcript_path` stores the file path
 - Worker container mounts `/data/transcripts/` read-only
 - Sample data from Keota transcript: mean speech rate 4.42 wps (stdev 1.73), inter-segment gaps 0-1.3s
 **TranscriptSegments in DB** (per-segment start_time, end_time, text):
 - Coarser than word-level but available without disk I/O
 - Can compute segment-level speech rate (word_count / duration) and inter-segment pauses
 **KeyMoment.raw_transcript** — flat text, no timing. Already used by `_transcript_energy()`.
 ### Derivable Audio Proxy Signals
 From word-level timing within a key moment's time window:
 1. **Speech rate variance** — High variance indicates emphasis shifts (slowing down for important points, speeding up for setup). Teaching moments typically show higher variance than casual chat.
 2. **Pause density** — Long pauses (>0.5s inter-word, >1s inter-segment) often precede or follow key demonstrations. Moments with strategic pauses score higher as highlights.
 3. **Speaking pace** — Absolute words-per-second. Very fast (>6 wps) suggests filler/rambling; moderate (3-5 wps) suggests deliberate teaching; very slow (<2 wps) suggests demonstration with sparse narration.
 4. **Emphasis detection** — Word-level gaps >0.3s mid-sentence indicate prosodic emphasis. Count of emphasis markers per minute.
 ### Why Not Librosa
 - **No audio files exist.** The pipeline ingests JSON transcripts, not media files. No video URLs are stored in the DB. Adding audio extraction would require: video download tooling (yt-dlp), audio extraction (ffmpeg), spectral analysis (librosa), and temp file management.
 - **Docker image bloat.** librosa pulls numpy, scipy, numba (~300MB). The API/worker image uses python:3.12-slim with ~200MB total currently.
 - **Word timing is sufficient.** Whisper's word-level timing provides speech rate and pause information that correlates with the same signals librosa would extract (RMS energy peaks correlate with speech rate acceleration; silence detection correlates with pause analysis).
 ## Key Files
 | File | Role | Changes Needed |
 |------|------|----------------|
 | `backend/pipeline/highlight_scorer.py` | Pure scoring functions + weights | Add 3 new scoring functions for audio proxy signals, rebalance weights to accommodate new dimensions |
 | `backend/pipeline/highlight_schemas.py` | Pydantic schemas for score breakdown | Add new dimension fields to `HighlightScoreBreakdown` |
 | `backend/pipeline/stages.py:2444` | `stage_highlight_detection()` Celery task | Load transcript JSON from disk, extract word-level timing for each moment's time window, pass to scorer |
 | `backend/pipeline/test_highlight_scorer.py` | 28 existing tests | Add tests for new scoring functions + updated composite ordering |
 | `backend/requirements.txt` | Python deps | No changes needed (stdlib only) |
 ## Natural Task Seams
 ### T01: Audio proxy scoring functions (pure, no DB)
 Add 3 new scoring functions to `highlight_scorer.py`:
 - `_speech_rate_variance(word_timings)` — stdev of per-window speech rates
 - `_pause_density(word_timings, segment_gaps)` — normalized count of significant pauses
 - `_speaking_pace_fitness(word_timings)` — bell-curve around optimal teaching pace
 Update `_WEIGHTS` dict — rebalance all 10 dimensions to sum to 1.0. Suggested new weights:
 - Reduce duration_score: 0.25→0.20
 - Reduce content_density_score: 0.20→0.15
 - Reduce technique_relevance_score: 0.20→0.15
 - Keep plugin_diversity: 0.10
 - Reduce engagement_proxy: 0.10→0.08
 - Keep position: 0.10
 - Reduce uniqueness: 0.05→0.02
 - New speech_rate_variance_score: 0.08
 - New pause_density_score: 0.07
 - New speaking_pace_score: 0.05
 Update `HighlightScoreBreakdown` in `highlight_schemas.py` with 3 new fields.
 Update `score_moment()` signature to accept optional `word_timings` parameter (list of `{word, start, end}` dicts). When absent, new dimensions score 0.5 (neutral) — backward compatible.
 **Verify:** All 28 existing tests still pass. New tests for each scoring function + composite scoring with/without word timings.
 ### T02: Word-timing extraction utility
 Create a utility function (in `highlight_scorer.py` or a small helper) that:
 1. Loads a transcript JSON file from disk given a path
 2. Extracts word-level timings for a given time window (start_time, end_time)
 3. Returns structured data the scorer expects
 This is pure I/O + filtering — testable with fixture JSON.
 **Verify:** Unit test with a small fixture JSON file.
 ### T03: Wire audio signals into the pipeline stage
 Update `stage_highlight_detection()` to:
 1. Load `SourceVideo.transcript_path` for the video
 2. Read the JSON file once per video (not per moment)
 3. For each KeyMoment, slice the word timings to the moment's time window
 4. Pass word timings to `score_moment()`
 **Verify:** Run the highlight detection stage on a real video on ub01. Compare before/after scores — new dimensions should appear in score_breakdown JSONB. Existing candidate ordering should shift but not invert dramatically.
 ## Risks & Mitigations
 | Risk | Likelihood | Mitigation |
 |------|-----------|------------|
 | Word-level timing missing for some transcripts (older ingests) | Medium | Scorer handles None word_timings gracefully (neutral 0.5 for new dimensions) |
 | Transcript JSON files very large (3000+ segments) | Low | Load once per video, slice by time window. Memory is fine for JSON. |
 | Weight rebalancing shifts scores such that previously-approved highlights drop below threshold | Medium | No hard threshold exists — HighlightCandidate stores all scores. UI filters by status, not score. Re-scoring changes ordering but doesn't lose data. |
 | Docker image needs new deps | None | All implementation uses stdlib (json, statistics) — no new packages |
 ## Don't Hand-Roll
 - **Statistics calculations** — Use Python's `statistics.mean()`, `statistics.stdev()` from stdlib rather than manual sum/count arithmetic.
 ## Sources
 - Existing codebase: `highlight_scorer.py`, `stages.py`, `models.py`, `highlight_schemas.py`
 - Live data inspection: Keota transcript JSON on ub01 (word-level timing confirmed present)
 - KNOWLEDGE.md: "Pure-function scoring + Celery task separation" pattern
--- a/.gsd/milestones/M022/slices/S05/tasks/T01-PLAN.md
+++ b/.gsd/milestones/M022/slices/S05/tasks/T01-PLAN.md
@ -0,0 +1,37 @@
 ---
 estimated_steps: 8
 estimated_files: 3
 skills_used: []
 ---
 # T01: Add audio proxy scoring functions, update schema, and test
 Add three new pure scoring functions to highlight_scorer.py that analyze word-level timing data as audio-energy proxies:
 1. `_speech_rate_variance(word_timings)` — compute words-per-second in sliding windows, return normalized stdev. High variance = emphasis shifts = better highlight.
 2. `_pause_density(word_timings)` — count inter-word gaps >0.5s and inter-segment gaps >1.0s, normalize by duration. Strategic pauses = better highlight.
 3. `_speaking_pace_fitness(word_timings)` — bell-curve around 3-5 wps optimal teaching pace.
 Add a `extract_word_timings(transcript_data, start_time, end_time)` utility function that takes parsed transcript JSON (list of segments with `words` arrays) and a time window, returns the filtered word-timing dicts.
 Update `_WEIGHTS` to 10 dimensions summing to 1.0. Update `score_moment()` to accept optional `word_timings` parameter — when None, new dimensions score 0.5 (neutral).
 Update `HighlightScoreBreakdown` in highlight_schemas.py with 3 new float fields.
 Add comprehensive tests for all new functions + backward compatibility.
 ## Inputs
 - ``backend/pipeline/highlight_scorer.py` — existing 7-dimension scorer with _WEIGHTS dict and score_moment()`
 - ``backend/pipeline/highlight_schemas.py` — HighlightScoreBreakdown with 7 fields`
 - ``backend/pipeline/test_highlight_scorer.py` — 28 existing pure-function tests`
 ## Expected Output
 - ``backend/pipeline/highlight_scorer.py` — 3 new scoring functions + extract_word_timings utility + rebalanced 10-dimension weights + updated score_moment() with optional word_timings param`
 - ``backend/pipeline/highlight_schemas.py` — HighlightScoreBreakdown with 10 fields`
 - ``backend/pipeline/test_highlight_scorer.py` — all 28 existing tests pass + new tests for each audio proxy function, extract_word_timings, composite scoring with/without word timings`
 ## Verification
 cd /home/aux/projects/content-to-kb-automator && python -m pytest backend/pipeline/test_highlight_scorer.py -v 2>&1 | tail -40
--- a/.gsd/milestones/M022/slices/S05/tasks/T01-SUMMARY.md
+++ b/.gsd/milestones/M022/slices/S05/tasks/T01-SUMMARY.md
@ -0,0 +1,80 @@
 ---
 id: T01
 parent: S05
 milestone: M022
 provides: []
 requires: []
 affects: []
 key_files: ["backend/pipeline/highlight_scorer.py", "backend/pipeline/highlight_schemas.py", "backend/pipeline/test_highlight_scorer.py"]
 key_decisions: ["Rebalanced weights: original 7 dims reduced proportionally, new 3 audio dims get 0.22 total weight", "Audio proxy functions return 0.5 (neutral) when word_timings is None for backward compatibility", "Speech rate variance uses coefficient of variation with 5s sliding windows"]
 patterns_established: []
 drill_down_paths: []
 observability_surfaces: []
 duration: ""
 verification_result: "Ran python -m pytest backend/pipeline/test_highlight_scorer.py -v — all 62 tests pass (28 original + 34 new) in 0.10s. Original relative orderings preserved (ideal > mediocre > poor). Backward compatibility verified: score_moment() without word_timings produces neutral audio dimension scores."
 completed_at: 2026-04-04T08:05:04.954Z
 blocker_discovered: false
 ---
 # T01: Added 3 audio proxy scoring functions, extract_word_timings utility, rebalanced 10-dimension weights, and 34 new tests — all 62 tests pass
 > Added 3 audio proxy scoring functions, extract_word_timings utility, rebalanced 10-dimension weights, and 34 new tests — all 62 tests pass
 ## What Happened
 ---
 id: T01
 parent: S05
 milestone: M022
 key_files:
  - backend/pipeline/highlight_scorer.py
  - backend/pipeline/highlight_schemas.py
  - backend/pipeline/test_highlight_scorer.py
 key_decisions:
  - Rebalanced weights: original 7 dims reduced proportionally, new 3 audio dims get 0.22 total weight
  - Audio proxy functions return 0.5 (neutral) when word_timings is None for backward compatibility
  - Speech rate variance uses coefficient of variation with 5s sliding windows
 duration: ""
 verification_result: passed
 completed_at: 2026-04-04T08:05:04.954Z
 blocker_discovered: false
 ---
 # T01: Added 3 audio proxy scoring functions, extract_word_timings utility, rebalanced 10-dimension weights, and 34 new tests — all 62 tests pass
 **Added 3 audio proxy scoring functions, extract_word_timings utility, rebalanced 10-dimension weights, and 34 new tests — all 62 tests pass**
 ## What Happened
 Added four new functions to highlight_scorer.py: extract_word_timings (filters word-level timing dicts by time window), _speech_rate_variance (WPS coefficient of variation in sliding windows), _pause_density (weighted inter-word gap counting), and _speaking_pace_fitness (bell-curve around 3-5 WPS). Rebalanced _WEIGHTS from 7 to 10 dimensions summing to 1.0. Updated score_moment() with optional word_timings parameter that defaults to 0.5 (neutral) for backward compatibility. Added 3 new fields to HighlightScoreBreakdown schema. Updated existing tests and added 34 new tests.
 ## Verification
 Ran python -m pytest backend/pipeline/test_highlight_scorer.py -v — all 62 tests pass (28 original + 34 new) in 0.10s. Original relative orderings preserved (ideal > mediocre > poor). Backward compatibility verified: score_moment() without word_timings produces neutral audio dimension scores.
 ## Verification Evidence
 | # | Command | Exit Code | Verdict | Duration |
 |---|---------|-----------|---------|----------|
 | 1 | `python -m pytest backend/pipeline/test_highlight_scorer.py -v` | 0 | ✅ pass | 3700ms |
 ## Deviations
 None.
 ## Known Issues
 None.
 ## Files Created/Modified
 - `backend/pipeline/highlight_scorer.py`
 - `backend/pipeline/highlight_schemas.py`
 - `backend/pipeline/test_highlight_scorer.py`
 ## Deviations
 None.
 ## Known Issues
 None.
--- a/.gsd/milestones/M022/slices/S05/tasks/T02-PLAN.md
+++ b/.gsd/milestones/M022/slices/S05/tasks/T02-PLAN.md
@ -0,0 +1,33 @@
 ---
 estimated_steps: 8
 estimated_files: 1
 skills_used: []
 ---
 # T02: Wire word-timing extraction into stage_highlight_detection and verify on ub01
 Update stage_highlight_detection() in stages.py to:
 1. Look up SourceVideo.transcript_path for the video
 2. Load the transcript JSON file once (from the /data/transcripts/ mount)
 3. For each KeyMoment, call extract_word_timings() with the moment's start_time/end_time
 4. Pass the resulting word_timings list to score_moment()
 Handle failure gracefully: if transcript_path is None or file doesn't exist or JSON is malformed, log WARNING and pass word_timings=None (scorer falls back to neutral 0.5).
 Rebuild the Docker image on ub01, run the highlight detection stage on a real video, and verify score_breakdown JSONB contains 10 dimensions with non-neutral values for the new audio proxy signals.
 IMPORTANT: The worker container mounts /data/transcripts/ read-only. SourceVideo.transcript_path stores the relative path. The stage runs in a sync Celery context — use standard open()/json.load(), not async.
 ## Inputs
 - ``backend/pipeline/stages.py` — stage_highlight_detection() at line 2444, currently passes only text fields to score_moment()`
 - ``backend/pipeline/highlight_scorer.py` — updated score_moment() with word_timings param and extract_word_timings() utility (from T01)`
 - ``backend/models.py` — SourceVideo.transcript_path field`
 ## Expected Output
 - ``backend/pipeline/stages.py` — stage_highlight_detection() loads transcript JSON, extracts word timings per moment, passes to scorer. Graceful fallback on missing/malformed transcript data.`
 ## Verification
 ssh ub01 'cd /vmPool/r/repos/xpltdco/chrysopedia && docker compose build chrysopedia-worker && docker compose up -d chrysopedia-worker && sleep 5 && docker exec chrysopedia-api python -c "from sqlalchemy import create_engine, text; import os; e=create_engine(os.environ[\"DATABASE_URL\"].replace(\"asyncpg\",\"psycopg2\")); r=e.execute(text(\"SELECT score_breakdown FROM highlight_candidates ORDER BY updated_at DESC LIMIT 1\")); print(r.fetchone()[0])"'
--- a/backend/pipeline/highlight_schemas.py
+++ b/backend/pipeline/highlight_schemas.py
@ -25,6 +25,18 @@ class HighlightScoreBreakdown(BaseModel):
    uniqueness_score: float = Field(description="Score based on title/topic distinctness among siblings")
    engagement_proxy_score: float = Field(description="Proxy engagement signal from summary quality/length")
    plugin_diversity_score: float = Field(description="Score based on breadth of plugins/tools mentioned")
    speech_rate_variance_score: float = Field(
        default=0.5,
        description="Score based on speech rate variation (emphasis shifts) from word timings",
    )
    pause_density_score: float = Field(
        default=0.5,
        description="Score based on strategic pause frequency from word timings",
    )
    speaking_pace_score: float = Field(
        default=0.5,
        description="Score based on words-per-second fitness for teaching pace",
    )
 class HighlightCandidateResponse(BaseModel):
--- a/backend/pipeline/highlight_scorer.py
+++ b/backend/pipeline/highlight_scorer.py
@ -1,30 +1,35 @@
 """Heuristic scoring engine for highlight candidate detection.
 Takes KeyMoment data + context (source quality, video content type) and
-returns a composite score in [0, 1] with a 7-dimension breakdown.
+returns a composite score in [0, 1] with a 10-dimension breakdown.
 The breakdown fields align with HighlightScoreBreakdown in highlight_schemas.py:
  duration_score, content_density_score, technique_relevance_score,
-  position_score, uniqueness_score, engagement_proxy_score, plugin_diversity_score
+  position_score, uniqueness_score, engagement_proxy_score, plugin_diversity_score,
  speech_rate_variance_score, pause_density_score, speaking_pace_score
 """
 from __future__ import annotations
 import math
 import re
 import statistics
 from typing import Any
 # ── Weights per dimension (must sum to 1.0) ──────────────────────────────────
 _WEIGHTS: dict[str, float] = {
-    "duration_score": 0.25,
+    "duration_score": 0.20,
-    "content_density_score": 0.20,
+    "content_density_score": 0.15,
-    "technique_relevance_score": 0.20,
+    "technique_relevance_score": 0.15,
-    "plugin_diversity_score": 0.10,
+    "plugin_diversity_score": 0.08,
-    "engagement_proxy_score": 0.10,
+    "engagement_proxy_score": 0.08,
-    "position_score": 0.10,          # mapped from source_quality
+    "position_score": 0.08,          # mapped from source_quality
-    "uniqueness_score": 0.05,        # mapped from video_type
+    "uniqueness_score": 0.04,        # mapped from video_type
    "speech_rate_variance_score": 0.08,
    "pause_density_score": 0.07,
    "speaking_pace_score": 0.07,
 }
 assert abs(sum(_WEIGHTS.values()) - 1.0) < 1e-9, "Weights must sum to 1.0"
@ -176,6 +181,163 @@ def _video_type_weight(video_content_type: str | None) -> float:
    return mapping.get(video_content_type or "", 0.5)
 # ── Audio proxy scoring functions ─────────────────────────────────────────────
 def extract_word_timings(
    transcript_data: list[dict[str, Any]],
    start_time: float,
    end_time: float,
 ) -> list[dict[str, Any]]:
    """Extract word-level timing dicts from transcript segments within a time window.
    Parameters
    ----------
    transcript_data : list[dict]
        Parsed transcript JSON — list of segments, each with a ``words`` array.
        Each word dict must have ``start`` and ``end`` float fields (seconds).
    start_time : float
        Window start in seconds (inclusive).
    end_time : float
        Window end in seconds (inclusive).
    Returns
    -------
    list[dict] — word-timing dicts whose ``start`` falls within [start_time, end_time].
    """
    if not transcript_data:
        return []
    words: list[dict[str, Any]] = []
    for segment in transcript_data:
        seg_words = segment.get("words")
        if not seg_words:
            continue
        for w in seg_words:
            w_start = w.get("start")
            if w_start is None:
                continue
            if start_time <= w_start <= end_time:
                words.append(w)
    return words
 def _speech_rate_variance(word_timings: list[dict[str, Any]] | None) -> float:
    """Compute normalized stdev of words-per-second in sliding windows.
    High variance indicates emphasis shifts (speeding up / slowing down),
    which correlates with engaging highlights.
    Uses 5-second sliding windows with 2.5-second step.
    Returns 0.5 (neutral) when word_timings is None or insufficient data.
    """
    if not word_timings or len(word_timings) < 4:
        return 0.5
    # Determine time span
    first_start = word_timings[0].get("start", 0.0)
    last_start = word_timings[-1].get("start", 0.0)
    span = last_start - first_start
    if span < 5.0:
        return 0.5
    # Compute WPS in 5s sliding windows with 2.5s step
    window_size = 5.0
    step = 2.5
    wps_values: list[float] = []
    t = first_start
    while t + window_size <= last_start + 0.01:
        count = sum(
            1 for w in word_timings
            if t <= w.get("start", 0.0) < t + window_size
        )
        wps_values.append(count / window_size)
        t += step
    if len(wps_values) < 2:
        return 0.5
    mean_wps = statistics.mean(wps_values)
    if mean_wps < 0.01:
        return 0.5
    stdev = statistics.stdev(wps_values)
    # Normalize: coefficient of variation, capped at 1.0
    # CV of ~0.3-0.5 is typical for varied speech; >0.5 is high variance
    cv = stdev / mean_wps
    return min(cv / 0.6, 1.0)
 def _pause_density(word_timings: list[dict[str, Any]] | None) -> float:
    """Count strategic pauses normalized by duration.
    Inter-word gaps >0.5s and inter-segment gaps >1.0s indicate deliberate
    pauses for emphasis, which correlate with better highlights.
    Returns 0.5 (neutral) when word_timings is None or insufficient data.
    """
    if not word_timings or len(word_timings) < 2:
        return 0.5
    first_start = word_timings[0].get("start", 0.0)
    last_end = word_timings[-1].get("end", word_timings[-1].get("start", 0.0))
    duration = last_end - first_start
    if duration < 1.0:
        return 0.5
    short_pauses = 0  # >0.5s gaps
    long_pauses = 0   # >1.0s gaps
    for i in range(1, len(word_timings)):
        prev_end = word_timings[i - 1].get("end", word_timings[i - 1].get("start", 0.0))
        curr_start = word_timings[i].get("start", 0.0)
        gap = curr_start - prev_end
        if gap > 1.0:
            long_pauses += 1
        elif gap > 0.5:
            short_pauses += 1
    # Weight long pauses more heavily
    weighted_pauses = short_pauses + long_pauses * 2.0
    # Normalize: ~2-4 weighted pauses per 30s is good density
    density = weighted_pauses / (duration / 15.0)
    return min(density, 1.0)
 def _speaking_pace_fitness(word_timings: list[dict[str, Any]] | None) -> float:
    """Bell-curve score around 3-5 words-per-second optimal teaching pace.
    3-5 WPS is the sweet spot for tutorial content — fast enough to be
    engaging, slow enough for comprehension. Returns 0.5 (neutral) when
    word_timings is None or insufficient data.
    """
    if not word_timings or len(word_timings) < 2:
        return 0.5
    first_start = word_timings[0].get("start", 0.0)
    last_end = word_timings[-1].get("end", word_timings[-1].get("start", 0.0))
    duration = last_end - first_start
    if duration < 1.0:
        return 0.5
    wps = len(word_timings) / duration
    # Sweet spot: 3-5 WPS → 1.0
    if 3.0 <= wps <= 5.0:
        return 1.0
    # Below sweet spot: linear ramp from 0 at 0 WPS to 1.0 at 3 WPS
    if wps < 3.0:
        return max(0.0, wps / 3.0)
    # Above sweet spot: decay from 1.0 at 5 WPS to 0.0 at 10 WPS
    if wps > 5.0:
        return max(0.0, 1.0 - (wps - 5.0) / 5.0)
    return 0.5  # unreachable, but defensive
 # ── Main scoring function ───────────────────────────────────────────────────
 def score_moment(
@ -188,6 +350,7 @@ def score_moment(
    raw_transcript: str | None = None,
    source_quality: str | None = None,
    video_content_type: str | None = None,
    word_timings: list[dict[str, Any]] | None = None,
 ) -> dict[str, Any]:
    """Score a KeyMoment for highlight potential.
@ -209,6 +372,9 @@ def score_moment(
        TechniquePage source quality (structured, mixed, unstructured).
    video_content_type : str | None
        SourceVideo content type (tutorial, breakdown, livestream, short_form).
    word_timings : list[dict] | None
        Word-level timing dicts with ``start`` and ``end`` keys (seconds).
        When None, audio proxy dimensions score 0.5 (neutral).
    Returns
    -------
@ -227,6 +393,9 @@ def score_moment(
        "engagement_proxy_score": _transcript_energy(raw_transcript),
        "position_score": _source_quality_weight(source_quality),
        "uniqueness_score": _video_type_weight(video_content_type),
        "speech_rate_variance_score": _speech_rate_variance(word_timings),
        "pause_density_score": _pause_density(word_timings),
        "speaking_pace_score": _speaking_pace_fitness(word_timings),
    }
    # Weighted composite
--- a/backend/pipeline/test_highlight_scorer.py
+++ b/backend/pipeline/test_highlight_scorer.py
@ -11,11 +11,15 @@ import pytest
 from backend.pipeline.highlight_scorer import (
    _content_type_weight,
    _duration_fitness,
    _pause_density,
    _plugin_richness,
    _source_quality_weight,
    _speaking_pace_fitness,
    _specificity_density,
    _speech_rate_variance,
    _transcript_energy,
    _video_type_weight,
    extract_word_timings,
    score_moment,
 )
@ -80,6 +84,50 @@ def _poor_moment() -> dict:
    )
 def _make_word_timings(
    start: float = 0.0,
    count: int = 40,
    wps: float = 4.0,
    pause_every: int | None = None,
    pause_duration: float = 0.8,
 ) -> list[dict]:
    """Generate synthetic word-timing dicts for testing.
    Parameters
    ----------
    start : float
        Start time in seconds.
    count : int
        Number of words to generate.
    wps : float
        Words per second (base rate).
    pause_every : int | None
        Insert a pause every N words. None = no pauses.
    pause_duration : float
        Duration of each pause in seconds.
    """
    timings = []
    t = start
    word_dur = 1.0 / wps * 0.7  # 70% speaking, 30% normal gap
    gap = 1.0 / wps * 0.3
    for i in range(count):
        timings.append({"word": f"word{i}", "start": t, "end": t + word_dur})
        t += word_dur + gap
        if pause_every and (i + 1) % pause_every == 0:
            t += pause_duration
    return timings
 def _make_transcript_segments(word_timings: list[dict], words_per_segment: int = 10) -> list[dict]:
    """Group word timings into transcript segments for extract_word_timings tests."""
    segments = []
    for i in range(0, len(word_timings), words_per_segment):
        chunk = word_timings[i : i + words_per_segment]
        segments.append({"words": chunk})
    return segments
 # ── Tests ────────────────────────────────────────────────────────────────────
 class TestScoreMoment:
@ -130,22 +178,41 @@ class TestScoreMoment:
        )
        assert 0.0 <= result["score"] <= 1.0
        assert result["duration_secs"] == 45.0
-        assert len(result["score_breakdown"]) == 7
+        assert len(result["score_breakdown"]) == 10
    def test_returns_duration_secs(self):
        result = score_moment(start_time=10.0, end_time=55.0)
        assert result["duration_secs"] == 45.0
-    def test_breakdown_has_seven_dimensions(self):
+    def test_breakdown_has_ten_dimensions(self):
        result = score_moment(**_ideal_moment())
-        assert len(result["score_breakdown"]) == 7
+        assert len(result["score_breakdown"]) == 10
        expected_keys = {
            "duration_score", "content_density_score", "technique_relevance_score",
            "plugin_diversity_score", "engagement_proxy_score", "position_score",
-            "uniqueness_score",
+            "uniqueness_score", "speech_rate_variance_score", "pause_density_score",
            "speaking_pace_score",
        }
        assert set(result["score_breakdown"].keys()) == expected_keys
    def test_without_word_timings_audio_dims_are_neutral(self):
        """When word_timings is None, audio proxy dimensions score 0.5."""
        result = score_moment(start_time=10.0, end_time=55.0)
        bd = result["score_breakdown"]
        assert bd["speech_rate_variance_score"] == 0.5
        assert bd["pause_density_score"] == 0.5
        assert bd["speaking_pace_score"] == 0.5
    def test_with_word_timings_changes_score(self):
        """Providing word_timings should shift the composite score vs without."""
        base = _ideal_moment()
        without = score_moment(**base)
        # Add word timings at a good teaching pace (~4 WPS) with some pauses
        timings = _make_word_timings(start=10.0, count=120, wps=4.0, pause_every=15)
        with_timings = score_moment(**base, word_timings=timings)
        # Scores should differ since audio dims are no longer neutral
        assert with_timings["score"] != without["score"]
 class TestDurationFitness:
    def test_bell_curve_peak(self):
@ -242,3 +309,213 @@ class TestVideoTypeWeight:
    def test_none_default(self):
        assert _video_type_weight(None) == 0.5
 # ── Audio proxy function tests ───────────────────────────────────────────────
 class TestExtractWordTimings:
    def test_filters_by_time_window(self):
        words = _make_word_timings(start=0.0, count=40, wps=4.0)
        segments = _make_transcript_segments(words)
        # Extract window 2.0–5.0s
        result = extract_word_timings(segments, start_time=2.0, end_time=5.0)
        for w in result:
            assert 2.0 <= w["start"] <= 5.0
    def test_returns_all_when_window_covers_entire_range(self):
        words = _make_word_timings(start=0.0, count=20, wps=4.0)
        segments = _make_transcript_segments(words)
        result = extract_word_timings(segments, start_time=0.0, end_time=100.0)
        assert len(result) == 20
    def test_empty_transcript_data(self):
        assert extract_word_timings([], start_time=0.0, end_time=10.0) == []
    def test_no_words_in_window(self):
        words = _make_word_timings(start=0.0, count=10, wps=4.0)
        segments = _make_transcript_segments(words)
        # Window far beyond the word timings
        result = extract_word_timings(segments, start_time=100.0, end_time=200.0)
        assert result == []
    def test_segments_without_words_key(self):
        """Segments missing 'words' are skipped gracefully."""
        segments = [{"text": "hello"}, {"words": [{"start": 1.0, "end": 1.2, "word": "a"}]}]
        result = extract_word_timings(segments, start_time=0.0, end_time=10.0)
        assert len(result) == 1
    def test_words_without_start_are_skipped(self):
        segments = [{"words": [{"end": 1.2, "word": "a"}, {"start": 2.0, "end": 2.2, "word": "b"}]}]
        result = extract_word_timings(segments, start_time=0.0, end_time=10.0)
        assert len(result) == 1
        assert result[0]["word"] == "b"
 class TestSpeechRateVariance:
    def test_none_returns_neutral(self):
        assert _speech_rate_variance(None) == 0.5
    def test_too_few_words_returns_neutral(self):
        timings = _make_word_timings(count=3, wps=4.0)
        assert _speech_rate_variance(timings) == 0.5
    def test_short_span_returns_neutral(self):
        """Words spanning <5s should return neutral."""
        timings = _make_word_timings(count=10, wps=4.0, start=0.0)
        # 10 words at 4 WPS = 2.5s span → too short
        assert _speech_rate_variance(timings) == 0.5
    def test_uniform_pace_scores_low(self):
        """Steady 4 WPS for 30s → low variance."""
        timings = _make_word_timings(start=0.0, count=120, wps=4.0)
        score = _speech_rate_variance(timings)
        assert score < 0.4, f"Uniform pace scored {score}, expected < 0.4"
    def test_varied_pace_scores_higher(self):
        """Alternating fast/slow sections → higher variance."""
        timings = []
        t = 0.0
        # Fast section: 6 WPS for 10s
        for i in range(60):
            dur = 0.12
            timings.append({"word": f"w{i}", "start": t, "end": t + dur})
            t += 1.0 / 6.0
        # Slow section: 2 WPS for 10s
        for i in range(20):
            dur = 0.3
            timings.append({"word": f"w{60+i}", "start": t, "end": t + dur})
            t += 0.5
        score = _speech_rate_variance(timings)
        uniform_score = _speech_rate_variance(
            _make_word_timings(start=0.0, count=80, wps=4.0)
        )
        assert score > uniform_score, (
            f"Varied pace ({score:.3f}) should be > uniform ({uniform_score:.3f})"
        )
    def test_score_bounded(self):
        timings = _make_word_timings(start=0.0, count=200, wps=4.0)
        score = _speech_rate_variance(timings)
        assert 0.0 <= score <= 1.0
 class TestPauseDensity:
    def test_none_returns_neutral(self):
        assert _pause_density(None) == 0.5
    def test_single_word_returns_neutral(self):
        assert _pause_density([{"start": 0.0, "end": 0.2}]) == 0.5
    def test_no_pauses_scores_zero(self):
        """Continuous speech with no gaps >0.5s → 0."""
        timings = _make_word_timings(start=0.0, count=60, wps=4.0)
        score = _pause_density(timings)
        assert score == 0.0
    def test_frequent_pauses_scores_high(self):
        """Pauses every 5 words → high density."""
        timings = _make_word_timings(start=0.0, count=60, wps=4.0, pause_every=5, pause_duration=0.8)
        score = _pause_density(timings)
        assert score > 0.5, f"Frequent pauses scored {score}, expected > 0.5"
    def test_long_pauses_weighted_more(self):
        """One 1.5s pause should score higher than one 0.6s pause in a longer segment."""
        # Build timings with one long pause at midpoint — 60 words for longer duration
        long_pause = []
        t = 0.0
        for i in range(60):
            long_pause.append({"word": f"w{i}", "start": t, "end": t + 0.15})
            t += 0.25
            if i == 29:
                t += 1.5  # long pause >1.0s
        # Build timings with one short pause — same word count
        short_pause = []
        t = 0.0
        for i in range(60):
            short_pause.append({"word": f"w{i}", "start": t, "end": t + 0.15})
            t += 0.25
            if i == 29:
                t += 0.6  # short pause >0.5s but <1.0s
        assert _pause_density(long_pause) > _pause_density(short_pause)
    def test_score_bounded(self):
        timings = _make_word_timings(start=0.0, count=60, wps=4.0, pause_every=3, pause_duration=1.5)
        score = _pause_density(timings)
        assert 0.0 <= score <= 1.0
 class TestSpeakingPaceFitness:
    def test_none_returns_neutral(self):
        assert _speaking_pace_fitness(None) == 0.5
    def test_single_word_returns_neutral(self):
        assert _speaking_pace_fitness([{"start": 0.0, "end": 0.2}]) == 0.5
    def test_optimal_pace_scores_high(self):
        """4 WPS (optimal teaching pace) → 1.0."""
        timings = _make_word_timings(start=0.0, count=40, wps=4.0)
        score = _speaking_pace_fitness(timings)
        assert score == 1.0, f"4 WPS scored {score}, expected 1.0"
    def test_three_wps_is_sweet_spot_edge(self):
        timings = _make_word_timings(start=0.0, count=30, wps=3.0)
        score = _speaking_pace_fitness(timings)
        assert score == 1.0
    def test_five_wps_is_sweet_spot_edge(self):
        timings = _make_word_timings(start=0.0, count=50, wps=5.0)
        score = _speaking_pace_fitness(timings)
        assert score > 0.95, f"5 WPS scored {score}, expected near 1.0"
    def test_too_slow_scores_lower(self):
        """1.5 WPS → below sweet spot."""
        timings = _make_word_timings(start=0.0, count=15, wps=1.5)
        score = _speaking_pace_fitness(timings)
        assert 0.4 < score < 0.6, f"1.5 WPS scored {score}, expected ~0.5"
    def test_too_fast_scores_lower(self):
        """8 WPS → above sweet spot."""
        timings = _make_word_timings(start=0.0, count=80, wps=8.0)
        score = _speaking_pace_fitness(timings)
        assert 0.0 < score < 1.0
    def test_very_fast_scores_zero(self):
        """10+ WPS → 0."""
        timings = _make_word_timings(start=0.0, count=110, wps=11.0)
        score = _speaking_pace_fitness(timings)
        assert score == 0.0
    def test_zero_wps_scores_zero(self):
        """Very short duration → neutral."""
        timings = [{"start": 0.0, "end": 0.01}, {"start": 0.005, "end": 0.015}]
        score = _speaking_pace_fitness(timings)
        # Duration ~0.015s → too short → 0.5 (neutral)
        assert score == 0.5
    def test_score_bounded(self):
        for wps in [0.5, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 8.0, 10.0]:
            timings = _make_word_timings(start=0.0, count=max(10, int(wps * 10)), wps=wps)
            score = _speaking_pace_fitness(timings)
            assert 0.0 <= score <= 1.0, f"WPS {wps} scored {score} out of bounds"
 class TestBackwardCompatibility:
    """Ensure the weight rebalancing doesn't break existing relative orderings."""
    def test_ideal_still_beats_poor(self):
        ideal = score_moment(**_ideal_moment())
        poor = score_moment(**_poor_moment())
        assert ideal["score"] > poor["score"]
    def test_ideal_still_above_threshold(self):
        result = score_moment(**_ideal_moment())
        assert result["score"] > 0.6, f"Ideal scored {result['score']}, expected > 0.6"
    def test_poor_still_below_threshold(self):
        result = score_moment(**_poor_moment())
        assert result["score"] < 0.45, f"Poor scored {result['score']}, expected < 0.45"
    def test_weights_sum_to_one(self):
        from backend.pipeline.highlight_scorer import _WEIGHTS
        assert abs(sum(_WEIGHTS.values()) - 1.0) < 1e-9