feat: Wired word-timing extraction into stage_highlight_detection — 62…

- "backend/pipeline/stages.py" - ".gsd/KNOWLEDGE.md" GSD-Task: S05/T02
2026-04-04 08:11:32 +00:00 · 2026-04-04 08:11:32 +00:00 · fb6a4cc58a
commit fb6a4cc58a
parent 27c5f4866b
5 changed files with 162 additions and 3 deletions
--- a/.gsd/KNOWLEDGE.md
+++ b/.gsd/KNOWLEDGE.md
@ -332,3 +332,15 @@
 **Context:** When a route depends on services that require a live database, create a standalone ASGI test client that mocks the DB session at the dependency level rather than using the shared conftest.py client. This avoids PostgreSQL dependency for tests that only need to verify request/response shape and SSE event ordering. The pattern: create a fresh FastAPI app in the test, override the DB dependency, mount the router, and use httpx.AsyncClient with ASGITransport.

 **Where:** `backend/tests/test_chat.py` — chat_client fixture
+
+## SourceVideo.transcript_path stores absolute paths
+
+**Context:** The `transcript_path` column on `SourceVideo` stores the full absolute path (e.g., `/data/transcripts/Creator/filename.mp4.json`), not a relative path from the mount point. Code that loads transcript files should use the path directly, not join it with a base directory prefix.
+
+**Fix:** Use `source_video.transcript_path` directly as the file path. Do not `os.path.join("/data/transcripts", transcript_path)` — that produces a double prefix.
+
+## highlight_candidates constraint name mismatch
+
+**Context:** The code in `stages.py` referenced `uq_highlight_candidate_moment` for the ON CONFLICT constraint, but the actual PostgreSQL constraint is named `highlight_candidates_key_moment_id_key`. This was created by Alembic's auto-naming convention rather than an explicit `UniqueConstraint(name=...)` in the model.
+
+**Fix:** Use the actual constraint name `highlight_candidates_key_moment_id_key`. When writing ON CONFLICT upserts, always verify the actual constraint name in the database with `inspect(engine).get_unique_constraints(table_name)` rather than guessing from the model definition.
--- a/.gsd/milestones/M022/slices/S05/S05-PLAN.md
+++ b/.gsd/milestones/M022/slices/S05/S05-PLAN.md
@ -20,7 +20,7 @@ Add comprehensive tests for all new functions + backward compatibility.
  - Estimate: 1h30m
  - Files: backend/pipeline/highlight_scorer.py, backend/pipeline/highlight_schemas.py, backend/pipeline/test_highlight_scorer.py
  - Verify: cd /home/aux/projects/content-to-kb-automator && python -m pytest backend/pipeline/test_highlight_scorer.py -v 2>&1 | tail -40
- [ ] **T02: Wire word-timing extraction into stage_highlight_detection and verify on ub01** — Update stage_highlight_detection() in stages.py to:
+- [x] **T02: Wired word-timing extraction into stage_highlight_detection — 62 candidates scored with 10 dimensions including non-neutral audio proxy signals on ub01** — Update stage_highlight_detection() in stages.py to:
 1. Look up SourceVideo.transcript_path for the video
 2. Load the transcript JSON file once (from the /data/transcripts/ mount)
 3. For each KeyMoment, call extract_word_timings() with the moment's start_time/end_time
--- a/.gsd/milestones/M022/slices/S05/tasks/T01-VERIFY.json
+++ b/.gsd/milestones/M022/slices/S05/tasks/T01-VERIFY.json
@ -0,0 +1,16 @@
+{
+  "schemaVersion": 1,
+  "taskId": "T01",
+  "unitId": "M022/S05/T01",
+  "timestamp": 1775289922475,
+  "passed": true,
+  "discoverySource": "task-plan",
+  "checks": [
+    {
+      "command": "cd /home/aux/projects/content-to-kb-automator",
+      "exitCode": 0,
+      "durationMs": 11,
+      "verdict": "pass"
+    }
+  ]
+}
--- a/.gsd/milestones/M022/slices/S05/tasks/T02-SUMMARY.md
+++ b/.gsd/milestones/M022/slices/S05/tasks/T02-SUMMARY.md
@ -0,0 +1,81 @@
+---
+id: T02
+parent: S05
+milestone: M022
+provides: []
+requires: []
+affects: []
+key_files: ["backend/pipeline/stages.py", ".gsd/KNOWLEDGE.md"]
+key_decisions: ["transcript_path in DB stores absolute path — use directly without joining", "Fixed constraint name: highlight_candidates_key_moment_id_key (not uq_highlight_candidate_moment)", "Accept both {segments:[...]} and bare [...] transcript JSON formats"]
+patterns_established: []
+drill_down_paths: []
+observability_surfaces: []
+duration: ""
+verification_result: "Rebuilt chrysopedia-worker on ub01, triggered stage_highlight_detection on KOAN Sound video (62 moments, 769 transcript segments). Task completed in 0.2s. Queried highlight_candidates and confirmed 10-dimension score_breakdown with non-neutral audio proxy values (speech_rate_variance: 0.818, pause_density: 0.930, speaking_pace: 0.937). All 62 unit tests pass locally."
+completed_at: 2026-04-04T08:11:11.689Z
+blocker_discovered: false
+---
+
+# T02: Wired word-timing extraction into stage_highlight_detection — 62 candidates scored with 10 dimensions including non-neutral audio proxy signals on ub01
+
+> Wired word-timing extraction into stage_highlight_detection — 62 candidates scored with 10 dimensions including non-neutral audio proxy signals on ub01
+
+## What Happened
+---
+id: T02
+parent: S05
+milestone: M022
+key_files:
+  - backend/pipeline/stages.py
+  - .gsd/KNOWLEDGE.md
+key_decisions:
+  - transcript_path in DB stores absolute path — use directly without joining
+  - Fixed constraint name: highlight_candidates_key_moment_id_key (not uq_highlight_candidate_moment)
+  - Accept both {segments:[...]} and bare [...] transcript JSON formats
+duration: ""
+verification_result: passed
+completed_at: 2026-04-04T08:11:11.689Z
+blocker_discovered: false
+---
+
+# T02: Wired word-timing extraction into stage_highlight_detection — 62 candidates scored with 10 dimensions including non-neutral audio proxy signals on ub01
+
+**Wired word-timing extraction into stage_highlight_detection — 62 candidates scored with 10 dimensions including non-neutral audio proxy signals on ub01**
+
+## What Happened
+
+Updated stage_highlight_detection() to load transcript JSON once per video, extract word-level timings per moment via extract_word_timings(), and pass them to score_moment(). Added graceful fallback for missing/malformed transcripts. Fixed pre-existing constraint name bug (uq_highlight_candidate_moment → highlight_candidates_key_moment_id_key) and corrected transcript_path usage (absolute paths, not relative). Deployed to ub01 and verified 62 candidates scored with all 10 dimensions containing non-neutral audio proxy values.
+
+## Verification
+
+Rebuilt chrysopedia-worker on ub01, triggered stage_highlight_detection on KOAN Sound video (62 moments, 769 transcript segments). Task completed in 0.2s. Queried highlight_candidates and confirmed 10-dimension score_breakdown with non-neutral audio proxy values (speech_rate_variance: 0.818, pause_density: 0.930, speaking_pace: 0.937). All 62 unit tests pass locally.
+
+## Verification Evidence
+
+| # | Command | Exit Code | Verdict | Duration |
+|---|---------|-----------|---------|----------|
+| 1 | `python -m pytest backend/pipeline/test_highlight_scorer.py -v` | 0 | ✅ pass | 80ms |
+| 2 | `docker compose build chrysopedia-worker && docker compose up -d chrysopedia-worker` | 0 | ✅ pass | 5500ms |
+| 3 | `celery send_task stage_highlight_detection (d3bfb4d6)` | 0 | ✅ pass | 200ms |
+| 4 | `SELECT score_breakdown FROM highlight_candidates — 10 dims, non-neutral audio` | 0 | ✅ pass | 100ms |
+
+
+## Deviations
+
+transcript_path stores absolute paths, not relative — used path directly. Fixed pre-existing constraint name bug that blocked all upserts.
+
+## Known Issues
+
+None.
+
+## Files Created/Modified
+
+- `backend/pipeline/stages.py`
+- `.gsd/KNOWLEDGE.md`
+
+
+## Deviations
+transcript_path stores absolute paths, not relative — used path directly. Fixed pre-existing constraint name bug that blocked all upserts.
+
+## Known Issues
+None.
--- a/backend/pipeline/stages.py
+++ b/backend/pipeline/stages.py
@ -12,6 +12,7 @@ from __future__ import annotations
 import hashlib
 import json
 import logging
+import os
 import re
 import subprocess
 import time
@ -2449,7 +2450,7 @@ def stage_highlight_detection(self, video_id: str, run_id: str | None = None) ->

    Returns the video_id for chain compatibility.
    """
-    from pipeline.highlight_scorer import score_moment
+    from pipeline.highlight_scorer import extract_word_timings, score_moment

    start = time.monotonic()
    logger.info("Highlight detection starting for video_id=%s", video_id)
@ -2457,6 +2458,47 @@ def stage_highlight_detection(self, video_id: str, run_id: str | None = None) ->

    session = _get_sync_session()
    try:
+        # ------------------------------------------------------------------
+        # Load transcript data once for the entire video (word-level timing)
+        # ------------------------------------------------------------------
+        transcript_data: list | None = None
+        source_video = session.execute(
+            select(SourceVideo).where(SourceVideo.id == video_id)
+        ).scalar_one_or_none()
+
+        if source_video and source_video.transcript_path:
+            transcript_file = source_video.transcript_path
+            try:
+                with open(transcript_file, "r") as fh:
+                    raw = json.load(fh)
+                # Accept both {"segments": [...]} and bare [...]
+                if isinstance(raw, dict):
+                    transcript_data = raw.get("segments", raw.get("results", []))
+                elif isinstance(raw, list):
+                    transcript_data = raw
+                else:
+                    transcript_data = None
+                if transcript_data:
+                    logger.info(
+                        "Loaded transcript for video_id=%s (%d segments)",
+                        video_id, len(transcript_data),
+                    )
+            except FileNotFoundError:
+                logger.warning(
+                    "Transcript file not found for video_id=%s: %s",
+                    video_id, transcript_file,
+                )
+            except (json.JSONDecodeError, OSError) as io_exc:
+                logger.warning(
+                    "Failed to load transcript for video_id=%s: %s",
+                    video_id, io_exc,
+                )
+        else:
+            logger.info(
+                "No transcript_path for video_id=%s — audio proxy signals will be neutral",
+                video_id,
+            )
+
        moments = (
            session.execute(
                select(KeyMoment)
@ -2480,6 +2522,13 @@ def stage_highlight_detection(self, video_id: str, run_id: str | None = None) ->
        candidate_count = 0
        for moment in moments:
            try:
+                # Extract word-level timings for this moment's window
+                word_timings = None
+                if transcript_data:
+                    word_timings = extract_word_timings(
+                        transcript_data, moment.start_time, moment.end_time,
+                    ) or None  # empty list → None for neutral fallback
+
                result = score_moment(
                    start_time=moment.start_time,
                    end_time=moment.end_time,
@ -2489,6 +2538,7 @@ def stage_highlight_detection(self, video_id: str, run_id: str | None = None) ->
                    raw_transcript=moment.raw_transcript,
                    source_quality=None,  # filled below if technique_page loaded
                    video_content_type=None,  # filled below if source_video loaded
+                    word_timings=word_timings,
                )
            except Exception as score_exc:
                logger.warning(
@ -2509,7 +2559,7 @@ def stage_highlight_detection(self, video_id: str, run_id: str | None = None) ->
                duration_secs=result["duration_secs"],
            )
            stmt = stmt.on_conflict_do_update(
-                constraint="uq_highlight_candidate_moment",
+                constraint="highlight_candidates_key_moment_id_key",
                set_={
                    "score": stmt.excluded.score,
                    "score_breakdown": stmt.excluded.score_breakdown,