feat: Wired word-timing extraction into stage_highlight_detection — 62…

- "backend/pipeline/stages.py"
- ".gsd/KNOWLEDGE.md"

GSD-Task: S05/T02
This commit is contained in:
jlightner 2026-04-04 08:11:32 +00:00
parent 27c5f4866b
commit fb6a4cc58a
5 changed files with 162 additions and 3 deletions

View file

@ -332,3 +332,15 @@
**Context:** When a route depends on services that require a live database, create a standalone ASGI test client that mocks the DB session at the dependency level rather than using the shared conftest.py client. This avoids PostgreSQL dependency for tests that only need to verify request/response shape and SSE event ordering. The pattern: create a fresh FastAPI app in the test, override the DB dependency, mount the router, and use httpx.AsyncClient with ASGITransport.
**Where:** `backend/tests/test_chat.py` — chat_client fixture
## SourceVideo.transcript_path stores absolute paths
**Context:** The `transcript_path` column on `SourceVideo` stores the full absolute path (e.g., `/data/transcripts/Creator/filename.mp4.json`), not a relative path from the mount point. Code that loads transcript files should use the path directly, not join it with a base directory prefix.
**Fix:** Use `source_video.transcript_path` directly as the file path. Do not `os.path.join("/data/transcripts", transcript_path)` — that produces a double prefix.
## highlight_candidates constraint name mismatch
**Context:** The code in `stages.py` referenced `uq_highlight_candidate_moment` for the ON CONFLICT constraint, but the actual PostgreSQL constraint is named `highlight_candidates_key_moment_id_key`. This was created by Alembic's auto-naming convention rather than an explicit `UniqueConstraint(name=...)` in the model.
**Fix:** Use the actual constraint name `highlight_candidates_key_moment_id_key`. When writing ON CONFLICT upserts, always verify the actual constraint name in the database with `inspect(engine).get_unique_constraints(table_name)` rather than guessing from the model definition.

View file

@ -20,7 +20,7 @@ Add comprehensive tests for all new functions + backward compatibility.
- Estimate: 1h30m
- Files: backend/pipeline/highlight_scorer.py, backend/pipeline/highlight_schemas.py, backend/pipeline/test_highlight_scorer.py
- Verify: cd /home/aux/projects/content-to-kb-automator && python -m pytest backend/pipeline/test_highlight_scorer.py -v 2>&1 | tail -40
- [ ] **T02: Wire word-timing extraction into stage_highlight_detection and verify on ub01** — Update stage_highlight_detection() in stages.py to:
- [x] **T02: Wired word-timing extraction into stage_highlight_detection — 62 candidates scored with 10 dimensions including non-neutral audio proxy signals on ub01** — Update stage_highlight_detection() in stages.py to:
1. Look up SourceVideo.transcript_path for the video
2. Load the transcript JSON file once (from the /data/transcripts/ mount)
3. For each KeyMoment, call extract_word_timings() with the moment's start_time/end_time

View file

@ -0,0 +1,16 @@
{
"schemaVersion": 1,
"taskId": "T01",
"unitId": "M022/S05/T01",
"timestamp": 1775289922475,
"passed": true,
"discoverySource": "task-plan",
"checks": [
{
"command": "cd /home/aux/projects/content-to-kb-automator",
"exitCode": 0,
"durationMs": 11,
"verdict": "pass"
}
]
}

View file

@ -0,0 +1,81 @@
---
id: T02
parent: S05
milestone: M022
provides: []
requires: []
affects: []
key_files: ["backend/pipeline/stages.py", ".gsd/KNOWLEDGE.md"]
key_decisions: ["transcript_path in DB stores absolute path — use directly without joining", "Fixed constraint name: highlight_candidates_key_moment_id_key (not uq_highlight_candidate_moment)", "Accept both {segments:[...]} and bare [...] transcript JSON formats"]
patterns_established: []
drill_down_paths: []
observability_surfaces: []
duration: ""
verification_result: "Rebuilt chrysopedia-worker on ub01, triggered stage_highlight_detection on KOAN Sound video (62 moments, 769 transcript segments). Task completed in 0.2s. Queried highlight_candidates and confirmed 10-dimension score_breakdown with non-neutral audio proxy values (speech_rate_variance: 0.818, pause_density: 0.930, speaking_pace: 0.937). All 62 unit tests pass locally."
completed_at: 2026-04-04T08:11:11.689Z
blocker_discovered: false
---
# T02: Wired word-timing extraction into stage_highlight_detection — 62 candidates scored with 10 dimensions including non-neutral audio proxy signals on ub01
> Wired word-timing extraction into stage_highlight_detection — 62 candidates scored with 10 dimensions including non-neutral audio proxy signals on ub01
## What Happened
---
id: T02
parent: S05
milestone: M022
key_files:
- backend/pipeline/stages.py
- .gsd/KNOWLEDGE.md
key_decisions:
- transcript_path in DB stores absolute path — use directly without joining
- Fixed constraint name: highlight_candidates_key_moment_id_key (not uq_highlight_candidate_moment)
- Accept both {segments:[...]} and bare [...] transcript JSON formats
duration: ""
verification_result: passed
completed_at: 2026-04-04T08:11:11.689Z
blocker_discovered: false
---
# T02: Wired word-timing extraction into stage_highlight_detection — 62 candidates scored with 10 dimensions including non-neutral audio proxy signals on ub01
**Wired word-timing extraction into stage_highlight_detection — 62 candidates scored with 10 dimensions including non-neutral audio proxy signals on ub01**
## What Happened
Updated stage_highlight_detection() to load transcript JSON once per video, extract word-level timings per moment via extract_word_timings(), and pass them to score_moment(). Added graceful fallback for missing/malformed transcripts. Fixed pre-existing constraint name bug (uq_highlight_candidate_moment → highlight_candidates_key_moment_id_key) and corrected transcript_path usage (absolute paths, not relative). Deployed to ub01 and verified 62 candidates scored with all 10 dimensions containing non-neutral audio proxy values.
## Verification
Rebuilt chrysopedia-worker on ub01, triggered stage_highlight_detection on KOAN Sound video (62 moments, 769 transcript segments). Task completed in 0.2s. Queried highlight_candidates and confirmed 10-dimension score_breakdown with non-neutral audio proxy values (speech_rate_variance: 0.818, pause_density: 0.930, speaking_pace: 0.937). All 62 unit tests pass locally.
## Verification Evidence
| # | Command | Exit Code | Verdict | Duration |
|---|---------|-----------|---------|----------|
| 1 | `python -m pytest backend/pipeline/test_highlight_scorer.py -v` | 0 | ✅ pass | 80ms |
| 2 | `docker compose build chrysopedia-worker && docker compose up -d chrysopedia-worker` | 0 | ✅ pass | 5500ms |
| 3 | `celery send_task stage_highlight_detection (d3bfb4d6)` | 0 | ✅ pass | 200ms |
| 4 | `SELECT score_breakdown FROM highlight_candidates — 10 dims, non-neutral audio` | 0 | ✅ pass | 100ms |
## Deviations
transcript_path stores absolute paths, not relative — used path directly. Fixed pre-existing constraint name bug that blocked all upserts.
## Known Issues
None.
## Files Created/Modified
- `backend/pipeline/stages.py`
- `.gsd/KNOWLEDGE.md`
## Deviations
transcript_path stores absolute paths, not relative — used path directly. Fixed pre-existing constraint name bug that blocked all upserts.
## Known Issues
None.

View file

@ -12,6 +12,7 @@ from __future__ import annotations
import hashlib
import json
import logging
import os
import re
import subprocess
import time
@ -2449,7 +2450,7 @@ def stage_highlight_detection(self, video_id: str, run_id: str | None = None) ->
Returns the video_id for chain compatibility.
"""
from pipeline.highlight_scorer import score_moment
from pipeline.highlight_scorer import extract_word_timings, score_moment
start = time.monotonic()
logger.info("Highlight detection starting for video_id=%s", video_id)
@ -2457,6 +2458,47 @@ def stage_highlight_detection(self, video_id: str, run_id: str | None = None) ->
session = _get_sync_session()
try:
# ------------------------------------------------------------------
# Load transcript data once for the entire video (word-level timing)
# ------------------------------------------------------------------
transcript_data: list | None = None
source_video = session.execute(
select(SourceVideo).where(SourceVideo.id == video_id)
).scalar_one_or_none()
if source_video and source_video.transcript_path:
transcript_file = source_video.transcript_path
try:
with open(transcript_file, "r") as fh:
raw = json.load(fh)
# Accept both {"segments": [...]} and bare [...]
if isinstance(raw, dict):
transcript_data = raw.get("segments", raw.get("results", []))
elif isinstance(raw, list):
transcript_data = raw
else:
transcript_data = None
if transcript_data:
logger.info(
"Loaded transcript for video_id=%s (%d segments)",
video_id, len(transcript_data),
)
except FileNotFoundError:
logger.warning(
"Transcript file not found for video_id=%s: %s",
video_id, transcript_file,
)
except (json.JSONDecodeError, OSError) as io_exc:
logger.warning(
"Failed to load transcript for video_id=%s: %s",
video_id, io_exc,
)
else:
logger.info(
"No transcript_path for video_id=%s — audio proxy signals will be neutral",
video_id,
)
moments = (
session.execute(
select(KeyMoment)
@ -2480,6 +2522,13 @@ def stage_highlight_detection(self, video_id: str, run_id: str | None = None) ->
candidate_count = 0
for moment in moments:
try:
# Extract word-level timings for this moment's window
word_timings = None
if transcript_data:
word_timings = extract_word_timings(
transcript_data, moment.start_time, moment.end_time,
) or None # empty list → None for neutral fallback
result = score_moment(
start_time=moment.start_time,
end_time=moment.end_time,
@ -2489,6 +2538,7 @@ def stage_highlight_detection(self, video_id: str, run_id: str | None = None) ->
raw_transcript=moment.raw_transcript,
source_quality=None, # filled below if technique_page loaded
video_content_type=None, # filled below if source_video loaded
word_timings=word_timings,
)
except Exception as score_exc:
logger.warning(
@ -2509,7 +2559,7 @@ def stage_highlight_detection(self, video_id: str, run_id: str | None = None) ->
duration_secs=result["duration_secs"],
)
stmt = stmt.on_conflict_do_update(
constraint="uq_highlight_candidate_moment",
constraint="highlight_candidates_key_moment_id_key",
set_={
"score": stmt.excluded.score,
"score_breakdown": stmt.excluded.score_breakdown,