diff --git a/.gsd/KNOWLEDGE.md b/.gsd/KNOWLEDGE.md index 02f10d5..4407bcb 100644 --- a/.gsd/KNOWLEDGE.md +++ b/.gsd/KNOWLEDGE.md @@ -308,3 +308,15 @@ **Context:** LightRAG's `/query/data` endpoint accepts `ll_keywords` (list of strings) that bias retrieval toward matching content without hard filtering. For creator-scoped search, pass the creator's name as a keyword; for domain-scoped, pass the topic category. Combine with post-filtering for strict creator scoping (request 3x results, filter locally by creator_id). **Where:** `backend/search_service.py` — `_creator_scoped_search()`, `_domain_scoped_search()` + +## Named unique constraints for Celery upsert targeting + +**Context:** When a Celery task needs idempotent writes (re-running on same input updates rather than duplicates), use a named unique constraint on the natural key column and target it with `INSERT ... ON CONFLICT ON CONSTRAINT DO UPDATE`. The named constraint approach is more explicit than targeting the column directly and works reliably with SQLAlchemy's `insert().on_conflict_do_update(constraint=...)`. + +**Where:** `backend/pipeline/stages.py` — `stage_highlight_detection`, constraint `uq_highlight_candidate_moment` on `key_moment_id` + +## Pure-function scoring + Celery task separation + +**Context:** Keep scoring logic as a pure function (no DB, no side effects) in a separate module from the Celery task that calls it. This enables unit testing with 28 tests running in 0.03s (no DB fixtures needed). The Celery task handles DB reads, calls the pure function, and writes results. Use lazy imports inside the Celery task function body to avoid circular imports at module load time. + +**Where:** `backend/pipeline/highlight_scorer.py` (pure), `backend/pipeline/stages.py` (Celery wiring) diff --git a/.gsd/PROJECT.md b/.gsd/PROJECT.md index d6daa93..1cf8448 100644 --- a/.gsd/PROJECT.md +++ b/.gsd/PROJECT.md @@ -60,6 +60,8 @@ Nineteen milestones complete. Phase 2 foundations are in place. M019 delivered c - **Creator dashboard shell** — Protected /creator/* routes with sidebar nav (Dashboard, Settings). Profile edit and password change forms. Code-split with React.lazy. - **Consent infrastructure** — Per-video consent toggles (allow_embed, allow_search, allow_kb, allow_download, allow_remix) with versioned audit trail. VideoConsent and ConsentAuditLog models with Alembic migration 017. 5 API endpoints with ownership verification and admin bypass. +- **Highlight detection v1** — Heuristic scoring engine with 7 weighted dimensions (duration fitness, content type, specificity density, plugin richness, transcript energy, source quality, video type) scores KeyMoment data into ranked highlight candidates stored in `highlight_candidates` table. Celery task for batch processing, 4 admin API endpoints for triggering detection and listing/inspecting candidates. 28 unit tests. + - **Web media player** — Custom video player page at `/watch/:videoId` with HLS playback (lazy-loaded hls.js), speed controls (0.5–2x), volume, seek, fullscreen, keyboard shortcuts, and synchronized transcript sidebar with binary search active segment detection and auto-scroll. Technique page key moment timestamps link directly to the watch page. Video + transcript API endpoints with creator info. - **LightRAG graph-enhanced retrieval** — Running as chrysopedia-lightrag service on port 9621. Uses DGX Sparks for LLM (entity extraction, summarization), Ollama nomic-embed-text for embeddings, Qdrant for vector storage, NetworkX for graph storage. 12 music production entity types configured. Exposed via REST API at /documents/text (ingest) and /query (retrieval with local/global/mix/hybrid modes). @@ -99,3 +101,4 @@ Nineteen milestones complete. Phase 2 foundations are in place. M019 delivered c | M018 | Phase 2 Research & Documentation — Site Audit and Forgejo Wiki Bootstrap | ✅ Complete | | M019 | Foundations — Auth, Consent & LightRAG | ✅ Complete | | M020 | Core Experiences — Player, Impersonation & Knowledge Routing | 🔄 Active | +| M021 | Intelligence Online — Chat, Chapters & Search Cutover | 🔄 Active | diff --git a/.gsd/milestones/M021/M021-ROADMAP.md b/.gsd/milestones/M021/M021-ROADMAP.md index 31d8458..5534dd1 100644 --- a/.gsd/milestones/M021/M021-ROADMAP.md +++ b/.gsd/milestones/M021/M021-ROADMAP.md @@ -9,7 +9,7 @@ LightRAG becomes the primary search engine. Chat engine goes live (encyclopedic | S01 | [B] LightRAG Search Cutover | high | — | ✅ | Primary search backed by LightRAG. Old system remains as automatic fallback. | | S02 | [B] Creator-Scoped Retrieval Cascade | medium | S01 | ✅ | Question on Keota's profile first checks Keota's content, then sound design domain, then full KB, then graceful fallback | | S03 | [B] Chat Engine MVP | high | S02 | ✅ | User asks a question, receives a streamed response with citations linking to source videos and technique pages | -| S04 | [B] Highlight Detection v1 | medium | — | ⬜ | Scored highlight candidates generated from existing pipeline data for a sample of videos | +| S04 | [B] Highlight Detection v1 | medium | — | ✅ | Scored highlight candidates generated from existing pipeline data for a sample of videos | | S05 | [A] Audio Mode + Chapter Markers | medium | — | ⬜ | Media player with waveform visualization in audio mode and chapter markers on the timeline | | S06 | [A] Auto-Chapters Review UI | low | — | ⬜ | Creator reviews detected chapters: drag boundaries, rename, reorder, approve for publication | | S07 | [A] Impersonation Polish + Write Mode | low | — | ⬜ | Impersonation write mode with confirmation modal. Audit log admin view shows all sessions. | diff --git a/.gsd/milestones/M021/slices/S04/S04-SUMMARY.md b/.gsd/milestones/M021/slices/S04/S04-SUMMARY.md new file mode 100644 index 0000000..9114bc2 --- /dev/null +++ b/.gsd/milestones/M021/slices/S04/S04-SUMMARY.md @@ -0,0 +1,108 @@ +--- +id: S04 +parent: M021 +milestone: M021 +provides: + - highlight_candidates table with scored KeyMoment data + - Admin API for listing/inspecting highlight candidates + - Celery task stage_highlight_detection for batch scoring + - HighlightCandidateResponse Pydantic schema for downstream consumers +requires: + [] +affects: + - S08 +key_files: + - backend/models.py + - alembic/versions/019_add_highlight_candidates.py + - backend/pipeline/highlight_schemas.py + - backend/pipeline/highlight_scorer.py + - backend/pipeline/test_highlight_scorer.py + - backend/pipeline/stages.py + - backend/routers/highlights.py + - backend/main.py +key_decisions: + - UNIQUE constraint on key_moment_id enforced at both ORM and named constraint level for upsert targeting + - Duration fitness uses piecewise linear rather than Gaussian bell curve for predictability + - Lazy import of score_moment inside Celery task to avoid circular imports at module load + - Upsert uses named constraint uq_highlight_candidate_moment for ON CONFLICT targeting + - 7 scoring dimensions mapped to HighlightScoreBreakdown schema fields for API/DB consistency +patterns_established: + - Heuristic scoring as pure function (no DB, no side effects) with separate Celery task for DB integration — enables easy unit testing + - Named unique constraint for upsert targeting pattern (uq_highlight_candidate_moment) — reusable for future pipeline stages that need idempotent writes +observability_surfaces: + - pipeline_events rows for highlight_detection stage (start/complete/error with candidate count in payload) + - GET /api/v1/admin/highlights/candidates — paginated list sorted by score desc + - GET /api/v1/admin/highlights/candidates/{id} — detail with full score_breakdown +drill_down_paths: + - .gsd/milestones/M021/slices/S04/tasks/T01-SUMMARY.md + - .gsd/milestones/M021/slices/S04/tasks/T02-SUMMARY.md + - .gsd/milestones/M021/slices/S04/tasks/T03-SUMMARY.md +duration: "" +verification_result: passed +completed_at: 2026-04-04T05:37:31.104Z +blocker_discovered: false +--- + +# S04: [B] Highlight Detection v1 + +**Heuristic scoring engine scores KeyMoment data into ranked highlight candidates via 7 weighted dimensions, stored in a new highlight_candidates table, exposed through 4 admin API endpoints, and triggerable via Celery task.** + +## What Happened + +Built the complete highlight detection pipeline in three tasks: + +**T01 — Data Foundation.** Added `HighlightStatus` enum (candidate/approved/rejected) and `HighlightCandidate` ORM model to models.py with UUID PK, unique FK to key_moments, score (float 0-1), score_breakdown (JSONB), duration_secs, status, and timestamps. Alembic migration 019 creates the table with indexes on source_video_id, score DESC, and status. Created Pydantic schemas: `HighlightScoreBreakdown` (7 float fields), `HighlightCandidateResponse`, and `HighlightBatchResult`. + +**T02 — Scoring Engine.** Implemented `score_moment()` pure function in highlight_scorer.py with 7 weighted dimensions: duration_fitness (0.25, bell curve peaking 30-60s), content_type_weight (0.20), specificity_density (0.20, regex unit/ratio counting), plugin_richness (0.10), transcript_energy (0.10, teaching-phrase detection), source_quality_weight (0.10), video_type_weight (0.05). Weights sum to 1.0. All 28 unit tests pass covering ideal/mediocre/poor ordering, edge cases (None/empty fields), and per-dimension behavior. + +**T03 — Runtime Wiring.** Added `stage_highlight_detection` Celery task following existing patterns (bind=True, max_retries=3, _get_sync_session, _emit_event start/complete/error). Task loads KeyMoments for a video, scores each, and bulk-upserts via INSERT ON CONFLICT on the named constraint. Created highlights router with 4 endpoints: POST detect/{video_id}, POST detect-all, GET candidates (paginated, score desc), GET candidates/{id}. Router registered in main.py. + +## Verification + +All 7 slice-level verification checks pass: +1. Model import (HighlightCandidate, HighlightStatus) — OK +2. Schema import (HighlightCandidateResponse, HighlightScoreBreakdown, HighlightBatchResult) — OK +3. Migration revision resolves to 019_add_highlight_candidates — OK +4. 28/28 scorer unit tests pass in 0.03s +5. Celery task import (stage_highlight_detection) — OK +6. Router import (highlights.router) — OK +7. Router registration confirmed in main.py app routes + +## Requirements Advanced + +None. + +## Requirements Validated + +None. + +## New Requirements Surfaced + +None. + +## Requirements Invalidated or Re-scoped + +None. + +## Deviations + +None. + +## Known Limitations + +Scoring is heuristic-only — no ML model or user feedback loop yet. Duration fitness uses piecewise linear (not Gaussian) for predictability. No integration tests against a live database (unit tests use pure functions only). + +## Follow-ups + +Run migration 019 on ub01 production database. Trigger detect-all endpoint on existing videos to populate initial candidates. Consider adding feedback loop (approved/rejected status) to tune weights in a future milestone. + +## Files Created/Modified + +- `backend/models.py` — Added HighlightStatus enum and HighlightCandidate ORM model +- `alembic/versions/019_add_highlight_candidates.py` — Migration 019: highlight_candidates table with indexes +- `backend/pipeline/highlight_schemas.py` — Pydantic schemas for scoring breakdown, API response, batch result +- `backend/pipeline/highlight_scorer.py` — Pure-function scoring engine with 7 weighted dimensions +- `backend/pipeline/test_highlight_scorer.py` — 28 unit tests for scoring engine +- `backend/pipeline/stages.py` — Added stage_highlight_detection Celery task +- `backend/routers/highlights.py` — 4 admin API endpoints for highlight detection +- `backend/main.py` — Registered highlights router diff --git a/.gsd/milestones/M021/slices/S04/S04-UAT.md b/.gsd/milestones/M021/slices/S04/S04-UAT.md new file mode 100644 index 0000000..3c3b06d --- /dev/null +++ b/.gsd/milestones/M021/slices/S04/S04-UAT.md @@ -0,0 +1,64 @@ +# S04: [B] Highlight Detection v1 — UAT + +**Milestone:** M021 +**Written:** 2026-04-04T05:37:31.104Z + +## UAT: Highlight Detection v1 + +### Preconditions +- Chrysopedia API running on ub01:8096 +- Migration 019 applied (`docker exec chrysopedia-api alembic upgrade head`) +- At least one source video with extracted KeyMoments in the database + +### Test 1: Model & Schema Imports +1. Run: `python -c "from backend.models import HighlightCandidate, HighlightStatus; print(HighlightStatus.candidate.value, HighlightStatus.approved.value, HighlightStatus.rejected.value)"` +2. **Expected:** Prints `candidate approved rejected` +3. Run: `python -c "from backend.pipeline.highlight_schemas import HighlightScoreBreakdown; print(list(HighlightScoreBreakdown.model_fields.keys()))"` +4. **Expected:** 7 field names printed + +### Test 2: Scoring Engine Ordering +1. Run: `python -m pytest backend/pipeline/test_highlight_scorer.py::TestScoreMoment::test_ordering_is_sensible -v` +2. **Expected:** PASSED — ideal (45s technique, 3 plugins) > mediocre > poor (300s reasoning, 0 plugins) + +### Test 3: Scoring Edge Cases +1. Run: `python -m pytest backend/pipeline/test_highlight_scorer.py::TestScoreMoment::test_missing_optional_fields -v` +2. **Expected:** PASSED — None transcript and None plugins don't crash, score in [0,1] + +### Test 4: Full Test Suite +1. Run: `python -m pytest backend/pipeline/test_highlight_scorer.py -v` +2. **Expected:** 28/28 tests pass + +### Test 5: Trigger Detection for Single Video +1. Pick a video_id from the database: `curl http://ub01:8096/api/v1/admin/pipeline/videos?limit=1` +2. POST: `curl -X POST http://ub01:8096/api/v1/admin/highlights/detect/{video_id}` +3. **Expected:** 200 with `{"task_id": "..."}` response +4. Wait 10s for worker to process +5. GET: `curl http://ub01:8096/api/v1/admin/highlights/candidates?limit=5` +6. **Expected:** Array of candidates with scores in [0,1], score_breakdown with 7 dimensions + +### Test 6: Trigger Detection for All Videos +1. POST: `curl -X POST http://ub01:8096/api/v1/admin/highlights/detect-all` +2. **Expected:** 200 with count of dispatched tasks +3. Wait 30s, then GET candidates endpoint +4. **Expected:** Candidates from multiple videos, sorted by score desc + +### Test 7: Candidate Detail +1. From Test 5/6 results, pick a candidate_id +2. GET: `curl http://ub01:8096/api/v1/admin/highlights/candidates/{candidate_id}` +3. **Expected:** Full candidate with score_breakdown showing all 7 dimension scores + +### Test 8: Idempotent Re-run +1. Re-trigger detection for the same video_id as Test 5 +2. Wait for completion +3. GET candidates for that video +4. **Expected:** Same number of candidates (upsert, not duplicate). Scores may differ only if data changed. + +### Test 9: 404 on Missing Candidate +1. GET: `curl http://ub01:8096/api/v1/admin/highlights/candidates/00000000-0000-0000-0000-000000000000` +2. **Expected:** 404 response + +### Test 10: Pagination +1. GET: `curl "http://ub01:8096/api/v1/admin/highlights/candidates?skip=0&limit=2"` +2. **Expected:** At most 2 candidates returned +3. GET with skip=2: `curl "http://ub01:8096/api/v1/admin/highlights/candidates?skip=2&limit=2"` +4. **Expected:** Next page of candidates (different from first page if enough exist) diff --git a/.gsd/milestones/M021/slices/S04/tasks/T03-VERIFY.json b/.gsd/milestones/M021/slices/S04/tasks/T03-VERIFY.json new file mode 100644 index 0000000..b9a2117 --- /dev/null +++ b/.gsd/milestones/M021/slices/S04/tasks/T03-VERIFY.json @@ -0,0 +1,9 @@ +{ + "schemaVersion": 1, + "taskId": "T03", + "unitId": "M021/S04/T03", + "timestamp": 1775280970845, + "passed": true, + "discoverySource": "none", + "checks": [] +} diff --git a/.gsd/milestones/M021/slices/S05/S05-PLAN.md b/.gsd/milestones/M021/slices/S05/S05-PLAN.md index 9e8e727..463fdd8 100644 --- a/.gsd/milestones/M021/slices/S05/S05-PLAN.md +++ b/.gsd/milestones/M021/slices/S05/S05-PLAN.md @@ -1,6 +1,126 @@ # S05: [A] Audio Mode + Chapter Markers -**Goal:** Add audio-only waveform mode and chapter marker timeline UI to the media player +**Goal:** Media player renders an audio waveform (via wavesurfer.js) when no video URL is available, and chapter markers derived from KeyMoment data appear on the seek bar timeline. **Demo:** After this: Media player with waveform visualization in audio mode and chapter markers on the timeline ## Tasks +- [x] **T01: Added media streaming endpoint and chapters endpoint to videos router, plus fetchChapters frontend API client** — Add two new endpoints to `backend/routers/videos.py`: + +1. **`GET /videos/{video_id}/stream`** — Serves the media file at `SourceVideo.file_path` via `FileResponse`. Validates the video exists and `file_path` is set. Returns 404 if video not found or no file. Guesses media type from file extension (audio/wav, audio/mpeg, video/mp4, etc.). + +2. **`GET /videos/{video_id}/chapters`** — Returns KeyMoment records for the video as chapter markers, sorted by `start_time`. Uses a new `ChapterMarkerRead` schema with fields: `id`, `title`, `start_time`, `end_time`, `content_type`. + +Also adds `fetchChapters()` to the frontend API client so downstream tasks can consume it. + +## Steps + +1. In `backend/schemas.py`, add `ChapterMarkerRead` Pydantic model (id: UUID, title: str, start_time: float, end_time: float, content_type: str) with `model_config = ConfigDict(from_attributes=True)`. Add `ChaptersResponse` with `video_id: UUID` and `chapters: list[ChapterMarkerRead]`. +2. In `backend/routers/videos.py`, add `GET /videos/{video_id}/stream` endpoint: query `SourceVideo` by id, check `file_path` exists and is a real file on disk, return `FileResponse(video.file_path, media_type=guessed_type)`. Import `os.path` and `mimetypes`. Return 404 with detail if video not found or file missing. +3. In `backend/routers/videos.py`, add `GET /videos/{video_id}/chapters` endpoint: query `KeyMoment` records where `source_video_id == video_id`, order by `start_time`. Verify video exists first (404 if not). Return `ChaptersResponse`. +4. In `frontend/src/api/videos.ts`, add `Chapter` interface (id, title, start_time, end_time, content_type) and `ChaptersResponse` interface. Add `fetchChapters(videoId: string)` function following the `fetchTranscript` pattern. +5. Verify: run `python -c "from routers.videos import router"` in backend dir to confirm imports compile. + +## Must-Haves + +- [ ] `ChapterMarkerRead` schema in `backend/schemas.py` +- [ ] Stream endpoint serves file from `file_path` with correct content-type +- [ ] Stream endpoint returns 404 when video not found or file_path missing/invalid +- [ ] Chapters endpoint returns KeyMoments sorted by start_time +- [ ] `fetchChapters()` added to frontend API client + +## Verification + +- `cd backend && python -c "from routers.videos import router; print('ok')"` exits 0 +- `grep -q 'def get_video_chapters' backend/routers/videos.py` confirms endpoint exists +- `grep -q 'def stream_video' backend/routers/videos.py` confirms stream endpoint exists +- `grep -q 'fetchChapters' frontend/src/api/videos.ts` confirms API client function exists + - Estimate: 45m + - Files: backend/routers/videos.py, backend/schemas.py, frontend/src/api/videos.ts + - Verify: cd /home/aux/projects/content-to-kb-automator/backend && python -c "from routers.videos import router; print('ok')" && grep -q 'fetchChapters' /home/aux/projects/content-to-kb-automator/frontend/src/api/videos.ts +- [ ] **T02: Audio waveform component with wavesurfer.js + WatchPage integration** — Install wavesurfer.js, create the AudioWaveform component, widen useMediaSync to support HTMLMediaElement, and wire the waveform into WatchPage as a replacement for VideoPlayer when no video URL is available. + +## Steps + +1. Install wavesurfer.js: `cd frontend && npm install wavesurfer.js` +2. In `frontend/src/hooks/useMediaSync.ts`, widen the ref type from `HTMLVideoElement` to `HTMLMediaElement`. Change `useRef` to `useRef`. Update the `MediaSyncState` interface's `videoRef` type to `React.RefObject`. All HTMLMediaElement APIs (play, pause, currentTime, volume, etc.) are identical — no behavioral changes needed. +3. In `frontend/src/components/VideoPlayer.tsx`, update the `ref` cast from `React.RefObject` to `React.RefObject` — actually this file explicitly casts `videoRef as React.RefObject` on line ~108, which remains valid since HTMLVideoElement extends HTMLMediaElement. +4. Create `frontend/src/components/AudioWaveform.tsx`: + - Props: `mediaSync: MediaSyncState`, `src: string` (the stream URL) + - Render a hidden `