diff --git a/API-Surface.md b/API-Surface.md index 3b76d53..f09f61c 100644 --- a/API-Surface.md +++ b/API-Surface.md @@ -1,6 +1,6 @@ # API Surface -50 API endpoints grouped by domain. All served by FastAPI under `/api/v1/`. +61 API endpoints grouped by domain. All served by FastAPI under `/api/v1/`. ## Public Endpoints (10) @@ -26,11 +26,19 @@ title, slug, topic_category, topic_tags, summary, body_sections, body_sections_f | Method | Path | Response Shape | Notes | |--------|------|---------------|-------| | GET | `/api/v1/creators?sort=&genre=` | `{items, total, offset, limit}` | sort: random\|alpha\|views | -| GET | `/api/v1/creators/{slug}` | 16-field object | Includes genre_breakdown, techniques, social_links | +| GET | `/api/v1/creators/{slug}` | 16-field object | Includes genre_breakdown, techniques, social_links, follower_count, personality_profile | | GET | `/api/v1/topics` | `[{name, description, sub_topics}]` | ⚠️ Bare list (not paginated) | | GET | `/api/v1/topics/{cat}/{sub}` | `{items, total, offset, limit}` | Subtopic techniques | | GET | `/api/v1/topics/{cat}` | `{items, total, offset, limit}` | Category techniques | +## Chat Endpoint (1) + +| Method | Path | Auth | Purpose | +|--------|------|------|---------| +| POST | `/api/v1/chat` | None | Streaming Q&A — SSE response with sources, tokens, done event. See [[Chat-Engine]] | + +**Request fields:** `query` (required, 1-1000 chars), `creator` (optional slug/UUID), `conversation_id` (optional UUID for multi-turn threading) + ## Auth Endpoints (4) All under prefix `/api/v1/auth/`. JWT-protected except registration and login. @@ -42,6 +50,28 @@ All under prefix `/api/v1/auth/`. JWT-protected except registration and login. | GET | `/auth/me` | Bearer JWT | Current user profile. Returns UserResponse. | | PUT | `/auth/me` | Bearer JWT | Update display_name and/or password (requires current_password for password changes). Returns UserResponse. | +## Follow Endpoints (4) — M022/S02 + +All require Bearer JWT. + +| Method | Path | Purpose | +|--------|------|---------| +| POST | `/api/v1/follows/{creator_id}` | Follow a creator (idempotent via INSERT ON CONFLICT DO NOTHING) | +| DELETE | `/api/v1/follows/{creator_id}` | Unfollow a creator | +| GET | `/api/v1/follows/{creator_id}/status` | Check if current user follows this creator | +| GET | `/api/v1/follows/me` | List creators the current user follows | + +## Creator Highlight Endpoints (4) — M022/S01 + +Creator-scoped highlight review. Requires Bearer JWT with creator ownership. + +| Method | Path | Purpose | +|--------|------|---------| +| GET | `/api/v1/creator/highlights` | List highlights for authenticated creator (status/shorts_only filters) | +| GET | `/api/v1/creator/highlights/{id}` | Detail with score_breakdown and key_moment | +| PATCH | `/api/v1/creator/highlights/{id}/status` | Update status (approve/reject) | +| PATCH | `/api/v1/creator/highlights/{id}/trim` | Update trim_start/trim_end | + ## Consent Endpoints (5) All under prefix `/api/v1/consent/`. All require Bearer JWT. @@ -54,16 +84,6 @@ All under prefix `/api/v1/consent/`. All require Bearer JWT. | GET | `/consent/videos/{video_id}/history` | Creator (owner) or Admin | Versioned audit trail of consent changes for a video. | | GET | `/consent/admin/summary` | Admin only | Aggregate consent flag counts across all videos. | -### Consent Fields - -Three boolean consent flags per video, each independently toggleable: - -| Field | Default | Meaning | -|-------|---------|---------| -| `kb_inclusion` | false | Allow indexing into knowledge base | -| `training_usage` | false | Allow use for model training | -| `public_display` | true | Allow public display on site | - ## Report Endpoints (3) | Method | Path | Purpose | @@ -72,7 +92,9 @@ Three boolean consent flags per video, each independently toggleable: | GET | `/api/v1/admin/reports` | List all reports | | PATCH | `/api/v1/admin/reports/{id}` | Update report status | -## Pipeline Admin Endpoints (20+) +## Admin Endpoints + +### Pipeline Admin (20+) All under prefix `/api/v1/admin/pipeline/`. @@ -100,52 +122,20 @@ All under prefix `/api/v1/admin/pipeline/`. | POST | `/admin/pipeline/creator-profile/{creator_id}` | Update creator profile | | POST | `/admin/pipeline/avatar-fetch/{creator_id}` | Fetch creator avatar | -## Other Endpoints (2) +### Highlight Admin (4) -| Method | Path | Notes | -|--------|------|-------| -| POST | `/api/v1/ingest` | Transcript upload | -| GET | `/api/v1/videos` | ⚠️ Bare list (not paginated) | +| Method | Path | Purpose | +|--------|------|---------| +| POST | `/admin/highlights/detect/{video_id}` | Score all KeyMoments for a video | +| POST | `/admin/highlights/detect-all` | Score all videos | +| GET | `/admin/highlights/candidates` | Paginated candidate list | +| GET | `/admin/highlights/candidates/{id}` | Single candidate with score_breakdown | -## Response Conventions +### Personality Extraction (1) — M022/S06 -**Standard paginated response:** -```json -{ - "items": [...], - "total": 83, - "offset": 0, - "limit": 20 -} -``` - -**Known inconsistencies:** -- `GET /topics` returns bare list instead of paginated dict -- `GET /videos` returns bare list instead of paginated dict -- Search uses `items` key (not `results`) -- `/techniques/random` returns JSON `{slug}` (not HTTP redirect) - -**New endpoints should follow the `{items, total, offset, limit}` paginated pattern.** - -## Authentication - -JWT-based authentication added in M019. See [[Authentication]] for full details. - -- **Public endpoints** (search, browse, techniques) require no auth -- **Auth endpoints** (`/auth/register`, `/auth/login`) are open; `/auth/me` requires Bearer JWT -- **Consent endpoints** require Bearer JWT with ownership verification (creator must own the video, or be admin) -- **Admin endpoints** (`/admin/*`) are accessible to anyone with network access (auth planned for future milestone) - ---- - -*See also: [[Architecture]], [[Data-Model]], [[Frontend]], [[Authentication]]* -utput` | Delete all pipeline output | -| POST | `/admin/pipeline/optimize-prompt` | Trigger prompt optimization | -| POST | `/admin/pipeline/reindex-all` | Rebuild Qdrant index | -| GET | `/admin/pipeline/worker-status` | Celery worker health | -| GET | `/admin/pipeline/recent-activity` | Recent pipeline events | -| POST | `/admin/pipeline/creator-profile/{creator_id}` | Update creator profile | -| POST | `/admin/pipeline/avatar-fetch/{creator_id}` | Fetch creator avatar | +| Method | Path | Purpose | +|--------|------|---------| +| POST | `/api/v1/admin/creators/{slug}/extract-profile` | Queue personality profile extraction task | ## Other Endpoints (2) @@ -178,9 +168,11 @@ utput` | Delete all pipeline output | JWT-based authentication added in M019. See [[Authentication]] for full details. -- **Public endpoints** (search, browse, techniques) require no auth +- **Public endpoints** (search, browse, techniques, chat) require no auth - **Auth endpoints** (`/auth/register`, `/auth/login`) are open; `/auth/me` requires Bearer JWT -- **Consent endpoints** require Bearer JWT with ownership verification (creator must own the video, or be admin) +- **Follow endpoints** require Bearer JWT +- **Creator endpoints** (`/creator/*`) require Bearer JWT with creator ownership verification +- **Consent endpoints** require Bearer JWT with ownership verification - **Admin endpoints** (`/admin/*`) are accessible to anyone with network access (auth planned for future milestone) --- diff --git a/Chat-Engine.md b/Chat-Engine.md index 2d9a85e..8353f9a 100644 --- a/Chat-Engine.md +++ b/Chat-Engine.md @@ -1,29 +1,33 @@ # Chat Engine -Streaming question-answering interface backed by LightRAG retrieval and LLM completion. Added in M021/S03. +Streaming question-answering interface backed by LightRAG retrieval and LLM completion. Added in M021/S03, expanded with multi-turn memory in M022/S04 and chat widget in M022/S03. ## Architecture ``` -User types question in ChatPage +User types question in ChatPage or ChatWidget │ ▼ -POST /api/v1/chat { query: "...", creator?: "..." } +POST /api/v1/chat { query, creator?, conversation_id? } │ ▼ -ChatService.stream(query, creator?) +ChatService.stream(query, creator?, conversation_id?) │ - ├─ 1. Retrieve: SearchService.search(query, creator) + ├─ 1. Load history: Redis chrysopedia:chat:{conversation_id} + │ + ├─ 2. Retrieve: SearchService.search(query, creator) │ └─ Uses 4-tier cascade if creator provided (see [[Search-Retrieval]]) │ - ├─ 2. Prompt: Assemble numbered context block into encyclopedic system prompt + ├─ 3. Prompt: System prompt + history + numbered context + user message │ └─ Sources formatted as [1] Title — Summary for citation mapping │ - ├─ 3. Stream: openai.AsyncOpenAI with stream=True + ├─ 4. Stream: openai.AsyncOpenAI with stream=True │ └─ Tokens streamed as SSE events in real-time │ + ├─ 5. Save history: Append user message + assistant response to Redis + │ ▼ -SSE response → ChatPage renders tokens + citation links +SSE response → ChatPage/ChatWidget renders tokens + citation links ``` ## SSE Protocol @@ -34,10 +38,28 @@ The chat endpoint returns a `text/event-stream` response with four event types i |-------|---------|------| | `sources` | `[{title, slug, creator_name, summary}]` | First — citation metadata for link rendering | | `token` | `string` (text chunk) | Repeated — streamed LLM completion tokens | -| `done` | `{cascade_tier: "creator"\|"domain"\|"global"\|"none"\|""}` | Once — signals completion, includes which retrieval tier answered | +| `done` | `{cascade_tier, conversation_id}` | Once — signals completion, includes retrieval tier and conversation ID | | `error` | `{message: string}` | On failure — emitted if LLM errors mid-stream | -The `cascade_tier` in the `done` event reveals which tier of the retrieval cascade served the context (see [[Search-Retrieval]]). +The `cascade_tier` in the `done` event reveals which tier of the retrieval cascade served the context. The `conversation_id` enables the frontend to thread follow-up messages. + +## Multi-Turn Conversation Memory (M022/S04) + +### Redis Storage + +- **Key pattern:** `chrysopedia:chat:{conversation_id}` +- **Format:** Single JSON string containing a list of `{role, content}` message dicts +- **TTL:** 1 hour, refreshed on each interaction +- **Cap:** 10 turn pairs (20 messages) — oldest pairs trimmed when exceeded + +### Conversation Flow + +1. Client sends `conversation_id` in POST body (or omits for new conversation) +2. Server auto-generates UUID when `conversation_id` is omitted +3. History loaded from Redis and injected between system prompt and user message +4. Assistant response accumulated during streaming +5. User message + assistant response appended to history in Redis +6. `conversation_id` returned in SSE `done` event for threading ## Citation Format @@ -45,7 +67,7 @@ The LLM is instructed to reference sources using numbered citations `[N]` in its - `[1]` → links to `/techniques/:slug` for the corresponding source - Multiple citations supported: `[1][3]` or `[1,3]` -- Citation regex: `/\[(\d+)\]/g` parsed locally in ChatPage +- Citation regex: `/\[(\d+)\]/g` parsed locally in both ChatPage and ChatWidget ## API Endpoint @@ -55,6 +77,7 @@ The LLM is instructed to reference sources using numbered citations `[N]` in its |-------|------|----------|------------| | `query` | string | Yes | 1–1000 characters | | `creator` | string | No | Creator UUID or slug for scoped retrieval | +| `conversation_id` | string | No | UUID for multi-turn threading. Auto-generated if omitted. | **Response:** `text/event-stream` (SSE) @@ -65,9 +88,11 @@ The LLM is instructed to reference sources using numbered citations `[N]` in its Located in `backend/chat_service.py`. The retrieve-prompt-stream pipeline: -1. **Retrieve** — Calls `SearchService.search()` with the query and optional creator parameter. Gets back ranked technique page results with the cascade_tier. -2. **Prompt** — Builds a numbered context block from search results. System prompt instructs the LLM to act as a music production encyclopedia, cite sources with `[N]` notation, and stay grounded in the provided context. -3. **Stream** — Opens an async streaming completion via `openai.AsyncOpenAI` (configured to point at DGX Sparks Qwen or local Ollama). Yields SSE events as tokens arrive. +1. **Load History** — `_load_history()` reads from Redis key `chrysopedia:chat:{conversation_id}`. Returns empty list if key absent. +2. **Retrieve** — Calls `SearchService.search()` with the query and optional creator parameter. Gets back ranked technique page results with the cascade_tier. +3. **Prompt** — Builds message array: system prompt → conversation history → numbered context block → user message. System prompt instructs the LLM to act as a music production encyclopedia, cite sources with `[N]` notation, and stay grounded in the provided context. +4. **Stream** — Opens an async streaming completion via `openai.AsyncOpenAI`. Yields SSE events as tokens arrive. +5. **Save History** — `_save_history()` appends the user message and accumulated assistant response to Redis. Trims to 10 turn pairs if exceeded. Refreshes TTL to 1 hour. Error handling: If the LLM fails mid-stream (after some tokens have been sent), an `error` event is emitted so the frontend can display a failure message rather than leaving the response hanging. @@ -77,38 +102,63 @@ Route: `/chat` (lazy-loaded, code-split) ### Components -- **Text input + submit button** — Query entry with Enter-to-submit +- **Multi-message conversation UI** — Messages array with conversation bubble layout +- **Conversation threading** — `conversationId` state, "New conversation" button to reset - **Streaming message display** — Accumulates tokens with blinking cursor animation during streaming -- **Citation markers** — `[N]` parsed to superscript links targeting `/techniques/:slug` -- **Source list** — Numbered sources with creator attribution displayed below the response -- **States:** Loading (streaming indicator), error (message display), empty (placeholder prompt) +- **Typing indicator** — Three-dot animation while streaming +- **Citation markers** — `[N]` parsed to superscript links targeting `/techniques/:slug` (per-message) +- **Source list** — Numbered sources with creator attribution displayed below each response +- **Auto-scroll** — Scrolls to bottom as new tokens arrive ### SSE Client Located in `frontend/src/api/chat.ts`. Uses `fetch()` + `ReadableStream` with typed callbacks: ```typescript -streamChat(query, creator?, { +streamChat(query, { onSources: (sources) => void, onToken: (token) => void, - onDone: (data) => void, + onDone: (data: ChatDoneMeta) => void, onError: (error) => void, -}) +}, creatorName?, conversationId?) ``` +`ChatDoneMeta` type includes `cascade_tier` and `conversation_id` fields. + +## Frontend: ChatWidget (M022/S03) + +Floating chat bubble on creator detail pages. Fixed-position bottom-right. + +### Behavior + +- **Bubble** → click → **slide-up panel** with conversation UI +- Creator-scoped: passes `creatorName` to `streamChat()` for retrieval cascade +- **Suggested questions** generated client-side from technique titles and categories +- **Typing indicator** — three-dot animation during streaming +- **Citation links** — parsed from response, linked to technique pages +- **Responsive** — full-width below 640px, 400px panel on desktop +- **Conversation threading** — `conversationId` generated via `crypto.randomUUID()` on first send, threaded through `streamChat()`, updated from done event +- **Reset on close** — messages and conversationId cleared when panel closes + ## Key Files -- `backend/chat_service.py` — ChatService retrieve-prompt-stream pipeline -- `backend/routers/chat.py` — POST /api/v1/chat endpoint -- `frontend/src/api/chat.ts` — SSE client utility -- `frontend/src/pages/ChatPage.tsx` — Chat UI page component -- `frontend/src/pages/ChatPage.module.css` — Chat page styles +- `backend/chat_service.py` — ChatService with history load/save, retrieve-prompt-stream pipeline +- `backend/routers/chat.py` — POST /api/v1/chat endpoint with conversation_id support +- `backend/tests/test_chat.py` — 13 tests (6 streaming + 7 conversation memory) +- `frontend/src/api/chat.ts` — SSE client with conversationId param and ChatDoneMeta type +- `frontend/src/pages/ChatPage.tsx` — Multi-message conversation UI +- `frontend/src/pages/ChatPage.module.css` — Conversation bubble layout styles +- `frontend/src/components/ChatWidget.tsx` — Floating chat widget component +- `frontend/src/components/ChatWidget.module.css` — Widget styles (38 custom property refs) ## Design Decisions -- **Standalone ASGI test client pattern** — Tests use mocked DB to avoid PostgreSQL dependency, enabling fast CI runs -- **Patch `openai.AsyncOpenAI` constructor** rather than instance attribute for reliable test mocking -- **Local citation regex** in ChatPage rather than importing from utils — link targets differ from technique page citations +- **Redis JSON string** — Conversation history stored as single JSON value (atomic read/write) rather than Redis list type +- **Auto-generate conversation_id** — Server creates UUID when client omits it, ensuring consistent `done` event shape +- **Widget resets on close** — Clean slate UX; no persistence across open/close cycles +- **Client-side suggested questions** — Generated from technique titles/categories without API call +- **Citation parsing duplicated** — ChatPage and ChatWidget each parse citations independently (extracted utility deferred) +- **Standalone ASGI test client** — Tests use mocked DB to avoid PostgreSQL dependency --- diff --git a/Data-Model.md b/Data-Model.md index fdfb15d..05b321a 100644 --- a/Data-Model.md +++ b/Data-Model.md @@ -1,6 +1,6 @@ # Data Model -18 SQLAlchemy models in `backend/models.py`. +20 SQLAlchemy models in `backend/models.py`. ## Entity Relationship Overview @@ -17,6 +17,8 @@ Creator (1) ──→ (N) SourceVideo (1) ──→ (N) TranscriptSegment │ ├──→ (N) RelatedTechniqueLink │ └──→ (M:N) SourceVideo (via TechniquePageVideo) │ + ├──→ (N) CreatorFollow ←── User + │ └──→ (0..1) User ──→ (N) InviteCode (created_by) ``` @@ -34,6 +36,7 @@ Creator (1) ──→ (N) SourceVideo (1) ──→ (N) TranscriptSegment | bio | Text | Admin-editable | | social_links | JSONB | Platform → URL mapping | | featured | Boolean | For homepage spotlight | +| personality_profile | JSONB | LLM-extracted personality data (M022/S06). See [[Personality-Profiles]] | ### SourceVideo @@ -101,6 +104,33 @@ Creator (1) ──→ (N) SourceVideo (1) ──→ (N) TranscriptSegment | content_snapshot | JSONB | Full page state at version time | | pipeline_metadata | JSONB | Prompt SHA-256 hashes, model config | +### HighlightCandidate + +| Field | Type | Notes | +|-------|------|-------| +| id | UUID PK | | +| key_moment_id | FK → KeyMoment | Unique constraint | +| source_video_id | FK → SourceVideo | Indexed | +| score | Float | Composite score 0.0–1.0 | +| score_breakdown | JSONB | Per-dimension scores (10 fields, see [[Highlights]]) | +| duration_secs | Float | Cached from KeyMoment | +| status | Enum(HighlightStatus) | candidate / approved / rejected | +| trim_start | Float | Nullable — trim offset in seconds (M022/S01) | +| trim_end | Float | Nullable — trim offset in seconds (M022/S01) | +| created_at | Timestamp | | +| updated_at | Timestamp | | + +### CreatorFollow (M022/S02) + +| Field | Type | Notes | +|-------|------|-------| +| id | UUID PK | | +| user_id | FK → User | Part of unique constraint | +| creator_id | FK → Creator | Part of unique constraint | +| created_at | Timestamp | | + +Unique constraint on `(user_id, creator_id)`. Idempotent follow via `INSERT ON CONFLICT DO NOTHING`. + ## Authentication & User Models ### User @@ -192,20 +222,17 @@ Append-only versioned record of per-field consent changes. | **HighlightStatus** | candidate, approved, rejected (M021/S04) | | **ChapterStatus** | draft, approved, hidden (M021/S06) | +## Migrations + +| Migration | Description | +|-----------|-------------| +| 019 | Add highlight_candidates table | +| 021 | Add trim_start/trim_end to highlight_candidates (M022/S01) | +| 022 | Add creator_follows table (M022/S02) | +| 023 | Add personality_profile JSONB to creators (M022/S06) | + ## Schema Notes -- **No Alembic migrations** — schema changes currently require manual DDL -- **body_sections_format** discriminator enables v1/v2 format coexistence (D024) -- **topic_category casing** is inconsistent across records (e.g., "Sound design" vs "Sound Design") — known data quality issue -- **Stage 4 classification data** (per-moment topic_tags) stored in Redis with 24h TTL, not DB columns -- **Timestamp convention:** `datetime.now(timezone.utc).replace(tzinfo=None)` — asyncpg rejects timezone-aware datetimes for TIMESTAMP WITHOUT TIME ZONE columns (D002) -- **User passwords** are stored as bcrypt hashes via `bcrypt.hashpw()` -- **Consent audit** uses version numbers assigned in application code (`max(version) + 1` per video_consent_id) - ---- - -*See also: [[Architecture]], [[API-Surface]], [[Pipeline]], [[Authentication]]* - changes currently require manual DDL - **body_sections_format** discriminator enables v1/v2 format coexistence (D024) - **topic_category casing** is inconsistent across records (e.g., "Sound design" vs "Sound Design") — known data quality issue - **Stage 4 classification data** (per-moment topic_tags) stored in Redis with 24h TTL, not DB columns diff --git a/Decisions.md b/Decisions.md index 4357bec..e99a1f4 100644 --- a/Decisions.md +++ b/Decisions.md @@ -31,12 +31,26 @@ Architectural and pattern decisions made during Chrysopedia development. Append- | D034 | Documentation strategy | Forgejo wiki, KB slice at end of every milestone | Incremental docs stay current; final pass in M025 | | D035 | File/object storage | MinIO (S3-compatible) self-hosted | Docker-native, signed URLs, fits existing infrastructure | -## M021 Decisions +## Authentication & Infrastructure Decisions | # | When | Decision | Choice | Rationale | |---|------|----------|--------|-----------| -| D039 | M021/S01 | LightRAG scoring strategy | Position-based (1.0 → 0.5 descending), sequential Qdrant fallback | `/query/data` has no numeric relevance score; retrieval order is the only signal | -| D040 | M021/S02 | Creator-scoped retrieval strategy | 4-tier cascade: creator → domain → global → none | Progressive widening ensures results while preferring creator context; `ll_keywords` for soft scoping; 3x oversampling for post-filter survival | +| D036 | M019/S02 | JWT auth configuration | HS256 with existing app_secret_key, 24h expiry, OAuth2PasswordBearer | Reuses existing secret; integrates with FastAPI dependency injection | +| D037 | — | Search impressions query | Exact case-insensitive title match via EXISTS subquery against SearchLog | MVP approach; expandable to ILIKE later | +| D038 | — | Primary git remote | git.xpltd.co (Forgejo) instead of github.com | Consolidating on self-hosted Forgejo; wiki already there | + +## Search & Retrieval Decisions + +| # | When | Decision | Choice | Rationale | +|---|------|----------|--------|-----------| +| D039 | M021/S01 | LightRAG scoring strategy | Position-based (1.0 → 0.5 descending), sequential Qdrant fallback | `/query/data` has no numeric relevance score | +| D040 | M021/S02 | Creator-scoped retrieval | 4-tier cascade: creator → domain → global → none | Progressive widening; `ll_keywords` for soft scoping; 3x oversampling for post-filter survival | + +## M022 Decisions + +| # | When | Decision | Choice | Rationale | +|---|------|----------|--------|-----------| +| D041 | M022/S05 | Highlight scorer weight distribution | 10 dimensions: original 7 reduced proportionally, 3 audio proxy dims get 0.22 total weight. Neutral fallback (0.5) when word_timings unavailable. | Audio proxy signals from word-level timing data; neutral fallback preserves backward compatibility | ## UI/UX Decisions diff --git a/Frontend.md b/Frontend.md index 77ef29b..9f6eec6 100644 --- a/Frontend.md +++ b/Frontend.md @@ -10,10 +10,13 @@ React 18 + TypeScript + Vite SPA. No UI library, no state management library, no | `/search` | SearchResults | Public | Sort, highlights, partial matches | | `/techniques/:slug` | TechniquePage | Public | v2 body sections, ToC sidebar, citations | | `/creators` | CreatorsBrowse | Public | Random default sort, genre filters | -| `/creators/:slug` | CreatorDetail | Public | Avatar, stats, technique list | +| `/creators/:slug` | CreatorDetail | Public | Avatar, stats, technique list, follow button, personality profile, chat widget | | `/topics` | TopicsBrowse | Public | 7 category cards, expandable sub-topics | | `/topics/:category/:subtopic` | SubTopicPage | Public | Creator-grouped techniques | +| `/chat` | ChatPage | Public | Multi-message conversation UI with threading | | `/about` | About | Public | Static project info | +| `/creator/highlights` | HighlightQueue | Creator JWT | Highlight review queue with filter tabs (M022/S01) | +| `/creator/tiers` | CreatorTiers | Creator JWT | Free/Pro/Premium tier cards with Coming Soon modals (M022/S02) | | `/admin/reports` | AdminReports | Admin* | Content reports | | `/admin/pipeline` | AdminPipeline | Admin* | Pipeline management | | `/admin/techniques` | AdminTechniquePages | Admin* | Technique page admin | @@ -38,6 +41,51 @@ React 18 + TypeScript + Vite SPA. No UI library, no state management library, no | CopyLinkButton | Clipboard copy with tooltip | | SocialIcons | Social media link icons (9 platforms) | | ReportIssueModal | Content report submission | +| ChatWidget | Floating chat bubble on creator pages — SSE streaming, citations, suggested questions (M022/S03) | +| PersonalityProfile | Collapsible creator personality display — 3 sub-cards (Teaching Style, Vocabulary, Style) (M022/S06) | + +## Feature Pages (M022) + +### HighlightQueue (M022/S01) + +Creator-scoped highlight review page at `/creator/highlights`. + +- **Filter tabs** — All / Shorts / Approved / Rejected +- **Candidate cards** — Title, duration, composite score, status badge +- **Score breakdown bars** — 10-dimension visual bars (fetched lazily on expand) +- **Action buttons** — Approve / Discard with ownership verification +- **Inline trim panel** — Validated trim_start / trim_end inputs +- **Files:** `HighlightQueue.tsx`, `HighlightQueue.module.css`, `highlights.ts` (API) + +### CreatorTiers (M022/S02) + +Tier configuration at `/creator/tiers`. + +- **Three cards** — Free (active), Pro, Premium +- **Coming Soon modals** — Styled placeholders per D033 (Stripe deferred to Phase 3) +- **Files:** `CreatorTiers.tsx`, `CreatorTiers.module.css` + +### ChatWidget (M022/S03) + +Floating chat on creator detail pages. + +- **Fixed-position bubble** (bottom-right) → slide-up conversation panel +- **Creator-scoped** — passes creatorName to streamChat() for retrieval cascade +- **Suggested questions** — client-side from technique titles/categories +- **Streaming SSE** — tokens, citations, typing indicator +- **Responsive** — full-width below 640px, 400px panel on desktop +- **Conversation threading** — conversationId via crypto.randomUUID(), resets on close +- **Files:** `ChatWidget.tsx`, `ChatWidget.module.css` + +### PersonalityProfile (M022/S06) + +Collapsible personality display on creator detail pages. + +- **Grid-template-rows animation** — 0fr → 1fr for smooth expand/collapse +- **Three sub-cards:** Teaching Style, Vocabulary, Style +- **Pill badges** for phrases/terms, checkmark/cross for boolean markers +- **Gracefully hidden** when profile is null +- **Files:** `PersonalityProfile.tsx`, styles in `App.css` ## Hooks @@ -45,19 +93,22 @@ React 18 + TypeScript + Vite SPA. No UI library, no state management library, no |------|---------| | useCountUp | Animated counter for homepage stats | | useSortPreference | Persists sort preference in localStorage | -| useDocumentTitle | Sets `` per page (all 10 pages instrumented) | +| useDocumentTitle | Sets `<title>` per page (all pages instrumented) | ## State Management -Local component state only (`useState`/`useEffect`). No Redux, Zustand, Context providers, or external state management library. +Local component state only (`useState`/`useEffect`). No Redux, Zustand, Context providers, or external state management library. AuthProvider context for JWT auth state. ## API Client -Two API modules: +API modules: - `public-client.ts` (~600 lines) — typed `request<T>` helper for REST endpoints -- `chat.ts` — SSE streaming client for POST /api/v1/chat using `fetch()` + `ReadableStream` -- `videos.ts` — chapter management functions (fetchChapters, fetchCreatorChapters, updateChapter, reorderChapters, approveChapters) -- `auth.ts` — authentication + impersonation functions including `fetchImpersonationLog()` +- `chat.ts` — SSE streaming client for POST /api/v1/chat using `fetch()` + `ReadableStream`, `ChatDoneMeta` type +- `videos.ts` — chapter management functions +- `auth.ts` — authentication + impersonation functions +- `highlights.ts` — creator highlight review functions (M022/S01) +- `follows.ts` — follow/unfollow/status/list functions (M022/S02) +- `creators.ts` — creator detail with personality_profile and follower_count types (M022/S02, S06) Relative `/api/v1` base URL (nginx proxies to API container). @@ -66,26 +117,13 @@ Relative `/api/v1` base URL (nginx proxies to API container). | Property | Value | |----------|-------| | File | `frontend/src/App.css` | -| Lines | 5,820 | -| Unique classes | ~589 | +| Lines | ~6,500+ | | Naming | BEM (`block__element--modifier`) | | Theme | Dark-only (no light mode) | | Custom properties | 77 in `:root` (D017) | | Accent color | Cyan `#22d3ee` | | Font stack | System fonts | -| Preprocessor | None | -| CSS Modules | None | - -### Custom Property Categories (77 total) - -- **Surface colors:** page background, card backgrounds, nav, footer, input -- **Text colors:** primary, secondary, muted, inverse, link, heading -- **Accent colors:** primary cyan, hover/active, focus rings -- **Badge colors:** Per-category pairs (bg + text) for 7 topic categories -- **Status colors:** Success/warning/error/info -- **Border colors:** Default, hover, focus, divider -- **Shadow colors:** Elevation, glow effects -- **Overlay colors:** Modal/dropdown overlays +| CSS Modules | Used for new components (HighlightQueue, CreatorTiers, ChatWidget, ChatPage) | ### Breakpoints @@ -93,7 +131,7 @@ Relative `/api/v1` base URL (nginx proxies to API container). |-----------|-------| | 480px | Narrow mobile — compact cards | | 600px | Wider mobile — grid adjustments | -| 640px | Small tablet — content width | +| 640px | Small tablet / chat widget responsive break | | 768px | Desktop ↔ mobile transition — sidebar collapse | ### Layout Patterns @@ -114,10 +152,3 @@ Relative `/api/v1` base URL (nginx proxies to API container). --- *See also: [[Architecture]], [[API-Surface]], [[Development-Guide]]* -*See also: [[Architecture]], [[API-Surface]], [[Development-Guide]]* -ocalhost:8001` -- **Production:** nginx serves static `dist/` bundle, proxies `/api` to FastAPI container - ---- - -*See also: [[Architecture]], [[API-Surface]], [[Development-Guide]]* diff --git a/Highlights.md b/Highlights.md index 337b2b1..ae8b7ac 100644 --- a/Highlights.md +++ b/Highlights.md @@ -1,10 +1,10 @@ # Highlight Detection -Heuristic scoring engine that ranks KeyMoment records into highlight candidates using 7 weighted dimensions. Added in M021/S04. +Heuristic scoring engine that ranks KeyMoment records into highlight candidates using 10 weighted dimensions. Originally added in M021/S04 with 7 dimensions, expanded to 10 in M022/S05. ## Overview -Highlight detection scores every KeyMoment in a video to identify the most "highlightable" segments — moments that would work well as standalone clips or featured content. The scoring is a pure function (no ML model, no external API) based on 7 dimensions derived from existing KeyMoment metadata. +Highlight detection scores every KeyMoment in a video to identify the most "highlightable" segments — moments that would work well as standalone clips or featured content. The scoring is a pure function (no ML model, no external API) based on 10 dimensions derived from existing KeyMoment metadata and word-level transcript timing data. ## Scoring Dimensions @@ -12,13 +12,22 @@ Total weight sums to 1.0. Each dimension produces a 0.0–1.0 score. | Dimension | Weight | What It Measures | |-----------|--------|-----------------| -| `duration_fitness` | 0.25 | Piecewise linear curve peaking at 30–60 seconds (ideal clip length) | -| `content_type` | 0.20 | Content type favorability: tutorial > tip > walkthrough > exploration | -| `specificity_density` | 0.20 | Regex-based counting of specific units, ratios, and named parameters in summary text | -| `plugin_richness` | 0.10 | Number of plugins/VSTs referenced (more = more actionable) | -| `transcript_energy` | 0.10 | Teaching-phrase detection in transcript text (e.g., "the trick is", "key thing") | -| `source_quality` | 0.10 | Source quality rating: high=1.0, medium=0.6, low=0.3 | -| `video_type` | 0.05 | Video type favorability mapping | +| `duration_fitness` | 0.20 | Piecewise linear curve peaking at 30–60 seconds (ideal clip length) | +| `content_type` | 0.16 | Content type favorability: tutorial > tip > walkthrough > exploration | +| `specificity_density` | 0.16 | Regex-based counting of specific units, ratios, and named parameters in summary text | +| `plugin_richness` | 0.08 | Number of plugins/VSTs referenced (more = more actionable) | +| `transcript_energy` | 0.08 | Teaching-phrase detection in transcript text (e.g., "the trick is", "key thing") | +| `source_quality` | 0.08 | Source quality rating: high=1.0, medium=0.6, low=0.3 | +| `video_type` | 0.02 | Video type favorability mapping | +| `speech_rate_variance` | ~0.07 | Coefficient of variation of words-per-second in 5s sliding windows | +| `pause_density` | ~0.08 | Count and weight of inter-word gaps (>0.5s short, >1.0s long) | +| `speaking_pace` | ~0.07 | Bell-curve fitness around optimal 3–5 WPS teaching pace | + +### Audio Proxy Dimensions (M022/S05) + +The three new dimensions (speech_rate_variance, pause_density, speaking_pace) are derived from **word-level transcript timing data** — not raw audio. This provides meaningful speech-pattern signals without requiring librosa or audio processing dependencies. + +**Neutral fallback:** When `word_timings` are unavailable (no word-level data in transcript), all three audio proxy dimensions default to **0.5** (neutral score). This preserves backward compatibility — existing scoring paths are unaffected. The weights of the original 7 dimensions were reduced proportionally to accommodate the new 0.22 total weight for audio dimensions (D041). ### Duration Fitness Curve @@ -36,12 +45,14 @@ Uses piecewise linear (not Gaussian) for predictability: | Field | Type | Notes | |-------|------|-------| | id | UUID PK | | -| key_moment_id | FK → KeyMoment | Unique constraint (`uq_highlight_candidate_moment`) | +| key_moment_id | FK → KeyMoment | Unique constraint (`highlight_candidates_key_moment_id_key`) | | source_video_id | FK → SourceVideo | Indexed | | score | Float | Composite score 0.0–1.0 | -| score_breakdown | JSONB | Per-dimension scores (7 fields) | +| score_breakdown | JSONB | Per-dimension scores (10 fields) | | duration_secs | Float | Cached from KeyMoment for display | | status | Enum(HighlightStatus) | candidate / approved / rejected | +| trim_start | Float | Nullable — trim start offset in seconds (M022/S01) | +| trim_end | Float | Nullable — trim end offset in seconds (M022/S01) | | created_at | Timestamp | | | updated_at | Timestamp | | @@ -59,12 +70,15 @@ Uses piecewise linear (not Gaussian) for predictability: - `score` DESC — rank ordering - `status` — filter by review state -### Migration +### Migrations -Alembic migration `019_add_highlight_candidates.py` creates the table with all indexes and the named unique constraint. +- `019_add_highlight_candidates.py` — Creates table with indexes and unique constraint +- `021_add_highlight_trim_columns.py` — Adds trim_start and trim_end columns (M022/S01) ## API Endpoints +### Admin Endpoints + All under `/api/v1/admin/highlights/`. Admin access. | Method | Path | Purpose | @@ -74,37 +88,31 @@ All under `/api/v1/admin/highlights/`. Admin access. | GET | `/admin/highlights/candidates` | Paginated candidate list, sorted by score DESC | | GET | `/admin/highlights/candidates/{id}` | Single candidate with full `score_breakdown` | -### Detect Response +### Creator Endpoints (M022/S01) + +Creator-scoped highlight review. Requires JWT auth with creator ownership verification. + +| Method | Path | Purpose | +|--------|------|---------| +| GET | `/api/v1/creator/highlights` | List highlights for authenticated creator (status/shorts_only filters, score DESC) | +| GET | `/api/v1/creator/highlights/{id}` | Detail with score_breakdown and key_moment | +| PATCH | `/api/v1/creator/highlights/{id}/status` | Update status (approve/reject) with ownership verification | +| PATCH | `/api/v1/creator/highlights/{id}/trim` | Update trim_start/trim_end (validation: non-negative, start < end) | + +### Score Breakdown Response ```json { - "video_id": "uuid", - "candidates_created": 12, - "candidates_updated": 0 -} -``` - -### Candidate Response - -```json -{ - "id": "uuid", - "key_moment_id": "uuid", - "source_video_id": "uuid", - "score": 0.847, - "score_breakdown": { - "duration_fitness": 0.95, - "content_type_weight": 0.80, - "specificity_density": 0.72, - "plugin_richness": 0.60, - "transcript_energy": 0.85, - "source_quality_weight": 1.00, - "video_type_weight": 0.50 - }, - "duration_secs": 45.0, - "status": "candidate", - "created_at": "...", - "updated_at": "..." + "duration_fitness": 0.95, + "content_type_weight": 0.80, + "specificity_density": 0.72, + "plugin_richness": 0.60, + "transcript_energy": 0.85, + "source_quality_weight": 1.00, + "video_type_weight": 0.50, + "speech_rate_variance_score": 0.057, + "pause_density_score": 0.0, + "speaking_pace_score": 1.0 } ``` @@ -114,30 +122,55 @@ All under `/api/v1/admin/highlights/`. Admin access. - **Binding:** `bind=True, max_retries=3` - **Session:** Uses `_get_sync_session` (sync SQLAlchemy, per D004) -- **Flow:** Load KeyMoments for video → score each via `score_moment()` → bulk upsert via `INSERT ON CONFLICT` on named constraint `uq_highlight_candidate_moment` +- **Flow:** Load KeyMoments for video → load transcript JSON → extract word timings per moment → score each via `score_moment()` → bulk upsert via `INSERT ON CONFLICT` on constraint `highlight_candidates_key_moment_id_key` +- **Transcript handling:** Loads transcript JSON once per video via `SourceVideo.transcript_path`. Accepts both `{segments: [...]}` and bare `[...]` JSON formats. +- **Fallback:** If transcript is missing or malformed, `word_timings=None` and scorer uses neutral values for audio dimensions - **Events:** Emits `pipeline_events` rows for start/complete/error with candidate count in payload ### Scoring Function -`score_moment()` in `backend/pipeline/highlight_scorer.py` is a **pure function** — no DB access, no side effects. Takes a KeyMoment-like dict, returns `(score, breakdown_dict)`. This separation enables easy unit testing (28 tests, runs in 0.03s). +`score_moment()` in `backend/pipeline/highlight_scorer.py` is a **pure function** — no DB access, no side effects. Takes a KeyMoment-like dict and optional `word_timings` list, returns `(score, breakdown_dict)`. This separation enables easy unit testing (62 tests, runs in 0.09s). + +### Word Timing Extraction + +`extract_word_timings()` filters word-level timing dicts from transcript JSON by time window. Used by the Celery task to extract timings per KeyMoment before scoring. + +## Frontend: Highlight Review Queue (M022/S01) + +Route: `/creator/highlights` (JWT-protected, lazy-loaded) + +### Components + +- **Filter tabs** — All / Shorts / Approved / Rejected +- **Candidate cards** — Key moment title, duration, composite score, status badge +- **Score breakdown bars** — Visual bars for each of the 10 scoring dimensions (fetched lazily on expand) +- **Action buttons** — Approve / Discard with ownership verification +- **Inline trim panel** — Validated number inputs for trim_start / trim_end +- **Sidebar link** — Star icon in creator dashboard SidebarNav ## Design Decisions - **Pure function scoring** — No DB or side effects in `score_moment()`, enabling fast unit tests - **Piecewise linear duration** — Predictable behavior vs. Gaussian bell curve -- **Named unique constraint** — `uq_highlight_candidate_moment` enables idempotent upserts via `ON CONFLICT` -- **Lazy import** — `score_moment` imported inside Celery task to avoid circular imports at module load +- **Neutral fallback at 0.5** — New audio dimensions don't penalize moments without word-level timing data (D041) +- **Proportional weight reduction** — Original 7 dimensions reduced proportionally to make room for 0.22 audio weight +- **Lazy detail fetch** — Score breakdown fetched on expand, not on list load (avoids N+1) +- **Creator-scoped router** — Ownership verification pattern reusable for future creator endpoints ## Key Files -- `backend/pipeline/highlight_scorer.py` — Pure scoring function with 7 dimensions -- `backend/pipeline/highlight_schemas.py` — Pydantic schemas (HighlightScoreBreakdown, HighlightCandidateResponse, HighlightBatchResult) +- `backend/pipeline/highlight_scorer.py` — Pure scoring function with 10 dimensions, word timing extraction +- `backend/pipeline/highlight_schemas.py` — Pydantic schemas (HighlightScoreBreakdown with 10 fields) - `backend/pipeline/stages.py` — `stage_highlight_detection` Celery task - `backend/routers/highlights.py` — 4 admin API endpoints -- `backend/models.py` — HighlightCandidate model, HighlightStatus enum -- `alembic/versions/019_add_highlight_candidates.py` — Migration -- `backend/pipeline/test_highlight_scorer.py` — 28 unit tests +- `backend/routers/creator_highlights.py` — 4 creator-scoped endpoints (M022/S01) +- `backend/models.py` — HighlightCandidate model with trim columns +- `alembic/versions/019_add_highlight_candidates.py` — Initial migration +- `alembic/versions/021_add_highlight_trim_columns.py` — Trim columns migration +- `backend/pipeline/test_highlight_scorer.py` — 62 unit tests +- `frontend/src/pages/HighlightQueue.tsx` — Creator review queue page +- `frontend/src/api/highlights.ts` — Highlight API client --- -*See also: [[Pipeline]], [[Data-Model]], [[API-Surface]]* +*See also: [[Pipeline]], [[Data-Model]], [[API-Surface]], [[Frontend]]* diff --git a/Home.md b/Home.md index 94864b1..fb9d468 100644 --- a/Home.md +++ b/Home.md @@ -8,12 +8,38 @@ Producers can search for specific techniques and find timestamped key moments, s - [[Architecture]] — System architecture, Docker services, network topology - [[Data-Model]] — SQLAlchemy models, relationships, enums -- [[API-Surface]] — All 41 API endpoints grouped by domain +- [[API-Surface]] — All 60+ API endpoints grouped by domain - [[Frontend]] — Routes, components, hooks, CSS architecture - [[Pipeline]] — 6-stage LLM extraction pipeline, prompt system +- [[Chat-Engine]] — Streaming Q&A with multi-turn memory +- [[Highlights]] — 10-dimension highlight detection and review queue +- [[Personality-Profiles]] — LLM-extracted creator teaching personality +- [[Search-Retrieval]] — LightRAG + Qdrant retrieval cascade - [[Deployment]] — Docker Compose setup, rebuild commands - [[Development-Guide]] — Local dev setup, common gotchas -- [[Decisions]] — Architectural decisions register (D001–D035) +- [[Decisions]] — Architectural decisions register (D001–D041) + +## Features + +### Core +- **Technique Pages** — LLM-synthesized study guides with v2 body sections, signal chains, citations +- **Search** — LightRAG primary + Qdrant fallback with 4-tier creator-scoped cascade +- **Pipeline** — 6-stage LLM extraction (transcripts → key moments → classification → synthesis → embedding) +- **Player** — Audio player with chapter markers + +### Creator Tools +- **Follow System** — User-to-creator follows with follower counts (M022) +- **Personality Profiles** — LLM-extracted teaching style, vocabulary, and tone analysis (M022) +- **Creator Tiers** — Free/Pro/Premium tier configuration with Coming Soon placeholders (M022) +- **Highlight Detection v2** — 10-dimension scoring with audio proxy signals, creator review queue (M022) +- **Chat Widget** — Floating creator-scoped chat bubble with streaming SSE and citations (M022) +- **Multi-Turn Chat Memory** — Redis-backed conversation history with conversation_id threading (M022) +- **Creator Dashboard** — Video management, chapter editing, consent controls + +### Platform +- **Authentication** — JWT with invite codes, admin/creator roles +- **Consent System** — Per-video granular consent with audit trail +- **Impersonation** — Admin-to-creator context switching with audit log ## Current Scale @@ -31,16 +57,11 @@ Producers can search for specific techniques and find timestamped key moments, s | Database | PostgreSQL 16 | | Cache/Broker | Redis 7 | | Vector Store | Qdrant 1.13.2 | +| RAG Framework | LightRAG + NetworkX | | Embeddings | Ollama (nomic-embed-text) | | LLM | OpenAI-compatible API (DGX Sparks Qwen primary, local Ollama fallback) | | Deployment | Docker Compose on ub01, nginx reverse proxy on nuc01 | --- -*Last updated: 2026-04-04 — M021 chat engine, retrieval cascade, highlights, audio mode, chapters, impersonation write mode* -inx reverse proxy on nuc01 | - ---- - -*Last updated: 2026-04-03 — M018/S02 initial bootstrap* - M018/S02 initial bootstrap* +*Last updated: 2026-04-04 — M022 follow system, personality profiles, highlight v2, chat widget, multi-turn memory, creator tiers* diff --git a/Personality-Profiles.md b/Personality-Profiles.md new file mode 100644 index 0000000..3a81f1f --- /dev/null +++ b/Personality-Profiles.md @@ -0,0 +1,132 @@ +# Personality Profiles + +LLM-powered extraction of creator teaching personality from transcript analysis. Added in M022/S06. + +## Overview + +Personality profiles capture each creator's distinctive teaching style — vocabulary patterns, tonal qualities, and stylistic markers — by analyzing their transcript corpus with a structured LLM extraction pipeline. Profiles are stored as JSONB on the Creator model and displayed on creator detail pages. + +## Extraction Pipeline + +### Transcript Sampling + +Three-tier sampling strategy based on total transcript size: + +| Tier | Condition | Strategy | +|------|-----------|----------| +| Small | < 20K chars | Use all transcript text | +| Medium | 20K–60K chars | 300-character excerpts per key moment | +| Large | > 60K chars | Topic-diverse random sampling via Redis classification data | + +Large-tier sampling uses deterministic seeding and pulls from across topic categories to ensure the profile reflects the creator's full range, not just their most common topic. + +### LLM Extraction + +The prompt template at `prompts/personality_extraction.txt` instructs the LLM to analyze transcript excerpts and produce structured JSON. The LLM response is parsed and validated with a Pydantic model before storage. + +**Celery task:** `extract_personality_profile` in `backend/pipeline/stages.py` +- Joins KeyMoment → SourceVideo to load transcripts +- Samples transcripts per the tier strategy +- Calls LLM with `response_model=object` for JSON mode +- Validates response with `PersonalityProfile` Pydantic model +- Stores result as JSONB on Creator row +- Emits pipeline_events for observability + +### Error Handling + +- Zero-transcript creators: early return, no profile +- Invalid JSON from LLM: retry +- Pydantic validation failure: retry +- Pipeline events track start/complete/error + +## PersonalityProfile Schema + +Stored as `Creator.personality_profile` JSONB column. Nested structure: + +### VocabularyProfile + +| Field | Type | Description | +|-------|------|-------------| +| signature_phrases | list[str] | Characteristic phrases the creator uses repeatedly | +| jargon_level | str | How technical their language is (e.g., "high", "moderate") | +| filler_words | list[str] | Common filler words/phrases | +| distinctive_terms | list[str] | Unique terminology or coined phrases | + +### ToneProfile + +| Field | Type | Description | +|-------|------|-------------| +| formality | str | Formal to casual spectrum | +| energy | str | Energy level descriptor | +| humor | str | Humor style/frequency | +| teaching_style | str | Overall teaching approach | + +### StyleMarkersProfile + +| Field | Type | Description | +|-------|------|-------------| +| explanation_approach | str | How they explain concepts | +| analogies | bool | Whether they use analogies frequently | +| sound_words | bool | Whether they use onomatopoeia / sound words | +| audience_engagement | str | How they address / engage viewers | + +### Metadata + +Each profile includes extraction metadata: + +| Field | Description | +|-------|-------------| +| extracted_at | ISO timestamp of extraction | +| transcript_sample_size | Number of characters sampled | +| model_used | LLM model identifier | + +## API + +### Admin Trigger + +| Method | Path | Purpose | +|--------|------|---------| +| POST | `/api/v1/admin/creators/{slug}/extract-profile` | Queue personality extraction task | + +Returns immediately — extraction runs asynchronously via Celery. Check `pipeline_events` for status. + +### Creator Detail + +`GET /api/v1/creators/{slug}` includes `personality_profile` field (null if not yet extracted). + +## Frontend Component + +`PersonalityProfile.tsx` — collapsible section on creator detail pages. + +### Layout + +- **Collapsible header** with chevron toggle (CSS `grid-template-rows: 0fr/1fr` animation) +- **Three sub-cards:** + - **Teaching Style** — formality, energy, humor, teaching_style, explanation_approach, audience_engagement + - **Vocabulary** — jargon_level summary, signature_phrases pills, filler_words pills, distinctive_terms pills + - **Style** — analogies (checkmark/cross), sound_words (checkmark/cross), summary paragraph +- **Metadata footer** — extraction date, sample size + +Handles null profiles gracefully (renders nothing). + +## Key Files + +- `prompts/personality_extraction.txt` — LLM prompt template +- `backend/pipeline/stages.py` — `extract_personality_profile` Celery task, `_sample_creator_transcripts()` helper +- `backend/schemas.py` — PersonalityProfile, VocabularyProfile, ToneProfile, StyleMarkersProfile Pydantic models +- `backend/models.py` — Creator.personality_profile JSONB column +- `backend/routers/admin.py` — POST /admin/creators/{slug}/extract-profile endpoint +- `backend/routers/creators.py` — Passthrough in GET /creators/{slug} +- `alembic/versions/023_add_personality_profile.py` — Migration +- `frontend/src/components/PersonalityProfile.tsx` — Collapsible profile component +- `frontend/src/api/creators.ts` — TypeScript interfaces for profile sub-objects + +## Design Decisions + +- **3-tier transcript sampling** — Balances coverage vs. token cost. Topic-diverse random sampling for large creators prevents profile skew toward dominant topic. +- **Admin trigger endpoint** — On-demand extraction rather than automatic on ingest. Profiles are expensive (large LLM call) and only needed once per creator. +- **JSONB storage** — Profile schema may evolve; JSONB avoids migration for every field change. + +--- + +*See also: [[Data-Model]], [[API-Surface]], [[Frontend]], [[Pipeline]]* diff --git a/_Sidebar.md b/_Sidebar.md index 4c63794..0d8f5bc 100644 --- a/_Sidebar.md +++ b/_Sidebar.md @@ -14,6 +14,7 @@ - [[Chat-Engine]] - [[Search-Retrieval]] - [[Highlights]] +- [[Personality-Profiles]] **Reference** - [[API-Surface]]