5 Chat Engine
jlightner edited this page 2026-04-04 10:31:50 -05:00
This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Chat Engine

Streaming question-answering interface backed by LightRAG retrieval and LLM completion. Added in M021/S03, expanded with multi-turn memory in M022/S04 and chat widget in M022/S03.

Architecture

User types question in ChatPage or ChatWidget
        │
        ▼
POST /api/v1/chat  { query, creator?, conversation_id? }
        │
        ▼
ChatService.stream(query, creator?, conversation_id?)
        │
        ├─ 1. Load history: Redis chrysopedia:chat:{conversation_id}
        │
        ├─ 2. Retrieve: SearchService.search(query, creator)
        │     └─ Uses 4-tier cascade if creator provided (see [[Search-Retrieval]])
        │
        ├─ 3. Prompt: System prompt + history + numbered context + user message
        │     └─ Sources formatted as [1] Title — Summary for citation mapping
        │
        ├─ 4. Stream: openai.AsyncOpenAI with stream=True
        │     └─ Tokens streamed as SSE events in real-time
        │
        ├─ 5. Save history: Append user message + assistant response to Redis
        │
        ▼
SSE response → ChatPage/ChatWidget renders tokens + citation links

SSE Protocol

The chat endpoint returns a text/event-stream response with four event types in strict order:

Event Payload When
sources [{title, slug, creator_name, summary, source_video_id, start_time, end_time, video_filename}] First — citation metadata for link rendering. Video fields added in M024/S05 for timestamp badges.
token string (text chunk) Repeated — streamed LLM completion tokens
done {cascade_tier, conversation_id} Once — signals completion, includes retrieval tier and conversation ID
error {message: string} On failure — emitted if LLM errors mid-stream

The cascade_tier in the done event reveals which tier of the retrieval cascade served the context. The conversation_id enables the frontend to thread follow-up messages.

Multi-Turn Conversation Memory (M022/S04)

Redis Storage

  • Key pattern: chrysopedia:chat:{conversation_id}
  • Format: Single JSON string containing a list of {role, content} message dicts
  • TTL: 1 hour, refreshed on each interaction
  • Cap: 10 turn pairs (20 messages) — oldest pairs trimmed when exceeded

Conversation Flow

  1. Client sends conversation_id in POST body (or omits for new conversation)
  2. Server auto-generates UUID when conversation_id is omitted
  3. History loaded from Redis and injected between system prompt and user message
  4. Assistant response accumulated during streaming
  5. User message + assistant response appended to history in Redis
  6. conversation_id returned in SSE done event for threading

Citation Format

The LLM is instructed to reference sources using numbered citations [N] in its response. The frontend parses these into superscript links:

  • [1] → links to /techniques/:slug for the corresponding source
  • Multiple citations supported: [1][3] or [1,3]
  • Citation regex: /\[(\d+)\]/g parsed by shared parseChatCitations() utility (M024/S05)
  • Timestamp badges — when start_time is defined, source cards show a badge linking to /watch/:id?t=N (M024/S05)
  • Video filename — displayed as subtle metadata on source cards (M024/S05)

API Endpoint

POST /api/v1/chat

Field Type Required Validation
query string Yes 11000 characters
creator string No Creator UUID or slug for scoped retrieval
conversation_id string No UUID for multi-turn threading. Auto-generated if omitted.

Response: text/event-stream (SSE)

Error responses:

  • 422 — Empty or missing query, query exceeds 1000 chars

Backend: ChatService

Located in backend/chat_service.py. The retrieve-prompt-stream pipeline:

  1. Load History_load_history() reads from Redis key chrysopedia:chat:{conversation_id}. Returns empty list if key absent.
  2. Retrieve — Calls SearchService.search() with the query and optional creator parameter. Gets back ranked technique page results with the cascade_tier.
  3. Prompt — Builds message array: system prompt → conversation history → numbered context block → user message. System prompt instructs the LLM to act as a music production encyclopedia, cite sources with [N] notation, and stay grounded in the provided context.
  4. Stream — Opens an async streaming completion via openai.AsyncOpenAI. Yields SSE events as tokens arrive.
  5. Save History_save_history() appends the user message and accumulated assistant response to Redis. Trims to 10 turn pairs if exceeded. Refreshes TTL to 1 hour.

Error handling: If the LLM fails mid-stream (after some tokens have been sent), an error event is emitted so the frontend can display a failure message rather than leaving the response hanging.

Frontend: ChatPage

Route: /chat (lazy-loaded, code-split)

Components

  • Multi-message conversation UI — Messages array with conversation bubble layout
  • Conversation threadingconversationId state, "New conversation" button to reset
  • Streaming message display — Accumulates tokens with blinking cursor animation during streaming
  • Typing indicator — Three-dot animation while streaming
  • Citation markers[N] parsed to superscript links targeting /techniques/:slug (per-message)
  • Source list — Numbered sources with creator attribution displayed below each response
  • Auto-scroll — Scrolls to bottom as new tokens arrive

SSE Client

Located in frontend/src/api/chat.ts. Uses fetch() + ReadableStream with typed callbacks:

streamChat(query, {
  onSources: (sources) => void,
  onToken: (token) => void,
  onDone: (data: ChatDoneMeta) => void,
  onError: (error) => void,
}, creatorName?, conversationId?)

ChatDoneMeta type includes cascade_tier and conversation_id fields.

Frontend: ChatWidget (M022/S03)

Floating chat bubble on creator detail pages. Fixed-position bottom-right.

Behavior

  • Bubble → click → slide-up panel with conversation UI
  • Creator-scoped: passes creatorName to streamChat() for retrieval cascade
  • Suggested questions generated client-side from technique titles and categories
  • Typing indicator — three-dot animation during streaming
  • Citation links — parsed from response, linked to technique pages
  • Responsive — full-width below 640px, 400px panel on desktop
  • Conversation threadingconversationId generated via crypto.randomUUID() on first send, threaded through streamChat(), updated from done event
  • Reset on close — messages and conversationId cleared when panel closes

Personality Interpolation (M023/S02, S04)

The chat engine modulates the system prompt based on (0.01.0) sent in the request. When a creator is specified, their JSONB is loaded and progressively injected into the system prompt.

5-Tier System (D044)

Weight Range Tier Profile Fields Included Instruction
< 0.2 None Pure encyclopedic (no personality block)
0.20.39 Subtle Reference Basic tone "Subtly reference this creator's style"
0.40.59 Creator Tone + descriptors, explanation approach "Adopt this creator's teaching tone"
0.60.79 Creator Voice + signature phrases (count scaled with weight) "Channel this creator's voice"
0.80.89 Full Voice + vocabulary, style markers "Speak in this creator's voice"
0.91.0 Full Embodiment + summary paragraph "Fully embody this creator"

Temperature Scaling

Linear:

  • weight=0.0 → temperature 0.3 (encyclopedic, precise)
  • weight=1.0 → temperature 0.5 (still grounded, slight creative variance)

Request Format

Graceful Fallback

If the creator has no (null JSONB), the system falls back to pure encyclopedic mode regardless of weight value. DB errors during profile fetch are caught and logged — never crash the stream.

Personality Interpolation (M023/S02, S04)

The chat engine modulates the system prompt based on personality_weight (0.01.0) sent in the request. When a creator is specified, their personality_profile JSONB is loaded and progressively injected into the system prompt.

5-Tier System (D044)

Weight Range Tier Profile Fields Included Instruction
< 0.2 None Pure encyclopedic (no personality block)
0.20.39 Subtle Reference Basic tone "Subtly reference this creator's style"
0.40.59 Creator Tone + descriptors, explanation approach "Adopt this creator's teaching tone"
0.60.79 Creator Voice + signature phrases (count scaled with weight) "Channel this creator's voice"
0.80.89 Full Voice + vocabulary, style markers "Speak in this creator's voice"
0.91.0 Full Embodiment + summary paragraph "Fully embody this creator"

Temperature Scaling

Linear: temperature = 0.3 + weight * 0.2

  • weight=0.0 → temperature 0.3 (encyclopedic, precise)
  • weight=1.0 → temperature 0.5 (still grounded, slight creative variance)

Request Format

{
  "query": "How do I make a reese bass?",
  "creator": "creator-slug",
  "personality_weight": 0.7,
  "conversation_id": "uuid"
}

Graceful Fallback

If the creator has no personality_profile (null JSONB), the system falls back to pure encyclopedic mode regardless of weight value. DB errors during profile fetch are caught and logged — never crash the stream.

Citation Metadata Propagation (M024/S05)

End-to-end video metadata flow from search results through SSE to frontend source cards.

Backend

  • search_service.py_enrich_qdrant_results() and _keyword_search_and() now batch-fetch SourceVideo filenames and include source_video_id, start_time, end_time, video_filename in result dicts. Non-key_moment types get empty/None values for uniform dict shape.
  • chat_service.py_build_sources() passes all four video fields through to SSE source events.

Frontend

  • ChatSource interface (in api/chat.ts) extended with source_video_id, start_time, end_time, video_filename fields.
  • utils/chatCitations.tsx — shared parseChatCitations() replaces duplicate implementations in ChatPage and ChatWidget. Accepts CSS module styles as Record<string, string>.
  • utils/formatTime.ts — shared hour-aware time formatter used across timestamp badges and player controls.
  • Source cards now show: timestamp badge (links to /watch/:id?t=N when start_time defined) and video filename metadata.

LLM Fallback Resilience (M025/S08)

ChatService maintains two AsyncOpenAI clients: primary (DGX endpoint) and fallback (Ollama). When the primary create() call fails with APIConnectionError, APITimeoutError, or InternalServerError, the entire streaming call is retried with the fallback client. The SSE done event includes fallback_used: true/false so the frontend and usage logging know which model actually served the response.

  • Config: LLM_FALLBACK_URL and LLM_FALLBACK_MODEL in docker-compose.yml
  • Logging: chat_llm_fallback WARNING when primary fails and fallback activates
  • Usage tracking: ChatUsageLog.model records actual model name (primary or fallback)

Refined System Prompt (M025/S09)

The system prompt was rewritten from 5 lines to a structured template covering:

  • Citation density: Cite every factual claim inline with [N] markers
  • Response format: Short paragraphs, bullet lists for step-by-step, bold key terms
  • Domain terminology: Music production context awareness
  • Conflicting sources: Present both perspectives with attribution
  • Response length: 2-4 paragraphs default, adjust to query complexity

Kept under 20 lines using markdown headers for structure. All 26 existing chat tests pass unchanged.

Chat Quality Evaluation Toolkit (M025/S09)

A 5-dimension LLM-as-judge evaluation framework:

  • Scorer (backend/pipeline/quality/chat_scorer.py): Grades responses on citation_accuracy, response_structure, domain_expertise, source_grounding, personality_fidelity
  • Eval Harness (backend/pipeline/quality/chat_eval.py): SSE-parsing runner that calls the live chat endpoint and feeds responses to the scorer
  • Test Suite (backend/pipeline/quality/fixtures/chat_test_suite.yaml): 10 queries across technical, conceptual, creator-scoped, and cross-creator categories
  • CLI: python -m pipeline.quality chat_eval subcommand for automated evaluation runs
  • Baseline report: Documented in S09-QUALITY-REPORT.md with JSON results

Key Files

  • backend/chat_service.py — ChatService with history load/save, retrieve-prompt-stream pipeline
  • backend/routers/chat.py — POST /api/v1/chat endpoint with conversation_id support
  • backend/tests/test_chat.py — 13 tests (6 streaming + 7 conversation memory)
  • frontend/src/api/chat.ts — SSE client with conversationId param and ChatDoneMeta type
  • frontend/src/pages/ChatPage.tsx — Multi-message conversation UI
  • frontend/src/pages/ChatPage.module.css — Conversation bubble layout styles
  • frontend/src/components/ChatWidget.tsx — Floating chat widget component
  • frontend/src/components/ChatWidget.module.css — Widget styles (38 custom property refs)
  • frontend/src/utils/chatCitations.tsx — Shared citation parser (M024/S05)
  • frontend/src/utils/formatTime.ts — Shared time formatter (M024/S05)

Design Decisions

  • Redis JSON string — Conversation history stored as single JSON value (atomic read/write) rather than Redis list type
  • Auto-generate conversation_id — Server creates UUID when client omits it, ensuring consistent done event shape
  • Widget resets on close — Clean slate UX; no persistence across open/close cycles
  • Client-side suggested questions — Generated from technique titles/categories without API call
  • 5-tier interpolation — Progressive field inclusion replaces 3-tier step function (D044 supersedes D043)
  • 5-tier interpolation — Progressive field inclusion replaces 3-tier step function (D044 supersedes D043)
  • Shared citation parsingparseChatCitations() in utils/chatCitations.tsx replaces duplicate implementations (M024/S05)
  • Standalone ASGI test client — Tests use mocked DB to avoid PostgreSQL dependency

See also: Search-Retrieval, API-Surface, Frontend