test: Created chat-specific LLM-as-judge scorer (5 dimensions), SSE-par…

- "backend/pipeline/quality/chat_scorer.py"
- "backend/pipeline/quality/chat_eval.py"
- "backend/pipeline/quality/fixtures/chat_test_suite.yaml"
- "backend/pipeline/quality/__main__.py"

GSD-Task: S09/T01
This commit is contained in:
jlightner 2026-04-04 14:43:52 +00:00
parent 160adc24bf
commit 846db2aad5
14 changed files with 1397 additions and 2 deletions

View file

@ -13,7 +13,7 @@ Production hardening, mobile polish, creator onboarding, and formal validation.
| S05 | [B] AI Transparency Page | low | — | ✅ | Creator sees all entities, relationships, and technique pages derived from their content |
| S06 | [B] Graph Backend Evaluation | low | — | ✅ | Benchmark report: NetworkX vs Neo4j at current and projected entity counts |
| S07 | [A] Data Export (GDPR-Style) | medium | — | ✅ | Creator downloads a ZIP with all derived content, entities, and relationships |
| S08 | [B] Load Testing + Fallback Resilience | medium | — | | 10 concurrent chat sessions maintain acceptable latency. DGX down → Ollama fallback works. |
| S08 | [B] Load Testing + Fallback Resilience | medium | — | | 10 concurrent chat sessions maintain acceptable latency. DGX down → Ollama fallback works. |
| S09 | [B] Prompt Optimization Pass | low | — | ⬜ | Chat quality reviewed across creators. Personality fidelity assessed. |
| S10 | Requirement Validation (R015, R037-R041) | low | — | ⬜ | R015, R037, R038, R039, R041 formally validated and signed off |
| S11 | Forgejo KB Final — Complete Documentation | low | S01, S02, S03, S04, S05, S06, S07, S08, S09, S10 | ⬜ | Forgejo wiki complete with newcomer onboarding guide covering entire platform |

View file

@ -0,0 +1,92 @@
---
id: S08
parent: M025
milestone: M025
provides:
- ChatService automatic LLM fallback (primary→Ollama)
- Load test script for chat SSE endpoint with latency statistics
requires:
[]
affects:
- S09
- S10
- S11
key_files:
- backend/chat_service.py
- backend/tests/test_chat.py
- docker-compose.yml
- scripts/load_test_chat.py
key_decisions:
- Catch APIConnectionError, APITimeoutError, and InternalServerError on primary create() then retry with fallback — matches sync LLMClient pattern
- Used httpx streaming + asyncio.gather for concurrent SSE load testing — no external tools needed
patterns_established:
- Async LLM fallback pattern: try primary streaming create(), on transient error reset state and retry with fallback client, propagate fallback_used flag through SSE and usage logging
observability_surfaces:
- chat_llm_fallback WARNING log when primary fails and fallback activates
- fallback_used field in SSE done event
- Usage log records actual model name (primary or fallback)
drill_down_paths:
- .gsd/milestones/M025/slices/S08/tasks/T01-SUMMARY.md
- .gsd/milestones/M025/slices/S08/tasks/T02-SUMMARY.md
duration: ""
verification_result: passed
completed_at: 2026-04-04T14:34:54.040Z
blocker_discovered: false
---
# S08: [B] Load Testing + Fallback Resilience
**ChatService now auto-falls back from primary to secondary LLM endpoint on connection/timeout/server errors, and a standalone load test script measures concurrent chat SSE latency with p50/p95/max statistics.**
## What Happened
Two tasks delivered the slice goal: resilient LLM fallback and a load testing tool.
T01 added automatic primary→fallback LLM endpoint switching in ChatService. When the primary AsyncOpenAI client fails with APIConnectionError, APITimeoutError, or InternalServerError during streaming, the entire create() call is retried with a fallback client pointing at the Ollama endpoint (configured via LLM_FALLBACK_URL/LLM_FALLBACK_MODEL in docker-compose.yml). The fallback_used boolean propagates through the SSE done event and usage logging so operators can see when fallback activates. This mirrors the pattern already established in the sync LLMClient used by pipeline stages. Five unit tests cover both error types plus existing fallback scenarios, all passing.
T02 created scripts/load_test_chat.py — a standalone asyncio+httpx script that fires N concurrent POST requests to the chat SSE endpoint, parses the event stream to measure time-to-first-token (TTFT) and total response time, and reports min/p50/p95/max statistics. Supports --auth-token (to avoid rate limiting), --output (JSON for CI), and --dry-run (offline SSE parsing verification). The dry-run mode validates the SSE parsing logic without a live server.
## Verification
All slice-level verification checks passed:
1. `cd backend && python -m pytest tests/test_chat.py -v -k fallback` — 5/5 passed (0.47s)
2. `python scripts/load_test_chat.py --help` — exits 0, shows all flags
3. `python scripts/load_test_chat.py --dry-run` — parses mock SSE correctly (3 tokens, 1 success, stats printed)
4. docker-compose.yml contains LLM_FALLBACK_URL and LLM_FALLBACK_MODEL in API environment
5. chat_service.py contains fallback client initialization, try/except with retry, fallback_used in done event
## Requirements Advanced
None.
## Requirements Validated
None.
## New Requirements Surfaced
None.
## Requirements Invalidated or Re-scoped
None.
## Deviations
Test mock factory uses call_count=2/3 instead of 1/2 because patching chat_service.openai.AsyncOpenAI intercepts SearchService's constructor call as well (shared module object). Minor implementation detail, no impact on coverage.
## Known Limitations
Running 10 concurrent unauthenticated requests from one IP will hit the default rate limit (10/hour). Load test requires --auth-token or temporarily raised rate limit for meaningful results.
## Follow-ups
None.
## Files Created/Modified
- `backend/chat_service.py` — Added _fallback_openai client, try/except with retry on primary failure, fallback_used in done event and usage log
- `backend/tests/test_chat.py` — Added test_chat_fallback_on_connection_error and test_chat_fallback_on_internal_server_error
- `docker-compose.yml` — Added LLM_FALLBACK_URL and LLM_FALLBACK_MODEL to API environment
- `scripts/load_test_chat.py` — New standalone async load test script with SSE parsing, latency stats, dry-run mode

View file

@ -0,0 +1,120 @@
# S08: [B] Load Testing + Fallback Resilience — UAT
**Milestone:** M025
**Written:** 2026-04-04T14:34:54.040Z
## UAT: S08 — Load Testing + Fallback Resilience
### Preconditions
- Chrysopedia stack running (API, worker, DB, Redis, Qdrant, Ollama)
- At least one creator with chat-ready content in the database
- Access to docker-compose.yml and ability to stop/start containers
---
### TC-01: Fallback activates when primary LLM is unreachable
**Steps:**
1. SSH to ub01, stop or misconfigure the primary LLM endpoint (e.g., set LLM_URL to an unreachable host in the API container env)
2. Restart the API container: `docker compose restart chrysopedia-api`
3. Open the web UI, navigate to a creator's chat page
4. Send a chat message: "What techniques does this creator use?"
5. Observe the response streams back successfully
6. Check API logs: `docker logs chrysopedia-api 2>&1 | grep chat_llm_fallback`
**Expected:**
- Chat response completes (tokens stream via SSE)
- API log shows WARNING with `chat_llm_fallback primary failed (APIConnectionError: ...)`
- SSE done event contains `"fallback_used": true`
---
### TC-02: Fallback activates on 500 Internal Server Error
**Steps:**
1. Configure primary LLM endpoint to a service that returns 500 (or mock via test)
2. Run: `cd backend && python -m pytest tests/test_chat.py -v -k test_chat_fallback_on_internal_server_error`
**Expected:**
- Test passes
- SSE events include token data and done event with `fallback_used: true`
---
### TC-03: Normal operation uses primary (no fallback)
**Steps:**
1. Ensure primary LLM endpoint is healthy
2. Send a chat message through the web UI
3. Check API logs for absence of `chat_llm_fallback` WARNING
**Expected:**
- Response streams normally
- No fallback warning in logs
- SSE done event contains `"fallback_used": false`
---
### TC-04: Load test script dry-run validates SSE parsing
**Steps:**
1. Run: `python scripts/load_test_chat.py --dry-run`
**Expected:**
- Output shows "Dry-run mode: parsing mock SSE response..."
- Reports 1 success, 0 errors
- Shows statistics table with TTFT and Total columns
- Exits 0
---
### TC-05: Load test script help and flags
**Steps:**
1. Run: `python scripts/load_test_chat.py --help`
**Expected:**
- Shows usage with --url, --concurrency, --query, --auth-token, --output, --dry-run flags
- Documents rate limit note
- Exits 0
---
### TC-06: Load test with 10 concurrent sessions (live)
**Steps:**
1. Create an auth token or temporarily raise rate limit
2. Run: `python scripts/load_test_chat.py --concurrency 10 --auth-token <token> --output /tmp/load_results.json`
**Expected:**
- 10 requests fire concurrently
- Results table shows per-request TTFT and total time
- Statistics show min/p50/p95/max for both metrics
- JSON output file written with structured results
- All 10 succeed (0 errors) under normal load
---
### TC-07: Load test JSON output format
**Steps:**
1. Run: `python scripts/load_test_chat.py --dry-run --output /tmp/dry_results.json`
2. Inspect: `cat /tmp/dry_results.json | python -m json.tool`
**Expected:**
- Valid JSON with `summary` (containing ttft and total stats) and `requests` array
- Each request entry has status, ttft_ms, total_ms, tokens, error fields
---
### Edge Cases
### TC-08: Both primary and fallback fail
**Steps:**
1. Run: `cd backend && python -m pytest tests/test_chat.py -v` (existing tests cover double-failure path)
2. Or: configure both LLM_URL and LLM_FALLBACK_URL to unreachable hosts, send chat message
**Expected:**
- SSE stream emits an `event: error` with appropriate message
- No crash or hang — clean error delivery to client

View file

@ -0,0 +1,22 @@
{
"schemaVersion": 1,
"taskId": "T02",
"unitId": "M025/S08/T02",
"timestamp": 1775313209389,
"passed": true,
"discoverySource": "task-plan",
"checks": [
{
"command": "python scripts/load_test_chat.py --help",
"exitCode": 0,
"durationMs": 83,
"verdict": "pass"
},
{
"command": "echo 'Script OK'",
"exitCode": 0,
"durationMs": 10,
"verdict": "pass"
}
]
}

View file

@ -1,6 +1,62 @@
# S09: [B] Prompt Optimization Pass
**Goal:** Prompt optimization pass for chat quality and personality fidelity
**Goal:** Chat quality reviewed across creators with structured evaluation, prompt refined for better citation/structure/domain guidance, personality fidelity assessed at multiple weight levels.
**Demo:** After this: Chat quality reviewed across creators. Personality fidelity assessed.
## Tasks
- [x] **T01: Created chat-specific LLM-as-judge scorer (5 dimensions), SSE-parsing eval harness, 10-query test suite, and chat_eval CLI subcommand** — Create a chat-specific quality evaluation module extending the existing pipeline/quality/ toolkit pattern. The scorer uses LLM-as-judge with a chat-specific rubric covering 5 dimensions: citation_accuracy (are citations real and correctly numbered), response_structure (concise, well-organized, uses appropriate formatting), domain_expertise (music production terminology used naturally), source_grounding (claims backed by provided sources, no fabrication), and personality_fidelity (at weight>0, response reflects creator voice proportional to weight).
The evaluation script sends queries to the live chat HTTP endpoint (configurable base URL), parses SSE responses, then scores each response using the LLM judge. It accepts a YAML/JSON test suite defining queries, expected creator scopes, and personality weights.
Follow the existing scorer.py pattern: rubric as a multi-line string constant, ScoreResult dataclass, dimension-level float scores 0.0-1.0, composite average.
Steps:
1. Read `backend/pipeline/quality/scorer.py` for the scoring pattern (StageConfig, rubric format, ScoreResult dataclass, _parse_scores)
2. Create `backend/pipeline/quality/chat_scorer.py` with ChatScoreResult dataclass (5 dimensions), chat-specific rubric prompt, and ChatScoreRunner class that takes an LLM judge client and scores a (query, response, sources, personality_weight, creator_name) tuple
3. Create `backend/pipeline/quality/chat_eval.py` with evaluation harness: loads a test suite YAML, calls the chat endpoint via httpx, parses SSE events, collects (query, accumulated_response, sources, metadata), feeds each to ChatScoreRunner, writes results JSON
4. Create `backend/pipeline/quality/fixtures/chat_test_suite.yaml` with 8-10 representative queries: 2 technical how-to, 2 conceptual, 2 creator-specific (with personality weights 0.0 and 0.7), 2 cross-creator
5. Wire `chat_eval` subcommand into `backend/pipeline/quality/__main__.py`
6. Verify the module imports cleanly: `cd backend && python -c 'from pipeline.quality.chat_scorer import ChatScoreRunner'`
- Estimate: 2h
- Files: backend/pipeline/quality/chat_scorer.py, backend/pipeline/quality/chat_eval.py, backend/pipeline/quality/fixtures/chat_test_suite.yaml, backend/pipeline/quality/__main__.py
- Verify: cd backend && python -c 'from pipeline.quality.chat_scorer import ChatScoreRunner, ChatScoreResult; from pipeline.quality.chat_eval import ChatEvalRunner; print("OK")'
- [ ] **T02: Refine chat system prompt and verify no test regressions** — Improve the `_SYSTEM_PROMPT_TEMPLATE` in `backend/chat_service.py` based on the gaps identified in research: the current prompt is 5 lines with no guidance on citation density, response structure, domain awareness, conflicting source handling, or response length.
The refined prompt should:
- Guide citation density: cite every factual claim, prefer inline citations [N] immediately after the claim
- Set response structure: use short paragraphs, bullet lists for steps/lists, bold key terms on first mention
- Add domain awareness: mention music production context, handle audio/synth/mixing terminology naturally
- Handle conflicting sources: when sources disagree, present both perspectives with their citations
- Set response length: aim for concise answers (2-4 paragraphs), expand only when the question warrants detail
- Preserve the existing constraint: ONLY use numbered sources, do not invent facts
Keep the prompt under 30 lines — this is chat, not synthesis. The personality block is appended separately and should not be duplicated here.
Steps:
1. Read `backend/chat_service.py` — locate `_SYSTEM_PROMPT_TEMPLATE` (around line 37)
2. Read `backend/tests/test_chat.py` to understand what the tests assert about the prompt/response format
3. Rewrite `_SYSTEM_PROMPT_TEMPLATE` with the improvements above, keeping `{context_block}` placeholder
4. Run existing chat tests: `cd backend && python -m pytest tests/test_chat.py -v` — all must pass
5. If any tests fail due to prompt content assertions, update the assertions to match the new prompt while preserving the intent of the test
- Estimate: 1h
- Files: backend/chat_service.py, backend/tests/test_chat.py
- Verify: cd backend && python -m pytest tests/test_chat.py -v
- [ ] **T03: Run chat evaluation, assess personality fidelity, write quality report** — Execute the chat evaluation harness against the live Chrysopedia chat endpoint on ub01, assess personality fidelity across weight levels for multiple creators, and write a quality report documenting all findings.
This task requires the live stack running on ub01 (API at http://ub01:8096). If the endpoint is unreachable, use manual curl-based evaluation with representative queries and score responses by inspection.
Steps:
1. Read `backend/pipeline/quality/chat_eval.py` and `backend/pipeline/quality/fixtures/chat_test_suite.yaml` from T01
2. Read `backend/chat_service.py` to review the refined prompt from T02
3. Attempt to run the evaluation: `ssh ub01 'cd /vmPool/r/repos/xpltdco/chrysopedia && docker exec chrysopedia-api python -m pipeline.quality chat_eval --base-url http://localhost:8000 --output /app/pipeline/quality/results/chat_eval_baseline.json'` — if this fails due to stack not running or endpoint issues, fall back to manual curl evaluation
4. For personality fidelity: test at least 2 creators with personality profiles, querying the same question at weights 0.0, 0.5, 0.8, 1.0. Verify progressive personality injection is visible in responses.
5. If automated eval ran: copy results back with `scp ub01:/vmPool/r/repos/xpltdco/chrysopedia/backend/pipeline/quality/results/chat_eval_*.json backend/pipeline/quality/results/`
6. Write quality report to `.gsd/milestones/M025/slices/S09/S09-QUALITY-REPORT.md` covering:
- Chat quality baseline scores (per-dimension if automated eval succeeded, qualitative if manual)
- Prompt changes summary (before/after comparison)
- Personality fidelity assessment per weight tier
- Recommendations for future improvements
7. If automated eval produced results, also write the raw JSON to `backend/pipeline/quality/results/`
- Estimate: 1.5h
- Files: backend/pipeline/quality/results/chat_eval_baseline.json, .gsd/milestones/M025/slices/S09/S09-QUALITY-REPORT.md
- Verify: test -f .gsd/milestones/M025/slices/S09/S09-QUALITY-REPORT.md && wc -l .gsd/milestones/M025/slices/S09/S09-QUALITY-REPORT.md | awk '{exit ($1 < 30)}'

View file

@ -0,0 +1,124 @@
# S09 Research — Prompt Optimization Pass
## Summary
This slice reviews chat quality across creators and assesses personality fidelity. The codebase already has a mature quality toolkit (`pipeline/quality/`) with scorer, optimizer, variant generator, and voice dial — all targeting **pipeline stages 2-5** (content synthesis). However, **no equivalent scoring/optimization exists for the chat system prompt**. The chat prompt in `chat_service.py` is a static template that hasn't been through the same rigor as the stage 5 synthesis prompt (which went through 100 variants and a formal optimization loop).
The work divides into two independent streams:
1. **Chat prompt quality review** — evaluate the `_SYSTEM_PROMPT_TEMPLATE` in `chat_service.py` and the context block construction
2. **Personality fidelity assessment** — evaluate the `_build_personality_block()` function and its tiered weight system against actual creator personality profiles
## Requirement Targets
- No active requirements explicitly own this slice. It supports overall chat quality which feeds into R015 (30-Second Retrieval Target) — better chat responses reduce the time-to-insight.
## Implementation Landscape
### Chat System Architecture
The chat system has these components:
| Component | File | Purpose |
|---|---|---|
| System prompt template | `backend/chat_service.py:_SYSTEM_PROMPT_TEMPLATE` | Static template with `{context_block}` placeholder |
| Context block builder | `backend/chat_service.py:_build_context_block()` | Formats search results as numbered `[N] Title by Creator` blocks |
| Personality injector | `backend/chat_service.py:_inject_personality()` | Queries Creator.personality_profile from DB, appends voice block |
| Personality block builder | `backend/chat_service.py:_build_personality_block()` | 5-tier progressive personality injection based on weight 0.0-1.0 |
| Search cascade | `backend/search_service.py:search()` | Creator→Domain→Global→None cascade with LightRAG + keyword + Qdrant |
| Chat router | `backend/routers/chat.py` | Rate limiting, SSE streaming, personality_weight parameter |
| Frontend widget | `frontend/src/components/ChatWidget.tsx` | Slider for personality_weight, SSE consumption |
### Current Chat Prompt (verbatim)
```
You are Chrysopedia, an expert encyclopedic assistant for music production techniques.
Answer the user's question using ONLY the numbered sources below. Cite sources by
writing [N] inline (e.g. [1], [2]) where N is the source number. If the sources
do not contain enough information, say so honestly — do not invent facts.
Sources:
{context_block}
```
This is minimal — 5 lines. The pipeline stage 5 synthesis prompt is 251 lines and went through formal optimization. The chat prompt has room for improvement in:
- Citation format guidance (when to cite, how many citations per claim)
- Response length/format guidance (concise vs detailed)
- Music production domain awareness
- Handling of conflicting sources
- Response structure (should it use headers, bullet points, etc.)
### Personality Block System
The `_build_personality_block()` function uses 5 tiers:
- **< 0.2**: No personality (empty string)
- **0.2-0.39**: Subtle hint — "subtly reference {name}'s communication style"
- **0.4-0.59**: Adopt tone — descriptors, explanation_approach, audience_engagement
- **0.6-0.79**: Creator voice — signature phrases (count scaled by weight)
- **0.8-0.89**: Full embody — distinctive_terms, sound_descriptions, self-references, pacing
- **>= 0.9**: + full summary paragraph
The personality data lives in `Creator.personality_profile` (JSONB column), extracted by `prompts/personality_extraction.txt`.
### Voice Dial (Stage 5)
`pipeline/quality/voice_dial.py` provides a 3-band modifier for stage 5 synthesis:
- Low (0-0.33): Clinical, suppress quotes
- Mid (0.34-0.66): Base prompt unchanged
- High (0.67-1.0): Maximum voice, prioritize exact words
This is separate from the chat personality system. The chat personality block is injected into the chat prompt at runtime; the voice dial modifies the stage 5 synthesis prompt at build time.
### Existing Quality Toolkit
The `pipeline/quality/` module provides:
- `scorer.py` — LLM-as-judge with per-stage rubrics (stages 2-5)
- `optimizer.py` — Automated generate→score→select loop
- `variant_generator.py` — LLM-powered prompt mutation
- `voice_dial.py` — Voice preservation modifier for stage 5
- `fitness.py` — LLM fitness tests
- `__main__.py` — CLI with `fitness`, `score`, `optimize`, `apply` subcommands
Previous optimization results: `backend/pipeline/quality/results/optimize_stage5_20260401_100005.json` shows stage 5 went from composite 0.95 to 1.0 across 3 iterations with 2 variants each. 100 stage 5 variants exist in `prompts/stage5_variants/`.
### Test Coverage
`backend/tests/test_chat.py` has integration tests for SSE format, citation numbering, creator forwarding, error events, and multi-turn memory. Uses standalone ASGI client with mocked DB. Tests verify protocol correctness but NOT response quality.
## Recommendation
### Approach: Manual review + structured evaluation script + prompt refinement
This is a **quality assessment and refinement** slice, not a feature build. The work should be:
1. **Build a chat quality evaluation harness** — extend the existing quality toolkit with a chat-specific scoring rubric and evaluation script that can test the chat prompt against sample queries across multiple creators. Dimensions: citation accuracy, response conciseness, domain expertise, personality fidelity (at various weight levels), source grounding.
2. **Run evaluations** — test the current chat prompt against a set of representative queries (technical how-to, conceptual explanation, creator-specific, cross-creator) with and without personality injection.
3. **Refine the chat prompt** — based on evaluation results, improve `_SYSTEM_PROMPT_TEMPLATE` with better guidance on citation density, response structure, domain terminology, and conciseness.
4. **Assess personality fidelity** — evaluate the personality injection at multiple weight levels for at least 2-3 creators, documenting where the tiered system works well and where it breaks down.
5. **Document findings** — write a quality report summarizing chat quality baselines, personality fidelity assessment, and any prompt changes made.
### Natural Task Seams
| Task | Scope | Files | Risk |
|---|---|---|---|
| T01: Chat quality evaluation harness | New scoring rubric + eval script for chat responses | `backend/pipeline/quality/chat_scorer.py`, `backend/pipeline/quality/chat_eval.py` | Low — extends known pattern |
| T02: Run chat quality evaluation | Execute eval against live chat endpoint across creators | Quality results output | Low — execution, not code |
| T03: Chat prompt refinement | Improve `_SYSTEM_PROMPT_TEMPLATE` based on T02 findings | `backend/chat_service.py` | Low — prompt editing |
| T04: Personality fidelity assessment | Evaluate personality tiers across creators, document findings | Assessment report | Low — evaluation, not code |
### Key Constraints
- Chat evaluation requires a running LLM endpoint (DGX Qwen or Ollama fallback)
- Personality fidelity assessment needs creators with populated `personality_profile` JSONB
- The chat prompt is short enough that manual refinement is more appropriate than the automated optimization loop (which was designed for the much longer stage 5 prompt)
- Changes to `_SYSTEM_PROMPT_TEMPLATE` affect all users immediately — no versioning mechanism like the stage 5 variants system
### Verification Strategy
- Before/after comparison of chat responses for the same queries
- Personality fidelity: compare responses at weight=0.0, 0.5, 0.8, 1.0 for the same query+creator — personality should be progressively more apparent
- Citation accuracy: responses should cite numbered sources, no hallucinated citations
- No regressions in existing `test_chat.py` tests

View file

@ -0,0 +1,38 @@
---
estimated_steps: 10
estimated_files: 4
skills_used: []
---
# T01: Build chat quality evaluation harness and scoring rubric
Create a chat-specific quality evaluation module extending the existing pipeline/quality/ toolkit pattern. The scorer uses LLM-as-judge with a chat-specific rubric covering 5 dimensions: citation_accuracy (are citations real and correctly numbered), response_structure (concise, well-organized, uses appropriate formatting), domain_expertise (music production terminology used naturally), source_grounding (claims backed by provided sources, no fabrication), and personality_fidelity (at weight>0, response reflects creator voice proportional to weight).
The evaluation script sends queries to the live chat HTTP endpoint (configurable base URL), parses SSE responses, then scores each response using the LLM judge. It accepts a YAML/JSON test suite defining queries, expected creator scopes, and personality weights.
Follow the existing scorer.py pattern: rubric as a multi-line string constant, ScoreResult dataclass, dimension-level float scores 0.0-1.0, composite average.
Steps:
1. Read `backend/pipeline/quality/scorer.py` for the scoring pattern (StageConfig, rubric format, ScoreResult dataclass, _parse_scores)
2. Create `backend/pipeline/quality/chat_scorer.py` with ChatScoreResult dataclass (5 dimensions), chat-specific rubric prompt, and ChatScoreRunner class that takes an LLM judge client and scores a (query, response, sources, personality_weight, creator_name) tuple
3. Create `backend/pipeline/quality/chat_eval.py` with evaluation harness: loads a test suite YAML, calls the chat endpoint via httpx, parses SSE events, collects (query, accumulated_response, sources, metadata), feeds each to ChatScoreRunner, writes results JSON
4. Create `backend/pipeline/quality/fixtures/chat_test_suite.yaml` with 8-10 representative queries: 2 technical how-to, 2 conceptual, 2 creator-specific (with personality weights 0.0 and 0.7), 2 cross-creator
5. Wire `chat_eval` subcommand into `backend/pipeline/quality/__main__.py`
6. Verify the module imports cleanly: `cd backend && python -c 'from pipeline.quality.chat_scorer import ChatScoreRunner'`
## Inputs
- `backend/pipeline/quality/scorer.py`
- `backend/pipeline/quality/__main__.py`
- `backend/chat_service.py`
## Expected Output
- `backend/pipeline/quality/chat_scorer.py`
- `backend/pipeline/quality/chat_eval.py`
- `backend/pipeline/quality/fixtures/chat_test_suite.yaml`
- `backend/pipeline/quality/__main__.py`
## Verification
cd backend && python -c 'from pipeline.quality.chat_scorer import ChatScoreRunner, ChatScoreResult; from pipeline.quality.chat_eval import ChatEvalRunner; print("OK")'

View file

@ -0,0 +1,84 @@
---
id: T01
parent: S09
milestone: M025
provides: []
requires: []
affects: []
key_files: ["backend/pipeline/quality/chat_scorer.py", "backend/pipeline/quality/chat_eval.py", "backend/pipeline/quality/fixtures/chat_test_suite.yaml", "backend/pipeline/quality/__main__.py"]
key_decisions: ["Reused ScoreResult pattern (generic scores dict + composite) rather than subclassing — keeps chat scorer independent", "Used synchronous httpx for SSE parsing — matches LLMClient sync pattern", "Personality fidelity dimension scores differently based on weight=0 vs weight>0"]
patterns_established: []
drill_down_paths: []
observability_surfaces: []
duration: ""
verification_result: "Import verification passed: ChatScoreRunner, ChatScoreResult, ChatEvalRunner all importable. CLI subcommand renders help correctly. YAML test suite loads all 10 cases with correct categories, personality weights, and creator assignments."
completed_at: 2026-04-04T14:43:48.992Z
blocker_discovered: false
---
# T01: Created chat-specific LLM-as-judge scorer (5 dimensions), SSE-parsing eval harness, 10-query test suite, and chat_eval CLI subcommand
> Created chat-specific LLM-as-judge scorer (5 dimensions), SSE-parsing eval harness, 10-query test suite, and chat_eval CLI subcommand
## What Happened
---
id: T01
parent: S09
milestone: M025
key_files:
- backend/pipeline/quality/chat_scorer.py
- backend/pipeline/quality/chat_eval.py
- backend/pipeline/quality/fixtures/chat_test_suite.yaml
- backend/pipeline/quality/__main__.py
key_decisions:
- Reused ScoreResult pattern (generic scores dict + composite) rather than subclassing — keeps chat scorer independent
- Used synchronous httpx for SSE parsing — matches LLMClient sync pattern
- Personality fidelity dimension scores differently based on weight=0 vs weight>0
duration: ""
verification_result: passed
completed_at: 2026-04-04T14:43:48.993Z
blocker_discovered: false
---
# T01: Created chat-specific LLM-as-judge scorer (5 dimensions), SSE-parsing eval harness, 10-query test suite, and chat_eval CLI subcommand
**Created chat-specific LLM-as-judge scorer (5 dimensions), SSE-parsing eval harness, 10-query test suite, and chat_eval CLI subcommand**
## What Happened
Built three new files extending the pipeline/quality toolkit for chat response evaluation: chat_scorer.py with ChatScoreResult and ChatScoreRunner (5 dimensions: citation_accuracy, response_structure, domain_expertise, source_grounding, personality_fidelity), chat_eval.py with ChatEvalRunner that calls the live chat SSE endpoint via httpx and scores responses, and a 10-query YAML test suite covering technical, conceptual, creator-scoped (weight=0 and 0.7), and cross-creator categories. Wired chat_eval subcommand into the quality CLI.
## Verification
Import verification passed: ChatScoreRunner, ChatScoreResult, ChatEvalRunner all importable. CLI subcommand renders help correctly. YAML test suite loads all 10 cases with correct categories, personality weights, and creator assignments.
## Verification Evidence
| # | Command | Exit Code | Verdict | Duration |
|---|---------|-----------|---------|----------|
| 1 | `cd backend && python -c 'from pipeline.quality.chat_scorer import ChatScoreRunner, ChatScoreResult; from pipeline.quality.chat_eval import ChatEvalRunner; print("OK")'` | 0 | ✅ pass | 800ms |
| 2 | `cd backend && python -m pipeline.quality chat_eval --help` | 0 | ✅ pass | 600ms |
| 3 | `YAML suite load test (10 cases)` | 0 | ✅ pass | 500ms |
## Deviations
None.
## Known Issues
None.
## Files Created/Modified
- `backend/pipeline/quality/chat_scorer.py`
- `backend/pipeline/quality/chat_eval.py`
- `backend/pipeline/quality/fixtures/chat_test_suite.yaml`
- `backend/pipeline/quality/__main__.py`
## Deviations
None.
## Known Issues
None.

View file

@ -0,0 +1,41 @@
---
estimated_steps: 15
estimated_files: 2
skills_used: []
---
# T02: Refine chat system prompt and verify no test regressions
Improve the `_SYSTEM_PROMPT_TEMPLATE` in `backend/chat_service.py` based on the gaps identified in research: the current prompt is 5 lines with no guidance on citation density, response structure, domain awareness, conflicting source handling, or response length.
The refined prompt should:
- Guide citation density: cite every factual claim, prefer inline citations [N] immediately after the claim
- Set response structure: use short paragraphs, bullet lists for steps/lists, bold key terms on first mention
- Add domain awareness: mention music production context, handle audio/synth/mixing terminology naturally
- Handle conflicting sources: when sources disagree, present both perspectives with their citations
- Set response length: aim for concise answers (2-4 paragraphs), expand only when the question warrants detail
- Preserve the existing constraint: ONLY use numbered sources, do not invent facts
Keep the prompt under 30 lines — this is chat, not synthesis. The personality block is appended separately and should not be duplicated here.
Steps:
1. Read `backend/chat_service.py` — locate `_SYSTEM_PROMPT_TEMPLATE` (around line 37)
2. Read `backend/tests/test_chat.py` to understand what the tests assert about the prompt/response format
3. Rewrite `_SYSTEM_PROMPT_TEMPLATE` with the improvements above, keeping `{context_block}` placeholder
4. Run existing chat tests: `cd backend && python -m pytest tests/test_chat.py -v` — all must pass
5. If any tests fail due to prompt content assertions, update the assertions to match the new prompt while preserving the intent of the test
## Inputs
- `backend/chat_service.py`
- `backend/tests/test_chat.py`
- `backend/pipeline/quality/chat_scorer.py`
## Expected Output
- `backend/chat_service.py`
- `backend/tests/test_chat.py`
## Verification
cd backend && python -m pytest tests/test_chat.py -v

View file

@ -0,0 +1,39 @@
---
estimated_steps: 14
estimated_files: 2
skills_used: []
---
# T03: Run chat evaluation, assess personality fidelity, write quality report
Execute the chat evaluation harness against the live Chrysopedia chat endpoint on ub01, assess personality fidelity across weight levels for multiple creators, and write a quality report documenting all findings.
This task requires the live stack running on ub01 (API at http://ub01:8096). If the endpoint is unreachable, use manual curl-based evaluation with representative queries and score responses by inspection.
Steps:
1. Read `backend/pipeline/quality/chat_eval.py` and `backend/pipeline/quality/fixtures/chat_test_suite.yaml` from T01
2. Read `backend/chat_service.py` to review the refined prompt from T02
3. Attempt to run the evaluation: `ssh ub01 'cd /vmPool/r/repos/xpltdco/chrysopedia && docker exec chrysopedia-api python -m pipeline.quality chat_eval --base-url http://localhost:8000 --output /app/pipeline/quality/results/chat_eval_baseline.json'` — if this fails due to stack not running or endpoint issues, fall back to manual curl evaluation
4. For personality fidelity: test at least 2 creators with personality profiles, querying the same question at weights 0.0, 0.5, 0.8, 1.0. Verify progressive personality injection is visible in responses.
5. If automated eval ran: copy results back with `scp ub01:/vmPool/r/repos/xpltdco/chrysopedia/backend/pipeline/quality/results/chat_eval_*.json backend/pipeline/quality/results/`
6. Write quality report to `.gsd/milestones/M025/slices/S09/S09-QUALITY-REPORT.md` covering:
- Chat quality baseline scores (per-dimension if automated eval succeeded, qualitative if manual)
- Prompt changes summary (before/after comparison)
- Personality fidelity assessment per weight tier
- Recommendations for future improvements
7. If automated eval produced results, also write the raw JSON to `backend/pipeline/quality/results/`
## Inputs
- `backend/pipeline/quality/chat_eval.py`
- `backend/pipeline/quality/fixtures/chat_test_suite.yaml`
- `backend/chat_service.py`
## Expected Output
- `.gsd/milestones/M025/slices/S09/S09-QUALITY-REPORT.md`
- `backend/pipeline/quality/results/chat_eval_baseline.json`
## Verification
test -f .gsd/milestones/M025/slices/S09/S09-QUALITY-REPORT.md && wc -l .gsd/milestones/M025/slices/S09/S09-QUALITY-REPORT.md | awk '{exit ($1 < 30)}'

View file

@ -18,6 +18,8 @@ from pathlib import Path
from config import get_settings
from pipeline.llm_client import LLMClient
from .chat_eval import ChatEvalRunner
from .chat_scorer import ChatScoreRunner
from .fitness import FitnessRunner
from .optimizer import OptimizationLoop, OptimizationResult
from .scorer import DIMENSIONS, STAGE_CONFIGS, ScoreRunner
@ -260,6 +262,36 @@ def main() -> int:
help="Write the winning prompt back to the stage's prompt file (backs up the original first)",
)
# -- chat_eval subcommand --
chat_parser = sub.add_parser(
"chat_eval",
help="Evaluate chat quality across a test suite of queries",
)
chat_parser.add_argument(
"--suite",
type=str,
required=True,
help="Path to a chat test suite YAML/JSON file",
)
chat_parser.add_argument(
"--base-url",
type=str,
default="http://localhost:8096",
help="Chat API base URL (default: http://localhost:8096)",
)
chat_parser.add_argument(
"--output",
type=str,
default="backend/pipeline/quality/results/",
help="Output path for results JSON (default: backend/pipeline/quality/results/)",
)
chat_parser.add_argument(
"--timeout",
type=float,
default=120.0,
help="Request timeout in seconds (default: 120)",
)
args = parser.parse_args()
if args.command is None:
@ -281,6 +313,9 @@ def main() -> int:
if args.command == "apply":
return _run_apply(args)
if args.command == "chat_eval":
return _run_chat_eval(args)
return 0
@ -558,5 +593,54 @@ def _run_apply(args: argparse.Namespace) -> int:
return 0 if success else 1
def _run_chat_eval(args: argparse.Namespace) -> int:
"""Execute the chat_eval subcommand — evaluate chat quality across a test suite."""
suite_path = Path(args.suite)
if not suite_path.exists():
print(f"Error: suite file not found: {args.suite}", file=sys.stderr)
return 1
# Load test cases
try:
cases = ChatEvalRunner.load_suite(suite_path)
except Exception as exc:
print(f"Error loading test suite: {exc}", file=sys.stderr)
return 1
if not cases:
print("Error: test suite contains no queries", file=sys.stderr)
return 1
print(f"\n Chat Evaluation: {len(cases)} queries from {suite_path}")
print(f" Endpoint: {args.base_url}")
# Build scorer and runner
settings = get_settings()
client = LLMClient(settings)
scorer = ChatScoreRunner(client)
runner = ChatEvalRunner(
scorer=scorer,
base_url=args.base_url,
timeout=args.timeout,
)
# Execute
results = runner.run_suite(cases)
# Print summary
runner.print_summary(results)
# Write results
try:
json_path = runner.write_results(results, args.output)
print(f" Results written to: {json_path}")
except OSError as exc:
print(f" Warning: failed to write results: {exc}", file=sys.stderr)
# Exit code: 0 if at least one scored, 1 if all errored
scored = [r for r in results if r.score and not r.score.error and not r.request_error]
return 0 if scored else 1
if __name__ == "__main__":
sys.exit(main())

View file

@ -0,0 +1,352 @@
"""Chat evaluation harness — sends queries to the live chat endpoint, scores responses.
Loads a test suite (YAML or JSON), calls the chat HTTP endpoint for each query,
parses SSE events to collect response text and sources, then scores each using
ChatScoreRunner. Writes results to a JSON file.
Usage:
python -m pipeline.quality chat_eval --suite fixtures/chat_test_suite.yaml
python -m pipeline.quality chat_eval --suite fixtures/chat_test_suite.yaml --base-url http://ub01:8096
"""
from __future__ import annotations
import json
import logging
import time
from dataclasses import dataclass, field
from datetime import datetime, timezone
from pathlib import Path
from typing import Any
import httpx
from pipeline.llm_client import LLMClient
from pipeline.quality.chat_scorer import CHAT_DIMENSIONS, ChatScoreResult, ChatScoreRunner
logger = logging.getLogger(__name__)
_DEFAULT_BASE_URL = "http://localhost:8096"
_CHAT_ENDPOINT = "/api/chat"
_REQUEST_TIMEOUT = 120.0 # seconds — LLM streaming can be slow
@dataclass
class ChatTestCase:
"""A single test case from the test suite."""
query: str
creator: str | None = None
personality_weight: float = 0.0
category: str = "general"
description: str = ""
@dataclass
class ChatEvalResult:
"""Result of evaluating a single test case."""
test_case: ChatTestCase
response: str = ""
sources: list[dict] = field(default_factory=list)
cascade_tier: str = ""
score: ChatScoreResult | None = None
request_error: str | None = None
latency_seconds: float = 0.0
class ChatEvalRunner:
"""Runs a chat evaluation suite against a live endpoint."""
def __init__(
self,
scorer: ChatScoreRunner,
base_url: str = _DEFAULT_BASE_URL,
timeout: float = _REQUEST_TIMEOUT,
) -> None:
self.scorer = scorer
self.base_url = base_url.rstrip("/")
self.timeout = timeout
@staticmethod
def load_suite(path: str | Path) -> list[ChatTestCase]:
"""Load test cases from a YAML or JSON file.
Expected format (YAML):
queries:
- query: "How do I sidechain a bass?"
creator: null
personality_weight: 0.0
category: technical
description: "Basic sidechain compression question"
"""
filepath = Path(path)
text = filepath.read_text(encoding="utf-8")
if filepath.suffix in (".yaml", ".yml"):
try:
import yaml
except ImportError:
raise ImportError(
"PyYAML is required to load YAML test suites. "
"Install with: pip install pyyaml"
)
data = yaml.safe_load(text)
else:
data = json.loads(text)
queries = data.get("queries", [])
cases: list[ChatTestCase] = []
for q in queries:
cases.append(ChatTestCase(
query=q["query"],
creator=q.get("creator"),
personality_weight=float(q.get("personality_weight", 0.0)),
category=q.get("category", "general"),
description=q.get("description", ""),
))
return cases
def run_suite(self, cases: list[ChatTestCase]) -> list[ChatEvalResult]:
"""Execute all test cases sequentially, scoring each response."""
results: list[ChatEvalResult] = []
for i, case in enumerate(cases, 1):
print(f"\n [{i}/{len(cases)}] {case.category}: {case.query[:60]}...")
result = self._run_single(case)
results.append(result)
if result.request_error:
print(f" ✗ Request error: {result.request_error}")
elif result.score and result.score.error:
print(f" ✗ Scoring error: {result.score.error}")
elif result.score:
print(f" ✓ Composite: {result.score.composite:.3f} "
f"(latency: {result.latency_seconds:.1f}s)")
return results
def _run_single(self, case: ChatTestCase) -> ChatEvalResult:
"""Execute a single test case: call endpoint, parse SSE, score."""
eval_result = ChatEvalResult(test_case=case)
# Call the chat endpoint
t0 = time.monotonic()
try:
response_text, sources, cascade_tier = self._call_chat_endpoint(case)
eval_result.latency_seconds = round(time.monotonic() - t0, 2)
except Exception as exc:
eval_result.latency_seconds = round(time.monotonic() - t0, 2)
eval_result.request_error = str(exc)
logger.error("chat_eval_request_error query=%r error=%s", case.query, exc)
return eval_result
eval_result.response = response_text
eval_result.sources = sources
eval_result.cascade_tier = cascade_tier
if not response_text:
eval_result.request_error = "Empty response from chat endpoint"
return eval_result
# Score the response
eval_result.score = self.scorer.score_response(
query=case.query,
response=response_text,
sources=sources,
personality_weight=case.personality_weight,
creator_name=case.creator,
)
return eval_result
def _call_chat_endpoint(
self, case: ChatTestCase
) -> tuple[str, list[dict], str]:
"""Call the chat SSE endpoint and parse the event stream.
Returns (accumulated_text, sources_list, cascade_tier).
"""
url = f"{self.base_url}{_CHAT_ENDPOINT}"
payload: dict[str, Any] = {"query": case.query}
if case.creator:
payload["creator"] = case.creator
if case.personality_weight > 0:
payload["personality_weight"] = case.personality_weight
sources: list[dict] = []
accumulated = ""
cascade_tier = ""
with httpx.Client(timeout=self.timeout) as client:
with client.stream("POST", url, json=payload) as resp:
resp.raise_for_status()
buffer = ""
for chunk in resp.iter_text():
buffer += chunk
# Parse SSE events from buffer
while "\n\n" in buffer:
event_block, buffer = buffer.split("\n\n", 1)
event_type, event_data = self._parse_sse_event(event_block)
if event_type == "sources":
sources = event_data if isinstance(event_data, list) else []
elif event_type == "token":
accumulated += event_data if isinstance(event_data, str) else str(event_data)
elif event_type == "done":
if isinstance(event_data, dict):
cascade_tier = event_data.get("cascade_tier", "")
elif event_type == "error":
msg = event_data.get("message", str(event_data)) if isinstance(event_data, dict) else str(event_data)
raise RuntimeError(f"Chat endpoint returned error: {msg}")
return accumulated, sources, cascade_tier
@staticmethod
def _parse_sse_event(block: str) -> tuple[str, Any]:
"""Parse a single SSE event block into (event_type, data)."""
event_type = ""
data_lines: list[str] = []
for line in block.strip().splitlines():
if line.startswith("event: "):
event_type = line[7:].strip()
elif line.startswith("data: "):
data_lines.append(line[6:])
elif line.startswith("data:"):
data_lines.append(line[5:])
raw_data = "\n".join(data_lines)
try:
parsed = json.loads(raw_data)
except (json.JSONDecodeError, ValueError):
parsed = raw_data # plain text token
return event_type, parsed
@staticmethod
def write_results(
results: list[ChatEvalResult],
output_path: str | Path,
) -> str:
"""Write evaluation results to a JSON file. Returns the path."""
out = Path(output_path)
out.parent.mkdir(parents=True, exist_ok=True)
timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")
if out.is_dir():
filepath = out / f"chat_eval_{timestamp}.json"
else:
filepath = out
# Build serializable payload
entries: list[dict] = []
for r in results:
entry: dict[str, Any] = {
"query": r.test_case.query,
"creator": r.test_case.creator,
"personality_weight": r.test_case.personality_weight,
"category": r.test_case.category,
"description": r.test_case.description,
"response_length": len(r.response),
"source_count": len(r.sources),
"cascade_tier": r.cascade_tier,
"latency_seconds": r.latency_seconds,
}
if r.request_error:
entry["error"] = r.request_error
elif r.score:
entry["scores"] = r.score.scores
entry["composite"] = r.score.composite
entry["justifications"] = r.score.justifications
entry["scoring_time"] = r.score.elapsed_seconds
if r.score.error:
entry["scoring_error"] = r.score.error
entries.append(entry)
# Summary stats
scored = [e for e in entries if "composite" in e]
avg_composite = (
sum(e["composite"] for e in scored) / len(scored) if scored else 0.0
)
dim_avgs: dict[str, float] = {}
for dim in CHAT_DIMENSIONS:
vals = [e["scores"][dim] for e in scored if dim in e.get("scores", {})]
dim_avgs[dim] = round(sum(vals) / len(vals), 3) if vals else 0.0
payload = {
"timestamp": timestamp,
"total_queries": len(results),
"scored_queries": len(scored),
"errors": len(results) - len(scored),
"average_composite": round(avg_composite, 3),
"dimension_averages": dim_avgs,
"results": entries,
}
filepath.write_text(json.dumps(payload, indent=2), encoding="utf-8")
return str(filepath)
@staticmethod
def print_summary(results: list[ChatEvalResult]) -> None:
"""Print a summary table of evaluation results."""
print("\n" + "=" * 72)
print(" CHAT EVALUATION SUMMARY")
print("=" * 72)
scored = [r for r in results if r.score and not r.score.error and not r.request_error]
errored = [r for r in results if r.request_error or (r.score and r.score.error)]
if not scored:
print("\n No successfully scored responses.\n")
if errored:
print(f" Errors: {len(errored)}")
for r in errored:
err = r.request_error or (r.score.error if r.score else "unknown")
print(f" - {r.test_case.query[:50]}: {err}")
print("=" * 72 + "\n")
return
# Header
print(f"\n {'Category':<12s} {'Query':<30s} {'Comp':>5s} {'Cite':>5s} {'Struct':>6s} {'Domain':>6s} {'Ground':>6s} {'Person':>6s}")
print(f" {''*12} {''*30} {''*5} {''*5} {''*6} {''*6} {''*6} {''*6}")
for r in scored:
s = r.score
assert s is not None
q = r.test_case.query[:30]
cat = r.test_case.category[:12]
print(
f" {cat:<12s} {q:<30s} "
f"{s.composite:5.2f} "
f"{s.citation_accuracy:5.2f} "
f"{s.response_structure:6.2f} "
f"{s.domain_expertise:6.2f} "
f"{s.source_grounding:6.2f} "
f"{s.personality_fidelity:6.2f}"
)
# Averages
avg_comp = sum(r.score.composite for r in scored) / len(scored)
avg_dims = {}
for dim in CHAT_DIMENSIONS:
vals = [r.score.scores.get(dim, 0.0) for r in scored]
avg_dims[dim] = sum(vals) / len(vals)
print(f"\n {'AVERAGE':<12s} {'':30s} "
f"{avg_comp:5.2f} "
f"{avg_dims['citation_accuracy']:5.2f} "
f"{avg_dims['response_structure']:6.2f} "
f"{avg_dims['domain_expertise']:6.2f} "
f"{avg_dims['source_grounding']:6.2f} "
f"{avg_dims['personality_fidelity']:6.2f}")
if errored:
print(f"\n Errors: {len(errored)}")
for r in errored:
err = r.request_error or (r.score.error if r.score else "unknown")
print(f" - {r.test_case.query[:50]}: {err}")
print("=" * 72 + "\n")

View file

@ -0,0 +1,271 @@
"""Chat-specific quality scorer — LLM-as-judge evaluation for chat responses.
Scores chat responses across 5 dimensions:
- citation_accuracy: Are citations real and correctly numbered?
- response_structure: Concise, well-organized, uses appropriate formatting?
- domain_expertise: Music production terminology used naturally?
- source_grounding: Claims backed by provided sources, no fabrication?
- personality_fidelity: At weight>0, response reflects creator voice?
Run via: python -m pipeline.quality chat_eval --suite <path>
"""
from __future__ import annotations
import json
import logging
import time
from dataclasses import dataclass, field
import openai
from pipeline.llm_client import LLMClient
logger = logging.getLogger(__name__)
CHAT_DIMENSIONS = [
"citation_accuracy",
"response_structure",
"domain_expertise",
"source_grounding",
"personality_fidelity",
]
CHAT_RUBRIC = """\
You are an expert evaluator of AI chat response quality for a music production knowledge base.
You will be given:
1. The user's query
2. The assistant's response
3. The numbered source citations that were provided to the assistant
4. The personality_weight (0.0 = encyclopedic, >0 = creator voice expected)
5. The creator_name (if any)
Evaluate the response across these 5 dimensions, scoring each 0.0 to 1.0:
**citation_accuracy** Citations are real, correctly numbered, and point to relevant sources
- 0.9-1.0: Every [N] citation references a real source number, citations are placed next to the claim they support, no phantom citations
- 0.5-0.7: Most citations are valid but some are misplaced or reference non-existent source numbers
- 0.0-0.3: Many phantom citations, wrong numbers, or citations placed randomly without connection to claims
**response_structure** Response is concise, well-organized, uses appropriate formatting
- 0.9-1.0: Clear paragraphs, uses bullet lists for steps/lists, bold for key terms, appropriate length (not padded)
- 0.5-0.7: Readable but could be better organized wall of text, missing formatting where it would help
- 0.0-0.3: Disorganized, excessively long or too terse, no formatting, hard to scan
**domain_expertise** Music production terminology used naturally and correctly
- 0.9-1.0: Uses correct audio/synth/mixing terminology, explains technical terms when appropriate, sounds like a knowledgeable producer
- 0.5-0.7: Generally correct but some terminology is vague ("adjust the sound" vs "shape the transient") or misused
- 0.0-0.3: Generic language, avoids domain terminology, or uses terms incorrectly
**source_grounding** Claims are backed by provided sources, no fabrication
- 0.9-1.0: Every factual claim traces to a provided source, no invented details (plugin names, settings, frequencies not in sources)
- 0.5-0.7: Mostly grounded but 1-2 claims seem embellished or not directly from sources
- 0.0-0.3: Contains hallucinated specifics settings, plugin names, or techniques not present in any source
**personality_fidelity** When personality_weight > 0, response reflects the creator's voice proportional to the weight
- If personality_weight == 0: Score based on neutral encyclopedic tone (should NOT show personality). Neutral, informative = 1.0. Forced personality = 0.5.
- If personality_weight > 0 and personality_weight < 0.5: Subtle personality hints expected. Score higher if tone is lightly flavored but still mainly encyclopedic.
- If personality_weight >= 0.5: Clear creator voice expected. Score higher for signature phrases, teaching style, energy matching the named creator.
- If no creator_name is provided: Score 1.0 if response is neutral/encyclopedic, lower if it adopts an unexplained persona.
Return ONLY a JSON object with this exact structure:
{
"citation_accuracy": <float 0.0-1.0>,
"response_structure": <float 0.0-1.0>,
"domain_expertise": <float 0.0-1.0>,
"source_grounding": <float 0.0-1.0>,
"personality_fidelity": <float 0.0-1.0>,
"justifications": {
"citation_accuracy": "<1-2 sentence justification>",
"response_structure": "<1-2 sentence justification>",
"domain_expertise": "<1-2 sentence justification>",
"source_grounding": "<1-2 sentence justification>",
"personality_fidelity": "<1-2 sentence justification>"
}
}
"""
@dataclass
class ChatScoreResult:
"""Outcome of scoring a chat response across quality dimensions."""
scores: dict[str, float] = field(default_factory=dict)
composite: float = 0.0
justifications: dict[str, str] = field(default_factory=dict)
elapsed_seconds: float = 0.0
error: str | None = None
# Convenience properties
@property
def citation_accuracy(self) -> float:
return self.scores.get("citation_accuracy", 0.0)
@property
def response_structure(self) -> float:
return self.scores.get("response_structure", 0.0)
@property
def domain_expertise(self) -> float:
return self.scores.get("domain_expertise", 0.0)
@property
def source_grounding(self) -> float:
return self.scores.get("source_grounding", 0.0)
@property
def personality_fidelity(self) -> float:
return self.scores.get("personality_fidelity", 0.0)
class ChatScoreRunner:
"""Scores chat responses using LLM-as-judge evaluation."""
def __init__(self, client: LLMClient) -> None:
self.client = client
def score_response(
self,
query: str,
response: str,
sources: list[dict],
personality_weight: float = 0.0,
creator_name: str | None = None,
) -> ChatScoreResult:
"""Score a single chat response against the 5 chat quality dimensions.
Parameters
----------
query:
The user's original query.
response:
The assistant's accumulated response text.
sources:
List of source citation dicts (as emitted by the SSE sources event).
personality_weight:
0.0 = encyclopedic mode, >0 = personality mode.
creator_name:
Creator name, if this was a creator-scoped query.
Returns
-------
ChatScoreResult with per-dimension scores.
"""
sources_block = json.dumps(sources, indent=2) if sources else "(no sources)"
user_prompt = (
f"## User Query\n\n{query}\n\n"
f"## Assistant Response\n\n{response}\n\n"
f"## Sources Provided\n\n```json\n{sources_block}\n```\n\n"
f"## Metadata\n\n"
f"- personality_weight: {personality_weight}\n"
f"- creator_name: {creator_name or '(none)'}\n\n"
f"Score this chat response across all 5 dimensions."
)
t0 = time.monotonic()
try:
from pydantic import BaseModel as _BM
resp = self.client.complete(
system_prompt=CHAT_RUBRIC,
user_prompt=user_prompt,
response_model=_BM,
modality="chat",
)
elapsed = round(time.monotonic() - t0, 2)
except (openai.APIConnectionError, openai.APITimeoutError) as exc:
elapsed = round(time.monotonic() - t0, 2)
return ChatScoreResult(
elapsed_seconds=elapsed,
error=f"Cannot reach LLM judge. Error: {exc}",
)
raw_text = str(resp).strip()
try:
parsed = json.loads(raw_text)
except json.JSONDecodeError:
logger.error("Malformed chat judge response (not JSON): %.300s", raw_text)
return ChatScoreResult(
elapsed_seconds=elapsed,
error=f"Malformed judge response. Raw excerpt: {raw_text[:200]}",
)
return self._parse_scores(parsed, elapsed)
def _parse_scores(self, parsed: dict, elapsed: float) -> ChatScoreResult:
"""Extract and validate scores from parsed JSON judge response."""
scores: dict[str, float] = {}
justifications: dict[str, str] = {}
raw_justifications = parsed.get("justifications", {})
if not isinstance(raw_justifications, dict):
raw_justifications = {}
for dim in CHAT_DIMENSIONS:
raw = parsed.get(dim)
if raw is None:
logger.warning("Missing dimension '%s' in chat judge response", dim)
scores[dim] = 0.0
justifications[dim] = "(missing from judge response)"
continue
try:
val = float(raw)
scores[dim] = max(0.0, min(1.0, val))
except (TypeError, ValueError):
logger.warning("Invalid value for '%s': %r", dim, raw)
scores[dim] = 0.0
justifications[dim] = f"(invalid value: {raw!r})"
continue
justifications[dim] = str(raw_justifications.get(dim, ""))
composite = sum(scores.values()) / len(CHAT_DIMENSIONS) if CHAT_DIMENSIONS else 0.0
return ChatScoreResult(
scores=scores,
composite=round(composite, 3),
justifications=justifications,
elapsed_seconds=elapsed,
)
def print_report(self, result: ChatScoreResult, query: str = "") -> None:
"""Print a formatted chat scoring report to stdout."""
print("\n" + "=" * 60)
print(" CHAT QUALITY SCORE REPORT")
if query:
print(f" Query: {query[:60]}{'...' if len(query) > 60 else ''}")
print("=" * 60)
if result.error:
print(f"\n ✗ Error: {result.error}\n")
print("=" * 60 + "\n")
return
for dim in CHAT_DIMENSIONS:
score = result.scores.get(dim, 0.0)
filled = int(score * 20)
bar = "" * filled + "" * (20 - filled)
justification = result.justifications.get(dim, "")
print(f"\n {dim.replace('_', ' ').title()}")
print(f" Score: {score:.2f} {bar}")
if justification:
# Simple word wrap at ~56 chars
words = justification.split()
lines: list[str] = []
current = ""
for word in words:
if current and len(current) + len(word) + 1 > 56:
lines.append(current)
current = word
else:
current = f"{current} {word}" if current else word
if current:
lines.append(current)
for line in lines:
print(f" {line}")
print("\n" + "-" * 60)
print(f" Composite: {result.composite:.3f}")
print(f" Time: {result.elapsed_seconds}s")
print("=" * 60 + "\n")

View file

@ -0,0 +1,72 @@
# Chat quality evaluation test suite
# 10 representative queries across 4 categories:
# - technical: How-to questions about specific production techniques
# - conceptual: Broader understanding questions about audio concepts
# - creator: Creator-scoped queries at different personality weights
# - cross_creator: Queries spanning multiple creators' approaches
queries:
# ── Technical how-to (2) ────────────────────────────────────────────
- query: "How do I set up sidechain compression on a bass synth using a kick drum as the trigger?"
creator: null
personality_weight: 0.0
category: technical
description: "Common sidechain compression setup — expects specific settings (ratio, attack, release)"
- query: "What are the best EQ settings for cleaning up a muddy vocal recording?"
creator: null
personality_weight: 0.0
category: technical
description: "Vocal EQ technique — expects frequency ranges, Q values, cut/boost guidance"
# ── Conceptual (2) ─────────────────────────────────────────────────
- query: "What is the difference between parallel compression and serial compression, and when should I use each?"
creator: null
personality_weight: 0.0
category: conceptual
description: "Conceptual comparison — expects clear definitions, use cases, pros/cons"
- query: "How does sample rate affect sound quality in music production?"
creator: null
personality_weight: 0.0
category: conceptual
description: "Audio fundamentals — expects Nyquist, aliasing, practical guidance"
# ── Creator-specific: encyclopedic (2) ──────────────────────────────
- query: "How does this creator approach sound design for bass sounds?"
creator: "KEOTA"
personality_weight: 0.0
category: creator_encyclopedic
description: "Creator-scoped query at weight=0 — should be neutral/encyclopedic about KEOTA's techniques"
- query: "What mixing techniques does this creator recommend for achieving width in a mix?"
creator: "Mr. Bill"
personality_weight: 0.0
category: creator_encyclopedic
description: "Creator-scoped query at weight=0 — neutral tone about Mr. Bill's approach"
# ── Creator-specific: personality (2) ───────────────────────────────
- query: "How does this creator approach sound design for bass sounds?"
creator: "KEOTA"
personality_weight: 0.7
category: creator_personality
description: "Same query as above but at weight=0.7 — should reflect KEOTA's voice and teaching style"
- query: "What mixing techniques does this creator recommend for achieving width in a mix?"
creator: "Mr. Bill"
personality_weight: 0.7
category: creator_personality
description: "Same query as above but at weight=0.7 — should reflect Mr. Bill's voice"
# ── Cross-creator (2) ──────────────────────────────────────────────
- query: "What are the different approaches to layering synth sounds across creators?"
creator: null
personality_weight: 0.0
category: cross_creator
description: "Cross-creator comparison — should cite multiple creators' techniques"
- query: "How do different producers approach drum processing and what plugins do they prefer?"
creator: null
personality_weight: 0.0
category: cross_creator
description: "Cross-creator comparison on drums — expects multiple perspectives with citations"