test: Rewrote _SYSTEM_PROMPT_TEMPLATE with citation density rules, resp…

- "backend/chat_service.py" GSD-Task: S09/T02
2026-04-04 14:45:09 +00:00 · 2026-04-04 14:45:09 +00:00 · 3cbb614654
commit 3cbb614654
parent 846db2aad5
4 changed files with 107 additions and 5 deletions
--- a/.gsd/milestones/M025/slices/S09/S09-PLAN.md
+++ b/.gsd/milestones/M025/slices/S09/S09-PLAN.md
@ -20,7 +20,7 @@ Steps:
  - Estimate: 2h
  - Files: backend/pipeline/quality/chat_scorer.py, backend/pipeline/quality/chat_eval.py, backend/pipeline/quality/fixtures/chat_test_suite.yaml, backend/pipeline/quality/__main__.py
  - Verify: cd backend && python -c 'from pipeline.quality.chat_scorer import ChatScoreRunner, ChatScoreResult; from pipeline.quality.chat_eval import ChatEvalRunner; print("OK")'
- [ ] **T02: Refine chat system prompt and verify no test regressions** — Improve the `_SYSTEM_PROMPT_TEMPLATE` in `backend/chat_service.py` based on the gaps identified in research: the current prompt is 5 lines with no guidance on citation density, response structure, domain awareness, conflicting source handling, or response length.
+- [x] **T02: Rewrote _SYSTEM_PROMPT_TEMPLATE with citation density rules, response structure guidance, domain-aware terminology handling, and conflicting-source instructions — all 26 chat tests pass unchanged** — Improve the `_SYSTEM_PROMPT_TEMPLATE` in `backend/chat_service.py` based on the gaps identified in research: the current prompt is 5 lines with no guidance on citation density, response structure, domain awareness, conflicting source handling, or response length.

 The refined prompt should:
 - Guide citation density: cite every factual claim, prefer inline citations [N] immediately after the claim
--- a/.gsd/milestones/M025/slices/S09/tasks/T01-VERIFY.json
+++ b/.gsd/milestones/M025/slices/S09/tasks/T01-VERIFY.json
@ -0,0 +1,16 @@
+{
+  "schemaVersion": 1,
+  "taskId": "T01",
+  "unitId": "M025/S09/T01",
+  "timestamp": 1775313832904,
+  "passed": true,
+  "discoverySource": "task-plan",
+  "checks": [
+    {
+      "command": "cd backend",
+      "exitCode": 0,
+      "durationMs": 14,
+      "verdict": "pass"
+    }
+  ]
+}
--- a/.gsd/milestones/M025/slices/S09/tasks/T02-SUMMARY.md
+++ b/.gsd/milestones/M025/slices/S09/tasks/T02-SUMMARY.md
@ -0,0 +1,74 @@
+---
+id: T02
+parent: S09
+milestone: M025
+provides: []
+requires: []
+affects: []
+key_files: ["backend/chat_service.py"]
+key_decisions: ["Kept prompt under 20 lines using markdown headers for structure rather than prose paragraphs"]
+patterns_established: []
+drill_down_paths: []
+observability_surfaces: []
+duration: ""
+verification_result: "cd backend && python -m pytest tests/test_chat.py -v — 26 passed in 1.37s"
+completed_at: 2026-04-04T14:45:01.092Z
+blocker_discovered: false
+---
+
+# T02: Rewrote _SYSTEM_PROMPT_TEMPLATE with citation density rules, response structure guidance, domain-aware terminology handling, and conflicting-source instructions — all 26 chat tests pass unchanged
+
+> Rewrote _SYSTEM_PROMPT_TEMPLATE with citation density rules, response structure guidance, domain-aware terminology handling, and conflicting-source instructions — all 26 chat tests pass unchanged
+
+## What Happened
+---
+id: T02
+parent: S09
+milestone: M025
+key_files:
+  - backend/chat_service.py
+key_decisions:
+  - Kept prompt under 20 lines using markdown headers for structure rather than prose paragraphs
+duration: ""
+verification_result: passed
+completed_at: 2026-04-04T14:45:01.092Z
+blocker_discovered: false
+---
+
+# T02: Rewrote _SYSTEM_PROMPT_TEMPLATE with citation density rules, response structure guidance, domain-aware terminology handling, and conflicting-source instructions — all 26 chat tests pass unchanged
+
+**Rewrote _SYSTEM_PROMPT_TEMPLATE with citation density rules, response structure guidance, domain-aware terminology handling, and conflicting-source instructions — all 26 chat tests pass unchanged**
+
+## What Happened
+
+Replaced the 5-line system prompt with a structured prompt addressing citation density, response format, domain terminology, conflicting source handling, and response length. No test changes needed — all 26 tests verify behavioral properties, not prompt wording.
+
+## Verification
+
+cd backend && python -m pytest tests/test_chat.py -v — 26 passed in 1.37s
+
+## Verification Evidence
+
+| # | Command | Exit Code | Verdict | Duration |
+|---|---------|-----------|---------|----------|
+| 1 | `cd backend && python -m pytest tests/test_chat.py -v` | 0 | ✅ pass | 1370ms |
+
+
+## Deviations
+
+None.
+
+## Known Issues
+
+None.
+
+## Files Created/Modified
+
+- `backend/chat_service.py`
+
+
+## Deviations
+None.
+
+## Known Issues
+None.
--- a/backend/chat_service.py
+++ b/backend/chat_service.py
@ -31,10 +31,22 @@ from search_service import SearchService
 logger = logging.getLogger("chrysopedia.chat")

 _SYSTEM_PROMPT_TEMPLATE = """\
-You are Chrysopedia, an expert encyclopedic assistant for music production techniques.
-Answer the user's question using ONLY the numbered sources below. Cite sources by
-writing [N] inline (e.g. [1], [2]) where N is the source number. If the sources
-do not contain enough information, say so honestly — do not invent facts.
+You are Chrysopedia, an expert assistant for music production techniques — \
+synthesis, sound design, mixing, sampling, and audio processing.
+
+## Rules
+- Use ONLY the numbered sources below. Do not invent facts.
+- Cite every factual claim inline with [N] immediately after the claim \
+(e.g. "Parallel compression adds sustain [2] while preserving transients [1].").
+- When sources disagree, present both perspectives with their citations.
+- If the sources lack enough information, say so honestly.
+
+## Response format
+- Aim for 2–4 short paragraphs. Expand only when the question warrants detail.
+- Use bullet lists for steps, signal chains, or parameter lists.
+- **Bold** key terms on first mention.
+- Use audio/synthesis/mixing terminology naturally — do not over-explain \
+standard concepts (e.g. LFO, sidechain, wet/dry) unless the user asks.

 Sources:
 {context_block}