feat: Built ScoreRunner with 5-dimension LLM-as-judge scoring rubric, C…
- "backend/pipeline/quality/scorer.py" - "backend/pipeline/quality/__main__.py" - "backend/pipeline/quality/fixtures/sample_moments.json" - "backend/pipeline/quality/fixtures/__init__.py" GSD-Task: S02/T01
This commit is contained in:
parent
c27cd77ae6
commit
5223772756
13 changed files with 1036 additions and 4 deletions
|
|
@ -6,7 +6,7 @@ A fully automated CLI tool that tests FYN-LLM fitness, scores pipeline output ac
|
|||
## Slice Overview
|
||||
| ID | Slice | Risk | Depends | Done | After this |
|
||||
|----|-------|------|---------|------|------------|
|
||||
| S01 | General FYN-LLM Fitness Suite | medium | — | ⬜ | Run `python -m pipeline.quality fitness` — outputs pass/fail for Mandelbrot question, JSON compliance, instruction following, and diverse prompt battery against live FYN-LLM |
|
||||
| S01 | General FYN-LLM Fitness Suite | medium | — | ✅ | Run `python -m pipeline.quality fitness` — outputs pass/fail for Mandelbrot question, JSON compliance, instruction following, and diverse prompt battery against live FYN-LLM |
|
||||
| S02 | Stage 5 Quality Scorer & Voice Preservation Dial | high | S01 | ⬜ | Run scorer on a reference article — outputs composite score across 5 dimensions. Run same article at voice_level 0.2 vs 0.8 — voice preservation score differs meaningfully |
|
||||
| S03 | Prompt Variant Generator & Automated A/B Loop | high | S02 | ⬜ | Run `python -m pipeline.quality optimize --stage 5 --iterations 10` — generates prompt variants, scores each against reference articles, outputs leaderboard and score trajectory chart |
|
||||
| S04 | Expand to Pipeline Stages 2-4 | medium | S03 | ⬜ | Run `python -m pipeline.quality optimize --stage 3 --iterations 5` — optimizes extraction prompts with stage-appropriate scoring |
|
||||
|
|
|
|||
91
.gsd/milestones/M013/slices/S01/S01-SUMMARY.md
Normal file
91
.gsd/milestones/M013/slices/S01/S01-SUMMARY.md
Normal file
|
|
@ -0,0 +1,91 @@
|
|||
---
|
||||
id: S01
|
||||
parent: M013
|
||||
milestone: M013
|
||||
provides:
|
||||
- pipeline.quality package with FitnessRunner and argparse CLI entry point
|
||||
- TestResult dataclass for structured test results
|
||||
requires:
|
||||
[]
|
||||
affects:
|
||||
- S02
|
||||
- S03
|
||||
key_files:
|
||||
- backend/pipeline/quality/__init__.py
|
||||
- backend/pipeline/quality/__main__.py
|
||||
- backend/pipeline/quality/fitness.py
|
||||
key_decisions:
|
||||
- Connectivity probe as first action in run_all() for fast failure before running tests
|
||||
- Mandelbrot test uses thinking modality to exercise both LLM modes
|
||||
- Generous validation thresholds to avoid flaky failures from LLM variance
|
||||
patterns_established:
|
||||
- pipeline.quality package structure with argparse subcommands — S02/S03 add score and optimize subcommands to the same CLI
|
||||
- TestResult dataclass as standard return type for all quality tests — carries name, passed, elapsed_seconds, token_count, detail
|
||||
observability_surfaces:
|
||||
- none
|
||||
drill_down_paths:
|
||||
- .gsd/milestones/M013/slices/S01/tasks/T01-SUMMARY.md
|
||||
duration: ""
|
||||
verification_result: passed
|
||||
completed_at: 2026-04-01T08:46:13.884Z
|
||||
blocker_discovered: false
|
||||
---
|
||||
|
||||
# S01: General FYN-LLM Fitness Suite
|
||||
|
||||
**Built pipeline.quality package with FitnessRunner CLI — 9 tests across 4 categories (Mandelbrot reasoning, JSON compliance, instruction following, diverse battery), clean connectivity error handling, exits 0/1.**
|
||||
|
||||
## What Happened
|
||||
|
||||
Created the `backend/pipeline/quality/` package as the foundation for the M013 quality assurance toolkit. The package has three files: `__init__.py`, `__main__.py` (argparse CLI with `fitness` subcommand), and `fitness.py` (FitnessRunner class).
|
||||
|
||||
FitnessRunner implements 9 tests across 4 categories:
|
||||
1. **Mandelbrot reasoning** (1 test) — asks about the Mandelbrot set area, checks for key concepts. Uses thinking modality to exercise both LLM modes.
|
||||
2. **JSON compliance** (2 tests) — simple and nested JSON with Pydantic validation. Catches malformed/empty responses gracefully.
|
||||
3. **Instruction following** (3 tests) — bullet count, keyword inclusion, lowercase constraint. Programmatic compliance checks.
|
||||
4. **Diverse battery** (3 tests) — summarization, classification, extraction. Tests practical pipeline-relevant capabilities.
|
||||
|
||||
The CLI runs `python -m pipeline.quality fitness`, prints a formatted pass/fail report with per-test timing and token counts, and exits 0 (all pass) or 1 (any failure or connectivity error). A connectivity pre-check probes the LLM endpoint before running tests, providing a clear error message with the endpoint URL on failure — no tracebacks.
|
||||
|
||||
The argparse structure is designed for extension: S02 and S03 will add `score` and `optimize` subcommands to the same CLI.
|
||||
|
||||
## Verification
|
||||
|
||||
1. Import check: `cd backend && python -c 'from pipeline.quality.fitness import FitnessRunner; print("import ok")'` — exits 0, prints "import ok".
|
||||
2. CLI connectivity error: `cd backend && python -m pipeline.quality fitness` — prints clear "Cannot reach LLM endpoint" message with URLs, exits 1, no traceback.
|
||||
3. Help output: `python -m pipeline.quality --help` — shows fitness subcommand with description.
|
||||
4. No-subcommand: `python -m pipeline.quality` — prints usage, exits 1.
|
||||
|
||||
## Requirements Advanced
|
||||
|
||||
None.
|
||||
|
||||
## Requirements Validated
|
||||
|
||||
None.
|
||||
|
||||
## New Requirements Surfaced
|
||||
|
||||
None.
|
||||
|
||||
## Requirements Invalidated or Re-scoped
|
||||
|
||||
None.
|
||||
|
||||
## Deviations
|
||||
|
||||
None.
|
||||
|
||||
## Known Limitations
|
||||
|
||||
Cannot verify actual test pass/fail behavior without a live FYN-LLM endpoint — only connectivity error path is testable on this machine.
|
||||
|
||||
## Follow-ups
|
||||
|
||||
None.
|
||||
|
||||
## Files Created/Modified
|
||||
|
||||
- `backend/pipeline/quality/__init__.py` — Empty package init
|
||||
- `backend/pipeline/quality/__main__.py` — Argparse CLI with fitness subcommand, extensible for score/optimize
|
||||
- `backend/pipeline/quality/fitness.py` — FitnessRunner class with 9 tests, 4 categories, connectivity pre-check, formatted report output
|
||||
52
.gsd/milestones/M013/slices/S01/S01-UAT.md
Normal file
52
.gsd/milestones/M013/slices/S01/S01-UAT.md
Normal file
|
|
@ -0,0 +1,52 @@
|
|||
# S01: General FYN-LLM Fitness Suite — UAT
|
||||
|
||||
**Milestone:** M013
|
||||
**Written:** 2026-04-01T08:46:13.884Z
|
||||
|
||||
## UAT: S01 — General FYN-LLM Fitness Suite
|
||||
|
||||
### Preconditions
|
||||
- Backend source at `backend/` with `pipeline/quality/` package present
|
||||
- Python 3.x with project dependencies installed
|
||||
- No live LLM endpoint required for error-path tests (tests 1-4)
|
||||
- Live FYN-LLM endpoint required for functional tests (tests 5-7)
|
||||
|
||||
### Test Cases
|
||||
|
||||
#### TC1: Import Check
|
||||
1. Run `cd backend && python -c 'from pipeline.quality.fitness import FitnessRunner; print("ok")'`
|
||||
2. **Expected:** Prints "ok", exits 0
|
||||
|
||||
#### TC2: CLI Help
|
||||
1. Run `cd backend && python -m pipeline.quality --help`
|
||||
2. **Expected:** Shows usage with `fitness` subcommand listed and described
|
||||
|
||||
#### TC3: No Subcommand
|
||||
1. Run `cd backend && python -m pipeline.quality`
|
||||
2. **Expected:** Prints usage message, exits non-zero
|
||||
|
||||
#### TC4: Connectivity Error (no LLM running)
|
||||
1. Ensure no LLM is running on localhost
|
||||
2. Run `cd backend && python -m pipeline.quality fitness`
|
||||
3. **Expected:** Prints "Cannot reach LLM endpoint at {url}" with both primary and fallback URLs. Exits 1. No Python traceback visible in output.
|
||||
|
||||
#### TC5: Full Fitness Run (requires live FYN-LLM)
|
||||
1. Ensure FYN-LLM is reachable at configured endpoint
|
||||
2. Run `cd backend && python -m pipeline.quality fitness`
|
||||
3. **Expected:** Prints formatted report with 9 test results across 4 categories. Each test shows name, PASS/FAIL, elapsed time. Summary line at end shows pass count / total. Exits 0 if all pass, 1 if any fail.
|
||||
|
||||
#### TC6: Mandelbrot Thinking Mode (requires live FYN-LLM)
|
||||
1. Run fitness suite with live LLM
|
||||
2. **Expected:** Mandelbrot test uses thinking modality (visible in verbose output or code inspection). Tests LLM's reasoning capability, not just chat completion.
|
||||
|
||||
#### TC7: JSON Compliance with Malformed Response (requires live FYN-LLM)
|
||||
1. If LLM returns non-JSON for a JSON test
|
||||
2. **Expected:** Test marks as FAIL with clear detail message ("Failed to parse JSON" or similar). No crash or traceback.
|
||||
|
||||
### Edge Cases
|
||||
|
||||
#### EC1: Empty LLM Response
|
||||
- If LLM returns empty string, affected tests should FAIL with descriptive detail, not crash with IndexError or similar.
|
||||
|
||||
#### EC2: Timeout
|
||||
- If LLM endpoint is reachable but extremely slow, the connectivity probe should eventually time out and report the failure clearly.
|
||||
16
.gsd/milestones/M013/slices/S01/tasks/T01-VERIFY.json
Normal file
16
.gsd/milestones/M013/slices/S01/tasks/T01-VERIFY.json
Normal file
|
|
@ -0,0 +1,16 @@
|
|||
{
|
||||
"schemaVersion": 1,
|
||||
"taskId": "T01",
|
||||
"unitId": "M013/S01/T01",
|
||||
"timestamp": 1775033105800,
|
||||
"passed": true,
|
||||
"discoverySource": "task-plan",
|
||||
"checks": [
|
||||
{
|
||||
"command": "cd backend",
|
||||
"exitCode": 0,
|
||||
"durationMs": 7,
|
||||
"verdict": "pass"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
|
@ -1,6 +1,116 @@
|
|||
# S02: Stage 5 Quality Scorer & Voice Preservation Dial
|
||||
|
||||
**Goal:** Build the measurement instrument: structural + content + voice + readability + preference scoring for stage 5 output, with voice_level dial in the prompt
|
||||
**Goal:** A `score` CLI subcommand that evaluates Stage 5 synthesis output across 5 quality dimensions using LLM-as-judge, plus a `--voice-level` parameter that modifies the synthesis prompt to dial voice preservation intensity — with the scorer proving the dial works by producing measurably different voice preservation scores at different levels.
|
||||
**Demo:** After this: Run scorer on a reference article — outputs composite score across 5 dimensions. Run same article at voice_level 0.2 vs 0.8 — voice preservation score differs meaningfully
|
||||
|
||||
## Tasks
|
||||
- [x] **T01: Built ScoreRunner with 5-dimension LLM-as-judge scoring rubric, CLI score subcommand, and realistic 6-moment sample fixture** — ## Description
|
||||
|
||||
Create the scorer module that evaluates a Stage 5 technique page across 5 quality dimensions using an LLM judge call. Wire it into the existing CLI as the `score` subcommand. Build a realistic fixture file for offline testing.
|
||||
|
||||
## Failure Modes
|
||||
|
||||
| Dependency | On error | On timeout | On malformed response |
|
||||
|------------|----------|-----------|----------------------|
|
||||
| LLM endpoint | Print clear connectivity error with URL, exit 1 (same as fitness) | Same — openai timeout maps to APITimeoutError | Parse judge JSON response; if malformed, log raw excerpt and return zero scores with detail message |
|
||||
|
||||
## Steps
|
||||
|
||||
1. Read `backend/pipeline/quality/fitness.py` to understand the TestResult/FitnessRunner pattern and report formatting.
|
||||
2. Create `backend/pipeline/quality/scorer.py` with:
|
||||
- `ScoreResult` dataclass: `structural`, `content_specificity`, `voice_preservation`, `readability`, `factual_fidelity` (each float 0.0–1.0), `composite` (weighted average, default equal weights), `justifications` (dict of dimension → string), `elapsed_seconds` (float).
|
||||
- `ScoreRunner` class taking an `LLMClient` in __init__.
|
||||
- `ScoreRunner.score_page(page_json: dict, moments: list[dict]) -> ScoreResult` method:
|
||||
- Builds a scoring rubric prompt (hardcoded string in scorer.py — not a separate prompt file) that asks the LLM to evaluate the page against the source moments across the 5 dimensions.
|
||||
- The rubric should specify what each dimension measures (see S02-RESEARCH.md for definitions) and instruct the LLM to return JSON: `{"structural": 0.8, "content_specificity": 0.7, ..., "justifications": {"structural": "...", ...}}`.
|
||||
- Calls `self.client.complete()` with modality="chat", response_model=BaseModel (JSON mode).
|
||||
- Parses JSON response, validates all 5 dimension keys present and values in [0.0, 1.0].
|
||||
- Computes composite as mean of 5 dimensions.
|
||||
- Returns `ScoreResult`.
|
||||
- `ScoreRunner.print_report(result: ScoreResult)` — formatted report matching fitness report style: header bar, per-dimension score with justification excerpt, composite score, timing.
|
||||
3. Create `backend/pipeline/quality/fixtures/sample_moments.json` with:
|
||||
- `{"creator_name": "ExampleCreator", "topic_category": "Sound design", "moments": [...]}` — 5-6 realistic moments with `summary`, `transcript_excerpt`, `topic_tags`, `topic_category`, `start_time`, `end_time` fields. Content about a concrete music production technique (e.g., snare layering or bass resampling). Include direct quotes and specific plugin/setting mentions so voice preservation scoring has signal.
|
||||
4. Create `backend/pipeline/quality/fixtures/__init__.py` (empty).
|
||||
5. Update `backend/pipeline/quality/__main__.py`:
|
||||
- Add `score` subcommand to argparse with args: `--file` (path to moments JSON), `--slug` (technique slug — placeholder, just store the arg, actual DB loading deferred), `--voice-level` (float, optional, default None — wired in T02).
|
||||
- When `args.command == "score"`: validate that exactly one of --file or --slug is provided. If --file, load JSON, extract `moments` and `creator_name`. Create `ScoreRunner(client)`. If `--voice-level` is None, call `score_page()` directly with the moments. Print the report.
|
||||
- For now, the --slug path prints "DB loading not yet implemented" and exits 1.
|
||||
6. Verify: `cd backend && python -c "from pipeline.quality.scorer import ScoreRunner, ScoreResult; print('import ok')"` exits 0.
|
||||
7. Verify: `cd backend && python -m pipeline.quality score --help` shows --file, --slug, --voice-level.
|
||||
8. Verify: `cd backend && python -m pipeline.quality score --file pipeline/quality/fixtures/sample_moments.json` — hits connectivity error, prints endpoint URL, exits 1 with no traceback.
|
||||
|
||||
## Must-Haves
|
||||
|
||||
- [ ] ScoreResult dataclass with 5 float dimensions + composite + justifications + elapsed_seconds
|
||||
- [ ] ScoreRunner.score_page() sends rubric + page + moments to LLM, parses JSON response
|
||||
- [ ] Formatted report output with per-dimension scores and justifications
|
||||
- [ ] Fixture JSON with 5+ realistic moments including transcript excerpts and plugin mentions
|
||||
- [ ] `score` subcommand wired in __main__.py with --file, --slug, --voice-level args
|
||||
- [ ] Connectivity error handled cleanly (same pattern as fitness)
|
||||
|
||||
## Verification
|
||||
|
||||
- `cd backend && python -c "from pipeline.quality.scorer import ScoreRunner, ScoreResult; print('import ok')"` — exits 0
|
||||
- `cd backend && python -m pipeline.quality score --help` — shows all three args
|
||||
- `cd backend && python -m pipeline.quality score --file pipeline/quality/fixtures/sample_moments.json` — connectivity error with URL, exit 1, no traceback
|
||||
- `cd backend && python -c "import json; d=json.load(open('pipeline/quality/fixtures/sample_moments.json')); assert 'moments' in d and len(d['moments']) >= 5"` — fixture valid
|
||||
- Estimate: 1.5h
|
||||
- Files: backend/pipeline/quality/scorer.py, backend/pipeline/quality/__main__.py, backend/pipeline/quality/fixtures/sample_moments.json, backend/pipeline/quality/fixtures/__init__.py
|
||||
- Verify: cd backend && python -c "from pipeline.quality.scorer import ScoreRunner, ScoreResult; print('import ok')" && python -m pipeline.quality score --help && python -c "import json; d=json.load(open('pipeline/quality/fixtures/sample_moments.json')); assert 'moments' in d and len(d['moments']) >= 5"
|
||||
- [ ] **T02: Implement voice dial prompt modifier and re-synthesis scoring flow** — ## Description
|
||||
|
||||
Build the voice dial module that modifies the stage 5 synthesis prompt based on a voice_level parameter (0.0–1.0), and wire it into the scorer so `--voice-level` triggers re-synthesis from source moments before scoring. This completes the slice by enabling the key demo: running the scorer at voice_level 0.2 vs 0.8 produces measurably different voice preservation scores.
|
||||
|
||||
## Failure Modes
|
||||
|
||||
| Dependency | On error | On timeout | On malformed response |
|
||||
|------------|----------|-----------|----------------------|
|
||||
| LLM endpoint (re-synthesis) | Same connectivity error pattern — print URL, exit 1 | Same | If synthesis returns unparseable JSON, log raw excerpt and exit 1 with message |
|
||||
| LLM endpoint (scoring judge) | Same as T01 scorer | Same | Same as T01 — zero scores with detail |
|
||||
| prompts/stage5_synthesis.txt | FileNotFoundError caught, print "Prompt file not found: {path}", exit 1 | N/A | N/A |
|
||||
|
||||
## Steps
|
||||
|
||||
1. Read the existing `prompts/stage5_synthesis.txt` to understand what voice language is already present (the research says it's roughly voice_level 0.6-0.7 baseline).
|
||||
2. Read `backend/pipeline/stages.py` functions `_load_prompt()`, `_get_stage_config()`, and `_synthesize_chunk()` to understand the exact synthesis call pattern — system_prompt + `<creator>...</creator>\n<moments>...</moments>` user prompt format.
|
||||
3. Create `backend/pipeline/quality/voice_dial.py` with:
|
||||
- `VoiceDial` class:
|
||||
- `__init__(self, base_prompt: str)` — stores the base stage 5 system prompt.
|
||||
- `modify(self, voice_level: float) -> str` — returns the modified system prompt.
|
||||
- 3 bands: low (0.0–0.33), mid (0.34–0.66), high (0.67–1.0).
|
||||
- Low band: append instruction to suppress direct quotes, write in neutral third-person encyclopedia style, avoid attributing opinions, minimize personality markers.
|
||||
- Mid band: return base prompt unmodified (the existing prompt already has moderate voice preservation).
|
||||
- High band: append instruction to maximize direct quotes from transcript, preserve every memorable phrase, prioritize creator's exact words over paraphrase, include personality and strong opinions.
|
||||
- Band boundaries at 0.33 and 0.67. Within each band, no continuous interpolation — just the band's modifier.
|
||||
4. Add `ScoreRunner.synthesize_and_score(moments: list[dict], creator_name: str, voice_level: float) -> ScoreResult` method to `backend/pipeline/quality/scorer.py`:
|
||||
- Loads stage5_synthesis.txt via `_load_prompt('stage5_synthesis.txt')` (import from pipeline.stages).
|
||||
- Creates `VoiceDial(base_prompt)` and calls `modify(voice_level)` to get modified prompt.
|
||||
- Gets stage config via `_get_stage_config(5)` for model_override and modality.
|
||||
- Builds user prompt in the same format as `_synthesize_chunk`: `<creator>{name}</creator>\n<moments>\n{moments_json}\n</moments>`.
|
||||
- Calls `self.client.complete()` with the modified prompt, parses response as SynthesisResult using `self.client.parse_response(raw, SynthesisResult)`.
|
||||
- If synthesis returns valid pages, takes the first page and calls `self.score_page()` on it.
|
||||
- Returns the ScoreResult.
|
||||
5. Update `backend/pipeline/quality/__main__.py` — in the `score` command handler:
|
||||
- If `--voice-level` is provided and `--file` is used: load moments from file, call `runner.synthesize_and_score(moments, creator_name, voice_level)` instead of `score_page()`.
|
||||
- If `--voice-level` is provided without `--file`: error — voice-level requires moments input (--file or --slug with DB).
|
||||
6. Verify: `cd backend && python -c "from pipeline.quality.voice_dial import VoiceDial; print('import ok')"` exits 0.
|
||||
7. Verify: `cd backend && python -c "from pipeline.quality.voice_dial import VoiceDial; vd = VoiceDial('base prompt'); low = vd.modify(0.1); mid = vd.modify(0.5); high = vd.modify(0.9); assert low != mid; assert high != mid; assert 'suppress' in low.lower() or 'neutral' in low.lower(); assert 'quote' in high.lower() or 'direct' in high.lower(); print('dial ok')"` — voice dial produces distinct prompts per band.
|
||||
8. Verify: `cd backend && python -m pipeline.quality score --file pipeline/quality/fixtures/sample_moments.json --voice-level 0.3` — connectivity error, exit 1, no traceback.
|
||||
|
||||
## Must-Haves
|
||||
|
||||
- [ ] VoiceDial class with 3 bands (low/mid/high) producing distinct prompt modifications
|
||||
- [ ] Low band suppresses voice, high band amplifies it, mid band passes through unmodified
|
||||
- [ ] ScoreRunner.synthesize_and_score() re-synthesizes from moments using modified prompt, then scores
|
||||
- [ ] --voice-level wired into CLI and triggers re-synthesis flow
|
||||
- [ ] Stage 5 prompt loaded from prompts/stage5_synthesis.txt (not hardcoded)
|
||||
- [ ] Synthesis output parsed as SynthesisResult (reuses existing schema)
|
||||
|
||||
## Verification
|
||||
|
||||
- `cd backend && python -c "from pipeline.quality.voice_dial import VoiceDial; print('import ok')"` — exits 0
|
||||
- `cd backend && python -c "from pipeline.quality.voice_dial import VoiceDial; vd = VoiceDial('base'); assert vd.modify(0.1) != vd.modify(0.5) != vd.modify(0.9); print('bands ok')"` — three distinct outputs
|
||||
- `cd backend && python -m pipeline.quality score --file pipeline/quality/fixtures/sample_moments.json --voice-level 0.3` — connectivity error, exit 1, no traceback
|
||||
- Estimate: 1.5h
|
||||
- Files: backend/pipeline/quality/voice_dial.py, backend/pipeline/quality/scorer.py, backend/pipeline/quality/__main__.py
|
||||
- Verify: cd backend && python -c "from pipeline.quality.voice_dial import VoiceDial; vd = VoiceDial('base'); assert vd.modify(0.1) != vd.modify(0.5); assert vd.modify(0.5) != vd.modify(0.9); print('bands ok')" && python -m pipeline.quality score --help
|
||||
|
|
|
|||
115
.gsd/milestones/M013/slices/S02/S02-RESEARCH.md
Normal file
115
.gsd/milestones/M013/slices/S02/S02-RESEARCH.md
Normal file
|
|
@ -0,0 +1,115 @@
|
|||
# S02 Research: Stage 5 Quality Scorer & Voice Preservation Dial
|
||||
|
||||
## Summary
|
||||
|
||||
This slice adds a `score` subcommand to the existing `pipeline.quality` CLI that evaluates Stage 5 synthesis output (technique pages) across 5 dimensions, plus a `voice_level` parameter that modifies the stage 5 prompt to dial voice preservation up or down — with the scorer proving the dial works by producing measurably different voice preservation scores.
|
||||
|
||||
The work is **targeted research** — known patterns (LLM-as-judge scoring, prompt interpolation), applied to well-understood codebase structures.
|
||||
|
||||
## Recommendation
|
||||
|
||||
Build a `scorer.py` module in `backend/pipeline/quality/` alongside the existing `fitness.py`. The scorer:
|
||||
|
||||
1. Takes a reference technique page (either from DB by slug, or from a JSON file) and scores it across 5 dimensions using LLM-as-judge evaluation.
|
||||
2. Optionally re-synthesizes the page with a modified prompt (voice_level dial injected into the stage 5 system prompt) before scoring.
|
||||
3. Outputs a structured report with per-dimension scores and a composite.
|
||||
|
||||
The voice_level dial modifies the stage 5 synthesis prompt by interpolating a voice-emphasis instruction. At 0.0 ("clinical"), the prompt suppresses direct quotes and creator voice. At 1.0 ("maximum voice"), it amplifies quote preservation, creator opinions, and personality. The existing prompt (stage5_synthesis.txt) already has strong voice preservation language — that's roughly voice_level 0.6-0.7 baseline.
|
||||
|
||||
## Implementation Landscape
|
||||
|
||||
### Existing code to build on
|
||||
|
||||
| File | What it provides | How S02 uses it |
|
||||
|------|-----------------|-----------------|
|
||||
| `backend/pipeline/quality/__main__.py` | argparse CLI with subcommand pattern | Add `score` subcommand with `--slug`, `--file`, `--voice-level` args |
|
||||
| `backend/pipeline/quality/fitness.py` | `TestResult` dataclass, `FitnessRunner` pattern | Follow same pattern: `ScoreRunner` class, `ScoreResult` dataclass |
|
||||
| `backend/pipeline/stages.py` → `_synthesize_chunk()` | Calls LLM with stage5 prompt + moments | Scorer can reuse `_load_prompt()`, `_get_llm_client()`, `_get_stage_config()` |
|
||||
| `backend/pipeline/schemas.py` → `SynthesizedPage`, `SynthesisResult` | Pydantic models for stage 5 output | Scorer validates re-synthesized output against these |
|
||||
| `backend/config.py` → `Settings` | All LLM endpoint config | Scorer uses same LLM client config |
|
||||
| `prompts/stage5_synthesis.txt` | The full synthesis prompt (~4KB) | Base prompt for voice_level interpolation |
|
||||
| `backend/pipeline/llm_client.py` → `LLMClient.complete()` | LLM call with fallback, modality support | Used for both re-synthesis and scoring judge calls |
|
||||
|
||||
### The 5 scoring dimensions
|
||||
|
||||
Based on the stage 5 prompt's quality guidelines, these are the natural scoring dimensions:
|
||||
|
||||
1. **Structural quality** — Does the page have well-named sections (not generic "Overview"/"Tips"), appropriate section count, 2-5 paragraphs per section, proper body_sections structure?
|
||||
2. **Content specificity** — Does the prose contain concrete details (frequencies, ratios, ms values, plugin settings) vs vague generalities? Ratio of specific claims to filler sentences.
|
||||
3. **Voice preservation** — Does the page preserve the creator's actual words, opinions, warnings? Presence of direct quotes, attributed opinions, personality.
|
||||
4. **Readability / flow** — Does the page read as synthesis (not concatenation)? Logical section ordering, merged related info, no redundancy, contradiction handling.
|
||||
5. **Factual fidelity** — Does the page avoid inventing information not in the source moments? No phantom specifics, no hallucinated plugin names or settings.
|
||||
|
||||
Each dimension scored 0.0–1.0 by the LLM judge. Composite = weighted average (weights configurable, default equal).
|
||||
|
||||
### Voice level dial implementation
|
||||
|
||||
The dial works by interpolating a voice-emphasis modifier into the stage 5 system prompt before the LLM call. This is NOT a new prompt file — it's a runtime prefix/suffix injected into the existing prompt.
|
||||
|
||||
**voice_level 0.0 ("clinical"):** Append instruction to suppress direct quotes, write in neutral third-person encyclopedia style, avoid attributing opinions.
|
||||
|
||||
**voice_level 0.5 ("balanced"):** No modification — use the base prompt as-is. The existing prompt already asks for moderate voice preservation.
|
||||
|
||||
**voice_level 1.0 ("maximum voice"):** Append instruction to maximize direct quotes from transcript, preserve every memorable phrase, prioritize creator's exact words over paraphrase, include personality and strong opinions even if tangential.
|
||||
|
||||
Linear interpolation between these anchor points selects which modifier text to inject and how strongly. In practice, 3 bands (low/mid/high) with blended instructions is simpler and more reliable than continuous interpolation with an LLM.
|
||||
|
||||
### Scoring approach: LLM-as-judge
|
||||
|
||||
The scorer sends the technique page JSON + the source moments to the LLM with a scoring rubric prompt. The rubric prompt is a new file (`prompts/scoring_rubric.txt` or hardcoded in scorer.py — hardcoded is simpler for iteration, can extract later).
|
||||
|
||||
The judge LLM call returns a JSON object with per-dimension scores and brief justifications. Using the same LLM that generated the content is acceptable for relative scoring (comparing voice_level 0.2 vs 0.8 on the same content) even though it's not ideal for absolute quality assessment.
|
||||
|
||||
### Reference article sourcing
|
||||
|
||||
The scorer needs a technique page to evaluate. Two modes:
|
||||
- `--slug <slug>`: Load from DB (requires DB connection). Loads the TechniquePage + its KeyMoments + classification data.
|
||||
- `--file <path>`: Load from a JSON file (no DB needed). The file contains the synthesized page JSON + the source moments array. This mode is useful for testing without a running stack.
|
||||
|
||||
For the voice_level comparison demo, the scorer needs to re-synthesize from source moments, not just score an existing page. So the flow is:
|
||||
1. Load reference moments (from DB or file)
|
||||
2. Synthesize at voice_level X using modified prompt
|
||||
3. Score the result
|
||||
4. Repeat at voice_level Y
|
||||
5. Compare scores
|
||||
|
||||
### Data needed for reference test
|
||||
|
||||
The scorer needs at least one set of real key moments to re-synthesize. Options:
|
||||
- Connect to the production DB on ub01 and pull a technique page + its moments
|
||||
- Create a fixture JSON file with sample moments for offline testing
|
||||
|
||||
A fixture file is better for the CLI demo — no DB dependency. The fixture should contain a realistic set of 5-8 moments with actual transcript excerpts, plugin mentions, and classification data.
|
||||
|
||||
### File structure after S02
|
||||
|
||||
```
|
||||
backend/pipeline/quality/
|
||||
__init__.py (existing)
|
||||
__main__.py (modified — add 'score' subcommand)
|
||||
fitness.py (existing, untouched)
|
||||
scorer.py (NEW — ScoreRunner, ScoreResult, scoring logic)
|
||||
voice_dial.py (NEW — voice level prompt modifier)
|
||||
fixtures/ (NEW — sample moments JSON for offline testing)
|
||||
sample_moments.json
|
||||
```
|
||||
|
||||
### Natural task seams
|
||||
|
||||
1. **T01: ScoreResult dataclass + scoring rubric + ScoreRunner skeleton** — Define the 5-dimension scoring schema, write the LLM judge rubric prompt, implement `ScoreRunner.score_page()` that sends a page to the judge and returns structured scores. Wire into CLI as `score` subcommand with `--slug` and `--file` args.
|
||||
|
||||
2. **T02: Voice dial implementation + re-synthesis flow** — Implement `voice_dial.py` with prompt modifier logic (3 bands). Add `--voice-level` arg to the `score` subcommand. When voice_level is provided, re-synthesize the page from source moments using the modified prompt before scoring. Build the sample fixture file.
|
||||
|
||||
3. **T03: Integration verification + report formatting** — End-to-end test: score a reference article, run at voice_level 0.2 vs 0.8, verify scores differ meaningfully on the voice preservation dimension. Format the output report (similar to fitness report style). Verify CLI help text.
|
||||
|
||||
### Risks
|
||||
|
||||
| Risk | Likelihood | Mitigation |
|
||||
|------|-----------|------------|
|
||||
| LLM judge scores don't differentiate voice levels | Medium | The scoring rubric must be specific about what constitutes voice preservation (direct quotes present, attributed opinions, personality markers). If the judge is too coarse, sharpen the rubric with examples. |
|
||||
| No live LLM endpoint accessible from dev machine | High | Same as S01 — we can build and verify structure/imports locally, but actual scoring requires FYN-LLM. The fixture file allows verifying the data flow without a live endpoint. |
|
||||
| Re-synthesis is expensive (full stage 5 LLM call per voice level) | Low | Expected — this is a quality tool, not a hot path. Each run takes ~30-60s per synthesis. Acceptable for a CLI benchmarking tool. |
|
||||
|
||||
### Skills check
|
||||
|
||||
No external skills needed. This is pure Python (dataclasses, Pydantic, argparse) using the existing LLM client. No new libraries required.
|
||||
76
.gsd/milestones/M013/slices/S02/tasks/T01-PLAN.md
Normal file
76
.gsd/milestones/M013/slices/S02/tasks/T01-PLAN.md
Normal file
|
|
@ -0,0 +1,76 @@
|
|||
---
|
||||
estimated_steps: 41
|
||||
estimated_files: 4
|
||||
skills_used: []
|
||||
---
|
||||
|
||||
# T01: Build ScoreRunner with 5-dimension LLM-as-judge scoring and CLI integration
|
||||
|
||||
## Description
|
||||
|
||||
Create the scorer module that evaluates a Stage 5 technique page across 5 quality dimensions using an LLM judge call. Wire it into the existing CLI as the `score` subcommand. Build a realistic fixture file for offline testing.
|
||||
|
||||
## Failure Modes
|
||||
|
||||
| Dependency | On error | On timeout | On malformed response |
|
||||
|------------|----------|-----------|----------------------|
|
||||
| LLM endpoint | Print clear connectivity error with URL, exit 1 (same as fitness) | Same — openai timeout maps to APITimeoutError | Parse judge JSON response; if malformed, log raw excerpt and return zero scores with detail message |
|
||||
|
||||
## Steps
|
||||
|
||||
1. Read `backend/pipeline/quality/fitness.py` to understand the TestResult/FitnessRunner pattern and report formatting.
|
||||
2. Create `backend/pipeline/quality/scorer.py` with:
|
||||
- `ScoreResult` dataclass: `structural`, `content_specificity`, `voice_preservation`, `readability`, `factual_fidelity` (each float 0.0–1.0), `composite` (weighted average, default equal weights), `justifications` (dict of dimension → string), `elapsed_seconds` (float).
|
||||
- `ScoreRunner` class taking an `LLMClient` in __init__.
|
||||
- `ScoreRunner.score_page(page_json: dict, moments: list[dict]) -> ScoreResult` method:
|
||||
- Builds a scoring rubric prompt (hardcoded string in scorer.py — not a separate prompt file) that asks the LLM to evaluate the page against the source moments across the 5 dimensions.
|
||||
- The rubric should specify what each dimension measures (see S02-RESEARCH.md for definitions) and instruct the LLM to return JSON: `{"structural": 0.8, "content_specificity": 0.7, ..., "justifications": {"structural": "...", ...}}`.
|
||||
- Calls `self.client.complete()` with modality="chat", response_model=BaseModel (JSON mode).
|
||||
- Parses JSON response, validates all 5 dimension keys present and values in [0.0, 1.0].
|
||||
- Computes composite as mean of 5 dimensions.
|
||||
- Returns `ScoreResult`.
|
||||
- `ScoreRunner.print_report(result: ScoreResult)` — formatted report matching fitness report style: header bar, per-dimension score with justification excerpt, composite score, timing.
|
||||
3. Create `backend/pipeline/quality/fixtures/sample_moments.json` with:
|
||||
- `{"creator_name": "ExampleCreator", "topic_category": "Sound design", "moments": [...]}` — 5-6 realistic moments with `summary`, `transcript_excerpt`, `topic_tags`, `topic_category`, `start_time`, `end_time` fields. Content about a concrete music production technique (e.g., snare layering or bass resampling). Include direct quotes and specific plugin/setting mentions so voice preservation scoring has signal.
|
||||
4. Create `backend/pipeline/quality/fixtures/__init__.py` (empty).
|
||||
5. Update `backend/pipeline/quality/__main__.py`:
|
||||
- Add `score` subcommand to argparse with args: `--file` (path to moments JSON), `--slug` (technique slug — placeholder, just store the arg, actual DB loading deferred), `--voice-level` (float, optional, default None — wired in T02).
|
||||
- When `args.command == "score"`: validate that exactly one of --file or --slug is provided. If --file, load JSON, extract `moments` and `creator_name`. Create `ScoreRunner(client)`. If `--voice-level` is None, call `score_page()` directly with the moments. Print the report.
|
||||
- For now, the --slug path prints "DB loading not yet implemented" and exits 1.
|
||||
6. Verify: `cd backend && python -c "from pipeline.quality.scorer import ScoreRunner, ScoreResult; print('import ok')"` exits 0.
|
||||
7. Verify: `cd backend && python -m pipeline.quality score --help` shows --file, --slug, --voice-level.
|
||||
8. Verify: `cd backend && python -m pipeline.quality score --file pipeline/quality/fixtures/sample_moments.json` — hits connectivity error, prints endpoint URL, exits 1 with no traceback.
|
||||
|
||||
## Must-Haves
|
||||
|
||||
- [ ] ScoreResult dataclass with 5 float dimensions + composite + justifications + elapsed_seconds
|
||||
- [ ] ScoreRunner.score_page() sends rubric + page + moments to LLM, parses JSON response
|
||||
- [ ] Formatted report output with per-dimension scores and justifications
|
||||
- [ ] Fixture JSON with 5+ realistic moments including transcript excerpts and plugin mentions
|
||||
- [ ] `score` subcommand wired in __main__.py with --file, --slug, --voice-level args
|
||||
- [ ] Connectivity error handled cleanly (same pattern as fitness)
|
||||
|
||||
## Verification
|
||||
|
||||
- `cd backend && python -c "from pipeline.quality.scorer import ScoreRunner, ScoreResult; print('import ok')"` — exits 0
|
||||
- `cd backend && python -m pipeline.quality score --help` — shows all three args
|
||||
- `cd backend && python -m pipeline.quality score --file pipeline/quality/fixtures/sample_moments.json` — connectivity error with URL, exit 1, no traceback
|
||||
- `cd backend && python -c "import json; d=json.load(open('pipeline/quality/fixtures/sample_moments.json')); assert 'moments' in d and len(d['moments']) >= 5"` — fixture valid
|
||||
|
||||
## Inputs
|
||||
|
||||
- `backend/pipeline/quality/__main__.py`
|
||||
- `backend/pipeline/quality/fitness.py`
|
||||
- `backend/pipeline/llm_client.py`
|
||||
- `backend/pipeline/schemas.py`
|
||||
|
||||
## Expected Output
|
||||
|
||||
- `backend/pipeline/quality/scorer.py`
|
||||
- `backend/pipeline/quality/__main__.py`
|
||||
- `backend/pipeline/quality/fixtures/sample_moments.json`
|
||||
- `backend/pipeline/quality/fixtures/__init__.py`
|
||||
|
||||
## Verification
|
||||
|
||||
cd backend && python -c "from pipeline.quality.scorer import ScoreRunner, ScoreResult; print('import ok')" && python -m pipeline.quality score --help && python -c "import json; d=json.load(open('pipeline/quality/fixtures/sample_moments.json')); assert 'moments' in d and len(d['moments']) >= 5"
|
||||
84
.gsd/milestones/M013/slices/S02/tasks/T01-SUMMARY.md
Normal file
84
.gsd/milestones/M013/slices/S02/tasks/T01-SUMMARY.md
Normal file
|
|
@ -0,0 +1,84 @@
|
|||
---
|
||||
id: T01
|
||||
parent: S02
|
||||
milestone: M013
|
||||
provides: []
|
||||
requires: []
|
||||
affects: []
|
||||
key_files: ["backend/pipeline/quality/scorer.py", "backend/pipeline/quality/__main__.py", "backend/pipeline/quality/fixtures/sample_moments.json", "backend/pipeline/quality/fixtures/__init__.py"]
|
||||
key_decisions: ["Hardcoded scoring rubric in scorer.py for iteration speed", "Used mutually_exclusive_group for --file/--slug"]
|
||||
patterns_established: []
|
||||
drill_down_paths: []
|
||||
observability_surfaces: []
|
||||
duration: ""
|
||||
verification_result: "All 4 verification commands pass: import check exits 0, --help shows all 3 args, connectivity error exits 1 with URL and no traceback, fixture validates with 6 moments."
|
||||
completed_at: 2026-04-01T08:53:38.205Z
|
||||
blocker_discovered: false
|
||||
---
|
||||
|
||||
# T01: Built ScoreRunner with 5-dimension LLM-as-judge scoring rubric, CLI score subcommand, and realistic 6-moment sample fixture
|
||||
|
||||
> Built ScoreRunner with 5-dimension LLM-as-judge scoring rubric, CLI score subcommand, and realistic 6-moment sample fixture
|
||||
|
||||
## What Happened
|
||||
---
|
||||
id: T01
|
||||
parent: S02
|
||||
milestone: M013
|
||||
key_files:
|
||||
- backend/pipeline/quality/scorer.py
|
||||
- backend/pipeline/quality/__main__.py
|
||||
- backend/pipeline/quality/fixtures/sample_moments.json
|
||||
- backend/pipeline/quality/fixtures/__init__.py
|
||||
key_decisions:
|
||||
- Hardcoded scoring rubric in scorer.py for iteration speed
|
||||
- Used mutually_exclusive_group for --file/--slug
|
||||
duration: ""
|
||||
verification_result: passed
|
||||
completed_at: 2026-04-01T08:53:38.205Z
|
||||
blocker_discovered: false
|
||||
---
|
||||
|
||||
# T01: Built ScoreRunner with 5-dimension LLM-as-judge scoring rubric, CLI score subcommand, and realistic 6-moment sample fixture
|
||||
|
||||
**Built ScoreRunner with 5-dimension LLM-as-judge scoring rubric, CLI score subcommand, and realistic 6-moment sample fixture**
|
||||
|
||||
## What Happened
|
||||
|
||||
Created scorer.py with ScoreResult dataclass (5 float dimensions + composite + justifications + elapsed_seconds + error) and ScoreRunner class that sends technique page JSON and source moments to an LLM judge with a detailed scoring rubric. Updated __main__.py with score subcommand (--file, --slug mutually exclusive; --voice-level optional for T02). Created fixtures/sample_moments.json with 6 realistic music production moments including transcript excerpts with specific plugins/settings.
|
||||
|
||||
## Verification
|
||||
|
||||
All 4 verification commands pass: import check exits 0, --help shows all 3 args, connectivity error exits 1 with URL and no traceback, fixture validates with 6 moments.
|
||||
|
||||
## Verification Evidence
|
||||
|
||||
| # | Command | Exit Code | Verdict | Duration |
|
||||
|---|---------|-----------|---------|----------|
|
||||
| 1 | `cd backend && python -c "from pipeline.quality.scorer import ScoreRunner, ScoreResult; print('import ok')"` | 0 | ✅ pass | 1000ms |
|
||||
| 2 | `cd backend && python -m pipeline.quality score --help` | 0 | ✅ pass | 1000ms |
|
||||
| 3 | `cd backend && python -m pipeline.quality score --file pipeline/quality/fixtures/sample_moments.json` | 1 | ✅ pass | 3000ms |
|
||||
| 4 | `cd backend && python -c "import json; d=json.load(open('pipeline/quality/fixtures/sample_moments.json')); assert 'moments' in d and len(d['moments']) >= 5"` | 0 | ✅ pass | 1000ms |
|
||||
|
||||
|
||||
## Deviations
|
||||
|
||||
None.
|
||||
|
||||
## Known Issues
|
||||
|
||||
None.
|
||||
|
||||
## Files Created/Modified
|
||||
|
||||
- `backend/pipeline/quality/scorer.py`
|
||||
- `backend/pipeline/quality/__main__.py`
|
||||
- `backend/pipeline/quality/fixtures/sample_moments.json`
|
||||
- `backend/pipeline/quality/fixtures/__init__.py`
|
||||
|
||||
|
||||
## Deviations
|
||||
None.
|
||||
|
||||
## Known Issues
|
||||
None.
|
||||
82
.gsd/milestones/M013/slices/S02/tasks/T02-PLAN.md
Normal file
82
.gsd/milestones/M013/slices/S02/tasks/T02-PLAN.md
Normal file
|
|
@ -0,0 +1,82 @@
|
|||
---
|
||||
estimated_steps: 45
|
||||
estimated_files: 3
|
||||
skills_used: []
|
||||
---
|
||||
|
||||
# T02: Implement voice dial prompt modifier and re-synthesis scoring flow
|
||||
|
||||
## Description
|
||||
|
||||
Build the voice dial module that modifies the stage 5 synthesis prompt based on a voice_level parameter (0.0–1.0), and wire it into the scorer so `--voice-level` triggers re-synthesis from source moments before scoring. This completes the slice by enabling the key demo: running the scorer at voice_level 0.2 vs 0.8 produces measurably different voice preservation scores.
|
||||
|
||||
## Failure Modes
|
||||
|
||||
| Dependency | On error | On timeout | On malformed response |
|
||||
|------------|----------|-----------|----------------------|
|
||||
| LLM endpoint (re-synthesis) | Same connectivity error pattern — print URL, exit 1 | Same | If synthesis returns unparseable JSON, log raw excerpt and exit 1 with message |
|
||||
| LLM endpoint (scoring judge) | Same as T01 scorer | Same | Same as T01 — zero scores with detail |
|
||||
| prompts/stage5_synthesis.txt | FileNotFoundError caught, print "Prompt file not found: {path}", exit 1 | N/A | N/A |
|
||||
|
||||
## Steps
|
||||
|
||||
1. Read the existing `prompts/stage5_synthesis.txt` to understand what voice language is already present (the research says it's roughly voice_level 0.6-0.7 baseline).
|
||||
2. Read `backend/pipeline/stages.py` functions `_load_prompt()`, `_get_stage_config()`, and `_synthesize_chunk()` to understand the exact synthesis call pattern — system_prompt + `<creator>...</creator>\n<moments>...</moments>` user prompt format.
|
||||
3. Create `backend/pipeline/quality/voice_dial.py` with:
|
||||
- `VoiceDial` class:
|
||||
- `__init__(self, base_prompt: str)` — stores the base stage 5 system prompt.
|
||||
- `modify(self, voice_level: float) -> str` — returns the modified system prompt.
|
||||
- 3 bands: low (0.0–0.33), mid (0.34–0.66), high (0.67–1.0).
|
||||
- Low band: append instruction to suppress direct quotes, write in neutral third-person encyclopedia style, avoid attributing opinions, minimize personality markers.
|
||||
- Mid band: return base prompt unmodified (the existing prompt already has moderate voice preservation).
|
||||
- High band: append instruction to maximize direct quotes from transcript, preserve every memorable phrase, prioritize creator's exact words over paraphrase, include personality and strong opinions.
|
||||
- Band boundaries at 0.33 and 0.67. Within each band, no continuous interpolation — just the band's modifier.
|
||||
4. Add `ScoreRunner.synthesize_and_score(moments: list[dict], creator_name: str, voice_level: float) -> ScoreResult` method to `backend/pipeline/quality/scorer.py`:
|
||||
- Loads stage5_synthesis.txt via `_load_prompt('stage5_synthesis.txt')` (import from pipeline.stages).
|
||||
- Creates `VoiceDial(base_prompt)` and calls `modify(voice_level)` to get modified prompt.
|
||||
- Gets stage config via `_get_stage_config(5)` for model_override and modality.
|
||||
- Builds user prompt in the same format as `_synthesize_chunk`: `<creator>{name}</creator>\n<moments>\n{moments_json}\n</moments>`.
|
||||
- Calls `self.client.complete()` with the modified prompt, parses response as SynthesisResult using `self.client.parse_response(raw, SynthesisResult)`.
|
||||
- If synthesis returns valid pages, takes the first page and calls `self.score_page()` on it.
|
||||
- Returns the ScoreResult.
|
||||
5. Update `backend/pipeline/quality/__main__.py` — in the `score` command handler:
|
||||
- If `--voice-level` is provided and `--file` is used: load moments from file, call `runner.synthesize_and_score(moments, creator_name, voice_level)` instead of `score_page()`.
|
||||
- If `--voice-level` is provided without `--file`: error — voice-level requires moments input (--file or --slug with DB).
|
||||
6. Verify: `cd backend && python -c "from pipeline.quality.voice_dial import VoiceDial; print('import ok')"` exits 0.
|
||||
7. Verify: `cd backend && python -c "from pipeline.quality.voice_dial import VoiceDial; vd = VoiceDial('base prompt'); low = vd.modify(0.1); mid = vd.modify(0.5); high = vd.modify(0.9); assert low != mid; assert high != mid; assert 'suppress' in low.lower() or 'neutral' in low.lower(); assert 'quote' in high.lower() or 'direct' in high.lower(); print('dial ok')"` — voice dial produces distinct prompts per band.
|
||||
8. Verify: `cd backend && python -m pipeline.quality score --file pipeline/quality/fixtures/sample_moments.json --voice-level 0.3` — connectivity error, exit 1, no traceback.
|
||||
|
||||
## Must-Haves
|
||||
|
||||
- [ ] VoiceDial class with 3 bands (low/mid/high) producing distinct prompt modifications
|
||||
- [ ] Low band suppresses voice, high band amplifies it, mid band passes through unmodified
|
||||
- [ ] ScoreRunner.synthesize_and_score() re-synthesizes from moments using modified prompt, then scores
|
||||
- [ ] --voice-level wired into CLI and triggers re-synthesis flow
|
||||
- [ ] Stage 5 prompt loaded from prompts/stage5_synthesis.txt (not hardcoded)
|
||||
- [ ] Synthesis output parsed as SynthesisResult (reuses existing schema)
|
||||
|
||||
## Verification
|
||||
|
||||
- `cd backend && python -c "from pipeline.quality.voice_dial import VoiceDial; print('import ok')"` — exits 0
|
||||
- `cd backend && python -c "from pipeline.quality.voice_dial import VoiceDial; vd = VoiceDial('base'); assert vd.modify(0.1) != vd.modify(0.5) != vd.modify(0.9); print('bands ok')"` — three distinct outputs
|
||||
- `cd backend && python -m pipeline.quality score --file pipeline/quality/fixtures/sample_moments.json --voice-level 0.3` — connectivity error, exit 1, no traceback
|
||||
|
||||
## Inputs
|
||||
|
||||
- `backend/pipeline/quality/scorer.py`
|
||||
- `backend/pipeline/quality/__main__.py`
|
||||
- `backend/pipeline/quality/fixtures/sample_moments.json`
|
||||
- `backend/pipeline/llm_client.py`
|
||||
- `backend/pipeline/stages.py`
|
||||
- `backend/pipeline/schemas.py`
|
||||
- `prompts/stage5_synthesis.txt`
|
||||
|
||||
## Expected Output
|
||||
|
||||
- `backend/pipeline/quality/voice_dial.py`
|
||||
- `backend/pipeline/quality/scorer.py`
|
||||
- `backend/pipeline/quality/__main__.py`
|
||||
|
||||
## Verification
|
||||
|
||||
cd backend && python -c "from pipeline.quality.voice_dial import VoiceDial; vd = VoiceDial('base'); assert vd.modify(0.1) != vd.modify(0.5); assert vd.modify(0.5) != vd.modify(0.9); print('bands ok')" && python -m pipeline.quality score --help
|
||||
|
|
@ -1,16 +1,22 @@
|
|||
"""FYN-LLM fitness test suite.
|
||||
"""FYN-LLM quality assurance toolkit.
|
||||
|
||||
Run with: python -m pipeline.quality fitness
|
||||
Subcommands:
|
||||
fitness — Run LLM fitness tests across four categories
|
||||
score — Score a Stage 5 technique page across 5 quality dimensions
|
||||
|
||||
Run with: python -m pipeline.quality <command>
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
|
||||
from config import get_settings
|
||||
from pipeline.llm_client import LLMClient
|
||||
|
||||
from .fitness import FitnessRunner
|
||||
from .scorer import ScoreRunner
|
||||
|
||||
|
||||
def main() -> int:
|
||||
|
|
@ -23,6 +29,29 @@ def main() -> int:
|
|||
# -- fitness subcommand --
|
||||
sub.add_parser("fitness", help="Run LLM fitness tests across four categories")
|
||||
|
||||
# -- score subcommand --
|
||||
score_parser = sub.add_parser(
|
||||
"score",
|
||||
help="Score a Stage 5 technique page across 5 quality dimensions",
|
||||
)
|
||||
source_group = score_parser.add_mutually_exclusive_group(required=True)
|
||||
source_group.add_argument(
|
||||
"--file",
|
||||
type=str,
|
||||
help="Path to a moments JSON file (creator_name, moments array)",
|
||||
)
|
||||
source_group.add_argument(
|
||||
"--slug",
|
||||
type=str,
|
||||
help="Technique slug to load from the database",
|
||||
)
|
||||
score_parser.add_argument(
|
||||
"--voice-level",
|
||||
type=float,
|
||||
default=None,
|
||||
help="Voice preservation dial (0.0=clinical, 1.0=maximum voice). Triggers re-synthesis before scoring.",
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.command is None:
|
||||
|
|
@ -35,6 +64,66 @@ def main() -> int:
|
|||
runner = FitnessRunner(client)
|
||||
return runner.run_all()
|
||||
|
||||
if args.command == "score":
|
||||
return _run_score(args)
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
def _run_score(args: argparse.Namespace) -> int:
|
||||
"""Execute the score subcommand."""
|
||||
# -- Load source data --
|
||||
if args.slug:
|
||||
print("DB loading not yet implemented", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
try:
|
||||
with open(args.file) as f:
|
||||
data = json.load(f)
|
||||
except FileNotFoundError:
|
||||
print(f"File not found: {args.file}", file=sys.stderr)
|
||||
return 1
|
||||
except json.JSONDecodeError as exc:
|
||||
print(f"Invalid JSON in {args.file}: {exc}", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
moments = data.get("moments", [])
|
||||
creator_name = data.get("creator_name", "Unknown")
|
||||
|
||||
if not moments:
|
||||
print("No moments found in input file", file=sys.stderr)
|
||||
return 1
|
||||
|
||||
# -- Build page stub from moments for scoring --
|
||||
# When --voice-level is set, T02 will re-synthesize. For now, build a
|
||||
# minimal page representation from the moments so the scorer has
|
||||
# something to evaluate.
|
||||
page_json = {
|
||||
"title": f"{creator_name} — Technique Page",
|
||||
"creator_name": creator_name,
|
||||
"summary": f"Technique page synthesized from {len(moments)} key moments.",
|
||||
"body_sections": [
|
||||
{
|
||||
"heading": m.get("topic_tags", ["Technique"])[0] if m.get("topic_tags") else "Technique",
|
||||
"content": m.get("summary", "") + "\n\n" + m.get("transcript_excerpt", ""),
|
||||
}
|
||||
for m in moments
|
||||
],
|
||||
}
|
||||
|
||||
settings = get_settings()
|
||||
client = LLMClient(settings)
|
||||
runner = ScoreRunner(client)
|
||||
|
||||
print(f"\nScoring page for '{creator_name}' ({len(moments)} moments)...")
|
||||
|
||||
result = runner.score_page(page_json, moments)
|
||||
|
||||
if result.error:
|
||||
runner.print_report(result)
|
||||
return 1
|
||||
|
||||
runner.print_report(result)
|
||||
return 0
|
||||
|
||||
|
||||
|
|
|
|||
0
backend/pipeline/quality/fixtures/__init__.py
Normal file
0
backend/pipeline/quality/fixtures/__init__.py
Normal file
54
backend/pipeline/quality/fixtures/sample_moments.json
Normal file
54
backend/pipeline/quality/fixtures/sample_moments.json
Normal file
|
|
@ -0,0 +1,54 @@
|
|||
{
|
||||
"creator_name": "KOAN Sound",
|
||||
"topic_category": "Sound design",
|
||||
"moments": [
|
||||
{
|
||||
"summary": "Layering snare transients by combining a high-frequency click from a Popcorn Snare with a mid-body from a pitched-down 808 rim shot, blending at -6dB relative offset.",
|
||||
"transcript_excerpt": "So what I'll do is take the Popcorn Snare — that's got this really sharp click at like 4k — and then I layer underneath it a rim shot pitched down maybe 3 semitones. You blend those together and suddenly you've got this snare that cuts through everything but still has weight.",
|
||||
"topic_tags": ["snare layering", "transient design", "sample stacking"],
|
||||
"topic_category": "Sound design",
|
||||
"start_time": 124.5,
|
||||
"end_time": 158.2
|
||||
},
|
||||
{
|
||||
"summary": "Using Serum's noise oscillator with the 'Analog_Crackle' wavetable at 12% mix to add organic texture to bass patches, followed by OTT at 30% depth for glue.",
|
||||
"transcript_excerpt": "One trick I always come back to is Serum's noise osc with Analog_Crackle. You don't want it loud — like 12 percent mix — just enough that the bass feels alive. Then slap OTT on there at maybe 30 percent depth and it glues the whole thing together without squashing it.",
|
||||
"topic_tags": ["bass design", "Serum", "OTT", "texture"],
|
||||
"topic_category": "Sound design",
|
||||
"start_time": 203.1,
|
||||
"end_time": 241.7
|
||||
},
|
||||
{
|
||||
"summary": "Resampling technique: bounce a bass patch to audio, chop the best 2 bars, then re-pitch in Simpler with warp off for tighter timing and consistent tone.",
|
||||
"transcript_excerpt": "I'll resample everything. Bounce it down, find the two bars that sound best, throw it in Simpler with warp completely off. Now you've got this tight, consistent thing where every hit is exactly the same energy. The pitch tracking is way more predictable too.",
|
||||
"topic_tags": ["resampling", "Ableton", "Simpler", "bass production"],
|
||||
"topic_category": "Sound design",
|
||||
"start_time": 312.0,
|
||||
"end_time": 349.8
|
||||
},
|
||||
{
|
||||
"summary": "Parallel compression chain for drums using Ableton's Drum Buss at 40% drive into a return track with Valhalla Room at 1.2s decay, mixed at -12dB.",
|
||||
"transcript_excerpt": "The parallel chain is dead simple — Drum Buss, crank the drive to about 40 percent, send that to a return with Valhalla Room. Keep the decay short, like 1.2 seconds. Mix it in at minus 12 and your drums just... breathe. They've got this room sound without getting washy.",
|
||||
"topic_tags": ["parallel compression", "drum processing", "Valhalla Room", "Drum Buss"],
|
||||
"topic_category": "Sound design",
|
||||
"start_time": 421.3,
|
||||
"end_time": 462.1
|
||||
},
|
||||
{
|
||||
"summary": "Frequency-specific sidechain using Trackspacer plugin instead of volume ducking, targeting only 100-300Hz so the bass ducks under the kick without losing high-end presence.",
|
||||
"transcript_excerpt": "Everyone does volume sidechain but honestly Trackspacer changed everything for me. You set it to only affect 100 to 300 Hz so when the kick hits, the bass ducks just in that low-mid range. The top end of the bass stays right there — you keep all the character and harmonics, you just clear the mud.",
|
||||
"topic_tags": ["sidechaining", "Trackspacer", "frequency ducking", "mixing"],
|
||||
"topic_category": "Sound design",
|
||||
"start_time": 498.7,
|
||||
"end_time": 534.2
|
||||
},
|
||||
{
|
||||
"summary": "Using Ableton's Utility plugin to check mono compatibility at every stage, specifically toggling mono on the sub bus to catch phase cancellation from layered bass patches.",
|
||||
"transcript_excerpt": "I'm almost paranoid about mono. I've got Utility on the sub bus and I'm flipping to mono constantly. If your layered bass sounds thin in mono you've got phase issues — doesn't matter how fat it sounds in stereo, it'll collapse on a club system.",
|
||||
"topic_tags": ["mono compatibility", "phase checking", "club mixing", "Utility"],
|
||||
"topic_category": "Sound design",
|
||||
"start_time": 567.0,
|
||||
"end_time": 598.4
|
||||
}
|
||||
]
|
||||
}
|
||||
263
backend/pipeline/quality/scorer.py
Normal file
263
backend/pipeline/quality/scorer.py
Normal file
|
|
@ -0,0 +1,263 @@
|
|||
"""Stage 5 quality scorer — LLM-as-judge evaluation across 5 dimensions.
|
||||
|
||||
Evaluates a synthesized technique page against source moments on:
|
||||
1. Structural quality — section naming, count, paragraph depth
|
||||
2. Content specificity — concrete details vs vague generalities
|
||||
3. Voice preservation — direct quotes, attributed opinions, personality
|
||||
4. Readability / flow — synthesis quality, logical ordering, no redundancy
|
||||
5. Factual fidelity — no hallucinated specifics, grounded in source moments
|
||||
|
||||
Run via: python -m pipeline.quality score --file <path>
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import logging
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
|
||||
import openai
|
||||
from pydantic import BaseModel
|
||||
|
||||
from pipeline.llm_client import LLMClient
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# ── Scoring rubric (hardcoded for iteration speed) ───────────────────────────
|
||||
|
||||
SCORING_RUBRIC = """\
|
||||
You are an expert evaluator of synthesized technique articles for music production education.
|
||||
|
||||
You will be given:
|
||||
1. A synthesized technique page (JSON with title, summary, body_sections)
|
||||
2. The source key moments (transcript excerpts, summaries, tags) used to create it
|
||||
|
||||
Evaluate the page across these 5 dimensions, scoring each 0.0 to 1.0:
|
||||
|
||||
**structural** — Section naming and organization
|
||||
- 0.9-1.0: Well-named specific sections (not generic "Overview"/"Tips"), appropriate count (3-6), 2-5 paragraphs per section
|
||||
- 0.5-0.7: Acceptable structure but some generic section names or uneven depth
|
||||
- 0.0-0.3: Poor structure — too few/many sections, generic names, single-paragraph sections
|
||||
|
||||
**content_specificity** — Concrete technical details
|
||||
- 0.9-1.0: Rich in frequencies (Hz), time values (ms), ratios, plugin names, specific settings, dB values
|
||||
- 0.5-0.7: Some specific details but padded with vague statements ("adjust to taste", "experiment with settings")
|
||||
- 0.0-0.3: Mostly vague generalities with few concrete values from the source material
|
||||
|
||||
**voice_preservation** — Creator's authentic voice
|
||||
- 0.9-1.0: Direct quotes preserved, opinions attributed to creator by name, personality and strong views retained
|
||||
- 0.5-0.7: Some paraphrased references to creator's views but few direct quotes
|
||||
- 0.0-0.3: Encyclopedia style — creator's voice completely smoothed out, no attribution
|
||||
|
||||
**readability** — Synthesis quality and flow
|
||||
- 0.9-1.0: Reads as a cohesive article, related info merged, logical flow, no redundancy or contradiction
|
||||
- 0.5-0.7: Generally readable but some awkward transitions or minor repetition
|
||||
- 0.0-0.3: Feels like concatenated bullet points, disjointed, redundant passages
|
||||
|
||||
**factual_fidelity** — Grounded in source material
|
||||
- 0.9-1.0: Every claim traceable to source moments, no invented plugin names/settings/techniques
|
||||
- 0.5-0.7: Mostly grounded but 1-2 details seem embellished or not directly from sources
|
||||
- 0.0-0.3: Contains hallucinated specifics — plugin names, settings, or techniques not in sources
|
||||
|
||||
Return ONLY a JSON object with this exact structure:
|
||||
{
|
||||
"structural": <float 0.0-1.0>,
|
||||
"content_specificity": <float 0.0-1.0>,
|
||||
"voice_preservation": <float 0.0-1.0>,
|
||||
"readability": <float 0.0-1.0>,
|
||||
"factual_fidelity": <float 0.0-1.0>,
|
||||
"justifications": {
|
||||
"structural": "<1-2 sentence justification>",
|
||||
"content_specificity": "<1-2 sentence justification>",
|
||||
"voice_preservation": "<1-2 sentence justification>",
|
||||
"readability": "<1-2 sentence justification>",
|
||||
"factual_fidelity": "<1-2 sentence justification>"
|
||||
}
|
||||
}
|
||||
"""
|
||||
|
||||
DIMENSIONS = [
|
||||
"structural",
|
||||
"content_specificity",
|
||||
"voice_preservation",
|
||||
"readability",
|
||||
"factual_fidelity",
|
||||
]
|
||||
|
||||
|
||||
# ── Result type ──────────────────────────────────────────────────────────────
|
||||
|
||||
@dataclass
|
||||
class ScoreResult:
|
||||
"""Outcome of scoring a technique page across 5 quality dimensions."""
|
||||
|
||||
structural: float = 0.0
|
||||
content_specificity: float = 0.0
|
||||
voice_preservation: float = 0.0
|
||||
readability: float = 0.0
|
||||
factual_fidelity: float = 0.0
|
||||
composite: float = 0.0
|
||||
justifications: dict[str, str] = field(default_factory=dict)
|
||||
elapsed_seconds: float = 0.0
|
||||
error: str | None = None
|
||||
|
||||
|
||||
# ── Runner ───────────────────────────────────────────────────────────────────
|
||||
|
||||
class ScoreRunner:
|
||||
"""Scores a Stage 5 technique page using LLM-as-judge evaluation."""
|
||||
|
||||
def __init__(self, client: LLMClient) -> None:
|
||||
self.client = client
|
||||
|
||||
def score_page(
|
||||
self,
|
||||
page_json: dict,
|
||||
moments: list[dict],
|
||||
) -> ScoreResult:
|
||||
"""Evaluate a technique page against source moments.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
page_json:
|
||||
Synthesized page dict (title, summary, body_sections).
|
||||
moments:
|
||||
Source key moments with transcript_excerpt, summary, etc.
|
||||
|
||||
Returns
|
||||
-------
|
||||
ScoreResult with per-dimension scores and justifications.
|
||||
"""
|
||||
# Build the user prompt with the page and source moments
|
||||
user_prompt = (
|
||||
"## Synthesized Technique Page\n\n"
|
||||
f"```json\n{json.dumps(page_json, indent=2)}\n```\n\n"
|
||||
"## Source Key Moments\n\n"
|
||||
f"```json\n{json.dumps(moments, indent=2)}\n```\n\n"
|
||||
"Score this page across all 5 dimensions."
|
||||
)
|
||||
|
||||
t0 = time.monotonic()
|
||||
try:
|
||||
resp = self.client.complete(
|
||||
system_prompt=SCORING_RUBRIC,
|
||||
user_prompt=user_prompt,
|
||||
response_model=BaseModel, # triggers JSON mode
|
||||
modality="chat",
|
||||
)
|
||||
elapsed = round(time.monotonic() - t0, 2)
|
||||
except (openai.APIConnectionError, openai.APITimeoutError) as exc:
|
||||
elapsed = round(time.monotonic() - t0, 2)
|
||||
url = self.client.settings.llm_api_url
|
||||
fallback = self.client.settings.llm_fallback_url
|
||||
return ScoreResult(
|
||||
elapsed_seconds=elapsed,
|
||||
error=(
|
||||
f"Cannot reach LLM endpoint at {url} (fallback {fallback}). "
|
||||
f"Error: {exc}"
|
||||
),
|
||||
)
|
||||
|
||||
# Parse the LLM judge response
|
||||
raw_text = str(resp).strip()
|
||||
try:
|
||||
parsed = json.loads(raw_text)
|
||||
except json.JSONDecodeError:
|
||||
logger.error("Malformed judge response (not JSON): %.300s", raw_text)
|
||||
return ScoreResult(
|
||||
elapsed_seconds=elapsed,
|
||||
error=f"Malformed judge response (not valid JSON). Raw excerpt: {raw_text[:200]}",
|
||||
)
|
||||
|
||||
return self._parse_scores(parsed, elapsed)
|
||||
|
||||
def _parse_scores(self, parsed: dict, elapsed: float) -> ScoreResult:
|
||||
"""Extract and validate scores from parsed JSON response."""
|
||||
scores: dict[str, float] = {}
|
||||
justifications: dict[str, str] = {}
|
||||
|
||||
raw_justifications = parsed.get("justifications", {})
|
||||
if not isinstance(raw_justifications, dict):
|
||||
raw_justifications = {}
|
||||
|
||||
for dim in DIMENSIONS:
|
||||
raw = parsed.get(dim)
|
||||
if raw is None:
|
||||
logger.warning("Missing dimension '%s' in judge response", dim)
|
||||
scores[dim] = 0.0
|
||||
justifications[dim] = "(missing from judge response)"
|
||||
continue
|
||||
|
||||
try:
|
||||
val = float(raw)
|
||||
scores[dim] = max(0.0, min(1.0, val)) # clamp
|
||||
except (TypeError, ValueError):
|
||||
logger.warning("Invalid value for '%s': %r", dim, raw)
|
||||
scores[dim] = 0.0
|
||||
justifications[dim] = f"(invalid value: {raw!r})"
|
||||
continue
|
||||
|
||||
justifications[dim] = str(raw_justifications.get(dim, ""))
|
||||
|
||||
composite = sum(scores.values()) / len(DIMENSIONS)
|
||||
|
||||
return ScoreResult(
|
||||
structural=scores["structural"],
|
||||
content_specificity=scores["content_specificity"],
|
||||
voice_preservation=scores["voice_preservation"],
|
||||
readability=scores["readability"],
|
||||
factual_fidelity=scores["factual_fidelity"],
|
||||
composite=round(composite, 3),
|
||||
justifications=justifications,
|
||||
elapsed_seconds=elapsed,
|
||||
)
|
||||
|
||||
def print_report(self, result: ScoreResult) -> None:
|
||||
"""Print a formatted scoring report to stdout."""
|
||||
print("\n" + "=" * 60)
|
||||
print(" STAGE 5 QUALITY SCORE REPORT")
|
||||
print("=" * 60)
|
||||
|
||||
if result.error:
|
||||
print(f"\n ✗ Error: {result.error}\n")
|
||||
print("=" * 60 + "\n")
|
||||
return
|
||||
|
||||
for dim in DIMENSIONS:
|
||||
score = getattr(result, dim)
|
||||
bar = self._score_bar(score)
|
||||
justification = result.justifications.get(dim, "")
|
||||
print(f"\n {dim.replace('_', ' ').title()}")
|
||||
print(f" Score: {score:.2f} {bar}")
|
||||
if justification:
|
||||
# Wrap justification at ~60 chars
|
||||
for line in self._wrap(justification, 56):
|
||||
print(f" {line}")
|
||||
|
||||
print("\n" + "-" * 60)
|
||||
print(f" Composite: {result.composite:.3f}")
|
||||
print(f" Time: {result.elapsed_seconds}s")
|
||||
print("=" * 60 + "\n")
|
||||
|
||||
@staticmethod
|
||||
def _score_bar(score: float, width: int = 20) -> str:
|
||||
"""Render a visual bar for a 0-1 score."""
|
||||
filled = int(score * width)
|
||||
return "█" * filled + "░" * (width - filled)
|
||||
|
||||
@staticmethod
|
||||
def _wrap(text: str, width: int) -> list[str]:
|
||||
"""Simple word wrap."""
|
||||
words = text.split()
|
||||
lines: list[str] = []
|
||||
current = ""
|
||||
for word in words:
|
||||
if current and len(current) + len(word) + 1 > width:
|
||||
lines.append(current)
|
||||
current = word
|
||||
else:
|
||||
current = f"{current} {word}" if current else word
|
||||
if current:
|
||||
lines.append(current)
|
||||
return lines
|
||||
Loading…
Add table
Reference in a new issue