feat: Added STAGE_CONFIGS registry (stages 2-5) with per-stage rubrics,…

- "backend/pipeline/quality/scorer.py"
- "backend/pipeline/quality/variant_generator.py"

GSD-Task: S04/T01
This commit is contained in:
jlightner 2026-04-01 09:20:24 +00:00
parent 84e85a52b3
commit e740798f7c
11 changed files with 874 additions and 79 deletions

View file

@ -8,5 +8,5 @@ A fully automated CLI tool that tests FYN-LLM fitness, scores pipeline output ac
|----|-------|------|---------|------|------------| |----|-------|------|---------|------|------------|
| S01 | General FYN-LLM Fitness Suite | medium | — | ✅ | Run `python -m pipeline.quality fitness` — outputs pass/fail for Mandelbrot question, JSON compliance, instruction following, and diverse prompt battery against live FYN-LLM | | S01 | General FYN-LLM Fitness Suite | medium | — | ✅ | Run `python -m pipeline.quality fitness` — outputs pass/fail for Mandelbrot question, JSON compliance, instruction following, and diverse prompt battery against live FYN-LLM |
| S02 | Stage 5 Quality Scorer & Voice Preservation Dial | high | S01 | ✅ | Run scorer on a reference article — outputs composite score across 5 dimensions. Run same article at voice_level 0.2 vs 0.8 — voice preservation score differs meaningfully | | S02 | Stage 5 Quality Scorer & Voice Preservation Dial | high | S01 | ✅ | Run scorer on a reference article — outputs composite score across 5 dimensions. Run same article at voice_level 0.2 vs 0.8 — voice preservation score differs meaningfully |
| S03 | Prompt Variant Generator & Automated A/B Loop | high | S02 | | Run `python -m pipeline.quality optimize --stage 5 --iterations 10` — generates prompt variants, scores each against reference articles, outputs leaderboard and score trajectory chart | | S03 | Prompt Variant Generator & Automated A/B Loop | high | S02 | | Run `python -m pipeline.quality optimize --stage 5 --iterations 10` — generates prompt variants, scores each against reference articles, outputs leaderboard and score trajectory chart |
| S04 | Expand to Pipeline Stages 2-4 | medium | S03 | ⬜ | Run `python -m pipeline.quality optimize --stage 3 --iterations 5` — optimizes extraction prompts with stage-appropriate scoring | | S04 | Expand to Pipeline Stages 2-4 | medium | S03 | ⬜ | Run `python -m pipeline.quality optimize --stage 3 --iterations 5` — optimizes extraction prompts with stage-appropriate scoring |

View file

@ -0,0 +1,96 @@
---
id: S03
parent: M013
milestone: M013
provides:
- optimize CLI subcommand for automated prompt A/B testing
- PromptVariantGenerator for LLM-powered prompt mutation
- OptimizationLoop + OptimizationResult for iterative optimization with full history
- Leaderboard table, ASCII trajectory chart, and timestamped JSON result output
requires:
- slice: S02
provides: ScoreRunner, ScoreResult, LLMClient, 5-dimension scoring rubric, VoiceDial pattern
affects:
- S04
key_files:
- backend/pipeline/quality/variant_generator.py
- backend/pipeline/quality/optimizer.py
- backend/pipeline/quality/__main__.py
- backend/pipeline/quality/results/.gitkeep
key_decisions:
- OptimizationLoop bypasses VoiceDial — owns the full prompt text directly to avoid double-application
- Variant validation uses both length diff and line-level symmetric difference to catch trivial mutations
- Reporting functions live in __main__.py rather than a separate reporting.py — keeps surface area small
patterns_established:
- Meta-prompt pattern: LLM acts as prompt engineer, receives current prompt + per-dimension scores + rubric summary, outputs a modified variant targeting weakest dimensions
- Variant validation gate: min-diff threshold + format marker check before scoring, invalid variants logged and skipped
- Optimization history capture: full iteration×variant matrix stored in OptimizationResult for downstream leaderboard/charting/JSON export
observability_surfaces:
- none
drill_down_paths:
- .gsd/milestones/M013/slices/S03/tasks/T01-SUMMARY.md
- .gsd/milestones/M013/slices/S03/tasks/T02-SUMMARY.md
duration: ""
verification_result: passed
completed_at: 2026-04-01T09:12:07.490Z
blocker_discovered: false
---
# S03: Prompt Variant Generator & Automated A/B Loop
**Automated prompt optimization loop: LLM-powered variant generation, iterative scoring, CLI with leaderboard/trajectory output, and JSON result persistence.**
## What Happened
Built two core modules and wired them into the existing quality CLI. `variant_generator.py` provides `PromptVariantGenerator` — given a base prompt and its per-dimension scores, it calls the LLM with a meta-prompt to generate N variants targeting the weakest scoring dimensions. Each variant is validated: must differ from the base by ≥50 chars (via line-level symmetric difference), must preserve JSON format markers (`SynthesisResult`, `"pages"`). Invalid variants are logged and skipped.
`optimizer.py` provides `OptimizationLoop` — loads the base stage 5 prompt and fixture data, runs a baseline score, then iterates: generate variants → score each via synthesis + the existing 5-dimension scorer → keep the best → repeat. The loop handles LLM errors gracefully (errored variants are skipped, not fatal). `OptimizationResult` captures the full history (iteration, variant index, prompt text, all scores) for downstream reporting.
The `optimize` CLI subcommand accepts `--stage`, `--iterations`, `--variants-per-iter`, `--file`, and `--output-dir`. Stage validation restricts to stage 5 (others print a clear message and exit 1). After the loop completes, three reporting functions fire: `print_leaderboard()` shows top 5 variants ranked by composite score with per-dimension breakdown; `print_trajectory()` renders a 15-row ASCII chart of best composite per iteration; `write_results_json()` persists the full result with config metadata to a timestamped JSON file.
One deviation from the plan: `OptimizationLoop._score_variant()` does its own synthesis call rather than delegating to `ScoreRunner.synthesize_and_score()`, because the loop owns the full prompt text directly and bypassing VoiceDial avoids double-application of voice modifiers.
## Verification
All slice-level verification checks passed:
1. `python -c "from pipeline.quality.variant_generator import PromptVariantGenerator; print('generator ok')"` → exit 0
2. `python -c "from pipeline.quality.optimizer import OptimizationLoop, OptimizationResult; print('optimizer ok')"` → exit 0
3. `python -m pipeline.quality optimize --help` → shows all 5 args (--stage, --iterations, --variants-per-iter, --file, --output-dir)
4. `python -m pipeline.quality optimize --stage 3 --iterations 1 --file ...` → prints "only stage 5 is supported" and exits 1
5. `backend/pipeline/quality/results/.gitkeep` exists
## Requirements Advanced
- R013 — Prompt optimization loop provides automated mechanism to improve prompt templates — generates variants, scores them, and identifies winners
## Requirements Validated
None.
## New Requirements Surfaced
None.
## Requirements Invalidated or Re-scoped
None.
## Deviations
OptimizationLoop._score_variant() performs its own synthesis call instead of delegating to ScoreRunner.synthesize_and_score(). This avoids double-application of VoiceDial modifiers since the optimization loop owns the full prompt text directly.
## Known Limitations
Only stage 5 optimization is supported — other stages print an error and exit 1. This is by design; S04 will extend to stages 2-4.
## Follow-ups
S04 extends optimization to pipeline stages 2-4 with stage-appropriate scoring dimensions.
## Files Created/Modified
- `backend/pipeline/quality/variant_generator.py` — New module: PromptVariantGenerator with meta-prompt, LLM-powered variant generation, and validation (min-diff + format markers)
- `backend/pipeline/quality/optimizer.py` — New module: OptimizationLoop (generate→score→select cycles) and OptimizationResult dataclass with full history
- `backend/pipeline/quality/__main__.py` — Added optimize subparser, print_leaderboard(), print_trajectory(), write_results_json() reporting functions
- `backend/pipeline/quality/results/.gitkeep` — Created results output directory

View file

@ -0,0 +1,73 @@
# S03: Prompt Variant Generator & Automated A/B Loop — UAT
**Milestone:** M013
**Written:** 2026-04-01T09:12:07.490Z
## UAT: S03 — Prompt Variant Generator & Automated A/B Loop
### Preconditions
- Working directory: project root (content-to-kb-automator)
- Python environment with backend dependencies installed
- `backend/pipeline/quality/fixtures/sample_moments.json` exists (created in S02)
### Test 1: Module Imports
**Steps:**
1. Run `python -c "from pipeline.quality.variant_generator import PromptVariantGenerator; print('ok')"`
2. Run `python -c "from pipeline.quality.optimizer import OptimizationLoop, OptimizationResult; print('ok')"`
**Expected:** Both print 'ok' and exit 0.
### Test 2: CLI Help Output
**Steps:**
1. Run `python -m pipeline.quality optimize --help`
**Expected:** Output shows all 5 arguments: --stage (default 5), --iterations (default 10), --variants-per-iter (default 2), --file (required), --output-dir (default backend/pipeline/quality/results/).
### Test 3: Stage Validation — Unsupported Stage
**Steps:**
1. Run `python -m pipeline.quality optimize --stage 3 --iterations 1 --file backend/pipeline/quality/fixtures/sample_moments.json`
**Expected:** Prints error containing "only stage 5" and exits with code 1.
### Test 4: Stage Validation — Stage 2
**Steps:**
1. Run `python -m pipeline.quality optimize --stage 2 --iterations 1 --file backend/pipeline/quality/fixtures/sample_moments.json`
**Expected:** Same rejection as Test 3 — prints error containing "only stage 5" and exits 1.
### Test 5: Missing Fixture File
**Steps:**
1. Run `python -m pipeline.quality optimize --stage 5 --iterations 1 --file /nonexistent/path.json`
**Expected:** Prints error about missing file and exits with non-zero code. No traceback.
### Test 6: Missing Required --file Arg
**Steps:**
1. Run `python -m pipeline.quality optimize --stage 5 --iterations 1`
**Expected:** argparse error about required --file argument.
### Test 7: Results Directory
**Steps:**
1. Verify `backend/pipeline/quality/results/.gitkeep` exists
**Expected:** File exists, directory is tracked in git.
### Test 8: End-to-End Optimization (requires LLM connectivity)
**Preconditions:** FYN-LLM reachable at configured endpoint.
**Steps:**
1. Run `python -m pipeline.quality optimize --stage 5 --iterations 2 --variants-per-iter 1 --file backend/pipeline/quality/fixtures/sample_moments.json --output-dir /tmp/chrysopedia_test_results/`
2. Check stdout for iteration progress lines
3. Check stdout for leaderboard table (top variants by composite score)
4. Check stdout for ASCII trajectory chart
5. Check `/tmp/chrysopedia_test_results/` for a JSON file matching `optimize_stage5_*.json`
6. Validate JSON contains keys: best_prompt, best_scores, history, config, elapsed_seconds
**Expected:** Loop runs 2 iterations, generates 1 variant per iteration, scores each, prints leaderboard and trajectory, writes JSON result file with all expected keys.
### Edge Case 9: LLM Unreachable
**Steps:**
1. Set LLM endpoint to an unreachable host (e.g., modify config temporarily)
2. Run `python -m pipeline.quality optimize --stage 5 --iterations 1 --file backend/pipeline/quality/fixtures/sample_moments.json`
**Expected:** Clean error message about connectivity failure. No Python traceback shown to user.

View file

@ -0,0 +1,16 @@
{
"schemaVersion": 1,
"taskId": "T02",
"unitId": "M013/S03/T02",
"timestamp": 1775034642864,
"passed": true,
"discoverySource": "task-plan",
"checks": [
{
"command": "python -m pipeline.quality optimize --help",
"exitCode": 0,
"durationMs": 496,
"verdict": "pass"
}
]
}

View file

@ -1,6 +1,55 @@
# S04: Expand to Pipeline Stages 2-4 # S04: Expand to Pipeline Stages 2-4
**Goal:** Apply the quality framework to stages 2 (segmentation), 3 (extraction), and 4 (classification) with stage-specific scoring criteria **Goal:** Extend the prompt optimization loop from stage-5-only to stages 2-4, with stage-appropriate scoring rubrics, fixture formats, and variant validation — so `optimize --stage N` works for any pipeline stage.
**Demo:** After this: Run `python -m pipeline.quality optimize --stage 3 --iterations 5` — optimizes extraction prompts with stage-appropriate scoring **Demo:** After this: Run `python -m pipeline.quality optimize --stage 3 --iterations 5` — optimizes extraction prompts with stage-appropriate scoring
## Tasks ## Tasks
- [x] **T01: Added STAGE_CONFIGS registry (stages 2-5) with per-stage rubrics, generalized ScoreResult to scores dict, and made variant generator stage-agnostic** — Build the STAGE_CONFIGS registry in scorer.py that maps each pipeline stage (2-5) to its scoring rubric, dimensions list, format markers, fixture key requirements, prompt file name, and output schema class. Generalize ScoreResult to use a `scores: dict[str, float]` field instead of 5 named fields (keep backward compat via properties). Add a `score_stage_output()` method to ScoreRunner that accepts arbitrary stage output + input and scores using the stage's rubric. Update variant_generator.py to accept format markers as a parameter rather than using the hardcoded `_FORMAT_MARKERS` list, and generalize the meta-prompt to work for any stage (not just synthesis).
## Context
The existing scorer has a hardcoded `SCORING_RUBRIC` with 5 stage-5 dimensions (structural, content_specificity, voice_preservation, readability, factual_fidelity). `ScoreResult` has these as named float fields. The variant generator has hardcoded `_FORMAT_MARKERS = ['SynthesisResult', '"pages"', 'body_sections', 'title', 'summary']` and a `VARIANT_META_PROMPT` that references 'synthesis prompt' language.
Stages 2-4 need different rubrics:
- Stage 2 (segmentation): coverage_completeness, topic_specificity, boundary_accuracy, summary_quality
- Stage 3 (extraction): moment_richness, timestamp_accuracy, content_type_correctness, summary_actionability, plugin_normalization
- Stage 4 (classification): category_accuracy, tag_completeness, tag_specificity, coverage, no_overlap
Format markers per stage:
- Stage 2: `'segments'`, `'start_index'`, `'end_index'`, `'topic_label'`
- Stage 3: `'moments'`, `'content_type'`, `'raw_transcript'`, `'plugins'`
- Stage 4: `'classifications'`, `'moment_index'`, `'topic_category'`, `'topic_tags'`
- Stage 5: `'SynthesisResult'`, `'"pages"'`, `'body_sections'`, `'title'`, `'summary'` (existing)
Prompt files: `stage2_segmentation.txt`, `stage3_extraction.txt`, `stage4_classification.txt`, `stage5_synthesis.txt`
Schemas: `SegmentationResult`, `ExtractionResult`, `ClassificationResult`, `SynthesisResult` (all in `pipeline.schemas`)
- Estimate: 1.5h
- Files: backend/pipeline/quality/scorer.py, backend/pipeline/quality/variant_generator.py
- Verify: cd backend && python -c "from pipeline.quality.scorer import STAGE_CONFIGS, ScoreResult, ScoreRunner, DIMENSIONS; assert 2 in STAGE_CONFIGS and 3 in STAGE_CONFIGS and 4 in STAGE_CONFIGS and 5 in STAGE_CONFIGS; r = ScoreResult(scores={'structural': 0.8, 'readability': 0.7}, composite=0.75); print('scorer ok')" && python -c "from pipeline.quality.variant_generator import PromptVariantGenerator; print('generator ok')"
- [ ] **T02: Generalize optimizer, create stage 2-4 fixtures, wire CLI, verify end-to-end** — Make OptimizationLoop stage-aware: generalize _load_fixture() to validate stage-specific keys, generalize _score_variant() to call the correct prompt and parse the correct schema per stage, and pass stage-appropriate format markers to the variant generator. Create minimal fixture JSON files for stages 2-4. Remove the stage-5 gate in __main__.py's _run_optimize(), add validation for stages 2-5. Verify all stages import and CLI accepts them.
## Context
The optimizer currently has three stage-5-specific hardcodings:
1. `_load_fixture()` expects `creator_name` and `moments` keys — stages 2-4 have different input shapes
2. `_score_variant()` calls synthesis via `SynthesisResult` schema and formats output as a technique page for the scorer
3. The `run()` method loads `stage{N}_synthesis.txt` — stages 2-4 use different prompt file names
The CLI's `_run_optimize()` rejects `args.stage != 5` with an error.
Stage fixture shapes (from research):
- Stage 2: `{transcript_segments: [{index, start_time, end_time, text}]}` — segments of a transcript
- Stage 3: `{topic_label, segments: [{start_time, end_time, text}]}` — a topic group to extract moments from
- Stage 4: `{moments: [{title, summary, content_type, plugins}], taxonomy: "..."}` — moments to classify
- Stage 5: `{creator_name, moments: [...]}` (existing sample_moments.json)
The STAGE_CONFIGS registry from T01 provides the prompt filename, schema class, dimensions, and format markers per stage. This task uses that registry to dispatch correctly.
After changes, `OptimizationLoop.run()` should:
1. Load the prompt file from `STAGE_CONFIGS[stage]['prompt_file']`
2. Load and validate fixture using `STAGE_CONFIGS[stage]['fixture_keys']`
3. In `_score_variant()`, use the stage's schema to parse LLM output, then format it for the scorer's `score_stage_output()`
4. Pass stage-appropriate format markers to variant generator
- Estimate: 2h
- Files: backend/pipeline/quality/optimizer.py, backend/pipeline/quality/__main__.py, backend/pipeline/quality/fixtures/sample_segments.json, backend/pipeline/quality/fixtures/sample_topic_group.json, backend/pipeline/quality/fixtures/sample_classifications.json
- Verify: cd backend && python -c "from pipeline.quality.optimizer import OptimizationLoop; print('optimizer ok')" && python -c "from pipeline.quality.__main__ import main; print('cli ok')" && python -m pipeline.quality optimize --stage 2 --iterations 1 --file pipeline/quality/fixtures/sample_segments.json --help 2>&1 | head -1 && python -m pipeline.quality optimize --stage 6 --file x 2>&1 | grep -q 'stage' && echo 'stage6 rejected ok'

View file

@ -0,0 +1,89 @@
# S04 Research: Expand to Pipeline Stages 2-4
## Depth: Targeted
Known patterns (optimization loop, variant generation, scoring) applied to new stages. Main complexity is defining stage-appropriate scoring rubrics and fixture formats.
## Summary
Extending the optimizer from stage-5-only to stages 2-4 requires four changes per stage: (1) a scoring rubric tailored to what that stage produces, (2) fixture data matching the stage's input shape, (3) format markers for variant validation, and (4) a score-variant pathway that calls the right LLM prompt and parses the right schema.
The existing architecture in `optimizer.py` and `variant_generator.py` is well-factored for this — the main hardcoding is in the scorer's `SCORING_RUBRIC` (stage-5-specific dimensions) and the optimizer's `_score_variant` method (calls synthesis → scores technique page).
## Requirement Coverage
**R013** (Prompt Template System) — already validated but this slice extends it with automated optimization for stages 2-4, strengthening the "re-run extraction on specific videos for calibration" aspect.
## Implementation Landscape
### What exists
| File | Role | Stage-5 Coupling |
|---|---|---|
| `backend/pipeline/quality/scorer.py` | `SCORING_RUBRIC`, `DIMENSIONS`, `ScoreResult`, `ScoreRunner` | SCORING_RUBRIC and DIMENSIONS are hardcoded to 5 stage-5 dimensions. ScoreResult dataclass has those 5 as named fields. `score_page()` expects page JSON shape. `synthesize_and_score()` loads stage5_synthesis.txt. |
| `backend/pipeline/quality/optimizer.py` | `OptimizationLoop`, `OptimizationResult` | `run()` loads `stage{N}_synthesis.txt` (already parameterized). `_score_variant()` calls synthesis and expects `SynthesisResult` schema. Fixture loader expects `{creator_name, moments}`. |
| `backend/pipeline/quality/variant_generator.py` | `PromptVariantGenerator`, `VARIANT_META_PROMPT` | `VARIANT_META_PROMPT` references "synthesis prompt" and synthesis-specific language. `_FORMAT_MARKERS` are `["SynthesisResult", '"pages"', "body_sections", "title", "summary"]` — stage-5-specific. |
| `backend/pipeline/quality/__main__.py` | CLI | `_run_optimize()` rejects `args.stage != 5` with an error message. |
| `backend/pipeline/quality/fixtures/sample_moments.json` | Test fixture | Stage 5 format: `{creator_name, topic_category, moments: [...]}` |
### Prompt files and schemas per stage
| Stage | Prompt File | Input Shape | Output Schema | What to Score |
|---|---|---|---|---|
| 2 (segmentation) | `stage2_segmentation.txt` | Transcript segments: `[idx] (start-end) text` | `SegmentationResult{segments: [{start_index, end_index, topic_label, summary}]}` | Coverage completeness (no gaps/overlaps), topic label specificity, segment boundary accuracy, summary quality |
| 3 (extraction) | `stage3_extraction.txt` | Topic group segments: `(start-end) text` with topic label | `ExtractionResult{moments: [{title, summary, start_time, end_time, content_type, plugins, raw_transcript}]}` | Moment richness (detail density), timestamp accuracy, content_type correctness, summary actionability, plugin name normalization |
| 4 (classification) | `stage4_classification.txt` | Moments list + taxonomy text | `ClassificationResult{classifications: [{moment_index, topic_category, topic_tags, content_type_override}]}` | Category accuracy, tag completeness, tag specificity, coverage (all moments classified), no-overlap (one category per moment) |
### Format markers per stage
Stage 2: `"segments"`, `start_index`, `end_index`, `topic_label`
Stage 3: `"moments"`, `content_type`, `raw_transcript`, `plugins`
Stage 4: `"classifications"`, `moment_index`, `topic_category`, `topic_tags`
### Architecture approach
Two viable approaches:
**A. Stage-specific scorer classes** — Create `Stage2Scorer`, `Stage3Scorer`, `Stage4Scorer` alongside the existing `ScoreRunner` (which becomes `Stage5Scorer`). Each has its own rubric, dimensions, and `score_output()` method. `OptimizationLoop` dispatches to the right scorer based on `self.stage`.
**B. Parameterized scorer with rubric registry** — Keep one `ScoreRunner` class but make it accept a rubric config (dimensions list, rubric text, output parser). A `STAGE_CONFIGS` dict maps stage number → config.
**Recommendation: B (registry).** The scoring flow is identical across stages (send rubric + output + input to LLM judge, parse scores). Only the rubric text, dimensions, format markers, and fixture→LLM-input formatting differ. A registry keeps the surface area small.
### Key design decisions
1. **ScoreResult generalization** — Currently has 5 named fields (`structural`, `content_specificity`, etc.). For stages 2-4 with different dimensions, either: (a) use a generic `scores: dict[str, float]` field, or (b) keep the named fields for stage 5 and add a generic dict for others. Option (a) is cleaner — stage 5's named fields can be populated from the dict.
2. **Fixture format per stage** — Each stage needs different fixture data:
- Stage 2: `{transcript_segments: [{index, start_time, end_time, text}]}`
- Stage 3: `{topic_label, segments: [{start_time, end_time, text}]}`
- Stage 4: `{moments: [...], taxonomy: "..."}`
The optimizer's `_load_fixture()` and `_score_variant()` need to be stage-aware.
3. **Variant generator meta-prompt** — The `VARIANT_META_PROMPT` currently references synthesis-specific language. It needs to be generalized or have per-stage variants. The core pattern (analyze scores → improve weakest dimensions → preserve format) is the same.
4. **Fixture creation** — Need sample fixtures for stages 2-4. Can extract from the existing sample_moments.json data (moments → stage 3/4 input) and from a real transcript (stage 2 input). Alternatively, create synthetic minimal fixtures.
### Files to modify
- `backend/pipeline/quality/scorer.py` — Add stage configs registry (rubric, dimensions, format markers per stage), generalize `ScoreResult`, add `score_stage_output()` method
- `backend/pipeline/quality/optimizer.py` — Generalize `_score_variant()` to dispatch per stage, generalize fixture loading
- `backend/pipeline/quality/variant_generator.py` — Generalize meta-prompt and format markers per stage
- `backend/pipeline/quality/__main__.py` — Remove stage-5 gate, add stage validation (2-5 only)
- `backend/pipeline/quality/fixtures/` — Add sample fixtures for stages 2-4
### Natural task seams
1. **Scorer generalization + rubric registry** — Define stage 2-4 scoring dimensions and rubrics, generalize ScoreResult, add stage config registry. This is the foundation.
2. **Optimizer + variant generator generalization** — Make OptimizationLoop stage-aware (fixture loading, score dispatch), generalize variant generator format markers and meta-prompt.
3. **Fixtures + CLI integration** — Create stage 2-4 fixtures, remove the stage-5 gate in CLI, wire everything together, verify end-to-end.
### Verification
- `python -m pipeline.quality optimize --stage 2 --iterations 1 --file <stage2_fixture>` runs without error
- `python -m pipeline.quality optimize --stage 3 --iterations 1 --file <stage3_fixture>` runs without error
- `python -m pipeline.quality optimize --stage 4 --iterations 1 --file <stage4_fixture>` runs without error
- `python -m pipeline.quality optimize --stage 5 --iterations 1 --file <stage5_fixture>` still works (no regression)
- `python -m pipeline.quality optimize --stage 6` → error message
- Import checks for all modified modules pass

View file

@ -0,0 +1,42 @@
---
estimated_steps: 14
estimated_files: 2
skills_used: []
---
# T01: Generalize scorer with stage config registry and update variant generator
Build the STAGE_CONFIGS registry in scorer.py that maps each pipeline stage (2-5) to its scoring rubric, dimensions list, format markers, fixture key requirements, prompt file name, and output schema class. Generalize ScoreResult to use a `scores: dict[str, float]` field instead of 5 named fields (keep backward compat via properties). Add a `score_stage_output()` method to ScoreRunner that accepts arbitrary stage output + input and scores using the stage's rubric. Update variant_generator.py to accept format markers as a parameter rather than using the hardcoded `_FORMAT_MARKERS` list, and generalize the meta-prompt to work for any stage (not just synthesis).
## Context
The existing scorer has a hardcoded `SCORING_RUBRIC` with 5 stage-5 dimensions (structural, content_specificity, voice_preservation, readability, factual_fidelity). `ScoreResult` has these as named float fields. The variant generator has hardcoded `_FORMAT_MARKERS = ['SynthesisResult', '"pages"', 'body_sections', 'title', 'summary']` and a `VARIANT_META_PROMPT` that references 'synthesis prompt' language.
Stages 2-4 need different rubrics:
- Stage 2 (segmentation): coverage_completeness, topic_specificity, boundary_accuracy, summary_quality
- Stage 3 (extraction): moment_richness, timestamp_accuracy, content_type_correctness, summary_actionability, plugin_normalization
- Stage 4 (classification): category_accuracy, tag_completeness, tag_specificity, coverage, no_overlap
Format markers per stage:
- Stage 2: `'segments'`, `'start_index'`, `'end_index'`, `'topic_label'`
- Stage 3: `'moments'`, `'content_type'`, `'raw_transcript'`, `'plugins'`
- Stage 4: `'classifications'`, `'moment_index'`, `'topic_category'`, `'topic_tags'`
- Stage 5: `'SynthesisResult'`, `'"pages"'`, `'body_sections'`, `'title'`, `'summary'` (existing)
Prompt files: `stage2_segmentation.txt`, `stage3_extraction.txt`, `stage4_classification.txt`, `stage5_synthesis.txt`
Schemas: `SegmentationResult`, `ExtractionResult`, `ClassificationResult`, `SynthesisResult` (all in `pipeline.schemas`)
## Inputs
- ``backend/pipeline/quality/scorer.py` — existing ScoreRunner, SCORING_RUBRIC, DIMENSIONS, ScoreResult`
- ``backend/pipeline/quality/variant_generator.py` — existing PromptVariantGenerator, VARIANT_META_PROMPT, _FORMAT_MARKERS`
- ``backend/pipeline/schemas.py` — SegmentationResult, ExtractionResult, ClassificationResult, SynthesisResult schemas`
## Expected Output
- ``backend/pipeline/quality/scorer.py` — STAGE_CONFIGS registry, generalized ScoreResult with scores dict, score_stage_output() method on ScoreRunner`
- ``backend/pipeline/quality/variant_generator.py` — generalized generate() accepting format_markers parameter, stage-agnostic VARIANT_META_PROMPT`
## Verification
cd backend && python -c "from pipeline.quality.scorer import STAGE_CONFIGS, ScoreResult, ScoreRunner, DIMENSIONS; assert 2 in STAGE_CONFIGS and 3 in STAGE_CONFIGS and 4 in STAGE_CONFIGS and 5 in STAGE_CONFIGS; r = ScoreResult(scores={'structural': 0.8, 'readability': 0.7}, composite=0.75); print('scorer ok')" && python -c "from pipeline.quality.variant_generator import PromptVariantGenerator; print('generator ok')"

View file

@ -0,0 +1,79 @@
---
id: T01
parent: S04
milestone: M013
provides: []
requires: []
affects: []
key_files: ["backend/pipeline/quality/scorer.py", "backend/pipeline/quality/variant_generator.py"]
key_decisions: ["Used backward-compat properties on ScoreResult instead of migrating all callers", "Made VARIANT_META_PROMPT a template with {dimension_descriptions} filled per-stage"]
patterns_established: []
drill_down_paths: []
observability_surfaces: []
duration: ""
verification_result: "All three verification commands pass: STAGE_CONFIGS has entries for stages 2-5, ScoreResult works with scores dict, backward-compat getattr works, StageConfig.get_schema() resolves all schema classes, PromptVariantGenerator imports cleanly."
completed_at: 2026-04-01T09:20:20.599Z
blocker_discovered: false
---
# T01: Added STAGE_CONFIGS registry (stages 2-5) with per-stage rubrics, generalized ScoreResult to scores dict, and made variant generator stage-agnostic
> Added STAGE_CONFIGS registry (stages 2-5) with per-stage rubrics, generalized ScoreResult to scores dict, and made variant generator stage-agnostic
## What Happened
---
id: T01
parent: S04
milestone: M013
key_files:
- backend/pipeline/quality/scorer.py
- backend/pipeline/quality/variant_generator.py
key_decisions:
- Used backward-compat properties on ScoreResult instead of migrating all callers
- Made VARIANT_META_PROMPT a template with {dimension_descriptions} filled per-stage
duration: ""
verification_result: passed
completed_at: 2026-04-01T09:20:20.600Z
blocker_discovered: false
---
# T01: Added STAGE_CONFIGS registry (stages 2-5) with per-stage rubrics, generalized ScoreResult to scores dict, and made variant generator stage-agnostic
**Added STAGE_CONFIGS registry (stages 2-5) with per-stage rubrics, generalized ScoreResult to scores dict, and made variant generator stage-agnostic**
## What Happened
Built STAGE_CONFIGS registry mapping stages 2-5 to StageConfig objects with rubrics, dimensions, format markers, fixture keys, prompt file names, and schema classes. Generalized ScoreResult from named float fields to a scores dict with backward-compat properties for stage 5. Added score_stage_output() to ScoreRunner for arbitrary stage scoring. Updated variant_generator.py with templatized meta-prompt and format_markers/stage parameters on generate().
## Verification
All three verification commands pass: STAGE_CONFIGS has entries for stages 2-5, ScoreResult works with scores dict, backward-compat getattr works, StageConfig.get_schema() resolves all schema classes, PromptVariantGenerator imports cleanly.
## Verification Evidence
| # | Command | Exit Code | Verdict | Duration |
|---|---------|-----------|---------|----------|
| 1 | `python -c "from pipeline.quality.scorer import STAGE_CONFIGS, ScoreResult, ScoreRunner, DIMENSIONS; assert 2 in STAGE_CONFIGS and 3 in STAGE_CONFIGS and 4 in STAGE_CONFIGS and 5 in STAGE_CONFIGS; r = ScoreResult(scores={'structural': 0.8, 'readability': 0.7}, composite=0.75); print('scorer ok')"` | 0 | ✅ pass | 1000ms |
| 2 | `python -c "from pipeline.quality.variant_generator import PromptVariantGenerator; print('generator ok')"` | 0 | ✅ pass | 1000ms |
| 3 | `python -c "from pipeline.quality.scorer import STAGE_CONFIGS, ScoreResult, DIMENSIONS; [cfg.get_schema() for cfg in STAGE_CONFIGS.values()]; r = ScoreResult(scores={'structural': 0.8}, composite=0.8); assert r.structural == 0.8; print('compat ok')"` | 0 | ✅ pass | 1000ms |
## Deviations
Added SCORING_RUBRIC backward-compat alias. Made VARIANT_META_PROMPT a template string with {dimension_descriptions} placeholder.
## Known Issues
__main__.py line 148 uses getattr pattern that only works for stage 5 — will need updating when optimize CLI is generalized.
## Files Created/Modified
- `backend/pipeline/quality/scorer.py`
- `backend/pipeline/quality/variant_generator.py`
## Deviations
Added SCORING_RUBRIC backward-compat alias. Made VARIANT_META_PROMPT a template string with {dimension_descriptions} placeholder.
## Known Issues
__main__.py line 148 uses getattr pattern that only works for stage 5 — will need updating when optimize CLI is generalized.

View file

@ -0,0 +1,52 @@
---
estimated_steps: 18
estimated_files: 5
skills_used: []
---
# T02: Generalize optimizer, create stage 2-4 fixtures, wire CLI, verify end-to-end
Make OptimizationLoop stage-aware: generalize _load_fixture() to validate stage-specific keys, generalize _score_variant() to call the correct prompt and parse the correct schema per stage, and pass stage-appropriate format markers to the variant generator. Create minimal fixture JSON files for stages 2-4. Remove the stage-5 gate in __main__.py's _run_optimize(), add validation for stages 2-5. Verify all stages import and CLI accepts them.
## Context
The optimizer currently has three stage-5-specific hardcodings:
1. `_load_fixture()` expects `creator_name` and `moments` keys — stages 2-4 have different input shapes
2. `_score_variant()` calls synthesis via `SynthesisResult` schema and formats output as a technique page for the scorer
3. The `run()` method loads `stage{N}_synthesis.txt` — stages 2-4 use different prompt file names
The CLI's `_run_optimize()` rejects `args.stage != 5` with an error.
Stage fixture shapes (from research):
- Stage 2: `{transcript_segments: [{index, start_time, end_time, text}]}` — segments of a transcript
- Stage 3: `{topic_label, segments: [{start_time, end_time, text}]}` — a topic group to extract moments from
- Stage 4: `{moments: [{title, summary, content_type, plugins}], taxonomy: "..."}` — moments to classify
- Stage 5: `{creator_name, moments: [...]}` (existing sample_moments.json)
The STAGE_CONFIGS registry from T01 provides the prompt filename, schema class, dimensions, and format markers per stage. This task uses that registry to dispatch correctly.
After changes, `OptimizationLoop.run()` should:
1. Load the prompt file from `STAGE_CONFIGS[stage]['prompt_file']`
2. Load and validate fixture using `STAGE_CONFIGS[stage]['fixture_keys']`
3. In `_score_variant()`, use the stage's schema to parse LLM output, then format it for the scorer's `score_stage_output()`
4. Pass stage-appropriate format markers to variant generator
## Inputs
- ``backend/pipeline/quality/scorer.py` — STAGE_CONFIGS registry and score_stage_output() from T01`
- ``backend/pipeline/quality/variant_generator.py` — generalized generate() from T01`
- ``backend/pipeline/quality/optimizer.py` — existing OptimizationLoop`
- ``backend/pipeline/quality/__main__.py` — existing _run_optimize with stage-5 gate`
- ``backend/pipeline/quality/fixtures/sample_moments.json` — existing stage 5 fixture for regression check`
## Expected Output
- ``backend/pipeline/quality/optimizer.py` — stage-aware OptimizationLoop with generalized _load_fixture, _score_variant, and format marker dispatch`
- ``backend/pipeline/quality/__main__.py` — _run_optimize accepts stages 2-5, rejects others`
- ``backend/pipeline/quality/fixtures/sample_segments.json` — stage 2 fixture`
- ``backend/pipeline/quality/fixtures/sample_topic_group.json` — stage 3 fixture`
- ``backend/pipeline/quality/fixtures/sample_classifications.json` — stage 4 fixture`
## Verification
cd backend && python -c "from pipeline.quality.optimizer import OptimizationLoop; print('optimizer ok')" && python -c "from pipeline.quality.__main__ import main; print('cli ok')" && python -m pipeline.quality optimize --stage 2 --iterations 1 --file pipeline/quality/fixtures/sample_segments.json --help 2>&1 | head -1 && python -m pipeline.quality optimize --stage 6 --file x 2>&1 | grep -q 'stage' && echo 'stage6 rejected ok'

View file

@ -1,11 +1,7 @@
"""Stage 5 quality scorer — LLM-as-judge evaluation across 5 dimensions. """Multi-stage quality scorer — LLM-as-judge evaluation with per-stage rubrics.
Evaluates a synthesized technique page against source moments on: Supports stages 2-5, each with its own scoring dimensions, rubric, format
1. Structural quality section naming, count, paragraph depth markers, fixture key requirements, prompt file name, and output schema.
2. Content specificity concrete details vs vague generalities
3. Voice preservation direct quotes, attributed opinions, personality
4. Readability / flow synthesis quality, logical ordering, no redundancy
5. Factual fidelity no hallucinated specifics, grounded in source moments
Run via: python -m pipeline.quality score --file <path> Run via: python -m pipeline.quality score --file <path>
""" """
@ -16,6 +12,7 @@ import logging
import sys import sys
import time import time
from dataclasses import dataclass, field from dataclasses import dataclass, field
from typing import Any
import openai import openai
from pydantic import BaseModel from pydantic import BaseModel
@ -26,9 +23,177 @@ from pipeline.quality.voice_dial import VoiceDial
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
# ── Scoring rubric (hardcoded for iteration speed) ─────────────────────────── # ── Per-stage configuration registry ─────────────────────────────────────────
SCORING_RUBRIC = """\ class StageConfig:
"""Configuration for scoring a specific pipeline stage."""
def __init__(
self,
stage: int,
dimensions: list[str],
rubric: str,
format_markers: list[str],
fixture_keys: list[str],
prompt_file: str,
schema_class: str,
) -> None:
self.stage = stage
self.dimensions = dimensions
self.rubric = rubric
self.format_markers = format_markers
self.fixture_keys = fixture_keys
self.prompt_file = prompt_file
self.schema_class = schema_class
def get_schema(self) -> type[BaseModel]:
"""Import and return the Pydantic schema class for this stage."""
from pipeline import schemas
return getattr(schemas, self.schema_class)
# ── Stage rubrics ────────────────────────────────────────────────────────────
_STAGE_2_RUBRIC = """\
You are an expert evaluator of transcript segmentation quality for educational content.
You will be given:
1. A segmentation result (JSON with segments, each having start_index, end_index, topic_label, summary)
2. The source transcript segments used as input
Evaluate the segmentation across these 4 dimensions, scoring each 0.0 to 1.0:
**coverage_completeness** All transcript content accounted for
- 0.9-1.0: Every transcript segment is covered by exactly one topic segment, no gaps or overlaps
- 0.5-0.7: Minor gaps or overlaps, but most content is covered
- 0.0-0.3: Large gaps significant transcript segments are not assigned to any topic
**topic_specificity** Topic labels are descriptive and useful
- 0.9-1.0: Labels are specific and descriptive (e.g., "Sidechain compression on kick-bass" not "Audio processing")
- 0.5-0.7: Labels are somewhat specific but could be more descriptive
- 0.0-0.3: Labels are generic or meaningless ("Topic 1", "Discussion", "Audio")
**boundary_accuracy** Segment boundaries align with actual topic transitions
- 0.9-1.0: Boundaries fall at natural topic transitions, segments are coherent units
- 0.5-0.7: Most boundaries are reasonable but some segments mix distinct topics
- 0.0-0.3: Boundaries seem arbitrary, segments contain unrelated content
**summary_quality** Summaries accurately describe segment content
- 0.9-1.0: Summaries capture the key points of each segment concisely and accurately
- 0.5-0.7: Summaries are acceptable but miss some key points or are too vague
- 0.0-0.3: Summaries are inaccurate, too generic, or missing
Return ONLY a JSON object with this exact structure:
{
"coverage_completeness": <float 0.0-1.0>,
"topic_specificity": <float 0.0-1.0>,
"boundary_accuracy": <float 0.0-1.0>,
"summary_quality": <float 0.0-1.0>,
"justifications": {
"coverage_completeness": "<1-2 sentence justification>",
"topic_specificity": "<1-2 sentence justification>",
"boundary_accuracy": "<1-2 sentence justification>",
"summary_quality": "<1-2 sentence justification>"
}
}
"""
_STAGE_3_RUBRIC = """\
You are an expert evaluator of key moment extraction quality for educational content.
You will be given:
1. An extraction result (JSON with moments, each having title, summary, start_time, end_time, content_type, plugins, raw_transcript)
2. The source topic segments used as input
Evaluate the extraction across these 5 dimensions, scoring each 0.0 to 1.0:
**moment_richness** Extracted moments capture substantial, distinct insights
- 0.9-1.0: Each moment represents a meaningful, distinct technique or concept with detailed summary
- 0.5-0.7: Moments are valid but some are thin or overlap significantly with others
- 0.0-0.3: Moments are trivial, redundant, or miss the main techniques discussed
**timestamp_accuracy** Time ranges are plausible and well-bounded
- 0.9-1.0: Start/end times form reasonable ranges, no zero-length or absurdly long spans
- 0.5-0.7: Most timestamps are reasonable but some spans seem too wide or narrow
- 0.0-0.3: Timestamps appear arbitrary or many are zero/identical
**content_type_correctness** Content types match the actual moment content
- 0.9-1.0: Each moment's content_type (technique/settings/reasoning/workflow) accurately describes it
- 0.5-0.7: Most are correct but 1-2 are miscategorized
- 0.0-0.3: Content types seem randomly assigned or all the same
**summary_actionability** Summaries provide actionable, specific information
- 0.9-1.0: Summaries contain concrete details (values, settings, steps) that a practitioner could follow
- 0.5-0.7: Summaries describe the topic but lack specific actionable details
- 0.0-0.3: Summaries are vague ("discusses compression") with no actionable information
**plugin_normalization** Plugin/tool names are correctly identified and normalized
- 0.9-1.0: Plugin names match standard names, no duplicates, captures all mentioned tools
- 0.5-0.7: Most plugins captured but some are misspelled, duplicated, or missed
- 0.0-0.3: Plugin list is mostly empty, contains non-plugins, or has many errors
Return ONLY a JSON object with this exact structure:
{
"moment_richness": <float 0.0-1.0>,
"timestamp_accuracy": <float 0.0-1.0>,
"content_type_correctness": <float 0.0-1.0>,
"summary_actionability": <float 0.0-1.0>,
"plugin_normalization": <float 0.0-1.0>,
"justifications": {
"moment_richness": "<1-2 sentence justification>",
"timestamp_accuracy": "<1-2 sentence justification>",
"content_type_correctness": "<1-2 sentence justification>",
"summary_actionability": "<1-2 sentence justification>",
"plugin_normalization": "<1-2 sentence justification>"
}
}
"""
_STAGE_4_RUBRIC = """\
You are an expert evaluator of content classification quality for educational content.
You will be given:
1. A classification result (JSON with classifications, each having moment_index, topic_category, topic_tags)
2. The source extracted moments used as input
Evaluate the classification across these 4 dimensions, scoring each 0.0 to 1.0:
**category_accuracy** Topic categories are appropriate and meaningful
- 0.9-1.0: Categories accurately reflect the primary topic of each moment, using domain-appropriate labels
- 0.5-0.7: Most categories are reasonable but some are too broad or slightly off
- 0.0-0.3: Categories are generic ("Music"), incorrect, or all the same
**tag_completeness** All relevant tags are captured
- 0.9-1.0: Tags capture the key concepts, tools, and techniques in each moment comprehensively
- 0.5-0.7: Main tags are present but secondary concepts or tools are missed
- 0.0-0.3: Tags are sparse, missing major concepts mentioned in the moments
**tag_specificity** Tags are specific enough to be useful for search/filtering
- 0.9-1.0: Tags are specific ("sidechain compression", "Pro-Q 3") not generic ("audio", "mixing")
- 0.5-0.7: Mix of specific and generic tags
- 0.0-0.3: Tags are too generic to meaningfully distinguish moments
**coverage** All moments are classified
- 0.9-1.0: Every moment_index from the input has a corresponding classification entry
- 0.5-0.7: Most moments classified but 1-2 are missing
- 0.0-0.3: Many moments are not classified
Return ONLY a JSON object with this exact structure:
{
"category_accuracy": <float 0.0-1.0>,
"tag_completeness": <float 0.0-1.0>,
"tag_specificity": <float 0.0-1.0>,
"coverage": <float 0.0-1.0>,
"justifications": {
"category_accuracy": "<1-2 sentence justification>",
"tag_completeness": "<1-2 sentence justification>",
"tag_specificity": "<1-2 sentence justification>",
"coverage": "<1-2 sentence justification>"
}
}
"""
_STAGE_5_RUBRIC = """\
You are an expert evaluator of synthesized technique articles for music production education. You are an expert evaluator of synthesized technique articles for music production education.
You will be given: You will be given:
@ -79,73 +244,142 @@ Return ONLY a JSON object with this exact structure:
} }
""" """
DIMENSIONS = [ # Backward-compat alias used by synthesize_and_score and external references
"structural", SCORING_RUBRIC = _STAGE_5_RUBRIC
"content_specificity",
"voice_preservation", # Build the stage configs registry
"readability", STAGE_CONFIGS: dict[int, StageConfig] = {
"factual_fidelity", 2: StageConfig(
] stage=2,
dimensions=["coverage_completeness", "topic_specificity", "boundary_accuracy", "summary_quality"],
rubric=_STAGE_2_RUBRIC,
format_markers=["segments", "start_index", "end_index", "topic_label"],
fixture_keys=["transcript_segments"],
prompt_file="stage2_segmentation.txt",
schema_class="SegmentationResult",
),
3: StageConfig(
stage=3,
dimensions=["moment_richness", "timestamp_accuracy", "content_type_correctness", "summary_actionability", "plugin_normalization"],
rubric=_STAGE_3_RUBRIC,
format_markers=["moments", "content_type", "raw_transcript", "plugins"],
fixture_keys=["topic_segments"],
prompt_file="stage3_extraction.txt",
schema_class="ExtractionResult",
),
4: StageConfig(
stage=4,
dimensions=["category_accuracy", "tag_completeness", "tag_specificity", "coverage"],
rubric=_STAGE_4_RUBRIC,
format_markers=["classifications", "moment_index", "topic_category", "topic_tags"],
fixture_keys=["extracted_moments"],
prompt_file="stage4_classification.txt",
schema_class="ClassificationResult",
),
5: StageConfig(
stage=5,
dimensions=["structural", "content_specificity", "voice_preservation", "readability", "factual_fidelity"],
rubric=SCORING_RUBRIC,
format_markers=["SynthesisResult", '"pages"', "body_sections", "title", "summary"],
fixture_keys=["key_moments", "creator_name"],
prompt_file="stage5_synthesis.txt",
schema_class="SynthesisResult",
),
}
# Backward-compatible alias: stage 5 dimensions list
DIMENSIONS = STAGE_CONFIGS[5].dimensions
# ── Result type ────────────────────────────────────────────────────────────── # ── Result type ──────────────────────────────────────────────────────────────
@dataclass @dataclass
class ScoreResult: class ScoreResult:
"""Outcome of scoring a technique page across 5 quality dimensions.""" """Outcome of scoring a stage output across quality dimensions.
structural: float = 0.0 Uses a generic ``scores`` dict keyed by dimension name. Stage 5's
content_specificity: float = 0.0 original named fields (structural, content_specificity, ) are
voice_preservation: float = 0.0 preserved as properties for backward compatibility.
readability: float = 0.0 """
factual_fidelity: float = 0.0
scores: dict[str, float] = field(default_factory=dict)
composite: float = 0.0 composite: float = 0.0
justifications: dict[str, str] = field(default_factory=dict) justifications: dict[str, str] = field(default_factory=dict)
elapsed_seconds: float = 0.0 elapsed_seconds: float = 0.0
error: str | None = None error: str | None = None
# ── Backward-compat properties for stage 5 named dimensions ──────
@property
def structural(self) -> float:
return self.scores.get("structural", 0.0)
@property
def content_specificity(self) -> float:
return self.scores.get("content_specificity", 0.0)
@property
def voice_preservation(self) -> float:
return self.scores.get("voice_preservation", 0.0)
@property
def readability(self) -> float:
return self.scores.get("readability", 0.0)
@property
def factual_fidelity(self) -> float:
return self.scores.get("factual_fidelity", 0.0)
# ── Runner ─────────────────────────────────────────────────────────────────── # ── Runner ───────────────────────────────────────────────────────────────────
class ScoreRunner: class ScoreRunner:
"""Scores a Stage 5 technique page using LLM-as-judge evaluation.""" """Scores pipeline stage outputs using LLM-as-judge evaluation."""
def __init__(self, client: LLMClient) -> None: def __init__(self, client: LLMClient) -> None:
self.client = client self.client = client
def score_page( # ── Generic stage scorer ─────────────────────────────────────────────
def score_stage_output(
self, self,
page_json: dict, stage: int,
moments: list[dict], output_json: dict | list,
input_json: dict | list,
) -> ScoreResult: ) -> ScoreResult:
"""Evaluate a technique page against source moments. """Score an arbitrary stage's output against its input.
Parameters Parameters
---------- ----------
page_json: stage:
Synthesized page dict (title, summary, body_sections). Pipeline stage number (2-5).
moments: output_json:
Source key moments with transcript_excerpt, summary, etc. The stage output to evaluate (parsed JSON).
input_json:
The stage input / source material.
Returns Returns
------- -------
ScoreResult with per-dimension scores and justifications. ScoreResult with per-dimension scores for the requested stage.
""" """
# Build the user prompt with the page and source moments if stage not in STAGE_CONFIGS:
return ScoreResult(error=f"No config for stage {stage}. Valid: {sorted(STAGE_CONFIGS)}")
cfg = STAGE_CONFIGS[stage]
user_prompt = ( user_prompt = (
"## Synthesized Technique Page\n\n" "## Stage Output\n\n"
f"```json\n{json.dumps(page_json, indent=2)}\n```\n\n" f"```json\n{json.dumps(output_json, indent=2)}\n```\n\n"
"## Source Key Moments\n\n" "## Stage Input\n\n"
f"```json\n{json.dumps(moments, indent=2)}\n```\n\n" f"```json\n{json.dumps(input_json, indent=2)}\n```\n\n"
"Score this page across all 5 dimensions." f"Score this stage {stage} output across all {len(cfg.dimensions)} dimensions."
) )
t0 = time.monotonic() t0 = time.monotonic()
try: try:
resp = self.client.complete( resp = self.client.complete(
system_prompt=SCORING_RUBRIC, system_prompt=cfg.rubric,
user_prompt=user_prompt, user_prompt=user_prompt,
response_model=BaseModel, # triggers JSON mode response_model=BaseModel,
modality="chat", modality="chat",
) )
elapsed = round(time.monotonic() - t0, 2) elapsed = round(time.monotonic() - t0, 2)
@ -155,13 +389,9 @@ class ScoreRunner:
fallback = self.client.settings.llm_fallback_url fallback = self.client.settings.llm_fallback_url
return ScoreResult( return ScoreResult(
elapsed_seconds=elapsed, elapsed_seconds=elapsed,
error=( error=f"Cannot reach LLM endpoint at {url} (fallback {fallback}). Error: {exc}",
f"Cannot reach LLM endpoint at {url} (fallback {fallback}). "
f"Error: {exc}"
),
) )
# Parse the LLM judge response
raw_text = str(resp).strip() raw_text = str(resp).strip()
try: try:
parsed = json.loads(raw_text) parsed = json.loads(raw_text)
@ -172,10 +402,27 @@ class ScoreRunner:
error=f"Malformed judge response (not valid JSON). Raw excerpt: {raw_text[:200]}", error=f"Malformed judge response (not valid JSON). Raw excerpt: {raw_text[:200]}",
) )
return self._parse_scores(parsed, elapsed, cfg.dimensions)
# ── Stage 5 convenience (backward compat) ────────────────────────────
def score_page(
self,
page_json: dict,
moments: list[dict],
) -> ScoreResult:
"""Evaluate a stage 5 technique page against source moments."""
return self.score_stage_output(
stage=5,
output_json=page_json,
input_json=moments,
)
return self._parse_scores(parsed, elapsed) return self._parse_scores(parsed, elapsed)
def _parse_scores(self, parsed: dict, elapsed: float) -> ScoreResult: def _parse_scores(self, parsed: dict, elapsed: float, dimensions: list[str] | None = None) -> ScoreResult:
"""Extract and validate scores from parsed JSON response.""" """Extract and validate scores from parsed JSON response."""
dims = dimensions or DIMENSIONS
scores: dict[str, float] = {} scores: dict[str, float] = {}
justifications: dict[str, str] = {} justifications: dict[str, str] = {}
@ -183,7 +430,7 @@ class ScoreRunner:
if not isinstance(raw_justifications, dict): if not isinstance(raw_justifications, dict):
raw_justifications = {} raw_justifications = {}
for dim in DIMENSIONS: for dim in dims:
raw = parsed.get(dim) raw = parsed.get(dim)
if raw is None: if raw is None:
logger.warning("Missing dimension '%s' in judge response", dim) logger.warning("Missing dimension '%s' in judge response", dim)
@ -202,14 +449,10 @@ class ScoreRunner:
justifications[dim] = str(raw_justifications.get(dim, "")) justifications[dim] = str(raw_justifications.get(dim, ""))
composite = sum(scores.values()) / len(DIMENSIONS) composite = sum(scores.values()) / len(dims) if dims else 0.0
return ScoreResult( return ScoreResult(
structural=scores["structural"], scores=scores,
content_specificity=scores["content_specificity"],
voice_preservation=scores["voice_preservation"],
readability=scores["readability"],
factual_fidelity=scores["factual_fidelity"],
composite=round(composite, 3), composite=round(composite, 3),
justifications=justifications, justifications=justifications,
elapsed_seconds=elapsed, elapsed_seconds=elapsed,
@ -318,10 +561,13 @@ class ScoreRunner:
result.elapsed_seconds = round(result.elapsed_seconds + elapsed_synth, 2) result.elapsed_seconds = round(result.elapsed_seconds + elapsed_synth, 2)
return result return result
def print_report(self, result: ScoreResult) -> None: def print_report(self, result: ScoreResult, stage: int = 5) -> None:
"""Print a formatted scoring report to stdout.""" """Print a formatted scoring report to stdout."""
dims = STAGE_CONFIGS[stage].dimensions if stage in STAGE_CONFIGS else list(result.scores.keys())
stage_label = f"STAGE {stage}" if stage in STAGE_CONFIGS else "QUALITY"
print("\n" + "=" * 60) print("\n" + "=" * 60)
print(" STAGE 5 QUALITY SCORE REPORT") print(f" {stage_label} QUALITY SCORE REPORT")
print("=" * 60) print("=" * 60)
if result.error: if result.error:
@ -329,8 +575,8 @@ class ScoreRunner:
print("=" * 60 + "\n") print("=" * 60 + "\n")
return return
for dim in DIMENSIONS: for dim in dims:
score = getattr(result, dim) score = result.scores.get(dim, 0.0)
bar = self._score_bar(score) bar = self._score_bar(score)
justification = result.justifications.get(dim, "") justification = result.justifications.get(dim, "")
print(f"\n {dim.replace('_', ' ').title()}") print(f"\n {dim.replace('_', ' ').title()}")

View file

@ -4,13 +4,17 @@ Uses a meta-prompt to instruct the LLM to act as a prompt engineer,
analyzing per-dimension scores and producing targeted prompt mutations analyzing per-dimension scores and producing targeted prompt mutations
that improve the weakest scoring dimensions while preserving the JSON that improve the weakest scoring dimensions while preserving the JSON
output format required by downstream parsing. output format required by downstream parsing.
Supports any pipeline stage (2-5) callers pass the stage's dimensions
and format markers so the meta-prompt and validation adapt automatically.
""" """
from __future__ import annotations from __future__ import annotations
import logging import logging
from typing import Sequence
from pipeline.llm_client import LLMClient from pipeline.llm_client import LLMClient
from pipeline.quality.scorer import DIMENSIONS, ScoreResult from pipeline.quality.scorer import DIMENSIONS, STAGE_CONFIGS, ScoreResult
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@ -18,29 +22,24 @@ logger = logging.getLogger(__name__)
# ── Meta-prompt for variant generation ──────────────────────────────────────── # ── Meta-prompt for variant generation ────────────────────────────────────────
VARIANT_META_PROMPT = """\ VARIANT_META_PROMPT = """\
You are an expert prompt engineer specializing in LLM-powered content synthesis. You are an expert prompt engineer specializing in LLM-powered content processing pipelines.
Your task: given a synthesis prompt and its quality evaluation scores, produce an Your task: given a pipeline stage prompt and its quality evaluation scores, produce an
improved variant of the prompt that targets the weakest-scoring dimensions while improved variant of the prompt that targets the weakest-scoring dimensions while
maintaining or improving the others. maintaining or improving the others.
## Scoring Dimensions (each 0.01.0) ## Scoring Dimensions (each 0.01.0)
- **structural** Section naming, count (3-6), paragraph depth (2-5 per section) {dimension_descriptions}
- **content_specificity** Concrete details: frequencies, time values, ratios, plugin names, dB values
- **voice_preservation** Direct quotes preserved, opinions attributed to creator by name, personality retained
- **readability** Cohesive article flow, related info merged, no redundancy or contradiction
- **factual_fidelity** Every claim traceable to source material, no hallucinated specifics
## Rules ## Rules
1. Focus your changes on the weakest 1-2 dimensions. Don't dilute the prompt by trying to fix everything. 1. Focus your changes on the weakest 1-2 dimensions. Don't dilute the prompt by trying to fix everything.
2. Add specific, actionable instructions not vague encouragements. 2. Add specific, actionable instructions not vague encouragements.
3. **CRITICAL: You MUST preserve the JSON output format section of the prompt EXACTLY as-is.** 3. **CRITICAL: You MUST preserve the JSON output format section of the prompt EXACTLY as-is.**
The prompt contains instructions about outputting a JSON object with a specific schema The prompt contains instructions about outputting a JSON object with a specific schema.
(SynthesisResult with "pages" containing title, summary, body_sections, etc.).
Do NOT modify, remove, or rephrase any part of the JSON format instructions. Do NOT modify, remove, or rephrase any part of the JSON format instructions.
Your changes should target the prose synthesis guidelines only. Your changes should target the processing/analysis guidelines only.
4. Keep the overall prompt length within 2x of the original. Don't bloat it. 4. Keep the overall prompt length within 2x of the original. Don't bloat it.
5. Make substantive changes rewording a sentence or adding one adjective is not enough. 5. Make substantive changes rewording a sentence or adding one adjective is not enough.
@ -50,9 +49,38 @@ Return ONLY the full modified prompt text. No explanation, no markdown fences, n
Just the complete prompt that could be used directly as a system prompt. Just the complete prompt that could be used directly as a system prompt.
""" """
# Dimension descriptions per stage, used to fill the meta-prompt template.
_DIMENSION_DESCRIPTIONS: dict[int, str] = {
2: (
"- **coverage_completeness** — All transcript content accounted for, no gaps or overlaps\n"
"- **topic_specificity** — Topic labels are descriptive and useful, not generic\n"
"- **boundary_accuracy** — Segment boundaries align with actual topic transitions\n"
"- **summary_quality** — Summaries accurately describe segment content"
),
3: (
"- **moment_richness** — Extracted moments capture substantial, distinct insights\n"
"- **timestamp_accuracy** — Time ranges are plausible and well-bounded\n"
"- **content_type_correctness** — Content types match the actual moment content\n"
"- **summary_actionability** — Summaries provide actionable, specific information\n"
"- **plugin_normalization** — Plugin/tool names are correctly identified and normalized"
),
4: (
"- **category_accuracy** — Topic categories are appropriate and meaningful\n"
"- **tag_completeness** — All relevant tags are captured\n"
"- **tag_specificity** — Tags are specific enough to be useful for search/filtering\n"
"- **coverage** — All moments are classified"
),
5: (
"- **structural** — Section naming, count (3-6), paragraph depth (2-5 per section)\n"
"- **content_specificity** — Concrete details: frequencies, time values, ratios, plugin names, dB values\n"
"- **voice_preservation** — Direct quotes preserved, opinions attributed to creator by name, personality retained\n"
"- **readability** — Cohesive article flow, related info merged, no redundancy or contradiction\n"
"- **factual_fidelity** — Every claim traceable to source material, no hallucinated specifics"
),
}
# Format markers that must survive variant generation — if any of these
# are present in the base prompt, the variant must also contain them. # Legacy default format markers for stage 5
_FORMAT_MARKERS = ["SynthesisResult", '"pages"', "body_sections", "title", "summary"] _FORMAT_MARKERS = ["SynthesisResult", '"pages"', "body_sections", "title", "summary"]
@ -71,6 +99,9 @@ class PromptVariantGenerator:
base_prompt: str, base_prompt: str,
scores: ScoreResult, scores: ScoreResult,
n: int = 2, n: int = 2,
*,
format_markers: Sequence[str] | None = None,
stage: int = 5,
) -> list[str]: ) -> list[str]:
"""Generate up to *n* valid prompt variants. """Generate up to *n* valid prompt variants.
@ -83,27 +114,48 @@ class PromptVariantGenerator:
Parameters Parameters
---------- ----------
base_prompt: base_prompt:
The current best synthesis prompt text. The current best prompt text for the target stage.
scores: scores:
ScoreResult from the most recent evaluation of *base_prompt*. ScoreResult from the most recent evaluation of *base_prompt*.
n: n:
Number of variants to attempt generating. Number of variants to attempt generating.
format_markers:
Override format markers for validation. When *None*, uses the
markers from ``STAGE_CONFIGS[stage]`` (falling back to stage 5
defaults for backward compat).
stage:
Pipeline stage number (2-5), used to select dimension
descriptions for the meta-prompt and default format markers.
Returns Returns
------- -------
list[str] list[str]
Valid variant prompt strings (may be fewer than *n*). Valid variant prompt strings (may be fewer than *n*).
""" """
user_prompt = self._build_user_prompt(base_prompt, scores) # Resolve format markers and dimensions for the target stage
if format_markers is not None:
markers = list(format_markers)
elif stage in STAGE_CONFIGS:
markers = STAGE_CONFIGS[stage].format_markers
else:
markers = _FORMAT_MARKERS
dimensions = STAGE_CONFIGS[stage].dimensions if stage in STAGE_CONFIGS else DIMENSIONS
# Build the system prompt with stage-appropriate dimension descriptions
dim_desc = _DIMENSION_DESCRIPTIONS.get(stage, _DIMENSION_DESCRIPTIONS[5])
system_prompt = VARIANT_META_PROMPT.format(dimension_descriptions=dim_desc)
user_prompt = self._build_user_prompt(base_prompt, scores, dimensions)
# Identify which format markers are actually present in the base # Identify which format markers are actually present in the base
required_markers = [m for m in _FORMAT_MARKERS if m in base_prompt] required_markers = [m for m in markers if m in base_prompt]
variants: list[str] = [] variants: list[str] = []
for i in range(n): for i in range(n):
logger.info("Generating variant %d/%d...", i + 1, n) logger.info("Generating variant %d/%d (stage %d)...", i + 1, n, stage)
try: try:
raw = self.client.complete( raw = self.client.complete(
system_prompt=VARIANT_META_PROMPT, system_prompt=system_prompt,
user_prompt=user_prompt, user_prompt=user_prompt,
response_model=None, # free-form text, not JSON response_model=None, # free-form text, not JSON
modality="chat", modality="chat",
@ -127,11 +179,12 @@ class PromptVariantGenerator:
# ── Internal helpers ────────────────────────────────────────────────── # ── Internal helpers ──────────────────────────────────────────────────
def _build_user_prompt(self, base_prompt: str, scores: ScoreResult) -> str: def _build_user_prompt(self, base_prompt: str, scores: ScoreResult, dimensions: list[str] | None = None) -> str:
"""Build the user message describing the current prompt and its scores.""" """Build the user message describing the current prompt and its scores."""
dims = dimensions or DIMENSIONS
# Build per-dimension score lines, sorted worst-first # Build per-dimension score lines, sorted worst-first
dim_lines: list[str] = [] dim_lines: list[str] = []
dim_scores = [(d, getattr(scores, d, 0.0)) for d in DIMENSIONS] dim_scores = [(d, scores.scores.get(d, 0.0)) for d in dims]
dim_scores.sort(key=lambda x: x[1]) dim_scores.sort(key=lambda x: x[1])
for dim, val in dim_scores: for dim, val in dim_scores: