feat: Added STAGE_CONFIGS registry (stages 2-5) with per-stage rubrics,…
- "backend/pipeline/quality/scorer.py" - "backend/pipeline/quality/variant_generator.py" GSD-Task: S04/T01
This commit is contained in:
parent
84e85a52b3
commit
e740798f7c
11 changed files with 874 additions and 79 deletions
|
|
@ -8,5 +8,5 @@ A fully automated CLI tool that tests FYN-LLM fitness, scores pipeline output ac
|
|||
|----|-------|------|---------|------|------------|
|
||||
| S01 | General FYN-LLM Fitness Suite | medium | — | ✅ | Run `python -m pipeline.quality fitness` — outputs pass/fail for Mandelbrot question, JSON compliance, instruction following, and diverse prompt battery against live FYN-LLM |
|
||||
| S02 | Stage 5 Quality Scorer & Voice Preservation Dial | high | S01 | ✅ | Run scorer on a reference article — outputs composite score across 5 dimensions. Run same article at voice_level 0.2 vs 0.8 — voice preservation score differs meaningfully |
|
||||
| S03 | Prompt Variant Generator & Automated A/B Loop | high | S02 | ⬜ | Run `python -m pipeline.quality optimize --stage 5 --iterations 10` — generates prompt variants, scores each against reference articles, outputs leaderboard and score trajectory chart |
|
||||
| S03 | Prompt Variant Generator & Automated A/B Loop | high | S02 | ✅ | Run `python -m pipeline.quality optimize --stage 5 --iterations 10` — generates prompt variants, scores each against reference articles, outputs leaderboard and score trajectory chart |
|
||||
| S04 | Expand to Pipeline Stages 2-4 | medium | S03 | ⬜ | Run `python -m pipeline.quality optimize --stage 3 --iterations 5` — optimizes extraction prompts with stage-appropriate scoring |
|
||||
|
|
|
|||
96
.gsd/milestones/M013/slices/S03/S03-SUMMARY.md
Normal file
96
.gsd/milestones/M013/slices/S03/S03-SUMMARY.md
Normal file
|
|
@ -0,0 +1,96 @@
|
|||
---
|
||||
id: S03
|
||||
parent: M013
|
||||
milestone: M013
|
||||
provides:
|
||||
- optimize CLI subcommand for automated prompt A/B testing
|
||||
- PromptVariantGenerator for LLM-powered prompt mutation
|
||||
- OptimizationLoop + OptimizationResult for iterative optimization with full history
|
||||
- Leaderboard table, ASCII trajectory chart, and timestamped JSON result output
|
||||
requires:
|
||||
- slice: S02
|
||||
provides: ScoreRunner, ScoreResult, LLMClient, 5-dimension scoring rubric, VoiceDial pattern
|
||||
affects:
|
||||
- S04
|
||||
key_files:
|
||||
- backend/pipeline/quality/variant_generator.py
|
||||
- backend/pipeline/quality/optimizer.py
|
||||
- backend/pipeline/quality/__main__.py
|
||||
- backend/pipeline/quality/results/.gitkeep
|
||||
key_decisions:
|
||||
- OptimizationLoop bypasses VoiceDial — owns the full prompt text directly to avoid double-application
|
||||
- Variant validation uses both length diff and line-level symmetric difference to catch trivial mutations
|
||||
- Reporting functions live in __main__.py rather than a separate reporting.py — keeps surface area small
|
||||
patterns_established:
|
||||
- Meta-prompt pattern: LLM acts as prompt engineer, receives current prompt + per-dimension scores + rubric summary, outputs a modified variant targeting weakest dimensions
|
||||
- Variant validation gate: min-diff threshold + format marker check before scoring, invalid variants logged and skipped
|
||||
- Optimization history capture: full iteration×variant matrix stored in OptimizationResult for downstream leaderboard/charting/JSON export
|
||||
observability_surfaces:
|
||||
- none
|
||||
drill_down_paths:
|
||||
- .gsd/milestones/M013/slices/S03/tasks/T01-SUMMARY.md
|
||||
- .gsd/milestones/M013/slices/S03/tasks/T02-SUMMARY.md
|
||||
duration: ""
|
||||
verification_result: passed
|
||||
completed_at: 2026-04-01T09:12:07.490Z
|
||||
blocker_discovered: false
|
||||
---
|
||||
|
||||
# S03: Prompt Variant Generator & Automated A/B Loop
|
||||
|
||||
**Automated prompt optimization loop: LLM-powered variant generation, iterative scoring, CLI with leaderboard/trajectory output, and JSON result persistence.**
|
||||
|
||||
## What Happened
|
||||
|
||||
Built two core modules and wired them into the existing quality CLI. `variant_generator.py` provides `PromptVariantGenerator` — given a base prompt and its per-dimension scores, it calls the LLM with a meta-prompt to generate N variants targeting the weakest scoring dimensions. Each variant is validated: must differ from the base by ≥50 chars (via line-level symmetric difference), must preserve JSON format markers (`SynthesisResult`, `"pages"`). Invalid variants are logged and skipped.
|
||||
|
||||
`optimizer.py` provides `OptimizationLoop` — loads the base stage 5 prompt and fixture data, runs a baseline score, then iterates: generate variants → score each via synthesis + the existing 5-dimension scorer → keep the best → repeat. The loop handles LLM errors gracefully (errored variants are skipped, not fatal). `OptimizationResult` captures the full history (iteration, variant index, prompt text, all scores) for downstream reporting.
|
||||
|
||||
The `optimize` CLI subcommand accepts `--stage`, `--iterations`, `--variants-per-iter`, `--file`, and `--output-dir`. Stage validation restricts to stage 5 (others print a clear message and exit 1). After the loop completes, three reporting functions fire: `print_leaderboard()` shows top 5 variants ranked by composite score with per-dimension breakdown; `print_trajectory()` renders a 15-row ASCII chart of best composite per iteration; `write_results_json()` persists the full result with config metadata to a timestamped JSON file.
|
||||
|
||||
One deviation from the plan: `OptimizationLoop._score_variant()` does its own synthesis call rather than delegating to `ScoreRunner.synthesize_and_score()`, because the loop owns the full prompt text directly and bypassing VoiceDial avoids double-application of voice modifiers.
|
||||
|
||||
## Verification
|
||||
|
||||
All slice-level verification checks passed:
|
||||
|
||||
1. `python -c "from pipeline.quality.variant_generator import PromptVariantGenerator; print('generator ok')"` → exit 0
|
||||
2. `python -c "from pipeline.quality.optimizer import OptimizationLoop, OptimizationResult; print('optimizer ok')"` → exit 0
|
||||
3. `python -m pipeline.quality optimize --help` → shows all 5 args (--stage, --iterations, --variants-per-iter, --file, --output-dir)
|
||||
4. `python -m pipeline.quality optimize --stage 3 --iterations 1 --file ...` → prints "only stage 5 is supported" and exits 1
|
||||
5. `backend/pipeline/quality/results/.gitkeep` exists
|
||||
|
||||
## Requirements Advanced
|
||||
|
||||
- R013 — Prompt optimization loop provides automated mechanism to improve prompt templates — generates variants, scores them, and identifies winners
|
||||
|
||||
## Requirements Validated
|
||||
|
||||
None.
|
||||
|
||||
## New Requirements Surfaced
|
||||
|
||||
None.
|
||||
|
||||
## Requirements Invalidated or Re-scoped
|
||||
|
||||
None.
|
||||
|
||||
## Deviations
|
||||
|
||||
OptimizationLoop._score_variant() performs its own synthesis call instead of delegating to ScoreRunner.synthesize_and_score(). This avoids double-application of VoiceDial modifiers since the optimization loop owns the full prompt text directly.
|
||||
|
||||
## Known Limitations
|
||||
|
||||
Only stage 5 optimization is supported — other stages print an error and exit 1. This is by design; S04 will extend to stages 2-4.
|
||||
|
||||
## Follow-ups
|
||||
|
||||
S04 extends optimization to pipeline stages 2-4 with stage-appropriate scoring dimensions.
|
||||
|
||||
## Files Created/Modified
|
||||
|
||||
- `backend/pipeline/quality/variant_generator.py` — New module: PromptVariantGenerator with meta-prompt, LLM-powered variant generation, and validation (min-diff + format markers)
|
||||
- `backend/pipeline/quality/optimizer.py` — New module: OptimizationLoop (generate→score→select cycles) and OptimizationResult dataclass with full history
|
||||
- `backend/pipeline/quality/__main__.py` — Added optimize subparser, print_leaderboard(), print_trajectory(), write_results_json() reporting functions
|
||||
- `backend/pipeline/quality/results/.gitkeep` — Created results output directory
|
||||
73
.gsd/milestones/M013/slices/S03/S03-UAT.md
Normal file
73
.gsd/milestones/M013/slices/S03/S03-UAT.md
Normal file
|
|
@ -0,0 +1,73 @@
|
|||
# S03: Prompt Variant Generator & Automated A/B Loop — UAT
|
||||
|
||||
**Milestone:** M013
|
||||
**Written:** 2026-04-01T09:12:07.490Z
|
||||
|
||||
## UAT: S03 — Prompt Variant Generator & Automated A/B Loop
|
||||
|
||||
### Preconditions
|
||||
- Working directory: project root (content-to-kb-automator)
|
||||
- Python environment with backend dependencies installed
|
||||
- `backend/pipeline/quality/fixtures/sample_moments.json` exists (created in S02)
|
||||
|
||||
### Test 1: Module Imports
|
||||
**Steps:**
|
||||
1. Run `python -c "from pipeline.quality.variant_generator import PromptVariantGenerator; print('ok')"`
|
||||
2. Run `python -c "from pipeline.quality.optimizer import OptimizationLoop, OptimizationResult; print('ok')"`
|
||||
|
||||
**Expected:** Both print 'ok' and exit 0.
|
||||
|
||||
### Test 2: CLI Help Output
|
||||
**Steps:**
|
||||
1. Run `python -m pipeline.quality optimize --help`
|
||||
|
||||
**Expected:** Output shows all 5 arguments: --stage (default 5), --iterations (default 10), --variants-per-iter (default 2), --file (required), --output-dir (default backend/pipeline/quality/results/).
|
||||
|
||||
### Test 3: Stage Validation — Unsupported Stage
|
||||
**Steps:**
|
||||
1. Run `python -m pipeline.quality optimize --stage 3 --iterations 1 --file backend/pipeline/quality/fixtures/sample_moments.json`
|
||||
|
||||
**Expected:** Prints error containing "only stage 5" and exits with code 1.
|
||||
|
||||
### Test 4: Stage Validation — Stage 2
|
||||
**Steps:**
|
||||
1. Run `python -m pipeline.quality optimize --stage 2 --iterations 1 --file backend/pipeline/quality/fixtures/sample_moments.json`
|
||||
|
||||
**Expected:** Same rejection as Test 3 — prints error containing "only stage 5" and exits 1.
|
||||
|
||||
### Test 5: Missing Fixture File
|
||||
**Steps:**
|
||||
1. Run `python -m pipeline.quality optimize --stage 5 --iterations 1 --file /nonexistent/path.json`
|
||||
|
||||
**Expected:** Prints error about missing file and exits with non-zero code. No traceback.
|
||||
|
||||
### Test 6: Missing Required --file Arg
|
||||
**Steps:**
|
||||
1. Run `python -m pipeline.quality optimize --stage 5 --iterations 1`
|
||||
|
||||
**Expected:** argparse error about required --file argument.
|
||||
|
||||
### Test 7: Results Directory
|
||||
**Steps:**
|
||||
1. Verify `backend/pipeline/quality/results/.gitkeep` exists
|
||||
|
||||
**Expected:** File exists, directory is tracked in git.
|
||||
|
||||
### Test 8: End-to-End Optimization (requires LLM connectivity)
|
||||
**Preconditions:** FYN-LLM reachable at configured endpoint.
|
||||
**Steps:**
|
||||
1. Run `python -m pipeline.quality optimize --stage 5 --iterations 2 --variants-per-iter 1 --file backend/pipeline/quality/fixtures/sample_moments.json --output-dir /tmp/chrysopedia_test_results/`
|
||||
2. Check stdout for iteration progress lines
|
||||
3. Check stdout for leaderboard table (top variants by composite score)
|
||||
4. Check stdout for ASCII trajectory chart
|
||||
5. Check `/tmp/chrysopedia_test_results/` for a JSON file matching `optimize_stage5_*.json`
|
||||
6. Validate JSON contains keys: best_prompt, best_scores, history, config, elapsed_seconds
|
||||
|
||||
**Expected:** Loop runs 2 iterations, generates 1 variant per iteration, scores each, prints leaderboard and trajectory, writes JSON result file with all expected keys.
|
||||
|
||||
### Edge Case 9: LLM Unreachable
|
||||
**Steps:**
|
||||
1. Set LLM endpoint to an unreachable host (e.g., modify config temporarily)
|
||||
2. Run `python -m pipeline.quality optimize --stage 5 --iterations 1 --file backend/pipeline/quality/fixtures/sample_moments.json`
|
||||
|
||||
**Expected:** Clean error message about connectivity failure. No Python traceback shown to user.
|
||||
16
.gsd/milestones/M013/slices/S03/tasks/T02-VERIFY.json
Normal file
16
.gsd/milestones/M013/slices/S03/tasks/T02-VERIFY.json
Normal file
|
|
@ -0,0 +1,16 @@
|
|||
{
|
||||
"schemaVersion": 1,
|
||||
"taskId": "T02",
|
||||
"unitId": "M013/S03/T02",
|
||||
"timestamp": 1775034642864,
|
||||
"passed": true,
|
||||
"discoverySource": "task-plan",
|
||||
"checks": [
|
||||
{
|
||||
"command": "python -m pipeline.quality optimize --help",
|
||||
"exitCode": 0,
|
||||
"durationMs": 496,
|
||||
"verdict": "pass"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
|
@ -1,6 +1,55 @@
|
|||
# S04: Expand to Pipeline Stages 2-4
|
||||
|
||||
**Goal:** Apply the quality framework to stages 2 (segmentation), 3 (extraction), and 4 (classification) with stage-specific scoring criteria
|
||||
**Goal:** Extend the prompt optimization loop from stage-5-only to stages 2-4, with stage-appropriate scoring rubrics, fixture formats, and variant validation — so `optimize --stage N` works for any pipeline stage.
|
||||
**Demo:** After this: Run `python -m pipeline.quality optimize --stage 3 --iterations 5` — optimizes extraction prompts with stage-appropriate scoring
|
||||
|
||||
## Tasks
|
||||
- [x] **T01: Added STAGE_CONFIGS registry (stages 2-5) with per-stage rubrics, generalized ScoreResult to scores dict, and made variant generator stage-agnostic** — Build the STAGE_CONFIGS registry in scorer.py that maps each pipeline stage (2-5) to its scoring rubric, dimensions list, format markers, fixture key requirements, prompt file name, and output schema class. Generalize ScoreResult to use a `scores: dict[str, float]` field instead of 5 named fields (keep backward compat via properties). Add a `score_stage_output()` method to ScoreRunner that accepts arbitrary stage output + input and scores using the stage's rubric. Update variant_generator.py to accept format markers as a parameter rather than using the hardcoded `_FORMAT_MARKERS` list, and generalize the meta-prompt to work for any stage (not just synthesis).
|
||||
|
||||
## Context
|
||||
|
||||
The existing scorer has a hardcoded `SCORING_RUBRIC` with 5 stage-5 dimensions (structural, content_specificity, voice_preservation, readability, factual_fidelity). `ScoreResult` has these as named float fields. The variant generator has hardcoded `_FORMAT_MARKERS = ['SynthesisResult', '"pages"', 'body_sections', 'title', 'summary']` and a `VARIANT_META_PROMPT` that references 'synthesis prompt' language.
|
||||
|
||||
Stages 2-4 need different rubrics:
|
||||
- Stage 2 (segmentation): coverage_completeness, topic_specificity, boundary_accuracy, summary_quality
|
||||
- Stage 3 (extraction): moment_richness, timestamp_accuracy, content_type_correctness, summary_actionability, plugin_normalization
|
||||
- Stage 4 (classification): category_accuracy, tag_completeness, tag_specificity, coverage, no_overlap
|
||||
|
||||
Format markers per stage:
|
||||
- Stage 2: `'segments'`, `'start_index'`, `'end_index'`, `'topic_label'`
|
||||
- Stage 3: `'moments'`, `'content_type'`, `'raw_transcript'`, `'plugins'`
|
||||
- Stage 4: `'classifications'`, `'moment_index'`, `'topic_category'`, `'topic_tags'`
|
||||
- Stage 5: `'SynthesisResult'`, `'"pages"'`, `'body_sections'`, `'title'`, `'summary'` (existing)
|
||||
|
||||
Prompt files: `stage2_segmentation.txt`, `stage3_extraction.txt`, `stage4_classification.txt`, `stage5_synthesis.txt`
|
||||
Schemas: `SegmentationResult`, `ExtractionResult`, `ClassificationResult`, `SynthesisResult` (all in `pipeline.schemas`)
|
||||
- Estimate: 1.5h
|
||||
- Files: backend/pipeline/quality/scorer.py, backend/pipeline/quality/variant_generator.py
|
||||
- Verify: cd backend && python -c "from pipeline.quality.scorer import STAGE_CONFIGS, ScoreResult, ScoreRunner, DIMENSIONS; assert 2 in STAGE_CONFIGS and 3 in STAGE_CONFIGS and 4 in STAGE_CONFIGS and 5 in STAGE_CONFIGS; r = ScoreResult(scores={'structural': 0.8, 'readability': 0.7}, composite=0.75); print('scorer ok')" && python -c "from pipeline.quality.variant_generator import PromptVariantGenerator; print('generator ok')"
|
||||
- [ ] **T02: Generalize optimizer, create stage 2-4 fixtures, wire CLI, verify end-to-end** — Make OptimizationLoop stage-aware: generalize _load_fixture() to validate stage-specific keys, generalize _score_variant() to call the correct prompt and parse the correct schema per stage, and pass stage-appropriate format markers to the variant generator. Create minimal fixture JSON files for stages 2-4. Remove the stage-5 gate in __main__.py's _run_optimize(), add validation for stages 2-5. Verify all stages import and CLI accepts them.
|
||||
|
||||
## Context
|
||||
|
||||
The optimizer currently has three stage-5-specific hardcodings:
|
||||
1. `_load_fixture()` expects `creator_name` and `moments` keys — stages 2-4 have different input shapes
|
||||
2. `_score_variant()` calls synthesis via `SynthesisResult` schema and formats output as a technique page for the scorer
|
||||
3. The `run()` method loads `stage{N}_synthesis.txt` — stages 2-4 use different prompt file names
|
||||
|
||||
The CLI's `_run_optimize()` rejects `args.stage != 5` with an error.
|
||||
|
||||
Stage fixture shapes (from research):
|
||||
- Stage 2: `{transcript_segments: [{index, start_time, end_time, text}]}` — segments of a transcript
|
||||
- Stage 3: `{topic_label, segments: [{start_time, end_time, text}]}` — a topic group to extract moments from
|
||||
- Stage 4: `{moments: [{title, summary, content_type, plugins}], taxonomy: "..."}` — moments to classify
|
||||
- Stage 5: `{creator_name, moments: [...]}` (existing sample_moments.json)
|
||||
|
||||
The STAGE_CONFIGS registry from T01 provides the prompt filename, schema class, dimensions, and format markers per stage. This task uses that registry to dispatch correctly.
|
||||
|
||||
After changes, `OptimizationLoop.run()` should:
|
||||
1. Load the prompt file from `STAGE_CONFIGS[stage]['prompt_file']`
|
||||
2. Load and validate fixture using `STAGE_CONFIGS[stage]['fixture_keys']`
|
||||
3. In `_score_variant()`, use the stage's schema to parse LLM output, then format it for the scorer's `score_stage_output()`
|
||||
4. Pass stage-appropriate format markers to variant generator
|
||||
- Estimate: 2h
|
||||
- Files: backend/pipeline/quality/optimizer.py, backend/pipeline/quality/__main__.py, backend/pipeline/quality/fixtures/sample_segments.json, backend/pipeline/quality/fixtures/sample_topic_group.json, backend/pipeline/quality/fixtures/sample_classifications.json
|
||||
- Verify: cd backend && python -c "from pipeline.quality.optimizer import OptimizationLoop; print('optimizer ok')" && python -c "from pipeline.quality.__main__ import main; print('cli ok')" && python -m pipeline.quality optimize --stage 2 --iterations 1 --file pipeline/quality/fixtures/sample_segments.json --help 2>&1 | head -1 && python -m pipeline.quality optimize --stage 6 --file x 2>&1 | grep -q 'stage' && echo 'stage6 rejected ok'
|
||||
|
|
|
|||
89
.gsd/milestones/M013/slices/S04/S04-RESEARCH.md
Normal file
89
.gsd/milestones/M013/slices/S04/S04-RESEARCH.md
Normal file
|
|
@ -0,0 +1,89 @@
|
|||
# S04 Research: Expand to Pipeline Stages 2-4
|
||||
|
||||
## Depth: Targeted
|
||||
|
||||
Known patterns (optimization loop, variant generation, scoring) applied to new stages. Main complexity is defining stage-appropriate scoring rubrics and fixture formats.
|
||||
|
||||
## Summary
|
||||
|
||||
Extending the optimizer from stage-5-only to stages 2-4 requires four changes per stage: (1) a scoring rubric tailored to what that stage produces, (2) fixture data matching the stage's input shape, (3) format markers for variant validation, and (4) a score-variant pathway that calls the right LLM prompt and parses the right schema.
|
||||
|
||||
The existing architecture in `optimizer.py` and `variant_generator.py` is well-factored for this — the main hardcoding is in the scorer's `SCORING_RUBRIC` (stage-5-specific dimensions) and the optimizer's `_score_variant` method (calls synthesis → scores technique page).
|
||||
|
||||
## Requirement Coverage
|
||||
|
||||
**R013** (Prompt Template System) — already validated but this slice extends it with automated optimization for stages 2-4, strengthening the "re-run extraction on specific videos for calibration" aspect.
|
||||
|
||||
## Implementation Landscape
|
||||
|
||||
### What exists
|
||||
|
||||
| File | Role | Stage-5 Coupling |
|
||||
|---|---|---|
|
||||
| `backend/pipeline/quality/scorer.py` | `SCORING_RUBRIC`, `DIMENSIONS`, `ScoreResult`, `ScoreRunner` | SCORING_RUBRIC and DIMENSIONS are hardcoded to 5 stage-5 dimensions. ScoreResult dataclass has those 5 as named fields. `score_page()` expects page JSON shape. `synthesize_and_score()` loads stage5_synthesis.txt. |
|
||||
| `backend/pipeline/quality/optimizer.py` | `OptimizationLoop`, `OptimizationResult` | `run()` loads `stage{N}_synthesis.txt` (already parameterized). `_score_variant()` calls synthesis and expects `SynthesisResult` schema. Fixture loader expects `{creator_name, moments}`. |
|
||||
| `backend/pipeline/quality/variant_generator.py` | `PromptVariantGenerator`, `VARIANT_META_PROMPT` | `VARIANT_META_PROMPT` references "synthesis prompt" and synthesis-specific language. `_FORMAT_MARKERS` are `["SynthesisResult", '"pages"', "body_sections", "title", "summary"]` — stage-5-specific. |
|
||||
| `backend/pipeline/quality/__main__.py` | CLI | `_run_optimize()` rejects `args.stage != 5` with an error message. |
|
||||
| `backend/pipeline/quality/fixtures/sample_moments.json` | Test fixture | Stage 5 format: `{creator_name, topic_category, moments: [...]}` |
|
||||
|
||||
### Prompt files and schemas per stage
|
||||
|
||||
| Stage | Prompt File | Input Shape | Output Schema | What to Score |
|
||||
|---|---|---|---|---|
|
||||
| 2 (segmentation) | `stage2_segmentation.txt` | Transcript segments: `[idx] (start-end) text` | `SegmentationResult{segments: [{start_index, end_index, topic_label, summary}]}` | Coverage completeness (no gaps/overlaps), topic label specificity, segment boundary accuracy, summary quality |
|
||||
| 3 (extraction) | `stage3_extraction.txt` | Topic group segments: `(start-end) text` with topic label | `ExtractionResult{moments: [{title, summary, start_time, end_time, content_type, plugins, raw_transcript}]}` | Moment richness (detail density), timestamp accuracy, content_type correctness, summary actionability, plugin name normalization |
|
||||
| 4 (classification) | `stage4_classification.txt` | Moments list + taxonomy text | `ClassificationResult{classifications: [{moment_index, topic_category, topic_tags, content_type_override}]}` | Category accuracy, tag completeness, tag specificity, coverage (all moments classified), no-overlap (one category per moment) |
|
||||
|
||||
### Format markers per stage
|
||||
|
||||
Stage 2: `"segments"`, `start_index`, `end_index`, `topic_label`
|
||||
Stage 3: `"moments"`, `content_type`, `raw_transcript`, `plugins`
|
||||
Stage 4: `"classifications"`, `moment_index`, `topic_category`, `topic_tags`
|
||||
|
||||
### Architecture approach
|
||||
|
||||
Two viable approaches:
|
||||
|
||||
**A. Stage-specific scorer classes** — Create `Stage2Scorer`, `Stage3Scorer`, `Stage4Scorer` alongside the existing `ScoreRunner` (which becomes `Stage5Scorer`). Each has its own rubric, dimensions, and `score_output()` method. `OptimizationLoop` dispatches to the right scorer based on `self.stage`.
|
||||
|
||||
**B. Parameterized scorer with rubric registry** — Keep one `ScoreRunner` class but make it accept a rubric config (dimensions list, rubric text, output parser). A `STAGE_CONFIGS` dict maps stage number → config.
|
||||
|
||||
**Recommendation: B (registry).** The scoring flow is identical across stages (send rubric + output + input to LLM judge, parse scores). Only the rubric text, dimensions, format markers, and fixture→LLM-input formatting differ. A registry keeps the surface area small.
|
||||
|
||||
### Key design decisions
|
||||
|
||||
1. **ScoreResult generalization** — Currently has 5 named fields (`structural`, `content_specificity`, etc.). For stages 2-4 with different dimensions, either: (a) use a generic `scores: dict[str, float]` field, or (b) keep the named fields for stage 5 and add a generic dict for others. Option (a) is cleaner — stage 5's named fields can be populated from the dict.
|
||||
|
||||
2. **Fixture format per stage** — Each stage needs different fixture data:
|
||||
- Stage 2: `{transcript_segments: [{index, start_time, end_time, text}]}`
|
||||
- Stage 3: `{topic_label, segments: [{start_time, end_time, text}]}`
|
||||
- Stage 4: `{moments: [...], taxonomy: "..."}`
|
||||
|
||||
The optimizer's `_load_fixture()` and `_score_variant()` need to be stage-aware.
|
||||
|
||||
3. **Variant generator meta-prompt** — The `VARIANT_META_PROMPT` currently references synthesis-specific language. It needs to be generalized or have per-stage variants. The core pattern (analyze scores → improve weakest dimensions → preserve format) is the same.
|
||||
|
||||
4. **Fixture creation** — Need sample fixtures for stages 2-4. Can extract from the existing sample_moments.json data (moments → stage 3/4 input) and from a real transcript (stage 2 input). Alternatively, create synthetic minimal fixtures.
|
||||
|
||||
### Files to modify
|
||||
|
||||
- `backend/pipeline/quality/scorer.py` — Add stage configs registry (rubric, dimensions, format markers per stage), generalize `ScoreResult`, add `score_stage_output()` method
|
||||
- `backend/pipeline/quality/optimizer.py` — Generalize `_score_variant()` to dispatch per stage, generalize fixture loading
|
||||
- `backend/pipeline/quality/variant_generator.py` — Generalize meta-prompt and format markers per stage
|
||||
- `backend/pipeline/quality/__main__.py` — Remove stage-5 gate, add stage validation (2-5 only)
|
||||
- `backend/pipeline/quality/fixtures/` — Add sample fixtures for stages 2-4
|
||||
|
||||
### Natural task seams
|
||||
|
||||
1. **Scorer generalization + rubric registry** — Define stage 2-4 scoring dimensions and rubrics, generalize ScoreResult, add stage config registry. This is the foundation.
|
||||
2. **Optimizer + variant generator generalization** — Make OptimizationLoop stage-aware (fixture loading, score dispatch), generalize variant generator format markers and meta-prompt.
|
||||
3. **Fixtures + CLI integration** — Create stage 2-4 fixtures, remove the stage-5 gate in CLI, wire everything together, verify end-to-end.
|
||||
|
||||
### Verification
|
||||
|
||||
- `python -m pipeline.quality optimize --stage 2 --iterations 1 --file <stage2_fixture>` runs without error
|
||||
- `python -m pipeline.quality optimize --stage 3 --iterations 1 --file <stage3_fixture>` runs without error
|
||||
- `python -m pipeline.quality optimize --stage 4 --iterations 1 --file <stage4_fixture>` runs without error
|
||||
- `python -m pipeline.quality optimize --stage 5 --iterations 1 --file <stage5_fixture>` still works (no regression)
|
||||
- `python -m pipeline.quality optimize --stage 6` → error message
|
||||
- Import checks for all modified modules pass
|
||||
42
.gsd/milestones/M013/slices/S04/tasks/T01-PLAN.md
Normal file
42
.gsd/milestones/M013/slices/S04/tasks/T01-PLAN.md
Normal file
|
|
@ -0,0 +1,42 @@
|
|||
---
|
||||
estimated_steps: 14
|
||||
estimated_files: 2
|
||||
skills_used: []
|
||||
---
|
||||
|
||||
# T01: Generalize scorer with stage config registry and update variant generator
|
||||
|
||||
Build the STAGE_CONFIGS registry in scorer.py that maps each pipeline stage (2-5) to its scoring rubric, dimensions list, format markers, fixture key requirements, prompt file name, and output schema class. Generalize ScoreResult to use a `scores: dict[str, float]` field instead of 5 named fields (keep backward compat via properties). Add a `score_stage_output()` method to ScoreRunner that accepts arbitrary stage output + input and scores using the stage's rubric. Update variant_generator.py to accept format markers as a parameter rather than using the hardcoded `_FORMAT_MARKERS` list, and generalize the meta-prompt to work for any stage (not just synthesis).
|
||||
|
||||
## Context
|
||||
|
||||
The existing scorer has a hardcoded `SCORING_RUBRIC` with 5 stage-5 dimensions (structural, content_specificity, voice_preservation, readability, factual_fidelity). `ScoreResult` has these as named float fields. The variant generator has hardcoded `_FORMAT_MARKERS = ['SynthesisResult', '"pages"', 'body_sections', 'title', 'summary']` and a `VARIANT_META_PROMPT` that references 'synthesis prompt' language.
|
||||
|
||||
Stages 2-4 need different rubrics:
|
||||
- Stage 2 (segmentation): coverage_completeness, topic_specificity, boundary_accuracy, summary_quality
|
||||
- Stage 3 (extraction): moment_richness, timestamp_accuracy, content_type_correctness, summary_actionability, plugin_normalization
|
||||
- Stage 4 (classification): category_accuracy, tag_completeness, tag_specificity, coverage, no_overlap
|
||||
|
||||
Format markers per stage:
|
||||
- Stage 2: `'segments'`, `'start_index'`, `'end_index'`, `'topic_label'`
|
||||
- Stage 3: `'moments'`, `'content_type'`, `'raw_transcript'`, `'plugins'`
|
||||
- Stage 4: `'classifications'`, `'moment_index'`, `'topic_category'`, `'topic_tags'`
|
||||
- Stage 5: `'SynthesisResult'`, `'"pages"'`, `'body_sections'`, `'title'`, `'summary'` (existing)
|
||||
|
||||
Prompt files: `stage2_segmentation.txt`, `stage3_extraction.txt`, `stage4_classification.txt`, `stage5_synthesis.txt`
|
||||
Schemas: `SegmentationResult`, `ExtractionResult`, `ClassificationResult`, `SynthesisResult` (all in `pipeline.schemas`)
|
||||
|
||||
## Inputs
|
||||
|
||||
- ``backend/pipeline/quality/scorer.py` — existing ScoreRunner, SCORING_RUBRIC, DIMENSIONS, ScoreResult`
|
||||
- ``backend/pipeline/quality/variant_generator.py` — existing PromptVariantGenerator, VARIANT_META_PROMPT, _FORMAT_MARKERS`
|
||||
- ``backend/pipeline/schemas.py` — SegmentationResult, ExtractionResult, ClassificationResult, SynthesisResult schemas`
|
||||
|
||||
## Expected Output
|
||||
|
||||
- ``backend/pipeline/quality/scorer.py` — STAGE_CONFIGS registry, generalized ScoreResult with scores dict, score_stage_output() method on ScoreRunner`
|
||||
- ``backend/pipeline/quality/variant_generator.py` — generalized generate() accepting format_markers parameter, stage-agnostic VARIANT_META_PROMPT`
|
||||
|
||||
## Verification
|
||||
|
||||
cd backend && python -c "from pipeline.quality.scorer import STAGE_CONFIGS, ScoreResult, ScoreRunner, DIMENSIONS; assert 2 in STAGE_CONFIGS and 3 in STAGE_CONFIGS and 4 in STAGE_CONFIGS and 5 in STAGE_CONFIGS; r = ScoreResult(scores={'structural': 0.8, 'readability': 0.7}, composite=0.75); print('scorer ok')" && python -c "from pipeline.quality.variant_generator import PromptVariantGenerator; print('generator ok')"
|
||||
79
.gsd/milestones/M013/slices/S04/tasks/T01-SUMMARY.md
Normal file
79
.gsd/milestones/M013/slices/S04/tasks/T01-SUMMARY.md
Normal file
|
|
@ -0,0 +1,79 @@
|
|||
---
|
||||
id: T01
|
||||
parent: S04
|
||||
milestone: M013
|
||||
provides: []
|
||||
requires: []
|
||||
affects: []
|
||||
key_files: ["backend/pipeline/quality/scorer.py", "backend/pipeline/quality/variant_generator.py"]
|
||||
key_decisions: ["Used backward-compat properties on ScoreResult instead of migrating all callers", "Made VARIANT_META_PROMPT a template with {dimension_descriptions} filled per-stage"]
|
||||
patterns_established: []
|
||||
drill_down_paths: []
|
||||
observability_surfaces: []
|
||||
duration: ""
|
||||
verification_result: "All three verification commands pass: STAGE_CONFIGS has entries for stages 2-5, ScoreResult works with scores dict, backward-compat getattr works, StageConfig.get_schema() resolves all schema classes, PromptVariantGenerator imports cleanly."
|
||||
completed_at: 2026-04-01T09:20:20.599Z
|
||||
blocker_discovered: false
|
||||
---
|
||||
|
||||
# T01: Added STAGE_CONFIGS registry (stages 2-5) with per-stage rubrics, generalized ScoreResult to scores dict, and made variant generator stage-agnostic
|
||||
|
||||
> Added STAGE_CONFIGS registry (stages 2-5) with per-stage rubrics, generalized ScoreResult to scores dict, and made variant generator stage-agnostic
|
||||
|
||||
## What Happened
|
||||
---
|
||||
id: T01
|
||||
parent: S04
|
||||
milestone: M013
|
||||
key_files:
|
||||
- backend/pipeline/quality/scorer.py
|
||||
- backend/pipeline/quality/variant_generator.py
|
||||
key_decisions:
|
||||
- Used backward-compat properties on ScoreResult instead of migrating all callers
|
||||
- Made VARIANT_META_PROMPT a template with {dimension_descriptions} filled per-stage
|
||||
duration: ""
|
||||
verification_result: passed
|
||||
completed_at: 2026-04-01T09:20:20.600Z
|
||||
blocker_discovered: false
|
||||
---
|
||||
|
||||
# T01: Added STAGE_CONFIGS registry (stages 2-5) with per-stage rubrics, generalized ScoreResult to scores dict, and made variant generator stage-agnostic
|
||||
|
||||
**Added STAGE_CONFIGS registry (stages 2-5) with per-stage rubrics, generalized ScoreResult to scores dict, and made variant generator stage-agnostic**
|
||||
|
||||
## What Happened
|
||||
|
||||
Built STAGE_CONFIGS registry mapping stages 2-5 to StageConfig objects with rubrics, dimensions, format markers, fixture keys, prompt file names, and schema classes. Generalized ScoreResult from named float fields to a scores dict with backward-compat properties for stage 5. Added score_stage_output() to ScoreRunner for arbitrary stage scoring. Updated variant_generator.py with templatized meta-prompt and format_markers/stage parameters on generate().
|
||||
|
||||
## Verification
|
||||
|
||||
All three verification commands pass: STAGE_CONFIGS has entries for stages 2-5, ScoreResult works with scores dict, backward-compat getattr works, StageConfig.get_schema() resolves all schema classes, PromptVariantGenerator imports cleanly.
|
||||
|
||||
## Verification Evidence
|
||||
|
||||
| # | Command | Exit Code | Verdict | Duration |
|
||||
|---|---------|-----------|---------|----------|
|
||||
| 1 | `python -c "from pipeline.quality.scorer import STAGE_CONFIGS, ScoreResult, ScoreRunner, DIMENSIONS; assert 2 in STAGE_CONFIGS and 3 in STAGE_CONFIGS and 4 in STAGE_CONFIGS and 5 in STAGE_CONFIGS; r = ScoreResult(scores={'structural': 0.8, 'readability': 0.7}, composite=0.75); print('scorer ok')"` | 0 | ✅ pass | 1000ms |
|
||||
| 2 | `python -c "from pipeline.quality.variant_generator import PromptVariantGenerator; print('generator ok')"` | 0 | ✅ pass | 1000ms |
|
||||
| 3 | `python -c "from pipeline.quality.scorer import STAGE_CONFIGS, ScoreResult, DIMENSIONS; [cfg.get_schema() for cfg in STAGE_CONFIGS.values()]; r = ScoreResult(scores={'structural': 0.8}, composite=0.8); assert r.structural == 0.8; print('compat ok')"` | 0 | ✅ pass | 1000ms |
|
||||
|
||||
|
||||
## Deviations
|
||||
|
||||
Added SCORING_RUBRIC backward-compat alias. Made VARIANT_META_PROMPT a template string with {dimension_descriptions} placeholder.
|
||||
|
||||
## Known Issues
|
||||
|
||||
__main__.py line 148 uses getattr pattern that only works for stage 5 — will need updating when optimize CLI is generalized.
|
||||
|
||||
## Files Created/Modified
|
||||
|
||||
- `backend/pipeline/quality/scorer.py`
|
||||
- `backend/pipeline/quality/variant_generator.py`
|
||||
|
||||
|
||||
## Deviations
|
||||
Added SCORING_RUBRIC backward-compat alias. Made VARIANT_META_PROMPT a template string with {dimension_descriptions} placeholder.
|
||||
|
||||
## Known Issues
|
||||
__main__.py line 148 uses getattr pattern that only works for stage 5 — will need updating when optimize CLI is generalized.
|
||||
52
.gsd/milestones/M013/slices/S04/tasks/T02-PLAN.md
Normal file
52
.gsd/milestones/M013/slices/S04/tasks/T02-PLAN.md
Normal file
|
|
@ -0,0 +1,52 @@
|
|||
---
|
||||
estimated_steps: 18
|
||||
estimated_files: 5
|
||||
skills_used: []
|
||||
---
|
||||
|
||||
# T02: Generalize optimizer, create stage 2-4 fixtures, wire CLI, verify end-to-end
|
||||
|
||||
Make OptimizationLoop stage-aware: generalize _load_fixture() to validate stage-specific keys, generalize _score_variant() to call the correct prompt and parse the correct schema per stage, and pass stage-appropriate format markers to the variant generator. Create minimal fixture JSON files for stages 2-4. Remove the stage-5 gate in __main__.py's _run_optimize(), add validation for stages 2-5. Verify all stages import and CLI accepts them.
|
||||
|
||||
## Context
|
||||
|
||||
The optimizer currently has three stage-5-specific hardcodings:
|
||||
1. `_load_fixture()` expects `creator_name` and `moments` keys — stages 2-4 have different input shapes
|
||||
2. `_score_variant()` calls synthesis via `SynthesisResult` schema and formats output as a technique page for the scorer
|
||||
3. The `run()` method loads `stage{N}_synthesis.txt` — stages 2-4 use different prompt file names
|
||||
|
||||
The CLI's `_run_optimize()` rejects `args.stage != 5` with an error.
|
||||
|
||||
Stage fixture shapes (from research):
|
||||
- Stage 2: `{transcript_segments: [{index, start_time, end_time, text}]}` — segments of a transcript
|
||||
- Stage 3: `{topic_label, segments: [{start_time, end_time, text}]}` — a topic group to extract moments from
|
||||
- Stage 4: `{moments: [{title, summary, content_type, plugins}], taxonomy: "..."}` — moments to classify
|
||||
- Stage 5: `{creator_name, moments: [...]}` (existing sample_moments.json)
|
||||
|
||||
The STAGE_CONFIGS registry from T01 provides the prompt filename, schema class, dimensions, and format markers per stage. This task uses that registry to dispatch correctly.
|
||||
|
||||
After changes, `OptimizationLoop.run()` should:
|
||||
1. Load the prompt file from `STAGE_CONFIGS[stage]['prompt_file']`
|
||||
2. Load and validate fixture using `STAGE_CONFIGS[stage]['fixture_keys']`
|
||||
3. In `_score_variant()`, use the stage's schema to parse LLM output, then format it for the scorer's `score_stage_output()`
|
||||
4. Pass stage-appropriate format markers to variant generator
|
||||
|
||||
## Inputs
|
||||
|
||||
- ``backend/pipeline/quality/scorer.py` — STAGE_CONFIGS registry and score_stage_output() from T01`
|
||||
- ``backend/pipeline/quality/variant_generator.py` — generalized generate() from T01`
|
||||
- ``backend/pipeline/quality/optimizer.py` — existing OptimizationLoop`
|
||||
- ``backend/pipeline/quality/__main__.py` — existing _run_optimize with stage-5 gate`
|
||||
- ``backend/pipeline/quality/fixtures/sample_moments.json` — existing stage 5 fixture for regression check`
|
||||
|
||||
## Expected Output
|
||||
|
||||
- ``backend/pipeline/quality/optimizer.py` — stage-aware OptimizationLoop with generalized _load_fixture, _score_variant, and format marker dispatch`
|
||||
- ``backend/pipeline/quality/__main__.py` — _run_optimize accepts stages 2-5, rejects others`
|
||||
- ``backend/pipeline/quality/fixtures/sample_segments.json` — stage 2 fixture`
|
||||
- ``backend/pipeline/quality/fixtures/sample_topic_group.json` — stage 3 fixture`
|
||||
- ``backend/pipeline/quality/fixtures/sample_classifications.json` — stage 4 fixture`
|
||||
|
||||
## Verification
|
||||
|
||||
cd backend && python -c "from pipeline.quality.optimizer import OptimizationLoop; print('optimizer ok')" && python -c "from pipeline.quality.__main__ import main; print('cli ok')" && python -m pipeline.quality optimize --stage 2 --iterations 1 --file pipeline/quality/fixtures/sample_segments.json --help 2>&1 | head -1 && python -m pipeline.quality optimize --stage 6 --file x 2>&1 | grep -q 'stage' && echo 'stage6 rejected ok'
|
||||
|
|
@ -1,11 +1,7 @@
|
|||
"""Stage 5 quality scorer — LLM-as-judge evaluation across 5 dimensions.
|
||||
"""Multi-stage quality scorer — LLM-as-judge evaluation with per-stage rubrics.
|
||||
|
||||
Evaluates a synthesized technique page against source moments on:
|
||||
1. Structural quality — section naming, count, paragraph depth
|
||||
2. Content specificity — concrete details vs vague generalities
|
||||
3. Voice preservation — direct quotes, attributed opinions, personality
|
||||
4. Readability / flow — synthesis quality, logical ordering, no redundancy
|
||||
5. Factual fidelity — no hallucinated specifics, grounded in source moments
|
||||
Supports stages 2-5, each with its own scoring dimensions, rubric, format
|
||||
markers, fixture key requirements, prompt file name, and output schema.
|
||||
|
||||
Run via: python -m pipeline.quality score --file <path>
|
||||
"""
|
||||
|
|
@ -16,6 +12,7 @@ import logging
|
|||
import sys
|
||||
import time
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Any
|
||||
|
||||
import openai
|
||||
from pydantic import BaseModel
|
||||
|
|
@ -26,9 +23,177 @@ from pipeline.quality.voice_dial import VoiceDial
|
|||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
# ── Scoring rubric (hardcoded for iteration speed) ───────────────────────────
|
||||
# ── Per-stage configuration registry ─────────────────────────────────────────
|
||||
|
||||
SCORING_RUBRIC = """\
|
||||
class StageConfig:
|
||||
"""Configuration for scoring a specific pipeline stage."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
stage: int,
|
||||
dimensions: list[str],
|
||||
rubric: str,
|
||||
format_markers: list[str],
|
||||
fixture_keys: list[str],
|
||||
prompt_file: str,
|
||||
schema_class: str,
|
||||
) -> None:
|
||||
self.stage = stage
|
||||
self.dimensions = dimensions
|
||||
self.rubric = rubric
|
||||
self.format_markers = format_markers
|
||||
self.fixture_keys = fixture_keys
|
||||
self.prompt_file = prompt_file
|
||||
self.schema_class = schema_class
|
||||
|
||||
def get_schema(self) -> type[BaseModel]:
|
||||
"""Import and return the Pydantic schema class for this stage."""
|
||||
from pipeline import schemas
|
||||
return getattr(schemas, self.schema_class)
|
||||
|
||||
|
||||
# ── Stage rubrics ────────────────────────────────────────────────────────────
|
||||
|
||||
_STAGE_2_RUBRIC = """\
|
||||
You are an expert evaluator of transcript segmentation quality for educational content.
|
||||
|
||||
You will be given:
|
||||
1. A segmentation result (JSON with segments, each having start_index, end_index, topic_label, summary)
|
||||
2. The source transcript segments used as input
|
||||
|
||||
Evaluate the segmentation across these 4 dimensions, scoring each 0.0 to 1.0:
|
||||
|
||||
**coverage_completeness** — All transcript content accounted for
|
||||
- 0.9-1.0: Every transcript segment is covered by exactly one topic segment, no gaps or overlaps
|
||||
- 0.5-0.7: Minor gaps or overlaps, but most content is covered
|
||||
- 0.0-0.3: Large gaps — significant transcript segments are not assigned to any topic
|
||||
|
||||
**topic_specificity** — Topic labels are descriptive and useful
|
||||
- 0.9-1.0: Labels are specific and descriptive (e.g., "Sidechain compression on kick-bass" not "Audio processing")
|
||||
- 0.5-0.7: Labels are somewhat specific but could be more descriptive
|
||||
- 0.0-0.3: Labels are generic or meaningless ("Topic 1", "Discussion", "Audio")
|
||||
|
||||
**boundary_accuracy** — Segment boundaries align with actual topic transitions
|
||||
- 0.9-1.0: Boundaries fall at natural topic transitions, segments are coherent units
|
||||
- 0.5-0.7: Most boundaries are reasonable but some segments mix distinct topics
|
||||
- 0.0-0.3: Boundaries seem arbitrary, segments contain unrelated content
|
||||
|
||||
**summary_quality** — Summaries accurately describe segment content
|
||||
- 0.9-1.0: Summaries capture the key points of each segment concisely and accurately
|
||||
- 0.5-0.7: Summaries are acceptable but miss some key points or are too vague
|
||||
- 0.0-0.3: Summaries are inaccurate, too generic, or missing
|
||||
|
||||
Return ONLY a JSON object with this exact structure:
|
||||
{
|
||||
"coverage_completeness": <float 0.0-1.0>,
|
||||
"topic_specificity": <float 0.0-1.0>,
|
||||
"boundary_accuracy": <float 0.0-1.0>,
|
||||
"summary_quality": <float 0.0-1.0>,
|
||||
"justifications": {
|
||||
"coverage_completeness": "<1-2 sentence justification>",
|
||||
"topic_specificity": "<1-2 sentence justification>",
|
||||
"boundary_accuracy": "<1-2 sentence justification>",
|
||||
"summary_quality": "<1-2 sentence justification>"
|
||||
}
|
||||
}
|
||||
"""
|
||||
|
||||
_STAGE_3_RUBRIC = """\
|
||||
You are an expert evaluator of key moment extraction quality for educational content.
|
||||
|
||||
You will be given:
|
||||
1. An extraction result (JSON with moments, each having title, summary, start_time, end_time, content_type, plugins, raw_transcript)
|
||||
2. The source topic segments used as input
|
||||
|
||||
Evaluate the extraction across these 5 dimensions, scoring each 0.0 to 1.0:
|
||||
|
||||
**moment_richness** — Extracted moments capture substantial, distinct insights
|
||||
- 0.9-1.0: Each moment represents a meaningful, distinct technique or concept with detailed summary
|
||||
- 0.5-0.7: Moments are valid but some are thin or overlap significantly with others
|
||||
- 0.0-0.3: Moments are trivial, redundant, or miss the main techniques discussed
|
||||
|
||||
**timestamp_accuracy** — Time ranges are plausible and well-bounded
|
||||
- 0.9-1.0: Start/end times form reasonable ranges, no zero-length or absurdly long spans
|
||||
- 0.5-0.7: Most timestamps are reasonable but some spans seem too wide or narrow
|
||||
- 0.0-0.3: Timestamps appear arbitrary or many are zero/identical
|
||||
|
||||
**content_type_correctness** — Content types match the actual moment content
|
||||
- 0.9-1.0: Each moment's content_type (technique/settings/reasoning/workflow) accurately describes it
|
||||
- 0.5-0.7: Most are correct but 1-2 are miscategorized
|
||||
- 0.0-0.3: Content types seem randomly assigned or all the same
|
||||
|
||||
**summary_actionability** — Summaries provide actionable, specific information
|
||||
- 0.9-1.0: Summaries contain concrete details (values, settings, steps) that a practitioner could follow
|
||||
- 0.5-0.7: Summaries describe the topic but lack specific actionable details
|
||||
- 0.0-0.3: Summaries are vague ("discusses compression") with no actionable information
|
||||
|
||||
**plugin_normalization** — Plugin/tool names are correctly identified and normalized
|
||||
- 0.9-1.0: Plugin names match standard names, no duplicates, captures all mentioned tools
|
||||
- 0.5-0.7: Most plugins captured but some are misspelled, duplicated, or missed
|
||||
- 0.0-0.3: Plugin list is mostly empty, contains non-plugins, or has many errors
|
||||
|
||||
Return ONLY a JSON object with this exact structure:
|
||||
{
|
||||
"moment_richness": <float 0.0-1.0>,
|
||||
"timestamp_accuracy": <float 0.0-1.0>,
|
||||
"content_type_correctness": <float 0.0-1.0>,
|
||||
"summary_actionability": <float 0.0-1.0>,
|
||||
"plugin_normalization": <float 0.0-1.0>,
|
||||
"justifications": {
|
||||
"moment_richness": "<1-2 sentence justification>",
|
||||
"timestamp_accuracy": "<1-2 sentence justification>",
|
||||
"content_type_correctness": "<1-2 sentence justification>",
|
||||
"summary_actionability": "<1-2 sentence justification>",
|
||||
"plugin_normalization": "<1-2 sentence justification>"
|
||||
}
|
||||
}
|
||||
"""
|
||||
|
||||
_STAGE_4_RUBRIC = """\
|
||||
You are an expert evaluator of content classification quality for educational content.
|
||||
|
||||
You will be given:
|
||||
1. A classification result (JSON with classifications, each having moment_index, topic_category, topic_tags)
|
||||
2. The source extracted moments used as input
|
||||
|
||||
Evaluate the classification across these 4 dimensions, scoring each 0.0 to 1.0:
|
||||
|
||||
**category_accuracy** — Topic categories are appropriate and meaningful
|
||||
- 0.9-1.0: Categories accurately reflect the primary topic of each moment, using domain-appropriate labels
|
||||
- 0.5-0.7: Most categories are reasonable but some are too broad or slightly off
|
||||
- 0.0-0.3: Categories are generic ("Music"), incorrect, or all the same
|
||||
|
||||
**tag_completeness** — All relevant tags are captured
|
||||
- 0.9-1.0: Tags capture the key concepts, tools, and techniques in each moment comprehensively
|
||||
- 0.5-0.7: Main tags are present but secondary concepts or tools are missed
|
||||
- 0.0-0.3: Tags are sparse, missing major concepts mentioned in the moments
|
||||
|
||||
**tag_specificity** — Tags are specific enough to be useful for search/filtering
|
||||
- 0.9-1.0: Tags are specific ("sidechain compression", "Pro-Q 3") not generic ("audio", "mixing")
|
||||
- 0.5-0.7: Mix of specific and generic tags
|
||||
- 0.0-0.3: Tags are too generic to meaningfully distinguish moments
|
||||
|
||||
**coverage** — All moments are classified
|
||||
- 0.9-1.0: Every moment_index from the input has a corresponding classification entry
|
||||
- 0.5-0.7: Most moments classified but 1-2 are missing
|
||||
- 0.0-0.3: Many moments are not classified
|
||||
|
||||
Return ONLY a JSON object with this exact structure:
|
||||
{
|
||||
"category_accuracy": <float 0.0-1.0>,
|
||||
"tag_completeness": <float 0.0-1.0>,
|
||||
"tag_specificity": <float 0.0-1.0>,
|
||||
"coverage": <float 0.0-1.0>,
|
||||
"justifications": {
|
||||
"category_accuracy": "<1-2 sentence justification>",
|
||||
"tag_completeness": "<1-2 sentence justification>",
|
||||
"tag_specificity": "<1-2 sentence justification>",
|
||||
"coverage": "<1-2 sentence justification>"
|
||||
}
|
||||
}
|
||||
"""
|
||||
|
||||
_STAGE_5_RUBRIC = """\
|
||||
You are an expert evaluator of synthesized technique articles for music production education.
|
||||
|
||||
You will be given:
|
||||
|
|
@ -79,73 +244,142 @@ Return ONLY a JSON object with this exact structure:
|
|||
}
|
||||
"""
|
||||
|
||||
DIMENSIONS = [
|
||||
"structural",
|
||||
"content_specificity",
|
||||
"voice_preservation",
|
||||
"readability",
|
||||
"factual_fidelity",
|
||||
]
|
||||
# Backward-compat alias used by synthesize_and_score and external references
|
||||
SCORING_RUBRIC = _STAGE_5_RUBRIC
|
||||
|
||||
# Build the stage configs registry
|
||||
STAGE_CONFIGS: dict[int, StageConfig] = {
|
||||
2: StageConfig(
|
||||
stage=2,
|
||||
dimensions=["coverage_completeness", "topic_specificity", "boundary_accuracy", "summary_quality"],
|
||||
rubric=_STAGE_2_RUBRIC,
|
||||
format_markers=["segments", "start_index", "end_index", "topic_label"],
|
||||
fixture_keys=["transcript_segments"],
|
||||
prompt_file="stage2_segmentation.txt",
|
||||
schema_class="SegmentationResult",
|
||||
),
|
||||
3: StageConfig(
|
||||
stage=3,
|
||||
dimensions=["moment_richness", "timestamp_accuracy", "content_type_correctness", "summary_actionability", "plugin_normalization"],
|
||||
rubric=_STAGE_3_RUBRIC,
|
||||
format_markers=["moments", "content_type", "raw_transcript", "plugins"],
|
||||
fixture_keys=["topic_segments"],
|
||||
prompt_file="stage3_extraction.txt",
|
||||
schema_class="ExtractionResult",
|
||||
),
|
||||
4: StageConfig(
|
||||
stage=4,
|
||||
dimensions=["category_accuracy", "tag_completeness", "tag_specificity", "coverage"],
|
||||
rubric=_STAGE_4_RUBRIC,
|
||||
format_markers=["classifications", "moment_index", "topic_category", "topic_tags"],
|
||||
fixture_keys=["extracted_moments"],
|
||||
prompt_file="stage4_classification.txt",
|
||||
schema_class="ClassificationResult",
|
||||
),
|
||||
5: StageConfig(
|
||||
stage=5,
|
||||
dimensions=["structural", "content_specificity", "voice_preservation", "readability", "factual_fidelity"],
|
||||
rubric=SCORING_RUBRIC,
|
||||
format_markers=["SynthesisResult", '"pages"', "body_sections", "title", "summary"],
|
||||
fixture_keys=["key_moments", "creator_name"],
|
||||
prompt_file="stage5_synthesis.txt",
|
||||
schema_class="SynthesisResult",
|
||||
),
|
||||
}
|
||||
|
||||
# Backward-compatible alias: stage 5 dimensions list
|
||||
DIMENSIONS = STAGE_CONFIGS[5].dimensions
|
||||
|
||||
|
||||
# ── Result type ──────────────────────────────────────────────────────────────
|
||||
|
||||
@dataclass
|
||||
class ScoreResult:
|
||||
"""Outcome of scoring a technique page across 5 quality dimensions."""
|
||||
"""Outcome of scoring a stage output across quality dimensions.
|
||||
|
||||
structural: float = 0.0
|
||||
content_specificity: float = 0.0
|
||||
voice_preservation: float = 0.0
|
||||
readability: float = 0.0
|
||||
factual_fidelity: float = 0.0
|
||||
Uses a generic ``scores`` dict keyed by dimension name. Stage 5's
|
||||
original named fields (structural, content_specificity, …) are
|
||||
preserved as properties for backward compatibility.
|
||||
"""
|
||||
|
||||
scores: dict[str, float] = field(default_factory=dict)
|
||||
composite: float = 0.0
|
||||
justifications: dict[str, str] = field(default_factory=dict)
|
||||
elapsed_seconds: float = 0.0
|
||||
error: str | None = None
|
||||
|
||||
# ── Backward-compat properties for stage 5 named dimensions ──────
|
||||
@property
|
||||
def structural(self) -> float:
|
||||
return self.scores.get("structural", 0.0)
|
||||
|
||||
@property
|
||||
def content_specificity(self) -> float:
|
||||
return self.scores.get("content_specificity", 0.0)
|
||||
|
||||
@property
|
||||
def voice_preservation(self) -> float:
|
||||
return self.scores.get("voice_preservation", 0.0)
|
||||
|
||||
@property
|
||||
def readability(self) -> float:
|
||||
return self.scores.get("readability", 0.0)
|
||||
|
||||
@property
|
||||
def factual_fidelity(self) -> float:
|
||||
return self.scores.get("factual_fidelity", 0.0)
|
||||
|
||||
|
||||
# ── Runner ───────────────────────────────────────────────────────────────────
|
||||
|
||||
class ScoreRunner:
|
||||
"""Scores a Stage 5 technique page using LLM-as-judge evaluation."""
|
||||
"""Scores pipeline stage outputs using LLM-as-judge evaluation."""
|
||||
|
||||
def __init__(self, client: LLMClient) -> None:
|
||||
self.client = client
|
||||
|
||||
def score_page(
|
||||
# ── Generic stage scorer ─────────────────────────────────────────────
|
||||
|
||||
def score_stage_output(
|
||||
self,
|
||||
page_json: dict,
|
||||
moments: list[dict],
|
||||
stage: int,
|
||||
output_json: dict | list,
|
||||
input_json: dict | list,
|
||||
) -> ScoreResult:
|
||||
"""Evaluate a technique page against source moments.
|
||||
"""Score an arbitrary stage's output against its input.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
page_json:
|
||||
Synthesized page dict (title, summary, body_sections).
|
||||
moments:
|
||||
Source key moments with transcript_excerpt, summary, etc.
|
||||
stage:
|
||||
Pipeline stage number (2-5).
|
||||
output_json:
|
||||
The stage output to evaluate (parsed JSON).
|
||||
input_json:
|
||||
The stage input / source material.
|
||||
|
||||
Returns
|
||||
-------
|
||||
ScoreResult with per-dimension scores and justifications.
|
||||
ScoreResult with per-dimension scores for the requested stage.
|
||||
"""
|
||||
# Build the user prompt with the page and source moments
|
||||
if stage not in STAGE_CONFIGS:
|
||||
return ScoreResult(error=f"No config for stage {stage}. Valid: {sorted(STAGE_CONFIGS)}")
|
||||
|
||||
cfg = STAGE_CONFIGS[stage]
|
||||
|
||||
user_prompt = (
|
||||
"## Synthesized Technique Page\n\n"
|
||||
f"```json\n{json.dumps(page_json, indent=2)}\n```\n\n"
|
||||
"## Source Key Moments\n\n"
|
||||
f"```json\n{json.dumps(moments, indent=2)}\n```\n\n"
|
||||
"Score this page across all 5 dimensions."
|
||||
"## Stage Output\n\n"
|
||||
f"```json\n{json.dumps(output_json, indent=2)}\n```\n\n"
|
||||
"## Stage Input\n\n"
|
||||
f"```json\n{json.dumps(input_json, indent=2)}\n```\n\n"
|
||||
f"Score this stage {stage} output across all {len(cfg.dimensions)} dimensions."
|
||||
)
|
||||
|
||||
t0 = time.monotonic()
|
||||
try:
|
||||
resp = self.client.complete(
|
||||
system_prompt=SCORING_RUBRIC,
|
||||
system_prompt=cfg.rubric,
|
||||
user_prompt=user_prompt,
|
||||
response_model=BaseModel, # triggers JSON mode
|
||||
response_model=BaseModel,
|
||||
modality="chat",
|
||||
)
|
||||
elapsed = round(time.monotonic() - t0, 2)
|
||||
|
|
@ -155,13 +389,9 @@ class ScoreRunner:
|
|||
fallback = self.client.settings.llm_fallback_url
|
||||
return ScoreResult(
|
||||
elapsed_seconds=elapsed,
|
||||
error=(
|
||||
f"Cannot reach LLM endpoint at {url} (fallback {fallback}). "
|
||||
f"Error: {exc}"
|
||||
),
|
||||
error=f"Cannot reach LLM endpoint at {url} (fallback {fallback}). Error: {exc}",
|
||||
)
|
||||
|
||||
# Parse the LLM judge response
|
||||
raw_text = str(resp).strip()
|
||||
try:
|
||||
parsed = json.loads(raw_text)
|
||||
|
|
@ -172,10 +402,27 @@ class ScoreRunner:
|
|||
error=f"Malformed judge response (not valid JSON). Raw excerpt: {raw_text[:200]}",
|
||||
)
|
||||
|
||||
return self._parse_scores(parsed, elapsed, cfg.dimensions)
|
||||
|
||||
# ── Stage 5 convenience (backward compat) ────────────────────────────
|
||||
|
||||
def score_page(
|
||||
self,
|
||||
page_json: dict,
|
||||
moments: list[dict],
|
||||
) -> ScoreResult:
|
||||
"""Evaluate a stage 5 technique page against source moments."""
|
||||
return self.score_stage_output(
|
||||
stage=5,
|
||||
output_json=page_json,
|
||||
input_json=moments,
|
||||
)
|
||||
|
||||
return self._parse_scores(parsed, elapsed)
|
||||
|
||||
def _parse_scores(self, parsed: dict, elapsed: float) -> ScoreResult:
|
||||
def _parse_scores(self, parsed: dict, elapsed: float, dimensions: list[str] | None = None) -> ScoreResult:
|
||||
"""Extract and validate scores from parsed JSON response."""
|
||||
dims = dimensions or DIMENSIONS
|
||||
scores: dict[str, float] = {}
|
||||
justifications: dict[str, str] = {}
|
||||
|
||||
|
|
@ -183,7 +430,7 @@ class ScoreRunner:
|
|||
if not isinstance(raw_justifications, dict):
|
||||
raw_justifications = {}
|
||||
|
||||
for dim in DIMENSIONS:
|
||||
for dim in dims:
|
||||
raw = parsed.get(dim)
|
||||
if raw is None:
|
||||
logger.warning("Missing dimension '%s' in judge response", dim)
|
||||
|
|
@ -202,14 +449,10 @@ class ScoreRunner:
|
|||
|
||||
justifications[dim] = str(raw_justifications.get(dim, ""))
|
||||
|
||||
composite = sum(scores.values()) / len(DIMENSIONS)
|
||||
composite = sum(scores.values()) / len(dims) if dims else 0.0
|
||||
|
||||
return ScoreResult(
|
||||
structural=scores["structural"],
|
||||
content_specificity=scores["content_specificity"],
|
||||
voice_preservation=scores["voice_preservation"],
|
||||
readability=scores["readability"],
|
||||
factual_fidelity=scores["factual_fidelity"],
|
||||
scores=scores,
|
||||
composite=round(composite, 3),
|
||||
justifications=justifications,
|
||||
elapsed_seconds=elapsed,
|
||||
|
|
@ -318,10 +561,13 @@ class ScoreRunner:
|
|||
result.elapsed_seconds = round(result.elapsed_seconds + elapsed_synth, 2)
|
||||
return result
|
||||
|
||||
def print_report(self, result: ScoreResult) -> None:
|
||||
def print_report(self, result: ScoreResult, stage: int = 5) -> None:
|
||||
"""Print a formatted scoring report to stdout."""
|
||||
dims = STAGE_CONFIGS[stage].dimensions if stage in STAGE_CONFIGS else list(result.scores.keys())
|
||||
stage_label = f"STAGE {stage}" if stage in STAGE_CONFIGS else "QUALITY"
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print(" STAGE 5 QUALITY SCORE REPORT")
|
||||
print(f" {stage_label} QUALITY SCORE REPORT")
|
||||
print("=" * 60)
|
||||
|
||||
if result.error:
|
||||
|
|
@ -329,8 +575,8 @@ class ScoreRunner:
|
|||
print("=" * 60 + "\n")
|
||||
return
|
||||
|
||||
for dim in DIMENSIONS:
|
||||
score = getattr(result, dim)
|
||||
for dim in dims:
|
||||
score = result.scores.get(dim, 0.0)
|
||||
bar = self._score_bar(score)
|
||||
justification = result.justifications.get(dim, "")
|
||||
print(f"\n {dim.replace('_', ' ').title()}")
|
||||
|
|
|
|||
|
|
@ -4,13 +4,17 @@ Uses a meta-prompt to instruct the LLM to act as a prompt engineer,
|
|||
analyzing per-dimension scores and producing targeted prompt mutations
|
||||
that improve the weakest scoring dimensions while preserving the JSON
|
||||
output format required by downstream parsing.
|
||||
|
||||
Supports any pipeline stage (2-5) — callers pass the stage's dimensions
|
||||
and format markers so the meta-prompt and validation adapt automatically.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from typing import Sequence
|
||||
|
||||
from pipeline.llm_client import LLMClient
|
||||
from pipeline.quality.scorer import DIMENSIONS, ScoreResult
|
||||
from pipeline.quality.scorer import DIMENSIONS, STAGE_CONFIGS, ScoreResult
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
|
@ -18,29 +22,24 @@ logger = logging.getLogger(__name__)
|
|||
# ── Meta-prompt for variant generation ────────────────────────────────────────
|
||||
|
||||
VARIANT_META_PROMPT = """\
|
||||
You are an expert prompt engineer specializing in LLM-powered content synthesis.
|
||||
You are an expert prompt engineer specializing in LLM-powered content processing pipelines.
|
||||
|
||||
Your task: given a synthesis prompt and its quality evaluation scores, produce an
|
||||
Your task: given a pipeline stage prompt and its quality evaluation scores, produce an
|
||||
improved variant of the prompt that targets the weakest-scoring dimensions while
|
||||
maintaining or improving the others.
|
||||
|
||||
## Scoring Dimensions (each 0.0–1.0)
|
||||
|
||||
- **structural** — Section naming, count (3-6), paragraph depth (2-5 per section)
|
||||
- **content_specificity** — Concrete details: frequencies, time values, ratios, plugin names, dB values
|
||||
- **voice_preservation** — Direct quotes preserved, opinions attributed to creator by name, personality retained
|
||||
- **readability** — Cohesive article flow, related info merged, no redundancy or contradiction
|
||||
- **factual_fidelity** — Every claim traceable to source material, no hallucinated specifics
|
||||
{dimension_descriptions}
|
||||
|
||||
## Rules
|
||||
|
||||
1. Focus your changes on the weakest 1-2 dimensions. Don't dilute the prompt by trying to fix everything.
|
||||
2. Add specific, actionable instructions — not vague encouragements.
|
||||
3. **CRITICAL: You MUST preserve the JSON output format section of the prompt EXACTLY as-is.**
|
||||
The prompt contains instructions about outputting a JSON object with a specific schema
|
||||
(SynthesisResult with "pages" containing title, summary, body_sections, etc.).
|
||||
The prompt contains instructions about outputting a JSON object with a specific schema.
|
||||
Do NOT modify, remove, or rephrase any part of the JSON format instructions.
|
||||
Your changes should target the prose synthesis guidelines only.
|
||||
Your changes should target the processing/analysis guidelines only.
|
||||
4. Keep the overall prompt length within 2x of the original. Don't bloat it.
|
||||
5. Make substantive changes — rewording a sentence or adding one adjective is not enough.
|
||||
|
||||
|
|
@ -50,9 +49,38 @@ Return ONLY the full modified prompt text. No explanation, no markdown fences, n
|
|||
Just the complete prompt that could be used directly as a system prompt.
|
||||
"""
|
||||
|
||||
# Dimension descriptions per stage, used to fill the meta-prompt template.
|
||||
_DIMENSION_DESCRIPTIONS: dict[int, str] = {
|
||||
2: (
|
||||
"- **coverage_completeness** — All transcript content accounted for, no gaps or overlaps\n"
|
||||
"- **topic_specificity** — Topic labels are descriptive and useful, not generic\n"
|
||||
"- **boundary_accuracy** — Segment boundaries align with actual topic transitions\n"
|
||||
"- **summary_quality** — Summaries accurately describe segment content"
|
||||
),
|
||||
3: (
|
||||
"- **moment_richness** — Extracted moments capture substantial, distinct insights\n"
|
||||
"- **timestamp_accuracy** — Time ranges are plausible and well-bounded\n"
|
||||
"- **content_type_correctness** — Content types match the actual moment content\n"
|
||||
"- **summary_actionability** — Summaries provide actionable, specific information\n"
|
||||
"- **plugin_normalization** — Plugin/tool names are correctly identified and normalized"
|
||||
),
|
||||
4: (
|
||||
"- **category_accuracy** — Topic categories are appropriate and meaningful\n"
|
||||
"- **tag_completeness** — All relevant tags are captured\n"
|
||||
"- **tag_specificity** — Tags are specific enough to be useful for search/filtering\n"
|
||||
"- **coverage** — All moments are classified"
|
||||
),
|
||||
5: (
|
||||
"- **structural** — Section naming, count (3-6), paragraph depth (2-5 per section)\n"
|
||||
"- **content_specificity** — Concrete details: frequencies, time values, ratios, plugin names, dB values\n"
|
||||
"- **voice_preservation** — Direct quotes preserved, opinions attributed to creator by name, personality retained\n"
|
||||
"- **readability** — Cohesive article flow, related info merged, no redundancy or contradiction\n"
|
||||
"- **factual_fidelity** — Every claim traceable to source material, no hallucinated specifics"
|
||||
),
|
||||
}
|
||||
|
||||
# Format markers that must survive variant generation — if any of these
|
||||
# are present in the base prompt, the variant must also contain them.
|
||||
|
||||
# Legacy default format markers for stage 5
|
||||
_FORMAT_MARKERS = ["SynthesisResult", '"pages"', "body_sections", "title", "summary"]
|
||||
|
||||
|
||||
|
|
@ -71,6 +99,9 @@ class PromptVariantGenerator:
|
|||
base_prompt: str,
|
||||
scores: ScoreResult,
|
||||
n: int = 2,
|
||||
*,
|
||||
format_markers: Sequence[str] | None = None,
|
||||
stage: int = 5,
|
||||
) -> list[str]:
|
||||
"""Generate up to *n* valid prompt variants.
|
||||
|
||||
|
|
@ -83,27 +114,48 @@ class PromptVariantGenerator:
|
|||
Parameters
|
||||
----------
|
||||
base_prompt:
|
||||
The current best synthesis prompt text.
|
||||
The current best prompt text for the target stage.
|
||||
scores:
|
||||
ScoreResult from the most recent evaluation of *base_prompt*.
|
||||
n:
|
||||
Number of variants to attempt generating.
|
||||
format_markers:
|
||||
Override format markers for validation. When *None*, uses the
|
||||
markers from ``STAGE_CONFIGS[stage]`` (falling back to stage 5
|
||||
defaults for backward compat).
|
||||
stage:
|
||||
Pipeline stage number (2-5), used to select dimension
|
||||
descriptions for the meta-prompt and default format markers.
|
||||
|
||||
Returns
|
||||
-------
|
||||
list[str]
|
||||
Valid variant prompt strings (may be fewer than *n*).
|
||||
"""
|
||||
user_prompt = self._build_user_prompt(base_prompt, scores)
|
||||
# Resolve format markers and dimensions for the target stage
|
||||
if format_markers is not None:
|
||||
markers = list(format_markers)
|
||||
elif stage in STAGE_CONFIGS:
|
||||
markers = STAGE_CONFIGS[stage].format_markers
|
||||
else:
|
||||
markers = _FORMAT_MARKERS
|
||||
|
||||
dimensions = STAGE_CONFIGS[stage].dimensions if stage in STAGE_CONFIGS else DIMENSIONS
|
||||
|
||||
# Build the system prompt with stage-appropriate dimension descriptions
|
||||
dim_desc = _DIMENSION_DESCRIPTIONS.get(stage, _DIMENSION_DESCRIPTIONS[5])
|
||||
system_prompt = VARIANT_META_PROMPT.format(dimension_descriptions=dim_desc)
|
||||
|
||||
user_prompt = self._build_user_prompt(base_prompt, scores, dimensions)
|
||||
# Identify which format markers are actually present in the base
|
||||
required_markers = [m for m in _FORMAT_MARKERS if m in base_prompt]
|
||||
required_markers = [m for m in markers if m in base_prompt]
|
||||
|
||||
variants: list[str] = []
|
||||
for i in range(n):
|
||||
logger.info("Generating variant %d/%d...", i + 1, n)
|
||||
logger.info("Generating variant %d/%d (stage %d)...", i + 1, n, stage)
|
||||
try:
|
||||
raw = self.client.complete(
|
||||
system_prompt=VARIANT_META_PROMPT,
|
||||
system_prompt=system_prompt,
|
||||
user_prompt=user_prompt,
|
||||
response_model=None, # free-form text, not JSON
|
||||
modality="chat",
|
||||
|
|
@ -127,11 +179,12 @@ class PromptVariantGenerator:
|
|||
|
||||
# ── Internal helpers ──────────────────────────────────────────────────
|
||||
|
||||
def _build_user_prompt(self, base_prompt: str, scores: ScoreResult) -> str:
|
||||
def _build_user_prompt(self, base_prompt: str, scores: ScoreResult, dimensions: list[str] | None = None) -> str:
|
||||
"""Build the user message describing the current prompt and its scores."""
|
||||
dims = dimensions or DIMENSIONS
|
||||
# Build per-dimension score lines, sorted worst-first
|
||||
dim_lines: list[str] = []
|
||||
dim_scores = [(d, getattr(scores, d, 0.0)) for d in DIMENSIONS]
|
||||
dim_scores = [(d, scores.scores.get(d, 0.0)) for d in dims]
|
||||
dim_scores.sort(key=lambda x: x[1])
|
||||
|
||||
for dim, val in dim_scores:
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue