feat: Refactored keyword_search to multi-token AND with cross-field mat…
- "backend/search_service.py" - "backend/schemas.py" - "backend/routers/search.py" - "backend/tests/test_search.py" GSD-Task: S01/T01
This commit is contained in:
parent
c344b8c670
commit
84e7a9906c
247 changed files with 714 additions and 20340 deletions
111
.artifacts/feature-synthesis-chunking.md
Normal file
111
.artifacts/feature-synthesis-chunking.md
Normal file
|
|
@ -0,0 +1,111 @@
|
||||||
|
# Feature: Stage 5 Synthesis Chunking for Large Category Groups
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
Stage 5 synthesis sends all key moments for a given `(video, topic_category)` group to the LLM in a single call. When a video produces a large number of moments in one category, the prompt exceeds what the model can process into a valid structured response.
|
||||||
|
|
||||||
|
**Concrete failure:** COPYCATT's "Sound Design - Everything In 2 Hours Speedrun" (2,026 transcript segments) produced 198 moments classified as "Sound design" (175) / "Sound Design" (23 — casing inconsistency). The synthesis prompt for that category was ~42k tokens. The model (`fyn-llm-agent-think`, 128k context) accepted the prompt but returned only 5,407 completion tokens with `finish=stop` — valid JSON that was structurally incomplete, failing Pydantic `SynthesisResult` validation. The pipeline retried and failed identically each time.
|
||||||
|
|
||||||
|
The other 37 videos in the corpus (up to 930 segments, ~60 moments per category max) all synthesized successfully.
|
||||||
|
|
||||||
|
## Root Causes
|
||||||
|
|
||||||
|
Two independent issues compound into this failure:
|
||||||
|
|
||||||
|
### 1. No chunking in stage 5 synthesis
|
||||||
|
|
||||||
|
`stage5_synthesis()` in `backend/pipeline/stages.py` iterates over `groups[category]` and builds one prompt containing ALL moments for that category. There's no upper bound on how many moments go into a single LLM call.
|
||||||
|
|
||||||
|
**Location:** `stages.py` lines ~850-875 — the `for category, moment_group in groups.items()` loop builds the full `moments_text` without splitting.
|
||||||
|
|
||||||
|
### 2. Inconsistent category casing from stage 4
|
||||||
|
|
||||||
|
Stage 4 classification produces `"Sound design"` and `"Sound Design"` as separate categories for the same video. Stage 5 groups by exact string match, so these stay separate — but even independently, 175 moments in one group is too many. The casing issue does inflate the problem by preventing natural splitting across categories.
|
||||||
|
|
||||||
|
**Location:** Classification output stored in Redis at `chrysopedia:classification:{video_id}`. The `topic_category` values come directly from the LLM with no normalization.
|
||||||
|
|
||||||
|
## Proposed Changes
|
||||||
|
|
||||||
|
### Change 1: Chunked synthesis with merge pass
|
||||||
|
|
||||||
|
Split large category groups into chunks before sending to the LLM. Each chunk produces technique pages independently, then a lightweight merge step combines pages with overlapping topics.
|
||||||
|
|
||||||
|
**In `stage5_synthesis()` (`backend/pipeline/stages.py`):**
|
||||||
|
|
||||||
|
1. After grouping moments by category, check each group's size against a configurable threshold (e.g., `SYNTHESIS_CHUNK_SIZE = 30` moments).
|
||||||
|
|
||||||
|
2. Groups at or below the threshold: process as today — single LLM call.
|
||||||
|
|
||||||
|
3. Groups above the threshold: split into chunks of `SYNTHESIS_CHUNK_SIZE` moments, ordered by `start_time` (preserving chronological context). Each chunk gets its own synthesis LLM call, producing its own `SynthesisResult` with 1+ pages.
|
||||||
|
|
||||||
|
4. After all chunks for a category are processed, collect the resulting pages. Pages with the same or very similar slugs (e.g., Levenshtein distance < 3, or shared slug prefix before the creator suffix) should be merged. The merge is a second LLM call with a simpler prompt: "Here are N partial technique pages on the same topic from the same creator. Merge them into a single cohesive page, combining body sections, deduplicating signal chains and plugins, and writing a unified summary." This merge prompt is much smaller than the original 198-moment prompt because it takes synthesized prose as input, not raw moment data.
|
||||||
|
|
||||||
|
5. If no pages share slugs across chunks, keep them all — they represent genuinely distinct sub-topics the LLM identified within the category.
|
||||||
|
|
||||||
|
**New config setting in `backend/config.py`:**
|
||||||
|
```python
|
||||||
|
synthesis_chunk_size: int = 30 # Max moments per synthesis LLM call
|
||||||
|
```
|
||||||
|
|
||||||
|
**New prompt file:** `prompts/stage5_merge.txt` — instructions for combining partial technique pages into a unified page. Much simpler than the full synthesis prompt since it operates on already-synthesized prose rather than raw moments.
|
||||||
|
|
||||||
|
**Token budget consideration:** 30 moments × ~200 tokens each (title + summary + metadata + transcript excerpt) = ~6k tokens of moment data + ~2k system prompt = ~8k input tokens. Well within what the model handles reliably. The merge call takes 2-4 partial pages of prose (~3-5k tokens total) — also very manageable.
|
||||||
|
|
||||||
|
### Change 2: Category casing normalization in stage 4
|
||||||
|
|
||||||
|
Normalize `topic_category` values before storing classification results in Redis.
|
||||||
|
|
||||||
|
**In `stage4_classification()` (`backend/pipeline/stages.py`):**
|
||||||
|
|
||||||
|
After parsing the `ClassificationResult` from the LLM, apply title-case normalization to each moment's `topic_category`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
category = cls_result.topic_category.strip().title()
|
||||||
|
# "Sound design" -> "Sound Design"
|
||||||
|
# "sound design" -> "Sound Design"
|
||||||
|
# "SOUND DESIGN" -> "Sound Design"
|
||||||
|
```
|
||||||
|
|
||||||
|
This is a one-line fix. It prevents the "Sound design" / "Sound Design" split that inflated the group sizes and would reduce the COPYCATT video from 198 → 198 moments in a single normalized "Sound Design" group — still too many without chunking, but it eliminates the class of bug where moments scatter across near-duplicate categories.
|
||||||
|
|
||||||
|
**Also apply in stage 5 as a safety net:** When building the `groups` dict, normalize the category key:
|
||||||
|
```python
|
||||||
|
category = cls_info.get("topic_category", "Uncategorized").strip().title()
|
||||||
|
```
|
||||||
|
|
||||||
|
This handles data already in Redis from prior stage 4 runs without requiring reprocessing.
|
||||||
|
|
||||||
|
### Change 3: Estimated token pre-check before LLM call
|
||||||
|
|
||||||
|
Before making the synthesis LLM call, estimate the total tokens (prompt + expected output) and log a warning if it exceeds a safety threshold. This doesn't block the call — chunking handles the splitting — but it provides observability for tuning `SYNTHESIS_CHUNK_SIZE`.
|
||||||
|
|
||||||
|
**In the synthesis loop, after building `user_prompt`:**
|
||||||
|
```python
|
||||||
|
estimated_input = estimate_tokens(system_prompt) + estimate_tokens(user_prompt)
|
||||||
|
if estimated_input > 15000:
|
||||||
|
logger.warning(
|
||||||
|
"Stage 5: Large synthesis input for category '%s' video_id=%s: "
|
||||||
|
"~%d input tokens, %d moments. Consider reducing SYNTHESIS_CHUNK_SIZE.",
|
||||||
|
category, video_id, estimated_input, len(moment_group),
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Files to Modify
|
||||||
|
|
||||||
|
| File | Change |
|
||||||
|
|------|--------|
|
||||||
|
| `backend/pipeline/stages.py` | Chunk logic in `stage5_synthesis()`, casing normalization in `stage4_classification()` and `stage5_synthesis()` grouping |
|
||||||
|
| `backend/pipeline/llm_client.py` | No changes needed — `estimate_max_tokens()` already handles per-call estimation |
|
||||||
|
| `backend/config.py` | Add `synthesis_chunk_size: int = 30` setting |
|
||||||
|
| `prompts/stage5_merge.txt` | New prompt for merging partial technique pages |
|
||||||
|
| `backend/schemas.py` | No changes — `SynthesisResult` schema works for both chunk and merge calls |
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
1. **Unit test:** Mock the LLM and verify that a 90-moment group gets split into 3 chunks of 30, each producing a `SynthesisResult`, followed by a merge call.
|
||||||
|
2. **Integration test:** Retrigger the COPYCATT "Sound Design - Everything In 2 Hours Speedrun" video and confirm it completes stage 5 without `LLMTruncationError`.
|
||||||
|
3. **Regression test:** Retrigger a small video (e.g., Skope "Understanding Waveshapers", 9 moments) and confirm behavior is unchanged — no chunking triggered, same output.
|
||||||
|
|
||||||
|
## Rollback
|
||||||
|
|
||||||
|
`SYNTHESIS_CHUNK_SIZE` can be set very high (e.g., 9999) to effectively disable chunking without a code change. The casing normalization is backward-compatible — it only affects new pipeline runs.
|
||||||
|
|
@ -1,28 +0,0 @@
|
||||||
# GSD State
|
|
||||||
|
|
||||||
**Active Milestone:** M011: M011:
|
|
||||||
**Active Slice:** None
|
|
||||||
**Phase:** complete
|
|
||||||
**Requirements Status:** 0 active · 0 validated · 0 deferred · 0 out of scope
|
|
||||||
|
|
||||||
## Milestone Registry
|
|
||||||
- ✅ **M001:** Chrysopedia Foundation — Infrastructure, Pipeline Core, and Skeleton UI
|
|
||||||
- ✅ **M002:** M002:
|
|
||||||
- ✅ **M003:** M003:
|
|
||||||
- ✅ **M004:** M004:
|
|
||||||
- ✅ **M005:** M005:
|
|
||||||
- ✅ **M006:** M006:
|
|
||||||
- ✅ **M007:** M007:
|
|
||||||
- ✅ **M008:** M008:
|
|
||||||
- ✅ **M009:** Homepage & First Impression
|
|
||||||
- ✅ **M010:** Discovery, Navigation & Visual Identity
|
|
||||||
- ✅ **M011:** M011:
|
|
||||||
|
|
||||||
## Recent Decisions
|
|
||||||
- None recorded
|
|
||||||
|
|
||||||
## Blockers
|
|
||||||
- None
|
|
||||||
|
|
||||||
## Next Action
|
|
||||||
All milestones complete.
|
|
||||||
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
Some files were not shown because too many files have changed in this diff Show more
Loading…
Add table
Reference in a new issue