feat: Refactored keyword_search to multi-token AND with cross-field mat…

- "backend/search_service.py" - "backend/schemas.py" - "backend/routers/search.py" - "backend/tests/test_search.py" GSD-Task: S01/T01
2026-04-01 06:15:20 +00:00 · 2026-04-01 06:15:20 +00:00 · 84e7a9906c
commit 84e7a9906c
parent c344b8c670
247 changed files with 714 additions and 20340 deletions
--- a/.artifacts/feature-synthesis-chunking.md
+++ b/.artifacts/feature-synthesis-chunking.md
@ -0,0 +1,111 @@
 # Feature: Stage 5 Synthesis Chunking for Large Category Groups
 ## Problem
 Stage 5 synthesis sends all key moments for a given `(video, topic_category)` group to the LLM in a single call. When a video produces a large number of moments in one category, the prompt exceeds what the model can process into a valid structured response.
 **Concrete failure:** COPYCATT's "Sound Design - Everything In 2 Hours Speedrun" (2,026 transcript segments) produced 198 moments classified as "Sound design" (175) / "Sound Design" (23 — casing inconsistency). The synthesis prompt for that category was ~42k tokens. The model (`fyn-llm-agent-think`, 128k context) accepted the prompt but returned only 5,407 completion tokens with `finish=stop` — valid JSON that was structurally incomplete, failing Pydantic `SynthesisResult` validation. The pipeline retried and failed identically each time.
 The other 37 videos in the corpus (up to 930 segments, ~60 moments per category max) all synthesized successfully.
 ## Root Causes
 Two independent issues compound into this failure:
 ### 1. No chunking in stage 5 synthesis
 `stage5_synthesis()` in `backend/pipeline/stages.py` iterates over `groups[category]` and builds one prompt containing ALL moments for that category. There's no upper bound on how many moments go into a single LLM call.
 **Location:** `stages.py` lines ~850-875 — the `for category, moment_group in groups.items()` loop builds the full `moments_text` without splitting.
 ### 2. Inconsistent category casing from stage 4
 Stage 4 classification produces `"Sound design"` and `"Sound Design"` as separate categories for the same video. Stage 5 groups by exact string match, so these stay separate — but even independently, 175 moments in one group is too many. The casing issue does inflate the problem by preventing natural splitting across categories.
 **Location:** Classification output stored in Redis at `chrysopedia:classification:{video_id}`. The `topic_category` values come directly from the LLM with no normalization.
 ## Proposed Changes
 ### Change 1: Chunked synthesis with merge pass
 Split large category groups into chunks before sending to the LLM. Each chunk produces technique pages independently, then a lightweight merge step combines pages with overlapping topics.
 **In `stage5_synthesis()` (`backend/pipeline/stages.py`):**
 1. After grouping moments by category, check each group's size against a configurable threshold (e.g., `SYNTHESIS_CHUNK_SIZE = 30` moments).
 2. Groups at or below the threshold: process as today — single LLM call.
 3. Groups above the threshold: split into chunks of `SYNTHESIS_CHUNK_SIZE` moments, ordered by `start_time` (preserving chronological context). Each chunk gets its own synthesis LLM call, producing its own `SynthesisResult` with 1+ pages.
 4. After all chunks for a category are processed, collect the resulting pages. Pages with the same or very similar slugs (e.g., Levenshtein distance < 3, or shared slug prefix before the creator suffix) should be merged. The merge is a second LLM call with a simpler prompt: "Here are N partial technique pages on the same topic from the same creator. Merge them into a single cohesive page, combining body sections, deduplicating signal chains and plugins, and writing a unified summary." This merge prompt is much smaller than the original 198-moment prompt because it takes synthesized prose as input, not raw moment data.
 5. If no pages share slugs across chunks, keep them all — they represent genuinely distinct sub-topics the LLM identified within the category.
 **New config setting in `backend/config.py`:**
 ```python
 synthesis_chunk_size: int = 30  # Max moments per synthesis LLM call
 ```
 **New prompt file:** `prompts/stage5_merge.txt` — instructions for combining partial technique pages into a unified page. Much simpler than the full synthesis prompt since it operates on already-synthesized prose rather than raw moments.
 **Token budget consideration:** 30 moments × ~200 tokens each (title + summary + metadata + transcript excerpt) = ~6k tokens of moment data + ~2k system prompt = ~8k input tokens. Well within what the model handles reliably. The merge call takes 2-4 partial pages of prose (~3-5k tokens total) — also very manageable.
 ### Change 2: Category casing normalization in stage 4
 Normalize `topic_category` values before storing classification results in Redis.
 **In `stage4_classification()` (`backend/pipeline/stages.py`):**
 After parsing the `ClassificationResult` from the LLM, apply title-case normalization to each moment's `topic_category`:
 ```python
 category = cls_result.topic_category.strip().title()
 # "Sound design" -> "Sound Design"
 # "sound design" -> "Sound Design"  
 # "SOUND DESIGN" -> "Sound Design"
 ```
 This is a one-line fix. It prevents the "Sound design" / "Sound Design" split that inflated the group sizes and would reduce the COPYCATT video from 198 → 198 moments in a single normalized "Sound Design" group — still too many without chunking, but it eliminates the class of bug where moments scatter across near-duplicate categories.
 **Also apply in stage 5 as a safety net:** When building the `groups` dict, normalize the category key:
 ```python
 category = cls_info.get("topic_category", "Uncategorized").strip().title()
 ```
 This handles data already in Redis from prior stage 4 runs without requiring reprocessing.
 ### Change 3: Estimated token pre-check before LLM call
 Before making the synthesis LLM call, estimate the total tokens (prompt + expected output) and log a warning if it exceeds a safety threshold. This doesn't block the call — chunking handles the splitting — but it provides observability for tuning `SYNTHESIS_CHUNK_SIZE`.
 **In the synthesis loop, after building `user_prompt`:**
 ```python
 estimated_input = estimate_tokens(system_prompt) + estimate_tokens(user_prompt)
 if estimated_input > 15000:
    logger.warning(
        "Stage 5: Large synthesis input for category '%s' video_id=%s: "
        "~%d input tokens, %d moments. Consider reducing SYNTHESIS_CHUNK_SIZE.",
        category, video_id, estimated_input, len(moment_group),
    )
 ```
 ## Files to Modify
 | File | Change |
 |------|--------|
 | `backend/pipeline/stages.py` | Chunk logic in `stage5_synthesis()`, casing normalization in `stage4_classification()` and `stage5_synthesis()` grouping |
 | `backend/pipeline/llm_client.py` | No changes needed — `estimate_max_tokens()` already handles per-call estimation |
 | `backend/config.py` | Add `synthesis_chunk_size: int = 30` setting |
 | `prompts/stage5_merge.txt` | New prompt for merging partial technique pages |
 | `backend/schemas.py` | No changes — `SynthesisResult` schema works for both chunk and merge calls |
 ## Testing
 1. **Unit test:** Mock the LLM and verify that a 90-moment group gets split into 3 chunks of 30, each producing a `SynthesisResult`, followed by a merge call.
 2. **Integration test:** Retrigger the COPYCATT "Sound Design - Everything In 2 Hours Speedrun" video and confirm it completes stage 5 without `LLMTruncationError`.
 3. **Regression test:** Retrigger a small video (e.g., Skope "Understanding Waveshapers", 9 moments) and confirm behavior is unchanged — no chunking triggered, same output.
 ## Rollback
 `SYNTHESIS_CHUNK_SIZE` can be set very high (e.g., 9999) to effectively disable chunking without a code change. The casing normalization is backward-compatible — it only affects new pipeline runs.
--- a/.gsd/STATE.md
+++ b/.gsd/STATE.md
@ -1,28 +0,0 @@
 # GSD State
 **Active Milestone:** M011: M011:
 **Active Slice:** None
 **Phase:** complete
 **Requirements Status:** 0 active · 0 validated · 0 deferred · 0 out of scope
 ## Milestone Registry
 - ✅ **M001:** Chrysopedia Foundation — Infrastructure, Pipeline Core, and Skeleton UI
 - ✅ **M002:** M002:
 - ✅ **M003:** M003:
 - ✅ **M004:** M004:
 - ✅ **M005:** M005:
 - ✅ **M006:** M006:
 - ✅ **M007:** M007:
 - ✅ **M008:** M008:
 - ✅ **M009:** Homepage & First Impression
 - ✅ **M010:** Discovery, Navigation & Visual Identity
 - ✅ **M011:** M011:
 ## Recent Decisions
 - None recorded
 ## Blockers
 - None
 ## Next Action
 All milestones complete.
--- a/.gsd/activity/001-execute-task-M001-S01-T01.jsonl
+++ b/.gsd/activity/001-execute-task-M001-S01-T01.jsonl
--- a/.gsd/activity/002-execute-task-M001-S01-T02.jsonl
+++ b/.gsd/activity/002-execute-task-M001-S01-T02.jsonl
--- a/.gsd/activity/003-execute-task-M001-S01-T03.jsonl
+++ b/.gsd/activity/003-execute-task-M001-S01-T03.jsonl
--- a/.gsd/activity/004-execute-task-M001-S01-T04.jsonl
+++ b/.gsd/activity/004-execute-task-M001-S01-T04.jsonl
--- a/.gsd/activity/005-execute-task-M001-S01-T05.jsonl
+++ b/.gsd/activity/005-execute-task-M001-S01-T05.jsonl
--- a/.gsd/activity/006-complete-slice-M001-S01.jsonl
+++ b/.gsd/activity/006-complete-slice-M001-S01.jsonl
--- a/.gsd/activity/007-research-slice-M001-S02.jsonl
+++ b/.gsd/activity/007-research-slice-M001-S02.jsonl
--- a/.gsd/activity/008-plan-slice-M001-S02.jsonl
+++ b/.gsd/activity/008-plan-slice-M001-S02.jsonl
--- a/.gsd/activity/009-execute-task-M001-S02-T01.jsonl
+++ b/.gsd/activity/009-execute-task-M001-S02-T01.jsonl
--- a/.gsd/activity/010-execute-task-M001-S02-T02.jsonl
+++ b/.gsd/activity/010-execute-task-M001-S02-T02.jsonl
--- a/.gsd/activity/011-complete-slice-M001-S02.jsonl
+++ b/.gsd/activity/011-complete-slice-M001-S02.jsonl
--- a/.gsd/activity/012-research-slice-M001-S03.jsonl
+++ b/.gsd/activity/012-research-slice-M001-S03.jsonl
--- a/.gsd/activity/013-plan-slice-M001-S03.jsonl
+++ b/.gsd/activity/013-plan-slice-M001-S03.jsonl
--- a/.gsd/activity/014-execute-task-M001-S03-T01.jsonl
+++ b/.gsd/activity/014-execute-task-M001-S03-T01.jsonl
--- a/.gsd/activity/015-execute-task-M001-S03-T02.jsonl
+++ b/.gsd/activity/015-execute-task-M001-S03-T02.jsonl
--- a/.gsd/activity/016-execute-task-M001-S03-T03.jsonl
+++ b/.gsd/activity/016-execute-task-M001-S03-T03.jsonl
--- a/.gsd/activity/017-execute-task-M001-S03-T04.jsonl
+++ b/.gsd/activity/017-execute-task-M001-S03-T04.jsonl
--- a/.gsd/activity/018-execute-task-M001-S03-T05.jsonl
+++ b/.gsd/activity/018-execute-task-M001-S03-T05.jsonl
--- a/.gsd/activity/019-complete-slice-M001-S03.jsonl
+++ b/.gsd/activity/019-complete-slice-M001-S03.jsonl
--- a/.gsd/activity/020-research-slice-M001-S04.jsonl
+++ b/.gsd/activity/020-research-slice-M001-S04.jsonl
--- a/.gsd/activity/021-plan-slice-M001-S04.jsonl
+++ b/.gsd/activity/021-plan-slice-M001-S04.jsonl
--- a/.gsd/activity/022-execute-task-M001-S04-T01.jsonl
+++ b/.gsd/activity/022-execute-task-M001-S04-T01.jsonl
--- a/.gsd/activity/023-execute-task-M001-S04-T02.jsonl
+++ b/.gsd/activity/023-execute-task-M001-S04-T02.jsonl
--- a/.gsd/activity/024-execute-task-M001-S04-T03.jsonl
+++ b/.gsd/activity/024-execute-task-M001-S04-T03.jsonl
--- a/.gsd/activity/025-complete-slice-M001-S04.jsonl
+++ b/.gsd/activity/025-complete-slice-M001-S04.jsonl
--- a/.gsd/activity/026-research-slice-M001-S05.jsonl
+++ b/.gsd/activity/026-research-slice-M001-S05.jsonl
--- a/.gsd/activity/027-plan-slice-M001-S05.jsonl
+++ b/.gsd/activity/027-plan-slice-M001-S05.jsonl
--- a/.gsd/activity/028-execute-task-M001-S05-T01.jsonl
+++ b/.gsd/activity/028-execute-task-M001-S05-T01.jsonl
--- a/.gsd/activity/029-execute-task-M001-S05-T02.jsonl
+++ b/.gsd/activity/029-execute-task-M001-S05-T02.jsonl
--- a/.gsd/activity/030-execute-task-M001-S05-T03.jsonl
+++ b/.gsd/activity/030-execute-task-M001-S05-T03.jsonl
--- a/.gsd/activity/031-execute-task-M001-S05-T04.jsonl
+++ b/.gsd/activity/031-execute-task-M001-S05-T04.jsonl
--- a/.gsd/activity/032-complete-slice-M001-S05.jsonl
+++ b/.gsd/activity/032-complete-slice-M001-S05.jsonl
--- a/.gsd/activity/033-validate-milestone-M001.jsonl
+++ b/.gsd/activity/033-validate-milestone-M001.jsonl
--- a/.gsd/activity/034-complete-milestone-M001.jsonl
+++ b/.gsd/activity/034-complete-milestone-M001.jsonl
--- a/.gsd/activity/035-research-slice-M004-S02.jsonl
+++ b/.gsd/activity/035-research-slice-M004-S02.jsonl
--- a/.gsd/activity/036-plan-slice-M004-S02.jsonl
+++ b/.gsd/activity/036-plan-slice-M004-S02.jsonl
--- a/.gsd/activity/037-execute-task-M004-S02-T01.jsonl
+++ b/.gsd/activity/037-execute-task-M004-S02-T01.jsonl
--- a/.gsd/activity/038-execute-task-M004-S02-T02.jsonl
+++ b/.gsd/activity/038-execute-task-M004-S02-T02.jsonl
--- a/.gsd/activity/039-complete-slice-M004-S02.jsonl
+++ b/.gsd/activity/039-complete-slice-M004-S02.jsonl
--- a/.gsd/activity/040-research-slice-M004-S03.jsonl
+++ b/.gsd/activity/040-research-slice-M004-S03.jsonl
--- a/.gsd/activity/041-plan-slice-M004-S03.jsonl
+++ b/.gsd/activity/041-plan-slice-M004-S03.jsonl
--- a/.gsd/activity/042-execute-task-M004-S03-T01.jsonl
+++ b/.gsd/activity/042-execute-task-M004-S03-T01.jsonl
--- a/.gsd/activity/043-execute-task-M004-S03-T02.jsonl
+++ b/.gsd/activity/043-execute-task-M004-S03-T02.jsonl
--- a/.gsd/activity/044-complete-slice-M004-S03.jsonl
+++ b/.gsd/activity/044-complete-slice-M004-S03.jsonl
--- a/.gsd/activity/045-research-slice-M004-S04.jsonl
+++ b/.gsd/activity/045-research-slice-M004-S04.jsonl
--- a/.gsd/activity/046-plan-slice-M004-S04.jsonl
+++ b/.gsd/activity/046-plan-slice-M004-S04.jsonl
--- a/.gsd/activity/047-execute-task-M004-S04-T01.jsonl
+++ b/.gsd/activity/047-execute-task-M004-S04-T01.jsonl
--- a/.gsd/activity/048-execute-task-M004-S04-T02.jsonl
+++ b/.gsd/activity/048-execute-task-M004-S04-T02.jsonl
--- a/.gsd/activity/049-execute-task-M004-S04-T03.jsonl
+++ b/.gsd/activity/049-execute-task-M004-S04-T03.jsonl
--- a/.gsd/activity/050-complete-slice-M004-S04.jsonl
+++ b/.gsd/activity/050-complete-slice-M004-S04.jsonl
--- a/.gsd/activity/051-validate-milestone-M004.jsonl
+++ b/.gsd/activity/051-validate-milestone-M004.jsonl
--- a/.gsd/activity/052-complete-milestone-M004.jsonl
+++ b/.gsd/activity/052-complete-milestone-M004.jsonl
--- a/.gsd/activity/053-execute-task-M005-S01-T01.jsonl
+++ b/.gsd/activity/053-execute-task-M005-S01-T01.jsonl
--- a/.gsd/activity/054-execute-task-M005-S01-T02.jsonl
+++ b/.gsd/activity/054-execute-task-M005-S01-T02.jsonl
--- a/.gsd/activity/055-execute-task-M005-S01-T03.jsonl
+++ b/.gsd/activity/055-execute-task-M005-S01-T03.jsonl
--- a/.gsd/activity/056-complete-slice-M005-S01.jsonl
+++ b/.gsd/activity/056-complete-slice-M005-S01.jsonl
--- a/.gsd/activity/057-research-slice-M005-S02.jsonl
+++ b/.gsd/activity/057-research-slice-M005-S02.jsonl
--- a/.gsd/activity/058-plan-slice-M005-S02.jsonl
+++ b/.gsd/activity/058-plan-slice-M005-S02.jsonl
--- a/.gsd/activity/059-execute-task-M005-S02-T01.jsonl
+++ b/.gsd/activity/059-execute-task-M005-S02-T01.jsonl
--- a/.gsd/activity/060-complete-slice-M005-S02.jsonl
+++ b/.gsd/activity/060-complete-slice-M005-S02.jsonl
--- a/.gsd/activity/061-research-slice-M005-S03.jsonl
+++ b/.gsd/activity/061-research-slice-M005-S03.jsonl
--- a/.gsd/activity/062-plan-slice-M005-S03.jsonl
+++ b/.gsd/activity/062-plan-slice-M005-S03.jsonl
--- a/.gsd/activity/063-execute-task-M005-S03-T01.jsonl
+++ b/.gsd/activity/063-execute-task-M005-S03-T01.jsonl
--- a/.gsd/activity/064-complete-slice-M005-S03.jsonl
+++ b/.gsd/activity/064-complete-slice-M005-S03.jsonl
--- a/.gsd/activity/065-validate-milestone-M005.jsonl
+++ b/.gsd/activity/065-validate-milestone-M005.jsonl
--- a/.gsd/activity/066-complete-milestone-M005.jsonl
+++ b/.gsd/activity/066-complete-milestone-M005.jsonl
--- a/.gsd/activity/067-research-slice-M006-S01.jsonl
+++ b/.gsd/activity/067-research-slice-M006-S01.jsonl
--- a/.gsd/activity/068-plan-slice-M006-S01.jsonl
+++ b/.gsd/activity/068-plan-slice-M006-S01.jsonl
--- a/.gsd/activity/069-execute-task-M006-S01-T01.jsonl
+++ b/.gsd/activity/069-execute-task-M006-S01-T01.jsonl
--- a/.gsd/activity/070-complete-slice-M006-S01.jsonl
+++ b/.gsd/activity/070-complete-slice-M006-S01.jsonl
--- a/.gsd/activity/071-research-slice-M006-S02.jsonl
+++ b/.gsd/activity/071-research-slice-M006-S02.jsonl
--- a/.gsd/activity/072-plan-slice-M006-S02.jsonl
+++ b/.gsd/activity/072-plan-slice-M006-S02.jsonl
--- a/.gsd/activity/073-execute-task-M006-S02-T01.jsonl
+++ b/.gsd/activity/073-execute-task-M006-S02-T01.jsonl
--- a/.gsd/activity/074-execute-task-M006-S02-T02.jsonl
+++ b/.gsd/activity/074-execute-task-M006-S02-T02.jsonl
--- a/.gsd/activity/075-complete-slice-M006-S02.jsonl
+++ b/.gsd/activity/075-complete-slice-M006-S02.jsonl
--- a/.gsd/activity/076-research-slice-M006-S03.jsonl
+++ b/.gsd/activity/076-research-slice-M006-S03.jsonl
--- a/.gsd/activity/077-plan-slice-M006-S03.jsonl
+++ b/.gsd/activity/077-plan-slice-M006-S03.jsonl
--- a/.gsd/activity/078-execute-task-M006-S03-T01.jsonl
+++ b/.gsd/activity/078-execute-task-M006-S03-T01.jsonl
--- a/.gsd/activity/079-execute-task-M006-S03-T02.jsonl
+++ b/.gsd/activity/079-execute-task-M006-S03-T02.jsonl
--- a/.gsd/activity/080-complete-slice-M006-S03.jsonl
+++ b/.gsd/activity/080-complete-slice-M006-S03.jsonl
--- a/.gsd/activity/081-research-slice-M006-S04.jsonl
+++ b/.gsd/activity/081-research-slice-M006-S04.jsonl
--- a/.gsd/activity/082-plan-slice-M006-S04.jsonl
+++ b/.gsd/activity/082-plan-slice-M006-S04.jsonl
--- a/.gsd/activity/083-execute-task-M006-S04-T01.jsonl
+++ b/.gsd/activity/083-execute-task-M006-S04-T01.jsonl
--- a/.gsd/activity/084-complete-slice-M006-S04.jsonl
+++ b/.gsd/activity/084-complete-slice-M006-S04.jsonl
--- a/.gsd/activity/085-research-slice-M006-S05.jsonl
+++ b/.gsd/activity/085-research-slice-M006-S05.jsonl
--- a/.gsd/activity/086-plan-slice-M006-S05.jsonl
+++ b/.gsd/activity/086-plan-slice-M006-S05.jsonl
--- a/.gsd/activity/087-execute-task-M006-S05-T01.jsonl
+++ b/.gsd/activity/087-execute-task-M006-S05-T01.jsonl
--- a/.gsd/activity/088-execute-task-M006-S05-T02.jsonl
+++ b/.gsd/activity/088-execute-task-M006-S05-T02.jsonl
--- a/.gsd/activity/089-complete-slice-M006-S05.jsonl
+++ b/.gsd/activity/089-complete-slice-M006-S05.jsonl
--- a/.gsd/activity/090-research-slice-M006-S06.jsonl
+++ b/.gsd/activity/090-research-slice-M006-S06.jsonl
--- a/.gsd/activity/091-plan-slice-M006-S06.jsonl
+++ b/.gsd/activity/091-plan-slice-M006-S06.jsonl
--- a/.gsd/activity/092-execute-task-M006-S06-T01.jsonl
+++ b/.gsd/activity/092-execute-task-M006-S06-T01.jsonl
--- a/.gsd/activity/093-execute-task-M006-S06-T02.jsonl
+++ b/.gsd/activity/093-execute-task-M006-S06-T02.jsonl
--- a/.gsd/activity/094-complete-slice-M006-S06.jsonl
+++ b/.gsd/activity/094-complete-slice-M006-S06.jsonl
--- a/.gsd/activity/095-validate-milestone-M006.jsonl
+++ b/.gsd/activity/095-validate-milestone-M006.jsonl
--- a/.gsd/activity/096-complete-milestone-M006.jsonl
+++ b/.gsd/activity/096-complete-milestone-M006.jsonl
--- a/.gsd/activity/097-research-slice-M007-S01.jsonl
+++ b/.gsd/activity/097-research-slice-M007-S01.jsonl
--- a/.gsd/activity/098-plan-slice-M007-S01.jsonl
+++ b/.gsd/activity/098-plan-slice-M007-S01.jsonl
--- a/Show more
+++ b/Show more