fix: restore complete project tree from ub01 canonical state

Auto-mode commit 7aa33cd accidentally deleted 78 files (14,814 lines) during M005
execution. Subsequent commits rebuilt some frontend files but backend/, alembic/,
tests/, whisper/, docker configs, and prompts were never restored in this repo.

This commit restores the full project tree by syncing from ub01's working directory,
which has all M001-M007 features running in production containers.

Restored: backend/ (config, models, routers, database, redis, search_service, worker),
alembic/ (6 migrations), docker/ (Dockerfiles, nginx, compose), prompts/ (4 stages),
tests/, whisper/, README.md, .env.example, chrysopedia-spec.md
This commit is contained in:
jlightner 2026-03-31 02:10:41 +00:00
parent f6dcc80dbf
commit 4b0914b12b
120 changed files with 12812 additions and 163 deletions

52
.env.example Normal file
View file

@ -0,0 +1,52 @@
# ─── Chrysopedia Environment Variables ───
# Copy to .env and fill in secrets before docker compose up
# PostgreSQL
POSTGRES_USER=chrysopedia
POSTGRES_PASSWORD=changeme
POSTGRES_DB=chrysopedia
# Redis (Celery broker) — container-internal, no secret needed
REDIS_URL=redis://chrysopedia-redis:6379/0
# LLM endpoint (OpenAI-compatible — OpenWebUI on FYN DGX)
LLM_API_URL=https://chat.forgetyour.name/api/v1
LLM_API_KEY=sk-changeme
LLM_MODEL=fyn-llm-agent-chat
LLM_FALLBACK_URL=https://chat.forgetyour.name/api/v1
LLM_FALLBACK_MODEL=fyn-llm-agent-chat
# Per-stage LLM model overrides (optional — defaults to LLM_MODEL)
# Modality: "chat" = standard JSON mode, "thinking" = reasoning model (strips <think> tags)
# Stages 2 (segmentation) and 4 (classification) are mechanical — use fast chat model
# Stages 3 (extraction) and 5 (synthesis) need reasoning — use thinking model
LLM_STAGE2_MODEL=fyn-llm-agent-chat
LLM_STAGE2_MODALITY=chat
LLM_STAGE3_MODEL=fyn-llm-agent-think
LLM_STAGE3_MODALITY=thinking
LLM_STAGE4_MODEL=fyn-llm-agent-chat
LLM_STAGE4_MODALITY=chat
LLM_STAGE5_MODEL=fyn-llm-agent-think
LLM_STAGE5_MODALITY=thinking
# Max tokens for LLM responses (OpenWebUI defaults to 1000 — pipeline needs much more)
LLM_MAX_TOKENS=65536
# Embedding endpoint (Ollama container in the compose stack)
EMBEDDING_API_URL=http://chrysopedia-ollama:11434/v1
EMBEDDING_MODEL=nomic-embed-text
# Qdrant (container-internal)
QDRANT_URL=http://chrysopedia-qdrant:6333
QDRANT_COLLECTION=chrysopedia
# Application
APP_ENV=production
APP_LOG_LEVEL=info
# File storage paths (inside container, bind-mounted to /vmPool/r/services/chrysopedia_data)
TRANSCRIPT_STORAGE_PATH=/data/transcripts
VIDEO_METADATA_PATH=/data/video_meta
# Review mode toggle (true = moments require admin review before publishing)
REVIEW_MODE=true

View file

@ -1,18 +1,18 @@
# GSD State
**Active Milestone:** M007: M007
**Active Slice:** S02: Debug Payload Viewer — Inline View, Copy, and Export in Admin UI
**Phase:** evaluating-gates
**Active Milestone:** M007: M007:
**Active Slice:** None
**Phase:** complete
**Requirements Status:** 0 active · 0 validated · 0 deferred · 0 out of scope
## Milestone Registry
- ✅ **M001:** Chrysopedia Foundation — Infrastructure, Pipeline Core, and Skeleton UI
- ✅ **M002:** M002: Chrysopedia Deployment — GitHub, ub01 Docker Stack, and Production Wiring
- ✅ **M003:** M003: Domain + DNS + Per-Stage LLM Model Routing
- ✅ **M004:** M004: UI Polish, Bug Fixes, Technique Page Redesign, and Article Versioning
- ✅ **M005:** M005: Pipeline Dashboard, Technique Page Redesign, Key Moment Cards
- ✅ **M006:** M006: Admin Nav, Pipeline Log Views, Commit SHA, Tag Polish, Topics Redesign, Footer
- 🔄 **M007:** M007
- ✅ **M002:** M002:
- ✅ **M003:** M003:
- ✅ **M004:** M004:
- ✅ **M005:** M005:
- ✅ **M006:** M006:
- **M007:** M007:
## Recent Decisions
- None recorded
@ -21,4 +21,4 @@
- None
## Next Action
Evaluate 3 quality gate(s) for S02 before execution.
All milestones complete.

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View file

@ -1,7 +0,0 @@
{
"pid": 2052340,
"startedAt": "2026-03-30T18:59:38.188Z",
"unitType": "execute-task",
"unitId": "M007/S02/T01",
"unitStartedAt": "2026-03-30T18:59:38.188Z"
}

View file

@ -388,3 +388,114 @@
{"ts":"2026-03-30T18:59:38.123Z","flowId":"e68cf509-8e7a-42ae-ae7f-68d2fe2171c3","seq":1,"eventType":"iteration-start","data":{"iteration":10}}
{"ts":"2026-03-30T18:59:38.151Z","flowId":"e68cf509-8e7a-42ae-ae7f-68d2fe2171c3","seq":2,"eventType":"dispatch-match","rule":"executing → execute-task","data":{"unitType":"execute-task","unitId":"M007/S02/T01"}}
{"ts":"2026-03-30T18:59:38.165Z","flowId":"e68cf509-8e7a-42ae-ae7f-68d2fe2171c3","seq":3,"eventType":"unit-start","data":{"unitType":"execute-task","unitId":"M007/S02/T01"}}
{"ts":"2026-03-30T19:07:23.525Z","flowId":"e68cf509-8e7a-42ae-ae7f-68d2fe2171c3","seq":4,"eventType":"unit-end","data":{"unitType":"execute-task","unitId":"M007/S02/T01","status":"completed","artifactVerified":true},"causedBy":{"flowId":"e68cf509-8e7a-42ae-ae7f-68d2fe2171c3","seq":3}}
{"ts":"2026-03-30T19:07:28.921Z","flowId":"0caba39d-a590-4068-a529-23720a0ea587","seq":1,"eventType":"iteration-start","data":{"iteration":11}}
{"ts":"2026-03-30T19:07:28.968Z","flowId":"0caba39d-a590-4068-a529-23720a0ea587","seq":2,"eventType":"dispatch-match","rule":"summarizing → complete-slice","data":{"unitType":"complete-slice","unitId":"M007/S02"}}
{"ts":"2026-03-30T19:07:28.982Z","flowId":"0caba39d-a590-4068-a529-23720a0ea587","seq":3,"eventType":"unit-start","data":{"unitType":"complete-slice","unitId":"M007/S02"}}
{"ts":"2026-03-30T19:10:28.704Z","flowId":"0caba39d-a590-4068-a529-23720a0ea587","seq":4,"eventType":"unit-end","data":{"unitType":"complete-slice","unitId":"M007/S02","status":"completed","artifactVerified":true},"causedBy":{"flowId":"0caba39d-a590-4068-a529-23720a0ea587","seq":3}}
{"ts":"2026-03-30T19:10:28.809Z","flowId":"0caba39d-a590-4068-a529-23720a0ea587","seq":5,"eventType":"iteration-end","data":{"iteration":11}}
{"ts":"2026-03-30T19:10:28.810Z","flowId":"361fad9f-00c6-4c3c-92a3-8739bccd5079","seq":1,"eventType":"iteration-start","data":{"iteration":12}}
{"ts":"2026-03-30T19:10:28.848Z","flowId":"361fad9f-00c6-4c3c-92a3-8739bccd5079","seq":2,"eventType":"dispatch-match","rule":"planning (no research, not S01) → research-slice","data":{"unitType":"research-slice","unitId":"M007/S03"}}
{"ts":"2026-03-30T19:10:28.866Z","flowId":"361fad9f-00c6-4c3c-92a3-8739bccd5079","seq":3,"eventType":"unit-start","data":{"unitType":"research-slice","unitId":"M007/S03"}}
{"ts":"2026-03-30T19:12:58.159Z","flowId":"361fad9f-00c6-4c3c-92a3-8739bccd5079","seq":4,"eventType":"unit-end","data":{"unitType":"research-slice","unitId":"M007/S03","status":"completed","artifactVerified":true},"causedBy":{"flowId":"361fad9f-00c6-4c3c-92a3-8739bccd5079","seq":3}}
{"ts":"2026-03-30T19:12:58.261Z","flowId":"361fad9f-00c6-4c3c-92a3-8739bccd5079","seq":5,"eventType":"iteration-end","data":{"iteration":12}}
{"ts":"2026-03-30T19:12:58.261Z","flowId":"74a18d0c-e084-402d-8476-7cc481491c8f","seq":1,"eventType":"iteration-start","data":{"iteration":13}}
{"ts":"2026-03-30T19:12:58.281Z","flowId":"74a18d0c-e084-402d-8476-7cc481491c8f","seq":2,"eventType":"dispatch-match","rule":"planning → plan-slice","data":{"unitType":"plan-slice","unitId":"M007/S03"}}
{"ts":"2026-03-30T19:12:58.291Z","flowId":"74a18d0c-e084-402d-8476-7cc481491c8f","seq":3,"eventType":"unit-start","data":{"unitType":"plan-slice","unitId":"M007/S03"}}
{"ts":"2026-03-30T19:15:08.081Z","flowId":"74a18d0c-e084-402d-8476-7cc481491c8f","seq":4,"eventType":"unit-end","data":{"unitType":"plan-slice","unitId":"M007/S03","status":"completed","artifactVerified":true},"causedBy":{"flowId":"74a18d0c-e084-402d-8476-7cc481491c8f","seq":3}}
{"ts":"2026-03-30T19:15:08.184Z","flowId":"74a18d0c-e084-402d-8476-7cc481491c8f","seq":5,"eventType":"iteration-end","data":{"iteration":13}}
{"ts":"2026-03-30T19:15:08.185Z","flowId":"5fc6dd58-03fd-4861-a7a4-083a1c4964a8","seq":1,"eventType":"iteration-start","data":{"iteration":14}}
{"ts":"2026-03-30T19:15:08.212Z","flowId":"e2cbd134-9c3c-4c98-a24d-61e10e3f27e7","seq":1,"eventType":"iteration-start","data":{"iteration":15}}
{"ts":"2026-03-30T19:15:08.244Z","flowId":"e2cbd134-9c3c-4c98-a24d-61e10e3f27e7","seq":2,"eventType":"dispatch-match","rule":"executing → execute-task","data":{"unitType":"execute-task","unitId":"M007/S03/T01"}}
{"ts":"2026-03-30T19:15:08.259Z","flowId":"e2cbd134-9c3c-4c98-a24d-61e10e3f27e7","seq":3,"eventType":"unit-start","data":{"unitType":"execute-task","unitId":"M007/S03/T01"}}
{"ts":"2026-03-30T19:17:47.626Z","flowId":"e2cbd134-9c3c-4c98-a24d-61e10e3f27e7","seq":4,"eventType":"unit-end","data":{"unitType":"execute-task","unitId":"M007/S03/T01","status":"completed","artifactVerified":true},"causedBy":{"flowId":"e2cbd134-9c3c-4c98-a24d-61e10e3f27e7","seq":3}}
{"ts":"2026-03-30T19:17:47.868Z","flowId":"e2cbd134-9c3c-4c98-a24d-61e10e3f27e7","seq":5,"eventType":"iteration-end","data":{"iteration":15}}
{"ts":"2026-03-30T19:17:47.869Z","flowId":"f34dec93-2f75-4725-80df-7a253fdd2d0f","seq":1,"eventType":"iteration-start","data":{"iteration":16}}
{"ts":"2026-03-30T19:17:47.902Z","flowId":"f34dec93-2f75-4725-80df-7a253fdd2d0f","seq":2,"eventType":"dispatch-match","rule":"executing → execute-task","data":{"unitType":"execute-task","unitId":"M007/S03/T02"}}
{"ts":"2026-03-30T19:17:47.920Z","flowId":"f34dec93-2f75-4725-80df-7a253fdd2d0f","seq":3,"eventType":"unit-start","data":{"unitType":"execute-task","unitId":"M007/S03/T02"}}
{"ts":"2026-03-30T19:24:39.796Z","flowId":"f34dec93-2f75-4725-80df-7a253fdd2d0f","seq":4,"eventType":"unit-end","data":{"unitType":"execute-task","unitId":"M007/S03/T02","status":"completed","artifactVerified":true},"causedBy":{"flowId":"f34dec93-2f75-4725-80df-7a253fdd2d0f","seq":3}}
{"ts":"2026-03-30T19:24:39.954Z","flowId":"f34dec93-2f75-4725-80df-7a253fdd2d0f","seq":5,"eventType":"iteration-end","data":{"iteration":16}}
{"ts":"2026-03-30T19:24:39.954Z","flowId":"38e7c249-f214-4444-ac55-af355dbb004b","seq":1,"eventType":"iteration-start","data":{"iteration":17}}
{"ts":"2026-03-30T19:24:40.081Z","flowId":"38e7c249-f214-4444-ac55-af355dbb004b","seq":2,"eventType":"dispatch-match","rule":"summarizing → complete-slice","data":{"unitType":"complete-slice","unitId":"M007/S03"}}
{"ts":"2026-03-30T19:24:40.099Z","flowId":"38e7c249-f214-4444-ac55-af355dbb004b","seq":3,"eventType":"unit-start","data":{"unitType":"complete-slice","unitId":"M007/S03"}}
{"ts":"2026-03-30T19:26:38.422Z","flowId":"38e7c249-f214-4444-ac55-af355dbb004b","seq":4,"eventType":"unit-end","data":{"unitType":"complete-slice","unitId":"M007/S03","status":"completed","artifactVerified":true},"causedBy":{"flowId":"38e7c249-f214-4444-ac55-af355dbb004b","seq":3}}
{"ts":"2026-03-30T19:26:38.524Z","flowId":"38e7c249-f214-4444-ac55-af355dbb004b","seq":5,"eventType":"iteration-end","data":{"iteration":17}}
{"ts":"2026-03-30T19:26:38.524Z","flowId":"ed4e68af-8bda-4f1c-82ec-4ea21e6aa41b","seq":1,"eventType":"iteration-start","data":{"iteration":18}}
{"ts":"2026-03-30T19:26:38.665Z","flowId":"ed4e68af-8bda-4f1c-82ec-4ea21e6aa41b","seq":2,"eventType":"dispatch-match","rule":"planning (no research, not S01) → research-slice","data":{"unitType":"research-slice","unitId":"M007/S04"}}
{"ts":"2026-03-30T19:26:38.679Z","flowId":"ed4e68af-8bda-4f1c-82ec-4ea21e6aa41b","seq":3,"eventType":"unit-start","data":{"unitType":"research-slice","unitId":"M007/S04"}}
{"ts":"2026-03-30T19:29:03.963Z","flowId":"ed4e68af-8bda-4f1c-82ec-4ea21e6aa41b","seq":4,"eventType":"unit-end","data":{"unitType":"research-slice","unitId":"M007/S04","status":"completed","artifactVerified":true},"causedBy":{"flowId":"ed4e68af-8bda-4f1c-82ec-4ea21e6aa41b","seq":3}}
{"ts":"2026-03-30T19:29:04.064Z","flowId":"ed4e68af-8bda-4f1c-82ec-4ea21e6aa41b","seq":5,"eventType":"iteration-end","data":{"iteration":18}}
{"ts":"2026-03-30T19:29:04.064Z","flowId":"56f78437-573e-445a-a630-db5531d6e95b","seq":1,"eventType":"iteration-start","data":{"iteration":19}}
{"ts":"2026-03-30T19:29:04.160Z","flowId":"56f78437-573e-445a-a630-db5531d6e95b","seq":2,"eventType":"dispatch-match","rule":"planning → plan-slice","data":{"unitType":"plan-slice","unitId":"M007/S04"}}
{"ts":"2026-03-30T19:29:04.171Z","flowId":"56f78437-573e-445a-a630-db5531d6e95b","seq":3,"eventType":"unit-start","data":{"unitType":"plan-slice","unitId":"M007/S04"}}
{"ts":"2026-03-30T19:31:01.891Z","flowId":"56f78437-573e-445a-a630-db5531d6e95b","seq":4,"eventType":"unit-end","data":{"unitType":"plan-slice","unitId":"M007/S04","status":"completed","artifactVerified":true},"causedBy":{"flowId":"56f78437-573e-445a-a630-db5531d6e95b","seq":3}}
{"ts":"2026-03-30T19:31:01.994Z","flowId":"56f78437-573e-445a-a630-db5531d6e95b","seq":5,"eventType":"iteration-end","data":{"iteration":19}}
{"ts":"2026-03-30T19:31:01.994Z","flowId":"49a2c337-a403-42ab-b778-5b45bcd525dd","seq":1,"eventType":"iteration-start","data":{"iteration":20}}
{"ts":"2026-03-30T19:31:02.112Z","flowId":"97ff47b0-6b73-4b2f-a31e-d1a31838f381","seq":1,"eventType":"iteration-start","data":{"iteration":21}}
{"ts":"2026-03-30T19:31:02.216Z","flowId":"97ff47b0-6b73-4b2f-a31e-d1a31838f381","seq":2,"eventType":"dispatch-match","rule":"executing → execute-task","data":{"unitType":"execute-task","unitId":"M007/S04/T01"}}
{"ts":"2026-03-30T19:31:02.226Z","flowId":"97ff47b0-6b73-4b2f-a31e-d1a31838f381","seq":3,"eventType":"unit-start","data":{"unitType":"execute-task","unitId":"M007/S04/T01"}}
{"ts":"2026-03-30T19:34:11.113Z","flowId":"97ff47b0-6b73-4b2f-a31e-d1a31838f381","seq":4,"eventType":"unit-end","data":{"unitType":"execute-task","unitId":"M007/S04/T01","status":"completed","artifactVerified":true},"causedBy":{"flowId":"97ff47b0-6b73-4b2f-a31e-d1a31838f381","seq":3}}
{"ts":"2026-03-30T19:34:11.315Z","flowId":"f2129392-f114-4a59-851b-ca9e897f2d99","seq":1,"eventType":"iteration-start","data":{"iteration":22}}
{"ts":"2026-03-30T19:34:11.422Z","flowId":"f2129392-f114-4a59-851b-ca9e897f2d99","seq":2,"eventType":"dispatch-match","rule":"executing → execute-task","data":{"unitType":"execute-task","unitId":"M007/S04/T02"}}
{"ts":"2026-03-30T19:34:11.433Z","flowId":"f2129392-f114-4a59-851b-ca9e897f2d99","seq":3,"eventType":"unit-start","data":{"unitType":"execute-task","unitId":"M007/S04/T02"}}
{"ts":"2026-03-30T19:36:47.725Z","flowId":"f2129392-f114-4a59-851b-ca9e897f2d99","seq":4,"eventType":"unit-end","data":{"unitType":"execute-task","unitId":"M007/S04/T02","status":"completed","artifactVerified":true},"causedBy":{"flowId":"f2129392-f114-4a59-851b-ca9e897f2d99","seq":3}}
{"ts":"2026-03-30T19:36:47.886Z","flowId":"c4afe3a6-a4e3-4626-810c-bbe1d52887d2","seq":1,"eventType":"iteration-start","data":{"iteration":23}}
{"ts":"2026-03-30T19:36:47.959Z","flowId":"c4afe3a6-a4e3-4626-810c-bbe1d52887d2","seq":2,"eventType":"dispatch-match","rule":"summarizing → complete-slice","data":{"unitType":"complete-slice","unitId":"M007/S04"}}
{"ts":"2026-03-30T19:36:47.970Z","flowId":"c4afe3a6-a4e3-4626-810c-bbe1d52887d2","seq":3,"eventType":"unit-start","data":{"unitType":"complete-slice","unitId":"M007/S04"}}
{"ts":"2026-03-30T19:37:54.150Z","flowId":"c4afe3a6-a4e3-4626-810c-bbe1d52887d2","seq":4,"eventType":"unit-end","data":{"unitType":"complete-slice","unitId":"M007/S04","status":"completed","artifactVerified":true},"causedBy":{"flowId":"c4afe3a6-a4e3-4626-810c-bbe1d52887d2","seq":3}}
{"ts":"2026-03-30T19:37:54.252Z","flowId":"c4afe3a6-a4e3-4626-810c-bbe1d52887d2","seq":5,"eventType":"iteration-end","data":{"iteration":23}}
{"ts":"2026-03-30T19:37:54.252Z","flowId":"ba9789f1-ead3-4c7c-b121-e2d3acfebd21","seq":1,"eventType":"iteration-start","data":{"iteration":24}}
{"ts":"2026-03-30T19:37:54.362Z","flowId":"ba9789f1-ead3-4c7c-b121-e2d3acfebd21","seq":2,"eventType":"dispatch-match","rule":"planning (no research, not S01) → research-slice","data":{"unitType":"research-slice","unitId":"M007/S05"}}
{"ts":"2026-03-30T19:37:54.371Z","flowId":"ba9789f1-ead3-4c7c-b121-e2d3acfebd21","seq":3,"eventType":"unit-start","data":{"unitType":"research-slice","unitId":"M007/S05"}}
{"ts":"2026-03-30T19:39:29.263Z","flowId":"ba9789f1-ead3-4c7c-b121-e2d3acfebd21","seq":4,"eventType":"unit-end","data":{"unitType":"research-slice","unitId":"M007/S05","status":"completed","artifactVerified":true},"causedBy":{"flowId":"ba9789f1-ead3-4c7c-b121-e2d3acfebd21","seq":3}}
{"ts":"2026-03-30T19:39:29.365Z","flowId":"ba9789f1-ead3-4c7c-b121-e2d3acfebd21","seq":5,"eventType":"iteration-end","data":{"iteration":24}}
{"ts":"2026-03-30T19:39:29.365Z","flowId":"8477c4dd-8e6a-44c7-8c46-39165872ef2f","seq":1,"eventType":"iteration-start","data":{"iteration":25}}
{"ts":"2026-03-30T19:39:29.507Z","flowId":"8477c4dd-8e6a-44c7-8c46-39165872ef2f","seq":2,"eventType":"dispatch-match","rule":"planning → plan-slice","data":{"unitType":"plan-slice","unitId":"M007/S05"}}
{"ts":"2026-03-30T19:39:29.525Z","flowId":"8477c4dd-8e6a-44c7-8c46-39165872ef2f","seq":3,"eventType":"unit-start","data":{"unitType":"plan-slice","unitId":"M007/S05"}}
{"ts":"2026-03-30T19:40:07.521Z","flowId":"8477c4dd-8e6a-44c7-8c46-39165872ef2f","seq":4,"eventType":"unit-end","data":{"unitType":"plan-slice","unitId":"M007/S05","status":"completed","artifactVerified":true},"causedBy":{"flowId":"8477c4dd-8e6a-44c7-8c46-39165872ef2f","seq":3}}
{"ts":"2026-03-30T19:40:07.641Z","flowId":"8477c4dd-8e6a-44c7-8c46-39165872ef2f","seq":5,"eventType":"iteration-end","data":{"iteration":25}}
{"ts":"2026-03-30T19:40:07.641Z","flowId":"bf2e38c7-8617-4669-b7fa-99bf7bcf95e7","seq":1,"eventType":"iteration-start","data":{"iteration":26}}
{"ts":"2026-03-30T19:40:07.723Z","flowId":"f3f48bc4-ee45-4425-881a-41f68074614f","seq":1,"eventType":"iteration-start","data":{"iteration":27}}
{"ts":"2026-03-30T19:40:07.818Z","flowId":"f3f48bc4-ee45-4425-881a-41f68074614f","seq":2,"eventType":"dispatch-match","rule":"executing → execute-task","data":{"unitType":"execute-task","unitId":"M007/S05/T01"}}
{"ts":"2026-03-30T19:40:07.829Z","flowId":"f3f48bc4-ee45-4425-881a-41f68074614f","seq":3,"eventType":"unit-start","data":{"unitType":"execute-task","unitId":"M007/S05/T01"}}
{"ts":"2026-03-30T19:41:40.986Z","flowId":"f3f48bc4-ee45-4425-881a-41f68074614f","seq":4,"eventType":"unit-end","data":{"unitType":"execute-task","unitId":"M007/S05/T01","status":"completed","artifactVerified":true},"causedBy":{"flowId":"f3f48bc4-ee45-4425-881a-41f68074614f","seq":3}}
{"ts":"2026-03-30T19:41:41.202Z","flowId":"f3f48bc4-ee45-4425-881a-41f68074614f","seq":5,"eventType":"iteration-end","data":{"iteration":27}}
{"ts":"2026-03-30T19:41:41.203Z","flowId":"6e6038b4-b4fc-4d82-a220-ca23f1ae91dc","seq":1,"eventType":"iteration-start","data":{"iteration":28}}
{"ts":"2026-03-30T19:41:41.340Z","flowId":"6e6038b4-b4fc-4d82-a220-ca23f1ae91dc","seq":2,"eventType":"dispatch-match","rule":"summarizing → complete-slice","data":{"unitType":"complete-slice","unitId":"M007/S05"}}
{"ts":"2026-03-30T19:41:41.356Z","flowId":"6e6038b4-b4fc-4d82-a220-ca23f1ae91dc","seq":3,"eventType":"unit-start","data":{"unitType":"complete-slice","unitId":"M007/S05"}}
{"ts":"2026-03-30T19:42:41.642Z","flowId":"6e6038b4-b4fc-4d82-a220-ca23f1ae91dc","seq":4,"eventType":"unit-end","data":{"unitType":"complete-slice","unitId":"M007/S05","status":"completed","artifactVerified":true},"causedBy":{"flowId":"6e6038b4-b4fc-4d82-a220-ca23f1ae91dc","seq":3}}
{"ts":"2026-03-30T19:42:41.744Z","flowId":"6e6038b4-b4fc-4d82-a220-ca23f1ae91dc","seq":5,"eventType":"iteration-end","data":{"iteration":28}}
{"ts":"2026-03-30T19:42:41.745Z","flowId":"03d87427-23fd-4250-884c-a71c15b73bf8","seq":1,"eventType":"iteration-start","data":{"iteration":29}}
{"ts":"2026-03-30T19:42:41.878Z","flowId":"03d87427-23fd-4250-884c-a71c15b73bf8","seq":2,"eventType":"dispatch-match","rule":"planning (no research, not S01) → research-slice","data":{"unitType":"research-slice","unitId":"M007/S06"}}
{"ts":"2026-03-30T19:42:41.895Z","flowId":"03d87427-23fd-4250-884c-a71c15b73bf8","seq":3,"eventType":"unit-start","data":{"unitType":"research-slice","unitId":"M007/S06"}}
{"ts":"2026-03-30T19:44:50.594Z","flowId":"03d87427-23fd-4250-884c-a71c15b73bf8","seq":4,"eventType":"unit-end","data":{"unitType":"research-slice","unitId":"M007/S06","status":"completed","artifactVerified":true},"causedBy":{"flowId":"03d87427-23fd-4250-884c-a71c15b73bf8","seq":3}}
{"ts":"2026-03-30T19:44:50.696Z","flowId":"03d87427-23fd-4250-884c-a71c15b73bf8","seq":5,"eventType":"iteration-end","data":{"iteration":29}}
{"ts":"2026-03-30T19:44:50.696Z","flowId":"f4353fd5-769c-45bd-8ef4-77d0ae2e445e","seq":1,"eventType":"iteration-start","data":{"iteration":30}}
{"ts":"2026-03-30T19:44:50.771Z","flowId":"f4353fd5-769c-45bd-8ef4-77d0ae2e445e","seq":2,"eventType":"dispatch-match","rule":"planning → plan-slice","data":{"unitType":"plan-slice","unitId":"M007/S06"}}
{"ts":"2026-03-30T19:44:50.779Z","flowId":"f4353fd5-769c-45bd-8ef4-77d0ae2e445e","seq":3,"eventType":"unit-start","data":{"unitType":"plan-slice","unitId":"M007/S06"}}
{"ts":"2026-03-30T19:46:02.833Z","flowId":"f4353fd5-769c-45bd-8ef4-77d0ae2e445e","seq":4,"eventType":"unit-end","data":{"unitType":"plan-slice","unitId":"M007/S06","status":"completed","artifactVerified":true},"causedBy":{"flowId":"f4353fd5-769c-45bd-8ef4-77d0ae2e445e","seq":3}}
{"ts":"2026-03-30T19:46:02.935Z","flowId":"f4353fd5-769c-45bd-8ef4-77d0ae2e445e","seq":5,"eventType":"iteration-end","data":{"iteration":30}}
{"ts":"2026-03-30T19:46:02.935Z","flowId":"070dddd6-32d6-4439-9287-e35a3c12423a","seq":1,"eventType":"iteration-start","data":{"iteration":31}}
{"ts":"2026-03-30T19:46:03.073Z","flowId":"03583653-8ba1-420f-8cd3-5184f2f024a5","seq":1,"eventType":"iteration-start","data":{"iteration":32}}
{"ts":"2026-03-30T19:46:03.212Z","flowId":"03583653-8ba1-420f-8cd3-5184f2f024a5","seq":2,"eventType":"dispatch-match","rule":"executing → execute-task","data":{"unitType":"execute-task","unitId":"M007/S06/T01"}}
{"ts":"2026-03-30T19:46:03.228Z","flowId":"03583653-8ba1-420f-8cd3-5184f2f024a5","seq":3,"eventType":"unit-start","data":{"unitType":"execute-task","unitId":"M007/S06/T01"}}
{"ts":"2026-03-30T19:48:29.975Z","flowId":"03583653-8ba1-420f-8cd3-5184f2f024a5","seq":4,"eventType":"unit-end","data":{"unitType":"execute-task","unitId":"M007/S06/T01","status":"completed","artifactVerified":true},"causedBy":{"flowId":"03583653-8ba1-420f-8cd3-5184f2f024a5","seq":3}}
{"ts":"2026-03-30T19:48:30.283Z","flowId":"03583653-8ba1-420f-8cd3-5184f2f024a5","seq":5,"eventType":"iteration-end","data":{"iteration":32}}
{"ts":"2026-03-30T19:48:30.283Z","flowId":"85108a2c-6888-4d5b-8ed4-a011ebc859a0","seq":1,"eventType":"iteration-start","data":{"iteration":33}}
{"ts":"2026-03-30T19:48:30.402Z","flowId":"85108a2c-6888-4d5b-8ed4-a011ebc859a0","seq":2,"eventType":"dispatch-match","rule":"summarizing → complete-slice","data":{"unitType":"complete-slice","unitId":"M007/S06"}}
{"ts":"2026-03-30T19:48:30.414Z","flowId":"85108a2c-6888-4d5b-8ed4-a011ebc859a0","seq":3,"eventType":"unit-start","data":{"unitType":"complete-slice","unitId":"M007/S06"}}
{"ts":"2026-03-30T19:49:21.353Z","flowId":"85108a2c-6888-4d5b-8ed4-a011ebc859a0","seq":4,"eventType":"unit-end","data":{"unitType":"complete-slice","unitId":"M007/S06","status":"completed","artifactVerified":true},"causedBy":{"flowId":"85108a2c-6888-4d5b-8ed4-a011ebc859a0","seq":3}}
{"ts":"2026-03-30T19:49:21.455Z","flowId":"85108a2c-6888-4d5b-8ed4-a011ebc859a0","seq":5,"eventType":"iteration-end","data":{"iteration":33}}
{"ts":"2026-03-30T19:49:21.455Z","flowId":"2cf17eef-c30d-43c6-a6d4-3bb2cea451de","seq":1,"eventType":"iteration-start","data":{"iteration":34}}
{"ts":"2026-03-30T19:49:21.575Z","flowId":"2cf17eef-c30d-43c6-a6d4-3bb2cea451de","seq":2,"eventType":"dispatch-match","rule":"validating-milestone → validate-milestone","data":{"unitType":"validate-milestone","unitId":"M007"}}
{"ts":"2026-03-30T19:49:21.589Z","flowId":"2cf17eef-c30d-43c6-a6d4-3bb2cea451de","seq":3,"eventType":"unit-start","data":{"unitType":"validate-milestone","unitId":"M007"}}
{"ts":"2026-03-30T19:51:17.420Z","flowId":"2cf17eef-c30d-43c6-a6d4-3bb2cea451de","seq":4,"eventType":"unit-end","data":{"unitType":"validate-milestone","unitId":"M007","status":"completed","artifactVerified":true},"causedBy":{"flowId":"2cf17eef-c30d-43c6-a6d4-3bb2cea451de","seq":3}}
{"ts":"2026-03-30T19:51:17.522Z","flowId":"2cf17eef-c30d-43c6-a6d4-3bb2cea451de","seq":5,"eventType":"iteration-end","data":{"iteration":34}}
{"ts":"2026-03-30T19:51:17.522Z","flowId":"5f446b30-16e4-419e-8ea0-252f18c7e0c9","seq":1,"eventType":"iteration-start","data":{"iteration":35}}
{"ts":"2026-03-30T19:51:17.712Z","flowId":"5f446b30-16e4-419e-8ea0-252f18c7e0c9","seq":2,"eventType":"dispatch-match","rule":"completing-milestone → complete-milestone","data":{"unitType":"complete-milestone","unitId":"M007"}}
{"ts":"2026-03-30T19:51:17.729Z","flowId":"5f446b30-16e4-419e-8ea0-252f18c7e0c9","seq":3,"eventType":"unit-start","data":{"unitType":"complete-milestone","unitId":"M007"}}
{"ts":"2026-03-30T19:53:11.667Z","flowId":"5f446b30-16e4-419e-8ea0-252f18c7e0c9","seq":4,"eventType":"unit-end","data":{"unitType":"complete-milestone","unitId":"M007","status":"completed","artifactVerified":true},"causedBy":{"flowId":"5f446b30-16e4-419e-8ea0-252f18c7e0c9","seq":3}}
{"ts":"2026-03-30T19:53:11.849Z","flowId":"5f446b30-16e4-419e-8ea0-252f18c7e0c9","seq":5,"eventType":"iteration-end","data":{"iteration":35}}
{"ts":"2026-03-30T19:53:11.849Z","flowId":"04e80f1b-8d6e-4e44-9dcf-bc7a619cd7f3","seq":1,"eventType":"iteration-start","data":{"iteration":36}}
{"ts":"2026-03-30T19:53:11.949Z","flowId":"2d8e4e33-914e-476e-bdd7-1d19ae05fe36","seq":0,"eventType":"worktree-merge-start","data":{"milestoneId":"M007","mode":"none"}}
{"ts":"2026-03-30T19:53:12.018Z","flowId":"04e80f1b-8d6e-4e44-9dcf-bc7a619cd7f3","seq":2,"eventType":"terminal","data":{"reason":"milestone-complete","milestoneId":"M007"}}

View file

@ -4317,6 +4317,929 @@
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "execute-task",
"id": "M007/S02/T01",
"model": "claude-opus-4-6",
"startedAt": 1774897178165,
"finishedAt": 1774897643382,
"tokens": {
"input": 75,
"output": 15023,
"cacheRead": 5881554,
"cacheWrite": 56779,
"total": 5953431
},
"cost": 3.6715957500000007,
"toolCalls": 72,
"assistantMessages": 70,
"userMessages": 0,
"apiRequests": 70,
"promptCharCount": 11242,
"baselineCharCount": 21533,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "complete-slice",
"id": "M007/S02",
"model": "claude-opus-4-6",
"startedAt": 1774897648982,
"finishedAt": 1774897828586,
"tokens": {
"input": 23,
"output": 5166,
"cacheRead": 1522090,
"cacheWrite": 27172,
"total": 1554451
},
"cost": 1.0601349999999998,
"toolCalls": 24,
"assistantMessages": 21,
"userMessages": 0,
"apiRequests": 21,
"promptCharCount": 34491,
"baselineCharCount": 21533,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "research-slice",
"id": "M007/S03",
"model": "claude-opus-4-6",
"startedAt": 1774897828866,
"finishedAt": 1774897978047,
"tokens": {
"input": 20,
"output": 4934,
"cacheRead": 1307100,
"cacheWrite": 26644,
"total": 1338698
},
"cost": 0.943525,
"toolCalls": 28,
"assistantMessages": 18,
"userMessages": 0,
"apiRequests": 18,
"promptCharCount": 24967,
"baselineCharCount": 21533,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "plan-slice",
"id": "M007/S03",
"model": "claude-opus-4-6",
"startedAt": 1774897978291,
"finishedAt": 1774898107975,
"tokens": {
"input": 11,
"output": 6149,
"cacheRead": 695356,
"cacheWrite": 22742,
"total": 724258
},
"cost": 0.6435955,
"toolCalls": 17,
"assistantMessages": 10,
"userMessages": 0,
"apiRequests": 10,
"promptCharCount": 34934,
"baselineCharCount": 21533,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "execute-task",
"id": "M007/S03/T01",
"model": "claude-opus-4-6",
"startedAt": 1774898108259,
"finishedAt": 1774898267492,
"tokens": {
"input": 23,
"output": 7314,
"cacheRead": 1487428,
"cacheWrite": 18545,
"total": 1513310
},
"cost": 1.04258525,
"toolCalls": 24,
"assistantMessages": 22,
"userMessages": 0,
"apiRequests": 22,
"promptCharCount": 14814,
"baselineCharCount": 21533,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "execute-task",
"id": "M007/S03/T02",
"model": "claude-opus-4-6",
"startedAt": 1774898267920,
"finishedAt": 1774898679681,
"tokens": {
"input": 42,
"output": 9726,
"cacheRead": 3015457,
"cacheWrite": 24404,
"total": 3049629
},
"cost": 1.9036135,
"toolCalls": 53,
"assistantMessages": 41,
"userMessages": 0,
"apiRequests": 41,
"promptCharCount": 14515,
"baselineCharCount": 21533,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "complete-slice",
"id": "M007/S03",
"model": "claude-opus-4-6",
"startedAt": 1774898680099,
"finishedAt": 1774898798309,
"tokens": {
"input": 21,
"output": 4972,
"cacheRead": 995363,
"cacheWrite": 18063,
"total": 1018419
},
"cost": 0.7349802500000001,
"toolCalls": 14,
"assistantMessages": 14,
"userMessages": 0,
"apiRequests": 14,
"promptCharCount": 22108,
"baselineCharCount": 21533,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "research-slice",
"id": "M007/S04",
"model": "claude-opus-4-6",
"startedAt": 1774898798679,
"finishedAt": 1774898943839,
"tokens": {
"input": 20,
"output": 5462,
"cacheRead": 1552827,
"cacheWrite": 44713,
"total": 1603022
},
"cost": 1.19251975,
"toolCalls": 28,
"assistantMessages": 18,
"userMessages": 0,
"apiRequests": 18,
"promptCharCount": 29245,
"baselineCharCount": 21851,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "plan-slice",
"id": "M007/S04",
"model": "claude-opus-4-6",
"startedAt": 1774898944171,
"finishedAt": 1774899061785,
"tokens": {
"input": 12,
"output": 4072,
"cacheRead": 803090,
"cacheWrite": 23523,
"total": 830697
},
"cost": 0.6504237500000001,
"toolCalls": 11,
"assistantMessages": 11,
"userMessages": 0,
"apiRequests": 11,
"promptCharCount": 38843,
"baselineCharCount": 21851,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "execute-task",
"id": "M007/S04/T01",
"model": "claude-opus-4-6",
"startedAt": 1774899062226,
"finishedAt": 1774899251015,
"tokens": {
"input": 35,
"output": 6982,
"cacheRead": 2429704,
"cacheWrite": 25695,
"total": 2462416
},
"cost": 1.5501707500000002,
"toolCalls": 34,
"assistantMessages": 32,
"userMessages": 0,
"apiRequests": 32,
"promptCharCount": 12185,
"baselineCharCount": 21851,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "execute-task",
"id": "M007/S04/T02",
"model": "claude-opus-4-6",
"startedAt": 1774899251433,
"finishedAt": 1774899407608,
"tokens": {
"input": 31,
"output": 6247,
"cacheRead": 2011068,
"cacheWrite": 19944,
"total": 2037290
},
"cost": 1.2865140000000002,
"toolCalls": 26,
"assistantMessages": 28,
"userMessages": 0,
"apiRequests": 28,
"promptCharCount": 13007,
"baselineCharCount": 21851,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "complete-slice",
"id": "M007/S04",
"model": "claude-opus-4-6",
"startedAt": 1774899407970,
"finishedAt": 1774899474030,
"tokens": {
"input": 8,
"output": 2726,
"cacheRead": 399468,
"cacheWrite": 14096,
"total": 416298
},
"cost": 0.356024,
"toolCalls": 6,
"assistantMessages": 6,
"userMessages": 0,
"apiRequests": 6,
"promptCharCount": 34632,
"baselineCharCount": 21851,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "research-slice",
"id": "M007/S05",
"model": "claude-opus-4-6",
"startedAt": 1774899474371,
"finishedAt": 1774899569155,
"tokens": {
"input": 17,
"output": 3338,
"cacheRead": 1002977,
"cacheWrite": 13503,
"total": 1019835
},
"cost": 0.66941725,
"toolCalls": 17,
"assistantMessages": 15,
"userMessages": 0,
"apiRequests": 15,
"promptCharCount": 24953,
"baselineCharCount": 21851,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "plan-slice",
"id": "M007/S05",
"model": "claude-opus-4-6",
"startedAt": 1774899569525,
"finishedAt": 1774899607404,
"tokens": {
"input": 4,
"output": 1766,
"cacheRead": 194606,
"cacheWrite": 13276,
"total": 209652
},
"cost": 0.22444799999999998,
"toolCalls": 4,
"assistantMessages": 3,
"userMessages": 0,
"apiRequests": 3,
"promptCharCount": 31862,
"baselineCharCount": 21851,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "execute-task",
"id": "M007/S05/T01",
"model": "claude-opus-4-6",
"startedAt": 1774899607829,
"finishedAt": 1774899700894,
"tokens": {
"input": 21,
"output": 3670,
"cacheRead": 1146065,
"cacheWrite": 9763,
"total": 1159519
},
"cost": 0.72590625,
"toolCalls": 16,
"assistantMessages": 18,
"userMessages": 0,
"apiRequests": 18,
"promptCharCount": 12517,
"baselineCharCount": 21851,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "complete-slice",
"id": "M007/S05",
"model": "claude-opus-4-6",
"startedAt": 1774899701356,
"finishedAt": 1774899761536,
"tokens": {
"input": 9,
"output": 2635,
"cacheRead": 469361,
"cacheWrite": 12904,
"total": 484909
},
"cost": 0.3812505,
"toolCalls": 12,
"assistantMessages": 7,
"userMessages": 0,
"apiRequests": 7,
"promptCharCount": 34154,
"baselineCharCount": 21851,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "research-slice",
"id": "M007/S06",
"model": "claude-opus-4-6",
"startedAt": 1774899761895,
"finishedAt": 1774899890474,
"tokens": {
"input": 22,
"output": 4634,
"cacheRead": 1471879,
"cacheWrite": 28131,
"total": 1504666
},
"cost": 1.0277182499999997,
"toolCalls": 27,
"assistantMessages": 20,
"userMessages": 0,
"apiRequests": 20,
"promptCharCount": 27474,
"baselineCharCount": 21851,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "plan-slice",
"id": "M007/S06",
"model": "claude-opus-4-6",
"startedAt": 1774899890779,
"finishedAt": 1774899962709,
"tokens": {
"input": 10,
"output": 3024,
"cacheRead": 628271,
"cacheWrite": 16437,
"total": 647742
},
"cost": 0.49251675,
"toolCalls": 9,
"assistantMessages": 9,
"userMessages": 0,
"apiRequests": 9,
"promptCharCount": 35215,
"baselineCharCount": 21851,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "execute-task",
"id": "M007/S06/T01",
"model": "claude-opus-4-6",
"startedAt": 1774899963228,
"finishedAt": 1774900109857,
"tokens": {
"input": 18,
"output": 3969,
"cacheRead": 1096570,
"cacheWrite": 10791,
"total": 1111348
},
"cost": 0.7150437500000001,
"toolCalls": 20,
"assistantMessages": 17,
"userMessages": 0,
"apiRequests": 17,
"promptCharCount": 11551,
"baselineCharCount": 21851,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "complete-slice",
"id": "M007/S06",
"model": "claude-opus-4-6",
"startedAt": 1774900110414,
"finishedAt": 1774900161244,
"tokens": {
"input": 8,
"output": 2176,
"cacheRead": 330857,
"cacheWrite": 11758,
"total": 344799
},
"cost": 0.293356,
"toolCalls": 4,
"assistantMessages": 5,
"userMessages": 0,
"apiRequests": 5,
"promptCharCount": 33871,
"baselineCharCount": 21851,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "validate-milestone",
"id": "M007",
"model": "claude-opus-4-6",
"startedAt": 1774900161589,
"finishedAt": 1774900277303,
"tokens": {
"input": 15,
"output": 4391,
"cacheRead": 930967,
"cacheWrite": 19894,
"total": 955267
},
"cost": 0.6996709999999999,
"toolCalls": 19,
"assistantMessages": 13,
"userMessages": 0,
"apiRequests": 13,
"promptCharCount": 33712,
"baselineCharCount": 21851,
"skills": [
"accessibility",
"agent-browser",
"best-practices",
"code-optimizer",
"core-web-vitals",
"create-gsd-extension",
"create-skill",
"create-workflow",
"debug-like-expert",
"frontend-design",
"github-workflows",
"lint",
"make-interfaces-feel-better",
"react-best-practices",
"review",
"test",
"userinterface-wiki",
"web-design-guidelines",
"web-quality-audit"
],
"cacheHitRate": 100
},
{
"type": "complete-milestone",
"id": "M007",
"model": "claude-opus-4-6",
"startedAt": 1774900277729,
"finishedAt": 1774900391973,
"tokens": {
"input": 15,
"output": 4910,
"cacheRead": 927877,
"cacheWrite": 18700,
"total": 951502
},
"cost": 0.7036385000000001,
"toolCalls": 17,
"assistantMessages": 13,
"userMessages": 0,
"apiRequests": 13,
"cacheHitRate": 100
}
]
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "complete-milestone",
"unitId": "M007",
"startedAt": 1774900277729,
"updatedAt": 1774900277729,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774900277729,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "complete-slice",
"unitId": "M007/S02",
"startedAt": 1774897648982,
"updatedAt": 1774897648983,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774897648982,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "complete-slice",
"unitId": "M007/S03",
"startedAt": 1774898680099,
"updatedAt": 1774898680100,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774898680099,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "complete-slice",
"unitId": "M007/S04",
"startedAt": 1774899407970,
"updatedAt": 1774899407970,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774899407970,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "complete-slice",
"unitId": "M007/S05",
"startedAt": 1774899701356,
"updatedAt": 1774899701357,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774899701356,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "complete-slice",
"unitId": "M007/S06",
"startedAt": 1774900110414,
"updatedAt": 1774900110415,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774900110414,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "execute-task",
"unitId": "M007/S03/T01",
"startedAt": 1774898108259,
"updatedAt": 1774898108260,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774898108259,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "execute-task",
"unitId": "M007/S03/T02",
"startedAt": 1774898267920,
"updatedAt": 1774898267921,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774898267920,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "execute-task",
"unitId": "M007/S04/T01",
"startedAt": 1774899062226,
"updatedAt": 1774899062226,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774899062226,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "execute-task",
"unitId": "M007/S04/T02",
"startedAt": 1774899251433,
"updatedAt": 1774899251433,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774899251433,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "execute-task",
"unitId": "M007/S05/T01",
"startedAt": 1774899607829,
"updatedAt": 1774899607829,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774899607829,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "execute-task",
"unitId": "M007/S06/T01",
"startedAt": 1774899963228,
"updatedAt": 1774899963228,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774899963228,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "plan-slice",
"unitId": "M007/S03",
"startedAt": 1774897978291,
"updatedAt": 1774897978291,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774897978291,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "plan-slice",
"unitId": "M007/S04",
"startedAt": 1774898944171,
"updatedAt": 1774898944171,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774898944171,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "plan-slice",
"unitId": "M007/S05",
"startedAt": 1774899569525,
"updatedAt": 1774899569525,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774899569525,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "plan-slice",
"unitId": "M007/S06",
"startedAt": 1774899890779,
"updatedAt": 1774899890780,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774899890779,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "research-slice",
"unitId": "M007/S03",
"startedAt": 1774897828866,
"updatedAt": 1774897828867,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774897828866,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "research-slice",
"unitId": "M007/S04",
"startedAt": 1774898798679,
"updatedAt": 1774898798679,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774898798679,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "research-slice",
"unitId": "M007/S05",
"startedAt": 1774899474371,
"updatedAt": 1774899474372,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774899474371,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "research-slice",
"unitId": "M007/S06",
"startedAt": 1774899761895,
"updatedAt": 1774899761895,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774899761895,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -0,0 +1,15 @@
{
"version": 1,
"unitType": "validate-milestone",
"unitId": "M007",
"startedAt": 1774900161589,
"updatedAt": 1774900161589,
"phase": "dispatched",
"wrapupWarningSent": false,
"continueHereFired": false,
"timeoutAt": null,
"lastProgressAt": 1774900161589,
"progressCount": 0,
"lastProgressKind": "dispatch",
"recoveryAttempts": 0
}

View file

@ -46,30 +46,3 @@ docker logs -f chrysopedia-worker
# View API logs
docker logs -f chrysopedia-api
```
## Remote Host: hal0022 (Whisper Transcription)
- **Host alias:** `hal0022`
- **IP:** 10.0.0.131
- **OS:** Windows (domain-joined to a.xpltd.co)
- **SSH user:** `a\jlightner`
- **SSH key:** `~/.ssh/hal0022_ed25519`
- **Role:** GPU workstation for Whisper transcription of video content
### Connecting
```bash
ssh hal0022
```
SSH config is already set up in `~/.ssh/config` on dev01.
### Content Location on hal0022
Video source files reside at:
```
A:\Education\Artist Streams & Content
```
Note: This is a Windows path. When accessing via SSH, use the appropriate path format for the shell available on hal0022.

320
README.md Normal file
View file

@ -0,0 +1,320 @@
# Chrysopedia
> From *chrysopoeia* (alchemical transmutation of base material into gold) + *encyclopedia*.
> Chrysopedia transmutes raw video content into refined, searchable production knowledge.
A self-hosted knowledge extraction system for electronic music production content. Video libraries are transcribed with Whisper, analyzed through a multi-stage LLM pipeline, curated via an admin review workflow, and served through a search-first web UI designed for mid-session retrieval.
---
## Information Flow
Content moves through six stages from raw video to searchable knowledge:
```
┌─────────────────────────────────────────────────────────────────────────┐
│ STAGE 1 · Transcription [Desktop / GPU] │
│ │
│ Video files → Whisper large-v3 (CUDA) → JSON transcripts │
│ Output: timestamped segments with speaker text │
└────────────────────────────────┬────────────────────────────────────────┘
│ JSON files (manual or folder watcher)
┌─────────────────────────────────────────────────────────────────────────┐
│ STAGE 2 · Ingestion [API + Watcher] │
│ │
│ POST /api/v1/ingest ← watcher auto-submits from /watch folder │
│ • Validate JSON structure │
│ • Compute content hash (SHA-256) for deduplication │
│ • Find-or-create Creator from folder name │
│ • Upsert SourceVideo (exact filename → content hash → fuzzy match) │
│ • Bulk-insert TranscriptSegment rows │
│ • Dispatch pipeline to Celery worker │
└────────────────────────────────┬────────────────────────────────────────┘
│ Celery task: run_pipeline(video_id)
┌─────────────────────────────────────────────────────────────────────────┐
│ STAGE 3 · LLM Extraction Pipeline [Celery Worker] │
│ │
│ Four sequential LLM stages, each with its own prompt template: │
│ │
│ 3a. Segmentation — Split transcript into semantic topic boundaries │
│ Model: chat (fast) Prompt: stage2_segmentation.txt │
│ │
│ 3b. Extraction — Identify key moments (title, summary, timestamps) │
│ Model: reasoning (think) Prompt: stage3_extraction.txt │
│ │
│ 3c. Classification — Assign content types + extract plugin names │
│ Model: chat (fast) Prompt: stage4_classification.txt │
│ │
│ 3d. Synthesis — Compose technique pages from approved moments │
│ Model: reasoning (think) Prompt: stage5_synthesis.txt │
│ │
│ Each stage emits PipelineEvent rows (tokens, duration, model, errors) │
└────────────────────────────────┬────────────────────────────────────────┘
│ KeyMoment rows (review_status: pending)
┌─────────────────────────────────────────────────────────────────────────┐
│ STAGE 4 · Review & Curation [Admin UI] │
│ │
│ Admin reviews extracted KeyMoments before they become technique pages: │
│ • Approve — moment proceeds to synthesis │
│ • Edit — correct title, summary, content type, plugins, then approve │
│ • Reject — moment is excluded from knowledge base │
│ (When REVIEW_MODE=false, moments auto-approve and skip this stage) │
└────────────────────────────────┬────────────────────────────────────────┘
│ Approved moments → Stage 3d synthesis
┌─────────────────────────────────────────────────────────────────────────┐
│ STAGE 5 · Knowledge Base [Web UI] │
│ │
│ TechniquePages — the primary output: │
│ • Structured body sections, signal chains, plugin lists │
│ • Linked to source KeyMoments with video timestamps │
│ • Cross-referenced via RelatedTechniqueLinks │
│ • Versioned (snapshots before each re-synthesis) │
│ • Organized by topic taxonomy (6 categories from canonical_tags.yaml) │
└────────────────────────────────┬────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ STAGE 6 · Search & Retrieval [Web UI] │
│ │
│ • Semantic search: query → embedding → Qdrant vector similarity │
│ • Keyword fallback: ILIKE search on title/summary (300ms timeout) │
│ • Browse by topic hierarchy, creator, or content type │
│ • Typeahead search from home page (debounced, top 5 results) │
└─────────────────────────────────────────────────────────────────────────┘
```
---
## Architecture
```
┌──────────────────────────────────────────────────────────────────────────┐
│ Desktop (GPU workstation — hal0022) │
│ whisper/transcribe.py → JSON transcripts → copy to /watch folder │
└────────────────────────────┬─────────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────────────────┐
│ Docker Compose: xpltd_chrysopedia (ub01) │
│ Network: chrysopedia (172.32.0.0/24) │
│ │
│ ┌────────────┐ ┌─────────────┐ ┌───────────────┐ ┌──────────────┐ │
│ │ PostgreSQL │ │ Redis │ │ Qdrant │ │ Ollama │ │
│ │ :5433 │ │ broker + │ │ vector DB │ │ embeddings │ │
│ │ 7 entities │ │ cache │ │ semantic │ │ nomic-embed │ │
│ └─────┬───────┘ └──────┬──────┘ └───────┬───────┘ └──────┬───────┘ │
│ │ │ │ │ │
│ ┌─────┴─────────────────┴─────────────────┴─────────────────┴────────┐ │
│ │ FastAPI (API) │ │
│ │ Ingest · Pipeline control · Review · Search · CRUD · Reports │ │
│ └──────────────────────────────┬────────────────────────────────────┘ │
│ │ │
│ ┌──────────────┐ ┌────────────┴───┐ ┌──────────────────────────┐ │
│ │ Watcher │ │ Celery Worker │ │ Web UI (React) │ │
│ │ /watch → │ │ LLM pipeline │ │ nginx → :8096 │ │
│ │ auto-ingest │ │ stages 2-5 │ │ search-first interface │ │
│ └──────────────┘ └────────────────┘ └──────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────┘
```
### Services
| Service | Image | Port | Purpose |
|---------|-------|------|---------|
| `chrysopedia-db` | `postgres:16-alpine` | `5433 → 5432` | Primary data store |
| `chrysopedia-redis` | `redis:7-alpine` | — | Celery broker + feature flag cache |
| `chrysopedia-qdrant` | `qdrant/qdrant:v1.13.2` | — | Vector DB for semantic search |
| `chrysopedia-ollama` | `ollama/ollama` | — | Embedding model server (nomic-embed-text) |
| `chrysopedia-api` | `Dockerfile.api` | `8000` | FastAPI REST API |
| `chrysopedia-worker` | `Dockerfile.api` | — | Celery worker (LLM pipeline) |
| `chrysopedia-watcher` | `Dockerfile.api` | — | Folder monitor → auto-ingest |
| `chrysopedia-web` | `Dockerfile.web` | `8096 → 80` | React frontend (nginx) |
### Data Model
| Entity | Purpose |
|--------|---------|
| **Creator** | Artists/producers whose content is indexed |
| **SourceVideo** | Video files processed by the pipeline (with content hash dedup) |
| **TranscriptSegment** | Timestamped text segments from Whisper |
| **KeyMoment** | Discrete insights extracted by LLM analysis |
| **TechniquePage** | Synthesized knowledge pages — the primary output |
| **TechniquePageVersion** | Snapshots before re-synthesis overwrites |
| **RelatedTechniqueLink** | Cross-references between technique pages |
| **Tag** | Hierarchical topic taxonomy |
| **ContentReport** | User-submitted content issues |
| **PipelineEvent** | Structured pipeline execution logs (tokens, timing, errors) |
---
## Quick Start
### Prerequisites
- Docker ≥ 24.0 and Docker Compose ≥ 2.20
- Python 3.10+ with NVIDIA GPU + CUDA (for Whisper transcription)
### Setup
```bash
# Clone and configure
git clone git@github.com:xpltdco/chrysopedia.git
cd chrysopedia
cp .env.example .env # edit with real values
# Start the stack
docker compose up -d
# Run database migrations
docker exec chrysopedia-api alembic upgrade head
# Pull the embedding model (first time only)
docker exec chrysopedia-ollama ollama pull nomic-embed-text
# Verify
curl http://localhost:8096/health
```
### Transcribe videos
```bash
cd whisper && pip install -r requirements.txt
# Single file
python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts
# Batch
python transcribe.py --input ./videos/ --output-dir ./transcripts
```
See [`whisper/README.md`](whisper/README.md) for full transcription docs.
---
## Environment Variables
Copy `.env.example` to `.env`. Key groups:
| Group | Variables | Notes |
|-------|-----------|-------|
| **Database** | `POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB` | Default user: `chrysopedia` |
| **LLM** | `LLM_API_URL`, `LLM_API_KEY`, `LLM_MODEL` | OpenAI-compatible endpoint |
| **LLM Fallback** | `LLM_FALLBACK_URL`, `LLM_FALLBACK_MODEL` | Automatic failover |
| **Per-Stage Models** | `LLM_STAGE{2-5}_MODEL`, `LLM_STAGE{2-5}_MODALITY` | `chat` for fast stages, `thinking` for reasoning |
| **Embedding** | `EMBEDDING_API_URL`, `EMBEDDING_MODEL` | Ollama nomic-embed-text |
| **Vector DB** | `QDRANT_URL`, `QDRANT_COLLECTION` | Container-internal |
| **Features** | `REVIEW_MODE`, `DEBUG_MODE` | Review gate + LLM I/O capture |
| **Storage** | `TRANSCRIPT_STORAGE_PATH`, `VIDEO_METADATA_PATH` | Container bind mounts |
---
## API Endpoints
### Public
| Method | Path | Description |
|--------|------|-------------|
| GET | `/health` | Health check (DB connectivity) |
| GET | `/api/v1/search?q=&scope=&limit=` | Semantic + keyword search |
| GET | `/api/v1/techniques` | List technique pages |
| GET | `/api/v1/techniques/{slug}` | Technique detail + key moments |
| GET | `/api/v1/techniques/{slug}/versions` | Version history |
| GET | `/api/v1/creators` | List creators (sort, genre filter) |
| GET | `/api/v1/creators/{slug}` | Creator detail |
| GET | `/api/v1/topics` | Topic hierarchy with counts |
| GET | `/api/v1/videos` | List source videos |
| POST | `/api/v1/reports` | Submit content report |
### Admin
| Method | Path | Description |
|--------|------|-------------|
| GET | `/api/v1/review/queue` | Review queue (status filter) |
| POST | `/api/v1/review/moments/{id}/approve` | Approve key moment |
| POST | `/api/v1/review/moments/{id}/reject` | Reject key moment |
| PUT | `/api/v1/review/moments/{id}` | Edit key moment |
| POST | `/api/v1/admin/pipeline/trigger/{video_id}` | Trigger/retrigger pipeline |
| GET | `/api/v1/admin/pipeline/events/{video_id}` | Pipeline event log |
| GET | `/api/v1/admin/pipeline/token-summary/{video_id}` | Token usage by stage |
| GET | `/api/v1/admin/pipeline/worker-status` | Celery worker status |
| PUT | `/api/v1/admin/pipeline/debug-mode` | Toggle debug mode |
### Ingest
| Method | Path | Description |
|--------|------|-------------|
| POST | `/api/v1/ingest` | Upload Whisper JSON transcript |
---
## Development
```bash
# Local backend (with Docker services)
python -m venv .venv && source .venv/bin/activate
pip install -r backend/requirements.txt
docker compose up -d chrysopedia-db chrysopedia-redis
alembic upgrade head
cd backend && uvicorn main:app --reload --host 0.0.0.0 --port 8000
# Database migrations
alembic revision --autogenerate -m "describe_change"
alembic upgrade head
```
### Project Structure
```
chrysopedia/
├── backend/ # FastAPI application
│ ├── main.py # Entry point, middleware, router mounting
│ ├── config.py # Pydantic Settings (all env vars)
│ ├── models.py # SQLAlchemy ORM models
│ ├── schemas.py # Pydantic request/response schemas
│ ├── worker.py # Celery app configuration
│ ├── watcher.py # Transcript folder watcher service
│ ├── search_service.py # Semantic search + keyword fallback
│ ├── routers/ # API endpoint handlers
│ ├── pipeline/ # LLM pipeline stages + clients
│ │ ├── stages.py # Stages 2-5 (Celery tasks)
│ │ ├── llm_client.py # OpenAI-compatible LLM client
│ │ ├── embedding_client.py
│ │ └── qdrant_client.py
│ └── tests/
├── frontend/ # React + TypeScript + Vite
│ └── src/
│ ├── pages/ # Home, Search, Technique, Creator, Topic, Admin
│ ├── components/ # Shared UI components
│ └── api/ # Typed API clients
├── whisper/ # Desktop transcription (Whisper large-v3)
├── docker/ # Dockerfiles + nginx config
├── alembic/ # Database migrations
├── config/ # canonical_tags.yaml (topic taxonomy)
├── prompts/ # LLM prompt templates (editable at runtime)
├── docker-compose.yml
└── .env.example
```
---
## Deployment (ub01)
```bash
ssh ub01
cd /vmPool/r/repos/xpltdco/chrysopedia
git pull && docker compose build && docker compose up -d
```
| Resource | Location |
|----------|----------|
| Web UI | `http://ub01:8096` |
| API | `http://ub01:8096/health` |
| PostgreSQL | `ub01:5433` |
| Compose config | `/vmPool/r/compose/xpltd_chrysopedia/docker-compose.yml` |
| Persistent data | `/vmPool/r/services/chrysopedia_*` |
XPLTD conventions: `xpltd_chrysopedia` project name, dedicated bridge network (`172.32.0.0/24`), bind mounts under `/vmPool/r/services/`, PostgreSQL on port `5433`.

37
alembic.ini Normal file
View file

@ -0,0 +1,37 @@
# Chrysopedia — Alembic configuration
[alembic]
script_location = alembic
sqlalchemy.url = postgresql+asyncpg://chrysopedia:changeme@localhost:5433/chrysopedia
[loggers]
keys = root,sqlalchemy,alembic
[handlers]
keys = console
[formatters]
keys = generic
[logger_root]
level = WARN
handlers = console
[logger_sqlalchemy]
level = WARN
handlers =
qualname = sqlalchemy.engine
[logger_alembic]
level = INFO
handlers =
qualname = alembic
[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = NOTSET
formatter = generic
[formatter_generic]
format = %(levelname)-5.5s [%(name)s] %(message)s
datefmt = %H:%M:%S

72
alembic/env.py Normal file
View file

@ -0,0 +1,72 @@
"""Alembic env.py — async migration runner for Chrysopedia."""
import asyncio
import os
import sys
from logging.config import fileConfig
from alembic import context
from sqlalchemy import pool
from sqlalchemy.ext.asyncio import async_engine_from_config
# Ensure the backend package is importable
# When running locally: alembic/ sits beside backend/, so ../backend works
# When running in Docker: alembic/ is inside /app/ alongside the backend modules
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "backend"))
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
from database import Base # noqa: E402
import models # noqa: E402, F401 — registers all tables on Base.metadata
config = context.config
if config.config_file_name is not None:
fileConfig(config.config_file_name)
target_metadata = Base.metadata
# Allow DATABASE_URL env var to override alembic.ini
url_override = os.getenv("DATABASE_URL")
if url_override:
config.set_main_option("sqlalchemy.url", url_override)
def run_migrations_offline() -> None:
"""Run migrations in 'offline' mode — emit SQL to stdout."""
url = config.get_main_option("sqlalchemy.url")
context.configure(
url=url,
target_metadata=target_metadata,
literal_binds=True,
dialect_opts={"paramstyle": "named"},
)
with context.begin_transaction():
context.run_migrations()
def do_run_migrations(connection):
context.configure(connection=connection, target_metadata=target_metadata)
with context.begin_transaction():
context.run_migrations()
async def run_async_migrations() -> None:
"""Run migrations in 'online' mode with an async engine."""
connectable = async_engine_from_config(
config.get_section(config.config_ini_section, {}),
prefix="sqlalchemy.",
poolclass=pool.NullPool,
)
async with connectable.connect() as connection:
await connection.run_sync(do_run_migrations)
await connectable.dispose()
def run_migrations_online() -> None:
asyncio.run(run_async_migrations())
if context.is_offline_mode():
run_migrations_offline()
else:
run_migrations_online()

25
alembic/script.py.mako Normal file
View file

@ -0,0 +1,25 @@
"""${message}
Revision ID: ${up_revision}
Revises: ${down_revision | comma,n}
Create Date: ${create_date}
"""
from typing import Sequence, Union
from alembic import op
import sqlalchemy as sa
${imports if imports else ""}
# revision identifiers, used by Alembic.
revision: str = ${repr(up_revision)}
down_revision: Union[str, None] = ${repr(down_revision)}
branch_labels: Union[str, Sequence[str], None] = ${repr(branch_labels)}
depends_on: Union[str, Sequence[str], None] = ${repr(depends_on)}
def upgrade() -> None:
${upgrades if upgrades else "pass"}
def downgrade() -> None:
${downgrades if downgrades else "pass"}

View file

@ -0,0 +1,171 @@
"""initial schema — 7 core entities
Revision ID: 001_initial
Revises:
Create Date: 2026-03-29
"""
from typing import Sequence, Union
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects.postgresql import ARRAY, JSONB, UUID
# revision identifiers, used by Alembic.
revision: str = "001_initial"
down_revision: Union[str, None] = None
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None
def upgrade() -> None:
# ── Enum types ───────────────────────────────────────────────────────
content_type = sa.Enum(
"tutorial", "livestream", "breakdown", "short_form",
name="content_type",
)
processing_status = sa.Enum(
"pending", "transcribed", "extracted", "reviewed", "published",
name="processing_status",
)
key_moment_content_type = sa.Enum(
"technique", "settings", "reasoning", "workflow",
name="key_moment_content_type",
)
review_status = sa.Enum(
"pending", "approved", "edited", "rejected",
name="review_status",
)
source_quality = sa.Enum(
"structured", "mixed", "unstructured",
name="source_quality",
)
page_review_status = sa.Enum(
"draft", "reviewed", "published",
name="page_review_status",
)
relationship_type = sa.Enum(
"same_technique_other_creator", "same_creator_adjacent", "general_cross_reference",
name="relationship_type",
)
# ── creators ─────────────────────────────────────────────────────────
op.create_table(
"creators",
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.text("gen_random_uuid()")),
sa.Column("name", sa.String(255), nullable=False),
sa.Column("slug", sa.String(255), nullable=False, unique=True),
sa.Column("genres", ARRAY(sa.String), nullable=True),
sa.Column("folder_name", sa.String(255), nullable=False),
sa.Column("view_count", sa.Integer, nullable=False, server_default="0"),
sa.Column("created_at", sa.DateTime(), nullable=False, server_default=sa.func.now()),
sa.Column("updated_at", sa.DateTime(), nullable=False, server_default=sa.func.now()),
)
# ── source_videos ────────────────────────────────────────────────────
op.create_table(
"source_videos",
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.text("gen_random_uuid()")),
sa.Column("creator_id", UUID(as_uuid=True), sa.ForeignKey("creators.id", ondelete="CASCADE"), nullable=False),
sa.Column("filename", sa.String(500), nullable=False),
sa.Column("file_path", sa.String(1000), nullable=False),
sa.Column("duration_seconds", sa.Integer, nullable=True),
sa.Column("content_type", content_type, nullable=False),
sa.Column("transcript_path", sa.String(1000), nullable=True),
sa.Column("processing_status", processing_status, nullable=False, server_default="pending"),
sa.Column("created_at", sa.DateTime(), nullable=False, server_default=sa.func.now()),
sa.Column("updated_at", sa.DateTime(), nullable=False, server_default=sa.func.now()),
)
op.create_index("ix_source_videos_creator_id", "source_videos", ["creator_id"])
# ── transcript_segments ──────────────────────────────────────────────
op.create_table(
"transcript_segments",
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.text("gen_random_uuid()")),
sa.Column("source_video_id", UUID(as_uuid=True), sa.ForeignKey("source_videos.id", ondelete="CASCADE"), nullable=False),
sa.Column("start_time", sa.Float, nullable=False),
sa.Column("end_time", sa.Float, nullable=False),
sa.Column("text", sa.Text, nullable=False),
sa.Column("segment_index", sa.Integer, nullable=False),
sa.Column("topic_label", sa.String(255), nullable=True),
)
op.create_index("ix_transcript_segments_video_id", "transcript_segments", ["source_video_id"])
# ── technique_pages (must come before key_moments due to FK) ─────────
op.create_table(
"technique_pages",
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.text("gen_random_uuid()")),
sa.Column("creator_id", UUID(as_uuid=True), sa.ForeignKey("creators.id", ondelete="CASCADE"), nullable=False),
sa.Column("title", sa.String(500), nullable=False),
sa.Column("slug", sa.String(500), nullable=False, unique=True),
sa.Column("topic_category", sa.String(255), nullable=False),
sa.Column("topic_tags", ARRAY(sa.String), nullable=True),
sa.Column("summary", sa.Text, nullable=True),
sa.Column("body_sections", JSONB, nullable=True),
sa.Column("signal_chains", JSONB, nullable=True),
sa.Column("plugins", ARRAY(sa.String), nullable=True),
sa.Column("source_quality", source_quality, nullable=True),
sa.Column("view_count", sa.Integer, nullable=False, server_default="0"),
sa.Column("review_status", page_review_status, nullable=False, server_default="draft"),
sa.Column("created_at", sa.DateTime(), nullable=False, server_default=sa.func.now()),
sa.Column("updated_at", sa.DateTime(), nullable=False, server_default=sa.func.now()),
)
op.create_index("ix_technique_pages_creator_id", "technique_pages", ["creator_id"])
op.create_index("ix_technique_pages_topic_category", "technique_pages", ["topic_category"])
# ── key_moments ──────────────────────────────────────────────────────
op.create_table(
"key_moments",
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.text("gen_random_uuid()")),
sa.Column("source_video_id", UUID(as_uuid=True), sa.ForeignKey("source_videos.id", ondelete="CASCADE"), nullable=False),
sa.Column("technique_page_id", UUID(as_uuid=True), sa.ForeignKey("technique_pages.id", ondelete="SET NULL"), nullable=True),
sa.Column("title", sa.String(500), nullable=False),
sa.Column("summary", sa.Text, nullable=False),
sa.Column("start_time", sa.Float, nullable=False),
sa.Column("end_time", sa.Float, nullable=False),
sa.Column("content_type", key_moment_content_type, nullable=False),
sa.Column("plugins", ARRAY(sa.String), nullable=True),
sa.Column("review_status", review_status, nullable=False, server_default="pending"),
sa.Column("raw_transcript", sa.Text, nullable=True),
sa.Column("created_at", sa.DateTime(), nullable=False, server_default=sa.func.now()),
sa.Column("updated_at", sa.DateTime(), nullable=False, server_default=sa.func.now()),
)
op.create_index("ix_key_moments_source_video_id", "key_moments", ["source_video_id"])
op.create_index("ix_key_moments_technique_page_id", "key_moments", ["technique_page_id"])
# ── related_technique_links ──────────────────────────────────────────
op.create_table(
"related_technique_links",
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.text("gen_random_uuid()")),
sa.Column("source_page_id", UUID(as_uuid=True), sa.ForeignKey("technique_pages.id", ondelete="CASCADE"), nullable=False),
sa.Column("target_page_id", UUID(as_uuid=True), sa.ForeignKey("technique_pages.id", ondelete="CASCADE"), nullable=False),
sa.Column("relationship", relationship_type, nullable=False),
sa.UniqueConstraint("source_page_id", "target_page_id", "relationship", name="uq_technique_link"),
)
# ── tags ─────────────────────────────────────────────────────────────
op.create_table(
"tags",
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.text("gen_random_uuid()")),
sa.Column("name", sa.String(255), nullable=False, unique=True),
sa.Column("category", sa.String(255), nullable=False),
sa.Column("aliases", ARRAY(sa.String), nullable=True),
)
op.create_index("ix_tags_category", "tags", ["category"])
def downgrade() -> None:
op.drop_table("tags")
op.drop_table("related_technique_links")
op.drop_table("key_moments")
op.drop_table("technique_pages")
op.drop_table("transcript_segments")
op.drop_table("source_videos")
op.drop_table("creators")
# Drop enum types
for name in [
"relationship_type", "page_review_status", "source_quality",
"review_status", "key_moment_content_type", "processing_status",
"content_type",
]:
sa.Enum(name=name).drop(op.get_bind(), checkfirst=True)

View file

@ -0,0 +1,39 @@
"""technique_page_versions table for article versioning
Revision ID: 002_technique_page_versions
Revises: 001_initial
Create Date: 2026-03-30
"""
from typing import Sequence, Union
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects.postgresql import JSONB, UUID
# revision identifiers, used by Alembic.
revision: str = "002_technique_page_versions"
down_revision: Union[str, None] = "001_initial"
branch_labels: Union[str, Sequence[str], None] = None
depends_on: Union[str, Sequence[str], None] = None
def upgrade() -> None:
op.create_table(
"technique_page_versions",
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.text("gen_random_uuid()")),
sa.Column("technique_page_id", UUID(as_uuid=True), sa.ForeignKey("technique_pages.id", ondelete="CASCADE"), nullable=False),
sa.Column("version_number", sa.Integer, nullable=False),
sa.Column("content_snapshot", JSONB, nullable=False),
sa.Column("pipeline_metadata", JSONB, nullable=True),
sa.Column("created_at", sa.DateTime(), nullable=False, server_default=sa.func.now()),
)
op.create_index(
"ix_technique_page_versions_page_version",
"technique_page_versions",
["technique_page_id", "version_number"],
unique=True,
)
def downgrade() -> None:
op.drop_table("technique_page_versions")

View file

@ -0,0 +1,47 @@
"""Create content_reports table.
Revision ID: 003_content_reports
Revises: 002_technique_page_versions
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects.postgresql import UUID
revision = "003_content_reports"
down_revision = "002_technique_page_versions"
branch_labels = None
depends_on = None
def upgrade() -> None:
op.create_table(
"content_reports",
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.func.gen_random_uuid()),
sa.Column("content_type", sa.String(50), nullable=False),
sa.Column("content_id", UUID(as_uuid=True), nullable=True),
sa.Column("content_title", sa.String(500), nullable=True),
sa.Column("report_type", sa.Enum(
"inaccurate", "missing_info", "wrong_attribution", "formatting", "other",
name="report_type", create_constraint=True,
), nullable=False),
sa.Column("description", sa.Text(), nullable=False),
sa.Column("status", sa.Enum(
"open", "acknowledged", "resolved", "dismissed",
name="report_status", create_constraint=True,
), nullable=False, server_default="open"),
sa.Column("admin_notes", sa.Text(), nullable=True),
sa.Column("page_url", sa.String(1000), nullable=True),
sa.Column("created_at", sa.DateTime(), server_default=sa.func.now(), nullable=False),
sa.Column("resolved_at", sa.DateTime(), nullable=True),
)
op.create_index("ix_content_reports_status_created", "content_reports", ["status", "created_at"])
op.create_index("ix_content_reports_content", "content_reports", ["content_type", "content_id"])
def downgrade() -> None:
op.drop_index("ix_content_reports_content")
op.drop_index("ix_content_reports_status_created")
op.drop_table("content_reports")
sa.Enum(name="report_status").drop(op.get_bind(), checkfirst=True)
sa.Enum(name="report_type").drop(op.get_bind(), checkfirst=True)

View file

@ -0,0 +1,37 @@
"""Create pipeline_events table.
Revision ID: 004_pipeline_events
Revises: 003_content_reports
"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects.postgresql import UUID, JSONB
revision = "004_pipeline_events"
down_revision = "003_content_reports"
branch_labels = None
depends_on = None
def upgrade() -> None:
op.create_table(
"pipeline_events",
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.func.gen_random_uuid()),
sa.Column("video_id", UUID(as_uuid=True), nullable=False, index=True),
sa.Column("stage", sa.String(50), nullable=False),
sa.Column("event_type", sa.String(30), nullable=False),
sa.Column("prompt_tokens", sa.Integer(), nullable=True),
sa.Column("completion_tokens", sa.Integer(), nullable=True),
sa.Column("total_tokens", sa.Integer(), nullable=True),
sa.Column("model", sa.String(100), nullable=True),
sa.Column("duration_ms", sa.Integer(), nullable=True),
sa.Column("payload", JSONB(), nullable=True),
sa.Column("created_at", sa.DateTime(), server_default=sa.func.now(), nullable=False),
)
# Composite index for event log queries (video + newest first)
op.create_index("ix_pipeline_events_video_created", "pipeline_events", ["video_id", "created_at"])
def downgrade() -> None:
op.drop_index("ix_pipeline_events_video_created")
op.drop_table("pipeline_events")

View file

@ -0,0 +1,29 @@
"""Add content_hash to source_videos for duplicate detection.
Revision ID: 005_content_hash
Revises: 004_pipeline_events
"""
from alembic import op
import sqlalchemy as sa
revision = "005_content_hash"
down_revision = "004_pipeline_events"
branch_labels = None
depends_on = None
def upgrade() -> None:
op.add_column(
"source_videos",
sa.Column("content_hash", sa.String(64), nullable=True),
)
op.create_index(
"ix_source_videos_content_hash",
"source_videos",
["content_hash"],
)
def downgrade() -> None:
op.drop_index("ix_source_videos_content_hash")
op.drop_column("source_videos", "content_hash")

View file

@ -0,0 +1,33 @@
"""Add debug LLM I/O capture columns to pipeline_events.
Revision ID: 006_debug_columns
Revises: 005_content_hash
"""
from alembic import op
import sqlalchemy as sa
revision = "006_debug_columns"
down_revision = "005_content_hash"
branch_labels = None
depends_on = None
def upgrade() -> None:
op.add_column(
"pipeline_events",
sa.Column("system_prompt_text", sa.Text(), nullable=True),
)
op.add_column(
"pipeline_events",
sa.Column("user_prompt_text", sa.Text(), nullable=True),
)
op.add_column(
"pipeline_events",
sa.Column("response_text", sa.Text(), nullable=True),
)
def downgrade() -> None:
op.drop_column("pipeline_events", "response_text")
op.drop_column("pipeline_events", "user_prompt_text")
op.drop_column("pipeline_events", "system_prompt_text")

85
backend/config.py Normal file
View file

@ -0,0 +1,85 @@
"""Application configuration loaded from environment variables."""
from functools import lru_cache
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
"""Chrysopedia API settings.
Values are loaded from environment variables (or .env file via
pydantic-settings' dotenv support).
"""
# Database
database_url: str = "postgresql+asyncpg://chrysopedia:changeme@localhost:5433/chrysopedia"
# Redis
redis_url: str = "redis://localhost:6379/0"
# Application
app_env: str = "development"
app_log_level: str = "info"
app_secret_key: str = "changeme-generate-a-real-secret"
# CORS
cors_origins: list[str] = ["*"]
# LLM endpoint (OpenAI-compatible)
llm_api_url: str = "http://localhost:11434/v1"
llm_api_key: str = "sk-placeholder"
llm_model: str = "fyn-llm-agent-chat"
llm_fallback_url: str = "http://localhost:11434/v1"
llm_fallback_model: str = "fyn-llm-agent-chat"
# Per-stage model overrides (optional — falls back to llm_model / "chat")
llm_stage2_model: str | None = "fyn-llm-agent-chat" # segmentation — mechanical, fast chat
llm_stage2_modality: str = "chat"
llm_stage3_model: str | None = "fyn-llm-agent-think" # extraction — reasoning
llm_stage3_modality: str = "thinking"
llm_stage4_model: str | None = "fyn-llm-agent-chat" # classification — mechanical, fast chat
llm_stage4_modality: str = "chat"
llm_stage5_model: str | None = "fyn-llm-agent-think" # synthesis — reasoning
llm_stage5_modality: str = "thinking"
# Dynamic token estimation — each stage calculates max_tokens from input size
llm_max_tokens_hard_limit: int = 32768 # Hard ceiling for dynamic estimator
llm_max_tokens: int = 65536 # Fallback when no estimate is provided
# Embedding endpoint
embedding_api_url: str = "http://localhost:11434/v1"
embedding_model: str = "nomic-embed-text"
embedding_dimensions: int = 768
# Qdrant
qdrant_url: str = "http://localhost:6333"
qdrant_collection: str = "chrysopedia"
# Prompt templates
prompts_path: str = "./prompts"
# Review mode — when True, extracted moments go to review queue before publishing
review_mode: bool = True
# Debug mode — when True, pipeline captures full LLM prompts and responses
debug_mode: bool = False
# File storage
transcript_storage_path: str = "/data/transcripts"
video_metadata_path: str = "/data/video_meta"
# Git commit SHA (set at Docker build time or via env var)
git_commit_sha: str = "unknown"
model_config = {
"env_file": ".env",
"env_file_encoding": "utf-8",
"case_sensitive": False,
}
@lru_cache
def get_settings() -> Settings:
"""Return cached application settings (singleton)."""
return Settings()

26
backend/database.py Normal file
View file

@ -0,0 +1,26 @@
"""Database engine, session factory, and declarative base for Chrysopedia."""
import os
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker, create_async_engine
from sqlalchemy.orm import DeclarativeBase
DATABASE_URL = os.getenv(
"DATABASE_URL",
"postgresql+asyncpg://chrysopedia:changeme@localhost:5433/chrysopedia",
)
engine = create_async_engine(DATABASE_URL, echo=False, pool_pre_ping=True)
async_session = async_sessionmaker(engine, class_=AsyncSession, expire_on_commit=False)
class Base(DeclarativeBase):
"""Declarative base for all ORM models."""
pass
async def get_session() -> AsyncSession: # type: ignore[misc]
"""FastAPI dependency that yields an async DB session."""
async with async_session() as session:
yield session

95
backend/main.py Normal file
View file

@ -0,0 +1,95 @@
"""Chrysopedia API — Knowledge extraction and retrieval system.
Entry point for the FastAPI application. Configures middleware,
structured logging, and mounts versioned API routers.
"""
import logging
import sys
from contextlib import asynccontextmanager
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
from config import get_settings
from routers import creators, health, ingest, pipeline, reports, review, search, techniques, topics, videos
def _setup_logging() -> None:
"""Configure structured logging to stdout."""
settings = get_settings()
level = getattr(logging, settings.app_log_level.upper(), logging.INFO)
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(
logging.Formatter(
fmt="%(asctime)s | %(levelname)-8s | %(name)s | %(message)s",
datefmt="%Y-%m-%dT%H:%M:%S",
)
)
root = logging.getLogger()
root.setLevel(level)
# Avoid duplicate handlers on reload
root.handlers.clear()
root.addHandler(handler)
# Quiet noisy libraries
logging.getLogger("uvicorn.access").setLevel(logging.WARNING)
logging.getLogger("sqlalchemy.engine").setLevel(logging.WARNING)
@asynccontextmanager
async def lifespan(app: FastAPI): # noqa: ARG001
"""Application lifespan: setup on startup, teardown on shutdown."""
_setup_logging()
logger = logging.getLogger("chrysopedia")
settings = get_settings()
logger.info(
"Chrysopedia API starting (env=%s, log_level=%s)",
settings.app_env,
settings.app_log_level,
)
yield
logger.info("Chrysopedia API shutting down")
app = FastAPI(
title="Chrysopedia API",
description="Knowledge extraction and retrieval for music production content",
version="0.1.0",
lifespan=lifespan,
)
# ── Middleware ────────────────────────────────────────────────────────────────
settings = get_settings()
app.add_middleware(
CORSMiddleware,
allow_origins=settings.cors_origins,
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# ── Routers ──────────────────────────────────────────────────────────────────
# Root-level health (no prefix)
app.include_router(health.router)
# Versioned API
app.include_router(creators.router, prefix="/api/v1")
app.include_router(ingest.router, prefix="/api/v1")
app.include_router(pipeline.router, prefix="/api/v1")
app.include_router(review.router, prefix="/api/v1")
app.include_router(reports.router, prefix="/api/v1")
app.include_router(search.router, prefix="/api/v1")
app.include_router(techniques.router, prefix="/api/v1")
app.include_router(topics.router, prefix="/api/v1")
app.include_router(videos.router, prefix="/api/v1")
@app.get("/api/v1/health")
async def api_health():
"""Lightweight version-prefixed health endpoint (no DB check)."""
return {"status": "ok", "version": "0.1.0"}

419
backend/models.py Normal file
View file

@ -0,0 +1,419 @@
"""SQLAlchemy ORM models for the Chrysopedia knowledge base.
Seven entities matching chrysopedia-spec.md §6.1:
Creator, SourceVideo, TranscriptSegment, KeyMoment,
TechniquePage, RelatedTechniqueLink, Tag
"""
from __future__ import annotations
import enum
import uuid
from datetime import datetime, timezone
from sqlalchemy import (
Enum,
Float,
ForeignKey,
Integer,
String,
Text,
UniqueConstraint,
func,
)
from sqlalchemy.dialects.postgresql import ARRAY, JSONB, UUID
from sqlalchemy.orm import Mapped, mapped_column
from sqlalchemy.orm import relationship as sa_relationship
from database import Base
# ── Enums ────────────────────────────────────────────────────────────────────
class ContentType(str, enum.Enum):
"""Source video content type."""
tutorial = "tutorial"
livestream = "livestream"
breakdown = "breakdown"
short_form = "short_form"
class ProcessingStatus(str, enum.Enum):
"""Pipeline processing status for a source video."""
pending = "pending"
transcribed = "transcribed"
extracted = "extracted"
reviewed = "reviewed"
published = "published"
class KeyMomentContentType(str, enum.Enum):
"""Content classification for a key moment."""
technique = "technique"
settings = "settings"
reasoning = "reasoning"
workflow = "workflow"
class ReviewStatus(str, enum.Enum):
"""Human review status for key moments."""
pending = "pending"
approved = "approved"
edited = "edited"
rejected = "rejected"
class SourceQuality(str, enum.Enum):
"""Derived source quality for technique pages."""
structured = "structured"
mixed = "mixed"
unstructured = "unstructured"
class PageReviewStatus(str, enum.Enum):
"""Review lifecycle for technique pages."""
draft = "draft"
reviewed = "reviewed"
published = "published"
class RelationshipType(str, enum.Enum):
"""Types of links between technique pages."""
same_technique_other_creator = "same_technique_other_creator"
same_creator_adjacent = "same_creator_adjacent"
general_cross_reference = "general_cross_reference"
# ── Helpers ──────────────────────────────────────────────────────────────────
def _uuid_pk() -> Mapped[uuid.UUID]:
return mapped_column(
UUID(as_uuid=True),
primary_key=True,
default=uuid.uuid4,
server_default=func.gen_random_uuid(),
)
def _now() -> datetime:
"""Return current UTC time as a naive datetime (no tzinfo).
PostgreSQL TIMESTAMP WITHOUT TIME ZONE columns require naive datetimes.
asyncpg rejects timezone-aware datetimes for such columns.
"""
return datetime.now(timezone.utc).replace(tzinfo=None)
# ── Models ───────────────────────────────────────────────────────────────────
class Creator(Base):
__tablename__ = "creators"
id: Mapped[uuid.UUID] = _uuid_pk()
name: Mapped[str] = mapped_column(String(255), nullable=False)
slug: Mapped[str] = mapped_column(String(255), unique=True, nullable=False)
genres: Mapped[list[str] | None] = mapped_column(ARRAY(String), nullable=True)
folder_name: Mapped[str] = mapped_column(String(255), nullable=False)
view_count: Mapped[int] = mapped_column(Integer, default=0, server_default="0")
created_at: Mapped[datetime] = mapped_column(
default=_now, server_default=func.now()
)
updated_at: Mapped[datetime] = mapped_column(
default=_now, server_default=func.now(), onupdate=_now
)
# relationships
videos: Mapped[list[SourceVideo]] = sa_relationship(back_populates="creator")
technique_pages: Mapped[list[TechniquePage]] = sa_relationship(back_populates="creator")
class SourceVideo(Base):
__tablename__ = "source_videos"
id: Mapped[uuid.UUID] = _uuid_pk()
creator_id: Mapped[uuid.UUID] = mapped_column(
ForeignKey("creators.id", ondelete="CASCADE"), nullable=False
)
filename: Mapped[str] = mapped_column(String(500), nullable=False)
file_path: Mapped[str] = mapped_column(String(1000), nullable=False)
duration_seconds: Mapped[int] = mapped_column(Integer, nullable=True)
content_type: Mapped[ContentType] = mapped_column(
Enum(ContentType, name="content_type", create_constraint=True),
nullable=False,
)
transcript_path: Mapped[str | None] = mapped_column(String(1000), nullable=True)
content_hash: Mapped[str | None] = mapped_column(String(64), nullable=True, index=True)
processing_status: Mapped[ProcessingStatus] = mapped_column(
Enum(ProcessingStatus, name="processing_status", create_constraint=True),
default=ProcessingStatus.pending,
server_default="pending",
)
created_at: Mapped[datetime] = mapped_column(
default=_now, server_default=func.now()
)
updated_at: Mapped[datetime] = mapped_column(
default=_now, server_default=func.now(), onupdate=_now
)
# relationships
creator: Mapped[Creator] = sa_relationship(back_populates="videos")
segments: Mapped[list[TranscriptSegment]] = sa_relationship(back_populates="source_video")
key_moments: Mapped[list[KeyMoment]] = sa_relationship(back_populates="source_video")
class TranscriptSegment(Base):
__tablename__ = "transcript_segments"
id: Mapped[uuid.UUID] = _uuid_pk()
source_video_id: Mapped[uuid.UUID] = mapped_column(
ForeignKey("source_videos.id", ondelete="CASCADE"), nullable=False
)
start_time: Mapped[float] = mapped_column(Float, nullable=False)
end_time: Mapped[float] = mapped_column(Float, nullable=False)
text: Mapped[str] = mapped_column(Text, nullable=False)
segment_index: Mapped[int] = mapped_column(Integer, nullable=False)
topic_label: Mapped[str | None] = mapped_column(String(255), nullable=True)
# relationships
source_video: Mapped[SourceVideo] = sa_relationship(back_populates="segments")
class KeyMoment(Base):
__tablename__ = "key_moments"
id: Mapped[uuid.UUID] = _uuid_pk()
source_video_id: Mapped[uuid.UUID] = mapped_column(
ForeignKey("source_videos.id", ondelete="CASCADE"), nullable=False
)
technique_page_id: Mapped[uuid.UUID | None] = mapped_column(
ForeignKey("technique_pages.id", ondelete="SET NULL"), nullable=True
)
title: Mapped[str] = mapped_column(String(500), nullable=False)
summary: Mapped[str] = mapped_column(Text, nullable=False)
start_time: Mapped[float] = mapped_column(Float, nullable=False)
end_time: Mapped[float] = mapped_column(Float, nullable=False)
content_type: Mapped[KeyMomentContentType] = mapped_column(
Enum(KeyMomentContentType, name="key_moment_content_type", create_constraint=True),
nullable=False,
)
plugins: Mapped[list[str] | None] = mapped_column(ARRAY(String), nullable=True)
review_status: Mapped[ReviewStatus] = mapped_column(
Enum(ReviewStatus, name="review_status", create_constraint=True),
default=ReviewStatus.pending,
server_default="pending",
)
raw_transcript: Mapped[str | None] = mapped_column(Text, nullable=True)
created_at: Mapped[datetime] = mapped_column(
default=_now, server_default=func.now()
)
updated_at: Mapped[datetime] = mapped_column(
default=_now, server_default=func.now(), onupdate=_now
)
# relationships
source_video: Mapped[SourceVideo] = sa_relationship(back_populates="key_moments")
technique_page: Mapped[TechniquePage | None] = sa_relationship(
back_populates="key_moments", foreign_keys=[technique_page_id]
)
class TechniquePage(Base):
__tablename__ = "technique_pages"
id: Mapped[uuid.UUID] = _uuid_pk()
creator_id: Mapped[uuid.UUID] = mapped_column(
ForeignKey("creators.id", ondelete="CASCADE"), nullable=False
)
title: Mapped[str] = mapped_column(String(500), nullable=False)
slug: Mapped[str] = mapped_column(String(500), unique=True, nullable=False)
topic_category: Mapped[str] = mapped_column(String(255), nullable=False)
topic_tags: Mapped[list[str] | None] = mapped_column(ARRAY(String), nullable=True)
summary: Mapped[str | None] = mapped_column(Text, nullable=True)
body_sections: Mapped[dict | None] = mapped_column(JSONB, nullable=True)
signal_chains: Mapped[list | None] = mapped_column(JSONB, nullable=True)
plugins: Mapped[list[str] | None] = mapped_column(ARRAY(String), nullable=True)
source_quality: Mapped[SourceQuality | None] = mapped_column(
Enum(SourceQuality, name="source_quality", create_constraint=True),
nullable=True,
)
view_count: Mapped[int] = mapped_column(Integer, default=0, server_default="0")
review_status: Mapped[PageReviewStatus] = mapped_column(
Enum(PageReviewStatus, name="page_review_status", create_constraint=True),
default=PageReviewStatus.draft,
server_default="draft",
)
created_at: Mapped[datetime] = mapped_column(
default=_now, server_default=func.now()
)
updated_at: Mapped[datetime] = mapped_column(
default=_now, server_default=func.now(), onupdate=_now
)
# relationships
creator: Mapped[Creator] = sa_relationship(back_populates="technique_pages")
key_moments: Mapped[list[KeyMoment]] = sa_relationship(
back_populates="technique_page", foreign_keys=[KeyMoment.technique_page_id]
)
versions: Mapped[list[TechniquePageVersion]] = sa_relationship(
back_populates="technique_page", order_by="TechniquePageVersion.version_number"
)
outgoing_links: Mapped[list[RelatedTechniqueLink]] = sa_relationship(
foreign_keys="RelatedTechniqueLink.source_page_id", back_populates="source_page"
)
incoming_links: Mapped[list[RelatedTechniqueLink]] = sa_relationship(
foreign_keys="RelatedTechniqueLink.target_page_id", back_populates="target_page"
)
class RelatedTechniqueLink(Base):
__tablename__ = "related_technique_links"
__table_args__ = (
UniqueConstraint("source_page_id", "target_page_id", "relationship", name="uq_technique_link"),
)
id: Mapped[uuid.UUID] = _uuid_pk()
source_page_id: Mapped[uuid.UUID] = mapped_column(
ForeignKey("technique_pages.id", ondelete="CASCADE"), nullable=False
)
target_page_id: Mapped[uuid.UUID] = mapped_column(
ForeignKey("technique_pages.id", ondelete="CASCADE"), nullable=False
)
relationship: Mapped[RelationshipType] = mapped_column(
Enum(RelationshipType, name="relationship_type", create_constraint=True),
nullable=False,
)
# relationships
source_page: Mapped[TechniquePage] = sa_relationship(
foreign_keys=[source_page_id], back_populates="outgoing_links"
)
target_page: Mapped[TechniquePage] = sa_relationship(
foreign_keys=[target_page_id], back_populates="incoming_links"
)
class TechniquePageVersion(Base):
"""Snapshot of a TechniquePage before a pipeline re-synthesis overwrites it."""
__tablename__ = "technique_page_versions"
id: Mapped[uuid.UUID] = _uuid_pk()
technique_page_id: Mapped[uuid.UUID] = mapped_column(
ForeignKey("technique_pages.id", ondelete="CASCADE"), nullable=False
)
version_number: Mapped[int] = mapped_column(Integer, nullable=False)
content_snapshot: Mapped[dict] = mapped_column(JSONB, nullable=False)
pipeline_metadata: Mapped[dict | None] = mapped_column(JSONB, nullable=True)
created_at: Mapped[datetime] = mapped_column(
default=_now, server_default=func.now()
)
# relationships
technique_page: Mapped[TechniquePage] = sa_relationship(
back_populates="versions"
)
class Tag(Base):
__tablename__ = "tags"
id: Mapped[uuid.UUID] = _uuid_pk()
name: Mapped[str] = mapped_column(String(255), unique=True, nullable=False)
category: Mapped[str] = mapped_column(String(255), nullable=False)
aliases: Mapped[list[str] | None] = mapped_column(ARRAY(String), nullable=True)
# ── Content Report Enums ─────────────────────────────────────────────────────
class ReportType(str, enum.Enum):
"""Classification of user-submitted content reports."""
inaccurate = "inaccurate"
missing_info = "missing_info"
wrong_attribution = "wrong_attribution"
formatting = "formatting"
other = "other"
class ReportStatus(str, enum.Enum):
"""Triage status for content reports."""
open = "open"
acknowledged = "acknowledged"
resolved = "resolved"
dismissed = "dismissed"
# ── Content Report ───────────────────────────────────────────────────────────
class ContentReport(Base):
"""User-submitted report about a content issue.
Generic: content_type + content_id can reference any entity
(technique_page, key_moment, creator, or general).
"""
__tablename__ = "content_reports"
id: Mapped[uuid.UUID] = _uuid_pk()
content_type: Mapped[str] = mapped_column(
String(50), nullable=False, doc="Entity type: technique_page, key_moment, creator, general"
)
content_id: Mapped[uuid.UUID | None] = mapped_column(
UUID(as_uuid=True), nullable=True, doc="FK to the reported entity (null for general reports)"
)
content_title: Mapped[str | None] = mapped_column(
String(500), nullable=True, doc="Snapshot of entity title at report time"
)
report_type: Mapped[ReportType] = mapped_column(
Enum(ReportType, name="report_type", create_constraint=True),
nullable=False,
)
description: Mapped[str] = mapped_column(Text, nullable=False)
status: Mapped[ReportStatus] = mapped_column(
Enum(ReportStatus, name="report_status", create_constraint=True),
default=ReportStatus.open,
server_default="open",
)
admin_notes: Mapped[str | None] = mapped_column(Text, nullable=True)
page_url: Mapped[str | None] = mapped_column(
String(1000), nullable=True, doc="URL the user was on when reporting"
)
created_at: Mapped[datetime] = mapped_column(
default=_now, server_default=func.now()
)
resolved_at: Mapped[datetime | None] = mapped_column(nullable=True)
# ── Pipeline Event ───────────────────────────────────────────────────────────
class PipelineEvent(Base):
"""Structured log entry for pipeline execution.
Captures per-stage start/complete/error/llm_call events with
token usage and optional response payloads for debugging.
"""
__tablename__ = "pipeline_events"
id: Mapped[uuid.UUID] = _uuid_pk()
video_id: Mapped[uuid.UUID] = mapped_column(
UUID(as_uuid=True), nullable=False, index=True,
)
stage: Mapped[str] = mapped_column(
String(50), nullable=False, doc="stage2_segmentation, stage3_extraction, etc."
)
event_type: Mapped[str] = mapped_column(
String(30), nullable=False, doc="start, complete, error, llm_call"
)
prompt_tokens: Mapped[int | None] = mapped_column(Integer, nullable=True)
completion_tokens: Mapped[int | None] = mapped_column(Integer, nullable=True)
total_tokens: Mapped[int | None] = mapped_column(Integer, nullable=True)
model: Mapped[str | None] = mapped_column(String(100), nullable=True)
duration_ms: Mapped[int | None] = mapped_column(Integer, nullable=True)
payload: Mapped[dict | None] = mapped_column(
JSONB, nullable=True, doc="LLM response content, error details, stage metadata"
)
created_at: Mapped[datetime] = mapped_column(
default=_now, server_default=func.now()
)
# Debug mode — full LLM I/O capture columns
system_prompt_text: Mapped[str | None] = mapped_column(Text, nullable=True)
user_prompt_text: Mapped[str | None] = mapped_column(Text, nullable=True)
response_text: Mapped[str | None] = mapped_column(Text, nullable=True)

View file

View file

@ -0,0 +1,88 @@
"""Synchronous embedding client using the OpenAI-compatible /v1/embeddings API.
Uses ``openai.OpenAI`` (sync) since Celery tasks run synchronously.
Handles connection failures gracefully embedding is non-blocking for the pipeline.
"""
from __future__ import annotations
import logging
import openai
from config import Settings
logger = logging.getLogger(__name__)
class EmbeddingClient:
"""Sync embedding client backed by an OpenAI-compatible /v1/embeddings endpoint."""
def __init__(self, settings: Settings) -> None:
self.settings = settings
self._client = openai.OpenAI(
base_url=settings.embedding_api_url,
api_key=settings.llm_api_key,
)
def embed(self, texts: list[str]) -> list[list[float]]:
"""Generate embedding vectors for a batch of texts.
Parameters
----------
texts:
List of strings to embed.
Returns
-------
list[list[float]]
Embedding vectors. Returns empty list on connection/timeout errors
so the pipeline can continue without embeddings.
"""
if not texts:
return []
try:
response = self._client.embeddings.create(
model=self.settings.embedding_model,
input=texts,
)
except (openai.APIConnectionError, openai.APITimeoutError) as exc:
logger.warning(
"Embedding API unavailable (%s: %s). Skipping %d texts.",
type(exc).__name__,
exc,
len(texts),
)
return []
except openai.APIError as exc:
logger.warning(
"Embedding API error (%s: %s). Skipping %d texts.",
type(exc).__name__,
exc,
len(texts),
)
return []
vectors = [item.embedding for item in response.data]
# Validate dimensions
expected_dim = self.settings.embedding_dimensions
for i, vec in enumerate(vectors):
if len(vec) != expected_dim:
logger.warning(
"Embedding dimension mismatch at index %d: expected %d, got %d. "
"Returning empty list.",
i,
expected_dim,
len(vec),
)
return []
logger.info(
"Generated %d embeddings (dim=%d) using model=%s",
len(vectors),
expected_dim,
self.settings.embedding_model,
)
return vectors

View file

@ -0,0 +1,328 @@
"""Synchronous LLM client with primary/fallback endpoint logic.
Uses the OpenAI-compatible API (works with Ollama, vLLM, OpenWebUI, etc.).
Celery tasks run synchronously, so this uses ``openai.OpenAI`` (not Async).
Supports two modalities:
- **chat**: Standard JSON mode with ``response_format: {"type": "json_object"}``
- **thinking**: For reasoning models that emit ``<think>...</think>`` blocks
before their answer. Skips ``response_format``, appends JSON instructions to
the system prompt, and strips think tags from the response.
"""
from __future__ import annotations
import logging
import re
from typing import TYPE_CHECKING, TypeVar
if TYPE_CHECKING:
from collections.abc import Callable
import openai
from pydantic import BaseModel
from config import Settings
logger = logging.getLogger(__name__)
T = TypeVar("T", bound=BaseModel)
# ── Think-tag stripping ──────────────────────────────────────────────────────
_THINK_PATTERN = re.compile(r"<think>.*?</think>", re.DOTALL)
def strip_think_tags(text: str) -> str:
"""Remove ``<think>...</think>`` blocks from LLM output.
Thinking/reasoning models often prefix their JSON with a reasoning trace
wrapped in ``<think>`` tags. This strips all such blocks (including
multiline and multiple occurrences) and returns the cleaned text.
Handles:
- Single ``<think>...</think>`` block
- Multiple blocks in one response
- Multiline content inside think tags
- Responses with no think tags (passthrough)
- Empty input (passthrough)
"""
if not text:
return text
cleaned = _THINK_PATTERN.sub("", text)
return cleaned.strip()
# ── Token estimation ─────────────────────────────────────────────────────────
# Stage-specific output multipliers: estimated output tokens as a ratio of input tokens.
# These are empirically tuned based on observed pipeline behavior.
_STAGE_OUTPUT_RATIOS: dict[str, float] = {
"stage2_segmentation": 0.3, # Compact topic groups — much smaller than input
"stage3_extraction": 1.2, # Detailed moments with summaries — can exceed input
"stage4_classification": 0.15, # Index + category + tags per moment — very compact
"stage5_synthesis": 1.5, # Full prose technique pages — heaviest output
}
# Minimum floor so we never send a trivially small max_tokens
_MIN_MAX_TOKENS = 2048
def estimate_tokens(text: str) -> int:
"""Estimate token count from text using a chars-per-token heuristic.
Uses 3.5 chars/token which is conservative for English + JSON markup.
"""
if not text:
return 0
return max(1, int(len(text) / 3.5))
def estimate_max_tokens(
system_prompt: str,
user_prompt: str,
stage: str | None = None,
hard_limit: int = 32768,
) -> int:
"""Estimate the max_tokens parameter for an LLM call.
Calculates expected output size based on input size and stage-specific
multipliers. The result is clamped between _MIN_MAX_TOKENS and hard_limit.
Parameters
----------
system_prompt:
The system prompt text.
user_prompt:
The user prompt text (transcript, moments, etc.).
stage:
Pipeline stage name (e.g. "stage3_extraction"). If None or unknown,
uses a default 1.0x multiplier.
hard_limit:
Absolute ceiling never exceed this value.
Returns
-------
int
Estimated max_tokens value to pass to the LLM API.
"""
input_tokens = estimate_tokens(system_prompt) + estimate_tokens(user_prompt)
ratio = _STAGE_OUTPUT_RATIOS.get(stage or "", 1.0)
estimated_output = int(input_tokens * ratio)
# Add a 20% buffer for JSON overhead and variability
estimated_output = int(estimated_output * 1.2)
# Clamp to [_MIN_MAX_TOKENS, hard_limit]
result = max(_MIN_MAX_TOKENS, min(estimated_output, hard_limit))
logger.info(
"Token estimate: input≈%d, stage=%s, ratio=%.2f, estimated_output=%d, max_tokens=%d (hard_limit=%d)",
input_tokens, stage or "default", ratio, estimated_output, result, hard_limit,
)
return result
class LLMClient:
"""Sync LLM client that tries a primary endpoint and falls back on failure."""
def __init__(self, settings: Settings) -> None:
self.settings = settings
self._primary = openai.OpenAI(
base_url=settings.llm_api_url,
api_key=settings.llm_api_key,
)
self._fallback = openai.OpenAI(
base_url=settings.llm_fallback_url,
api_key=settings.llm_api_key,
)
# ── Core completion ──────────────────────────────────────────────────
def complete(
self,
system_prompt: str,
user_prompt: str,
response_model: type[BaseModel] | None = None,
modality: str = "chat",
model_override: str | None = None,
on_complete: "Callable | None" = None,
max_tokens: int | None = None,
) -> str:
"""Send a chat completion request, falling back on connection/timeout errors.
Parameters
----------
system_prompt:
System message content.
user_prompt:
User message content.
response_model:
If provided and modality is "chat", ``response_format`` is set to
``{"type": "json_object"}``. For "thinking" modality, JSON
instructions are appended to the system prompt instead.
modality:
Either "chat" (default) or "thinking". Thinking modality skips
response_format and strips ``<think>`` tags from output.
model_override:
Model name to use instead of the default. If None, uses the
configured default for the endpoint.
max_tokens:
Override for max_tokens on this call. If None, falls back to
the configured ``llm_max_tokens`` from settings.
Returns
-------
str
Raw completion text from the model (think tags stripped if thinking).
"""
kwargs: dict = {}
effective_system = system_prompt
if modality == "thinking":
# Thinking models often don't support response_format: json_object.
# Instead, append explicit JSON instructions to the system prompt.
if response_model is not None:
json_schema_hint = (
"\n\nYou MUST respond with ONLY valid JSON. "
"No markdown code fences, no explanation, no preamble — "
"just the raw JSON object."
)
effective_system = system_prompt + json_schema_hint
else:
# Chat modality — use standard JSON mode
if response_model is not None:
kwargs["response_format"] = {"type": "json_object"}
messages = [
{"role": "system", "content": effective_system},
{"role": "user", "content": user_prompt},
]
primary_model = model_override or self.settings.llm_model
fallback_model = self.settings.llm_fallback_model
effective_max_tokens = max_tokens if max_tokens is not None else self.settings.llm_max_tokens
logger.info(
"LLM request: model=%s, modality=%s, response_model=%s, max_tokens=%d",
primary_model,
modality,
response_model.__name__ if response_model else None,
effective_max_tokens,
)
# --- Try primary endpoint ---
try:
response = self._primary.chat.completions.create(
model=primary_model,
messages=messages,
max_tokens=effective_max_tokens,
**kwargs,
)
raw = response.choices[0].message.content or ""
usage = getattr(response, "usage", None)
if usage:
logger.info(
"LLM response: prompt_tokens=%s, completion_tokens=%s, total=%s, content_len=%d, finish=%s",
usage.prompt_tokens, usage.completion_tokens, usage.total_tokens,
len(raw), response.choices[0].finish_reason,
)
if modality == "thinking":
raw = strip_think_tags(raw)
if on_complete is not None:
try:
on_complete(
model=primary_model,
prompt_tokens=usage.prompt_tokens if usage else None,
completion_tokens=usage.completion_tokens if usage else None,
total_tokens=usage.total_tokens if usage else None,
content=raw,
finish_reason=response.choices[0].finish_reason if response.choices else None,
)
except Exception as cb_exc:
logger.warning("on_complete callback failed: %s", cb_exc)
return raw
except (openai.APIConnectionError, openai.APITimeoutError) as exc:
logger.warning(
"Primary LLM endpoint failed (%s: %s), trying fallback at %s",
type(exc).__name__,
exc,
self.settings.llm_fallback_url,
)
# --- Try fallback endpoint ---
try:
response = self._fallback.chat.completions.create(
model=fallback_model,
messages=messages,
max_tokens=effective_max_tokens,
**kwargs,
)
raw = response.choices[0].message.content or ""
usage = getattr(response, "usage", None)
if usage:
logger.info(
"LLM response (fallback): prompt_tokens=%s, completion_tokens=%s, total=%s, content_len=%d, finish=%s",
usage.prompt_tokens, usage.completion_tokens, usage.total_tokens,
len(raw), response.choices[0].finish_reason,
)
if modality == "thinking":
raw = strip_think_tags(raw)
if on_complete is not None:
try:
on_complete(
model=fallback_model,
prompt_tokens=usage.prompt_tokens if usage else None,
completion_tokens=usage.completion_tokens if usage else None,
total_tokens=usage.total_tokens if usage else None,
content=raw,
finish_reason=response.choices[0].finish_reason if response.choices else None,
is_fallback=True,
)
except Exception as cb_exc:
logger.warning("on_complete callback failed: %s", cb_exc)
return raw
except (openai.APIConnectionError, openai.APITimeoutError, openai.APIError) as exc:
logger.error(
"Fallback LLM endpoint also failed (%s: %s). Giving up.",
type(exc).__name__,
exc,
)
raise
# ── Response parsing ─────────────────────────────────────────────────
def parse_response(self, text: str, model: type[T]) -> T:
"""Parse raw LLM output as JSON and validate against a Pydantic model.
Parameters
----------
text:
Raw JSON string from the LLM.
model:
Pydantic model class to validate against.
Returns
-------
T
Validated Pydantic model instance.
Raises
------
pydantic.ValidationError
If the JSON doesn't match the schema.
ValueError
If the text is not valid JSON.
"""
try:
return model.model_validate_json(text)
except Exception:
logger.error(
"Failed to parse LLM response as %s. Response text: %.500s",
model.__name__,
text,
)
raise

View file

@ -0,0 +1,184 @@
"""Qdrant vector database manager for collection lifecycle and point upserts.
Handles collection creation (idempotent) and batch upserts for technique pages
and key moments. Connection failures are non-blocking the pipeline continues
without search indexing.
"""
from __future__ import annotations
import logging
import uuid
from qdrant_client import QdrantClient
from qdrant_client.http import exceptions as qdrant_exceptions
from qdrant_client.models import Distance, PointStruct, VectorParams
from config import Settings
logger = logging.getLogger(__name__)
class QdrantManager:
"""Manages a Qdrant collection for Chrysopedia technique-page and key-moment vectors."""
def __init__(self, settings: Settings) -> None:
self.settings = settings
self._client = QdrantClient(url=settings.qdrant_url)
self._collection = settings.qdrant_collection
# ── Collection management ────────────────────────────────────────────
def ensure_collection(self) -> None:
"""Create the collection if it does not already exist.
Uses cosine distance and the configured embedding dimensions.
"""
try:
if self._client.collection_exists(self._collection):
logger.info("Qdrant collection '%s' already exists.", self._collection)
return
self._client.create_collection(
collection_name=self._collection,
vectors_config=VectorParams(
size=self.settings.embedding_dimensions,
distance=Distance.COSINE,
),
)
logger.info(
"Created Qdrant collection '%s' (dim=%d, cosine).",
self._collection,
self.settings.embedding_dimensions,
)
except qdrant_exceptions.UnexpectedResponse as exc:
logger.warning(
"Qdrant error during ensure_collection (%s). Skipping.",
exc,
)
except Exception as exc:
logger.warning(
"Qdrant connection failed during ensure_collection (%s: %s). Skipping.",
type(exc).__name__,
exc,
)
# ── Low-level upsert ─────────────────────────────────────────────────
def upsert_points(self, points: list[PointStruct]) -> None:
"""Upsert a batch of pre-built PointStruct objects."""
if not points:
return
try:
self._client.upsert(
collection_name=self._collection,
points=points,
)
logger.info(
"Upserted %d points to Qdrant collection '%s'.",
len(points),
self._collection,
)
except qdrant_exceptions.UnexpectedResponse as exc:
logger.warning(
"Qdrant upsert failed (%s). %d points skipped.",
exc,
len(points),
)
except Exception as exc:
logger.warning(
"Qdrant upsert connection error (%s: %s). %d points skipped.",
type(exc).__name__,
exc,
len(points),
)
# ── High-level upserts ───────────────────────────────────────────────
def upsert_technique_pages(
self,
pages: list[dict],
vectors: list[list[float]],
) -> None:
"""Build and upsert PointStructs for technique pages.
Each page dict must contain:
page_id, creator_id, title, topic_category, topic_tags, summary
Parameters
----------
pages:
Metadata dicts, one per technique page.
vectors:
Corresponding embedding vectors (same order as pages).
"""
if len(pages) != len(vectors):
logger.warning(
"Technique-page count (%d) != vector count (%d). Skipping upsert.",
len(pages),
len(vectors),
)
return
points = []
for page, vector in zip(pages, vectors):
point = PointStruct(
id=str(uuid.uuid4()),
vector=vector,
payload={
"type": "technique_page",
"page_id": page["page_id"],
"creator_id": page["creator_id"],
"title": page["title"],
"topic_category": page["topic_category"],
"topic_tags": page.get("topic_tags") or [],
"summary": page.get("summary") or "",
},
)
points.append(point)
self.upsert_points(points)
def upsert_key_moments(
self,
moments: list[dict],
vectors: list[list[float]],
) -> None:
"""Build and upsert PointStructs for key moments.
Each moment dict must contain:
moment_id, source_video_id, title, start_time, end_time, content_type
Parameters
----------
moments:
Metadata dicts, one per key moment.
vectors:
Corresponding embedding vectors (same order as moments).
"""
if len(moments) != len(vectors):
logger.warning(
"Key-moment count (%d) != vector count (%d). Skipping upsert.",
len(moments),
len(vectors),
)
return
points = []
for moment, vector in zip(moments, vectors):
point = PointStruct(
id=str(uuid.uuid4()),
vector=vector,
payload={
"type": "key_moment",
"moment_id": moment["moment_id"],
"source_video_id": moment["source_video_id"],
"title": moment["title"],
"start_time": moment["start_time"],
"end_time": moment["end_time"],
"content_type": moment["content_type"],
},
)
points.append(point)
self.upsert_points(points)

View file

@ -0,0 +1,99 @@
"""Pydantic schemas for pipeline stage inputs and outputs.
Stage 2 Segmentation: groups transcript segments by topic.
Stage 3 Extraction: extracts key moments from segments.
Stage 4 Classification: classifies moments by category/tags.
Stage 5 Synthesis: generates technique pages from classified moments.
"""
from __future__ import annotations
from pydantic import BaseModel, Field
# ── Stage 2: Segmentation ───────────────────────────────────────────────────
class TopicSegment(BaseModel):
"""A contiguous group of transcript segments sharing a topic."""
start_index: int = Field(description="First transcript segment index in this group")
end_index: int = Field(description="Last transcript segment index in this group (inclusive)")
topic_label: str = Field(description="Short label describing the topic")
summary: str = Field(description="Brief summary of what is discussed")
class SegmentationResult(BaseModel):
"""Full output of stage 2 (segmentation)."""
segments: list[TopicSegment]
# ── Stage 3: Extraction ─────────────────────────────────────────────────────
class ExtractedMoment(BaseModel):
"""A single key moment extracted from a topic segment group."""
title: str = Field(description="Concise title for the moment")
summary: str = Field(description="Detailed summary of the technique/concept")
start_time: float = Field(description="Start time in seconds")
end_time: float = Field(description="End time in seconds")
content_type: str = Field(description="One of: technique, settings, reasoning, workflow")
plugins: list[str] = Field(default_factory=list, description="Plugins/tools mentioned")
raw_transcript: str = Field(default="", description="Raw transcript text for this moment")
class ExtractionResult(BaseModel):
"""Full output of stage 3 (extraction)."""
moments: list[ExtractedMoment]
# ── Stage 4: Classification ─────────────────────────────────────────────────
class ClassifiedMoment(BaseModel):
"""Classification metadata for a single extracted moment."""
moment_index: int = Field(description="Index into ExtractionResult.moments")
topic_category: str = Field(description="High-level topic category")
topic_tags: list[str] = Field(default_factory=list, description="Specific topic tags")
content_type_override: str | None = Field(
default=None,
description="Override for content_type if classification disagrees with extraction",
)
class ClassificationResult(BaseModel):
"""Full output of stage 4 (classification)."""
classifications: list[ClassifiedMoment]
# ── Stage 5: Synthesis ───────────────────────────────────────────────────────
class SynthesizedPage(BaseModel):
"""A technique page synthesized from classified moments."""
title: str = Field(description="Page title")
slug: str = Field(description="URL-safe slug")
topic_category: str = Field(description="Primary topic category")
topic_tags: list[str] = Field(default_factory=list, description="Associated tags")
summary: str = Field(description="Page summary / overview paragraph")
body_sections: dict = Field(
default_factory=dict,
description="Structured body content as section_name -> content mapping",
)
signal_chains: list[dict] = Field(
default_factory=list,
description="Signal chain descriptions (for audio/music production contexts)",
)
plugins: list[str] = Field(default_factory=list, description="Plugins/tools referenced")
source_quality: str = Field(
default="mixed",
description="One of: structured, mixed, unstructured",
)
class SynthesisResult(BaseModel):
"""Full output of stage 5 (synthesis)."""
pages: list[SynthesizedPage]

View file

@ -12,6 +12,7 @@ from __future__ import annotations
import hashlib
import json
import logging
import subprocess
import time
from collections import defaultdict
from pathlib import Path
@ -24,6 +25,7 @@ from sqlalchemy.orm import Session, sessionmaker
from config import get_settings
from models import (
Creator,
KeyMoment,
KeyMomentContentType,
PipelineEvent,
@ -34,7 +36,7 @@ from models import (
TranscriptSegment,
)
from pipeline.embedding_client import EmbeddingClient
from pipeline.llm_client import LLMClient
from pipeline.llm_client import LLMClient, estimate_max_tokens
from pipeline.qdrant_client import QdrantManager
from pipeline.schemas import (
ClassificationResult,
@ -60,6 +62,9 @@ def _emit_event(
model: str | None = None,
duration_ms: int | None = None,
payload: dict | None = None,
system_prompt_text: str | None = None,
user_prompt_text: str | None = None,
response_text: str | None = None,
) -> None:
"""Persist a pipeline event to the DB. Best-effort -- failures logged, not raised."""
try:
@ -75,6 +80,9 @@ def _emit_event(
model=model,
duration_ms=duration_ms,
payload=payload,
system_prompt_text=system_prompt_text,
user_prompt_text=user_prompt_text,
response_text=response_text,
)
session.add(event)
session.commit()
@ -84,8 +92,34 @@ def _emit_event(
logger.warning("Failed to emit pipeline event: %s", exc)
def _make_llm_callback(video_id: str, stage: str):
"""Create an on_complete callback for LLMClient that emits llm_call events."""
def _is_debug_mode() -> bool:
"""Check if debug mode is enabled via Redis. Falls back to config setting."""
try:
import redis
settings = get_settings()
r = redis.from_url(settings.redis_url)
val = r.get("chrysopedia:debug_mode")
r.close()
if val is not None:
return val.decode().lower() == "true"
except Exception:
pass
return getattr(get_settings(), "debug_mode", False)
def _make_llm_callback(
video_id: str,
stage: str,
system_prompt: str | None = None,
user_prompt: str | None = None,
):
"""Create an on_complete callback for LLMClient that emits llm_call events.
When debug mode is enabled, captures full system prompt, user prompt,
and response text on each llm_call event.
"""
debug = _is_debug_mode()
def callback(*, model=None, prompt_tokens=None, completion_tokens=None,
total_tokens=None, content=None, finish_reason=None,
is_fallback=False, **_kwargs):
@ -105,6 +139,9 @@ def _make_llm_callback(video_id: str, stage: str):
"finish_reason": finish_reason,
"is_fallback": is_fallback,
},
system_prompt_text=system_prompt if debug else None,
user_prompt_text=user_prompt if debug else None,
response_text=content if debug else None,
)
return callback
@ -271,9 +308,11 @@ def stage2_segmentation(self, video_id: str) -> str:
llm = _get_llm_client()
model_override, modality = _get_stage_config(2)
logger.info("Stage 2 using model=%s, modality=%s", model_override or "default", modality)
raw = llm.complete(system_prompt, user_prompt, response_model=SegmentationResult, on_complete=_make_llm_callback(video_id, "stage2_segmentation"),
modality=modality, model_override=model_override)
hard_limit = get_settings().llm_max_tokens_hard_limit
max_tokens = estimate_max_tokens(system_prompt, user_prompt, stage="stage2_segmentation", hard_limit=hard_limit)
logger.info("Stage 2 using model=%s, modality=%s, max_tokens=%d", model_override or "default", modality, max_tokens)
raw = llm.complete(system_prompt, user_prompt, response_model=SegmentationResult, on_complete=_make_llm_callback(video_id, "stage2_segmentation", system_prompt=system_prompt, user_prompt=user_prompt),
modality=modality, model_override=model_override, max_tokens=max_tokens)
result = _safe_parse_llm_response(raw, SegmentationResult, llm, system_prompt, user_prompt,
modality=modality, model_override=model_override)
@ -345,6 +384,7 @@ def stage3_extraction(self, video_id: str) -> str:
system_prompt = _load_prompt("stage3_extraction.txt")
llm = _get_llm_client()
model_override, modality = _get_stage_config(3)
hard_limit = get_settings().llm_max_tokens_hard_limit
logger.info("Stage 3 using model=%s, modality=%s", model_override or "default", modality)
total_moments = 0
@ -362,8 +402,9 @@ def stage3_extraction(self, video_id: str) -> str:
f"<segment>\n{segment_text}\n</segment>"
)
raw = llm.complete(system_prompt, user_prompt, response_model=ExtractionResult, on_complete=_make_llm_callback(video_id, "stage3_extraction"),
modality=modality, model_override=model_override)
max_tokens = estimate_max_tokens(system_prompt, user_prompt, stage="stage3_extraction", hard_limit=hard_limit)
raw = llm.complete(system_prompt, user_prompt, response_model=ExtractionResult, on_complete=_make_llm_callback(video_id, "stage3_extraction", system_prompt=system_prompt, user_prompt=user_prompt),
modality=modality, model_override=model_override, max_tokens=max_tokens)
result = _safe_parse_llm_response(raw, ExtractionResult, llm, system_prompt, user_prompt,
modality=modality, model_override=model_override)
@ -474,9 +515,11 @@ def stage4_classification(self, video_id: str) -> str:
llm = _get_llm_client()
model_override, modality = _get_stage_config(4)
logger.info("Stage 4 using model=%s, modality=%s", model_override or "default", modality)
raw = llm.complete(system_prompt, user_prompt, response_model=ClassificationResult, on_complete=_make_llm_callback(video_id, "stage4_classification"),
modality=modality, model_override=model_override)
hard_limit = get_settings().llm_max_tokens_hard_limit
max_tokens = estimate_max_tokens(system_prompt, user_prompt, stage="stage4_classification", hard_limit=hard_limit)
logger.info("Stage 4 using model=%s, modality=%s, max_tokens=%d", model_override or "default", modality, max_tokens)
raw = llm.complete(system_prompt, user_prompt, response_model=ClassificationResult, on_complete=_make_llm_callback(video_id, "stage4_classification", system_prompt=system_prompt, user_prompt=user_prompt),
modality=modality, model_override=model_override, max_tokens=max_tokens)
result = _safe_parse_llm_response(raw, ClassificationResult, llm, system_prompt, user_prompt,
modality=modality, model_override=model_override)
@ -548,6 +591,44 @@ def _load_classification_data(video_id: str) -> list[dict]:
return json.loads(raw)
def _get_git_commit_sha() -> str:
"""Resolve the git commit SHA used to build this image.
Resolution order:
1. /app/.git-commit file (written during Docker build)
2. git rev-parse --short HEAD (local dev)
3. GIT_COMMIT_SHA env var / config setting
4. "unknown"
"""
# Docker build artifact
git_commit_file = Path("/app/.git-commit")
if git_commit_file.exists():
sha = git_commit_file.read_text(encoding="utf-8").strip()
if sha and sha != "unknown":
return sha
# Local dev — run git
try:
result = subprocess.run(
["git", "rev-parse", "--short", "HEAD"],
capture_output=True, text=True, timeout=5,
)
if result.returncode == 0 and result.stdout.strip():
return result.stdout.strip()
except (FileNotFoundError, subprocess.TimeoutExpired):
pass
# Config / env var fallback
try:
sha = get_settings().git_commit_sha
if sha and sha != "unknown":
return sha
except Exception:
pass
return "unknown"
def _capture_pipeline_metadata() -> dict:
"""Capture current pipeline configuration for version metadata.
@ -578,6 +659,7 @@ def _capture_pipeline_metadata() -> dict:
prompt_hashes[filename] = ""
return {
"git_commit_sha": _get_git_commit_sha(),
"models": {
"stage2": settings.llm_stage2_model,
"stage3": settings.llm_stage3_model,
@ -631,6 +713,12 @@ def stage5_synthesis(self, video_id: str) -> str:
.all()
)
# Resolve creator name for the LLM prompt
creator = session.execute(
select(Creator).where(Creator.id == video.creator_id)
).scalar_one_or_none()
creator_name = creator.name if creator else "Unknown"
if not moments:
logger.info("Stage 5: No moments found for video_id=%s, skipping.", video_id)
return video_id
@ -649,6 +737,7 @@ def stage5_synthesis(self, video_id: str) -> str:
system_prompt = _load_prompt("stage5_synthesis.txt")
llm = _get_llm_client()
model_override, modality = _get_stage_config(5)
hard_limit = get_settings().llm_max_tokens_hard_limit
logger.info("Stage 5 using model=%s, modality=%s", model_override or "default", modality)
pages_created = 0
@ -671,16 +760,38 @@ def stage5_synthesis(self, video_id: str) -> str:
)
moments_text = "\n\n".join(moments_lines)
user_prompt = f"<moments>\n{moments_text}\n</moments>"
user_prompt = f"<creator>{creator_name}</creator>\n<moments>\n{moments_text}\n</moments>"
raw = llm.complete(system_prompt, user_prompt, response_model=SynthesisResult, on_complete=_make_llm_callback(video_id, "stage5_synthesis"),
modality=modality, model_override=model_override)
max_tokens = estimate_max_tokens(system_prompt, user_prompt, stage="stage5_synthesis", hard_limit=hard_limit)
raw = llm.complete(system_prompt, user_prompt, response_model=SynthesisResult, on_complete=_make_llm_callback(video_id, "stage5_synthesis", system_prompt=system_prompt, user_prompt=user_prompt),
modality=modality, model_override=model_override, max_tokens=max_tokens)
result = _safe_parse_llm_response(raw, SynthesisResult, llm, system_prompt, user_prompt,
modality=modality, model_override=model_override)
# Load prior pages from this video (snapshot taken before pipeline reset)
prior_page_ids = _load_prior_pages(video_id)
# Create/update TechniquePage rows
for page_data in result.pages:
# Check if page with this slug already exists
existing = None
# First: check prior pages from this video by creator + category
if prior_page_ids:
existing = session.execute(
select(TechniquePage).where(
TechniquePage.id.in_(prior_page_ids),
TechniquePage.creator_id == video.creator_id,
TechniquePage.topic_category == (page_data.topic_category or category),
)
).scalar_one_or_none()
if existing:
logger.info(
"Stage 5: Matched prior page '%s' (id=%s) by creator+category for video_id=%s",
existing.slug, existing.id, video_id,
)
# Fallback: check by slug (handles cross-video dedup)
if existing is None:
existing = session.execute(
select(TechniquePage).where(TechniquePage.slug == page_data.slug)
).scalar_one_or_none()
@ -912,6 +1023,58 @@ def stage6_embed_and_index(self, video_id: str) -> str:
session.close()
def _snapshot_prior_pages(video_id: str) -> None:
"""Save existing technique_page_ids linked to this video before pipeline resets them.
When a video is reprocessed, stage 3 deletes and recreates key_moments,
breaking the link to technique pages. This snapshots the page IDs to Redis
so stage 5 can find and update prior pages instead of creating duplicates.
"""
import redis
session = _get_sync_session()
try:
# Find technique pages linked via this video's key moments
rows = session.execute(
select(KeyMoment.technique_page_id)
.where(
KeyMoment.source_video_id == video_id,
KeyMoment.technique_page_id.isnot(None),
)
.distinct()
).scalars().all()
page_ids = [str(pid) for pid in rows]
if page_ids:
settings = get_settings()
r = redis.Redis.from_url(settings.redis_url)
key = f"chrysopedia:prior_pages:{video_id}"
r.set(key, json.dumps(page_ids), ex=86400)
logger.info(
"Snapshot %d prior technique pages for video_id=%s: %s",
len(page_ids), video_id, page_ids,
)
else:
logger.info("No prior technique pages for video_id=%s", video_id)
finally:
session.close()
def _load_prior_pages(video_id: str) -> list[str]:
"""Load prior technique page IDs from Redis."""
import redis
settings = get_settings()
r = redis.Redis.from_url(settings.redis_url)
key = f"chrysopedia:prior_pages:{video_id}"
raw = r.get(key)
if raw is None:
return []
return json.loads(raw)
# ── Orchestrator ─────────────────────────────────────────────────────────────
@celery_app.task
@ -945,6 +1108,9 @@ def run_pipeline(video_id: str) -> str:
finally:
session.close()
# Snapshot prior technique pages before pipeline resets key_moments
_snapshot_prior_pages(video_id)
# Build the chain based on current status
stages = []
if status in (ProcessingStatus.pending, ProcessingStatus.transcribed):

3
backend/pytest.ini Normal file
View file

@ -0,0 +1,3 @@
[pytest]
asyncio_mode = auto
testpaths = tests

15
backend/redis_client.py Normal file
View file

@ -0,0 +1,15 @@
"""Async Redis client helper for Chrysopedia."""
import redis.asyncio as aioredis
from config import get_settings
async def get_redis() -> aioredis.Redis:
"""Return an async Redis client from the configured URL.
Callers should close the connection when done, or use it
as a short-lived client within a request handler.
"""
settings = get_settings()
return aioredis.from_url(settings.redis_url, decode_responses=True)

View file

@ -0,0 +1 @@
"""Chrysopedia API routers package."""

119
backend/routers/creators.py Normal file
View file

@ -0,0 +1,119 @@
"""Creator endpoints for Chrysopedia API.
Enhanced with sort (random default per R014), genre filter, and
technique/video counts for browse pages.
"""
import logging
from typing import Annotated
from fastapi import APIRouter, Depends, HTTPException, Query
from sqlalchemy import func, select
from sqlalchemy.ext.asyncio import AsyncSession
from database import get_session
from models import Creator, SourceVideo, TechniquePage
from schemas import CreatorBrowseItem, CreatorDetail, CreatorRead
logger = logging.getLogger("chrysopedia.creators")
router = APIRouter(prefix="/creators", tags=["creators"])
@router.get("")
async def list_creators(
sort: Annotated[str, Query()] = "random",
genre: Annotated[str | None, Query()] = None,
offset: Annotated[int, Query(ge=0)] = 0,
limit: Annotated[int, Query(ge=1, le=100)] = 50,
db: AsyncSession = Depends(get_session),
):
"""List creators with sort, genre filter, and technique/video counts.
- **sort**: ``random`` (default, R014 creator equity), ``alpha``, ``views``
- **genre**: filter by genre (matches against ARRAY column)
"""
# Subqueries for counts
technique_count_sq = (
select(func.count())
.where(TechniquePage.creator_id == Creator.id)
.correlate(Creator)
.scalar_subquery()
)
video_count_sq = (
select(func.count())
.where(SourceVideo.creator_id == Creator.id)
.correlate(Creator)
.scalar_subquery()
)
stmt = select(
Creator,
technique_count_sq.label("technique_count"),
video_count_sq.label("video_count"),
)
# Genre filter
if genre:
stmt = stmt.where(Creator.genres.any(genre))
# Sorting
if sort == "alpha":
stmt = stmt.order_by(Creator.name)
elif sort == "views":
stmt = stmt.order_by(Creator.view_count.desc())
else:
# Default: random (small dataset <100, func.random() is fine)
stmt = stmt.order_by(func.random())
stmt = stmt.offset(offset).limit(limit)
result = await db.execute(stmt)
rows = result.all()
items: list[CreatorBrowseItem] = []
for row in rows:
creator = row[0]
tc = row[1] or 0
vc = row[2] or 0
base = CreatorRead.model_validate(creator)
items.append(
CreatorBrowseItem(**base.model_dump(), technique_count=tc, video_count=vc)
)
# Get total count (without offset/limit)
count_stmt = select(func.count()).select_from(Creator)
if genre:
count_stmt = count_stmt.where(Creator.genres.any(genre))
total = (await db.execute(count_stmt)).scalar() or 0
logger.debug(
"Listed %d creators (sort=%s, genre=%s, offset=%d, limit=%d)",
len(items), sort, genre, offset, limit,
)
return {"items": items, "total": total, "offset": offset, "limit": limit}
@router.get("/{slug}", response_model=CreatorDetail)
async def get_creator(
slug: str,
db: AsyncSession = Depends(get_session),
) -> CreatorDetail:
"""Get a single creator by slug, including video count."""
stmt = select(Creator).where(Creator.slug == slug)
result = await db.execute(stmt)
creator = result.scalar_one_or_none()
if creator is None:
raise HTTPException(status_code=404, detail=f"Creator '{slug}' not found")
# Count videos for this creator
count_stmt = (
select(func.count())
.select_from(SourceVideo)
.where(SourceVideo.creator_id == creator.id)
)
count_result = await db.execute(count_stmt)
video_count = count_result.scalar() or 0
creator_data = CreatorRead.model_validate(creator)
return CreatorDetail(**creator_data.model_dump(), video_count=video_count)

34
backend/routers/health.py Normal file
View file

@ -0,0 +1,34 @@
"""Health check endpoints for Chrysopedia API."""
import logging
from fastapi import APIRouter, Depends
from sqlalchemy import text
from sqlalchemy.ext.asyncio import AsyncSession
from database import get_session
from schemas import HealthResponse
logger = logging.getLogger("chrysopedia.health")
router = APIRouter(tags=["health"])
@router.get("/health", response_model=HealthResponse)
async def health_check(db: AsyncSession = Depends(get_session)) -> HealthResponse:
"""Root health check — verifies API is running and DB is reachable."""
db_status = "unknown"
try:
result = await db.execute(text("SELECT 1"))
result.scalar()
db_status = "connected"
except Exception:
logger.warning("Database health check failed", exc_info=True)
db_status = "unreachable"
return HealthResponse(
status="ok",
service="chrysopedia-api",
version="0.1.0",
database=db_status,
)

284
backend/routers/ingest.py Normal file
View file

@ -0,0 +1,284 @@
"""Transcript ingestion endpoint for the Chrysopedia API.
Accepts a Whisper-format transcript JSON via multipart file upload, finds or
creates a Creator, upserts a SourceVideo, bulk-inserts TranscriptSegments,
persists the raw JSON to disk, and returns a structured response.
"""
import hashlib
import json
import logging
import os
import re
import uuid
from fastapi import APIRouter, Depends, HTTPException, UploadFile
from sqlalchemy import delete, select
from sqlalchemy.ext.asyncio import AsyncSession
from config import get_settings
from database import get_session
from models import ContentType, Creator, ProcessingStatus, SourceVideo, TranscriptSegment
from schemas import TranscriptIngestResponse
logger = logging.getLogger("chrysopedia.ingest")
router = APIRouter(prefix="/ingest", tags=["ingest"])
REQUIRED_KEYS = {"source_file", "creator_folder", "duration_seconds", "segments"}
def slugify(value: str) -> str:
"""Lowercase, replace non-alphanumeric chars with hyphens, collapse/strip."""
value = value.lower()
value = re.sub(r"[^a-z0-9]+", "-", value)
value = value.strip("-")
value = re.sub(r"-{2,}", "-", value)
return value
def compute_content_hash(segments: list[dict]) -> str:
"""Compute a stable SHA-256 hash from transcript segment text.
Hashes only the segment text content in order, ignoring metadata like
filenames, timestamps, or dates. Two transcripts of the same audio will
produce identical hashes even if ingested with different filenames.
"""
h = hashlib.sha256()
for seg in segments:
h.update(str(seg.get("text", "")).encode("utf-8"))
return h.hexdigest()
@router.post("", response_model=TranscriptIngestResponse)
async def ingest_transcript(
file: UploadFile,
db: AsyncSession = Depends(get_session),
) -> TranscriptIngestResponse:
"""Ingest a Whisper transcript JSON file.
Workflow:
1. Parse and validate the uploaded JSON.
2. Find-or-create a Creator by folder_name.
3. Upsert a SourceVideo by (creator_id, filename).
4. Bulk-insert TranscriptSegment rows.
5. Save raw JSON to transcript_storage_path.
6. Return structured response.
"""
settings = get_settings()
# ── 1. Read & parse JSON ─────────────────────────────────────────────
try:
raw_bytes = await file.read()
raw_text = raw_bytes.decode("utf-8")
except Exception as exc:
raise HTTPException(status_code=400, detail=f"Invalid file: {exc}") from exc
try:
data = json.loads(raw_text)
except json.JSONDecodeError as exc:
raise HTTPException(
status_code=422, detail=f"JSON parse error: {exc}"
) from exc
if not isinstance(data, dict):
raise HTTPException(status_code=422, detail="Expected a JSON object at the top level")
missing = REQUIRED_KEYS - data.keys()
if missing:
raise HTTPException(
status_code=422,
detail=f"Missing required keys: {', '.join(sorted(missing))}",
)
source_file: str = data["source_file"]
creator_folder: str = data["creator_folder"]
duration_seconds: int | None = data.get("duration_seconds")
segments_data: list = data["segments"]
if not isinstance(segments_data, list):
raise HTTPException(status_code=422, detail="'segments' must be an array")
content_hash = compute_content_hash(segments_data)
logger.info("Content hash for %s: %s", source_file, content_hash)
# ── 2. Find-or-create Creator ────────────────────────────────────────
stmt = select(Creator).where(Creator.folder_name == creator_folder)
result = await db.execute(stmt)
creator = result.scalar_one_or_none()
if creator is None:
creator = Creator(
name=creator_folder,
slug=slugify(creator_folder),
folder_name=creator_folder,
)
db.add(creator)
await db.flush() # assign id
# ── 3. Upsert SourceVideo ────────────────────────────────────────────
# First check for exact filename match (original behavior)
stmt = select(SourceVideo).where(
SourceVideo.creator_id == creator.id,
SourceVideo.filename == source_file,
)
result = await db.execute(stmt)
existing_video = result.scalar_one_or_none()
# Tier 2: content hash match (same audio, different filename/metadata)
matched_video = None
match_reason = None
if existing_video is None:
stmt = select(SourceVideo).where(
SourceVideo.content_hash == content_hash,
)
result = await db.execute(stmt)
matched_video = result.scalar_one_or_none()
if matched_video:
match_reason = "content_hash"
# Tier 3: filename + duration match (same yt-dlp download, re-encoded)
if existing_video is None and matched_video is None and duration_seconds is not None:
# Strip common prefixes like dates (e.g. "2023-07-19 ") and extensions
# to get a normalized base name for fuzzy matching
base_name = re.sub(r"^\d{4}-\d{2}-\d{2}\s+", "", source_file)
base_name = re.sub(r"\s*\(\d+p\).*$", "", base_name) # strip resolution suffix
base_name = os.path.splitext(base_name)[0].strip()
stmt = select(SourceVideo).where(
SourceVideo.creator_id == creator.id,
SourceVideo.duration_seconds == duration_seconds,
)
result = await db.execute(stmt)
candidates = result.scalars().all()
for candidate in candidates:
cand_name = re.sub(r"^\d{4}-\d{2}-\d{2}\s+", "", candidate.filename)
cand_name = re.sub(r"\s*\(\d+p\).*$", "", cand_name)
cand_name = os.path.splitext(cand_name)[0].strip()
if cand_name == base_name:
matched_video = candidate
match_reason = "filename+duration"
break
is_reupload = existing_video is not None
is_duplicate_content = matched_video is not None
if is_duplicate_content:
logger.info(
"Duplicate detected via %s: '%s' matches existing video '%s' (%s)",
match_reason, source_file, matched_video.filename, matched_video.id,
)
if is_reupload:
video = existing_video
# Delete old segments for idempotent re-upload
await db.execute(
delete(TranscriptSegment).where(
TranscriptSegment.source_video_id == video.id
)
)
video.duration_seconds = duration_seconds
video.content_hash = content_hash
video.processing_status = ProcessingStatus.transcribed
elif is_duplicate_content:
# Same content, different filename — update the existing record
video = matched_video
await db.execute(
delete(TranscriptSegment).where(
TranscriptSegment.source_video_id == video.id
)
)
video.filename = source_file
video.file_path = f"{creator_folder}/{source_file}"
video.duration_seconds = duration_seconds
video.content_hash = content_hash
video.processing_status = ProcessingStatus.transcribed
is_reupload = True # Treat as reupload for response
else:
video = SourceVideo(
creator_id=creator.id,
filename=source_file,
file_path=f"{creator_folder}/{source_file}",
duration_seconds=duration_seconds,
content_type=ContentType.tutorial,
content_hash=content_hash,
processing_status=ProcessingStatus.transcribed,
)
db.add(video)
await db.flush() # assign id
# ── 4. Bulk-insert TranscriptSegments ────────────────────────────────
segment_objs = [
TranscriptSegment(
source_video_id=video.id,
start_time=float(seg["start"]),
end_time=float(seg["end"]),
text=str(seg["text"]),
segment_index=idx,
)
for idx, seg in enumerate(segments_data)
]
db.add_all(segment_objs)
# ── 5. Save raw JSON to disk ─────────────────────────────────────────
transcript_dir = os.path.join(
settings.transcript_storage_path, creator_folder
)
transcript_path = os.path.join(transcript_dir, f"{source_file}.json")
try:
os.makedirs(transcript_dir, exist_ok=True)
with open(transcript_path, "w", encoding="utf-8") as f:
f.write(raw_text)
except OSError as exc:
raise HTTPException(
status_code=500, detail=f"Failed to save transcript: {exc}"
) from exc
video.transcript_path = transcript_path
# ── 6. Commit & respond ──────────────────────────────────────────────
try:
await db.commit()
except Exception as exc:
await db.rollback()
logger.error("Database commit failed during ingest: %s", exc)
raise HTTPException(
status_code=500, detail="Database error during ingest"
) from exc
await db.refresh(video)
await db.refresh(creator)
# ── 7. Dispatch LLM pipeline (best-effort) ──────────────────────────
try:
from pipeline.stages import run_pipeline
run_pipeline.delay(str(video.id))
logger.info("Pipeline dispatched for video_id=%s", video.id)
except Exception as exc:
logger.warning(
"Pipeline dispatch failed for video_id=%s (ingest still succeeds): %s",
video.id,
exc,
)
logger.info(
"Ingested transcript: creator=%s, file=%s, segments=%d, reupload=%s",
creator.name,
source_file,
len(segment_objs),
is_reupload,
)
return TranscriptIngestResponse(
video_id=video.id,
creator_id=creator.id,
creator_name=creator.name,
filename=source_file,
segments_stored=len(segment_objs),
processing_status=video.processing_status.value,
is_reupload=is_reupload,
content_hash=content_hash,
)

375
backend/routers/pipeline.py Normal file
View file

@ -0,0 +1,375 @@
"""Pipeline management endpoints — public trigger + admin dashboard.
Public:
POST /pipeline/trigger/{video_id} Trigger pipeline for a video
Admin:
GET /admin/pipeline/videos Video list with status + event counts
POST /admin/pipeline/trigger/{video_id} Retrigger (same as public but under admin prefix)
POST /admin/pipeline/revoke/{video_id} Revoke/cancel active tasks for a video
GET /admin/pipeline/events/{video_id} Event log for a video (paginated)
GET /admin/pipeline/worker-status Active/reserved tasks from Celery inspect
"""
import logging
import uuid
from typing import Annotated
from fastapi import APIRouter, Depends, HTTPException, Query
from sqlalchemy import func, select, case
from sqlalchemy.ext.asyncio import AsyncSession
from config import get_settings
from database import get_session
from models import PipelineEvent, SourceVideo, Creator
from redis_client import get_redis
from schemas import DebugModeResponse, DebugModeUpdate, TokenStageSummary, TokenSummaryResponse
logger = logging.getLogger("chrysopedia.pipeline")
router = APIRouter(tags=["pipeline"])
REDIS_DEBUG_MODE_KEY = "chrysopedia:debug_mode"
# ── Public trigger ───────────────────────────────────────────────────────────
@router.post("/pipeline/trigger/{video_id}")
async def trigger_pipeline(
video_id: str,
db: AsyncSession = Depends(get_session),
):
"""Manually trigger (or re-trigger) the LLM extraction pipeline for a video."""
stmt = select(SourceVideo).where(SourceVideo.id == video_id)
result = await db.execute(stmt)
video = result.scalar_one_or_none()
if video is None:
raise HTTPException(status_code=404, detail=f"Video not found: {video_id}")
from pipeline.stages import run_pipeline
try:
run_pipeline.delay(str(video.id))
logger.info("Pipeline manually triggered for video_id=%s", video_id)
except Exception as exc:
logger.warning("Failed to dispatch pipeline for video_id=%s: %s", video_id, exc)
raise HTTPException(
status_code=503,
detail="Pipeline dispatch failed — Celery/Redis may be unavailable",
) from exc
return {
"status": "triggered",
"video_id": str(video.id),
"current_processing_status": video.processing_status.value,
}
# ── Admin: Video list ────────────────────────────────────────────────────────
@router.get("/admin/pipeline/videos")
async def list_pipeline_videos(
db: AsyncSession = Depends(get_session),
):
"""List all videos with processing status and pipeline event counts."""
# Subquery for event counts per video
event_counts = (
select(
PipelineEvent.video_id,
func.count().label("event_count"),
func.sum(case(
(PipelineEvent.event_type == "llm_call", PipelineEvent.total_tokens),
else_=0
)).label("total_tokens_used"),
func.max(PipelineEvent.created_at).label("last_event_at"),
)
.group_by(PipelineEvent.video_id)
.subquery()
)
stmt = (
select(
SourceVideo.id,
SourceVideo.filename,
SourceVideo.processing_status,
SourceVideo.content_hash,
SourceVideo.created_at,
SourceVideo.updated_at,
Creator.name.label("creator_name"),
event_counts.c.event_count,
event_counts.c.total_tokens_used,
event_counts.c.last_event_at,
)
.join(Creator, SourceVideo.creator_id == Creator.id)
.outerjoin(event_counts, SourceVideo.id == event_counts.c.video_id)
.order_by(SourceVideo.updated_at.desc())
)
result = await db.execute(stmt)
rows = result.all()
return {
"items": [
{
"id": str(r.id),
"filename": r.filename,
"processing_status": r.processing_status.value if hasattr(r.processing_status, 'value') else str(r.processing_status),
"content_hash": r.content_hash,
"creator_name": r.creator_name,
"created_at": r.created_at.isoformat() if r.created_at else None,
"updated_at": r.updated_at.isoformat() if r.updated_at else None,
"event_count": r.event_count or 0,
"total_tokens_used": r.total_tokens_used or 0,
"last_event_at": r.last_event_at.isoformat() if r.last_event_at else None,
}
for r in rows
],
"total": len(rows),
}
# ── Admin: Retrigger ─────────────────────────────────────────────────────────
@router.post("/admin/pipeline/trigger/{video_id}")
async def admin_trigger_pipeline(
video_id: str,
db: AsyncSession = Depends(get_session),
):
"""Admin retrigger — same as public trigger."""
return await trigger_pipeline(video_id, db)
# ── Admin: Revoke ────────────────────────────────────────────────────────────
@router.post("/admin/pipeline/revoke/{video_id}")
async def revoke_pipeline(video_id: str):
"""Revoke/cancel active Celery tasks for a video.
Uses Celery's revoke with terminate=True to kill running tasks.
This is best-effort the task may have already completed.
"""
from worker import celery_app
try:
# Get active tasks and revoke any matching this video_id
inspector = celery_app.control.inspect()
active = inspector.active() or {}
revoked_count = 0
for _worker, tasks in active.items():
for task in tasks:
task_args = task.get("args", [])
if task_args and str(task_args[0]) == video_id:
celery_app.control.revoke(task["id"], terminate=True)
revoked_count += 1
logger.info("Revoked task %s for video_id=%s", task["id"], video_id)
return {
"status": "revoked" if revoked_count > 0 else "no_active_tasks",
"video_id": video_id,
"tasks_revoked": revoked_count,
}
except Exception as exc:
logger.warning("Failed to revoke tasks for video_id=%s: %s", video_id, exc)
raise HTTPException(
status_code=503,
detail="Failed to communicate with Celery worker",
) from exc
# ── Admin: Event log ─────────────────────────────────────────────────────────
@router.get("/admin/pipeline/events/{video_id}")
async def list_pipeline_events(
video_id: str,
offset: Annotated[int, Query(ge=0)] = 0,
limit: Annotated[int, Query(ge=1, le=200)] = 100,
stage: Annotated[str | None, Query(description="Filter by stage name")] = None,
event_type: Annotated[str | None, Query(description="Filter by event type")] = None,
order: Annotated[str, Query(description="Sort order: asc or desc")] = "desc",
db: AsyncSession = Depends(get_session),
):
"""Get pipeline events for a video. Default: newest first (desc)."""
stmt = select(PipelineEvent).where(PipelineEvent.video_id == video_id)
if stage:
stmt = stmt.where(PipelineEvent.stage == stage)
if event_type:
stmt = stmt.where(PipelineEvent.event_type == event_type)
# Validate order param
if order not in ("asc", "desc"):
raise HTTPException(status_code=400, detail="order must be 'asc' or 'desc'")
# Count
count_stmt = select(func.count()).select_from(stmt.subquery())
total = (await db.execute(count_stmt)).scalar() or 0
# Fetch
order_clause = PipelineEvent.created_at.asc() if order == "asc" else PipelineEvent.created_at.desc()
stmt = stmt.order_by(order_clause)
stmt = stmt.offset(offset).limit(limit)
result = await db.execute(stmt)
events = result.scalars().all()
return {
"items": [
{
"id": str(e.id),
"video_id": str(e.video_id),
"stage": e.stage,
"event_type": e.event_type,
"prompt_tokens": e.prompt_tokens,
"completion_tokens": e.completion_tokens,
"total_tokens": e.total_tokens,
"model": e.model,
"duration_ms": e.duration_ms,
"payload": e.payload,
"created_at": e.created_at.isoformat() if e.created_at else None,
"system_prompt_text": e.system_prompt_text,
"user_prompt_text": e.user_prompt_text,
"response_text": e.response_text,
}
for e in events
],
"total": total,
"offset": offset,
"limit": limit,
}
# ── Admin: Debug mode ─────────────────────────────────────────────────────────
@router.get("/admin/pipeline/debug-mode", response_model=DebugModeResponse)
async def get_debug_mode() -> DebugModeResponse:
"""Get the current pipeline debug mode (on/off)."""
settings = get_settings()
try:
redis = await get_redis()
try:
value = await redis.get(REDIS_DEBUG_MODE_KEY)
if value is not None:
return DebugModeResponse(debug_mode=value.lower() == "true")
finally:
await redis.aclose()
except Exception as exc:
logger.warning("Redis unavailable for debug mode read, using config default: %s", exc)
return DebugModeResponse(debug_mode=settings.debug_mode)
@router.put("/admin/pipeline/debug-mode", response_model=DebugModeResponse)
async def set_debug_mode(body: DebugModeUpdate) -> DebugModeResponse:
"""Set the pipeline debug mode (on/off)."""
try:
redis = await get_redis()
try:
await redis.set(REDIS_DEBUG_MODE_KEY, str(body.debug_mode))
finally:
await redis.aclose()
except Exception as exc:
logger.error("Failed to set debug mode in Redis: %s", exc)
raise HTTPException(
status_code=503,
detail=f"Redis unavailable: {exc}",
)
logger.info("Pipeline debug mode set to %s", body.debug_mode)
return DebugModeResponse(debug_mode=body.debug_mode)
# ── Admin: Token summary ─────────────────────────────────────────────────────
@router.get("/admin/pipeline/token-summary/{video_id}", response_model=TokenSummaryResponse)
async def get_token_summary(
video_id: str,
db: AsyncSession = Depends(get_session),
) -> TokenSummaryResponse:
"""Get per-stage token usage summary for a video."""
stmt = (
select(
PipelineEvent.stage,
func.count().label("call_count"),
func.coalesce(func.sum(PipelineEvent.prompt_tokens), 0).label("total_prompt_tokens"),
func.coalesce(func.sum(PipelineEvent.completion_tokens), 0).label("total_completion_tokens"),
func.coalesce(func.sum(PipelineEvent.total_tokens), 0).label("total_tokens"),
)
.where(PipelineEvent.video_id == video_id)
.where(PipelineEvent.event_type == "llm_call")
.group_by(PipelineEvent.stage)
.order_by(PipelineEvent.stage)
)
result = await db.execute(stmt)
rows = result.all()
stages = [
TokenStageSummary(
stage=r.stage,
call_count=r.call_count,
total_prompt_tokens=r.total_prompt_tokens,
total_completion_tokens=r.total_completion_tokens,
total_tokens=r.total_tokens,
)
for r in rows
]
grand_total = sum(s.total_tokens for s in stages)
return TokenSummaryResponse(
video_id=video_id,
stages=stages,
grand_total_tokens=grand_total,
)
# ── Admin: Worker status ─────────────────────────────────────────────────────
@router.get("/admin/pipeline/worker-status")
async def worker_status():
"""Get current Celery worker status — active, reserved, and stats."""
from worker import celery_app
try:
inspector = celery_app.control.inspect()
active = inspector.active() or {}
reserved = inspector.reserved() or {}
stats = inspector.stats() or {}
workers = []
for worker_name in set(list(active.keys()) + list(reserved.keys()) + list(stats.keys())):
worker_active = active.get(worker_name, [])
worker_reserved = reserved.get(worker_name, [])
worker_stats = stats.get(worker_name, {})
workers.append({
"name": worker_name,
"active_tasks": [
{
"id": t.get("id"),
"name": t.get("name"),
"args": t.get("args", []),
"time_start": t.get("time_start"),
}
for t in worker_active
],
"reserved_tasks": len(worker_reserved),
"total_completed": worker_stats.get("total", {}).get("tasks.pipeline.stages.stage2_segmentation", 0)
+ worker_stats.get("total", {}).get("tasks.pipeline.stages.stage3_extraction", 0)
+ worker_stats.get("total", {}).get("tasks.pipeline.stages.stage4_classification", 0)
+ worker_stats.get("total", {}).get("tasks.pipeline.stages.stage5_synthesis", 0),
"uptime": worker_stats.get("clock", None),
"pool_size": worker_stats.get("pool", {}).get("max-concurrency") if isinstance(worker_stats.get("pool"), dict) else None,
})
return {
"online": len(workers) > 0,
"workers": workers,
}
except Exception as exc:
logger.warning("Failed to inspect Celery workers: %s", exc)
return {
"online": False,
"workers": [],
"error": str(exc),
}

147
backend/routers/reports.py Normal file
View file

@ -0,0 +1,147 @@
"""Content reports router — public submission + admin management.
Public:
POST /reports Submit a content issue report
Admin:
GET /admin/reports List reports (filterable by status, content_type)
GET /admin/reports/{id} Get single report detail
PATCH /admin/reports/{id} Update status / add admin notes
"""
import logging
import uuid
from datetime import datetime, timezone
from typing import Annotated
from fastapi import APIRouter, Depends, HTTPException, Query
from sqlalchemy import func, select
from sqlalchemy.ext.asyncio import AsyncSession
from database import get_session
from models import ContentReport, ReportStatus
from schemas import (
ContentReportCreate,
ContentReportListResponse,
ContentReportRead,
ContentReportUpdate,
)
logger = logging.getLogger("chrysopedia.reports")
router = APIRouter(tags=["reports"])
# ── Public ───────────────────────────────────────────────────────────────────
@router.post("/reports", response_model=ContentReportRead, status_code=201)
async def submit_report(
body: ContentReportCreate,
db: AsyncSession = Depends(get_session),
):
"""Submit a content issue report (public, no auth)."""
report = ContentReport(
content_type=body.content_type,
content_id=body.content_id,
content_title=body.content_title,
report_type=body.report_type,
description=body.description,
page_url=body.page_url,
)
db.add(report)
await db.commit()
await db.refresh(report)
logger.info(
"New content report: id=%s type=%s content=%s/%s",
report.id, report.report_type, report.content_type, report.content_id,
)
return report
# ── Admin ────────────────────────────────────────────────────────────────────
@router.get("/admin/reports", response_model=ContentReportListResponse)
async def list_reports(
status: Annotated[str | None, Query(description="Filter by status")] = None,
content_type: Annotated[str | None, Query(description="Filter by content type")] = None,
offset: Annotated[int, Query(ge=0)] = 0,
limit: Annotated[int, Query(ge=1, le=100)] = 50,
db: AsyncSession = Depends(get_session),
):
"""List content reports with optional filters."""
stmt = select(ContentReport)
if status:
stmt = stmt.where(ContentReport.status == status)
if content_type:
stmt = stmt.where(ContentReport.content_type == content_type)
# Count
count_stmt = select(func.count()).select_from(stmt.subquery())
total = (await db.execute(count_stmt)).scalar() or 0
# Fetch page
stmt = stmt.order_by(ContentReport.created_at.desc())
stmt = stmt.offset(offset).limit(limit)
result = await db.execute(stmt)
items = result.scalars().all()
return {"items": items, "total": total, "offset": offset, "limit": limit}
@router.get("/admin/reports/{report_id}", response_model=ContentReportRead)
async def get_report(
report_id: uuid.UUID,
db: AsyncSession = Depends(get_session),
):
"""Get a single content report by ID."""
result = await db.execute(
select(ContentReport).where(ContentReport.id == report_id)
)
report = result.scalar_one_or_none()
if not report:
raise HTTPException(status_code=404, detail="Report not found")
return report
@router.patch("/admin/reports/{report_id}", response_model=ContentReportRead)
async def update_report(
report_id: uuid.UUID,
body: ContentReportUpdate,
db: AsyncSession = Depends(get_session),
):
"""Update report status and/or admin notes."""
result = await db.execute(
select(ContentReport).where(ContentReport.id == report_id)
)
report = result.scalar_one_or_none()
if not report:
raise HTTPException(status_code=404, detail="Report not found")
if body.status is not None:
# Validate status value
try:
ReportStatus(body.status)
except ValueError:
raise HTTPException(
status_code=422,
detail=f"Invalid status: {body.status}. Must be one of: open, acknowledged, resolved, dismissed",
)
report.status = body.status
if body.status in ("resolved", "dismissed"):
report.resolved_at = datetime.now(timezone.utc).replace(tzinfo=None)
elif body.status == "open":
report.resolved_at = None
if body.admin_notes is not None:
report.admin_notes = body.admin_notes
await db.commit()
await db.refresh(report)
logger.info(
"Report updated: id=%s status=%s",
report.id, report.status,
)
return report

375
backend/routers/review.py Normal file
View file

@ -0,0 +1,375 @@
"""Review queue endpoints for Chrysopedia API.
Provides admin review workflow: list queue, stats, approve, reject,
edit, split, merge key moments, and toggle review/auto mode via Redis.
"""
import logging
import uuid
from typing import Annotated
from fastapi import APIRouter, Depends, HTTPException, Query
from sqlalchemy import case, func, select
from sqlalchemy.ext.asyncio import AsyncSession
from config import get_settings
from database import get_session
from models import Creator, KeyMoment, KeyMomentContentType, ReviewStatus, SourceVideo
from redis_client import get_redis
from schemas import (
KeyMomentRead,
MomentEditRequest,
MomentMergeRequest,
MomentSplitRequest,
ReviewModeResponse,
ReviewModeUpdate,
ReviewQueueItem,
ReviewQueueResponse,
ReviewStatsResponse,
)
logger = logging.getLogger("chrysopedia.review")
router = APIRouter(prefix="/review", tags=["review"])
REDIS_MODE_KEY = "chrysopedia:review_mode"
VALID_STATUSES = {"pending", "approved", "edited", "rejected", "all"}
# ── Helpers ──────────────────────────────────────────────────────────────────
def _moment_to_queue_item(
moment: KeyMoment, video_filename: str, creator_name: str
) -> ReviewQueueItem:
"""Convert a KeyMoment ORM instance + joined fields to a ReviewQueueItem."""
data = KeyMomentRead.model_validate(moment).model_dump()
data["video_filename"] = video_filename
data["creator_name"] = creator_name
return ReviewQueueItem(**data)
# ── Endpoints ────────────────────────────────────────────────────────────────
@router.get("/queue", response_model=ReviewQueueResponse)
async def list_queue(
status: Annotated[str, Query()] = "pending",
offset: Annotated[int, Query(ge=0)] = 0,
limit: Annotated[int, Query(ge=1, le=1000)] = 50,
db: AsyncSession = Depends(get_session),
) -> ReviewQueueResponse:
"""List key moments in the review queue, filtered by status."""
if status not in VALID_STATUSES:
raise HTTPException(
status_code=400,
detail=f"Invalid status filter '{status}'. Must be one of: {', '.join(sorted(VALID_STATUSES))}",
)
# Base query joining KeyMoment → SourceVideo → Creator
base = (
select(
KeyMoment,
SourceVideo.filename.label("video_filename"),
Creator.name.label("creator_name"),
)
.join(SourceVideo, KeyMoment.source_video_id == SourceVideo.id)
.join(Creator, SourceVideo.creator_id == Creator.id)
)
if status != "all":
base = base.where(KeyMoment.review_status == ReviewStatus(status))
# Count total matching rows
count_stmt = select(func.count()).select_from(base.subquery())
total = (await db.execute(count_stmt)).scalar_one()
# Fetch paginated results
stmt = base.order_by(KeyMoment.created_at.desc()).offset(offset).limit(limit)
rows = (await db.execute(stmt)).all()
items = [
_moment_to_queue_item(row.KeyMoment, row.video_filename, row.creator_name)
for row in rows
]
return ReviewQueueResponse(items=items, total=total, offset=offset, limit=limit)
@router.get("/stats", response_model=ReviewStatsResponse)
async def get_stats(
db: AsyncSession = Depends(get_session),
) -> ReviewStatsResponse:
"""Return counts of key moments grouped by review status."""
stmt = (
select(
KeyMoment.review_status,
func.count().label("cnt"),
)
.group_by(KeyMoment.review_status)
)
result = await db.execute(stmt)
counts = {row.review_status.value: row.cnt for row in result.all()}
return ReviewStatsResponse(
pending=counts.get("pending", 0),
approved=counts.get("approved", 0),
edited=counts.get("edited", 0),
rejected=counts.get("rejected", 0),
)
@router.post("/moments/{moment_id}/approve", response_model=KeyMomentRead)
async def approve_moment(
moment_id: uuid.UUID,
db: AsyncSession = Depends(get_session),
) -> KeyMomentRead:
"""Approve a key moment for publishing."""
moment = await db.get(KeyMoment, moment_id)
if moment is None:
raise HTTPException(
status_code=404,
detail=f"Key moment {moment_id} not found",
)
moment.review_status = ReviewStatus.approved
await db.commit()
await db.refresh(moment)
logger.info("Approved key moment %s", moment_id)
return KeyMomentRead.model_validate(moment)
@router.post("/moments/{moment_id}/reject", response_model=KeyMomentRead)
async def reject_moment(
moment_id: uuid.UUID,
db: AsyncSession = Depends(get_session),
) -> KeyMomentRead:
"""Reject a key moment."""
moment = await db.get(KeyMoment, moment_id)
if moment is None:
raise HTTPException(
status_code=404,
detail=f"Key moment {moment_id} not found",
)
moment.review_status = ReviewStatus.rejected
await db.commit()
await db.refresh(moment)
logger.info("Rejected key moment %s", moment_id)
return KeyMomentRead.model_validate(moment)
@router.put("/moments/{moment_id}", response_model=KeyMomentRead)
async def edit_moment(
moment_id: uuid.UUID,
body: MomentEditRequest,
db: AsyncSession = Depends(get_session),
) -> KeyMomentRead:
"""Update editable fields of a key moment and set status to edited."""
moment = await db.get(KeyMoment, moment_id)
if moment is None:
raise HTTPException(
status_code=404,
detail=f"Key moment {moment_id} not found",
)
update_data = body.model_dump(exclude_unset=True)
# Convert content_type string to enum if provided
if "content_type" in update_data and update_data["content_type"] is not None:
try:
update_data["content_type"] = KeyMomentContentType(update_data["content_type"])
except ValueError:
raise HTTPException(
status_code=400,
detail=f"Invalid content_type '{update_data['content_type']}'",
)
for field, value in update_data.items():
setattr(moment, field, value)
moment.review_status = ReviewStatus.edited
await db.commit()
await db.refresh(moment)
logger.info("Edited key moment %s (fields: %s)", moment_id, list(update_data.keys()))
return KeyMomentRead.model_validate(moment)
@router.post("/moments/{moment_id}/split", response_model=list[KeyMomentRead])
async def split_moment(
moment_id: uuid.UUID,
body: MomentSplitRequest,
db: AsyncSession = Depends(get_session),
) -> list[KeyMomentRead]:
"""Split a key moment into two at the given timestamp."""
moment = await db.get(KeyMoment, moment_id)
if moment is None:
raise HTTPException(
status_code=404,
detail=f"Key moment {moment_id} not found",
)
# Validate split_time is strictly between start_time and end_time
if body.split_time <= moment.start_time or body.split_time >= moment.end_time:
raise HTTPException(
status_code=400,
detail=(
f"split_time ({body.split_time}) must be strictly between "
f"start_time ({moment.start_time}) and end_time ({moment.end_time})"
),
)
# Update original moment to [start_time, split_time)
original_end = moment.end_time
moment.end_time = body.split_time
moment.review_status = ReviewStatus.pending
# Create new moment for [split_time, end_time]
new_moment = KeyMoment(
source_video_id=moment.source_video_id,
technique_page_id=moment.technique_page_id,
title=f"{moment.title} (split)",
summary=moment.summary,
start_time=body.split_time,
end_time=original_end,
content_type=moment.content_type,
plugins=moment.plugins,
review_status=ReviewStatus.pending,
raw_transcript=moment.raw_transcript,
)
db.add(new_moment)
await db.commit()
await db.refresh(moment)
await db.refresh(new_moment)
logger.info(
"Split key moment %s at %.2f → original [%.2f, %.2f), new [%.2f, %.2f]",
moment_id, body.split_time,
moment.start_time, moment.end_time,
new_moment.start_time, new_moment.end_time,
)
return [
KeyMomentRead.model_validate(moment),
KeyMomentRead.model_validate(new_moment),
]
@router.post("/moments/{moment_id}/merge", response_model=KeyMomentRead)
async def merge_moments(
moment_id: uuid.UUID,
body: MomentMergeRequest,
db: AsyncSession = Depends(get_session),
) -> KeyMomentRead:
"""Merge two key moments into one."""
if moment_id == body.target_moment_id:
raise HTTPException(
status_code=400,
detail="Cannot merge a moment with itself",
)
source = await db.get(KeyMoment, moment_id)
if source is None:
raise HTTPException(
status_code=404,
detail=f"Key moment {moment_id} not found",
)
target = await db.get(KeyMoment, body.target_moment_id)
if target is None:
raise HTTPException(
status_code=404,
detail=f"Target key moment {body.target_moment_id} not found",
)
# Both must belong to the same source video
if source.source_video_id != target.source_video_id:
raise HTTPException(
status_code=400,
detail="Cannot merge moments from different source videos",
)
# Merge: combined summary, min start, max end
source.summary = f"{source.summary}\n\n{target.summary}"
source.start_time = min(source.start_time, target.start_time)
source.end_time = max(source.end_time, target.end_time)
source.review_status = ReviewStatus.pending
# Delete target
await db.delete(target)
await db.commit()
await db.refresh(source)
logger.info(
"Merged key moment %s with %s → [%.2f, %.2f]",
moment_id, body.target_moment_id,
source.start_time, source.end_time,
)
return KeyMomentRead.model_validate(source)
@router.get("/moments/{moment_id}", response_model=ReviewQueueItem)
async def get_moment(
moment_id: uuid.UUID,
db: AsyncSession = Depends(get_session),
) -> ReviewQueueItem:
"""Get a single key moment by ID with video and creator info."""
stmt = (
select(KeyMoment, SourceVideo.file_path, Creator.name)
.join(SourceVideo, KeyMoment.source_video_id == SourceVideo.id)
.join(Creator, SourceVideo.creator_id == Creator.id)
.where(KeyMoment.id == moment_id)
)
result = await db.execute(stmt)
row = result.one_or_none()
if row is None:
raise HTTPException(status_code=404, detail=f"Moment {moment_id} not found")
moment, file_path, creator_name = row
return _moment_to_queue_item(moment, file_path or "", creator_name)
@router.get("/mode", response_model=ReviewModeResponse)
async def get_mode() -> ReviewModeResponse:
"""Get the current review mode (review vs auto)."""
settings = get_settings()
try:
redis = await get_redis()
try:
value = await redis.get(REDIS_MODE_KEY)
if value is not None:
return ReviewModeResponse(review_mode=value.lower() == "true")
finally:
await redis.aclose()
except Exception as exc:
# Redis unavailable — fall back to config default
logger.warning("Redis unavailable for mode read, using config default: %s", exc)
return ReviewModeResponse(review_mode=settings.review_mode)
@router.put("/mode", response_model=ReviewModeResponse)
async def set_mode(
body: ReviewModeUpdate,
) -> ReviewModeResponse:
"""Set the review mode (review vs auto)."""
try:
redis = await get_redis()
try:
await redis.set(REDIS_MODE_KEY, str(body.review_mode))
finally:
await redis.aclose()
except Exception as exc:
logger.error("Failed to set review mode in Redis: %s", exc)
raise HTTPException(
status_code=503,
detail=f"Redis unavailable: {exc}",
)
logger.info("Review mode set to %s", body.review_mode)
return ReviewModeResponse(review_mode=body.review_mode)

46
backend/routers/search.py Normal file
View file

@ -0,0 +1,46 @@
"""Search endpoint for semantic + keyword search with graceful fallback."""
from __future__ import annotations
import logging
from typing import Annotated
from fastapi import APIRouter, Depends, Query
from sqlalchemy.ext.asyncio import AsyncSession
from config import get_settings
from database import get_session
from schemas import SearchResponse, SearchResultItem
from search_service import SearchService
logger = logging.getLogger("chrysopedia.search.router")
router = APIRouter(prefix="/search", tags=["search"])
def _get_search_service() -> SearchService:
"""Build a SearchService from current settings."""
return SearchService(get_settings())
@router.get("", response_model=SearchResponse)
async def search(
q: Annotated[str, Query(max_length=500)] = "",
scope: Annotated[str, Query()] = "all",
limit: Annotated[int, Query(ge=1, le=100)] = 20,
db: AsyncSession = Depends(get_session),
) -> SearchResponse:
"""Semantic search with keyword fallback.
- **q**: Search query (max 500 chars). Empty empty results.
- **scope**: ``all`` | ``topics`` | ``creators``. Invalid defaults to ``all``.
- **limit**: Max results (1100, default 20).
"""
svc = _get_search_service()
result = await svc.search(query=q, scope=scope, limit=limit, db=db)
return SearchResponse(
items=[SearchResultItem(**item) for item in result["items"]],
total=result["total"],
query=result["query"],
fallback_used=result["fallback_used"],
)

View file

@ -0,0 +1,217 @@
"""Technique page endpoints — list and detail with eager-loaded relations."""
from __future__ import annotations
import logging
from typing import Annotated
from fastapi import APIRouter, Depends, HTTPException, Query
from sqlalchemy import func, select
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy.orm import selectinload
from database import get_session
from models import Creator, KeyMoment, RelatedTechniqueLink, SourceVideo, TechniquePage, TechniquePageVersion
from schemas import (
CreatorInfo,
KeyMomentSummary,
PaginatedResponse,
RelatedLinkItem,
TechniquePageDetail,
TechniquePageRead,
TechniquePageVersionDetail,
TechniquePageVersionListResponse,
TechniquePageVersionSummary,
)
logger = logging.getLogger("chrysopedia.techniques")
router = APIRouter(prefix="/techniques", tags=["techniques"])
@router.get("", response_model=PaginatedResponse)
async def list_techniques(
category: Annotated[str | None, Query()] = None,
creator_slug: Annotated[str | None, Query()] = None,
offset: Annotated[int, Query(ge=0)] = 0,
limit: Annotated[int, Query(ge=1, le=100)] = 50,
db: AsyncSession = Depends(get_session),
) -> PaginatedResponse:
"""List technique pages with optional category/creator filtering."""
stmt = select(TechniquePage)
if category:
stmt = stmt.where(TechniquePage.topic_category == category)
if creator_slug:
# Join to Creator to filter by slug
stmt = stmt.join(Creator, TechniquePage.creator_id == Creator.id).where(
Creator.slug == creator_slug
)
# Count total before pagination
from sqlalchemy import func
count_stmt = select(func.count()).select_from(stmt.subquery())
count_result = await db.execute(count_stmt)
total = count_result.scalar() or 0
stmt = stmt.options(selectinload(TechniquePage.creator)).order_by(TechniquePage.created_at.desc()).offset(offset).limit(limit)
result = await db.execute(stmt)
pages = result.scalars().all()
items = []
for p in pages:
item = TechniquePageRead.model_validate(p)
if p.creator:
item.creator_name = p.creator.name
item.creator_slug = p.creator.slug
items.append(item)
return PaginatedResponse(
items=items,
total=total,
offset=offset,
limit=limit,
)
@router.get("/{slug}", response_model=TechniquePageDetail)
async def get_technique(
slug: str,
db: AsyncSession = Depends(get_session),
) -> TechniquePageDetail:
"""Get full technique page detail with key moments, creator, and related links."""
stmt = (
select(TechniquePage)
.where(TechniquePage.slug == slug)
.options(
selectinload(TechniquePage.key_moments).selectinload(KeyMoment.source_video),
selectinload(TechniquePage.creator),
selectinload(TechniquePage.outgoing_links).selectinload(
RelatedTechniqueLink.target_page
),
selectinload(TechniquePage.incoming_links).selectinload(
RelatedTechniqueLink.source_page
),
)
)
result = await db.execute(stmt)
page = result.scalar_one_or_none()
if page is None:
raise HTTPException(status_code=404, detail=f"Technique '{slug}' not found")
# Build key moments (ordered by start_time)
key_moments = sorted(page.key_moments, key=lambda km: km.start_time)
key_moment_items = []
for km in key_moments:
item = KeyMomentSummary.model_validate(km)
item.video_filename = km.source_video.filename if km.source_video else ""
key_moment_items.append(item)
# Build creator info
creator_info = None
if page.creator:
creator_info = CreatorInfo(
name=page.creator.name,
slug=page.creator.slug,
genres=page.creator.genres,
)
# Build related links (outgoing + incoming)
related_links: list[RelatedLinkItem] = []
for link in page.outgoing_links:
if link.target_page:
related_links.append(
RelatedLinkItem(
target_title=link.target_page.title,
target_slug=link.target_page.slug,
relationship=link.relationship.value if hasattr(link.relationship, 'value') else str(link.relationship),
)
)
for link in page.incoming_links:
if link.source_page:
related_links.append(
RelatedLinkItem(
target_title=link.source_page.title,
target_slug=link.source_page.slug,
relationship=link.relationship.value if hasattr(link.relationship, 'value') else str(link.relationship),
)
)
base = TechniquePageRead.model_validate(page)
# Count versions for this page
version_count_stmt = select(func.count()).where(
TechniquePageVersion.technique_page_id == page.id
)
version_count_result = await db.execute(version_count_stmt)
version_count = version_count_result.scalar() or 0
return TechniquePageDetail(
**base.model_dump(),
key_moments=key_moment_items,
creator_info=creator_info,
related_links=related_links,
version_count=version_count,
)
@router.get("/{slug}/versions", response_model=TechniquePageVersionListResponse)
async def list_technique_versions(
slug: str,
db: AsyncSession = Depends(get_session),
) -> TechniquePageVersionListResponse:
"""List all version snapshots for a technique page, newest first."""
# Resolve the technique page
page_stmt = select(TechniquePage).where(TechniquePage.slug == slug)
page_result = await db.execute(page_stmt)
page = page_result.scalar_one_or_none()
if page is None:
raise HTTPException(status_code=404, detail=f"Technique '{slug}' not found")
# Fetch versions ordered by version_number DESC
versions_stmt = (
select(TechniquePageVersion)
.where(TechniquePageVersion.technique_page_id == page.id)
.order_by(TechniquePageVersion.version_number.desc())
)
versions_result = await db.execute(versions_stmt)
versions = versions_result.scalars().all()
items = [TechniquePageVersionSummary.model_validate(v) for v in versions]
return TechniquePageVersionListResponse(items=items, total=len(items))
@router.get("/{slug}/versions/{version_number}", response_model=TechniquePageVersionDetail)
async def get_technique_version(
slug: str,
version_number: int,
db: AsyncSession = Depends(get_session),
) -> TechniquePageVersionDetail:
"""Get a specific version snapshot by version number."""
# Resolve the technique page
page_stmt = select(TechniquePage).where(TechniquePage.slug == slug)
page_result = await db.execute(page_stmt)
page = page_result.scalar_one_or_none()
if page is None:
raise HTTPException(status_code=404, detail=f"Technique '{slug}' not found")
# Fetch the specific version
version_stmt = (
select(TechniquePageVersion)
.where(
TechniquePageVersion.technique_page_id == page.id,
TechniquePageVersion.version_number == version_number,
)
)
version_result = await db.execute(version_stmt)
version = version_result.scalar_one_or_none()
if version is None:
raise HTTPException(
status_code=404,
detail=f"Version {version_number} not found for technique '{slug}'",
)
return TechniquePageVersionDetail.model_validate(version)

144
backend/routers/topics.py Normal file
View file

@ -0,0 +1,144 @@
"""Topics endpoint — two-level category hierarchy with aggregated counts."""
from __future__ import annotations
import logging
import os
from typing import Annotated, Any
import yaml
from fastapi import APIRouter, Depends, Query
from sqlalchemy import func, select
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy.orm import selectinload
from database import get_session
from models import Creator, TechniquePage
from schemas import (
PaginatedResponse,
TechniquePageRead,
TopicCategory,
TopicSubTopic,
)
logger = logging.getLogger("chrysopedia.topics")
router = APIRouter(prefix="/topics", tags=["topics"])
# Path to canonical_tags.yaml relative to the backend directory
_TAGS_PATH = os.path.join(os.path.dirname(__file__), "..", "..", "config", "canonical_tags.yaml")
def _load_canonical_tags() -> list[dict[str, Any]]:
"""Load the canonical tag categories from YAML."""
path = os.path.normpath(_TAGS_PATH)
try:
with open(path) as f:
data = yaml.safe_load(f)
return data.get("categories", [])
except FileNotFoundError:
logger.warning("canonical_tags.yaml not found at %s", path)
return []
@router.get("", response_model=list[TopicCategory])
async def list_topics(
db: AsyncSession = Depends(get_session),
) -> list[TopicCategory]:
"""Return the two-level topic hierarchy with technique/creator counts per sub-topic.
Categories come from ``canonical_tags.yaml``. Counts are computed
from live DB data by matching ``topic_tags`` array contents.
"""
categories = _load_canonical_tags()
# Pre-fetch all technique pages with their tags and creator_ids for counting
tp_stmt = select(
TechniquePage.topic_category,
TechniquePage.topic_tags,
TechniquePage.creator_id,
)
tp_result = await db.execute(tp_stmt)
tp_rows = tp_result.all()
# Build per-sub-topic counts
result: list[TopicCategory] = []
for cat in categories:
cat_name = cat.get("name", "")
cat_desc = cat.get("description", "")
sub_topic_names: list[str] = cat.get("sub_topics", [])
sub_topics: list[TopicSubTopic] = []
for st_name in sub_topic_names:
technique_count = 0
creator_ids: set[str] = set()
for tp_cat, tp_tags, tp_creator_id in tp_rows:
tags = tp_tags or []
# Match if the sub-topic name appears in the technique's tags
# or if the category matches and tag is in sub-topics
if st_name.lower() in [t.lower() for t in tags]:
technique_count += 1
creator_ids.add(str(tp_creator_id))
sub_topics.append(
TopicSubTopic(
name=st_name,
technique_count=technique_count,
creator_count=len(creator_ids),
)
)
result.append(
TopicCategory(
name=cat_name,
description=cat_desc,
sub_topics=sub_topics,
)
)
return result
@router.get("/{category_slug}", response_model=PaginatedResponse)
async def get_topic_techniques(
category_slug: str,
offset: Annotated[int, Query(ge=0)] = 0,
limit: Annotated[int, Query(ge=1, le=100)] = 50,
db: AsyncSession = Depends(get_session),
) -> PaginatedResponse:
"""Return technique pages filtered by topic_category.
The ``category_slug`` is matched case-insensitively against
``technique_pages.topic_category`` (e.g. 'sound-design' matches 'Sound design').
"""
# Normalize slug to category name: replace hyphens with spaces, title-case
category_name = category_slug.replace("-", " ").title()
# Also try exact match on the slug form
stmt = select(TechniquePage).where(
TechniquePage.topic_category.ilike(category_name)
)
count_stmt = select(func.count()).select_from(stmt.subquery())
count_result = await db.execute(count_stmt)
total = count_result.scalar() or 0
stmt = stmt.options(selectinload(TechniquePage.creator)).order_by(TechniquePage.title).offset(offset).limit(limit)
result = await db.execute(stmt)
pages = result.scalars().all()
items = []
for p in pages:
item = TechniquePageRead.model_validate(p)
if p.creator:
item.creator_name = p.creator.name
item.creator_slug = p.creator.slug
items.append(item)
return PaginatedResponse(
items=items,
total=total,
offset=offset,
limit=limit,
)

36
backend/routers/videos.py Normal file
View file

@ -0,0 +1,36 @@
"""Source video endpoints for Chrysopedia API."""
import logging
from typing import Annotated
from fastapi import APIRouter, Depends, Query
from sqlalchemy import select
from sqlalchemy.ext.asyncio import AsyncSession
from database import get_session
from models import SourceVideo
from schemas import SourceVideoRead
logger = logging.getLogger("chrysopedia.videos")
router = APIRouter(prefix="/videos", tags=["videos"])
@router.get("", response_model=list[SourceVideoRead])
async def list_videos(
offset: Annotated[int, Query(ge=0)] = 0,
limit: Annotated[int, Query(ge=1, le=100)] = 50,
creator_id: str | None = None,
db: AsyncSession = Depends(get_session),
) -> list[SourceVideoRead]:
"""List source videos with optional filtering by creator."""
stmt = select(SourceVideo).order_by(SourceVideo.created_at.desc())
if creator_id:
stmt = stmt.where(SourceVideo.creator_id == creator_id)
stmt = stmt.offset(offset).limit(limit)
result = await db.execute(stmt)
videos = result.scalars().all()
logger.debug("Listed %d videos (offset=%d, limit=%d)", len(videos), offset, limit)
return [SourceVideoRead.model_validate(v) for v in videos]

459
backend/schemas.py Normal file
View file

@ -0,0 +1,459 @@
"""Pydantic schemas for the Chrysopedia API.
Read-only schemas for list/detail endpoints and input schemas for creation.
Each schema mirrors the corresponding SQLAlchemy model in models.py.
"""
from __future__ import annotations
import uuid
from datetime import datetime
from pydantic import BaseModel, ConfigDict, Field
# ── Health ───────────────────────────────────────────────────────────────────
class HealthResponse(BaseModel):
status: str = "ok"
service: str = "chrysopedia-api"
version: str = "0.1.0"
database: str = "unknown"
# ── Creator ──────────────────────────────────────────────────────────────────
class CreatorBase(BaseModel):
name: str
slug: str
genres: list[str] | None = None
folder_name: str
class CreatorCreate(CreatorBase):
pass
class CreatorRead(CreatorBase):
model_config = ConfigDict(from_attributes=True)
id: uuid.UUID
view_count: int = 0
created_at: datetime
updated_at: datetime
class CreatorDetail(CreatorRead):
"""Creator with nested video count."""
video_count: int = 0
# ── SourceVideo ──────────────────────────────────────────────────────────────
class SourceVideoBase(BaseModel):
filename: str
file_path: str
duration_seconds: int | None = None
content_type: str
transcript_path: str | None = None
class SourceVideoCreate(SourceVideoBase):
creator_id: uuid.UUID
class SourceVideoRead(SourceVideoBase):
model_config = ConfigDict(from_attributes=True)
id: uuid.UUID
creator_id: uuid.UUID
content_hash: str | None = None
processing_status: str = "pending"
created_at: datetime
updated_at: datetime
# ── TranscriptSegment ────────────────────────────────────────────────────────
class TranscriptSegmentBase(BaseModel):
start_time: float
end_time: float
text: str
segment_index: int
topic_label: str | None = None
class TranscriptSegmentCreate(TranscriptSegmentBase):
source_video_id: uuid.UUID
class TranscriptSegmentRead(TranscriptSegmentBase):
model_config = ConfigDict(from_attributes=True)
id: uuid.UUID
source_video_id: uuid.UUID
# ── KeyMoment ────────────────────────────────────────────────────────────────
class KeyMomentBase(BaseModel):
title: str
summary: str
start_time: float
end_time: float
content_type: str
plugins: list[str] | None = None
raw_transcript: str | None = None
class KeyMomentCreate(KeyMomentBase):
source_video_id: uuid.UUID
technique_page_id: uuid.UUID | None = None
class KeyMomentRead(KeyMomentBase):
model_config = ConfigDict(from_attributes=True)
id: uuid.UUID
source_video_id: uuid.UUID
technique_page_id: uuid.UUID | None = None
review_status: str = "pending"
created_at: datetime
updated_at: datetime
# ── TechniquePage ────────────────────────────────────────────────────────────
class TechniquePageBase(BaseModel):
title: str
slug: str
topic_category: str
topic_tags: list[str] | None = None
summary: str | None = None
body_sections: dict | None = None
signal_chains: list | None = None
plugins: list[str] | None = None
class TechniquePageCreate(TechniquePageBase):
creator_id: uuid.UUID
source_quality: str | None = None
class TechniquePageRead(TechniquePageBase):
model_config = ConfigDict(from_attributes=True)
id: uuid.UUID
creator_id: uuid.UUID
creator_name: str = ""
creator_slug: str = ""
source_quality: str | None = None
view_count: int = 0
review_status: str = "draft"
created_at: datetime
updated_at: datetime
# ── RelatedTechniqueLink ─────────────────────────────────────────────────────
class RelatedTechniqueLinkBase(BaseModel):
source_page_id: uuid.UUID
target_page_id: uuid.UUID
relationship: str
class RelatedTechniqueLinkCreate(RelatedTechniqueLinkBase):
pass
class RelatedTechniqueLinkRead(RelatedTechniqueLinkBase):
model_config = ConfigDict(from_attributes=True)
id: uuid.UUID
# ── Tag ──────────────────────────────────────────────────────────────────────
class TagBase(BaseModel):
name: str
category: str
aliases: list[str] | None = None
class TagCreate(TagBase):
pass
class TagRead(TagBase):
model_config = ConfigDict(from_attributes=True)
id: uuid.UUID
# ── Transcript Ingestion ─────────────────────────────────────────────────────
class TranscriptIngestResponse(BaseModel):
"""Response returned after successfully ingesting a transcript."""
video_id: uuid.UUID
creator_id: uuid.UUID
creator_name: str
filename: str
segments_stored: int
processing_status: str
is_reupload: bool
content_hash: str
# ── Pagination wrapper ───────────────────────────────────────────────────────
class PaginatedResponse(BaseModel):
"""Generic paginated list response."""
items: list = Field(default_factory=list)
total: int = 0
offset: int = 0
limit: int = 50
# ── Review Queue ─────────────────────────────────────────────────────────────
class ReviewQueueItem(KeyMomentRead):
"""Key moment enriched with source video and creator info for review UI."""
video_filename: str
creator_name: str
class ReviewQueueResponse(BaseModel):
"""Paginated response for the review queue."""
items: list[ReviewQueueItem] = Field(default_factory=list)
total: int = 0
offset: int = 0
limit: int = 50
class ReviewStatsResponse(BaseModel):
"""Counts of key moments grouped by review status."""
pending: int = 0
approved: int = 0
edited: int = 0
rejected: int = 0
class MomentEditRequest(BaseModel):
"""Editable fields for a key moment."""
title: str | None = None
summary: str | None = None
start_time: float | None = None
end_time: float | None = None
content_type: str | None = None
plugins: list[str] | None = None
class MomentSplitRequest(BaseModel):
"""Request to split a moment at a given timestamp."""
split_time: float
class MomentMergeRequest(BaseModel):
"""Request to merge two moments."""
target_moment_id: uuid.UUID
class ReviewModeResponse(BaseModel):
"""Current review mode state."""
review_mode: bool
class ReviewModeUpdate(BaseModel):
"""Request to update the review mode."""
review_mode: bool
# ── Search ───────────────────────────────────────────────────────────────────
class SearchResultItem(BaseModel):
"""A single search result."""
title: str
slug: str = ""
type: str = ""
score: float = 0.0
summary: str = ""
creator_name: str = ""
creator_slug: str = ""
topic_category: str = ""
topic_tags: list[str] = Field(default_factory=list)
class SearchResponse(BaseModel):
"""Top-level search response with metadata."""
items: list[SearchResultItem] = Field(default_factory=list)
total: int = 0
query: str = ""
fallback_used: bool = False
# ── Technique Page Detail ────────────────────────────────────────────────────
class KeyMomentSummary(BaseModel):
"""Lightweight key moment for technique page detail."""
model_config = ConfigDict(from_attributes=True)
id: uuid.UUID
title: str
summary: str
start_time: float
end_time: float
content_type: str
plugins: list[str] | None = None
source_video_id: uuid.UUID | None = None
video_filename: str = ""
class RelatedLinkItem(BaseModel):
"""A related technique link with target info."""
model_config = ConfigDict(from_attributes=True)
target_title: str = ""
target_slug: str = ""
relationship: str = ""
class CreatorInfo(BaseModel):
"""Minimal creator info embedded in technique detail."""
model_config = ConfigDict(from_attributes=True)
name: str
slug: str
genres: list[str] | None = None
class TechniquePageDetail(TechniquePageRead):
"""Technique page with nested key moments, creator, and related links."""
key_moments: list[KeyMomentSummary] = Field(default_factory=list)
creator_info: CreatorInfo | None = None
related_links: list[RelatedLinkItem] = Field(default_factory=list)
version_count: int = 0
# ── Technique Page Versions ──────────────────────────────────────────────────
class TechniquePageVersionSummary(BaseModel):
"""Lightweight version entry for list responses."""
model_config = ConfigDict(from_attributes=True)
version_number: int
created_at: datetime
pipeline_metadata: dict | None = None
class TechniquePageVersionDetail(BaseModel):
"""Full version snapshot for detail responses."""
model_config = ConfigDict(from_attributes=True)
version_number: int
content_snapshot: dict
pipeline_metadata: dict | None = None
created_at: datetime
class TechniquePageVersionListResponse(BaseModel):
"""Response for version list endpoint."""
items: list[TechniquePageVersionSummary] = Field(default_factory=list)
total: int = 0
# ── Topics ───────────────────────────────────────────────────────────────────
class TopicSubTopic(BaseModel):
"""A sub-topic with aggregated counts."""
name: str
technique_count: int = 0
creator_count: int = 0
class TopicCategory(BaseModel):
"""A top-level topic category with sub-topics."""
name: str
description: str = ""
sub_topics: list[TopicSubTopic] = Field(default_factory=list)
# ── Creator Browse ───────────────────────────────────────────────────────────
class CreatorBrowseItem(CreatorRead):
"""Creator with technique and video counts for browse pages."""
technique_count: int = 0
video_count: int = 0
# ── Content Reports ──────────────────────────────────────────────────────────
class ContentReportCreate(BaseModel):
"""Public submission: report a content issue."""
content_type: str = Field(
..., description="Entity type: technique_page, key_moment, creator, general"
)
content_id: uuid.UUID | None = Field(
None, description="ID of the reported entity (null for general reports)"
)
content_title: str | None = Field(
None, description="Title of the reported content (for display context)"
)
report_type: str = Field(
..., description="inaccurate, missing_info, wrong_attribution, formatting, other"
)
description: str = Field(
..., min_length=10, max_length=2000,
description="Description of the issue"
)
page_url: str | None = Field(
None, description="URL the user was on when reporting"
)
class ContentReportRead(BaseModel):
"""Full report for admin views."""
model_config = ConfigDict(from_attributes=True)
id: uuid.UUID
content_type: str
content_id: uuid.UUID | None = None
content_title: str | None = None
report_type: str
description: str
status: str = "open"
admin_notes: str | None = None
page_url: str | None = None
created_at: datetime
resolved_at: datetime | None = None
class ContentReportUpdate(BaseModel):
"""Admin update: change status and/or add notes."""
status: str | None = Field(
None, description="open, acknowledged, resolved, dismissed"
)
admin_notes: str | None = Field(
None, max_length=2000, description="Admin notes about resolution"
)
class ContentReportListResponse(BaseModel):
"""Paginated list of content reports."""
items: list[ContentReportRead] = Field(default_factory=list)
total: int = 0
offset: int = 0
limit: int = 50
# ── Pipeline Debug Mode ─────────────────────────────────────────────────────
class DebugModeResponse(BaseModel):
"""Current debug mode status."""
debug_mode: bool
class DebugModeUpdate(BaseModel):
"""Toggle debug mode on/off."""
debug_mode: bool
class TokenStageSummary(BaseModel):
"""Per-stage token usage aggregation."""
stage: str
call_count: int
total_prompt_tokens: int
total_completion_tokens: int
total_tokens: int
class TokenSummaryResponse(BaseModel):
"""Token usage summary for a video, broken down by stage."""
video_id: str
stages: list[TokenStageSummary] = Field(default_factory=list)
grand_total_tokens: int

362
backend/search_service.py Normal file
View file

@ -0,0 +1,362 @@
"""Async search service for the public search endpoint.
Orchestrates semantic search (embedding + Qdrant) with keyword fallback.
All external calls have timeouts and graceful degradation if embedding
or Qdrant fail, the service falls back to keyword-only (ILIKE) search.
"""
from __future__ import annotations
import asyncio
import logging
import time
from typing import Any
import openai
from qdrant_client import AsyncQdrantClient
from qdrant_client.http import exceptions as qdrant_exceptions
from qdrant_client.models import FieldCondition, Filter, MatchValue
from sqlalchemy import or_, select
from sqlalchemy.ext.asyncio import AsyncSession
from config import Settings
from models import Creator, KeyMoment, SourceVideo, TechniquePage
logger = logging.getLogger("chrysopedia.search")
# Timeout for external calls (embedding API, Qdrant) in seconds
_EXTERNAL_TIMEOUT = 0.3 # 300ms per plan
class SearchService:
"""Async search service with semantic + keyword fallback.
Parameters
----------
settings:
Application settings containing embedding and Qdrant config.
"""
def __init__(self, settings: Settings) -> None:
self.settings = settings
self._openai = openai.AsyncOpenAI(
base_url=settings.embedding_api_url,
api_key=settings.llm_api_key,
)
self._qdrant = AsyncQdrantClient(url=settings.qdrant_url)
self._collection = settings.qdrant_collection
# ── Embedding ────────────────────────────────────────────────────────
async def embed_query(self, text: str) -> list[float] | None:
"""Embed a query string into a vector.
Returns None on any failure (timeout, connection, malformed response)
so the caller can fall back to keyword search.
"""
try:
response = await asyncio.wait_for(
self._openai.embeddings.create(
model=self.settings.embedding_model,
input=text,
),
timeout=_EXTERNAL_TIMEOUT,
)
except asyncio.TimeoutError:
logger.warning("Embedding API timeout (%.0fms limit) for query: %.50s", _EXTERNAL_TIMEOUT * 1000, text)
return None
except (openai.APIConnectionError, openai.APITimeoutError) as exc:
logger.warning("Embedding API connection error (%s: %s)", type(exc).__name__, exc)
return None
except openai.APIError as exc:
logger.warning("Embedding API error (%s: %s)", type(exc).__name__, exc)
return None
if not response.data:
logger.warning("Embedding API returned empty data for query: %.50s", text)
return None
vector = response.data[0].embedding
if len(vector) != self.settings.embedding_dimensions:
logger.warning(
"Embedding dimension mismatch: expected %d, got %d",
self.settings.embedding_dimensions,
len(vector),
)
return None
return vector
# ── Qdrant vector search ─────────────────────────────────────────────
async def search_qdrant(
self,
vector: list[float],
limit: int = 20,
type_filter: str | None = None,
) -> list[dict[str, Any]]:
"""Search Qdrant for nearest neighbours.
Returns a list of dicts with 'score' and 'payload' keys.
Returns empty list on failure.
"""
query_filter = None
if type_filter:
query_filter = Filter(
must=[FieldCondition(key="type", match=MatchValue(value=type_filter))]
)
try:
results = await asyncio.wait_for(
self._qdrant.query_points(
collection_name=self._collection,
query=vector,
query_filter=query_filter,
limit=limit,
with_payload=True,
),
timeout=_EXTERNAL_TIMEOUT,
)
except asyncio.TimeoutError:
logger.warning("Qdrant search timeout (%.0fms limit)", _EXTERNAL_TIMEOUT * 1000)
return []
except qdrant_exceptions.UnexpectedResponse as exc:
logger.warning("Qdrant search error: %s", exc)
return []
except Exception as exc:
logger.warning("Qdrant connection error (%s: %s)", type(exc).__name__, exc)
return []
return [
{"score": point.score, "payload": point.payload}
for point in results.points
]
# ── Keyword fallback ─────────────────────────────────────────────────
async def keyword_search(
self,
query: str,
scope: str,
limit: int,
db: AsyncSession,
) -> list[dict[str, Any]]:
"""ILIKE keyword search across technique pages, key moments, and creators.
Searches title/name columns. Returns a unified list of result dicts.
"""
results: list[dict[str, Any]] = []
pattern = f"%{query}%"
if scope in ("all", "topics"):
stmt = (
select(TechniquePage)
.where(
or_(
TechniquePage.title.ilike(pattern),
TechniquePage.summary.ilike(pattern),
)
)
.limit(limit)
)
rows = await db.execute(stmt)
for tp in rows.scalars().all():
results.append({
"type": "technique_page",
"title": tp.title,
"slug": tp.slug,
"summary": tp.summary or "",
"topic_category": tp.topic_category,
"topic_tags": tp.topic_tags or [],
"creator_id": str(tp.creator_id),
"score": 0.0,
})
if scope in ("all",):
km_stmt = (
select(KeyMoment, SourceVideo, Creator)
.join(SourceVideo, KeyMoment.source_video_id == SourceVideo.id)
.join(Creator, SourceVideo.creator_id == Creator.id)
.where(KeyMoment.title.ilike(pattern))
.limit(limit)
)
km_rows = await db.execute(km_stmt)
for km, sv, cr in km_rows.all():
results.append({
"type": "key_moment",
"title": km.title,
"slug": "",
"summary": km.summary or "",
"topic_category": "",
"topic_tags": [],
"creator_id": str(cr.id),
"creator_name": cr.name,
"creator_slug": cr.slug,
"score": 0.0,
})
if scope in ("all", "creators"):
cr_stmt = (
select(Creator)
.where(Creator.name.ilike(pattern))
.limit(limit)
)
cr_rows = await db.execute(cr_stmt)
for cr in cr_rows.scalars().all():
results.append({
"type": "creator",
"title": cr.name,
"slug": cr.slug,
"summary": "",
"topic_category": "",
"topic_tags": cr.genres or [],
"creator_id": str(cr.id),
"score": 0.0,
})
# Enrich keyword results with creator names
kw_creator_ids = {r["creator_id"] for r in results if r.get("creator_id")}
kw_creator_map: dict[str, dict[str, str]] = {}
if kw_creator_ids:
import uuid as _uuid_mod
valid = []
for cid in kw_creator_ids:
try:
valid.append(_uuid_mod.UUID(cid))
except (ValueError, AttributeError):
pass
if valid:
cr_stmt = select(Creator).where(Creator.id.in_(valid))
cr_result = await db.execute(cr_stmt)
for c in cr_result.scalars().all():
kw_creator_map[str(c.id)] = {"name": c.name, "slug": c.slug}
for r in results:
info = kw_creator_map.get(r.get("creator_id", ""), {"name": "", "slug": ""})
r["creator_name"] = info["name"]
r["creator_slug"] = info["slug"]
return results[:limit]
# ── Orchestrator ─────────────────────────────────────────────────────
async def search(
self,
query: str,
scope: str,
limit: int,
db: AsyncSession,
) -> dict[str, Any]:
"""Run semantic search with keyword fallback.
Returns a dict matching the SearchResponse schema shape.
"""
start = time.monotonic()
# Validate / sanitize inputs
if not query or not query.strip():
return {"items": [], "total": 0, "query": query, "fallback_used": False}
# Truncate long queries
query = query.strip()[:500]
# Normalize scope
if scope not in ("all", "topics", "creators"):
scope = "all"
# Map scope to Qdrant type filter
type_filter_map = {
"all": None,
"topics": "technique_page",
"creators": None, # creators aren't in Qdrant
}
qdrant_type_filter = type_filter_map.get(scope)
fallback_used = False
items: list[dict[str, Any]] = []
# Try semantic search
vector = await self.embed_query(query)
if vector is not None:
qdrant_results = await self.search_qdrant(vector, limit=limit, type_filter=qdrant_type_filter)
if qdrant_results:
# Enrich Qdrant results with DB metadata
items = await self._enrich_results(qdrant_results, db)
# Fallback to keyword search if semantic failed or returned nothing
if not items:
items = await self.keyword_search(query, scope, limit, db)
fallback_used = True
elapsed_ms = (time.monotonic() - start) * 1000
logger.info(
"Search query=%r scope=%s results=%d fallback=%s latency_ms=%.1f",
query,
scope,
len(items),
fallback_used,
elapsed_ms,
)
return {
"items": items,
"total": len(items),
"query": query,
"fallback_used": fallback_used,
}
# ── Result enrichment ────────────────────────────────────────────────
async def _enrich_results(
self,
qdrant_results: list[dict[str, Any]],
db: AsyncSession,
) -> list[dict[str, Any]]:
"""Enrich Qdrant results with creator names and slugs from DB."""
enriched: list[dict[str, Any]] = []
# Collect creator_ids to batch-fetch
creator_ids = set()
for r in qdrant_results:
payload = r.get("payload", {})
cid = payload.get("creator_id")
if cid:
creator_ids.add(cid)
# Batch fetch creators
creator_map: dict[str, dict[str, str]] = {}
if creator_ids:
from sqlalchemy.dialects.postgresql import UUID as PgUUID
import uuid as uuid_mod
valid_ids = []
for cid in creator_ids:
try:
valid_ids.append(uuid_mod.UUID(cid))
except (ValueError, AttributeError):
pass
if valid_ids:
stmt = select(Creator).where(Creator.id.in_(valid_ids))
result = await db.execute(stmt)
for c in result.scalars().all():
creator_map[str(c.id)] = {"name": c.name, "slug": c.slug}
for r in qdrant_results:
payload = r.get("payload", {})
cid = payload.get("creator_id", "")
creator_info = creator_map.get(cid, {"name": "", "slug": ""})
enriched.append({
"type": payload.get("type", ""),
"title": payload.get("title", ""),
"slug": payload.get("slug", payload.get("title", "").lower().replace(" ", "-")),
"summary": payload.get("summary", ""),
"topic_category": payload.get("topic_category", ""),
"topic_tags": payload.get("topic_tags", []),
"creator_id": cid,
"creator_name": creator_info["name"],
"creator_slug": creator_info["slug"],
"score": r.get("score", 0.0),
})
return enriched

View file

192
backend/tests/conftest.py Normal file
View file

@ -0,0 +1,192 @@
"""Shared fixtures for Chrysopedia integration tests.
Provides:
- Async SQLAlchemy engine/session against a real PostgreSQL test database
- Sync SQLAlchemy engine/session for pipeline stage tests (Celery stages are sync)
- httpx.AsyncClient wired to the FastAPI app with dependency overrides
- Pre-ingest fixture for pipeline tests
- Sample transcript fixture path and temporary storage directory
Key design choice: function-scoped engine with NullPool avoids asyncpg
"another operation in progress" errors caused by session-scoped connection
reuse between the ASGI test client and verification queries.
"""
import json
import os
import pathlib
import uuid
import pytest
import pytest_asyncio
from httpx import ASGITransport, AsyncClient
from sqlalchemy import create_engine
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker, create_async_engine
from sqlalchemy.orm import Session, sessionmaker
from sqlalchemy.pool import NullPool
# Ensure backend/ is on sys.path so "from models import ..." works
import sys
sys.path.insert(0, str(pathlib.Path(__file__).resolve().parent.parent))
from database import Base, get_session # noqa: E402
from main import app # noqa: E402
from models import ( # noqa: E402
ContentType,
Creator,
ProcessingStatus,
SourceVideo,
TranscriptSegment,
)
TEST_DATABASE_URL = os.getenv(
"TEST_DATABASE_URL",
"postgresql+asyncpg://chrysopedia:changeme@localhost:5433/chrysopedia_test",
)
TEST_DATABASE_URL_SYNC = TEST_DATABASE_URL.replace(
"postgresql+asyncpg://", "postgresql+psycopg2://"
)
@pytest_asyncio.fixture()
async def db_engine():
"""Create a per-test async engine (NullPool) and create/drop all tables."""
engine = create_async_engine(TEST_DATABASE_URL, echo=False, poolclass=NullPool)
# Create all tables fresh for each test
async with engine.begin() as conn:
await conn.run_sync(Base.metadata.drop_all)
await conn.run_sync(Base.metadata.create_all)
yield engine
# Drop all tables after test
async with engine.begin() as conn:
await conn.run_sync(Base.metadata.drop_all)
await engine.dispose()
@pytest_asyncio.fixture()
async def client(db_engine, tmp_path):
"""Async HTTP test client wired to FastAPI with dependency overrides."""
session_factory = async_sessionmaker(
db_engine, class_=AsyncSession, expire_on_commit=False
)
async def _override_get_session():
async with session_factory() as session:
yield session
# Override DB session dependency
app.dependency_overrides[get_session] = _override_get_session
# Override transcript_storage_path via environment variable
os.environ["TRANSCRIPT_STORAGE_PATH"] = str(tmp_path)
# Clear the lru_cache so Settings picks up the new env var
from config import get_settings
get_settings.cache_clear()
transport = ASGITransport(app=app)
async with AsyncClient(transport=transport, base_url="http://testserver") as ac:
yield ac
# Teardown: clean overrides and restore settings cache
app.dependency_overrides.clear()
os.environ.pop("TRANSCRIPT_STORAGE_PATH", None)
get_settings.cache_clear()
@pytest.fixture()
def sample_transcript_path() -> pathlib.Path:
"""Path to the sample 5-segment transcript JSON fixture."""
return pathlib.Path(__file__).parent / "fixtures" / "sample_transcript.json"
@pytest.fixture()
def tmp_transcript_dir(tmp_path) -> pathlib.Path:
"""Temporary directory for transcript storage during tests."""
return tmp_path
# ── Sync engine/session for pipeline stages ──────────────────────────────────
@pytest.fixture()
def sync_engine(db_engine):
"""Create a sync SQLAlchemy engine pointing at the test database.
Tables are already created/dropped by the async ``db_engine`` fixture,
so this fixture just wraps a sync engine around the same DB URL.
"""
engine = create_engine(TEST_DATABASE_URL_SYNC, echo=False, poolclass=NullPool)
yield engine
engine.dispose()
@pytest.fixture()
def sync_session(sync_engine) -> Session:
"""Create a sync SQLAlchemy session for pipeline stage tests."""
factory = sessionmaker(bind=sync_engine)
session = factory()
yield session
session.close()
# ── Pre-ingest fixture for pipeline tests ────────────────────────────────────
@pytest.fixture()
def pre_ingested_video(sync_engine):
"""Ingest the sample transcript directly into the test DB via sync ORM.
Returns a dict with ``video_id``, ``creator_id``, and ``segment_count``.
"""
factory = sessionmaker(bind=sync_engine)
session = factory()
try:
# Create creator
creator = Creator(
name="Skope",
slug="skope",
folder_name="Skope",
)
session.add(creator)
session.flush()
# Create video
video = SourceVideo(
creator_id=creator.id,
filename="mixing-basics-ep1.mp4",
file_path="Skope/mixing-basics-ep1.mp4",
duration_seconds=1234,
content_type=ContentType.tutorial,
processing_status=ProcessingStatus.transcribed,
)
session.add(video)
session.flush()
# Create transcript segments
sample = pathlib.Path(__file__).parent / "fixtures" / "sample_transcript.json"
data = json.loads(sample.read_text())
for idx, seg in enumerate(data["segments"]):
session.add(TranscriptSegment(
source_video_id=video.id,
start_time=float(seg["start"]),
end_time=float(seg["end"]),
text=str(seg["text"]),
segment_index=idx,
))
session.commit()
result = {
"video_id": str(video.id),
"creator_id": str(creator.id),
"segment_count": len(data["segments"]),
}
finally:
session.close()
return result

View file

@ -0,0 +1,111 @@
"""Mock LLM and embedding responses for pipeline integration tests.
Each response is a JSON string matching the Pydantic schema for that stage.
The sample transcript has 5 segments about gain staging, so mock responses
reflect that content.
"""
import json
import random
# ── Stage 2: Segmentation ───────────────────────────────────────────────────
STAGE2_SEGMENTATION_RESPONSE = json.dumps({
"segments": [
{
"start_index": 0,
"end_index": 1,
"topic_label": "Introduction",
"summary": "Introduces the episode about mixing basics and gain staging.",
},
{
"start_index": 2,
"end_index": 4,
"topic_label": "Gain Staging Technique",
"summary": "Covers practical steps for gain staging including setting levels and avoiding clipping.",
},
]
})
# ── Stage 3: Extraction ─────────────────────────────────────────────────────
STAGE3_EXTRACTION_RESPONSE = json.dumps({
"moments": [
{
"title": "Setting Levels for Gain Staging",
"summary": "Demonstrates the process of setting proper gain levels across the signal chain to maintain headroom.",
"start_time": 12.8,
"end_time": 28.5,
"content_type": "technique",
"plugins": ["Pro-Q 3"],
"raw_transcript": "First thing you want to do is set your levels. Make sure nothing is clipping on the master bus.",
},
{
"title": "Master Bus Clipping Prevention",
"summary": "Explains how to monitor and prevent clipping on the master bus during a mix session.",
"start_time": 20.1,
"end_time": 35.0,
"content_type": "settings",
"plugins": [],
"raw_transcript": "Make sure nothing is clipping on the master bus. That wraps up this quick overview.",
},
]
})
# ── Stage 4: Classification ─────────────────────────────────────────────────
STAGE4_CLASSIFICATION_RESPONSE = json.dumps({
"classifications": [
{
"moment_index": 0,
"topic_category": "Mixing",
"topic_tags": ["gain staging", "eq"],
"content_type_override": None,
},
{
"moment_index": 1,
"topic_category": "Mixing",
"topic_tags": ["gain staging", "bus processing"],
"content_type_override": None,
},
]
})
# ── Stage 5: Synthesis ───────────────────────────────────────────────────────
STAGE5_SYNTHESIS_RESPONSE = json.dumps({
"pages": [
{
"title": "Gain Staging in Mixing",
"slug": "gain-staging-in-mixing",
"topic_category": "Mixing",
"topic_tags": ["gain staging"],
"summary": "A comprehensive guide to gain staging in a mixing context, covering level setting and master bus management.",
"body_sections": {
"Overview": "Gain staging ensures each stage of the signal chain operates at optimal levels.",
"Steps": "1. Set input levels. 2. Check bus levels. 3. Monitor master output.",
},
"signal_chains": [
{"chain": "Input -> Channel Strip -> Bus -> Master", "notes": "Keep headroom at each stage."}
],
"plugins": ["Pro-Q 3"],
"source_quality": "structured",
}
]
})
# ── Embedding response ───────────────────────────────────────────────────────
def make_mock_embedding(dim: int = 768) -> list[float]:
"""Generate a deterministic-seeded mock embedding vector."""
rng = random.Random(42)
return [rng.uniform(-1, 1) for _ in range(dim)]
def make_mock_embeddings(n: int, dim: int = 768) -> list[list[float]]:
"""Generate n distinct mock embedding vectors."""
return [
[random.Random(42 + i).uniform(-1, 1) for _ in range(dim)]
for i in range(n)
]

View file

@ -0,0 +1,12 @@
{
"source_file": "mixing-basics-ep1.mp4",
"creator_folder": "Skope",
"duration_seconds": 1234,
"segments": [
{"start": 0.0, "end": 5.2, "text": "Welcome to mixing basics episode one."},
{"start": 5.2, "end": 12.8, "text": "Today we are going to talk about gain staging."},
{"start": 12.8, "end": 20.1, "text": "First thing you want to do is set your levels."},
{"start": 20.1, "end": 28.5, "text": "Make sure nothing is clipping on the master bus."},
{"start": 28.5, "end": 35.0, "text": "That wraps up this quick overview of gain staging."}
]
}

View file

@ -0,0 +1,179 @@
"""Integration tests for the transcript ingest endpoint.
Tests run against a real PostgreSQL database via httpx.AsyncClient
on the FastAPI ASGI app. Each test gets a clean database state via
TRUNCATE in the client fixture (conftest.py).
"""
import json
import pathlib
import pytest
from httpx import AsyncClient
from sqlalchemy import func, select, text
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker
from models import Creator, SourceVideo, TranscriptSegment
# ── Helpers ──────────────────────────────────────────────────────────────────
INGEST_URL = "/api/v1/ingest"
def _upload_file(path: pathlib.Path):
"""Return a dict suitable for httpx multipart file upload."""
return {"file": (path.name, path.read_bytes(), "application/json")}
async def _query_db(db_engine, stmt):
"""Run a read query in its own session to avoid connection contention."""
session_factory = async_sessionmaker(
db_engine, class_=AsyncSession, expire_on_commit=False
)
async with session_factory() as session:
result = await session.execute(stmt)
return result
async def _count_rows(db_engine, model):
"""Count rows in a table via a fresh session."""
result = await _query_db(db_engine, select(func.count(model.id)))
return result.scalar_one()
# ── Happy-path tests ────────────────────────────────────────────────────────
async def test_ingest_creates_creator_and_video(client, sample_transcript_path, db_engine):
"""POST a valid transcript → 200 with creator, video, and 5 segments created."""
resp = await client.post(INGEST_URL, files=_upload_file(sample_transcript_path))
assert resp.status_code == 200, f"Expected 200, got {resp.status_code}: {resp.text}"
data = resp.json()
assert "video_id" in data
assert "creator_id" in data
assert data["segments_stored"] == 5
assert data["creator_name"] == "Skope"
assert data["is_reupload"] is False
# Verify DB state via a fresh session
session_factory = async_sessionmaker(db_engine, class_=AsyncSession, expire_on_commit=False)
async with session_factory() as session:
# Creator exists with correct folder_name and slug
result = await session.execute(
select(Creator).where(Creator.folder_name == "Skope")
)
creator = result.scalar_one()
assert creator.slug == "skope"
assert creator.name == "Skope"
# SourceVideo exists with correct status
result = await session.execute(
select(SourceVideo).where(SourceVideo.creator_id == creator.id)
)
video = result.scalar_one()
assert video.processing_status.value == "transcribed"
assert video.filename == "mixing-basics-ep1.mp4"
# 5 TranscriptSegment rows with sequential indices
result = await session.execute(
select(TranscriptSegment)
.where(TranscriptSegment.source_video_id == video.id)
.order_by(TranscriptSegment.segment_index)
)
segments = result.scalars().all()
assert len(segments) == 5
assert [s.segment_index for s in segments] == [0, 1, 2, 3, 4]
async def test_ingest_reuses_existing_creator(client, sample_transcript_path, db_engine):
"""If a Creator with the same folder_name already exists, reuse it."""
session_factory = async_sessionmaker(db_engine, class_=AsyncSession, expire_on_commit=False)
# Pre-create a Creator with folder_name='Skope' in a separate session
async with session_factory() as session:
existing = Creator(name="Skope", slug="skope", folder_name="Skope")
session.add(existing)
await session.commit()
await session.refresh(existing)
existing_id = existing.id
# POST transcript — should reuse the creator
resp = await client.post(INGEST_URL, files=_upload_file(sample_transcript_path))
assert resp.status_code == 200
data = resp.json()
assert data["creator_id"] == str(existing_id)
# Verify only 1 Creator row in DB
count = await _count_rows(db_engine, Creator)
assert count == 1, f"Expected 1 creator, got {count}"
async def test_ingest_idempotent_reupload(client, sample_transcript_path, db_engine):
"""Uploading the same transcript twice is idempotent: same video, no duplicate segments."""
# First upload
resp1 = await client.post(INGEST_URL, files=_upload_file(sample_transcript_path))
assert resp1.status_code == 200
data1 = resp1.json()
assert data1["is_reupload"] is False
video_id = data1["video_id"]
# Second upload (same file)
resp2 = await client.post(INGEST_URL, files=_upload_file(sample_transcript_path))
assert resp2.status_code == 200
data2 = resp2.json()
assert data2["is_reupload"] is True
assert data2["video_id"] == video_id
# Verify DB: still only 1 SourceVideo and 5 segments (not 10)
video_count = await _count_rows(db_engine, SourceVideo)
assert video_count == 1, f"Expected 1 video, got {video_count}"
seg_count = await _count_rows(db_engine, TranscriptSegment)
assert seg_count == 5, f"Expected 5 segments, got {seg_count}"
async def test_ingest_saves_json_to_disk(client, sample_transcript_path, tmp_path):
"""Ingested transcript raw JSON is persisted to the filesystem."""
resp = await client.post(INGEST_URL, files=_upload_file(sample_transcript_path))
assert resp.status_code == 200
# The ingest endpoint saves to {transcript_storage_path}/{creator_folder}/{source_file}.json
expected_path = tmp_path / "Skope" / "mixing-basics-ep1.mp4.json"
assert expected_path.exists(), f"Expected file at {expected_path}"
# Verify the saved JSON is valid and matches the source
saved = json.loads(expected_path.read_text())
source = json.loads(sample_transcript_path.read_text())
assert saved == source
# ── Error tests ──────────────────────────────────────────────────────────────
async def test_ingest_rejects_invalid_json(client, tmp_path):
"""Uploading a non-JSON file returns 422."""
bad_file = tmp_path / "bad.json"
bad_file.write_text("this is not valid json {{{")
resp = await client.post(
INGEST_URL,
files={"file": ("bad.json", bad_file.read_bytes(), "application/json")},
)
assert resp.status_code == 422, f"Expected 422, got {resp.status_code}: {resp.text}"
assert "JSON parse error" in resp.json()["detail"]
async def test_ingest_rejects_missing_fields(client, tmp_path):
"""Uploading JSON without required fields returns 422."""
incomplete = tmp_path / "incomplete.json"
# Missing creator_folder and segments
incomplete.write_text(json.dumps({"source_file": "test.mp4", "duration_seconds": 100}))
resp = await client.post(
INGEST_URL,
files={"file": ("incomplete.json", incomplete.read_bytes(), "application/json")},
)
assert resp.status_code == 422, f"Expected 422, got {resp.status_code}: {resp.text}"
assert "Missing required keys" in resp.json()["detail"]

View file

@ -0,0 +1,773 @@
"""Integration tests for the LLM extraction pipeline.
Tests run against a real PostgreSQL test database with mocked LLM and Qdrant
clients. Pipeline stages are sync (Celery tasks), so tests call stage
functions directly with sync SQLAlchemy sessions.
Tests (a)(f) call pipeline stages directly. Tests (g)(i) use the async
HTTP client. Test (j) verifies LLM fallback logic.
"""
from __future__ import annotations
import json
import os
import pathlib
import uuid
from unittest.mock import MagicMock, patch, PropertyMock
import openai
import pytest
from sqlalchemy import create_engine, select
from sqlalchemy.orm import Session, sessionmaker
from sqlalchemy.pool import NullPool
from models import (
Creator,
KeyMoment,
KeyMomentContentType,
ProcessingStatus,
SourceVideo,
TechniquePage,
TranscriptSegment,
)
from pipeline.schemas import (
ClassificationResult,
ExtractionResult,
SegmentationResult,
SynthesisResult,
)
from tests.fixtures.mock_llm_responses import (
STAGE2_SEGMENTATION_RESPONSE,
STAGE3_EXTRACTION_RESPONSE,
STAGE4_CLASSIFICATION_RESPONSE,
STAGE5_SYNTHESIS_RESPONSE,
make_mock_embeddings,
)
# ── Test database URL ────────────────────────────────────────────────────────
TEST_DATABASE_URL_SYNC = os.getenv(
"TEST_DATABASE_URL",
"postgresql+asyncpg://chrysopedia:changeme@localhost:5433/chrysopedia_test",
).replace("postgresql+asyncpg://", "postgresql+psycopg2://")
# ── Helpers ──────────────────────────────────────────────────────────────────
def _make_mock_openai_response(content: str):
"""Build a mock OpenAI ChatCompletion response object."""
mock_message = MagicMock()
mock_message.content = content
mock_choice = MagicMock()
mock_choice.message = mock_message
mock_response = MagicMock()
mock_response.choices = [mock_choice]
return mock_response
def _make_mock_embedding_response(vectors: list[list[float]]):
"""Build a mock OpenAI Embedding response object."""
mock_items = []
for i, vec in enumerate(vectors):
item = MagicMock()
item.embedding = vec
item.index = i
mock_items.append(item)
mock_response = MagicMock()
mock_response.data = mock_items
return mock_response
def _patch_pipeline_engine(sync_engine):
"""Patch the pipeline.stages module to use the test sync engine/session."""
return [
patch("pipeline.stages._engine", sync_engine),
patch(
"pipeline.stages._SessionLocal",
sessionmaker(bind=sync_engine),
),
]
def _patch_llm_completions(side_effect_fn):
"""Patch openai.OpenAI so all instances share a mocked chat.completions.create."""
mock_client = MagicMock()
mock_client.chat.completions.create.side_effect = side_effect_fn
return patch("openai.OpenAI", return_value=mock_client)
def _create_canonical_tags_file(tmp_path: pathlib.Path) -> pathlib.Path:
"""Write a minimal canonical_tags.yaml for stage4 to load."""
config_dir = tmp_path / "config"
config_dir.mkdir(exist_ok=True)
tags_path = config_dir / "canonical_tags.yaml"
tags_path.write_text(
"categories:\n"
" - name: Mixing\n"
" description: Balancing and processing elements\n"
" sub_topics: [eq, compression, gain staging, bus processing]\n"
" - name: Sound design\n"
" description: Creating sounds\n"
" sub_topics: [bass, drums]\n"
)
return tags_path
# ── (a) Stage 2: Segmentation ───────────────────────────────────────────────
def test_stage2_segmentation_updates_topic_labels(
db_engine, sync_engine, pre_ingested_video, tmp_path
):
"""Stage 2 should update topic_label on each TranscriptSegment."""
video_id = pre_ingested_video["video_id"]
# Create prompts directory
prompts_dir = tmp_path / "prompts"
prompts_dir.mkdir()
(prompts_dir / "stage2_segmentation.txt").write_text("You are a segmentation assistant.")
# Build the mock LLM that returns the segmentation response
def llm_side_effect(**kwargs):
return _make_mock_openai_response(STAGE2_SEGMENTATION_RESPONSE)
patches = _patch_pipeline_engine(sync_engine)
for p in patches:
p.start()
with _patch_llm_completions(llm_side_effect), \
patch("pipeline.stages.get_settings") as mock_settings:
s = MagicMock()
s.prompts_path = str(prompts_dir)
s.llm_api_url = "http://mock:11434/v1"
s.llm_api_key = "sk-test"
s.llm_model = "test-model"
s.llm_fallback_url = "http://mock:11434/v1"
s.llm_fallback_model = "test-model"
s.database_url = TEST_DATABASE_URL_SYNC.replace("psycopg2", "asyncpg")
mock_settings.return_value = s
# Import and call stage directly (not via Celery)
from pipeline.stages import stage2_segmentation
result = stage2_segmentation(video_id)
assert result == video_id
for p in patches:
p.stop()
# Verify: check topic_label on segments
factory = sessionmaker(bind=sync_engine)
session = factory()
try:
segments = (
session.execute(
select(TranscriptSegment)
.where(TranscriptSegment.source_video_id == video_id)
.order_by(TranscriptSegment.segment_index)
)
.scalars()
.all()
)
# Segments 0,1 should have "Introduction", segments 2,3,4 should have "Gain Staging Technique"
assert segments[0].topic_label == "Introduction"
assert segments[1].topic_label == "Introduction"
assert segments[2].topic_label == "Gain Staging Technique"
assert segments[3].topic_label == "Gain Staging Technique"
assert segments[4].topic_label == "Gain Staging Technique"
finally:
session.close()
# ── (b) Stage 3: Extraction ─────────────────────────────────────────────────
def test_stage3_extraction_creates_key_moments(
db_engine, sync_engine, pre_ingested_video, tmp_path
):
"""Stages 2+3 should create KeyMoment rows and set processing_status=extracted."""
video_id = pre_ingested_video["video_id"]
prompts_dir = tmp_path / "prompts"
prompts_dir.mkdir()
(prompts_dir / "stage2_segmentation.txt").write_text("Segment assistant.")
(prompts_dir / "stage3_extraction.txt").write_text("Extraction assistant.")
call_count = {"n": 0}
responses = [STAGE2_SEGMENTATION_RESPONSE, STAGE3_EXTRACTION_RESPONSE, STAGE3_EXTRACTION_RESPONSE]
def llm_side_effect(**kwargs):
idx = min(call_count["n"], len(responses) - 1)
resp = responses[idx]
call_count["n"] += 1
return _make_mock_openai_response(resp)
patches = _patch_pipeline_engine(sync_engine)
for p in patches:
p.start()
with _patch_llm_completions(llm_side_effect), \
patch("pipeline.stages.get_settings") as mock_settings:
s = MagicMock()
s.prompts_path = str(prompts_dir)
s.llm_api_url = "http://mock:11434/v1"
s.llm_api_key = "sk-test"
s.llm_model = "test-model"
s.llm_fallback_url = "http://mock:11434/v1"
s.llm_fallback_model = "test-model"
s.database_url = TEST_DATABASE_URL_SYNC.replace("psycopg2", "asyncpg")
mock_settings.return_value = s
from pipeline.stages import stage2_segmentation, stage3_extraction
stage2_segmentation(video_id)
stage3_extraction(video_id)
for p in patches:
p.stop()
# Verify key moments created
factory = sessionmaker(bind=sync_engine)
session = factory()
try:
moments = (
session.execute(
select(KeyMoment)
.where(KeyMoment.source_video_id == video_id)
.order_by(KeyMoment.start_time)
)
.scalars()
.all()
)
# Two topic groups → extraction called twice → up to 4 moments
# (2 per group from the mock response)
assert len(moments) >= 2
assert moments[0].title == "Setting Levels for Gain Staging"
assert moments[0].content_type == KeyMomentContentType.technique
# Verify processing_status
video = session.execute(
select(SourceVideo).where(SourceVideo.id == video_id)
).scalar_one()
assert video.processing_status == ProcessingStatus.extracted
finally:
session.close()
# ── (c) Stage 4: Classification ─────────────────────────────────────────────
def test_stage4_classification_assigns_tags(
db_engine, sync_engine, pre_ingested_video, tmp_path
):
"""Stages 2+3+4 should store classification data in Redis."""
video_id = pre_ingested_video["video_id"]
prompts_dir = tmp_path / "prompts"
prompts_dir.mkdir()
(prompts_dir / "stage2_segmentation.txt").write_text("Segment assistant.")
(prompts_dir / "stage3_extraction.txt").write_text("Extraction assistant.")
(prompts_dir / "stage4_classification.txt").write_text("Classification assistant.")
_create_canonical_tags_file(tmp_path)
call_count = {"n": 0}
responses = [
STAGE2_SEGMENTATION_RESPONSE,
STAGE3_EXTRACTION_RESPONSE,
STAGE3_EXTRACTION_RESPONSE,
STAGE4_CLASSIFICATION_RESPONSE,
]
def llm_side_effect(**kwargs):
idx = min(call_count["n"], len(responses) - 1)
resp = responses[idx]
call_count["n"] += 1
return _make_mock_openai_response(resp)
patches = _patch_pipeline_engine(sync_engine)
for p in patches:
p.start()
stored_cls_data = {}
def mock_store_classification(vid, data):
stored_cls_data[vid] = data
with _patch_llm_completions(llm_side_effect), \
patch("pipeline.stages.get_settings") as mock_settings, \
patch("pipeline.stages._load_canonical_tags") as mock_tags, \
patch("pipeline.stages._store_classification_data", side_effect=mock_store_classification):
s = MagicMock()
s.prompts_path = str(prompts_dir)
s.llm_api_url = "http://mock:11434/v1"
s.llm_api_key = "sk-test"
s.llm_model = "test-model"
s.llm_fallback_url = "http://mock:11434/v1"
s.llm_fallback_model = "test-model"
s.database_url = TEST_DATABASE_URL_SYNC.replace("psycopg2", "asyncpg")
s.review_mode = True
mock_settings.return_value = s
mock_tags.return_value = {
"categories": [
{"name": "Mixing", "description": "Balancing", "sub_topics": ["gain staging", "eq"]},
]
}
from pipeline.stages import stage2_segmentation, stage3_extraction, stage4_classification
stage2_segmentation(video_id)
stage3_extraction(video_id)
stage4_classification(video_id)
for p in patches:
p.stop()
# Verify classification data was stored
assert video_id in stored_cls_data
cls_data = stored_cls_data[video_id]
assert len(cls_data) >= 1
assert cls_data[0]["topic_category"] == "Mixing"
assert "gain staging" in cls_data[0]["topic_tags"]
# ── (d) Stage 5: Synthesis ──────────────────────────────────────────────────
def test_stage5_synthesis_creates_technique_pages(
db_engine, sync_engine, pre_ingested_video, tmp_path
):
"""Full pipeline stages 2-5 should create TechniquePage rows linked to KeyMoments."""
video_id = pre_ingested_video["video_id"]
prompts_dir = tmp_path / "prompts"
prompts_dir.mkdir()
(prompts_dir / "stage2_segmentation.txt").write_text("Segment assistant.")
(prompts_dir / "stage3_extraction.txt").write_text("Extraction assistant.")
(prompts_dir / "stage4_classification.txt").write_text("Classification assistant.")
(prompts_dir / "stage5_synthesis.txt").write_text("Synthesis assistant.")
call_count = {"n": 0}
responses = [
STAGE2_SEGMENTATION_RESPONSE,
STAGE3_EXTRACTION_RESPONSE,
STAGE3_EXTRACTION_RESPONSE,
STAGE4_CLASSIFICATION_RESPONSE,
STAGE5_SYNTHESIS_RESPONSE,
]
def llm_side_effect(**kwargs):
idx = min(call_count["n"], len(responses) - 1)
resp = responses[idx]
call_count["n"] += 1
return _make_mock_openai_response(resp)
patches = _patch_pipeline_engine(sync_engine)
for p in patches:
p.start()
# Mock classification data in Redis (simulate stage 4 having stored it)
mock_cls_data = [
{"moment_id": "will-be-replaced", "topic_category": "Mixing", "topic_tags": ["gain staging"]},
]
with _patch_llm_completions(llm_side_effect), \
patch("pipeline.stages.get_settings") as mock_settings, \
patch("pipeline.stages._load_canonical_tags") as mock_tags, \
patch("pipeline.stages._store_classification_data"), \
patch("pipeline.stages._load_classification_data") as mock_load_cls:
s = MagicMock()
s.prompts_path = str(prompts_dir)
s.llm_api_url = "http://mock:11434/v1"
s.llm_api_key = "sk-test"
s.llm_model = "test-model"
s.llm_fallback_url = "http://mock:11434/v1"
s.llm_fallback_model = "test-model"
s.database_url = TEST_DATABASE_URL_SYNC.replace("psycopg2", "asyncpg")
s.review_mode = True
mock_settings.return_value = s
mock_tags.return_value = {
"categories": [
{"name": "Mixing", "description": "Balancing", "sub_topics": ["gain staging"]},
]
}
from pipeline.stages import (
stage2_segmentation,
stage3_extraction,
stage4_classification,
stage5_synthesis,
)
stage2_segmentation(video_id)
stage3_extraction(video_id)
stage4_classification(video_id)
# Now set up mock_load_cls to return data with real moment IDs
factory = sessionmaker(bind=sync_engine)
sess = factory()
real_moments = (
sess.execute(
select(KeyMoment).where(KeyMoment.source_video_id == video_id)
)
.scalars()
.all()
)
real_cls = [
{"moment_id": str(m.id), "topic_category": "Mixing", "topic_tags": ["gain staging"]}
for m in real_moments
]
sess.close()
mock_load_cls.return_value = real_cls
stage5_synthesis(video_id)
for p in patches:
p.stop()
# Verify TechniquePages created
factory = sessionmaker(bind=sync_engine)
session = factory()
try:
pages = session.execute(select(TechniquePage)).scalars().all()
assert len(pages) >= 1
page = pages[0]
assert page.title == "Gain Staging in Mixing"
assert page.body_sections is not None
assert "Overview" in page.body_sections
assert page.signal_chains is not None
assert len(page.signal_chains) >= 1
assert page.summary is not None
# Verify KeyMoments are linked to the TechniquePage
moments = (
session.execute(
select(KeyMoment).where(KeyMoment.technique_page_id == page.id)
)
.scalars()
.all()
)
assert len(moments) >= 1
# Verify processing_status updated
video = session.execute(
select(SourceVideo).where(SourceVideo.id == video_id)
).scalar_one()
assert video.processing_status == ProcessingStatus.reviewed
finally:
session.close()
# ── (e) Stage 6: Embed & Index ──────────────────────────────────────────────
def test_stage6_embeds_and_upserts_to_qdrant(
db_engine, sync_engine, pre_ingested_video, tmp_path
):
"""Full pipeline through stage 6 should call EmbeddingClient and QdrantManager."""
video_id = pre_ingested_video["video_id"]
prompts_dir = tmp_path / "prompts"
prompts_dir.mkdir()
(prompts_dir / "stage2_segmentation.txt").write_text("Segment assistant.")
(prompts_dir / "stage3_extraction.txt").write_text("Extraction assistant.")
(prompts_dir / "stage4_classification.txt").write_text("Classification assistant.")
(prompts_dir / "stage5_synthesis.txt").write_text("Synthesis assistant.")
call_count = {"n": 0}
responses = [
STAGE2_SEGMENTATION_RESPONSE,
STAGE3_EXTRACTION_RESPONSE,
STAGE3_EXTRACTION_RESPONSE,
STAGE4_CLASSIFICATION_RESPONSE,
STAGE5_SYNTHESIS_RESPONSE,
]
def llm_side_effect(**kwargs):
idx = min(call_count["n"], len(responses) - 1)
resp = responses[idx]
call_count["n"] += 1
return _make_mock_openai_response(resp)
patches = _patch_pipeline_engine(sync_engine)
for p in patches:
p.start()
mock_embed_client = MagicMock()
mock_embed_client.embed.side_effect = lambda texts: make_mock_embeddings(len(texts))
mock_qdrant_mgr = MagicMock()
with _patch_llm_completions(llm_side_effect), \
patch("pipeline.stages.get_settings") as mock_settings, \
patch("pipeline.stages._load_canonical_tags") as mock_tags, \
patch("pipeline.stages._store_classification_data"), \
patch("pipeline.stages._load_classification_data") as mock_load_cls, \
patch("pipeline.stages.EmbeddingClient", return_value=mock_embed_client), \
patch("pipeline.stages.QdrantManager", return_value=mock_qdrant_mgr):
s = MagicMock()
s.prompts_path = str(prompts_dir)
s.llm_api_url = "http://mock:11434/v1"
s.llm_api_key = "sk-test"
s.llm_model = "test-model"
s.llm_fallback_url = "http://mock:11434/v1"
s.llm_fallback_model = "test-model"
s.database_url = TEST_DATABASE_URL_SYNC.replace("psycopg2", "asyncpg")
s.review_mode = True
s.embedding_api_url = "http://mock:11434/v1"
s.embedding_model = "test-embed"
s.embedding_dimensions = 768
s.qdrant_url = "http://mock:6333"
s.qdrant_collection = "test_collection"
mock_settings.return_value = s
mock_tags.return_value = {
"categories": [
{"name": "Mixing", "description": "Balancing", "sub_topics": ["gain staging"]},
]
}
from pipeline.stages import (
stage2_segmentation,
stage3_extraction,
stage4_classification,
stage5_synthesis,
stage6_embed_and_index,
)
stage2_segmentation(video_id)
stage3_extraction(video_id)
stage4_classification(video_id)
# Load real moment IDs for classification data mock
factory = sessionmaker(bind=sync_engine)
sess = factory()
real_moments = (
sess.execute(
select(KeyMoment).where(KeyMoment.source_video_id == video_id)
)
.scalars()
.all()
)
real_cls = [
{"moment_id": str(m.id), "topic_category": "Mixing", "topic_tags": ["gain staging"]}
for m in real_moments
]
sess.close()
mock_load_cls.return_value = real_cls
stage5_synthesis(video_id)
stage6_embed_and_index(video_id)
for p in patches:
p.stop()
# Verify EmbeddingClient.embed was called
assert mock_embed_client.embed.called
# Verify QdrantManager methods called
mock_qdrant_mgr.ensure_collection.assert_called_once()
assert (
mock_qdrant_mgr.upsert_technique_pages.called
or mock_qdrant_mgr.upsert_key_moments.called
), "Expected at least one upsert call to QdrantManager"
# ── (f) Resumability ────────────────────────────────────────────────────────
def test_run_pipeline_resumes_from_extracted(
db_engine, sync_engine, pre_ingested_video, tmp_path
):
"""When status=extracted, run_pipeline should skip stages 2+3 and run 4+5+6."""
video_id = pre_ingested_video["video_id"]
# Set video status to "extracted" directly
factory = sessionmaker(bind=sync_engine)
session = factory()
video = session.execute(
select(SourceVideo).where(SourceVideo.id == video_id)
).scalar_one()
video.processing_status = ProcessingStatus.extracted
session.commit()
session.close()
patches = _patch_pipeline_engine(sync_engine)
for p in patches:
p.start()
with patch("pipeline.stages.get_settings") as mock_settings, \
patch("pipeline.stages.stage2_segmentation") as mock_s2, \
patch("pipeline.stages.stage3_extraction") as mock_s3, \
patch("pipeline.stages.stage4_classification") as mock_s4, \
patch("pipeline.stages.stage5_synthesis") as mock_s5, \
patch("pipeline.stages.stage6_embed_and_index") as mock_s6, \
patch("pipeline.stages.celery_chain") as mock_chain:
s = MagicMock()
s.database_url = TEST_DATABASE_URL_SYNC.replace("psycopg2", "asyncpg")
mock_settings.return_value = s
# Mock chain to inspect what stages it gets
mock_pipeline = MagicMock()
mock_chain.return_value = mock_pipeline
# Mock the .s() method on each task
mock_s2.s = MagicMock(return_value="s2_sig")
mock_s3.s = MagicMock(return_value="s3_sig")
mock_s4.s = MagicMock(return_value="s4_sig")
mock_s5.s = MagicMock(return_value="s5_sig")
mock_s6.s = MagicMock(return_value="s6_sig")
from pipeline.stages import run_pipeline
run_pipeline(video_id)
# Verify: stages 2 and 3 should NOT have .s() called with video_id
mock_s2.s.assert_not_called()
mock_s3.s.assert_not_called()
# Stages 4, 5, 6 should have .s() called
mock_s4.s.assert_called_once_with(video_id)
mock_s5.s.assert_called_once()
mock_s6.s.assert_called_once()
for p in patches:
p.stop()
# ── (g) Pipeline trigger endpoint ───────────────────────────────────────────
async def test_pipeline_trigger_endpoint(client, db_engine):
"""POST /api/v1/pipeline/trigger/{video_id} with valid video returns 200."""
# Ingest a transcript first to create a video
sample = pathlib.Path(__file__).parent / "fixtures" / "sample_transcript.json"
with patch("routers.ingest.run_pipeline", create=True) as mock_rp:
mock_rp.delay = MagicMock()
resp = await client.post(
"/api/v1/ingest",
files={"file": (sample.name, sample.read_bytes(), "application/json")},
)
assert resp.status_code == 200
video_id = resp.json()["video_id"]
# Trigger the pipeline
with patch("pipeline.stages.run_pipeline") as mock_rp:
mock_rp.delay = MagicMock()
resp = await client.post(f"/api/v1/pipeline/trigger/{video_id}")
assert resp.status_code == 200
data = resp.json()
assert data["status"] == "triggered"
assert data["video_id"] == video_id
# ── (h) Pipeline trigger 404 ────────────────────────────────────────────────
async def test_pipeline_trigger_404_for_missing_video(client):
"""POST /api/v1/pipeline/trigger/{nonexistent} returns 404."""
fake_id = str(uuid.uuid4())
resp = await client.post(f"/api/v1/pipeline/trigger/{fake_id}")
assert resp.status_code == 404
assert "not found" in resp.json()["detail"].lower()
# ── (i) Ingest dispatches pipeline ──────────────────────────────────────────
async def test_ingest_dispatches_pipeline(client, db_engine):
"""Ingesting a transcript should call run_pipeline.delay with the video_id."""
sample = pathlib.Path(__file__).parent / "fixtures" / "sample_transcript.json"
with patch("pipeline.stages.run_pipeline") as mock_rp:
mock_rp.delay = MagicMock()
resp = await client.post(
"/api/v1/ingest",
files={"file": (sample.name, sample.read_bytes(), "application/json")},
)
assert resp.status_code == 200
video_id = resp.json()["video_id"]
mock_rp.delay.assert_called_once_with(video_id)
# ── (j) LLM fallback on primary failure ─────────────────────────────────────
def test_llm_fallback_on_primary_failure():
"""LLMClient should fall back to secondary endpoint when primary raises APIConnectionError."""
from pipeline.llm_client import LLMClient
settings = MagicMock()
settings.llm_api_url = "http://primary:11434/v1"
settings.llm_api_key = "sk-test"
settings.llm_fallback_url = "http://fallback:11434/v1"
settings.llm_fallback_model = "fallback-model"
settings.llm_model = "primary-model"
with patch("openai.OpenAI") as MockOpenAI:
primary_client = MagicMock()
fallback_client = MagicMock()
# First call → primary, second call → fallback
MockOpenAI.side_effect = [primary_client, fallback_client]
client = LLMClient(settings)
# Primary raises APIConnectionError
primary_client.chat.completions.create.side_effect = openai.APIConnectionError(
request=MagicMock()
)
# Fallback succeeds
fallback_response = _make_mock_openai_response('{"result": "ok"}')
fallback_client.chat.completions.create.return_value = fallback_response
result = client.complete("system", "user")
assert result == '{"result": "ok"}'
primary_client.chat.completions.create.assert_called_once()
fallback_client.chat.completions.create.assert_called_once()
# ── Think-tag stripping ─────────────────────────────────────────────────────
def test_strip_think_tags():
"""strip_think_tags should handle all edge cases correctly."""
from pipeline.llm_client import strip_think_tags
# Single block with JSON after
assert strip_think_tags('<think>reasoning here</think>{"a": 1}') == '{"a": 1}'
# Multiline think block
assert strip_think_tags(
'<think>\nI need to analyze this.\nLet me think step by step.\n</think>\n{"result": "ok"}'
) == '{"result": "ok"}'
# Multiple think blocks
result = strip_think_tags('<think>first</think>hello<think>second</think> world')
assert result == "hello world"
# No think tags — passthrough
assert strip_think_tags('{"clean": true}') == '{"clean": true}'
# Empty string
assert strip_think_tags("") == ""
# Think block with special characters
assert strip_think_tags(
'<think>analyzing "complex" <data> & stuff</think>{"done": true}'
) == '{"done": true}'
# Only a think block, no actual content
assert strip_think_tags("<think>just thinking</think>") == ""

View file

@ -0,0 +1,526 @@
"""Integration tests for the public S05 API endpoints:
techniques, topics, and enhanced creators.
Tests run against a real PostgreSQL test database via httpx.AsyncClient.
"""
from __future__ import annotations
import uuid
import pytest
import pytest_asyncio
from httpx import AsyncClient
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker
from models import (
ContentType,
Creator,
KeyMoment,
KeyMomentContentType,
ProcessingStatus,
RelatedTechniqueLink,
RelationshipType,
SourceVideo,
TechniquePage,
)
TECHNIQUES_URL = "/api/v1/techniques"
TOPICS_URL = "/api/v1/topics"
CREATORS_URL = "/api/v1/creators"
# ── Seed helpers ─────────────────────────────────────────────────────────────
async def _seed_full_data(db_engine) -> dict:
"""Seed 2 creators, 2 videos, 3 technique pages, key moments, and a related link.
Returns a dict of IDs and metadata for assertions.
"""
session_factory = async_sessionmaker(
db_engine, class_=AsyncSession, expire_on_commit=False
)
async with session_factory() as session:
# Creators
creator1 = Creator(
name="Alpha Creator",
slug="alpha-creator",
genres=["Bass music", "Dubstep"],
folder_name="AlphaCreator",
)
creator2 = Creator(
name="Beta Producer",
slug="beta-producer",
genres=["House", "Techno"],
folder_name="BetaProducer",
)
session.add_all([creator1, creator2])
await session.flush()
# Videos
video1 = SourceVideo(
creator_id=creator1.id,
filename="bass-tutorial.mp4",
file_path="AlphaCreator/bass-tutorial.mp4",
duration_seconds=600,
content_type=ContentType.tutorial,
processing_status=ProcessingStatus.extracted,
)
video2 = SourceVideo(
creator_id=creator2.id,
filename="mixing-masterclass.mp4",
file_path="BetaProducer/mixing-masterclass.mp4",
duration_seconds=1200,
content_type=ContentType.tutorial,
processing_status=ProcessingStatus.extracted,
)
session.add_all([video1, video2])
await session.flush()
# Technique pages
tp1 = TechniquePage(
creator_id=creator1.id,
title="Reese Bass Design",
slug="reese-bass-design",
topic_category="Sound design",
topic_tags=["bass", "textures"],
summary="Classic reese bass creation",
body_sections={"intro": "Getting started with reese bass"},
)
tp2 = TechniquePage(
creator_id=creator2.id,
title="Granular Pad Textures",
slug="granular-pad-textures",
topic_category="Synthesis",
topic_tags=["granular", "pads"],
summary="Creating evolving pad textures",
)
tp3 = TechniquePage(
creator_id=creator1.id,
title="FM Bass Layering",
slug="fm-bass-layering",
topic_category="Synthesis",
topic_tags=["fm", "bass"],
summary="FM synthesis for bass layers",
)
session.add_all([tp1, tp2, tp3])
await session.flush()
# Key moments
km1 = KeyMoment(
source_video_id=video1.id,
technique_page_id=tp1.id,
title="Oscillator setup",
summary="Setting up the initial oscillator",
start_time=10.0,
end_time=60.0,
content_type=KeyMomentContentType.technique,
)
km2 = KeyMoment(
source_video_id=video1.id,
technique_page_id=tp1.id,
title="Distortion chain",
summary="Adding distortion to the reese",
start_time=60.0,
end_time=120.0,
content_type=KeyMomentContentType.technique,
)
km3 = KeyMoment(
source_video_id=video2.id,
technique_page_id=tp2.id,
title="Granular engine parameters",
summary="Configuring the granular engine",
start_time=20.0,
end_time=80.0,
content_type=KeyMomentContentType.settings,
)
session.add_all([km1, km2, km3])
await session.flush()
# Related technique link: tp1 → tp3 (same_creator_adjacent)
link = RelatedTechniqueLink(
source_page_id=tp1.id,
target_page_id=tp3.id,
relationship=RelationshipType.same_creator_adjacent,
)
session.add(link)
await session.commit()
return {
"creator1_id": str(creator1.id),
"creator1_name": creator1.name,
"creator1_slug": creator1.slug,
"creator2_id": str(creator2.id),
"creator2_name": creator2.name,
"creator2_slug": creator2.slug,
"video1_id": str(video1.id),
"video2_id": str(video2.id),
"tp1_slug": tp1.slug,
"tp1_title": tp1.title,
"tp2_slug": tp2.slug,
"tp3_slug": tp3.slug,
"tp3_title": tp3.title,
}
# ── Technique Tests ──────────────────────────────────────────────────────────
@pytest.mark.asyncio
async def test_list_techniques(client, db_engine):
"""GET /techniques returns a paginated list of technique pages."""
seed = await _seed_full_data(db_engine)
resp = await client.get(TECHNIQUES_URL)
assert resp.status_code == 200
data = resp.json()
assert data["total"] == 3
assert len(data["items"]) == 3
# Each item has required fields
slugs = {item["slug"] for item in data["items"]}
assert seed["tp1_slug"] in slugs
assert seed["tp2_slug"] in slugs
assert seed["tp3_slug"] in slugs
@pytest.mark.asyncio
async def test_list_techniques_with_category_filter(client, db_engine):
"""GET /techniques?category=Synthesis returns only Synthesis technique pages."""
await _seed_full_data(db_engine)
resp = await client.get(TECHNIQUES_URL, params={"category": "Synthesis"})
assert resp.status_code == 200
data = resp.json()
assert data["total"] == 2
for item in data["items"]:
assert item["topic_category"] == "Synthesis"
@pytest.mark.asyncio
async def test_get_technique_detail(client, db_engine):
"""GET /techniques/{slug} returns full detail with key_moments, creator_info, and related_links."""
seed = await _seed_full_data(db_engine)
resp = await client.get(f"{TECHNIQUES_URL}/{seed['tp1_slug']}")
assert resp.status_code == 200
data = resp.json()
assert data["title"] == seed["tp1_title"]
assert data["slug"] == seed["tp1_slug"]
assert data["topic_category"] == "Sound design"
# Key moments: tp1 has 2 key moments
assert len(data["key_moments"]) == 2
km_titles = {km["title"] for km in data["key_moments"]}
assert "Oscillator setup" in km_titles
assert "Distortion chain" in km_titles
# Creator info
assert data["creator_info"] is not None
assert data["creator_info"]["name"] == seed["creator1_name"]
assert data["creator_info"]["slug"] == seed["creator1_slug"]
# Related links: tp1 → tp3 (same_creator_adjacent)
assert len(data["related_links"]) >= 1
related_slugs = {link["target_slug"] for link in data["related_links"]}
assert seed["tp3_slug"] in related_slugs
@pytest.mark.asyncio
async def test_get_technique_invalid_slug_returns_404(client, db_engine):
"""GET /techniques/{invalid-slug} returns 404."""
await _seed_full_data(db_engine)
resp = await client.get(f"{TECHNIQUES_URL}/nonexistent-slug-xyz")
assert resp.status_code == 404
assert "not found" in resp.json()["detail"].lower()
# ── Topics Tests ─────────────────────────────────────────────────────────────
@pytest.mark.asyncio
async def test_list_topics_hierarchy(client, db_engine):
"""GET /topics returns category hierarchy with counts matching seeded data."""
await _seed_full_data(db_engine)
resp = await client.get(TOPICS_URL)
assert resp.status_code == 200
data = resp.json()
# Should have the 6 categories from canonical_tags.yaml
assert len(data) == 6
category_names = {cat["name"] for cat in data}
assert "Sound design" in category_names
assert "Synthesis" in category_names
assert "Mixing" in category_names
# Check Sound design category — should have "bass" sub-topic with count
sound_design = next(c for c in data if c["name"] == "Sound design")
bass_sub = next(
(st for st in sound_design["sub_topics"] if st["name"] == "bass"), None
)
assert bass_sub is not None
# tp1 (tags: ["bass", "textures"]) and tp3 (tags: ["fm", "bass"]) both have "bass"
assert bass_sub["technique_count"] == 2
# Both from creator1
assert bass_sub["creator_count"] == 1
# Check Synthesis category — "granular" sub-topic
synthesis = next(c for c in data if c["name"] == "Synthesis")
granular_sub = next(
(st for st in synthesis["sub_topics"] if st["name"] == "granular"), None
)
assert granular_sub is not None
assert granular_sub["technique_count"] == 1
assert granular_sub["creator_count"] == 1
@pytest.mark.asyncio
async def test_topics_with_no_technique_pages(client, db_engine):
"""GET /topics with no seeded data returns categories with zero counts."""
# No data seeded — just use the clean DB
resp = await client.get(TOPICS_URL)
assert resp.status_code == 200
data = resp.json()
assert len(data) == 6
# All sub-topic counts should be zero
for category in data:
for st in category["sub_topics"]:
assert st["technique_count"] == 0
assert st["creator_count"] == 0
# ── Creator Tests ────────────────────────────────────────────────────────────
@pytest.mark.asyncio
async def test_list_creators_random_sort(client, db_engine):
"""GET /creators?sort=random returns all creators (order may vary)."""
seed = await _seed_full_data(db_engine)
resp = await client.get(CREATORS_URL, params={"sort": "random"})
assert resp.status_code == 200
data = resp.json()
assert len(data) == 2
names = {item["name"] for item in data}
assert seed["creator1_name"] in names
assert seed["creator2_name"] in names
# Each item has technique_count and video_count
for item in data:
assert "technique_count" in item
assert "video_count" in item
@pytest.mark.asyncio
async def test_list_creators_alpha_sort(client, db_engine):
"""GET /creators?sort=alpha returns creators in alphabetical order."""
seed = await _seed_full_data(db_engine)
resp = await client.get(CREATORS_URL, params={"sort": "alpha"})
assert resp.status_code == 200
data = resp.json()
assert len(data) == 2
# "Alpha Creator" < "Beta Producer" alphabetically
assert data[0]["name"] == "Alpha Creator"
assert data[1]["name"] == "Beta Producer"
@pytest.mark.asyncio
async def test_list_creators_genre_filter(client, db_engine):
"""GET /creators?genre=Bass+music returns only matching creators."""
seed = await _seed_full_data(db_engine)
resp = await client.get(CREATORS_URL, params={"genre": "Bass music"})
assert resp.status_code == 200
data = resp.json()
assert len(data) == 1
assert data[0]["name"] == seed["creator1_name"]
assert data[0]["slug"] == seed["creator1_slug"]
@pytest.mark.asyncio
async def test_get_creator_detail(client, db_engine):
"""GET /creators/{slug} returns detail with video_count."""
seed = await _seed_full_data(db_engine)
resp = await client.get(f"{CREATORS_URL}/{seed['creator1_slug']}")
assert resp.status_code == 200
data = resp.json()
assert data["name"] == seed["creator1_name"]
assert data["slug"] == seed["creator1_slug"]
assert data["video_count"] == 1 # creator1 has 1 video
@pytest.mark.asyncio
async def test_get_creator_invalid_slug_returns_404(client, db_engine):
"""GET /creators/{invalid-slug} returns 404."""
await _seed_full_data(db_engine)
resp = await client.get(f"{CREATORS_URL}/nonexistent-creator-xyz")
assert resp.status_code == 404
@pytest.mark.asyncio
async def test_creators_with_counts(client, db_engine):
"""GET /creators returns correct technique_count and video_count."""
seed = await _seed_full_data(db_engine)
resp = await client.get(CREATORS_URL, params={"sort": "alpha"})
assert resp.status_code == 200
data = resp.json()
# Alpha Creator: 2 technique pages, 1 video
alpha = data[0]
assert alpha["name"] == "Alpha Creator"
assert alpha["technique_count"] == 2
assert alpha["video_count"] == 1
# Beta Producer: 1 technique page, 1 video
beta = data[1]
assert beta["name"] == "Beta Producer"
assert beta["technique_count"] == 1
assert beta["video_count"] == 1
@pytest.mark.asyncio
async def test_creators_empty_list(client, db_engine):
"""GET /creators with no creators returns empty list."""
# No data seeded
resp = await client.get(CREATORS_URL)
assert resp.status_code == 200
data = resp.json()
assert data == []
# ── Version Tests ────────────────────────────────────────────────────────────
async def _insert_version(db_engine, technique_page_id: str, version_number: int, content_snapshot: dict, pipeline_metadata: dict | None = None):
"""Insert a TechniquePageVersion row directly for testing."""
from models import TechniquePageVersion
session_factory = async_sessionmaker(
db_engine, class_=AsyncSession, expire_on_commit=False
)
async with session_factory() as session:
v = TechniquePageVersion(
technique_page_id=uuid.UUID(technique_page_id) if isinstance(technique_page_id, str) else technique_page_id,
version_number=version_number,
content_snapshot=content_snapshot,
pipeline_metadata=pipeline_metadata,
)
session.add(v)
await session.commit()
@pytest.mark.asyncio
async def test_version_list_empty(client, db_engine):
"""GET /techniques/{slug}/versions returns empty list when page has no versions."""
seed = await _seed_full_data(db_engine)
resp = await client.get(f"{TECHNIQUES_URL}/{seed['tp1_slug']}/versions")
assert resp.status_code == 200
data = resp.json()
assert data["items"] == []
assert data["total"] == 0
@pytest.mark.asyncio
async def test_version_list_with_versions(client, db_engine):
"""GET /techniques/{slug}/versions returns versions after inserting them."""
seed = await _seed_full_data(db_engine)
# Get the technique page ID by fetching the detail
detail_resp = await client.get(f"{TECHNIQUES_URL}/{seed['tp1_slug']}")
page_id = detail_resp.json()["id"]
# Insert two versions
snapshot1 = {"title": "Old Reese Bass v1", "summary": "First draft"}
snapshot2 = {"title": "Old Reese Bass v2", "summary": "Second draft"}
await _insert_version(db_engine, page_id, 1, snapshot1, {"model": "gpt-4o"})
await _insert_version(db_engine, page_id, 2, snapshot2, {"model": "gpt-4o-mini"})
resp = await client.get(f"{TECHNIQUES_URL}/{seed['tp1_slug']}/versions")
assert resp.status_code == 200
data = resp.json()
assert data["total"] == 2
assert len(data["items"]) == 2
# Ordered by version_number DESC
assert data["items"][0]["version_number"] == 2
assert data["items"][1]["version_number"] == 1
assert data["items"][0]["pipeline_metadata"]["model"] == "gpt-4o-mini"
assert data["items"][1]["pipeline_metadata"]["model"] == "gpt-4o"
@pytest.mark.asyncio
async def test_version_detail_returns_content_snapshot(client, db_engine):
"""GET /techniques/{slug}/versions/{version_number} returns full snapshot."""
seed = await _seed_full_data(db_engine)
detail_resp = await client.get(f"{TECHNIQUES_URL}/{seed['tp1_slug']}")
page_id = detail_resp.json()["id"]
snapshot = {"title": "Old Title", "summary": "Old summary", "body_sections": {"intro": "Old intro"}}
metadata = {"model": "gpt-4o", "prompt_hash": "abc123"}
await _insert_version(db_engine, page_id, 1, snapshot, metadata)
resp = await client.get(f"{TECHNIQUES_URL}/{seed['tp1_slug']}/versions/1")
assert resp.status_code == 200
data = resp.json()
assert data["version_number"] == 1
assert data["content_snapshot"] == snapshot
assert data["pipeline_metadata"] == metadata
assert "created_at" in data
@pytest.mark.asyncio
async def test_version_detail_404_for_nonexistent_version(client, db_engine):
"""GET /techniques/{slug}/versions/999 returns 404."""
seed = await _seed_full_data(db_engine)
resp = await client.get(f"{TECHNIQUES_URL}/{seed['tp1_slug']}/versions/999")
assert resp.status_code == 404
assert "not found" in resp.json()["detail"].lower()
@pytest.mark.asyncio
async def test_versions_404_for_nonexistent_slug(client, db_engine):
"""GET /techniques/nonexistent-slug/versions returns 404."""
await _seed_full_data(db_engine)
resp = await client.get(f"{TECHNIQUES_URL}/nonexistent-slug-xyz/versions")
assert resp.status_code == 404
assert "not found" in resp.json()["detail"].lower()
@pytest.mark.asyncio
async def test_technique_detail_includes_version_count(client, db_engine):
"""GET /techniques/{slug} includes version_count field."""
seed = await _seed_full_data(db_engine)
# Initially version_count should be 0
resp = await client.get(f"{TECHNIQUES_URL}/{seed['tp1_slug']}")
assert resp.status_code == 200
data = resp.json()
assert data["version_count"] == 0
# Insert a version and check again
page_id = data["id"]
await _insert_version(db_engine, page_id, 1, {"title": "Snapshot"})
resp2 = await client.get(f"{TECHNIQUES_URL}/{seed['tp1_slug']}")
assert resp2.status_code == 200
assert resp2.json()["version_count"] == 1

View file

@ -0,0 +1,495 @@
"""Integration tests for the review queue endpoints.
Tests run against a real PostgreSQL test database via httpx.AsyncClient.
Redis is mocked for mode toggle tests.
"""
import uuid
from unittest.mock import AsyncMock, patch
import pytest
import pytest_asyncio
from httpx import AsyncClient
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker
from models import (
ContentType,
Creator,
KeyMoment,
KeyMomentContentType,
ProcessingStatus,
ReviewStatus,
SourceVideo,
)
# ── Helpers ──────────────────────────────────────────────────────────────────
QUEUE_URL = "/api/v1/review/queue"
STATS_URL = "/api/v1/review/stats"
MODE_URL = "/api/v1/review/mode"
def _moment_url(moment_id: str, action: str = "") -> str:
"""Build a moment action URL."""
base = f"/api/v1/review/moments/{moment_id}"
return f"{base}/{action}" if action else base
async def _seed_creator_and_video(db_engine) -> dict:
"""Seed a creator and source video, return their IDs."""
session_factory = async_sessionmaker(
db_engine, class_=AsyncSession, expire_on_commit=False
)
async with session_factory() as session:
creator = Creator(
name="TestCreator",
slug="test-creator",
folder_name="TestCreator",
)
session.add(creator)
await session.flush()
video = SourceVideo(
creator_id=creator.id,
filename="test-video.mp4",
file_path="TestCreator/test-video.mp4",
duration_seconds=600,
content_type=ContentType.tutorial,
processing_status=ProcessingStatus.extracted,
)
session.add(video)
await session.flush()
result = {
"creator_id": creator.id,
"creator_name": creator.name,
"video_id": video.id,
"video_filename": video.filename,
}
await session.commit()
return result
async def _seed_moment(
db_engine,
video_id: uuid.UUID,
title: str = "Test Moment",
summary: str = "A test key moment",
start_time: float = 10.0,
end_time: float = 30.0,
review_status: ReviewStatus = ReviewStatus.pending,
) -> uuid.UUID:
"""Seed a single key moment and return its ID."""
session_factory = async_sessionmaker(
db_engine, class_=AsyncSession, expire_on_commit=False
)
async with session_factory() as session:
moment = KeyMoment(
source_video_id=video_id,
title=title,
summary=summary,
start_time=start_time,
end_time=end_time,
content_type=KeyMomentContentType.technique,
review_status=review_status,
)
session.add(moment)
await session.commit()
return moment.id
async def _seed_second_video(db_engine, creator_id: uuid.UUID) -> uuid.UUID:
"""Seed a second video for cross-video merge tests."""
session_factory = async_sessionmaker(
db_engine, class_=AsyncSession, expire_on_commit=False
)
async with session_factory() as session:
video = SourceVideo(
creator_id=creator_id,
filename="other-video.mp4",
file_path="TestCreator/other-video.mp4",
duration_seconds=300,
content_type=ContentType.tutorial,
processing_status=ProcessingStatus.extracted,
)
session.add(video)
await session.commit()
return video.id
# ── Queue listing tests ─────────────────────────────────────────────────────
@pytest.mark.asyncio
async def test_list_queue_empty(client: AsyncClient):
"""Queue returns empty list when no moments exist."""
resp = await client.get(QUEUE_URL)
assert resp.status_code == 200
data = resp.json()
assert data["items"] == []
assert data["total"] == 0
@pytest.mark.asyncio
async def test_list_queue_with_moments(client: AsyncClient, db_engine):
"""Queue returns moments enriched with video filename and creator name."""
seed = await _seed_creator_and_video(db_engine)
await _seed_moment(db_engine, seed["video_id"], title="EQ Basics")
resp = await client.get(QUEUE_URL)
assert resp.status_code == 200
data = resp.json()
assert data["total"] == 1
item = data["items"][0]
assert item["title"] == "EQ Basics"
assert item["video_filename"] == seed["video_filename"]
assert item["creator_name"] == seed["creator_name"]
assert item["review_status"] == "pending"
@pytest.mark.asyncio
async def test_list_queue_filter_by_status(client: AsyncClient, db_engine):
"""Queue filters correctly by status query parameter."""
seed = await _seed_creator_and_video(db_engine)
await _seed_moment(db_engine, seed["video_id"], title="Pending One")
await _seed_moment(
db_engine, seed["video_id"], title="Approved One",
review_status=ReviewStatus.approved,
)
await _seed_moment(
db_engine, seed["video_id"], title="Rejected One",
review_status=ReviewStatus.rejected,
)
# Default filter: pending
resp = await client.get(QUEUE_URL)
assert resp.json()["total"] == 1
assert resp.json()["items"][0]["title"] == "Pending One"
# Approved
resp = await client.get(QUEUE_URL, params={"status": "approved"})
assert resp.json()["total"] == 1
assert resp.json()["items"][0]["title"] == "Approved One"
# All
resp = await client.get(QUEUE_URL, params={"status": "all"})
assert resp.json()["total"] == 3
# ── Stats tests ──────────────────────────────────────────────────────────────
@pytest.mark.asyncio
async def test_stats_counts(client: AsyncClient, db_engine):
"""Stats returns correct counts per review status."""
seed = await _seed_creator_and_video(db_engine)
await _seed_moment(db_engine, seed["video_id"], review_status=ReviewStatus.pending)
await _seed_moment(db_engine, seed["video_id"], review_status=ReviewStatus.pending)
await _seed_moment(db_engine, seed["video_id"], review_status=ReviewStatus.approved)
await _seed_moment(db_engine, seed["video_id"], review_status=ReviewStatus.rejected)
resp = await client.get(STATS_URL)
assert resp.status_code == 200
data = resp.json()
assert data["pending"] == 2
assert data["approved"] == 1
assert data["edited"] == 0
assert data["rejected"] == 1
# ── Approve tests ────────────────────────────────────────────────────────────
@pytest.mark.asyncio
async def test_approve_moment(client: AsyncClient, db_engine):
"""Approve sets review_status to approved."""
seed = await _seed_creator_and_video(db_engine)
moment_id = await _seed_moment(db_engine, seed["video_id"])
resp = await client.post(_moment_url(str(moment_id), "approve"))
assert resp.status_code == 200
assert resp.json()["review_status"] == "approved"
@pytest.mark.asyncio
async def test_approve_nonexistent_moment(client: AsyncClient):
"""Approve returns 404 for nonexistent moment."""
fake_id = str(uuid.uuid4())
resp = await client.post(_moment_url(fake_id, "approve"))
assert resp.status_code == 404
# ── Reject tests ─────────────────────────────────────────────────────────────
@pytest.mark.asyncio
async def test_reject_moment(client: AsyncClient, db_engine):
"""Reject sets review_status to rejected."""
seed = await _seed_creator_and_video(db_engine)
moment_id = await _seed_moment(db_engine, seed["video_id"])
resp = await client.post(_moment_url(str(moment_id), "reject"))
assert resp.status_code == 200
assert resp.json()["review_status"] == "rejected"
@pytest.mark.asyncio
async def test_reject_nonexistent_moment(client: AsyncClient):
"""Reject returns 404 for nonexistent moment."""
fake_id = str(uuid.uuid4())
resp = await client.post(_moment_url(fake_id, "reject"))
assert resp.status_code == 404
# ── Edit tests ───────────────────────────────────────────────────────────────
@pytest.mark.asyncio
async def test_edit_moment(client: AsyncClient, db_engine):
"""Edit updates fields and sets review_status to edited."""
seed = await _seed_creator_and_video(db_engine)
moment_id = await _seed_moment(db_engine, seed["video_id"], title="Original Title")
resp = await client.put(
_moment_url(str(moment_id)),
json={"title": "Updated Title", "summary": "New summary"},
)
assert resp.status_code == 200
data = resp.json()
assert data["title"] == "Updated Title"
assert data["summary"] == "New summary"
assert data["review_status"] == "edited"
@pytest.mark.asyncio
async def test_edit_nonexistent_moment(client: AsyncClient):
"""Edit returns 404 for nonexistent moment."""
fake_id = str(uuid.uuid4())
resp = await client.put(
_moment_url(fake_id),
json={"title": "Won't Work"},
)
assert resp.status_code == 404
# ── Split tests ──────────────────────────────────────────────────────────────
@pytest.mark.asyncio
async def test_split_moment(client: AsyncClient, db_engine):
"""Split creates two moments with correct timestamps."""
seed = await _seed_creator_and_video(db_engine)
moment_id = await _seed_moment(
db_engine, seed["video_id"],
title="Full Moment", start_time=10.0, end_time=30.0,
)
resp = await client.post(
_moment_url(str(moment_id), "split"),
json={"split_time": 20.0},
)
assert resp.status_code == 200
data = resp.json()
assert len(data) == 2
# First (original): [10.0, 20.0)
assert data[0]["start_time"] == 10.0
assert data[0]["end_time"] == 20.0
# Second (new): [20.0, 30.0]
assert data[1]["start_time"] == 20.0
assert data[1]["end_time"] == 30.0
assert "(split)" in data[1]["title"]
@pytest.mark.asyncio
async def test_split_invalid_time_below_start(client: AsyncClient, db_engine):
"""Split returns 400 when split_time is at or below start_time."""
seed = await _seed_creator_and_video(db_engine)
moment_id = await _seed_moment(
db_engine, seed["video_id"], start_time=10.0, end_time=30.0,
)
resp = await client.post(
_moment_url(str(moment_id), "split"),
json={"split_time": 10.0},
)
assert resp.status_code == 400
@pytest.mark.asyncio
async def test_split_invalid_time_above_end(client: AsyncClient, db_engine):
"""Split returns 400 when split_time is at or above end_time."""
seed = await _seed_creator_and_video(db_engine)
moment_id = await _seed_moment(
db_engine, seed["video_id"], start_time=10.0, end_time=30.0,
)
resp = await client.post(
_moment_url(str(moment_id), "split"),
json={"split_time": 30.0},
)
assert resp.status_code == 400
@pytest.mark.asyncio
async def test_split_nonexistent_moment(client: AsyncClient):
"""Split returns 404 for nonexistent moment."""
fake_id = str(uuid.uuid4())
resp = await client.post(
_moment_url(fake_id, "split"),
json={"split_time": 20.0},
)
assert resp.status_code == 404
# ── Merge tests ──────────────────────────────────────────────────────────────
@pytest.mark.asyncio
async def test_merge_moments(client: AsyncClient, db_engine):
"""Merge combines two moments: combined summary, min start, max end, target deleted."""
seed = await _seed_creator_and_video(db_engine)
m1_id = await _seed_moment(
db_engine, seed["video_id"],
title="First", summary="Summary A",
start_time=10.0, end_time=20.0,
)
m2_id = await _seed_moment(
db_engine, seed["video_id"],
title="Second", summary="Summary B",
start_time=25.0, end_time=35.0,
)
resp = await client.post(
_moment_url(str(m1_id), "merge"),
json={"target_moment_id": str(m2_id)},
)
assert resp.status_code == 200
data = resp.json()
assert data["start_time"] == 10.0
assert data["end_time"] == 35.0
assert "Summary A" in data["summary"]
assert "Summary B" in data["summary"]
# Target should be deleted — reject should 404
resp2 = await client.post(_moment_url(str(m2_id), "reject"))
assert resp2.status_code == 404
@pytest.mark.asyncio
async def test_merge_different_videos(client: AsyncClient, db_engine):
"""Merge returns 400 when moments are from different source videos."""
seed = await _seed_creator_and_video(db_engine)
m1_id = await _seed_moment(db_engine, seed["video_id"], title="Video 1 moment")
other_video_id = await _seed_second_video(db_engine, seed["creator_id"])
m2_id = await _seed_moment(db_engine, other_video_id, title="Video 2 moment")
resp = await client.post(
_moment_url(str(m1_id), "merge"),
json={"target_moment_id": str(m2_id)},
)
assert resp.status_code == 400
assert "different source videos" in resp.json()["detail"]
@pytest.mark.asyncio
async def test_merge_with_self(client: AsyncClient, db_engine):
"""Merge returns 400 when trying to merge a moment with itself."""
seed = await _seed_creator_and_video(db_engine)
m_id = await _seed_moment(db_engine, seed["video_id"])
resp = await client.post(
_moment_url(str(m_id), "merge"),
json={"target_moment_id": str(m_id)},
)
assert resp.status_code == 400
assert "itself" in resp.json()["detail"]
@pytest.mark.asyncio
async def test_merge_nonexistent_target(client: AsyncClient, db_engine):
"""Merge returns 404 when target moment does not exist."""
seed = await _seed_creator_and_video(db_engine)
m_id = await _seed_moment(db_engine, seed["video_id"])
resp = await client.post(
_moment_url(str(m_id), "merge"),
json={"target_moment_id": str(uuid.uuid4())},
)
assert resp.status_code == 404
@pytest.mark.asyncio
async def test_merge_nonexistent_source(client: AsyncClient):
"""Merge returns 404 when source moment does not exist."""
fake_id = str(uuid.uuid4())
resp = await client.post(
_moment_url(fake_id, "merge"),
json={"target_moment_id": str(uuid.uuid4())},
)
assert resp.status_code == 404
# ── Mode toggle tests ───────────────────────────────────────────────────────
@pytest.mark.asyncio
async def test_get_mode_default(client: AsyncClient):
"""Get mode returns config default when Redis has no value."""
mock_redis = AsyncMock()
mock_redis.get = AsyncMock(return_value=None)
mock_redis.aclose = AsyncMock()
with patch("routers.review.get_redis", return_value=mock_redis):
resp = await client.get(MODE_URL)
assert resp.status_code == 200
# Default from config is True
assert resp.json()["review_mode"] is True
@pytest.mark.asyncio
async def test_set_mode(client: AsyncClient):
"""Set mode writes to Redis and returns the new value."""
mock_redis = AsyncMock()
mock_redis.set = AsyncMock()
mock_redis.aclose = AsyncMock()
with patch("routers.review.get_redis", return_value=mock_redis):
resp = await client.put(MODE_URL, json={"review_mode": False})
assert resp.status_code == 200
assert resp.json()["review_mode"] is False
mock_redis.set.assert_called_once_with("chrysopedia:review_mode", "False")
@pytest.mark.asyncio
async def test_get_mode_from_redis(client: AsyncClient):
"""Get mode reads the value stored in Redis."""
mock_redis = AsyncMock()
mock_redis.get = AsyncMock(return_value="False")
mock_redis.aclose = AsyncMock()
with patch("routers.review.get_redis", return_value=mock_redis):
resp = await client.get(MODE_URL)
assert resp.status_code == 200
assert resp.json()["review_mode"] is False
@pytest.mark.asyncio
async def test_get_mode_redis_error_fallback(client: AsyncClient):
"""Get mode falls back to config default when Redis is unavailable."""
with patch("routers.review.get_redis", side_effect=ConnectionError("Redis down")):
resp = await client.get(MODE_URL)
assert resp.status_code == 200
# Falls back to config default (True)
assert resp.json()["review_mode"] is True
@pytest.mark.asyncio
async def test_set_mode_redis_error(client: AsyncClient):
"""Set mode returns 503 when Redis is unavailable."""
with patch("routers.review.get_redis", side_effect=ConnectionError("Redis down")):
resp = await client.put(MODE_URL, json={"review_mode": False})
assert resp.status_code == 503

View file

@ -0,0 +1,341 @@
"""Integration tests for the /api/v1/search endpoint.
Tests run against a real PostgreSQL test database via httpx.AsyncClient.
SearchService is mocked at the router dependency level so we can test
endpoint behavior without requiring external embedding API or Qdrant.
"""
from __future__ import annotations
import uuid
from unittest.mock import AsyncMock, MagicMock, patch
import pytest
import pytest_asyncio
from httpx import AsyncClient
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker
from models import (
ContentType,
Creator,
KeyMoment,
KeyMomentContentType,
ProcessingStatus,
SourceVideo,
TechniquePage,
)
SEARCH_URL = "/api/v1/search"
# ── Seed helpers ─────────────────────────────────────────────────────────────
async def _seed_search_data(db_engine) -> dict:
"""Seed 2 creators, 3 technique pages, and 5 key moments for search tests.
Returns a dict with creator/technique IDs and metadata for assertions.
"""
session_factory = async_sessionmaker(
db_engine, class_=AsyncSession, expire_on_commit=False
)
async with session_factory() as session:
# Creators
creator1 = Creator(
name="Mr. Bill",
slug="mr-bill",
genres=["Bass music", "Glitch"],
folder_name="MrBill",
)
creator2 = Creator(
name="KOAN Sound",
slug="koan-sound",
genres=["Drum & bass", "Neuro"],
folder_name="KOANSound",
)
session.add_all([creator1, creator2])
await session.flush()
# Videos (needed for key moments FK)
video1 = SourceVideo(
creator_id=creator1.id,
filename="bass-design-101.mp4",
file_path="MrBill/bass-design-101.mp4",
duration_seconds=600,
content_type=ContentType.tutorial,
processing_status=ProcessingStatus.extracted,
)
video2 = SourceVideo(
creator_id=creator2.id,
filename="reese-bass-deep-dive.mp4",
file_path="KOANSound/reese-bass-deep-dive.mp4",
duration_seconds=900,
content_type=ContentType.tutorial,
processing_status=ProcessingStatus.extracted,
)
session.add_all([video1, video2])
await session.flush()
# Technique pages
tp1 = TechniquePage(
creator_id=creator1.id,
title="Reese Bass Design",
slug="reese-bass-design",
topic_category="Sound design",
topic_tags=["bass", "textures"],
summary="How to create a classic reese bass",
)
tp2 = TechniquePage(
creator_id=creator2.id,
title="Granular Pad Textures",
slug="granular-pad-textures",
topic_category="Synthesis",
topic_tags=["granular", "pads"],
summary="Creating pad textures with granular synthesis",
)
tp3 = TechniquePage(
creator_id=creator1.id,
title="FM Bass Layering",
slug="fm-bass-layering",
topic_category="Synthesis",
topic_tags=["fm", "bass"],
summary="FM synthesis techniques for bass layering",
)
session.add_all([tp1, tp2, tp3])
await session.flush()
# Key moments
km1 = KeyMoment(
source_video_id=video1.id,
technique_page_id=tp1.id,
title="Setting up the Reese oscillator",
summary="Initial oscillator setup for reese bass",
start_time=10.0,
end_time=60.0,
content_type=KeyMomentContentType.technique,
)
km2 = KeyMoment(
source_video_id=video1.id,
technique_page_id=tp1.id,
title="Adding distortion to the Reese",
summary="Distortion processing chain for reese bass",
start_time=60.0,
end_time=120.0,
content_type=KeyMomentContentType.technique,
)
km3 = KeyMoment(
source_video_id=video2.id,
technique_page_id=tp2.id,
title="Granular engine settings",
summary="Dialing in granular engine parameters",
start_time=20.0,
end_time=80.0,
content_type=KeyMomentContentType.settings,
)
km4 = KeyMoment(
source_video_id=video1.id,
technique_page_id=tp3.id,
title="FM ratio selection",
summary="Choosing FM ratios for bass tones",
start_time=5.0,
end_time=45.0,
content_type=KeyMomentContentType.technique,
)
km5 = KeyMoment(
source_video_id=video2.id,
title="Outro and credits",
summary="End of the video",
start_time=800.0,
end_time=900.0,
content_type=KeyMomentContentType.workflow,
)
session.add_all([km1, km2, km3, km4, km5])
await session.commit()
return {
"creator1_id": str(creator1.id),
"creator1_name": creator1.name,
"creator1_slug": creator1.slug,
"creator2_id": str(creator2.id),
"creator2_name": creator2.name,
"tp1_slug": tp1.slug,
"tp1_title": tp1.title,
"tp2_slug": tp2.slug,
"tp3_slug": tp3.slug,
}
# ── Tests ────────────────────────────────────────────────────────────────────
@pytest.mark.asyncio
async def test_search_happy_path_with_mocked_service(client, db_engine):
"""Search endpoint returns mocked results with correct response shape."""
seed = await _seed_search_data(db_engine)
# Mock the SearchService.search method to return canned results
mock_result = {
"items": [
{
"type": "technique_page",
"title": "Reese Bass Design",
"slug": "reese-bass-design",
"summary": "How to create a classic reese bass",
"topic_category": "Sound design",
"topic_tags": ["bass", "textures"],
"creator_name": "Mr. Bill",
"creator_slug": "mr-bill",
"score": 0.95,
}
],
"total": 1,
"query": "reese bass",
"fallback_used": False,
}
with patch("routers.search.SearchService") as MockSvc:
instance = MockSvc.return_value
instance.search = AsyncMock(return_value=mock_result)
resp = await client.get(SEARCH_URL, params={"q": "reese bass"})
assert resp.status_code == 200
data = resp.json()
assert data["query"] == "reese bass"
assert data["total"] == 1
assert data["fallback_used"] is False
assert len(data["items"]) == 1
item = data["items"][0]
assert item["title"] == "Reese Bass Design"
assert item["slug"] == "reese-bass-design"
assert "score" in item
@pytest.mark.asyncio
async def test_search_empty_query_returns_empty(client, db_engine):
"""Empty search query returns empty results without hitting SearchService."""
await _seed_search_data(db_engine)
# With empty query, the search service returns empty results directly
mock_result = {
"items": [],
"total": 0,
"query": "",
"fallback_used": False,
}
with patch("routers.search.SearchService") as MockSvc:
instance = MockSvc.return_value
instance.search = AsyncMock(return_value=mock_result)
resp = await client.get(SEARCH_URL, params={"q": ""})
assert resp.status_code == 200
data = resp.json()
assert data["items"] == []
assert data["total"] == 0
assert data["query"] == ""
assert data["fallback_used"] is False
@pytest.mark.asyncio
async def test_search_keyword_fallback(client, db_engine):
"""When embedding fails, search uses keyword fallback and sets fallback_used=true."""
seed = await _seed_search_data(db_engine)
mock_result = {
"items": [
{
"type": "technique_page",
"title": "Reese Bass Design",
"slug": "reese-bass-design",
"summary": "How to create a classic reese bass",
"topic_category": "Sound design",
"topic_tags": ["bass", "textures"],
"creator_name": "",
"creator_slug": "",
"score": 0.0,
}
],
"total": 1,
"query": "reese",
"fallback_used": True,
}
with patch("routers.search.SearchService") as MockSvc:
instance = MockSvc.return_value
instance.search = AsyncMock(return_value=mock_result)
resp = await client.get(SEARCH_URL, params={"q": "reese"})
assert resp.status_code == 200
data = resp.json()
assert data["fallback_used"] is True
assert data["total"] >= 1
assert data["items"][0]["title"] == "Reese Bass Design"
@pytest.mark.asyncio
async def test_search_scope_filter(client, db_engine):
"""Search with scope=topics returns only technique_page type results."""
await _seed_search_data(db_engine)
mock_result = {
"items": [
{
"type": "technique_page",
"title": "FM Bass Layering",
"slug": "fm-bass-layering",
"summary": "FM synthesis techniques for bass layering",
"topic_category": "Synthesis",
"topic_tags": ["fm", "bass"],
"creator_name": "Mr. Bill",
"creator_slug": "mr-bill",
"score": 0.88,
}
],
"total": 1,
"query": "bass",
"fallback_used": False,
}
with patch("routers.search.SearchService") as MockSvc:
instance = MockSvc.return_value
instance.search = AsyncMock(return_value=mock_result)
resp = await client.get(SEARCH_URL, params={"q": "bass", "scope": "topics"})
assert resp.status_code == 200
data = resp.json()
# All items should be technique_page type when scope=topics
for item in data["items"]:
assert item["type"] == "technique_page"
# Verify the service was called with scope=topics
call_kwargs = instance.search.call_args
assert call_kwargs.kwargs.get("scope") == "topics" or call_kwargs[1].get("scope") == "topics"
@pytest.mark.asyncio
async def test_search_no_matching_results(client, db_engine):
"""Search with no matching results returns empty items list."""
await _seed_search_data(db_engine)
mock_result = {
"items": [],
"total": 0,
"query": "zzzznonexistent",
"fallback_used": True,
}
with patch("routers.search.SearchService") as MockSvc:
instance = MockSvc.return_value
instance.search = AsyncMock(return_value=mock_result)
resp = await client.get(SEARCH_URL, params={"q": "zzzznonexistent"})
assert resp.status_code == 200
data = resp.json()
assert data["items"] == []
assert data["total"] == 0

32
backend/worker.py Normal file
View file

@ -0,0 +1,32 @@
"""Celery application instance for the Chrysopedia pipeline.
Usage:
celery -A worker worker --loglevel=info
"""
from celery import Celery
from config import get_settings
settings = get_settings()
celery_app = Celery(
"chrysopedia",
broker=settings.redis_url,
backend=settings.redis_url,
)
celery_app.conf.update(
task_serializer="json",
result_serializer="json",
accept_content=["json"],
timezone="UTC",
enable_utc=True,
task_track_started=True,
task_acks_late=True,
worker_prefetch_multiplier=1,
)
# Import pipeline.stages so that @celery_app.task decorators register tasks.
# This import must come after celery_app is defined.
import pipeline.stages # noqa: E402, F401

713
chrysopedia-spec.md Normal file
View file

@ -0,0 +1,713 @@
# Chrysopedia — Project Specification
> **Etymology:** From *chrysopoeia* (the alchemical transmutation of base material into gold) + *encyclopedia* (an organized body of knowledge). Chrysopedia transmutes raw video content into refined, searchable production knowledge.
---
## 1. Project overview
### 1.1 Problem statement
Hundreds of hours of educational video content from electronic music producers sit on local storage — tutorials, livestreams, track breakdowns, and deep dives covering techniques in sound design, mixing, arrangement, synthesis, and more. This content is extremely valuable but nearly impossible to retrieve: videos are unsearchable, unchaptered, and undocumented. A 4-hour livestream may contain 6 minutes of actionable gold buried among tangents and chat interaction. The current retrieval method is "scrub through from memory and hope" — or more commonly, the knowledge is simply lost.
### 1.2 Solution
Chrysopedia is a self-hosted knowledge extraction and retrieval system that:
1. **Transcribes** video content using local Whisper inference
2. **Extracts** key moments, techniques, and insights using LLM analysis
3. **Classifies** content by topic, creator, plugins, and production stage
4. **Synthesizes** knowledge across multiple sources into coherent technique pages
5. **Serves** a fast, search-first web UI for mid-session retrieval
The system transforms raw video files into a browsable, searchable knowledge base with direct timestamp links back to source material.
### 1.3 Design principles
- **Search-first.** The primary interaction is typing a query and getting results in seconds. Browse is secondary, for exploration.
- **Surgical retrieval.** A producer mid-session should be able to Alt+Tab, find the technique they need, absorb the key insight, and get back to their DAW in under 2 minutes.
- **Creator equity.** No artist is privileged in the UI. All creators get equal visual weight. Default sort is randomized.
- **Dual-axis navigation.** Content is accessible by Topic (technique/production stage) and by Creator (artist), with both paths being first-class citizens.
- **Incremental, not one-time.** The system must handle ongoing content additions, not just an initial batch.
- **Self-hosted and portable.** Packaged as a Docker Compose project, deployable on existing infrastructure.
### 1.4 Name and identity
- **Project name:** Chrysopedia
- **Suggested subdomain:** `chrysopedia.xpltd.co`
- **Docker project name:** `chrysopedia`
---
## 2. Content inventory and source material
### 2.1 Current state
- **Volume:** 100500 video files
- **Creators:** 50+ distinct artists/producers
- **Formats:** Primarily MP4/MKV, mixed quality and naming conventions
- **Organization:** Folders per artist, filenames loosely descriptive
- **Location:** Local desktop storage (not yet on the hypervisor/NAS)
- **Content types:**
- Full-length tutorials (30min4hrs, structured walkthroughs)
- Livestream recordings (long, unstructured, conversational)
- Track breakdowns / start-to-finish productions
### 2.2 Content characteristics
The audio track carries the vast majority of the value. Visual demonstrations (screen recordings of DAW work) are useful context but are not the primary extraction target. The transcript is the primary ore.
**Structured content** (tutorials, breakdowns) tends to have natural topic boundaries — the producer announces what they're about to cover, then demonstrates. These are easier to segment.
**Unstructured content** (livestreams) is chaotic: tangents, chat interaction, rambling, with gems appearing without warning. The extraction pipeline must handle both structured and unstructured content using semantic understanding, not just topic detection from speaker announcements.
---
## 3. Terminology
| Term | Definition |
|------|-----------|
| **Creator** | An artist, producer, or educator whose video content is in the system. Formerly "artist" — renamed for flexibility. |
| **Technique page** | The primary knowledge unit: a structured page covering one technique or concept from one creator, compiled from one or more source videos. |
| **Key moment** | A discrete, timestamped insight extracted from a video — a specific technique, setting, or piece of reasoning worth capturing. |
| **Topic** | A production domain or concept category (e.g., "sound design," "mixing," "snare design"). Organized hierarchically. |
| **Genre** | A broad musical style tag (e.g., "dubstep," "drum & bass," "halftime"). Stored as metadata on Creators, not on techniques. Used as a filter across all views. |
| **Source video** | An original video file that has been processed by the pipeline. |
| **Transcript** | The timestamped text output of Whisper processing a source video's audio. |
---
## 4. User experience
### 4.1 UX philosophy
The system is accessed via Alt+Tab from a DAW on the same desktop machine. Every design decision optimizes for speed of retrieval and minimal cognitive load. The interface should feel like a tool, not a destination.
**Primary access method:** Same machine, Alt+Tab to browser.
### 4.2 Landing page (Launchpad)
The landing page is a decision point, not a dashboard. Minimal, focused, fast.
**Layout (top to bottom):**
1. **Search bar** — prominent, full-width, with live typeahead (results appear after 23 characters). This is the primary interaction for most visits. Scope toggle tabs below the search input: `All | Topics | Creators`
2. **Two navigation cards** — side-by-side:
- **Topics** — "Browse by technique, production stage, or concept" with count of total techniques and categories
- **Creators** — "Browse by artist, filterable by genre" with count of total creators and genres
3. **Recently added** — a short list of the most recently processed/published technique pages with creator name, topic tag, and relative timestamp
**Future feature (not v1):** Trending / popular section alongside recently added, driven by view counts and cross-reference frequency.
### 4.3 Live search (typeahead)
The search bar is the primary interface. Behavior:
- Results begin appearing after 23 characters typed
- Scope toggle: `All | Topics | Creators` — filters what types of results appear
- **"All" scope** groups results by type:
- **Topics** — technique pages matching the query, showing title, creator name(s), parent topic tag
- **Key moments** — individual timestamped insights matching the query, showing moment title, creator, source file, and timestamp. Clicking jumps to the technique page (or eventually direct to the video moment)
- **Creators** — creator names matching the query
- **"Topics" scope** — shows only technique pages
- **"Creators" scope** — shows only creator matches
- Genre filter is accessible on Creators scope and cross-filters Topics scope (using creator-level genre metadata)
- Search is semantic where possible (powered by Qdrant vector search), with keyword fallback
### 4.4 Technique page (A+C hybrid format)
The core content unit. Each technique page covers one technique or concept from one creator. The format adapts by content type but follows a consistent structure.
**Layout (top to bottom):**
1. **Header:**
- Topic tags (e.g., "sound design," "drums," "snare")
- Technique title (e.g., "Snare design")
- Creator name
- Meta line: "Compiled from N sources · M key moments · Last updated [date]"
- Source quality warning (amber banner) if content came from an unstructured livestream
2. **Study guide prose (Section A):**
- Organized by sub-aspects of the technique (e.g., "Layer construction," "Saturation & character," "Mix context")
- Rich prose capturing:
- The specific technique/method described (highest priority)
- Exact settings, plugins, and parameters when the creator was *teaching* the setting (not incidental use)
- The reasoning/philosophy behind choices when the creator explains *why*
- Signal chain blocks rendered in monospace when a creator walks through a routing chain
- Direct quotes of creator opinions/warnings when they add value (e.g., "He says it 'smears the transient into mush'")
3. **Key moments index (Section C):**
- Compact list of individual timestamped insights
- Each row: moment title, source video filename, clickable timestamp
- Sorted chronologically within each source video
4. **Related techniques:**
- Links to related technique pages — same technique by other creators, adjacent techniques by the same creator, general/cross-creator technique pages
- Renders as clickable pill-shaped tags
5. **Plugins referenced:**
- List of all plugins/tools mentioned in the technique page
- Each is a clickable tag that could lead to "all techniques referencing this plugin" (future: dedicated plugin pages)
**Content type adaptation:**
- **Technique-heavy content** (sound design, specific methods): Full A+C treatment with signal chains, plugin details, parameter specifics
- **Philosophy/workflow content** (mixdown approach, creative process): More prose-heavy, fewer signal chain blocks, but same overall structure. These pages are still browsable but also serve as rich context for future RAG/chat retrieval
- **Livestream-sourced content:** Amber warning banner noting source quality. Timestamps may land in messy context with tangents nearby
### 4.5 Creators browse page
Accessed from the landing page "Creators" card.
**Layout:**
- Page title: "Creators" with total count
- Filter input: type-to-narrow the list
- Genre filter pills: `All genres | Bass music | Drum & bass | Dubstep | Halftime | House | IDM | Neuro | Techno | ...` — clicking a genre filters the list to creators tagged with that genre
- Sort options: Randomized (default, re-shuffled on every page load), Alphabetical, View count
- Creator list: flat, equal-weight rows. Each row shows:
- Creator name
- Genre tags (multiple allowed)
- Technique count
- Video count
- View count (sum of activity across all content derived from this creator)
- Clicking a row navigates to that creator's detail page (list of all their technique pages)
**Default sort is randomized on every page load** to prevent discovery bias. Users can toggle to alphabetical or sort by view count.
### 4.6 Topics browse page
Accessed from the landing page "Topics" card.
**Layout:**
- Page title: "Topics" with total technique count
- Filter input: type-to-narrow
- Genre filter pills (uses creator-level genre metadata to filter): show only techniques from creators tagged with the selected genre
- **Two-level hierarchy displayed:**
- **Top-level categories:** Sound design, Mixing, Synthesis, Arrangement, Workflow, Mastering
- **Sub-topics within each:** clicking a top-level category expands or navigates to show sub-topics (e.g., Sound Design → Bass, Drums, Pads, Leads, FX, Foley; Drums → Kick, Snare, Hi-hat, Percussion)
- Each sub-topic shows: technique count, number of creators covering it
- Clicking a sub-topic shows all technique pages in that category, filterable by creator and genre
### 4.7 Search results page
For complex queries that go beyond typeahead (e.g., hitting Enter after typing a full query).
**Layout:**
- Search bar at top (retains query)
- Scope tabs: `All results (N) | Techniques (N) | Key moments (N) | Creators (N)`
- Results split into two tiers:
- **Technique pages** — first-class results with title, creator, summary snippet, tags, moment count, plugin list
- **Also mentioned in** — cross-references where the search term appears inside other technique pages (e.g., searching "snare" surfaces "drum bus processing" because it mentions snare bus techniques)
---
## 5. Taxonomy and topic hierarchy
### 5.1 Top-level categories
These are broad production stages/domains. They should cover the full scope of music production education:
| Category | Description | Example sub-topics |
|----------|-------------|-------------------|
| Sound design | Creating and shaping sounds from scratch or samples | Bass, drums (kick, snare, hi-hat, percussion), pads, leads, FX, foley, vocals, textures |
| Mixing | Balancing, processing, and spatializing elements in a session | EQ, compression, bus processing, reverb/delay, stereo imaging, gain staging, automation |
| Synthesis | Methods of generating sound | FM, wavetable, granular, additive, subtractive, modular, physical modeling |
| Arrangement | Structuring a track from intro to outro | Song structure, transitions, tension/release, energy flow, breakdowns, drops |
| Workflow | Creative process, session management, productivity | DAW setup, templates, creative process, collaboration, file management, resampling |
| Mastering | Final stage processing for release | Limiting, stereo width, loudness, format delivery, referencing |
### 5.2 Sub-topic management
Sub-topics are not rigidly pre-defined. The extraction pipeline proposes sub-topic tags during classification, and the taxonomy grows organically as content is processed. However, the system maintains a **canonical tag list** that the LLM references during classification to ensure consistency (e.g., always "snare" not sometimes "snare drum" and sometimes "snare design").
The canonical tag list is editable by the administrator and should be stored as a configuration file that the pipeline references. New tags can be proposed by the pipeline and queued for admin approval, or auto-added if they fit within an existing top-level category.
### 5.3 Genre taxonomy
Genres are broad, general-level tags. Sub-genre classification is explicitly out of scope to avoid complexity.
**Initial genre set (expandable):**
Bass music, Drum & bass, Dubstep, Halftime, House, Techno, IDM, Glitch, Downtempo, Neuro, Ambient, Experimental, Cinematic
**Rules:**
- Genres are metadata on Creators, not on techniques
- A Creator can have multiple genre tags
- Genre is available as a filter on both the Creators browse page and the Topics browse page (filtering Topics by genre shows techniques from creators tagged with that genre)
- Genre tags are assigned during initial creator setup (manually or LLM-suggested based on content analysis) and can be edited by the administrator
---
## 6. Data model
### 6.1 Core entities
**Creator**
```
id UUID
name string (display name, e.g., "KOAN Sound")
slug string (URL-safe, e.g., "koan-sound")
genres string[] (e.g., ["glitch hop", "neuro", "bass music"])
folder_name string (matches the folder name on disk for source mapping)
view_count integer (aggregated from child technique page views)
created_at timestamp
updated_at timestamp
```
**Source Video**
```
id UUID
creator_id FK → Creator
filename string (original filename)
file_path string (path on disk)
duration_seconds integer
content_type enum: tutorial | livestream | breakdown | short_form
transcript_path string (path to transcript JSON)
processing_status enum: pending | transcribed | extracted | reviewed | published
created_at timestamp
updated_at timestamp
```
**Transcript Segment**
```
id UUID
source_video_id FK → Source Video
start_time float (seconds)
end_time float (seconds)
text text
segment_index integer (order within video)
topic_label string (LLM-assigned topic label for this segment)
```
**Key Moment**
```
id UUID
source_video_id FK → Source Video
technique_page_id FK → Technique Page (nullable until assigned)
title string (e.g., "Three-layer snare construction")
summary text (1-3 sentence description)
start_time float (seconds)
end_time float (seconds)
content_type enum: technique | settings | reasoning | workflow
plugins string[] (plugin names detected)
review_status enum: pending | approved | edited | rejected
raw_transcript text (the original transcript text for this segment)
created_at timestamp
updated_at timestamp
```
**Technique Page**
```
id UUID
creator_id FK → Creator
title string (e.g., "Snare design")
slug string (URL-safe)
topic_category string (top-level: "sound design")
topic_tags string[] (sub-topics: ["drums", "snare", "layering", "saturation"])
summary text (synthesized overview paragraph)
body_sections JSONB (structured prose sections with headings)
signal_chains JSONB[] (structured signal chain representations)
plugins string[] (all plugins referenced across all moments)
source_quality enum: structured | mixed | unstructured (derived from source video types)
view_count integer
review_status enum: draft | reviewed | published
created_at timestamp
updated_at timestamp
```
**Related Technique Link**
```
id UUID
source_page_id FK → Technique Page
target_page_id FK → Technique Page
relationship enum: same_technique_other_creator | same_creator_adjacent | general_cross_reference
```
**Tag (canonical)**
```
id UUID
name string (e.g., "snare")
category string (parent top-level category: "sound design")
aliases string[] (alternative phrasings the LLM should normalize: ["snare drum", "snare design"])
```
### 6.2 Storage layer
| Store | Purpose | Technology |
|-------|---------|------------|
| Relational DB | All structured data (creators, videos, moments, technique pages, tags) | PostgreSQL (preferred) or SQLite for initial simplicity |
| Vector DB | Semantic search embeddings for transcripts, key moments, and technique page content | Qdrant (already running on hypervisor) |
| File store | Raw transcript JSON files, source video reference metadata | Local filesystem on hypervisor, organized by creator slug |
### 6.3 Vector embeddings
The following content gets embedded in Qdrant for semantic search:
- Key moment summaries (with metadata: creator, topic, timestamp, source video)
- Technique page summaries and body sections
- Transcript segments (for future RAG/chat retrieval)
Embedding model: configurable. Can use a local model via Ollama (e.g., `nomic-embed-text`) or an API-based model. The embedding endpoint should be a configurable URL, same pattern as the LLM endpoint.
---
## 7. Pipeline architecture
### 7.1 Infrastructure topology
```
Desktop (RTX 4090) Hypervisor (Docker host)
┌─────────────────────┐ ┌─────────────────────────────────┐
│ Video files (local) │ │ Chrysopedia Docker Compose │
│ Whisper (local GPU) │──2.5GbE──────▶│ ├─ API / pipeline service │
│ Output: transcript │ (text only) │ ├─ Web UI │
│ JSON files │ │ ├─ PostgreSQL │
└─────────────────────┘ │ ├─ Qdrant (existing) │
│ └─ File store │
└────────────┬────────────────────┘
│ API calls (text)
┌─────────────▼────────────────────┐
│ Friend's DGX Sparks │
│ Qwen via Open WebUI API │
│ (2Gb fiber, high uptime) │
└──────────────────────────────────┘
```
**Bandwidth analysis:** Transcript JSON files are 200500KB each. At 50Mbit upload, the entire library's transcripts could transfer in under a minute. The bandwidth constraint is irrelevant for this workload. The only large files (videos) stay on the desktop.
**Future centralization:** The Docker Compose project should be structured so that when all hardware is co-located, the only change is config (moving Whisper into the compose stack and pointing file paths to local storage). No architectural rewrite.
### 7.2 Processing stages
#### Stage 1: Audio extraction and transcription (Desktop)
**Tool:** Whisper large-v3 running locally on RTX 4090
**Input:** Video file (MP4/MKV)
**Process:**
1. Extract audio track from video (ffmpeg → WAV or direct pipe)
2. Run Whisper with word-level or segment-level timestamps
3. Output: JSON file with timestamped transcript
**Output format:**
```json
{
"source_file": "Skope — Sound Design Masterclass pt2.mp4",
"creator_folder": "Skope",
"duration_seconds": 7243,
"segments": [
{
"start": 0.0,
"end": 4.52,
"text": "Hey everyone welcome back to part two...",
"words": [
{"word": "Hey", "start": 0.0, "end": 0.28},
{"word": "everyone", "start": 0.32, "end": 0.74}
]
}
]
}
```
**Performance estimate:** Whisper large-v3 on a 4090 processes audio at roughly 10-20x real-time. A 2-hour video takes ~6-12 minutes to transcribe. For 300 videos averaging 1.5 hours each, the initial transcription pass is roughly 15-40 hours of GPU time.
#### Stage 2: Transcript segmentation (Hypervisor → LLM)
**Tool:** LLM (Qwen on DGX Sparks, or local Ollama as fallback)
**Input:** Full timestamped transcript JSON
**Process:** The LLM analyzes the transcript to identify topic boundaries — points where the creator shifts from one subject to another. Output is a segmented transcript with topic labels per segment.
**This stage can use a lighter model** if needed (segmentation is more mechanical than extraction). However, for simplicity in v1, use the same model endpoint as stages 3-5.
#### Stage 3: Key moment extraction (Hypervisor → LLM)
**Tool:** LLM (Qwen on DGX Sparks)
**Input:** Individual transcript segments from Stage 2
**Process:** The LLM reads each segment and identifies actionable insights. The extraction prompt should distinguish between:
- **Instructional content** (the creator is *teaching* something) → extract as a key moment
- **Incidental content** (the creator is *using* a tool without explaining it) → skip
- **Philosophical/reasoning content** (the creator explains *why* they make a choice) → extract with `content_type: reasoning`
- **Settings/parameters** (specific plugin settings, values, configurations being demonstrated) → extract with `content_type: settings`
**Extraction rule for plugin detail:** Capture plugin names and settings when the creator is *teaching* the setting — spending time explaining why they chose it, what it does, how to configure it. Skip incidental plugin usage (a plugin is visible but not discussed).
#### Stage 4: Classification and tagging (Hypervisor → LLM)
**Tool:** LLM (Qwen on DGX Sparks)
**Input:** Extracted key moments from Stage 3
**Process:** Each moment is classified with:
- Top-level topic category
- Sub-topic tags (referencing the canonical tag list)
- Plugin names (normalized to canonical names)
- Content type classification
The LLM is provided the canonical tag list as context and instructed to use existing tags where possible, proposing new tags only when no existing tag fits.
#### Stage 5: Synthesis (Hypervisor → LLM)
**Tool:** LLM (Qwen on DGX Sparks)
**Input:** All approved/published key moments for a given creator + topic combination
**Process:** When multiple key moments from the same creator cover overlapping or related topics, the synthesis stage merges them into a coherent technique page. This includes:
- Writing the overview summary paragraph
- Organizing body sections by sub-aspect
- Generating signal chain blocks where applicable
- Identifying related technique pages for cross-linking
- Compiling the plugin reference list
This stage runs whenever new key moments are approved for a creator+topic combination that already has a technique page (updating it), or when enough moments accumulate to warrant a new page.
### 7.3 LLM endpoint configuration
The pipeline talks to an **OpenAI-compatible API endpoint** (which both Ollama and Open WebUI expose). The LLM is not hardcoded — it's configured via environment variables:
```
LLM_API_URL=https://friend-openwebui.example.com/api
LLM_API_KEY=sk-...
LLM_MODEL=qwen2.5-72b
LLM_FALLBACK_URL=http://localhost:11434/v1 # local Ollama
LLM_FALLBACK_MODEL=qwen2.5:14b-q8_0
```
The pipeline should attempt the primary endpoint first and fall back to the local model if the primary is unavailable.
### 7.4 Embedding endpoint configuration
Same configurable pattern:
```
EMBEDDING_API_URL=http://localhost:11434/v1
EMBEDDING_MODEL=nomic-embed-text
```
### 7.5 Processing estimates for initial seeding
| Stage | Per video | 300 videos total |
|-------|----------|-----------------|
| Transcription (Whisper, 4090) | 612 min | 3060 hours |
| Segmentation (LLM) | ~1 min | ~5 hours |
| Extraction (LLM) | ~2 min | ~10 hours |
| Classification (LLM) | ~30 sec | ~2.5 hours |
| Synthesis (LLM) | ~2 min per technique page | Varies by page count |
**Recommendation:** Tell the DGX Sparks friend to expect a weekend of sustained processing for the initial seed. The pipeline must be **resumable** — if it drops, it picks up from the last successfully processed video/stage, not from the beginning.
---
## 8. Review and approval workflow
### 8.1 Modes
The system supports two modes:
- **Review mode (initial calibration):** All extracted key moments enter a review queue. The administrator reviews, edits, approves, or rejects each moment before it's published.
- **Auto mode (post-calibration):** Extracted moments are published automatically. The review queue still exists but functions as an audit log rather than a gate.
The mode is a system-level toggle. The transition from review to auto mode happens when the administrator is satisfied with extraction quality — typically after reviewing the first several videos and tuning prompts.
### 8.2 Review queue interface
The review UI is part of the Chrysopedia web application (an admin section, not a separate tool).
**Queue view:**
- Counts: pending, approved, edited, rejected
- Filter tabs: Pending | Approved | Edited | Rejected
- Items organized by source video (review all moments from one video in sequence for context)
**Individual moment review:**
- Extracted moment: title, timestamp range, summary, tags, plugins detected
- Raw transcript segment displayed alongside for comparison
- Five actions:
- **Approve** — publish as-is
- **Edit & approve** — modify summary, tags, timestamp, or plugins, then publish
- **Split** — the moment actually contains two distinct insights; split into two separate moments
- **Merge with adjacent** — the system over-segmented; combine with the next or previous moment
- **Reject** — not a key moment; discard
### 8.3 Prompt tuning
The extraction prompts (stages 2-5) should be stored as editable configuration, not hardcoded. If review reveals systematic issues (e.g., the LLM consistently misclassifies mixing techniques as sound design), the administrator should be able to:
1. Edit the prompt templates
2. Re-run extraction on specific videos or all videos
3. Review the new output
This is the "calibration loop" — run pipeline, review output, tune prompts, re-run, repeat until quality is sufficient for auto mode.
---
## 9. New content ingestion workflow
### 9.1 Adding new videos
The ongoing workflow for adding new content after initial seeding:
1. **Drop file:** Place new video file(s) in the appropriate creator folder on the desktop (or create a new folder for a new creator)
2. **Trigger transcription:** Run the Whisper transcription stage on the new file(s). This could be a manual CLI command, a watched-folder daemon, or an n8n workflow trigger.
3. **Ship transcript:** Transfer the transcript JSON to the hypervisor (automated via the pipeline)
4. **Process:** Stages 2-5 run automatically on the new transcript
5. **Review or auto-publish:** Depending on mode, moments enter the review queue or publish directly
6. **Synthesis update:** If the new content covers a topic that already has a technique page for this creator, the synthesis stage updates the existing page. If it's a new topic, a new technique page is created.
### 9.2 Adding new creators
When a new creator's content is added:
1. Create a new folder on the desktop with the creator's name
2. Add video files
3. The pipeline detects the new folder name and creates a Creator record
4. Genre tags can be auto-suggested by the LLM based on content analysis, or manually assigned by the administrator
5. Process videos as normal
### 9.3 Watched folder (optional, future)
For maximum automation, a filesystem watcher on the desktop could detect new video files and automatically trigger the transcription pipeline. This is a nice-to-have for v2, not a v1 requirement. In v1, transcription is triggered manually.
---
## 10. Deployment and infrastructure
### 10.1 Docker Compose project
The entire Chrysopedia stack (excluding Whisper, which runs on the desktop GPU) is packaged as a single `docker-compose.yml`:
```yaml
# Indicative structure — not final
services:
chrysopedia-api:
# FastAPI or similar — handles pipeline orchestration, API endpoints
chrysopedia-web:
# Web UI — React, Svelte, or similar SPA
chrysopedia-db:
# PostgreSQL
chrysopedia-qdrant:
# Only if not using the existing Qdrant instance
chrysopedia-worker:
# Background job processor for pipeline stages 2-5
```
### 10.2 Existing infrastructure integration
**IMPORTANT:** The implementing agent should reference **XPLTD Lore** when making deployment decisions. This includes:
- Existing Docker conventions, naming patterns, and network configuration
- The hypervisor's current resource allocation and available capacity (~60 containers already running)
- Existing Qdrant instance (may be shared or a new collection created)
- Existing n8n instance (potential for workflow triggers)
- Storage paths and volume mount conventions
- Any reverse proxy or DNS configuration patterns
Do not assume infrastructure details — consult XPLTD Lore for how applications are typically deployed in this environment.
### 10.3 Whisper on desktop
Whisper runs separately on the desktop with the RTX 4090. It is NOT part of the Docker Compose stack (for now). It should be packaged as a simple Python script or lightweight container that:
1. Accepts a video file path (or watches a directory)
2. Extracts audio via ffmpeg
3. Runs Whisper large-v3
4. Outputs transcript JSON
5. Ships the JSON to the hypervisor (SCP, rsync, or API upload to the Chrysopedia API)
**Future centralization:** When all hardware is co-located, Whisper can be added to the Docker Compose stack with GPU passthrough, and the video files can be mounted directly. The pipeline should be designed so this migration is a config change, not a rewrite.
### 10.4 Network considerations
- Desktop ↔ Hypervisor: 2.5GbE (ample for transcript JSON transfer)
- Hypervisor ↔ DGX Sparks: Internet (50Mbit up from Chrysopedia side, 2Gb fiber on the DGX side). Transcript text payloads are tiny; this is not a bottleneck.
- Web UI: Served from hypervisor, accessed via local network (same machine Alt+Tab) or from other devices on the network. Eventually shareable with external users.
---
## 11. Technology recommendations
These are recommendations, not mandates. The implementing agent should evaluate alternatives based on current best practices and XPLTD Lore.
| Component | Recommendation | Rationale |
|-----------|---------------|-----------|
| Transcription | Whisper large-v3 (local, 4090) | Best accuracy, local processing keeps media files on-network |
| LLM inference | Qwen via Open WebUI API (DGX Sparks) | Free, powerful, high uptime. Ollama on 4090 as fallback |
| Embedding | nomic-embed-text via Ollama (local) | Good quality, runs easily alongside other local models |
| Vector DB | Qdrant | Already running on hypervisor |
| Relational DB | PostgreSQL | Robust, good JSONB support for flexible schema fields |
| API framework | FastAPI (Python) | Strong async support, good for pipeline orchestration |
| Web UI | React or Svelte SPA | Fast, component-based, good for search-heavy UIs |
| Background jobs | Celery with Redis, or a simpler task queue | Pipeline stages 2-5 run as background jobs |
| Audio extraction | ffmpeg | Universal, reliable |
---
## 12. Open questions and future considerations
These items are explicitly out of scope for v1 but should be considered in architectural decisions:
### 12.1 Chat / RAG retrieval
Not required for v1, but the system should be **architected to support it easily.** The Qdrant embeddings and structured knowledge base provide the foundation. A future chat interface could use the Qwen instance (or any compatible LLM) with RAG over the Chrysopedia knowledge base to answer natural language questions like "How does Skope approach snare design differently from Au5?"
### 12.2 Direct video playback
v1 provides file paths and timestamps ("Skope — Sound Design Masterclass pt2.mp4 @ 1:42:30"). Future versions could embed video playback directly in the web UI, jumping to the exact timestamp. This requires the video files to be network-accessible from the web UI, which depends on centralizing storage.
### 12.3 Access control
Not needed for v1. The system is initially for personal/local use. Future versions may add authentication for sharing with friends or external users. The architecture should not preclude this (e.g., don't hardcode single-user assumptions into the data model).
### 12.4 Multi-user features
Eventually: user-specific bookmarks, personal notes on technique pages, view history, and personalized "trending" based on individual usage patterns.
### 12.5 Content types beyond video
The extraction pipeline is fundamentally transcript-based. It could be extended to process podcast episodes, audio-only recordings, or even written tutorials/blog posts with minimal architectural changes.
### 12.6 Plugin knowledge base
Plugins referenced across all technique pages could be promoted to a first-class entity with their own browse page: "All techniques that reference Serum" or "Signal chains using Pro-Q 3." The data model already captures plugin references — this is primarily a UI feature.
---
## 13. Success criteria
The system is successful when:
1. **A producer mid-session can find a specific technique in under 30 seconds** — from Alt+Tab to reading the key insight
2. **The extraction pipeline correctly identifies 80%+ of key moments** without human intervention (post-calibration)
3. **New content can be added and processed within hours**, not days
4. **The knowledge base grows more useful over time** — cross-references and related techniques create a web of connected knowledge that surfaces unexpected insights
5. **The system runs reliably on existing infrastructure** without requiring significant new hardware or ongoing cloud costs
---
## 14. Implementation phases
### Phase 1: Foundation
- Set up Docker Compose project with PostgreSQL, API service, and web UI skeleton
- Implement Whisper transcription script for desktop
- Build transcript ingestion endpoint on the API
- Implement basic Creator and Source Video management
### Phase 2: Extraction pipeline
- Implement stages 2-5 (segmentation, extraction, classification, synthesis)
- Build the review queue UI
- Process a small batch of videos (5-10) for calibration
- Tune extraction prompts based on review feedback
### Phase 3: Knowledge UI
- Build the search-first web UI: landing page, live search, technique pages
- Implement Qdrant integration for semantic search
- Build Creators and Topics browse pages
- Implement related technique cross-linking
### Phase 4: Initial seeding
- Process the full video library through the pipeline
- Review and approve extractions (transitioning toward auto mode)
- Populate the canonical tag list and genre taxonomy
- Build out cross-references and related technique links
### Phase 5: Polish and ongoing
- Transition to auto mode for new content
- Implement view count tracking
- Optimize search ranking and relevance
- Begin sharing with trusted external users
---
*This specification was developed through collaborative ideation between the project owner and Claude. The implementing agent should treat this as a comprehensive guide while exercising judgment on technical implementation details, consulting XPLTD Lore for infrastructure conventions, and adapting to discoveries made during development.*

View file

@ -0,0 +1,48 @@
# Canonical tags — 7 top-level production categories
# Sub-topics grow organically during pipeline extraction
# Order follows the natural production learning arc:
# setup → theory → create sounds → structure → polish → deliver
categories:
- name: Workflow
description: Creative process, session management, productivity
sub_topics: [daw setup, templates, creative process, collaboration, file management, resampling]
- name: Music Theory
description: Harmony, scales, chord progressions, and musical structure
sub_topics: [harmony, chord progressions, scales, rhythm, time signatures, melody, counterpoint, song keys]
- name: Sound Design
description: Creating and shaping sounds from scratch or samples
sub_topics: [bass, drums, kick, snare, hi-hat, percussion, pads, leads, fx, foley, vocals, textures]
- name: Synthesis
description: Methods of generating sound
sub_topics: [fm, wavetable, granular, additive, subtractive, modular, physical modeling]
- name: Arrangement
description: Structuring a track from intro to outro
sub_topics: [song structure, transitions, tension, energy flow, breakdowns, drops]
- name: Mixing
description: Balancing, processing, and spatializing elements
sub_topics: [eq, compression, bus processing, reverb, delay, stereo imaging, gain staging, automation]
- name: Mastering
description: Final stage processing for release
sub_topics: [limiting, stereo width, loudness, format delivery, referencing]
# Genre taxonomy (assigned to Creators, not techniques)
genres:
- Bass music
- Drum & bass
- Dubstep
- Halftime
- House
- Techno
- IDM
- Glitch
- Downtempo
- Neuro
- Ambient
- Experimental
- Cinematic

30
docker/Dockerfile.api Normal file
View file

@ -0,0 +1,30 @@
FROM python:3.12-slim
WORKDIR /app
# System deps
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc libpq-dev curl \
&& rm -rf /var/lib/apt/lists/*
# Python deps (cached layer)
COPY backend/requirements.txt /app/requirements.txt
RUN pip install --no-cache-dir -r requirements.txt
# Git commit SHA for version tracking
ARG GIT_COMMIT_SHA=unknown
# Application code
COPY backend/ /app/
RUN echo "${GIT_COMMIT_SHA}" > /app/.git-commit
COPY prompts/ /prompts/
COPY config/ /config/
COPY alembic.ini /app/alembic.ini
COPY alembic/ /app/alembic/
EXPOSE 8000
HEALTHCHECK --interval=15s --timeout=5s --retries=3 --start-period=10s \
CMD curl -f http://localhost:8000/health || exit 1
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

20
docker/Dockerfile.web Normal file
View file

@ -0,0 +1,20 @@
FROM node:22-alpine AS build
WORKDIR /app
COPY frontend/package*.json ./
RUN npm ci --ignore-scripts
COPY frontend/ .
ARG VITE_GIT_COMMIT=dev
ENV VITE_GIT_COMMIT=$VITE_GIT_COMMIT
RUN npm run build
FROM nginx:1.27-alpine
COPY --from=build /app/dist /usr/share/nginx/html
COPY docker/nginx.conf /etc/nginx/conf.d/default.conf
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]

35
docker/nginx.conf Normal file
View file

@ -0,0 +1,35 @@
server {
listen 80;
server_name _;
root /usr/share/nginx/html;
index index.html;
# Use Docker's embedded DNS with 30s TTL so upstream IPs refresh
# after container recreates
resolver 127.0.0.11 valid=30s ipv6=off;
# Allow large transcript uploads (up to 50MB)
client_max_body_size 50m;
# SPA fallback
location / {
try_files $uri $uri/ /index.html;
}
# API proxy variable forces nginx to re-resolve on each request
location /api/ {
set $backend http://chrysopedia-api:8000;
proxy_pass $backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_read_timeout 300s;
proxy_send_timeout 300s;
}
location /health {
set $backend http://chrysopedia-api:8000;
proxy_pass $backend;
}
}

View file

@ -70,8 +70,8 @@
/* Pills / special badges */
--color-pill-bg: #22222e;
--color-pill-text: #e2e2ea;
--color-pill-plugin-bg: #2e1065;
--color-pill-plugin-text: #c4b5fd;
--color-pill-plugin-bg: #3b1f06;
--color-pill-plugin-text: #f6ad55;
--color-badge-category-bg: #1e1b4b;
--color-badge-category-text: #93c5fd;
--color-badge-type-technique-bg: #1e1b4b;
@ -88,8 +88,8 @@
/* Per-category badge colors */
--color-badge-cat-sound-design-bg: #0d3b3b;
--color-badge-cat-sound-design-text: #5eead4;
--color-badge-cat-mixing-bg: #2e1065;
--color-badge-cat-mixing-text: #c4b5fd;
--color-badge-cat-mixing-bg: #0f2942;
--color-badge-cat-mixing-text: #7dd3fc;
--color-badge-cat-synthesis-bg: #0c2461;
--color-badge-cat-synthesis-text: #93c5fd;
--color-badge-cat-arrangement-bg: #422006;
@ -198,9 +198,13 @@ body {
.app-main {
flex: 1;
width: 100%;
max-width: 72rem;
margin: 1.5rem auto;
padding: 0 1.5rem;
box-sizing: border-box;
overflow-wrap: break-word;
overflow-x: hidden;
}
/* ── App footer ───────────────────────────────────────────────────────────── */
@ -930,7 +934,7 @@ a.app-footer__repo:hover {
.home-hero {
text-align: center;
padding: 3rem 1rem 2rem;
padding: 0.5rem 1rem 1.5rem;
}
.home-hero__title {
@ -1336,6 +1340,9 @@ a.app-footer__repo:hover {
.technique-page {
max-width: 64rem;
width: 100%;
overflow-wrap: break-word;
word-wrap: break-word;
}
.technique-columns {
@ -1347,6 +1354,8 @@ a.app-footer__repo:hover {
.technique-columns__main {
min-width: 0; /* prevent grid blowout */
overflow-wrap: break-word;
word-wrap: break-word;
}
.technique-columns__sidebar {
@ -1392,15 +1401,26 @@ a.app-footer__repo:hover {
color: var(--color-banner-amber-text);
}
.technique-header {
margin-bottom: 1.5rem;
.technique-header__title-row {
display: flex;
align-items: flex-start;
justify-content: space-between;
gap: 1rem;
margin-bottom: 0.5rem;
}
.technique-header__title-row .badge--category {
flex-shrink: 0;
margin-top: 0.35rem;
}
.technique-header__title {
font-size: 1.75rem;
font-weight: 800;
letter-spacing: -0.02em;
margin-bottom: 0.5rem;
margin-bottom: 0;
line-height: 1.2;
}
@ -1415,6 +1435,7 @@ a.app-footer__repo:hover {
display: inline-flex;
flex-wrap: wrap;
gap: 0.25rem;
max-width: 100%;
}
.technique-header__creator-genres {
@ -1432,6 +1453,9 @@ a.app-footer__repo:hover {
}
.technique-header__creator-link {
display: inline-flex;
align-items: center;
gap: 0.35rem;
font-size: 1.125rem;
font-weight: 600;
color: var(--color-link-accent);
@ -1458,6 +1482,13 @@ a.app-footer__repo:hover {
/* ── Technique prose / sections ───────────────────────────────────────────── */
.technique-main__tags {
display: flex;
flex-wrap: wrap;
gap: 0.25rem;
margin-bottom: 1rem;
}
.technique-summary {
margin-bottom: 1.5rem;
}
@ -1521,7 +1552,6 @@ a.app-footer__repo:hover {
background: var(--color-bg-surface);
border: 1px solid var(--color-border);
border-radius: 0.5rem;
overflow: hidden;
}
.technique-moment__title {
@ -1530,7 +1560,6 @@ a.app-footer__repo:hover {
font-size: 0.9375rem;
font-weight: 600;
line-height: 1.3;
word-break: break-word;
}
.technique-moment__meta {
@ -1539,7 +1568,6 @@ a.app-footer__repo:hover {
gap: 0.5rem;
margin-bottom: 0.25rem;
flex-wrap: wrap;
min-width: 0;
}
.technique-moment__time {
@ -1552,7 +1580,7 @@ a.app-footer__repo:hover {
font-size: 0.75rem;
color: var(--color-text-muted);
font-family: "SF Mono", "Fira Code", "Fira Mono", "Roboto Mono", monospace;
max-width: 100%;
max-width: 20rem;
overflow: hidden;
text-overflow: ellipsis;
white-space: nowrap;
@ -1810,6 +1838,9 @@ a.app-footer__repo:hover {
}
.creator-row__name {
display: inline-flex;
align-items: center;
gap: 0.5rem;
font-size: 0.9375rem;
font-weight: 600;
min-width: 10rem;
@ -1852,6 +1883,9 @@ a.app-footer__repo:hover {
}
.creator-detail__name {
display: flex;
align-items: center;
gap: 0.75rem;
font-size: 1.75rem;
font-weight: 800;
letter-spacing: -0.02em;
@ -2012,12 +2046,10 @@ a.app-footer__repo:hover {
margin: 0;
}
.topic-card__dot {
display: inline-block;
width: 0.5rem;
height: 0.5rem;
border-radius: 50%;
.topic-card__glyph {
flex-shrink: 0;
line-height: 1;
opacity: 0.7;
}
.topic-card__desc {
@ -2170,26 +2202,6 @@ a.app-footer__repo:hover {
.topic-subtopic {
padding-left: 1rem;
}
.app-main {
padding: 0 1rem;
}
.technique-header__meta {
gap: 0.375rem;
}
.technique-header__tags {
gap: 0.1875rem;
}
.technique-header__creator-genres {
gap: 0.1875rem;
}
.version-switcher__select {
max-width: 12rem;
}
}
/* ── Report Issue Modal ─────────────────────────────────────────────────── */
@ -2553,9 +2565,6 @@ a.app-footer__repo:hover {
padding: 0.3rem 0.5rem;
font-size: 0.8rem;
cursor: pointer;
max-width: 100%;
overflow: hidden;
text-overflow: ellipsis;
}
.version-switcher__select:focus {
@ -3178,3 +3187,126 @@ a.app-footer__repo:hover {
white-space: pre-wrap;
word-break: break-word;
}
/* ── Ghost button ─────────────────────────────────────────────────────── */
.btn--ghost {
background: transparent;
color: var(--color-text-muted);
border-color: transparent;
}
.btn--ghost:hover:not(:disabled) {
color: var(--color-text-secondary);
background: var(--color-bg-surface);
border-color: var(--color-border);
}
/* ── Technique page footer ────────────────────────────────────────────── */
.technique-footer {
margin-top: 2rem;
padding-top: 1rem;
border-top: 1px solid var(--color-border);
display: flex;
justify-content: flex-end;
}
/* ── Creator avatar ───────────────────────────────────────────────────── */
.creator-avatar {
border-radius: 4px;
flex-shrink: 0;
vertical-align: middle;
}
.creator-avatar--img {
object-fit: cover;
}
/* ── Copy link button ─────────────────────────────────────────────────── */
.copy-link-btn {
display: inline-flex;
align-items: center;
justify-content: center;
position: relative;
background: none;
border: none;
color: var(--color-text-muted);
cursor: pointer;
padding: 0.15rem;
border-radius: 4px;
opacity: 0;
transition: opacity 0.15s, color 0.15s, background 0.15s;
vertical-align: middle;
margin-left: 0.25rem;
}
.technique-header__title:hover .copy-link-btn,
.copy-link-btn:focus-visible {
opacity: 1;
}
.copy-link-btn:hover {
opacity: 1;
color: var(--color-accent);
background: var(--color-bg-surface);
}
.copy-link-btn__tooltip {
position: absolute;
top: -1.75rem;
left: 50%;
transform: translateX(-50%);
background: var(--color-bg-surface);
color: var(--color-accent);
font-size: 0.7rem;
padding: 0.15rem 0.5rem;
border-radius: 4px;
border: 1px solid var(--color-border);
white-space: nowrap;
pointer-events: none;
animation: fadeInUp 0.15s ease-out;
}
@keyframes fadeInUp {
from { opacity: 0; transform: translateX(-50%) translateY(4px); }
to { opacity: 1; transform: translateX(-50%) translateY(0); }
}
/* ── Recent card with creator ─────────────────────────────────────────── */
.recent-card__header {
display: flex;
align-items: flex-start;
justify-content: space-between;
gap: 0.5rem;
}
.recent-card__creator {
display: inline-flex;
align-items: center;
gap: 0.3rem;
font-size: 0.8rem;
color: var(--color-text-secondary);
white-space: nowrap;
flex-shrink: 0;
}
/* ── Search result card creator ───────────────────────────────────────── */
.search-result-card__creator {
display: inline-flex;
align-items: center;
gap: 0.3rem;
}
/* ── Technique footer inspect link ────────────────────────────────────── */
.technique-footer__inspect {
display: inline-flex;
align-items: center;
gap: 0.3rem;
text-decoration: none;
}

Some files were not shown because too many files have changed in this diff Show more