fix: restore complete project tree from ub01 canonical state
Auto-mode commit 7aa33cd accidentally deleted 78 files (14,814 lines) during M005
execution. Subsequent commits rebuilt some frontend files but backend/, alembic/,
tests/, whisper/, docker configs, and prompts were never restored in this repo.
This commit restores the full project tree by syncing from ub01's working directory,
which has all M001-M007 features running in production containers.
Restored: backend/ (config, models, routers, database, redis, search_service, worker),
alembic/ (6 migrations), docker/ (Dockerfiles, nginx, compose), prompts/ (4 stages),
tests/, whisper/, README.md, .env.example, chrysopedia-spec.md
This commit is contained in:
parent
f6dcc80dbf
commit
4b0914b12b
120 changed files with 12812 additions and 163 deletions
52
.env.example
Normal file
52
.env.example
Normal file
|
|
@ -0,0 +1,52 @@
|
|||
# ─── Chrysopedia Environment Variables ───
|
||||
# Copy to .env and fill in secrets before docker compose up
|
||||
|
||||
# PostgreSQL
|
||||
POSTGRES_USER=chrysopedia
|
||||
POSTGRES_PASSWORD=changeme
|
||||
POSTGRES_DB=chrysopedia
|
||||
|
||||
# Redis (Celery broker) — container-internal, no secret needed
|
||||
REDIS_URL=redis://chrysopedia-redis:6379/0
|
||||
|
||||
# LLM endpoint (OpenAI-compatible — OpenWebUI on FYN DGX)
|
||||
LLM_API_URL=https://chat.forgetyour.name/api/v1
|
||||
LLM_API_KEY=sk-changeme
|
||||
LLM_MODEL=fyn-llm-agent-chat
|
||||
LLM_FALLBACK_URL=https://chat.forgetyour.name/api/v1
|
||||
LLM_FALLBACK_MODEL=fyn-llm-agent-chat
|
||||
|
||||
# Per-stage LLM model overrides (optional — defaults to LLM_MODEL)
|
||||
# Modality: "chat" = standard JSON mode, "thinking" = reasoning model (strips <think> tags)
|
||||
# Stages 2 (segmentation) and 4 (classification) are mechanical — use fast chat model
|
||||
# Stages 3 (extraction) and 5 (synthesis) need reasoning — use thinking model
|
||||
LLM_STAGE2_MODEL=fyn-llm-agent-chat
|
||||
LLM_STAGE2_MODALITY=chat
|
||||
LLM_STAGE3_MODEL=fyn-llm-agent-think
|
||||
LLM_STAGE3_MODALITY=thinking
|
||||
LLM_STAGE4_MODEL=fyn-llm-agent-chat
|
||||
LLM_STAGE4_MODALITY=chat
|
||||
LLM_STAGE5_MODEL=fyn-llm-agent-think
|
||||
LLM_STAGE5_MODALITY=thinking
|
||||
|
||||
# Max tokens for LLM responses (OpenWebUI defaults to 1000 — pipeline needs much more)
|
||||
LLM_MAX_TOKENS=65536
|
||||
|
||||
# Embedding endpoint (Ollama container in the compose stack)
|
||||
EMBEDDING_API_URL=http://chrysopedia-ollama:11434/v1
|
||||
EMBEDDING_MODEL=nomic-embed-text
|
||||
|
||||
# Qdrant (container-internal)
|
||||
QDRANT_URL=http://chrysopedia-qdrant:6333
|
||||
QDRANT_COLLECTION=chrysopedia
|
||||
|
||||
# Application
|
||||
APP_ENV=production
|
||||
APP_LOG_LEVEL=info
|
||||
|
||||
# File storage paths (inside container, bind-mounted to /vmPool/r/services/chrysopedia_data)
|
||||
TRANSCRIPT_STORAGE_PATH=/data/transcripts
|
||||
VIDEO_METADATA_PATH=/data/video_meta
|
||||
|
||||
# Review mode toggle (true = moments require admin review before publishing)
|
||||
REVIEW_MODE=true
|
||||
|
|
@ -1,18 +1,18 @@
|
|||
# GSD State
|
||||
|
||||
**Active Milestone:** M007: M007
|
||||
**Active Slice:** S02: Debug Payload Viewer — Inline View, Copy, and Export in Admin UI
|
||||
**Phase:** evaluating-gates
|
||||
**Active Milestone:** M007: M007:
|
||||
**Active Slice:** None
|
||||
**Phase:** complete
|
||||
**Requirements Status:** 0 active · 0 validated · 0 deferred · 0 out of scope
|
||||
|
||||
## Milestone Registry
|
||||
- ✅ **M001:** Chrysopedia Foundation — Infrastructure, Pipeline Core, and Skeleton UI
|
||||
- ✅ **M002:** M002: Chrysopedia Deployment — GitHub, ub01 Docker Stack, and Production Wiring
|
||||
- ✅ **M003:** M003: Domain + DNS + Per-Stage LLM Model Routing
|
||||
- ✅ **M004:** M004: UI Polish, Bug Fixes, Technique Page Redesign, and Article Versioning
|
||||
- ✅ **M005:** M005: Pipeline Dashboard, Technique Page Redesign, Key Moment Cards
|
||||
- ✅ **M006:** M006: Admin Nav, Pipeline Log Views, Commit SHA, Tag Polish, Topics Redesign, Footer
|
||||
- 🔄 **M007:** M007
|
||||
- ✅ **M002:** M002:
|
||||
- ✅ **M003:** M003:
|
||||
- ✅ **M004:** M004:
|
||||
- ✅ **M005:** M005:
|
||||
- ✅ **M006:** M006:
|
||||
- ✅ **M007:** M007:
|
||||
|
||||
## Recent Decisions
|
||||
- None recorded
|
||||
|
|
@ -21,4 +21,4 @@
|
|||
- None
|
||||
|
||||
## Next Action
|
||||
Evaluate 3 quality gate(s) for S02 before execution.
|
||||
All milestones complete.
|
||||
|
|
|
|||
147
.gsd/activity/105-execute-task-M007-S02-T01.jsonl
Normal file
147
.gsd/activity/105-execute-task-M007-S02-T01.jsonl
Normal file
File diff suppressed because one or more lines are too long
48
.gsd/activity/106-complete-slice-M007-S02.jsonl
Normal file
48
.gsd/activity/106-complete-slice-M007-S02.jsonl
Normal file
File diff suppressed because one or more lines are too long
49
.gsd/activity/107-research-slice-M007-S03.jsonl
Normal file
49
.gsd/activity/107-research-slice-M007-S03.jsonl
Normal file
File diff suppressed because one or more lines are too long
30
.gsd/activity/108-plan-slice-M007-S03.jsonl
Normal file
30
.gsd/activity/108-plan-slice-M007-S03.jsonl
Normal file
File diff suppressed because one or more lines are too long
49
.gsd/activity/109-execute-task-M007-S03-T01.jsonl
Normal file
49
.gsd/activity/109-execute-task-M007-S03-T01.jsonl
Normal file
File diff suppressed because one or more lines are too long
97
.gsd/activity/110-execute-task-M007-S03-T02.jsonl
Normal file
97
.gsd/activity/110-execute-task-M007-S03-T02.jsonl
Normal file
File diff suppressed because one or more lines are too long
36
.gsd/activity/111-complete-slice-M007-S03.jsonl
Normal file
36
.gsd/activity/111-complete-slice-M007-S03.jsonl
Normal file
File diff suppressed because one or more lines are too long
49
.gsd/activity/112-research-slice-M007-S04.jsonl
Normal file
49
.gsd/activity/112-research-slice-M007-S04.jsonl
Normal file
File diff suppressed because one or more lines are too long
25
.gsd/activity/113-plan-slice-M007-S04.jsonl
Normal file
25
.gsd/activity/113-plan-slice-M007-S04.jsonl
Normal file
File diff suppressed because one or more lines are too long
70
.gsd/activity/114-execute-task-M007-S04-T01.jsonl
Normal file
70
.gsd/activity/114-execute-task-M007-S04-T01.jsonl
Normal file
File diff suppressed because one or more lines are too long
59
.gsd/activity/115-execute-task-M007-S04-T02.jsonl
Normal file
59
.gsd/activity/115-execute-task-M007-S04-T02.jsonl
Normal file
File diff suppressed because one or more lines are too long
15
.gsd/activity/116-complete-slice-M007-S04.jsonl
Normal file
15
.gsd/activity/116-complete-slice-M007-S04.jsonl
Normal file
File diff suppressed because one or more lines are too long
35
.gsd/activity/117-research-slice-M007-S05.jsonl
Normal file
35
.gsd/activity/117-research-slice-M007-S05.jsonl
Normal file
File diff suppressed because one or more lines are too long
10
.gsd/activity/118-plan-slice-M007-S05.jsonl
Normal file
10
.gsd/activity/118-plan-slice-M007-S05.jsonl
Normal file
File diff suppressed because one or more lines are too long
38
.gsd/activity/119-execute-task-M007-S05-T01.jsonl
Normal file
38
.gsd/activity/119-execute-task-M007-S05-T01.jsonl
Normal file
File diff suppressed because one or more lines are too long
22
.gsd/activity/120-complete-slice-M007-S05.jsonl
Normal file
22
.gsd/activity/120-complete-slice-M007-S05.jsonl
Normal file
File diff suppressed because one or more lines are too long
50
.gsd/activity/121-research-slice-M007-S06.jsonl
Normal file
50
.gsd/activity/121-research-slice-M007-S06.jsonl
Normal file
File diff suppressed because one or more lines are too long
21
.gsd/activity/122-plan-slice-M007-S06.jsonl
Normal file
21
.gsd/activity/122-plan-slice-M007-S06.jsonl
Normal file
File diff suppressed because one or more lines are too long
40
.gsd/activity/123-execute-task-M007-S06-T01.jsonl
Normal file
40
.gsd/activity/123-execute-task-M007-S06-T01.jsonl
Normal file
File diff suppressed because one or more lines are too long
13
.gsd/activity/124-complete-slice-M007-S06.jsonl
Normal file
13
.gsd/activity/124-complete-slice-M007-S06.jsonl
Normal file
File diff suppressed because one or more lines are too long
35
.gsd/activity/125-validate-milestone-M007.jsonl
Normal file
35
.gsd/activity/125-validate-milestone-M007.jsonl
Normal file
File diff suppressed because one or more lines are too long
33
.gsd/activity/126-complete-milestone-M007.jsonl
Normal file
33
.gsd/activity/126-complete-milestone-M007.jsonl
Normal file
File diff suppressed because one or more lines are too long
|
|
@ -1,7 +0,0 @@
|
|||
{
|
||||
"pid": 2052340,
|
||||
"startedAt": "2026-03-30T18:59:38.188Z",
|
||||
"unitType": "execute-task",
|
||||
"unitId": "M007/S02/T01",
|
||||
"unitStartedAt": "2026-03-30T18:59:38.188Z"
|
||||
}
|
||||
|
|
@ -388,3 +388,114 @@
|
|||
{"ts":"2026-03-30T18:59:38.123Z","flowId":"e68cf509-8e7a-42ae-ae7f-68d2fe2171c3","seq":1,"eventType":"iteration-start","data":{"iteration":10}}
|
||||
{"ts":"2026-03-30T18:59:38.151Z","flowId":"e68cf509-8e7a-42ae-ae7f-68d2fe2171c3","seq":2,"eventType":"dispatch-match","rule":"executing → execute-task","data":{"unitType":"execute-task","unitId":"M007/S02/T01"}}
|
||||
{"ts":"2026-03-30T18:59:38.165Z","flowId":"e68cf509-8e7a-42ae-ae7f-68d2fe2171c3","seq":3,"eventType":"unit-start","data":{"unitType":"execute-task","unitId":"M007/S02/T01"}}
|
||||
{"ts":"2026-03-30T19:07:23.525Z","flowId":"e68cf509-8e7a-42ae-ae7f-68d2fe2171c3","seq":4,"eventType":"unit-end","data":{"unitType":"execute-task","unitId":"M007/S02/T01","status":"completed","artifactVerified":true},"causedBy":{"flowId":"e68cf509-8e7a-42ae-ae7f-68d2fe2171c3","seq":3}}
|
||||
{"ts":"2026-03-30T19:07:28.921Z","flowId":"0caba39d-a590-4068-a529-23720a0ea587","seq":1,"eventType":"iteration-start","data":{"iteration":11}}
|
||||
{"ts":"2026-03-30T19:07:28.968Z","flowId":"0caba39d-a590-4068-a529-23720a0ea587","seq":2,"eventType":"dispatch-match","rule":"summarizing → complete-slice","data":{"unitType":"complete-slice","unitId":"M007/S02"}}
|
||||
{"ts":"2026-03-30T19:07:28.982Z","flowId":"0caba39d-a590-4068-a529-23720a0ea587","seq":3,"eventType":"unit-start","data":{"unitType":"complete-slice","unitId":"M007/S02"}}
|
||||
{"ts":"2026-03-30T19:10:28.704Z","flowId":"0caba39d-a590-4068-a529-23720a0ea587","seq":4,"eventType":"unit-end","data":{"unitType":"complete-slice","unitId":"M007/S02","status":"completed","artifactVerified":true},"causedBy":{"flowId":"0caba39d-a590-4068-a529-23720a0ea587","seq":3}}
|
||||
{"ts":"2026-03-30T19:10:28.809Z","flowId":"0caba39d-a590-4068-a529-23720a0ea587","seq":5,"eventType":"iteration-end","data":{"iteration":11}}
|
||||
{"ts":"2026-03-30T19:10:28.810Z","flowId":"361fad9f-00c6-4c3c-92a3-8739bccd5079","seq":1,"eventType":"iteration-start","data":{"iteration":12}}
|
||||
{"ts":"2026-03-30T19:10:28.848Z","flowId":"361fad9f-00c6-4c3c-92a3-8739bccd5079","seq":2,"eventType":"dispatch-match","rule":"planning (no research, not S01) → research-slice","data":{"unitType":"research-slice","unitId":"M007/S03"}}
|
||||
{"ts":"2026-03-30T19:10:28.866Z","flowId":"361fad9f-00c6-4c3c-92a3-8739bccd5079","seq":3,"eventType":"unit-start","data":{"unitType":"research-slice","unitId":"M007/S03"}}
|
||||
{"ts":"2026-03-30T19:12:58.159Z","flowId":"361fad9f-00c6-4c3c-92a3-8739bccd5079","seq":4,"eventType":"unit-end","data":{"unitType":"research-slice","unitId":"M007/S03","status":"completed","artifactVerified":true},"causedBy":{"flowId":"361fad9f-00c6-4c3c-92a3-8739bccd5079","seq":3}}
|
||||
{"ts":"2026-03-30T19:12:58.261Z","flowId":"361fad9f-00c6-4c3c-92a3-8739bccd5079","seq":5,"eventType":"iteration-end","data":{"iteration":12}}
|
||||
{"ts":"2026-03-30T19:12:58.261Z","flowId":"74a18d0c-e084-402d-8476-7cc481491c8f","seq":1,"eventType":"iteration-start","data":{"iteration":13}}
|
||||
{"ts":"2026-03-30T19:12:58.281Z","flowId":"74a18d0c-e084-402d-8476-7cc481491c8f","seq":2,"eventType":"dispatch-match","rule":"planning → plan-slice","data":{"unitType":"plan-slice","unitId":"M007/S03"}}
|
||||
{"ts":"2026-03-30T19:12:58.291Z","flowId":"74a18d0c-e084-402d-8476-7cc481491c8f","seq":3,"eventType":"unit-start","data":{"unitType":"plan-slice","unitId":"M007/S03"}}
|
||||
{"ts":"2026-03-30T19:15:08.081Z","flowId":"74a18d0c-e084-402d-8476-7cc481491c8f","seq":4,"eventType":"unit-end","data":{"unitType":"plan-slice","unitId":"M007/S03","status":"completed","artifactVerified":true},"causedBy":{"flowId":"74a18d0c-e084-402d-8476-7cc481491c8f","seq":3}}
|
||||
{"ts":"2026-03-30T19:15:08.184Z","flowId":"74a18d0c-e084-402d-8476-7cc481491c8f","seq":5,"eventType":"iteration-end","data":{"iteration":13}}
|
||||
{"ts":"2026-03-30T19:15:08.185Z","flowId":"5fc6dd58-03fd-4861-a7a4-083a1c4964a8","seq":1,"eventType":"iteration-start","data":{"iteration":14}}
|
||||
{"ts":"2026-03-30T19:15:08.212Z","flowId":"e2cbd134-9c3c-4c98-a24d-61e10e3f27e7","seq":1,"eventType":"iteration-start","data":{"iteration":15}}
|
||||
{"ts":"2026-03-30T19:15:08.244Z","flowId":"e2cbd134-9c3c-4c98-a24d-61e10e3f27e7","seq":2,"eventType":"dispatch-match","rule":"executing → execute-task","data":{"unitType":"execute-task","unitId":"M007/S03/T01"}}
|
||||
{"ts":"2026-03-30T19:15:08.259Z","flowId":"e2cbd134-9c3c-4c98-a24d-61e10e3f27e7","seq":3,"eventType":"unit-start","data":{"unitType":"execute-task","unitId":"M007/S03/T01"}}
|
||||
{"ts":"2026-03-30T19:17:47.626Z","flowId":"e2cbd134-9c3c-4c98-a24d-61e10e3f27e7","seq":4,"eventType":"unit-end","data":{"unitType":"execute-task","unitId":"M007/S03/T01","status":"completed","artifactVerified":true},"causedBy":{"flowId":"e2cbd134-9c3c-4c98-a24d-61e10e3f27e7","seq":3}}
|
||||
{"ts":"2026-03-30T19:17:47.868Z","flowId":"e2cbd134-9c3c-4c98-a24d-61e10e3f27e7","seq":5,"eventType":"iteration-end","data":{"iteration":15}}
|
||||
{"ts":"2026-03-30T19:17:47.869Z","flowId":"f34dec93-2f75-4725-80df-7a253fdd2d0f","seq":1,"eventType":"iteration-start","data":{"iteration":16}}
|
||||
{"ts":"2026-03-30T19:17:47.902Z","flowId":"f34dec93-2f75-4725-80df-7a253fdd2d0f","seq":2,"eventType":"dispatch-match","rule":"executing → execute-task","data":{"unitType":"execute-task","unitId":"M007/S03/T02"}}
|
||||
{"ts":"2026-03-30T19:17:47.920Z","flowId":"f34dec93-2f75-4725-80df-7a253fdd2d0f","seq":3,"eventType":"unit-start","data":{"unitType":"execute-task","unitId":"M007/S03/T02"}}
|
||||
{"ts":"2026-03-30T19:24:39.796Z","flowId":"f34dec93-2f75-4725-80df-7a253fdd2d0f","seq":4,"eventType":"unit-end","data":{"unitType":"execute-task","unitId":"M007/S03/T02","status":"completed","artifactVerified":true},"causedBy":{"flowId":"f34dec93-2f75-4725-80df-7a253fdd2d0f","seq":3}}
|
||||
{"ts":"2026-03-30T19:24:39.954Z","flowId":"f34dec93-2f75-4725-80df-7a253fdd2d0f","seq":5,"eventType":"iteration-end","data":{"iteration":16}}
|
||||
{"ts":"2026-03-30T19:24:39.954Z","flowId":"38e7c249-f214-4444-ac55-af355dbb004b","seq":1,"eventType":"iteration-start","data":{"iteration":17}}
|
||||
{"ts":"2026-03-30T19:24:40.081Z","flowId":"38e7c249-f214-4444-ac55-af355dbb004b","seq":2,"eventType":"dispatch-match","rule":"summarizing → complete-slice","data":{"unitType":"complete-slice","unitId":"M007/S03"}}
|
||||
{"ts":"2026-03-30T19:24:40.099Z","flowId":"38e7c249-f214-4444-ac55-af355dbb004b","seq":3,"eventType":"unit-start","data":{"unitType":"complete-slice","unitId":"M007/S03"}}
|
||||
{"ts":"2026-03-30T19:26:38.422Z","flowId":"38e7c249-f214-4444-ac55-af355dbb004b","seq":4,"eventType":"unit-end","data":{"unitType":"complete-slice","unitId":"M007/S03","status":"completed","artifactVerified":true},"causedBy":{"flowId":"38e7c249-f214-4444-ac55-af355dbb004b","seq":3}}
|
||||
{"ts":"2026-03-30T19:26:38.524Z","flowId":"38e7c249-f214-4444-ac55-af355dbb004b","seq":5,"eventType":"iteration-end","data":{"iteration":17}}
|
||||
{"ts":"2026-03-30T19:26:38.524Z","flowId":"ed4e68af-8bda-4f1c-82ec-4ea21e6aa41b","seq":1,"eventType":"iteration-start","data":{"iteration":18}}
|
||||
{"ts":"2026-03-30T19:26:38.665Z","flowId":"ed4e68af-8bda-4f1c-82ec-4ea21e6aa41b","seq":2,"eventType":"dispatch-match","rule":"planning (no research, not S01) → research-slice","data":{"unitType":"research-slice","unitId":"M007/S04"}}
|
||||
{"ts":"2026-03-30T19:26:38.679Z","flowId":"ed4e68af-8bda-4f1c-82ec-4ea21e6aa41b","seq":3,"eventType":"unit-start","data":{"unitType":"research-slice","unitId":"M007/S04"}}
|
||||
{"ts":"2026-03-30T19:29:03.963Z","flowId":"ed4e68af-8bda-4f1c-82ec-4ea21e6aa41b","seq":4,"eventType":"unit-end","data":{"unitType":"research-slice","unitId":"M007/S04","status":"completed","artifactVerified":true},"causedBy":{"flowId":"ed4e68af-8bda-4f1c-82ec-4ea21e6aa41b","seq":3}}
|
||||
{"ts":"2026-03-30T19:29:04.064Z","flowId":"ed4e68af-8bda-4f1c-82ec-4ea21e6aa41b","seq":5,"eventType":"iteration-end","data":{"iteration":18}}
|
||||
{"ts":"2026-03-30T19:29:04.064Z","flowId":"56f78437-573e-445a-a630-db5531d6e95b","seq":1,"eventType":"iteration-start","data":{"iteration":19}}
|
||||
{"ts":"2026-03-30T19:29:04.160Z","flowId":"56f78437-573e-445a-a630-db5531d6e95b","seq":2,"eventType":"dispatch-match","rule":"planning → plan-slice","data":{"unitType":"plan-slice","unitId":"M007/S04"}}
|
||||
{"ts":"2026-03-30T19:29:04.171Z","flowId":"56f78437-573e-445a-a630-db5531d6e95b","seq":3,"eventType":"unit-start","data":{"unitType":"plan-slice","unitId":"M007/S04"}}
|
||||
{"ts":"2026-03-30T19:31:01.891Z","flowId":"56f78437-573e-445a-a630-db5531d6e95b","seq":4,"eventType":"unit-end","data":{"unitType":"plan-slice","unitId":"M007/S04","status":"completed","artifactVerified":true},"causedBy":{"flowId":"56f78437-573e-445a-a630-db5531d6e95b","seq":3}}
|
||||
{"ts":"2026-03-30T19:31:01.994Z","flowId":"56f78437-573e-445a-a630-db5531d6e95b","seq":5,"eventType":"iteration-end","data":{"iteration":19}}
|
||||
{"ts":"2026-03-30T19:31:01.994Z","flowId":"49a2c337-a403-42ab-b778-5b45bcd525dd","seq":1,"eventType":"iteration-start","data":{"iteration":20}}
|
||||
{"ts":"2026-03-30T19:31:02.112Z","flowId":"97ff47b0-6b73-4b2f-a31e-d1a31838f381","seq":1,"eventType":"iteration-start","data":{"iteration":21}}
|
||||
{"ts":"2026-03-30T19:31:02.216Z","flowId":"97ff47b0-6b73-4b2f-a31e-d1a31838f381","seq":2,"eventType":"dispatch-match","rule":"executing → execute-task","data":{"unitType":"execute-task","unitId":"M007/S04/T01"}}
|
||||
{"ts":"2026-03-30T19:31:02.226Z","flowId":"97ff47b0-6b73-4b2f-a31e-d1a31838f381","seq":3,"eventType":"unit-start","data":{"unitType":"execute-task","unitId":"M007/S04/T01"}}
|
||||
{"ts":"2026-03-30T19:34:11.113Z","flowId":"97ff47b0-6b73-4b2f-a31e-d1a31838f381","seq":4,"eventType":"unit-end","data":{"unitType":"execute-task","unitId":"M007/S04/T01","status":"completed","artifactVerified":true},"causedBy":{"flowId":"97ff47b0-6b73-4b2f-a31e-d1a31838f381","seq":3}}
|
||||
{"ts":"2026-03-30T19:34:11.315Z","flowId":"f2129392-f114-4a59-851b-ca9e897f2d99","seq":1,"eventType":"iteration-start","data":{"iteration":22}}
|
||||
{"ts":"2026-03-30T19:34:11.422Z","flowId":"f2129392-f114-4a59-851b-ca9e897f2d99","seq":2,"eventType":"dispatch-match","rule":"executing → execute-task","data":{"unitType":"execute-task","unitId":"M007/S04/T02"}}
|
||||
{"ts":"2026-03-30T19:34:11.433Z","flowId":"f2129392-f114-4a59-851b-ca9e897f2d99","seq":3,"eventType":"unit-start","data":{"unitType":"execute-task","unitId":"M007/S04/T02"}}
|
||||
{"ts":"2026-03-30T19:36:47.725Z","flowId":"f2129392-f114-4a59-851b-ca9e897f2d99","seq":4,"eventType":"unit-end","data":{"unitType":"execute-task","unitId":"M007/S04/T02","status":"completed","artifactVerified":true},"causedBy":{"flowId":"f2129392-f114-4a59-851b-ca9e897f2d99","seq":3}}
|
||||
{"ts":"2026-03-30T19:36:47.886Z","flowId":"c4afe3a6-a4e3-4626-810c-bbe1d52887d2","seq":1,"eventType":"iteration-start","data":{"iteration":23}}
|
||||
{"ts":"2026-03-30T19:36:47.959Z","flowId":"c4afe3a6-a4e3-4626-810c-bbe1d52887d2","seq":2,"eventType":"dispatch-match","rule":"summarizing → complete-slice","data":{"unitType":"complete-slice","unitId":"M007/S04"}}
|
||||
{"ts":"2026-03-30T19:36:47.970Z","flowId":"c4afe3a6-a4e3-4626-810c-bbe1d52887d2","seq":3,"eventType":"unit-start","data":{"unitType":"complete-slice","unitId":"M007/S04"}}
|
||||
{"ts":"2026-03-30T19:37:54.150Z","flowId":"c4afe3a6-a4e3-4626-810c-bbe1d52887d2","seq":4,"eventType":"unit-end","data":{"unitType":"complete-slice","unitId":"M007/S04","status":"completed","artifactVerified":true},"causedBy":{"flowId":"c4afe3a6-a4e3-4626-810c-bbe1d52887d2","seq":3}}
|
||||
{"ts":"2026-03-30T19:37:54.252Z","flowId":"c4afe3a6-a4e3-4626-810c-bbe1d52887d2","seq":5,"eventType":"iteration-end","data":{"iteration":23}}
|
||||
{"ts":"2026-03-30T19:37:54.252Z","flowId":"ba9789f1-ead3-4c7c-b121-e2d3acfebd21","seq":1,"eventType":"iteration-start","data":{"iteration":24}}
|
||||
{"ts":"2026-03-30T19:37:54.362Z","flowId":"ba9789f1-ead3-4c7c-b121-e2d3acfebd21","seq":2,"eventType":"dispatch-match","rule":"planning (no research, not S01) → research-slice","data":{"unitType":"research-slice","unitId":"M007/S05"}}
|
||||
{"ts":"2026-03-30T19:37:54.371Z","flowId":"ba9789f1-ead3-4c7c-b121-e2d3acfebd21","seq":3,"eventType":"unit-start","data":{"unitType":"research-slice","unitId":"M007/S05"}}
|
||||
{"ts":"2026-03-30T19:39:29.263Z","flowId":"ba9789f1-ead3-4c7c-b121-e2d3acfebd21","seq":4,"eventType":"unit-end","data":{"unitType":"research-slice","unitId":"M007/S05","status":"completed","artifactVerified":true},"causedBy":{"flowId":"ba9789f1-ead3-4c7c-b121-e2d3acfebd21","seq":3}}
|
||||
{"ts":"2026-03-30T19:39:29.365Z","flowId":"ba9789f1-ead3-4c7c-b121-e2d3acfebd21","seq":5,"eventType":"iteration-end","data":{"iteration":24}}
|
||||
{"ts":"2026-03-30T19:39:29.365Z","flowId":"8477c4dd-8e6a-44c7-8c46-39165872ef2f","seq":1,"eventType":"iteration-start","data":{"iteration":25}}
|
||||
{"ts":"2026-03-30T19:39:29.507Z","flowId":"8477c4dd-8e6a-44c7-8c46-39165872ef2f","seq":2,"eventType":"dispatch-match","rule":"planning → plan-slice","data":{"unitType":"plan-slice","unitId":"M007/S05"}}
|
||||
{"ts":"2026-03-30T19:39:29.525Z","flowId":"8477c4dd-8e6a-44c7-8c46-39165872ef2f","seq":3,"eventType":"unit-start","data":{"unitType":"plan-slice","unitId":"M007/S05"}}
|
||||
{"ts":"2026-03-30T19:40:07.521Z","flowId":"8477c4dd-8e6a-44c7-8c46-39165872ef2f","seq":4,"eventType":"unit-end","data":{"unitType":"plan-slice","unitId":"M007/S05","status":"completed","artifactVerified":true},"causedBy":{"flowId":"8477c4dd-8e6a-44c7-8c46-39165872ef2f","seq":3}}
|
||||
{"ts":"2026-03-30T19:40:07.641Z","flowId":"8477c4dd-8e6a-44c7-8c46-39165872ef2f","seq":5,"eventType":"iteration-end","data":{"iteration":25}}
|
||||
{"ts":"2026-03-30T19:40:07.641Z","flowId":"bf2e38c7-8617-4669-b7fa-99bf7bcf95e7","seq":1,"eventType":"iteration-start","data":{"iteration":26}}
|
||||
{"ts":"2026-03-30T19:40:07.723Z","flowId":"f3f48bc4-ee45-4425-881a-41f68074614f","seq":1,"eventType":"iteration-start","data":{"iteration":27}}
|
||||
{"ts":"2026-03-30T19:40:07.818Z","flowId":"f3f48bc4-ee45-4425-881a-41f68074614f","seq":2,"eventType":"dispatch-match","rule":"executing → execute-task","data":{"unitType":"execute-task","unitId":"M007/S05/T01"}}
|
||||
{"ts":"2026-03-30T19:40:07.829Z","flowId":"f3f48bc4-ee45-4425-881a-41f68074614f","seq":3,"eventType":"unit-start","data":{"unitType":"execute-task","unitId":"M007/S05/T01"}}
|
||||
{"ts":"2026-03-30T19:41:40.986Z","flowId":"f3f48bc4-ee45-4425-881a-41f68074614f","seq":4,"eventType":"unit-end","data":{"unitType":"execute-task","unitId":"M007/S05/T01","status":"completed","artifactVerified":true},"causedBy":{"flowId":"f3f48bc4-ee45-4425-881a-41f68074614f","seq":3}}
|
||||
{"ts":"2026-03-30T19:41:41.202Z","flowId":"f3f48bc4-ee45-4425-881a-41f68074614f","seq":5,"eventType":"iteration-end","data":{"iteration":27}}
|
||||
{"ts":"2026-03-30T19:41:41.203Z","flowId":"6e6038b4-b4fc-4d82-a220-ca23f1ae91dc","seq":1,"eventType":"iteration-start","data":{"iteration":28}}
|
||||
{"ts":"2026-03-30T19:41:41.340Z","flowId":"6e6038b4-b4fc-4d82-a220-ca23f1ae91dc","seq":2,"eventType":"dispatch-match","rule":"summarizing → complete-slice","data":{"unitType":"complete-slice","unitId":"M007/S05"}}
|
||||
{"ts":"2026-03-30T19:41:41.356Z","flowId":"6e6038b4-b4fc-4d82-a220-ca23f1ae91dc","seq":3,"eventType":"unit-start","data":{"unitType":"complete-slice","unitId":"M007/S05"}}
|
||||
{"ts":"2026-03-30T19:42:41.642Z","flowId":"6e6038b4-b4fc-4d82-a220-ca23f1ae91dc","seq":4,"eventType":"unit-end","data":{"unitType":"complete-slice","unitId":"M007/S05","status":"completed","artifactVerified":true},"causedBy":{"flowId":"6e6038b4-b4fc-4d82-a220-ca23f1ae91dc","seq":3}}
|
||||
{"ts":"2026-03-30T19:42:41.744Z","flowId":"6e6038b4-b4fc-4d82-a220-ca23f1ae91dc","seq":5,"eventType":"iteration-end","data":{"iteration":28}}
|
||||
{"ts":"2026-03-30T19:42:41.745Z","flowId":"03d87427-23fd-4250-884c-a71c15b73bf8","seq":1,"eventType":"iteration-start","data":{"iteration":29}}
|
||||
{"ts":"2026-03-30T19:42:41.878Z","flowId":"03d87427-23fd-4250-884c-a71c15b73bf8","seq":2,"eventType":"dispatch-match","rule":"planning (no research, not S01) → research-slice","data":{"unitType":"research-slice","unitId":"M007/S06"}}
|
||||
{"ts":"2026-03-30T19:42:41.895Z","flowId":"03d87427-23fd-4250-884c-a71c15b73bf8","seq":3,"eventType":"unit-start","data":{"unitType":"research-slice","unitId":"M007/S06"}}
|
||||
{"ts":"2026-03-30T19:44:50.594Z","flowId":"03d87427-23fd-4250-884c-a71c15b73bf8","seq":4,"eventType":"unit-end","data":{"unitType":"research-slice","unitId":"M007/S06","status":"completed","artifactVerified":true},"causedBy":{"flowId":"03d87427-23fd-4250-884c-a71c15b73bf8","seq":3}}
|
||||
{"ts":"2026-03-30T19:44:50.696Z","flowId":"03d87427-23fd-4250-884c-a71c15b73bf8","seq":5,"eventType":"iteration-end","data":{"iteration":29}}
|
||||
{"ts":"2026-03-30T19:44:50.696Z","flowId":"f4353fd5-769c-45bd-8ef4-77d0ae2e445e","seq":1,"eventType":"iteration-start","data":{"iteration":30}}
|
||||
{"ts":"2026-03-30T19:44:50.771Z","flowId":"f4353fd5-769c-45bd-8ef4-77d0ae2e445e","seq":2,"eventType":"dispatch-match","rule":"planning → plan-slice","data":{"unitType":"plan-slice","unitId":"M007/S06"}}
|
||||
{"ts":"2026-03-30T19:44:50.779Z","flowId":"f4353fd5-769c-45bd-8ef4-77d0ae2e445e","seq":3,"eventType":"unit-start","data":{"unitType":"plan-slice","unitId":"M007/S06"}}
|
||||
{"ts":"2026-03-30T19:46:02.833Z","flowId":"f4353fd5-769c-45bd-8ef4-77d0ae2e445e","seq":4,"eventType":"unit-end","data":{"unitType":"plan-slice","unitId":"M007/S06","status":"completed","artifactVerified":true},"causedBy":{"flowId":"f4353fd5-769c-45bd-8ef4-77d0ae2e445e","seq":3}}
|
||||
{"ts":"2026-03-30T19:46:02.935Z","flowId":"f4353fd5-769c-45bd-8ef4-77d0ae2e445e","seq":5,"eventType":"iteration-end","data":{"iteration":30}}
|
||||
{"ts":"2026-03-30T19:46:02.935Z","flowId":"070dddd6-32d6-4439-9287-e35a3c12423a","seq":1,"eventType":"iteration-start","data":{"iteration":31}}
|
||||
{"ts":"2026-03-30T19:46:03.073Z","flowId":"03583653-8ba1-420f-8cd3-5184f2f024a5","seq":1,"eventType":"iteration-start","data":{"iteration":32}}
|
||||
{"ts":"2026-03-30T19:46:03.212Z","flowId":"03583653-8ba1-420f-8cd3-5184f2f024a5","seq":2,"eventType":"dispatch-match","rule":"executing → execute-task","data":{"unitType":"execute-task","unitId":"M007/S06/T01"}}
|
||||
{"ts":"2026-03-30T19:46:03.228Z","flowId":"03583653-8ba1-420f-8cd3-5184f2f024a5","seq":3,"eventType":"unit-start","data":{"unitType":"execute-task","unitId":"M007/S06/T01"}}
|
||||
{"ts":"2026-03-30T19:48:29.975Z","flowId":"03583653-8ba1-420f-8cd3-5184f2f024a5","seq":4,"eventType":"unit-end","data":{"unitType":"execute-task","unitId":"M007/S06/T01","status":"completed","artifactVerified":true},"causedBy":{"flowId":"03583653-8ba1-420f-8cd3-5184f2f024a5","seq":3}}
|
||||
{"ts":"2026-03-30T19:48:30.283Z","flowId":"03583653-8ba1-420f-8cd3-5184f2f024a5","seq":5,"eventType":"iteration-end","data":{"iteration":32}}
|
||||
{"ts":"2026-03-30T19:48:30.283Z","flowId":"85108a2c-6888-4d5b-8ed4-a011ebc859a0","seq":1,"eventType":"iteration-start","data":{"iteration":33}}
|
||||
{"ts":"2026-03-30T19:48:30.402Z","flowId":"85108a2c-6888-4d5b-8ed4-a011ebc859a0","seq":2,"eventType":"dispatch-match","rule":"summarizing → complete-slice","data":{"unitType":"complete-slice","unitId":"M007/S06"}}
|
||||
{"ts":"2026-03-30T19:48:30.414Z","flowId":"85108a2c-6888-4d5b-8ed4-a011ebc859a0","seq":3,"eventType":"unit-start","data":{"unitType":"complete-slice","unitId":"M007/S06"}}
|
||||
{"ts":"2026-03-30T19:49:21.353Z","flowId":"85108a2c-6888-4d5b-8ed4-a011ebc859a0","seq":4,"eventType":"unit-end","data":{"unitType":"complete-slice","unitId":"M007/S06","status":"completed","artifactVerified":true},"causedBy":{"flowId":"85108a2c-6888-4d5b-8ed4-a011ebc859a0","seq":3}}
|
||||
{"ts":"2026-03-30T19:49:21.455Z","flowId":"85108a2c-6888-4d5b-8ed4-a011ebc859a0","seq":5,"eventType":"iteration-end","data":{"iteration":33}}
|
||||
{"ts":"2026-03-30T19:49:21.455Z","flowId":"2cf17eef-c30d-43c6-a6d4-3bb2cea451de","seq":1,"eventType":"iteration-start","data":{"iteration":34}}
|
||||
{"ts":"2026-03-30T19:49:21.575Z","flowId":"2cf17eef-c30d-43c6-a6d4-3bb2cea451de","seq":2,"eventType":"dispatch-match","rule":"validating-milestone → validate-milestone","data":{"unitType":"validate-milestone","unitId":"M007"}}
|
||||
{"ts":"2026-03-30T19:49:21.589Z","flowId":"2cf17eef-c30d-43c6-a6d4-3bb2cea451de","seq":3,"eventType":"unit-start","data":{"unitType":"validate-milestone","unitId":"M007"}}
|
||||
{"ts":"2026-03-30T19:51:17.420Z","flowId":"2cf17eef-c30d-43c6-a6d4-3bb2cea451de","seq":4,"eventType":"unit-end","data":{"unitType":"validate-milestone","unitId":"M007","status":"completed","artifactVerified":true},"causedBy":{"flowId":"2cf17eef-c30d-43c6-a6d4-3bb2cea451de","seq":3}}
|
||||
{"ts":"2026-03-30T19:51:17.522Z","flowId":"2cf17eef-c30d-43c6-a6d4-3bb2cea451de","seq":5,"eventType":"iteration-end","data":{"iteration":34}}
|
||||
{"ts":"2026-03-30T19:51:17.522Z","flowId":"5f446b30-16e4-419e-8ea0-252f18c7e0c9","seq":1,"eventType":"iteration-start","data":{"iteration":35}}
|
||||
{"ts":"2026-03-30T19:51:17.712Z","flowId":"5f446b30-16e4-419e-8ea0-252f18c7e0c9","seq":2,"eventType":"dispatch-match","rule":"completing-milestone → complete-milestone","data":{"unitType":"complete-milestone","unitId":"M007"}}
|
||||
{"ts":"2026-03-30T19:51:17.729Z","flowId":"5f446b30-16e4-419e-8ea0-252f18c7e0c9","seq":3,"eventType":"unit-start","data":{"unitType":"complete-milestone","unitId":"M007"}}
|
||||
{"ts":"2026-03-30T19:53:11.667Z","flowId":"5f446b30-16e4-419e-8ea0-252f18c7e0c9","seq":4,"eventType":"unit-end","data":{"unitType":"complete-milestone","unitId":"M007","status":"completed","artifactVerified":true},"causedBy":{"flowId":"5f446b30-16e4-419e-8ea0-252f18c7e0c9","seq":3}}
|
||||
{"ts":"2026-03-30T19:53:11.849Z","flowId":"5f446b30-16e4-419e-8ea0-252f18c7e0c9","seq":5,"eventType":"iteration-end","data":{"iteration":35}}
|
||||
{"ts":"2026-03-30T19:53:11.849Z","flowId":"04e80f1b-8d6e-4e44-9dcf-bc7a619cd7f3","seq":1,"eventType":"iteration-start","data":{"iteration":36}}
|
||||
{"ts":"2026-03-30T19:53:11.949Z","flowId":"2d8e4e33-914e-476e-bdd7-1d19ae05fe36","seq":0,"eventType":"worktree-merge-start","data":{"milestoneId":"M007","mode":"none"}}
|
||||
{"ts":"2026-03-30T19:53:12.018Z","flowId":"04e80f1b-8d6e-4e44-9dcf-bc7a619cd7f3","seq":2,"eventType":"terminal","data":{"reason":"milestone-complete","milestoneId":"M007"}}
|
||||
|
|
|
|||
|
|
@ -4317,6 +4317,929 @@
|
|||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "execute-task",
|
||||
"id": "M007/S02/T01",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774897178165,
|
||||
"finishedAt": 1774897643382,
|
||||
"tokens": {
|
||||
"input": 75,
|
||||
"output": 15023,
|
||||
"cacheRead": 5881554,
|
||||
"cacheWrite": 56779,
|
||||
"total": 5953431
|
||||
},
|
||||
"cost": 3.6715957500000007,
|
||||
"toolCalls": 72,
|
||||
"assistantMessages": 70,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 70,
|
||||
"promptCharCount": 11242,
|
||||
"baselineCharCount": 21533,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "complete-slice",
|
||||
"id": "M007/S02",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774897648982,
|
||||
"finishedAt": 1774897828586,
|
||||
"tokens": {
|
||||
"input": 23,
|
||||
"output": 5166,
|
||||
"cacheRead": 1522090,
|
||||
"cacheWrite": 27172,
|
||||
"total": 1554451
|
||||
},
|
||||
"cost": 1.0601349999999998,
|
||||
"toolCalls": 24,
|
||||
"assistantMessages": 21,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 21,
|
||||
"promptCharCount": 34491,
|
||||
"baselineCharCount": 21533,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "research-slice",
|
||||
"id": "M007/S03",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774897828866,
|
||||
"finishedAt": 1774897978047,
|
||||
"tokens": {
|
||||
"input": 20,
|
||||
"output": 4934,
|
||||
"cacheRead": 1307100,
|
||||
"cacheWrite": 26644,
|
||||
"total": 1338698
|
||||
},
|
||||
"cost": 0.943525,
|
||||
"toolCalls": 28,
|
||||
"assistantMessages": 18,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 18,
|
||||
"promptCharCount": 24967,
|
||||
"baselineCharCount": 21533,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "plan-slice",
|
||||
"id": "M007/S03",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774897978291,
|
||||
"finishedAt": 1774898107975,
|
||||
"tokens": {
|
||||
"input": 11,
|
||||
"output": 6149,
|
||||
"cacheRead": 695356,
|
||||
"cacheWrite": 22742,
|
||||
"total": 724258
|
||||
},
|
||||
"cost": 0.6435955,
|
||||
"toolCalls": 17,
|
||||
"assistantMessages": 10,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 10,
|
||||
"promptCharCount": 34934,
|
||||
"baselineCharCount": 21533,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "execute-task",
|
||||
"id": "M007/S03/T01",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774898108259,
|
||||
"finishedAt": 1774898267492,
|
||||
"tokens": {
|
||||
"input": 23,
|
||||
"output": 7314,
|
||||
"cacheRead": 1487428,
|
||||
"cacheWrite": 18545,
|
||||
"total": 1513310
|
||||
},
|
||||
"cost": 1.04258525,
|
||||
"toolCalls": 24,
|
||||
"assistantMessages": 22,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 22,
|
||||
"promptCharCount": 14814,
|
||||
"baselineCharCount": 21533,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "execute-task",
|
||||
"id": "M007/S03/T02",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774898267920,
|
||||
"finishedAt": 1774898679681,
|
||||
"tokens": {
|
||||
"input": 42,
|
||||
"output": 9726,
|
||||
"cacheRead": 3015457,
|
||||
"cacheWrite": 24404,
|
||||
"total": 3049629
|
||||
},
|
||||
"cost": 1.9036135,
|
||||
"toolCalls": 53,
|
||||
"assistantMessages": 41,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 41,
|
||||
"promptCharCount": 14515,
|
||||
"baselineCharCount": 21533,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "complete-slice",
|
||||
"id": "M007/S03",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774898680099,
|
||||
"finishedAt": 1774898798309,
|
||||
"tokens": {
|
||||
"input": 21,
|
||||
"output": 4972,
|
||||
"cacheRead": 995363,
|
||||
"cacheWrite": 18063,
|
||||
"total": 1018419
|
||||
},
|
||||
"cost": 0.7349802500000001,
|
||||
"toolCalls": 14,
|
||||
"assistantMessages": 14,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 14,
|
||||
"promptCharCount": 22108,
|
||||
"baselineCharCount": 21533,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "research-slice",
|
||||
"id": "M007/S04",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774898798679,
|
||||
"finishedAt": 1774898943839,
|
||||
"tokens": {
|
||||
"input": 20,
|
||||
"output": 5462,
|
||||
"cacheRead": 1552827,
|
||||
"cacheWrite": 44713,
|
||||
"total": 1603022
|
||||
},
|
||||
"cost": 1.19251975,
|
||||
"toolCalls": 28,
|
||||
"assistantMessages": 18,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 18,
|
||||
"promptCharCount": 29245,
|
||||
"baselineCharCount": 21851,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "plan-slice",
|
||||
"id": "M007/S04",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774898944171,
|
||||
"finishedAt": 1774899061785,
|
||||
"tokens": {
|
||||
"input": 12,
|
||||
"output": 4072,
|
||||
"cacheRead": 803090,
|
||||
"cacheWrite": 23523,
|
||||
"total": 830697
|
||||
},
|
||||
"cost": 0.6504237500000001,
|
||||
"toolCalls": 11,
|
||||
"assistantMessages": 11,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 11,
|
||||
"promptCharCount": 38843,
|
||||
"baselineCharCount": 21851,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "execute-task",
|
||||
"id": "M007/S04/T01",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774899062226,
|
||||
"finishedAt": 1774899251015,
|
||||
"tokens": {
|
||||
"input": 35,
|
||||
"output": 6982,
|
||||
"cacheRead": 2429704,
|
||||
"cacheWrite": 25695,
|
||||
"total": 2462416
|
||||
},
|
||||
"cost": 1.5501707500000002,
|
||||
"toolCalls": 34,
|
||||
"assistantMessages": 32,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 32,
|
||||
"promptCharCount": 12185,
|
||||
"baselineCharCount": 21851,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "execute-task",
|
||||
"id": "M007/S04/T02",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774899251433,
|
||||
"finishedAt": 1774899407608,
|
||||
"tokens": {
|
||||
"input": 31,
|
||||
"output": 6247,
|
||||
"cacheRead": 2011068,
|
||||
"cacheWrite": 19944,
|
||||
"total": 2037290
|
||||
},
|
||||
"cost": 1.2865140000000002,
|
||||
"toolCalls": 26,
|
||||
"assistantMessages": 28,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 28,
|
||||
"promptCharCount": 13007,
|
||||
"baselineCharCount": 21851,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "complete-slice",
|
||||
"id": "M007/S04",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774899407970,
|
||||
"finishedAt": 1774899474030,
|
||||
"tokens": {
|
||||
"input": 8,
|
||||
"output": 2726,
|
||||
"cacheRead": 399468,
|
||||
"cacheWrite": 14096,
|
||||
"total": 416298
|
||||
},
|
||||
"cost": 0.356024,
|
||||
"toolCalls": 6,
|
||||
"assistantMessages": 6,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 6,
|
||||
"promptCharCount": 34632,
|
||||
"baselineCharCount": 21851,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "research-slice",
|
||||
"id": "M007/S05",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774899474371,
|
||||
"finishedAt": 1774899569155,
|
||||
"tokens": {
|
||||
"input": 17,
|
||||
"output": 3338,
|
||||
"cacheRead": 1002977,
|
||||
"cacheWrite": 13503,
|
||||
"total": 1019835
|
||||
},
|
||||
"cost": 0.66941725,
|
||||
"toolCalls": 17,
|
||||
"assistantMessages": 15,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 15,
|
||||
"promptCharCount": 24953,
|
||||
"baselineCharCount": 21851,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "plan-slice",
|
||||
"id": "M007/S05",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774899569525,
|
||||
"finishedAt": 1774899607404,
|
||||
"tokens": {
|
||||
"input": 4,
|
||||
"output": 1766,
|
||||
"cacheRead": 194606,
|
||||
"cacheWrite": 13276,
|
||||
"total": 209652
|
||||
},
|
||||
"cost": 0.22444799999999998,
|
||||
"toolCalls": 4,
|
||||
"assistantMessages": 3,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 3,
|
||||
"promptCharCount": 31862,
|
||||
"baselineCharCount": 21851,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "execute-task",
|
||||
"id": "M007/S05/T01",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774899607829,
|
||||
"finishedAt": 1774899700894,
|
||||
"tokens": {
|
||||
"input": 21,
|
||||
"output": 3670,
|
||||
"cacheRead": 1146065,
|
||||
"cacheWrite": 9763,
|
||||
"total": 1159519
|
||||
},
|
||||
"cost": 0.72590625,
|
||||
"toolCalls": 16,
|
||||
"assistantMessages": 18,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 18,
|
||||
"promptCharCount": 12517,
|
||||
"baselineCharCount": 21851,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "complete-slice",
|
||||
"id": "M007/S05",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774899701356,
|
||||
"finishedAt": 1774899761536,
|
||||
"tokens": {
|
||||
"input": 9,
|
||||
"output": 2635,
|
||||
"cacheRead": 469361,
|
||||
"cacheWrite": 12904,
|
||||
"total": 484909
|
||||
},
|
||||
"cost": 0.3812505,
|
||||
"toolCalls": 12,
|
||||
"assistantMessages": 7,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 7,
|
||||
"promptCharCount": 34154,
|
||||
"baselineCharCount": 21851,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "research-slice",
|
||||
"id": "M007/S06",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774899761895,
|
||||
"finishedAt": 1774899890474,
|
||||
"tokens": {
|
||||
"input": 22,
|
||||
"output": 4634,
|
||||
"cacheRead": 1471879,
|
||||
"cacheWrite": 28131,
|
||||
"total": 1504666
|
||||
},
|
||||
"cost": 1.0277182499999997,
|
||||
"toolCalls": 27,
|
||||
"assistantMessages": 20,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 20,
|
||||
"promptCharCount": 27474,
|
||||
"baselineCharCount": 21851,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "plan-slice",
|
||||
"id": "M007/S06",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774899890779,
|
||||
"finishedAt": 1774899962709,
|
||||
"tokens": {
|
||||
"input": 10,
|
||||
"output": 3024,
|
||||
"cacheRead": 628271,
|
||||
"cacheWrite": 16437,
|
||||
"total": 647742
|
||||
},
|
||||
"cost": 0.49251675,
|
||||
"toolCalls": 9,
|
||||
"assistantMessages": 9,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 9,
|
||||
"promptCharCount": 35215,
|
||||
"baselineCharCount": 21851,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "execute-task",
|
||||
"id": "M007/S06/T01",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774899963228,
|
||||
"finishedAt": 1774900109857,
|
||||
"tokens": {
|
||||
"input": 18,
|
||||
"output": 3969,
|
||||
"cacheRead": 1096570,
|
||||
"cacheWrite": 10791,
|
||||
"total": 1111348
|
||||
},
|
||||
"cost": 0.7150437500000001,
|
||||
"toolCalls": 20,
|
||||
"assistantMessages": 17,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 17,
|
||||
"promptCharCount": 11551,
|
||||
"baselineCharCount": 21851,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "complete-slice",
|
||||
"id": "M007/S06",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774900110414,
|
||||
"finishedAt": 1774900161244,
|
||||
"tokens": {
|
||||
"input": 8,
|
||||
"output": 2176,
|
||||
"cacheRead": 330857,
|
||||
"cacheWrite": 11758,
|
||||
"total": 344799
|
||||
},
|
||||
"cost": 0.293356,
|
||||
"toolCalls": 4,
|
||||
"assistantMessages": 5,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 5,
|
||||
"promptCharCount": 33871,
|
||||
"baselineCharCount": 21851,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "validate-milestone",
|
||||
"id": "M007",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774900161589,
|
||||
"finishedAt": 1774900277303,
|
||||
"tokens": {
|
||||
"input": 15,
|
||||
"output": 4391,
|
||||
"cacheRead": 930967,
|
||||
"cacheWrite": 19894,
|
||||
"total": 955267
|
||||
},
|
||||
"cost": 0.6996709999999999,
|
||||
"toolCalls": 19,
|
||||
"assistantMessages": 13,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 13,
|
||||
"promptCharCount": 33712,
|
||||
"baselineCharCount": 21851,
|
||||
"skills": [
|
||||
"accessibility",
|
||||
"agent-browser",
|
||||
"best-practices",
|
||||
"code-optimizer",
|
||||
"core-web-vitals",
|
||||
"create-gsd-extension",
|
||||
"create-skill",
|
||||
"create-workflow",
|
||||
"debug-like-expert",
|
||||
"frontend-design",
|
||||
"github-workflows",
|
||||
"lint",
|
||||
"make-interfaces-feel-better",
|
||||
"react-best-practices",
|
||||
"review",
|
||||
"test",
|
||||
"userinterface-wiki",
|
||||
"web-design-guidelines",
|
||||
"web-quality-audit"
|
||||
],
|
||||
"cacheHitRate": 100
|
||||
},
|
||||
{
|
||||
"type": "complete-milestone",
|
||||
"id": "M007",
|
||||
"model": "claude-opus-4-6",
|
||||
"startedAt": 1774900277729,
|
||||
"finishedAt": 1774900391973,
|
||||
"tokens": {
|
||||
"input": 15,
|
||||
"output": 4910,
|
||||
"cacheRead": 927877,
|
||||
"cacheWrite": 18700,
|
||||
"total": 951502
|
||||
},
|
||||
"cost": 0.7036385000000001,
|
||||
"toolCalls": 17,
|
||||
"assistantMessages": 13,
|
||||
"userMessages": 0,
|
||||
"apiRequests": 13,
|
||||
"cacheHitRate": 100
|
||||
}
|
||||
]
|
||||
}
|
||||
|
|
|
|||
15
.gsd/runtime/units/complete-milestone-M007.json
Normal file
15
.gsd/runtime/units/complete-milestone-M007.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "complete-milestone",
|
||||
"unitId": "M007",
|
||||
"startedAt": 1774900277729,
|
||||
"updatedAt": 1774900277729,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774900277729,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/complete-slice-M007-S02.json
Normal file
15
.gsd/runtime/units/complete-slice-M007-S02.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "complete-slice",
|
||||
"unitId": "M007/S02",
|
||||
"startedAt": 1774897648982,
|
||||
"updatedAt": 1774897648983,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774897648982,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/complete-slice-M007-S03.json
Normal file
15
.gsd/runtime/units/complete-slice-M007-S03.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "complete-slice",
|
||||
"unitId": "M007/S03",
|
||||
"startedAt": 1774898680099,
|
||||
"updatedAt": 1774898680100,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774898680099,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/complete-slice-M007-S04.json
Normal file
15
.gsd/runtime/units/complete-slice-M007-S04.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "complete-slice",
|
||||
"unitId": "M007/S04",
|
||||
"startedAt": 1774899407970,
|
||||
"updatedAt": 1774899407970,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774899407970,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/complete-slice-M007-S05.json
Normal file
15
.gsd/runtime/units/complete-slice-M007-S05.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "complete-slice",
|
||||
"unitId": "M007/S05",
|
||||
"startedAt": 1774899701356,
|
||||
"updatedAt": 1774899701357,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774899701356,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/complete-slice-M007-S06.json
Normal file
15
.gsd/runtime/units/complete-slice-M007-S06.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "complete-slice",
|
||||
"unitId": "M007/S06",
|
||||
"startedAt": 1774900110414,
|
||||
"updatedAt": 1774900110415,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774900110414,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/execute-task-M007-S03-T01.json
Normal file
15
.gsd/runtime/units/execute-task-M007-S03-T01.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "execute-task",
|
||||
"unitId": "M007/S03/T01",
|
||||
"startedAt": 1774898108259,
|
||||
"updatedAt": 1774898108260,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774898108259,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/execute-task-M007-S03-T02.json
Normal file
15
.gsd/runtime/units/execute-task-M007-S03-T02.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "execute-task",
|
||||
"unitId": "M007/S03/T02",
|
||||
"startedAt": 1774898267920,
|
||||
"updatedAt": 1774898267921,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774898267920,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/execute-task-M007-S04-T01.json
Normal file
15
.gsd/runtime/units/execute-task-M007-S04-T01.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "execute-task",
|
||||
"unitId": "M007/S04/T01",
|
||||
"startedAt": 1774899062226,
|
||||
"updatedAt": 1774899062226,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774899062226,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/execute-task-M007-S04-T02.json
Normal file
15
.gsd/runtime/units/execute-task-M007-S04-T02.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "execute-task",
|
||||
"unitId": "M007/S04/T02",
|
||||
"startedAt": 1774899251433,
|
||||
"updatedAt": 1774899251433,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774899251433,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/execute-task-M007-S05-T01.json
Normal file
15
.gsd/runtime/units/execute-task-M007-S05-T01.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "execute-task",
|
||||
"unitId": "M007/S05/T01",
|
||||
"startedAt": 1774899607829,
|
||||
"updatedAt": 1774899607829,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774899607829,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/execute-task-M007-S06-T01.json
Normal file
15
.gsd/runtime/units/execute-task-M007-S06-T01.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "execute-task",
|
||||
"unitId": "M007/S06/T01",
|
||||
"startedAt": 1774899963228,
|
||||
"updatedAt": 1774899963228,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774899963228,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/plan-slice-M007-S03.json
Normal file
15
.gsd/runtime/units/plan-slice-M007-S03.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "plan-slice",
|
||||
"unitId": "M007/S03",
|
||||
"startedAt": 1774897978291,
|
||||
"updatedAt": 1774897978291,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774897978291,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/plan-slice-M007-S04.json
Normal file
15
.gsd/runtime/units/plan-slice-M007-S04.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "plan-slice",
|
||||
"unitId": "M007/S04",
|
||||
"startedAt": 1774898944171,
|
||||
"updatedAt": 1774898944171,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774898944171,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/plan-slice-M007-S05.json
Normal file
15
.gsd/runtime/units/plan-slice-M007-S05.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "plan-slice",
|
||||
"unitId": "M007/S05",
|
||||
"startedAt": 1774899569525,
|
||||
"updatedAt": 1774899569525,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774899569525,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/plan-slice-M007-S06.json
Normal file
15
.gsd/runtime/units/plan-slice-M007-S06.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "plan-slice",
|
||||
"unitId": "M007/S06",
|
||||
"startedAt": 1774899890779,
|
||||
"updatedAt": 1774899890780,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774899890779,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/research-slice-M007-S03.json
Normal file
15
.gsd/runtime/units/research-slice-M007-S03.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "research-slice",
|
||||
"unitId": "M007/S03",
|
||||
"startedAt": 1774897828866,
|
||||
"updatedAt": 1774897828867,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774897828866,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/research-slice-M007-S04.json
Normal file
15
.gsd/runtime/units/research-slice-M007-S04.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "research-slice",
|
||||
"unitId": "M007/S04",
|
||||
"startedAt": 1774898798679,
|
||||
"updatedAt": 1774898798679,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774898798679,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/research-slice-M007-S05.json
Normal file
15
.gsd/runtime/units/research-slice-M007-S05.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "research-slice",
|
||||
"unitId": "M007/S05",
|
||||
"startedAt": 1774899474371,
|
||||
"updatedAt": 1774899474372,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774899474371,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/research-slice-M007-S06.json
Normal file
15
.gsd/runtime/units/research-slice-M007-S06.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "research-slice",
|
||||
"unitId": "M007/S06",
|
||||
"startedAt": 1774899761895,
|
||||
"updatedAt": 1774899761895,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774899761895,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
15
.gsd/runtime/units/validate-milestone-M007.json
Normal file
15
.gsd/runtime/units/validate-milestone-M007.json
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
{
|
||||
"version": 1,
|
||||
"unitType": "validate-milestone",
|
||||
"unitId": "M007",
|
||||
"startedAt": 1774900161589,
|
||||
"updatedAt": 1774900161589,
|
||||
"phase": "dispatched",
|
||||
"wrapupWarningSent": false,
|
||||
"continueHereFired": false,
|
||||
"timeoutAt": null,
|
||||
"lastProgressAt": 1774900161589,
|
||||
"progressCount": 0,
|
||||
"lastProgressKind": "dispatch",
|
||||
"recoveryAttempts": 0
|
||||
}
|
||||
27
CLAUDE.md
27
CLAUDE.md
|
|
@ -46,30 +46,3 @@ docker logs -f chrysopedia-worker
|
|||
# View API logs
|
||||
docker logs -f chrysopedia-api
|
||||
```
|
||||
|
||||
## Remote Host: hal0022 (Whisper Transcription)
|
||||
|
||||
- **Host alias:** `hal0022`
|
||||
- **IP:** 10.0.0.131
|
||||
- **OS:** Windows (domain-joined to a.xpltd.co)
|
||||
- **SSH user:** `a\jlightner`
|
||||
- **SSH key:** `~/.ssh/hal0022_ed25519`
|
||||
- **Role:** GPU workstation for Whisper transcription of video content
|
||||
|
||||
### Connecting
|
||||
|
||||
```bash
|
||||
ssh hal0022
|
||||
```
|
||||
|
||||
SSH config is already set up in `~/.ssh/config` on dev01.
|
||||
|
||||
### Content Location on hal0022
|
||||
|
||||
Video source files reside at:
|
||||
|
||||
```
|
||||
A:\Education\Artist Streams & Content
|
||||
```
|
||||
|
||||
Note: This is a Windows path. When accessing via SSH, use the appropriate path format for the shell available on hal0022.
|
||||
|
|
|
|||
320
README.md
Normal file
320
README.md
Normal file
|
|
@ -0,0 +1,320 @@
|
|||
# Chrysopedia
|
||||
|
||||
> From *chrysopoeia* (alchemical transmutation of base material into gold) + *encyclopedia*.
|
||||
> Chrysopedia transmutes raw video content into refined, searchable production knowledge.
|
||||
|
||||
A self-hosted knowledge extraction system for electronic music production content. Video libraries are transcribed with Whisper, analyzed through a multi-stage LLM pipeline, curated via an admin review workflow, and served through a search-first web UI designed for mid-session retrieval.
|
||||
|
||||
---
|
||||
|
||||
## Information Flow
|
||||
|
||||
Content moves through six stages from raw video to searchable knowledge:
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ STAGE 1 · Transcription [Desktop / GPU] │
|
||||
│ │
|
||||
│ Video files → Whisper large-v3 (CUDA) → JSON transcripts │
|
||||
│ Output: timestamped segments with speaker text │
|
||||
└────────────────────────────────┬────────────────────────────────────────┘
|
||||
│ JSON files (manual or folder watcher)
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ STAGE 2 · Ingestion [API + Watcher] │
|
||||
│ │
|
||||
│ POST /api/v1/ingest ← watcher auto-submits from /watch folder │
|
||||
│ • Validate JSON structure │
|
||||
│ • Compute content hash (SHA-256) for deduplication │
|
||||
│ • Find-or-create Creator from folder name │
|
||||
│ • Upsert SourceVideo (exact filename → content hash → fuzzy match) │
|
||||
│ • Bulk-insert TranscriptSegment rows │
|
||||
│ • Dispatch pipeline to Celery worker │
|
||||
└────────────────────────────────┬────────────────────────────────────────┘
|
||||
│ Celery task: run_pipeline(video_id)
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ STAGE 3 · LLM Extraction Pipeline [Celery Worker] │
|
||||
│ │
|
||||
│ Four sequential LLM stages, each with its own prompt template: │
|
||||
│ │
|
||||
│ 3a. Segmentation — Split transcript into semantic topic boundaries │
|
||||
│ Model: chat (fast) Prompt: stage2_segmentation.txt │
|
||||
│ │
|
||||
│ 3b. Extraction — Identify key moments (title, summary, timestamps) │
|
||||
│ Model: reasoning (think) Prompt: stage3_extraction.txt │
|
||||
│ │
|
||||
│ 3c. Classification — Assign content types + extract plugin names │
|
||||
│ Model: chat (fast) Prompt: stage4_classification.txt │
|
||||
│ │
|
||||
│ 3d. Synthesis — Compose technique pages from approved moments │
|
||||
│ Model: reasoning (think) Prompt: stage5_synthesis.txt │
|
||||
│ │
|
||||
│ Each stage emits PipelineEvent rows (tokens, duration, model, errors) │
|
||||
└────────────────────────────────┬────────────────────────────────────────┘
|
||||
│ KeyMoment rows (review_status: pending)
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ STAGE 4 · Review & Curation [Admin UI] │
|
||||
│ │
|
||||
│ Admin reviews extracted KeyMoments before they become technique pages: │
|
||||
│ • Approve — moment proceeds to synthesis │
|
||||
│ • Edit — correct title, summary, content type, plugins, then approve │
|
||||
│ • Reject — moment is excluded from knowledge base │
|
||||
│ (When REVIEW_MODE=false, moments auto-approve and skip this stage) │
|
||||
└────────────────────────────────┬────────────────────────────────────────┘
|
||||
│ Approved moments → Stage 3d synthesis
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ STAGE 5 · Knowledge Base [Web UI] │
|
||||
│ │
|
||||
│ TechniquePages — the primary output: │
|
||||
│ • Structured body sections, signal chains, plugin lists │
|
||||
│ • Linked to source KeyMoments with video timestamps │
|
||||
│ • Cross-referenced via RelatedTechniqueLinks │
|
||||
│ • Versioned (snapshots before each re-synthesis) │
|
||||
│ • Organized by topic taxonomy (6 categories from canonical_tags.yaml) │
|
||||
└────────────────────────────────┬────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────────────────────────────────────────────────────────────┐
|
||||
│ STAGE 6 · Search & Retrieval [Web UI] │
|
||||
│ │
|
||||
│ • Semantic search: query → embedding → Qdrant vector similarity │
|
||||
│ • Keyword fallback: ILIKE search on title/summary (300ms timeout) │
|
||||
│ • Browse by topic hierarchy, creator, or content type │
|
||||
│ • Typeahead search from home page (debounced, top 5 results) │
|
||||
└─────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ Desktop (GPU workstation — hal0022) │
|
||||
│ whisper/transcribe.py → JSON transcripts → copy to /watch folder │
|
||||
└────────────────────────────┬─────────────────────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────────────────────────────────────────────────────────┐
|
||||
│ Docker Compose: xpltd_chrysopedia (ub01) │
|
||||
│ Network: chrysopedia (172.32.0.0/24) │
|
||||
│ │
|
||||
│ ┌────────────┐ ┌─────────────┐ ┌───────────────┐ ┌──────────────┐ │
|
||||
│ │ PostgreSQL │ │ Redis │ │ Qdrant │ │ Ollama │ │
|
||||
│ │ :5433 │ │ broker + │ │ vector DB │ │ embeddings │ │
|
||||
│ │ 7 entities │ │ cache │ │ semantic │ │ nomic-embed │ │
|
||||
│ └─────┬───────┘ └──────┬──────┘ └───────┬───────┘ └──────┬───────┘ │
|
||||
│ │ │ │ │ │
|
||||
│ ┌─────┴─────────────────┴─────────────────┴─────────────────┴────────┐ │
|
||||
│ │ FastAPI (API) │ │
|
||||
│ │ Ingest · Pipeline control · Review · Search · CRUD · Reports │ │
|
||||
│ └──────────────────────────────┬────────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ┌──────────────┐ ┌────────────┴───┐ ┌──────────────────────────┐ │
|
||||
│ │ Watcher │ │ Celery Worker │ │ Web UI (React) │ │
|
||||
│ │ /watch → │ │ LLM pipeline │ │ nginx → :8096 │ │
|
||||
│ │ auto-ingest │ │ stages 2-5 │ │ search-first interface │ │
|
||||
│ └──────────────┘ └────────────────┘ └──────────────────────────┘ │
|
||||
└──────────────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### Services
|
||||
|
||||
| Service | Image | Port | Purpose |
|
||||
|---------|-------|------|---------|
|
||||
| `chrysopedia-db` | `postgres:16-alpine` | `5433 → 5432` | Primary data store |
|
||||
| `chrysopedia-redis` | `redis:7-alpine` | — | Celery broker + feature flag cache |
|
||||
| `chrysopedia-qdrant` | `qdrant/qdrant:v1.13.2` | — | Vector DB for semantic search |
|
||||
| `chrysopedia-ollama` | `ollama/ollama` | — | Embedding model server (nomic-embed-text) |
|
||||
| `chrysopedia-api` | `Dockerfile.api` | `8000` | FastAPI REST API |
|
||||
| `chrysopedia-worker` | `Dockerfile.api` | — | Celery worker (LLM pipeline) |
|
||||
| `chrysopedia-watcher` | `Dockerfile.api` | — | Folder monitor → auto-ingest |
|
||||
| `chrysopedia-web` | `Dockerfile.web` | `8096 → 80` | React frontend (nginx) |
|
||||
|
||||
### Data Model
|
||||
|
||||
| Entity | Purpose |
|
||||
|--------|---------|
|
||||
| **Creator** | Artists/producers whose content is indexed |
|
||||
| **SourceVideo** | Video files processed by the pipeline (with content hash dedup) |
|
||||
| **TranscriptSegment** | Timestamped text segments from Whisper |
|
||||
| **KeyMoment** | Discrete insights extracted by LLM analysis |
|
||||
| **TechniquePage** | Synthesized knowledge pages — the primary output |
|
||||
| **TechniquePageVersion** | Snapshots before re-synthesis overwrites |
|
||||
| **RelatedTechniqueLink** | Cross-references between technique pages |
|
||||
| **Tag** | Hierarchical topic taxonomy |
|
||||
| **ContentReport** | User-submitted content issues |
|
||||
| **PipelineEvent** | Structured pipeline execution logs (tokens, timing, errors) |
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Docker ≥ 24.0 and Docker Compose ≥ 2.20
|
||||
- Python 3.10+ with NVIDIA GPU + CUDA (for Whisper transcription)
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
# Clone and configure
|
||||
git clone git@github.com:xpltdco/chrysopedia.git
|
||||
cd chrysopedia
|
||||
cp .env.example .env # edit with real values
|
||||
|
||||
# Start the stack
|
||||
docker compose up -d
|
||||
|
||||
# Run database migrations
|
||||
docker exec chrysopedia-api alembic upgrade head
|
||||
|
||||
# Pull the embedding model (first time only)
|
||||
docker exec chrysopedia-ollama ollama pull nomic-embed-text
|
||||
|
||||
# Verify
|
||||
curl http://localhost:8096/health
|
||||
```
|
||||
|
||||
### Transcribe videos
|
||||
|
||||
```bash
|
||||
cd whisper && pip install -r requirements.txt
|
||||
|
||||
# Single file
|
||||
python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts
|
||||
|
||||
# Batch
|
||||
python transcribe.py --input ./videos/ --output-dir ./transcripts
|
||||
```
|
||||
|
||||
See [`whisper/README.md`](whisper/README.md) for full transcription docs.
|
||||
|
||||
---
|
||||
|
||||
## Environment Variables
|
||||
|
||||
Copy `.env.example` to `.env`. Key groups:
|
||||
|
||||
| Group | Variables | Notes |
|
||||
|-------|-----------|-------|
|
||||
| **Database** | `POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB` | Default user: `chrysopedia` |
|
||||
| **LLM** | `LLM_API_URL`, `LLM_API_KEY`, `LLM_MODEL` | OpenAI-compatible endpoint |
|
||||
| **LLM Fallback** | `LLM_FALLBACK_URL`, `LLM_FALLBACK_MODEL` | Automatic failover |
|
||||
| **Per-Stage Models** | `LLM_STAGE{2-5}_MODEL`, `LLM_STAGE{2-5}_MODALITY` | `chat` for fast stages, `thinking` for reasoning |
|
||||
| **Embedding** | `EMBEDDING_API_URL`, `EMBEDDING_MODEL` | Ollama nomic-embed-text |
|
||||
| **Vector DB** | `QDRANT_URL`, `QDRANT_COLLECTION` | Container-internal |
|
||||
| **Features** | `REVIEW_MODE`, `DEBUG_MODE` | Review gate + LLM I/O capture |
|
||||
| **Storage** | `TRANSCRIPT_STORAGE_PATH`, `VIDEO_METADATA_PATH` | Container bind mounts |
|
||||
|
||||
---
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Public
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|------|-------------|
|
||||
| GET | `/health` | Health check (DB connectivity) |
|
||||
| GET | `/api/v1/search?q=&scope=&limit=` | Semantic + keyword search |
|
||||
| GET | `/api/v1/techniques` | List technique pages |
|
||||
| GET | `/api/v1/techniques/{slug}` | Technique detail + key moments |
|
||||
| GET | `/api/v1/techniques/{slug}/versions` | Version history |
|
||||
| GET | `/api/v1/creators` | List creators (sort, genre filter) |
|
||||
| GET | `/api/v1/creators/{slug}` | Creator detail |
|
||||
| GET | `/api/v1/topics` | Topic hierarchy with counts |
|
||||
| GET | `/api/v1/videos` | List source videos |
|
||||
| POST | `/api/v1/reports` | Submit content report |
|
||||
|
||||
### Admin
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|------|-------------|
|
||||
| GET | `/api/v1/review/queue` | Review queue (status filter) |
|
||||
| POST | `/api/v1/review/moments/{id}/approve` | Approve key moment |
|
||||
| POST | `/api/v1/review/moments/{id}/reject` | Reject key moment |
|
||||
| PUT | `/api/v1/review/moments/{id}` | Edit key moment |
|
||||
| POST | `/api/v1/admin/pipeline/trigger/{video_id}` | Trigger/retrigger pipeline |
|
||||
| GET | `/api/v1/admin/pipeline/events/{video_id}` | Pipeline event log |
|
||||
| GET | `/api/v1/admin/pipeline/token-summary/{video_id}` | Token usage by stage |
|
||||
| GET | `/api/v1/admin/pipeline/worker-status` | Celery worker status |
|
||||
| PUT | `/api/v1/admin/pipeline/debug-mode` | Toggle debug mode |
|
||||
|
||||
### Ingest
|
||||
|
||||
| Method | Path | Description |
|
||||
|--------|------|-------------|
|
||||
| POST | `/api/v1/ingest` | Upload Whisper JSON transcript |
|
||||
|
||||
---
|
||||
|
||||
## Development
|
||||
|
||||
```bash
|
||||
# Local backend (with Docker services)
|
||||
python -m venv .venv && source .venv/bin/activate
|
||||
pip install -r backend/requirements.txt
|
||||
docker compose up -d chrysopedia-db chrysopedia-redis
|
||||
alembic upgrade head
|
||||
cd backend && uvicorn main:app --reload --host 0.0.0.0 --port 8000
|
||||
|
||||
# Database migrations
|
||||
alembic revision --autogenerate -m "describe_change"
|
||||
alembic upgrade head
|
||||
```
|
||||
|
||||
### Project Structure
|
||||
|
||||
```
|
||||
chrysopedia/
|
||||
├── backend/ # FastAPI application
|
||||
│ ├── main.py # Entry point, middleware, router mounting
|
||||
│ ├── config.py # Pydantic Settings (all env vars)
|
||||
│ ├── models.py # SQLAlchemy ORM models
|
||||
│ ├── schemas.py # Pydantic request/response schemas
|
||||
│ ├── worker.py # Celery app configuration
|
||||
│ ├── watcher.py # Transcript folder watcher service
|
||||
│ ├── search_service.py # Semantic search + keyword fallback
|
||||
│ ├── routers/ # API endpoint handlers
|
||||
│ ├── pipeline/ # LLM pipeline stages + clients
|
||||
│ │ ├── stages.py # Stages 2-5 (Celery tasks)
|
||||
│ │ ├── llm_client.py # OpenAI-compatible LLM client
|
||||
│ │ ├── embedding_client.py
|
||||
│ │ └── qdrant_client.py
|
||||
│ └── tests/
|
||||
├── frontend/ # React + TypeScript + Vite
|
||||
│ └── src/
|
||||
│ ├── pages/ # Home, Search, Technique, Creator, Topic, Admin
|
||||
│ ├── components/ # Shared UI components
|
||||
│ └── api/ # Typed API clients
|
||||
├── whisper/ # Desktop transcription (Whisper large-v3)
|
||||
├── docker/ # Dockerfiles + nginx config
|
||||
├── alembic/ # Database migrations
|
||||
├── config/ # canonical_tags.yaml (topic taxonomy)
|
||||
├── prompts/ # LLM prompt templates (editable at runtime)
|
||||
├── docker-compose.yml
|
||||
└── .env.example
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Deployment (ub01)
|
||||
|
||||
```bash
|
||||
ssh ub01
|
||||
cd /vmPool/r/repos/xpltdco/chrysopedia
|
||||
git pull && docker compose build && docker compose up -d
|
||||
```
|
||||
|
||||
| Resource | Location |
|
||||
|----------|----------|
|
||||
| Web UI | `http://ub01:8096` |
|
||||
| API | `http://ub01:8096/health` |
|
||||
| PostgreSQL | `ub01:5433` |
|
||||
| Compose config | `/vmPool/r/compose/xpltd_chrysopedia/docker-compose.yml` |
|
||||
| Persistent data | `/vmPool/r/services/chrysopedia_*` |
|
||||
|
||||
XPLTD conventions: `xpltd_chrysopedia` project name, dedicated bridge network (`172.32.0.0/24`), bind mounts under `/vmPool/r/services/`, PostgreSQL on port `5433`.
|
||||
37
alembic.ini
Normal file
37
alembic.ini
Normal file
|
|
@ -0,0 +1,37 @@
|
|||
# Chrysopedia — Alembic configuration
|
||||
[alembic]
|
||||
script_location = alembic
|
||||
sqlalchemy.url = postgresql+asyncpg://chrysopedia:changeme@localhost:5433/chrysopedia
|
||||
|
||||
[loggers]
|
||||
keys = root,sqlalchemy,alembic
|
||||
|
||||
[handlers]
|
||||
keys = console
|
||||
|
||||
[formatters]
|
||||
keys = generic
|
||||
|
||||
[logger_root]
|
||||
level = WARN
|
||||
handlers = console
|
||||
|
||||
[logger_sqlalchemy]
|
||||
level = WARN
|
||||
handlers =
|
||||
qualname = sqlalchemy.engine
|
||||
|
||||
[logger_alembic]
|
||||
level = INFO
|
||||
handlers =
|
||||
qualname = alembic
|
||||
|
||||
[handler_console]
|
||||
class = StreamHandler
|
||||
args = (sys.stderr,)
|
||||
level = NOTSET
|
||||
formatter = generic
|
||||
|
||||
[formatter_generic]
|
||||
format = %(levelname)-5.5s [%(name)s] %(message)s
|
||||
datefmt = %H:%M:%S
|
||||
72
alembic/env.py
Normal file
72
alembic/env.py
Normal file
|
|
@ -0,0 +1,72 @@
|
|||
"""Alembic env.py — async migration runner for Chrysopedia."""
|
||||
|
||||
import asyncio
|
||||
import os
|
||||
import sys
|
||||
from logging.config import fileConfig
|
||||
|
||||
from alembic import context
|
||||
from sqlalchemy import pool
|
||||
from sqlalchemy.ext.asyncio import async_engine_from_config
|
||||
|
||||
# Ensure the backend package is importable
|
||||
# When running locally: alembic/ sits beside backend/, so ../backend works
|
||||
# When running in Docker: alembic/ is inside /app/ alongside the backend modules
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "..", "backend"))
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
|
||||
|
||||
from database import Base # noqa: E402
|
||||
import models # noqa: E402, F401 — registers all tables on Base.metadata
|
||||
|
||||
config = context.config
|
||||
|
||||
if config.config_file_name is not None:
|
||||
fileConfig(config.config_file_name)
|
||||
|
||||
target_metadata = Base.metadata
|
||||
|
||||
# Allow DATABASE_URL env var to override alembic.ini
|
||||
url_override = os.getenv("DATABASE_URL")
|
||||
if url_override:
|
||||
config.set_main_option("sqlalchemy.url", url_override)
|
||||
|
||||
|
||||
def run_migrations_offline() -> None:
|
||||
"""Run migrations in 'offline' mode — emit SQL to stdout."""
|
||||
url = config.get_main_option("sqlalchemy.url")
|
||||
context.configure(
|
||||
url=url,
|
||||
target_metadata=target_metadata,
|
||||
literal_binds=True,
|
||||
dialect_opts={"paramstyle": "named"},
|
||||
)
|
||||
with context.begin_transaction():
|
||||
context.run_migrations()
|
||||
|
||||
|
||||
def do_run_migrations(connection):
|
||||
context.configure(connection=connection, target_metadata=target_metadata)
|
||||
with context.begin_transaction():
|
||||
context.run_migrations()
|
||||
|
||||
|
||||
async def run_async_migrations() -> None:
|
||||
"""Run migrations in 'online' mode with an async engine."""
|
||||
connectable = async_engine_from_config(
|
||||
config.get_section(config.config_ini_section, {}),
|
||||
prefix="sqlalchemy.",
|
||||
poolclass=pool.NullPool,
|
||||
)
|
||||
async with connectable.connect() as connection:
|
||||
await connection.run_sync(do_run_migrations)
|
||||
await connectable.dispose()
|
||||
|
||||
|
||||
def run_migrations_online() -> None:
|
||||
asyncio.run(run_async_migrations())
|
||||
|
||||
|
||||
if context.is_offline_mode():
|
||||
run_migrations_offline()
|
||||
else:
|
||||
run_migrations_online()
|
||||
25
alembic/script.py.mako
Normal file
25
alembic/script.py.mako
Normal file
|
|
@ -0,0 +1,25 @@
|
|||
"""${message}
|
||||
|
||||
Revision ID: ${up_revision}
|
||||
Revises: ${down_revision | comma,n}
|
||||
Create Date: ${create_date}
|
||||
"""
|
||||
from typing import Sequence, Union
|
||||
|
||||
from alembic import op
|
||||
import sqlalchemy as sa
|
||||
${imports if imports else ""}
|
||||
|
||||
# revision identifiers, used by Alembic.
|
||||
revision: str = ${repr(up_revision)}
|
||||
down_revision: Union[str, None] = ${repr(down_revision)}
|
||||
branch_labels: Union[str, Sequence[str], None] = ${repr(branch_labels)}
|
||||
depends_on: Union[str, Sequence[str], None] = ${repr(depends_on)}
|
||||
|
||||
|
||||
def upgrade() -> None:
|
||||
${upgrades if upgrades else "pass"}
|
||||
|
||||
|
||||
def downgrade() -> None:
|
||||
${downgrades if downgrades else "pass"}
|
||||
171
alembic/versions/001_initial.py
Normal file
171
alembic/versions/001_initial.py
Normal file
|
|
@ -0,0 +1,171 @@
|
|||
"""initial schema — 7 core entities
|
||||
|
||||
Revision ID: 001_initial
|
||||
Revises:
|
||||
Create Date: 2026-03-29
|
||||
"""
|
||||
from typing import Sequence, Union
|
||||
|
||||
from alembic import op
|
||||
import sqlalchemy as sa
|
||||
from sqlalchemy.dialects.postgresql import ARRAY, JSONB, UUID
|
||||
|
||||
# revision identifiers, used by Alembic.
|
||||
revision: str = "001_initial"
|
||||
down_revision: Union[str, None] = None
|
||||
branch_labels: Union[str, Sequence[str], None] = None
|
||||
depends_on: Union[str, Sequence[str], None] = None
|
||||
|
||||
|
||||
def upgrade() -> None:
|
||||
# ── Enum types ───────────────────────────────────────────────────────
|
||||
content_type = sa.Enum(
|
||||
"tutorial", "livestream", "breakdown", "short_form",
|
||||
name="content_type",
|
||||
)
|
||||
processing_status = sa.Enum(
|
||||
"pending", "transcribed", "extracted", "reviewed", "published",
|
||||
name="processing_status",
|
||||
)
|
||||
key_moment_content_type = sa.Enum(
|
||||
"technique", "settings", "reasoning", "workflow",
|
||||
name="key_moment_content_type",
|
||||
)
|
||||
review_status = sa.Enum(
|
||||
"pending", "approved", "edited", "rejected",
|
||||
name="review_status",
|
||||
)
|
||||
source_quality = sa.Enum(
|
||||
"structured", "mixed", "unstructured",
|
||||
name="source_quality",
|
||||
)
|
||||
page_review_status = sa.Enum(
|
||||
"draft", "reviewed", "published",
|
||||
name="page_review_status",
|
||||
)
|
||||
relationship_type = sa.Enum(
|
||||
"same_technique_other_creator", "same_creator_adjacent", "general_cross_reference",
|
||||
name="relationship_type",
|
||||
)
|
||||
|
||||
# ── creators ─────────────────────────────────────────────────────────
|
||||
op.create_table(
|
||||
"creators",
|
||||
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.text("gen_random_uuid()")),
|
||||
sa.Column("name", sa.String(255), nullable=False),
|
||||
sa.Column("slug", sa.String(255), nullable=False, unique=True),
|
||||
sa.Column("genres", ARRAY(sa.String), nullable=True),
|
||||
sa.Column("folder_name", sa.String(255), nullable=False),
|
||||
sa.Column("view_count", sa.Integer, nullable=False, server_default="0"),
|
||||
sa.Column("created_at", sa.DateTime(), nullable=False, server_default=sa.func.now()),
|
||||
sa.Column("updated_at", sa.DateTime(), nullable=False, server_default=sa.func.now()),
|
||||
)
|
||||
|
||||
# ── source_videos ────────────────────────────────────────────────────
|
||||
op.create_table(
|
||||
"source_videos",
|
||||
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.text("gen_random_uuid()")),
|
||||
sa.Column("creator_id", UUID(as_uuid=True), sa.ForeignKey("creators.id", ondelete="CASCADE"), nullable=False),
|
||||
sa.Column("filename", sa.String(500), nullable=False),
|
||||
sa.Column("file_path", sa.String(1000), nullable=False),
|
||||
sa.Column("duration_seconds", sa.Integer, nullable=True),
|
||||
sa.Column("content_type", content_type, nullable=False),
|
||||
sa.Column("transcript_path", sa.String(1000), nullable=True),
|
||||
sa.Column("processing_status", processing_status, nullable=False, server_default="pending"),
|
||||
sa.Column("created_at", sa.DateTime(), nullable=False, server_default=sa.func.now()),
|
||||
sa.Column("updated_at", sa.DateTime(), nullable=False, server_default=sa.func.now()),
|
||||
)
|
||||
op.create_index("ix_source_videos_creator_id", "source_videos", ["creator_id"])
|
||||
|
||||
# ── transcript_segments ──────────────────────────────────────────────
|
||||
op.create_table(
|
||||
"transcript_segments",
|
||||
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.text("gen_random_uuid()")),
|
||||
sa.Column("source_video_id", UUID(as_uuid=True), sa.ForeignKey("source_videos.id", ondelete="CASCADE"), nullable=False),
|
||||
sa.Column("start_time", sa.Float, nullable=False),
|
||||
sa.Column("end_time", sa.Float, nullable=False),
|
||||
sa.Column("text", sa.Text, nullable=False),
|
||||
sa.Column("segment_index", sa.Integer, nullable=False),
|
||||
sa.Column("topic_label", sa.String(255), nullable=True),
|
||||
)
|
||||
op.create_index("ix_transcript_segments_video_id", "transcript_segments", ["source_video_id"])
|
||||
|
||||
# ── technique_pages (must come before key_moments due to FK) ─────────
|
||||
op.create_table(
|
||||
"technique_pages",
|
||||
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.text("gen_random_uuid()")),
|
||||
sa.Column("creator_id", UUID(as_uuid=True), sa.ForeignKey("creators.id", ondelete="CASCADE"), nullable=False),
|
||||
sa.Column("title", sa.String(500), nullable=False),
|
||||
sa.Column("slug", sa.String(500), nullable=False, unique=True),
|
||||
sa.Column("topic_category", sa.String(255), nullable=False),
|
||||
sa.Column("topic_tags", ARRAY(sa.String), nullable=True),
|
||||
sa.Column("summary", sa.Text, nullable=True),
|
||||
sa.Column("body_sections", JSONB, nullable=True),
|
||||
sa.Column("signal_chains", JSONB, nullable=True),
|
||||
sa.Column("plugins", ARRAY(sa.String), nullable=True),
|
||||
sa.Column("source_quality", source_quality, nullable=True),
|
||||
sa.Column("view_count", sa.Integer, nullable=False, server_default="0"),
|
||||
sa.Column("review_status", page_review_status, nullable=False, server_default="draft"),
|
||||
sa.Column("created_at", sa.DateTime(), nullable=False, server_default=sa.func.now()),
|
||||
sa.Column("updated_at", sa.DateTime(), nullable=False, server_default=sa.func.now()),
|
||||
)
|
||||
op.create_index("ix_technique_pages_creator_id", "technique_pages", ["creator_id"])
|
||||
op.create_index("ix_technique_pages_topic_category", "technique_pages", ["topic_category"])
|
||||
|
||||
# ── key_moments ──────────────────────────────────────────────────────
|
||||
op.create_table(
|
||||
"key_moments",
|
||||
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.text("gen_random_uuid()")),
|
||||
sa.Column("source_video_id", UUID(as_uuid=True), sa.ForeignKey("source_videos.id", ondelete="CASCADE"), nullable=False),
|
||||
sa.Column("technique_page_id", UUID(as_uuid=True), sa.ForeignKey("technique_pages.id", ondelete="SET NULL"), nullable=True),
|
||||
sa.Column("title", sa.String(500), nullable=False),
|
||||
sa.Column("summary", sa.Text, nullable=False),
|
||||
sa.Column("start_time", sa.Float, nullable=False),
|
||||
sa.Column("end_time", sa.Float, nullable=False),
|
||||
sa.Column("content_type", key_moment_content_type, nullable=False),
|
||||
sa.Column("plugins", ARRAY(sa.String), nullable=True),
|
||||
sa.Column("review_status", review_status, nullable=False, server_default="pending"),
|
||||
sa.Column("raw_transcript", sa.Text, nullable=True),
|
||||
sa.Column("created_at", sa.DateTime(), nullable=False, server_default=sa.func.now()),
|
||||
sa.Column("updated_at", sa.DateTime(), nullable=False, server_default=sa.func.now()),
|
||||
)
|
||||
op.create_index("ix_key_moments_source_video_id", "key_moments", ["source_video_id"])
|
||||
op.create_index("ix_key_moments_technique_page_id", "key_moments", ["technique_page_id"])
|
||||
|
||||
# ── related_technique_links ──────────────────────────────────────────
|
||||
op.create_table(
|
||||
"related_technique_links",
|
||||
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.text("gen_random_uuid()")),
|
||||
sa.Column("source_page_id", UUID(as_uuid=True), sa.ForeignKey("technique_pages.id", ondelete="CASCADE"), nullable=False),
|
||||
sa.Column("target_page_id", UUID(as_uuid=True), sa.ForeignKey("technique_pages.id", ondelete="CASCADE"), nullable=False),
|
||||
sa.Column("relationship", relationship_type, nullable=False),
|
||||
sa.UniqueConstraint("source_page_id", "target_page_id", "relationship", name="uq_technique_link"),
|
||||
)
|
||||
|
||||
# ── tags ─────────────────────────────────────────────────────────────
|
||||
op.create_table(
|
||||
"tags",
|
||||
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.text("gen_random_uuid()")),
|
||||
sa.Column("name", sa.String(255), nullable=False, unique=True),
|
||||
sa.Column("category", sa.String(255), nullable=False),
|
||||
sa.Column("aliases", ARRAY(sa.String), nullable=True),
|
||||
)
|
||||
op.create_index("ix_tags_category", "tags", ["category"])
|
||||
|
||||
|
||||
def downgrade() -> None:
|
||||
op.drop_table("tags")
|
||||
op.drop_table("related_technique_links")
|
||||
op.drop_table("key_moments")
|
||||
op.drop_table("technique_pages")
|
||||
op.drop_table("transcript_segments")
|
||||
op.drop_table("source_videos")
|
||||
op.drop_table("creators")
|
||||
|
||||
# Drop enum types
|
||||
for name in [
|
||||
"relationship_type", "page_review_status", "source_quality",
|
||||
"review_status", "key_moment_content_type", "processing_status",
|
||||
"content_type",
|
||||
]:
|
||||
sa.Enum(name=name).drop(op.get_bind(), checkfirst=True)
|
||||
39
alembic/versions/002_technique_page_versions.py
Normal file
39
alembic/versions/002_technique_page_versions.py
Normal file
|
|
@ -0,0 +1,39 @@
|
|||
"""technique_page_versions table for article versioning
|
||||
|
||||
Revision ID: 002_technique_page_versions
|
||||
Revises: 001_initial
|
||||
Create Date: 2026-03-30
|
||||
"""
|
||||
from typing import Sequence, Union
|
||||
|
||||
from alembic import op
|
||||
import sqlalchemy as sa
|
||||
from sqlalchemy.dialects.postgresql import JSONB, UUID
|
||||
|
||||
# revision identifiers, used by Alembic.
|
||||
revision: str = "002_technique_page_versions"
|
||||
down_revision: Union[str, None] = "001_initial"
|
||||
branch_labels: Union[str, Sequence[str], None] = None
|
||||
depends_on: Union[str, Sequence[str], None] = None
|
||||
|
||||
|
||||
def upgrade() -> None:
|
||||
op.create_table(
|
||||
"technique_page_versions",
|
||||
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.text("gen_random_uuid()")),
|
||||
sa.Column("technique_page_id", UUID(as_uuid=True), sa.ForeignKey("technique_pages.id", ondelete="CASCADE"), nullable=False),
|
||||
sa.Column("version_number", sa.Integer, nullable=False),
|
||||
sa.Column("content_snapshot", JSONB, nullable=False),
|
||||
sa.Column("pipeline_metadata", JSONB, nullable=True),
|
||||
sa.Column("created_at", sa.DateTime(), nullable=False, server_default=sa.func.now()),
|
||||
)
|
||||
op.create_index(
|
||||
"ix_technique_page_versions_page_version",
|
||||
"technique_page_versions",
|
||||
["technique_page_id", "version_number"],
|
||||
unique=True,
|
||||
)
|
||||
|
||||
|
||||
def downgrade() -> None:
|
||||
op.drop_table("technique_page_versions")
|
||||
47
alembic/versions/003_content_reports.py
Normal file
47
alembic/versions/003_content_reports.py
Normal file
|
|
@ -0,0 +1,47 @@
|
|||
"""Create content_reports table.
|
||||
|
||||
Revision ID: 003_content_reports
|
||||
Revises: 002_technique_page_versions
|
||||
"""
|
||||
from alembic import op
|
||||
import sqlalchemy as sa
|
||||
from sqlalchemy.dialects.postgresql import UUID
|
||||
|
||||
revision = "003_content_reports"
|
||||
down_revision = "002_technique_page_versions"
|
||||
branch_labels = None
|
||||
depends_on = None
|
||||
|
||||
|
||||
def upgrade() -> None:
|
||||
op.create_table(
|
||||
"content_reports",
|
||||
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.func.gen_random_uuid()),
|
||||
sa.Column("content_type", sa.String(50), nullable=False),
|
||||
sa.Column("content_id", UUID(as_uuid=True), nullable=True),
|
||||
sa.Column("content_title", sa.String(500), nullable=True),
|
||||
sa.Column("report_type", sa.Enum(
|
||||
"inaccurate", "missing_info", "wrong_attribution", "formatting", "other",
|
||||
name="report_type", create_constraint=True,
|
||||
), nullable=False),
|
||||
sa.Column("description", sa.Text(), nullable=False),
|
||||
sa.Column("status", sa.Enum(
|
||||
"open", "acknowledged", "resolved", "dismissed",
|
||||
name="report_status", create_constraint=True,
|
||||
), nullable=False, server_default="open"),
|
||||
sa.Column("admin_notes", sa.Text(), nullable=True),
|
||||
sa.Column("page_url", sa.String(1000), nullable=True),
|
||||
sa.Column("created_at", sa.DateTime(), server_default=sa.func.now(), nullable=False),
|
||||
sa.Column("resolved_at", sa.DateTime(), nullable=True),
|
||||
)
|
||||
|
||||
op.create_index("ix_content_reports_status_created", "content_reports", ["status", "created_at"])
|
||||
op.create_index("ix_content_reports_content", "content_reports", ["content_type", "content_id"])
|
||||
|
||||
|
||||
def downgrade() -> None:
|
||||
op.drop_index("ix_content_reports_content")
|
||||
op.drop_index("ix_content_reports_status_created")
|
||||
op.drop_table("content_reports")
|
||||
sa.Enum(name="report_status").drop(op.get_bind(), checkfirst=True)
|
||||
sa.Enum(name="report_type").drop(op.get_bind(), checkfirst=True)
|
||||
37
alembic/versions/004_pipeline_events.py
Normal file
37
alembic/versions/004_pipeline_events.py
Normal file
|
|
@ -0,0 +1,37 @@
|
|||
"""Create pipeline_events table.
|
||||
|
||||
Revision ID: 004_pipeline_events
|
||||
Revises: 003_content_reports
|
||||
"""
|
||||
from alembic import op
|
||||
import sqlalchemy as sa
|
||||
from sqlalchemy.dialects.postgresql import UUID, JSONB
|
||||
|
||||
revision = "004_pipeline_events"
|
||||
down_revision = "003_content_reports"
|
||||
branch_labels = None
|
||||
depends_on = None
|
||||
|
||||
|
||||
def upgrade() -> None:
|
||||
op.create_table(
|
||||
"pipeline_events",
|
||||
sa.Column("id", UUID(as_uuid=True), primary_key=True, server_default=sa.func.gen_random_uuid()),
|
||||
sa.Column("video_id", UUID(as_uuid=True), nullable=False, index=True),
|
||||
sa.Column("stage", sa.String(50), nullable=False),
|
||||
sa.Column("event_type", sa.String(30), nullable=False),
|
||||
sa.Column("prompt_tokens", sa.Integer(), nullable=True),
|
||||
sa.Column("completion_tokens", sa.Integer(), nullable=True),
|
||||
sa.Column("total_tokens", sa.Integer(), nullable=True),
|
||||
sa.Column("model", sa.String(100), nullable=True),
|
||||
sa.Column("duration_ms", sa.Integer(), nullable=True),
|
||||
sa.Column("payload", JSONB(), nullable=True),
|
||||
sa.Column("created_at", sa.DateTime(), server_default=sa.func.now(), nullable=False),
|
||||
)
|
||||
# Composite index for event log queries (video + newest first)
|
||||
op.create_index("ix_pipeline_events_video_created", "pipeline_events", ["video_id", "created_at"])
|
||||
|
||||
|
||||
def downgrade() -> None:
|
||||
op.drop_index("ix_pipeline_events_video_created")
|
||||
op.drop_table("pipeline_events")
|
||||
29
alembic/versions/005_content_hash.py
Normal file
29
alembic/versions/005_content_hash.py
Normal file
|
|
@ -0,0 +1,29 @@
|
|||
"""Add content_hash to source_videos for duplicate detection.
|
||||
|
||||
Revision ID: 005_content_hash
|
||||
Revises: 004_pipeline_events
|
||||
"""
|
||||
from alembic import op
|
||||
import sqlalchemy as sa
|
||||
|
||||
revision = "005_content_hash"
|
||||
down_revision = "004_pipeline_events"
|
||||
branch_labels = None
|
||||
depends_on = None
|
||||
|
||||
|
||||
def upgrade() -> None:
|
||||
op.add_column(
|
||||
"source_videos",
|
||||
sa.Column("content_hash", sa.String(64), nullable=True),
|
||||
)
|
||||
op.create_index(
|
||||
"ix_source_videos_content_hash",
|
||||
"source_videos",
|
||||
["content_hash"],
|
||||
)
|
||||
|
||||
|
||||
def downgrade() -> None:
|
||||
op.drop_index("ix_source_videos_content_hash")
|
||||
op.drop_column("source_videos", "content_hash")
|
||||
33
alembic/versions/006_debug_columns.py
Normal file
33
alembic/versions/006_debug_columns.py
Normal file
|
|
@ -0,0 +1,33 @@
|
|||
"""Add debug LLM I/O capture columns to pipeline_events.
|
||||
|
||||
Revision ID: 006_debug_columns
|
||||
Revises: 005_content_hash
|
||||
"""
|
||||
from alembic import op
|
||||
import sqlalchemy as sa
|
||||
|
||||
revision = "006_debug_columns"
|
||||
down_revision = "005_content_hash"
|
||||
branch_labels = None
|
||||
depends_on = None
|
||||
|
||||
|
||||
def upgrade() -> None:
|
||||
op.add_column(
|
||||
"pipeline_events",
|
||||
sa.Column("system_prompt_text", sa.Text(), nullable=True),
|
||||
)
|
||||
op.add_column(
|
||||
"pipeline_events",
|
||||
sa.Column("user_prompt_text", sa.Text(), nullable=True),
|
||||
)
|
||||
op.add_column(
|
||||
"pipeline_events",
|
||||
sa.Column("response_text", sa.Text(), nullable=True),
|
||||
)
|
||||
|
||||
|
||||
def downgrade() -> None:
|
||||
op.drop_column("pipeline_events", "response_text")
|
||||
op.drop_column("pipeline_events", "user_prompt_text")
|
||||
op.drop_column("pipeline_events", "system_prompt_text")
|
||||
85
backend/config.py
Normal file
85
backend/config.py
Normal file
|
|
@ -0,0 +1,85 @@
|
|||
"""Application configuration loaded from environment variables."""
|
||||
|
||||
from functools import lru_cache
|
||||
|
||||
from pydantic_settings import BaseSettings
|
||||
|
||||
|
||||
class Settings(BaseSettings):
|
||||
"""Chrysopedia API settings.
|
||||
|
||||
Values are loaded from environment variables (or .env file via
|
||||
pydantic-settings' dotenv support).
|
||||
"""
|
||||
|
||||
# Database
|
||||
database_url: str = "postgresql+asyncpg://chrysopedia:changeme@localhost:5433/chrysopedia"
|
||||
|
||||
# Redis
|
||||
redis_url: str = "redis://localhost:6379/0"
|
||||
|
||||
# Application
|
||||
app_env: str = "development"
|
||||
app_log_level: str = "info"
|
||||
app_secret_key: str = "changeme-generate-a-real-secret"
|
||||
|
||||
# CORS
|
||||
cors_origins: list[str] = ["*"]
|
||||
|
||||
# LLM endpoint (OpenAI-compatible)
|
||||
llm_api_url: str = "http://localhost:11434/v1"
|
||||
llm_api_key: str = "sk-placeholder"
|
||||
llm_model: str = "fyn-llm-agent-chat"
|
||||
llm_fallback_url: str = "http://localhost:11434/v1"
|
||||
llm_fallback_model: str = "fyn-llm-agent-chat"
|
||||
|
||||
# Per-stage model overrides (optional — falls back to llm_model / "chat")
|
||||
llm_stage2_model: str | None = "fyn-llm-agent-chat" # segmentation — mechanical, fast chat
|
||||
llm_stage2_modality: str = "chat"
|
||||
llm_stage3_model: str | None = "fyn-llm-agent-think" # extraction — reasoning
|
||||
llm_stage3_modality: str = "thinking"
|
||||
llm_stage4_model: str | None = "fyn-llm-agent-chat" # classification — mechanical, fast chat
|
||||
llm_stage4_modality: str = "chat"
|
||||
llm_stage5_model: str | None = "fyn-llm-agent-think" # synthesis — reasoning
|
||||
llm_stage5_modality: str = "thinking"
|
||||
|
||||
# Dynamic token estimation — each stage calculates max_tokens from input size
|
||||
llm_max_tokens_hard_limit: int = 32768 # Hard ceiling for dynamic estimator
|
||||
llm_max_tokens: int = 65536 # Fallback when no estimate is provided
|
||||
|
||||
# Embedding endpoint
|
||||
embedding_api_url: str = "http://localhost:11434/v1"
|
||||
embedding_model: str = "nomic-embed-text"
|
||||
embedding_dimensions: int = 768
|
||||
|
||||
# Qdrant
|
||||
qdrant_url: str = "http://localhost:6333"
|
||||
qdrant_collection: str = "chrysopedia"
|
||||
|
||||
# Prompt templates
|
||||
prompts_path: str = "./prompts"
|
||||
|
||||
# Review mode — when True, extracted moments go to review queue before publishing
|
||||
review_mode: bool = True
|
||||
|
||||
# Debug mode — when True, pipeline captures full LLM prompts and responses
|
||||
debug_mode: bool = False
|
||||
|
||||
# File storage
|
||||
transcript_storage_path: str = "/data/transcripts"
|
||||
video_metadata_path: str = "/data/video_meta"
|
||||
|
||||
# Git commit SHA (set at Docker build time or via env var)
|
||||
git_commit_sha: str = "unknown"
|
||||
|
||||
model_config = {
|
||||
"env_file": ".env",
|
||||
"env_file_encoding": "utf-8",
|
||||
"case_sensitive": False,
|
||||
}
|
||||
|
||||
|
||||
@lru_cache
|
||||
def get_settings() -> Settings:
|
||||
"""Return cached application settings (singleton)."""
|
||||
return Settings()
|
||||
26
backend/database.py
Normal file
26
backend/database.py
Normal file
|
|
@ -0,0 +1,26 @@
|
|||
"""Database engine, session factory, and declarative base for Chrysopedia."""
|
||||
|
||||
import os
|
||||
|
||||
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker, create_async_engine
|
||||
from sqlalchemy.orm import DeclarativeBase
|
||||
|
||||
DATABASE_URL = os.getenv(
|
||||
"DATABASE_URL",
|
||||
"postgresql+asyncpg://chrysopedia:changeme@localhost:5433/chrysopedia",
|
||||
)
|
||||
|
||||
engine = create_async_engine(DATABASE_URL, echo=False, pool_pre_ping=True)
|
||||
|
||||
async_session = async_sessionmaker(engine, class_=AsyncSession, expire_on_commit=False)
|
||||
|
||||
|
||||
class Base(DeclarativeBase):
|
||||
"""Declarative base for all ORM models."""
|
||||
pass
|
||||
|
||||
|
||||
async def get_session() -> AsyncSession: # type: ignore[misc]
|
||||
"""FastAPI dependency that yields an async DB session."""
|
||||
async with async_session() as session:
|
||||
yield session
|
||||
95
backend/main.py
Normal file
95
backend/main.py
Normal file
|
|
@ -0,0 +1,95 @@
|
|||
"""Chrysopedia API — Knowledge extraction and retrieval system.
|
||||
|
||||
Entry point for the FastAPI application. Configures middleware,
|
||||
structured logging, and mounts versioned API routers.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import sys
|
||||
from contextlib import asynccontextmanager
|
||||
|
||||
from fastapi import FastAPI
|
||||
from fastapi.middleware.cors import CORSMiddleware
|
||||
|
||||
from config import get_settings
|
||||
from routers import creators, health, ingest, pipeline, reports, review, search, techniques, topics, videos
|
||||
|
||||
|
||||
def _setup_logging() -> None:
|
||||
"""Configure structured logging to stdout."""
|
||||
settings = get_settings()
|
||||
level = getattr(logging, settings.app_log_level.upper(), logging.INFO)
|
||||
|
||||
handler = logging.StreamHandler(sys.stdout)
|
||||
handler.setFormatter(
|
||||
logging.Formatter(
|
||||
fmt="%(asctime)s | %(levelname)-8s | %(name)s | %(message)s",
|
||||
datefmt="%Y-%m-%dT%H:%M:%S",
|
||||
)
|
||||
)
|
||||
|
||||
root = logging.getLogger()
|
||||
root.setLevel(level)
|
||||
# Avoid duplicate handlers on reload
|
||||
root.handlers.clear()
|
||||
root.addHandler(handler)
|
||||
|
||||
# Quiet noisy libraries
|
||||
logging.getLogger("uvicorn.access").setLevel(logging.WARNING)
|
||||
logging.getLogger("sqlalchemy.engine").setLevel(logging.WARNING)
|
||||
|
||||
|
||||
@asynccontextmanager
|
||||
async def lifespan(app: FastAPI): # noqa: ARG001
|
||||
"""Application lifespan: setup on startup, teardown on shutdown."""
|
||||
_setup_logging()
|
||||
logger = logging.getLogger("chrysopedia")
|
||||
settings = get_settings()
|
||||
logger.info(
|
||||
"Chrysopedia API starting (env=%s, log_level=%s)",
|
||||
settings.app_env,
|
||||
settings.app_log_level,
|
||||
)
|
||||
yield
|
||||
logger.info("Chrysopedia API shutting down")
|
||||
|
||||
|
||||
app = FastAPI(
|
||||
title="Chrysopedia API",
|
||||
description="Knowledge extraction and retrieval for music production content",
|
||||
version="0.1.0",
|
||||
lifespan=lifespan,
|
||||
)
|
||||
|
||||
# ── Middleware ────────────────────────────────────────────────────────────────
|
||||
|
||||
settings = get_settings()
|
||||
app.add_middleware(
|
||||
CORSMiddleware,
|
||||
allow_origins=settings.cors_origins,
|
||||
allow_credentials=True,
|
||||
allow_methods=["*"],
|
||||
allow_headers=["*"],
|
||||
)
|
||||
|
||||
# ── Routers ──────────────────────────────────────────────────────────────────
|
||||
|
||||
# Root-level health (no prefix)
|
||||
app.include_router(health.router)
|
||||
|
||||
# Versioned API
|
||||
app.include_router(creators.router, prefix="/api/v1")
|
||||
app.include_router(ingest.router, prefix="/api/v1")
|
||||
app.include_router(pipeline.router, prefix="/api/v1")
|
||||
app.include_router(review.router, prefix="/api/v1")
|
||||
app.include_router(reports.router, prefix="/api/v1")
|
||||
app.include_router(search.router, prefix="/api/v1")
|
||||
app.include_router(techniques.router, prefix="/api/v1")
|
||||
app.include_router(topics.router, prefix="/api/v1")
|
||||
app.include_router(videos.router, prefix="/api/v1")
|
||||
|
||||
|
||||
@app.get("/api/v1/health")
|
||||
async def api_health():
|
||||
"""Lightweight version-prefixed health endpoint (no DB check)."""
|
||||
return {"status": "ok", "version": "0.1.0"}
|
||||
419
backend/models.py
Normal file
419
backend/models.py
Normal file
|
|
@ -0,0 +1,419 @@
|
|||
"""SQLAlchemy ORM models for the Chrysopedia knowledge base.
|
||||
|
||||
Seven entities matching chrysopedia-spec.md §6.1:
|
||||
Creator, SourceVideo, TranscriptSegment, KeyMoment,
|
||||
TechniquePage, RelatedTechniqueLink, Tag
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import enum
|
||||
import uuid
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from sqlalchemy import (
|
||||
Enum,
|
||||
Float,
|
||||
ForeignKey,
|
||||
Integer,
|
||||
String,
|
||||
Text,
|
||||
UniqueConstraint,
|
||||
func,
|
||||
)
|
||||
from sqlalchemy.dialects.postgresql import ARRAY, JSONB, UUID
|
||||
from sqlalchemy.orm import Mapped, mapped_column
|
||||
from sqlalchemy.orm import relationship as sa_relationship
|
||||
|
||||
from database import Base
|
||||
|
||||
|
||||
# ── Enums ────────────────────────────────────────────────────────────────────
|
||||
|
||||
class ContentType(str, enum.Enum):
|
||||
"""Source video content type."""
|
||||
tutorial = "tutorial"
|
||||
livestream = "livestream"
|
||||
breakdown = "breakdown"
|
||||
short_form = "short_form"
|
||||
|
||||
|
||||
class ProcessingStatus(str, enum.Enum):
|
||||
"""Pipeline processing status for a source video."""
|
||||
pending = "pending"
|
||||
transcribed = "transcribed"
|
||||
extracted = "extracted"
|
||||
reviewed = "reviewed"
|
||||
published = "published"
|
||||
|
||||
|
||||
class KeyMomentContentType(str, enum.Enum):
|
||||
"""Content classification for a key moment."""
|
||||
technique = "technique"
|
||||
settings = "settings"
|
||||
reasoning = "reasoning"
|
||||
workflow = "workflow"
|
||||
|
||||
|
||||
class ReviewStatus(str, enum.Enum):
|
||||
"""Human review status for key moments."""
|
||||
pending = "pending"
|
||||
approved = "approved"
|
||||
edited = "edited"
|
||||
rejected = "rejected"
|
||||
|
||||
|
||||
class SourceQuality(str, enum.Enum):
|
||||
"""Derived source quality for technique pages."""
|
||||
structured = "structured"
|
||||
mixed = "mixed"
|
||||
unstructured = "unstructured"
|
||||
|
||||
|
||||
class PageReviewStatus(str, enum.Enum):
|
||||
"""Review lifecycle for technique pages."""
|
||||
draft = "draft"
|
||||
reviewed = "reviewed"
|
||||
published = "published"
|
||||
|
||||
|
||||
class RelationshipType(str, enum.Enum):
|
||||
"""Types of links between technique pages."""
|
||||
same_technique_other_creator = "same_technique_other_creator"
|
||||
same_creator_adjacent = "same_creator_adjacent"
|
||||
general_cross_reference = "general_cross_reference"
|
||||
|
||||
|
||||
# ── Helpers ──────────────────────────────────────────────────────────────────
|
||||
|
||||
def _uuid_pk() -> Mapped[uuid.UUID]:
|
||||
return mapped_column(
|
||||
UUID(as_uuid=True),
|
||||
primary_key=True,
|
||||
default=uuid.uuid4,
|
||||
server_default=func.gen_random_uuid(),
|
||||
)
|
||||
|
||||
|
||||
def _now() -> datetime:
|
||||
"""Return current UTC time as a naive datetime (no tzinfo).
|
||||
|
||||
PostgreSQL TIMESTAMP WITHOUT TIME ZONE columns require naive datetimes.
|
||||
asyncpg rejects timezone-aware datetimes for such columns.
|
||||
"""
|
||||
return datetime.now(timezone.utc).replace(tzinfo=None)
|
||||
|
||||
|
||||
# ── Models ───────────────────────────────────────────────────────────────────
|
||||
|
||||
class Creator(Base):
|
||||
__tablename__ = "creators"
|
||||
|
||||
id: Mapped[uuid.UUID] = _uuid_pk()
|
||||
name: Mapped[str] = mapped_column(String(255), nullable=False)
|
||||
slug: Mapped[str] = mapped_column(String(255), unique=True, nullable=False)
|
||||
genres: Mapped[list[str] | None] = mapped_column(ARRAY(String), nullable=True)
|
||||
folder_name: Mapped[str] = mapped_column(String(255), nullable=False)
|
||||
view_count: Mapped[int] = mapped_column(Integer, default=0, server_default="0")
|
||||
created_at: Mapped[datetime] = mapped_column(
|
||||
default=_now, server_default=func.now()
|
||||
)
|
||||
updated_at: Mapped[datetime] = mapped_column(
|
||||
default=_now, server_default=func.now(), onupdate=_now
|
||||
)
|
||||
|
||||
# relationships
|
||||
videos: Mapped[list[SourceVideo]] = sa_relationship(back_populates="creator")
|
||||
technique_pages: Mapped[list[TechniquePage]] = sa_relationship(back_populates="creator")
|
||||
|
||||
|
||||
class SourceVideo(Base):
|
||||
__tablename__ = "source_videos"
|
||||
|
||||
id: Mapped[uuid.UUID] = _uuid_pk()
|
||||
creator_id: Mapped[uuid.UUID] = mapped_column(
|
||||
ForeignKey("creators.id", ondelete="CASCADE"), nullable=False
|
||||
)
|
||||
filename: Mapped[str] = mapped_column(String(500), nullable=False)
|
||||
file_path: Mapped[str] = mapped_column(String(1000), nullable=False)
|
||||
duration_seconds: Mapped[int] = mapped_column(Integer, nullable=True)
|
||||
content_type: Mapped[ContentType] = mapped_column(
|
||||
Enum(ContentType, name="content_type", create_constraint=True),
|
||||
nullable=False,
|
||||
)
|
||||
transcript_path: Mapped[str | None] = mapped_column(String(1000), nullable=True)
|
||||
content_hash: Mapped[str | None] = mapped_column(String(64), nullable=True, index=True)
|
||||
processing_status: Mapped[ProcessingStatus] = mapped_column(
|
||||
Enum(ProcessingStatus, name="processing_status", create_constraint=True),
|
||||
default=ProcessingStatus.pending,
|
||||
server_default="pending",
|
||||
)
|
||||
created_at: Mapped[datetime] = mapped_column(
|
||||
default=_now, server_default=func.now()
|
||||
)
|
||||
updated_at: Mapped[datetime] = mapped_column(
|
||||
default=_now, server_default=func.now(), onupdate=_now
|
||||
)
|
||||
|
||||
# relationships
|
||||
creator: Mapped[Creator] = sa_relationship(back_populates="videos")
|
||||
segments: Mapped[list[TranscriptSegment]] = sa_relationship(back_populates="source_video")
|
||||
key_moments: Mapped[list[KeyMoment]] = sa_relationship(back_populates="source_video")
|
||||
|
||||
|
||||
class TranscriptSegment(Base):
|
||||
__tablename__ = "transcript_segments"
|
||||
|
||||
id: Mapped[uuid.UUID] = _uuid_pk()
|
||||
source_video_id: Mapped[uuid.UUID] = mapped_column(
|
||||
ForeignKey("source_videos.id", ondelete="CASCADE"), nullable=False
|
||||
)
|
||||
start_time: Mapped[float] = mapped_column(Float, nullable=False)
|
||||
end_time: Mapped[float] = mapped_column(Float, nullable=False)
|
||||
text: Mapped[str] = mapped_column(Text, nullable=False)
|
||||
segment_index: Mapped[int] = mapped_column(Integer, nullable=False)
|
||||
topic_label: Mapped[str | None] = mapped_column(String(255), nullable=True)
|
||||
|
||||
# relationships
|
||||
source_video: Mapped[SourceVideo] = sa_relationship(back_populates="segments")
|
||||
|
||||
|
||||
class KeyMoment(Base):
|
||||
__tablename__ = "key_moments"
|
||||
|
||||
id: Mapped[uuid.UUID] = _uuid_pk()
|
||||
source_video_id: Mapped[uuid.UUID] = mapped_column(
|
||||
ForeignKey("source_videos.id", ondelete="CASCADE"), nullable=False
|
||||
)
|
||||
technique_page_id: Mapped[uuid.UUID | None] = mapped_column(
|
||||
ForeignKey("technique_pages.id", ondelete="SET NULL"), nullable=True
|
||||
)
|
||||
title: Mapped[str] = mapped_column(String(500), nullable=False)
|
||||
summary: Mapped[str] = mapped_column(Text, nullable=False)
|
||||
start_time: Mapped[float] = mapped_column(Float, nullable=False)
|
||||
end_time: Mapped[float] = mapped_column(Float, nullable=False)
|
||||
content_type: Mapped[KeyMomentContentType] = mapped_column(
|
||||
Enum(KeyMomentContentType, name="key_moment_content_type", create_constraint=True),
|
||||
nullable=False,
|
||||
)
|
||||
plugins: Mapped[list[str] | None] = mapped_column(ARRAY(String), nullable=True)
|
||||
review_status: Mapped[ReviewStatus] = mapped_column(
|
||||
Enum(ReviewStatus, name="review_status", create_constraint=True),
|
||||
default=ReviewStatus.pending,
|
||||
server_default="pending",
|
||||
)
|
||||
raw_transcript: Mapped[str | None] = mapped_column(Text, nullable=True)
|
||||
created_at: Mapped[datetime] = mapped_column(
|
||||
default=_now, server_default=func.now()
|
||||
)
|
||||
updated_at: Mapped[datetime] = mapped_column(
|
||||
default=_now, server_default=func.now(), onupdate=_now
|
||||
)
|
||||
|
||||
# relationships
|
||||
source_video: Mapped[SourceVideo] = sa_relationship(back_populates="key_moments")
|
||||
technique_page: Mapped[TechniquePage | None] = sa_relationship(
|
||||
back_populates="key_moments", foreign_keys=[technique_page_id]
|
||||
)
|
||||
|
||||
|
||||
class TechniquePage(Base):
|
||||
__tablename__ = "technique_pages"
|
||||
|
||||
id: Mapped[uuid.UUID] = _uuid_pk()
|
||||
creator_id: Mapped[uuid.UUID] = mapped_column(
|
||||
ForeignKey("creators.id", ondelete="CASCADE"), nullable=False
|
||||
)
|
||||
title: Mapped[str] = mapped_column(String(500), nullable=False)
|
||||
slug: Mapped[str] = mapped_column(String(500), unique=True, nullable=False)
|
||||
topic_category: Mapped[str] = mapped_column(String(255), nullable=False)
|
||||
topic_tags: Mapped[list[str] | None] = mapped_column(ARRAY(String), nullable=True)
|
||||
summary: Mapped[str | None] = mapped_column(Text, nullable=True)
|
||||
body_sections: Mapped[dict | None] = mapped_column(JSONB, nullable=True)
|
||||
signal_chains: Mapped[list | None] = mapped_column(JSONB, nullable=True)
|
||||
plugins: Mapped[list[str] | None] = mapped_column(ARRAY(String), nullable=True)
|
||||
source_quality: Mapped[SourceQuality | None] = mapped_column(
|
||||
Enum(SourceQuality, name="source_quality", create_constraint=True),
|
||||
nullable=True,
|
||||
)
|
||||
view_count: Mapped[int] = mapped_column(Integer, default=0, server_default="0")
|
||||
review_status: Mapped[PageReviewStatus] = mapped_column(
|
||||
Enum(PageReviewStatus, name="page_review_status", create_constraint=True),
|
||||
default=PageReviewStatus.draft,
|
||||
server_default="draft",
|
||||
)
|
||||
created_at: Mapped[datetime] = mapped_column(
|
||||
default=_now, server_default=func.now()
|
||||
)
|
||||
updated_at: Mapped[datetime] = mapped_column(
|
||||
default=_now, server_default=func.now(), onupdate=_now
|
||||
)
|
||||
|
||||
# relationships
|
||||
creator: Mapped[Creator] = sa_relationship(back_populates="technique_pages")
|
||||
key_moments: Mapped[list[KeyMoment]] = sa_relationship(
|
||||
back_populates="technique_page", foreign_keys=[KeyMoment.technique_page_id]
|
||||
)
|
||||
versions: Mapped[list[TechniquePageVersion]] = sa_relationship(
|
||||
back_populates="technique_page", order_by="TechniquePageVersion.version_number"
|
||||
)
|
||||
outgoing_links: Mapped[list[RelatedTechniqueLink]] = sa_relationship(
|
||||
foreign_keys="RelatedTechniqueLink.source_page_id", back_populates="source_page"
|
||||
)
|
||||
incoming_links: Mapped[list[RelatedTechniqueLink]] = sa_relationship(
|
||||
foreign_keys="RelatedTechniqueLink.target_page_id", back_populates="target_page"
|
||||
)
|
||||
|
||||
|
||||
class RelatedTechniqueLink(Base):
|
||||
__tablename__ = "related_technique_links"
|
||||
__table_args__ = (
|
||||
UniqueConstraint("source_page_id", "target_page_id", "relationship", name="uq_technique_link"),
|
||||
)
|
||||
|
||||
id: Mapped[uuid.UUID] = _uuid_pk()
|
||||
source_page_id: Mapped[uuid.UUID] = mapped_column(
|
||||
ForeignKey("technique_pages.id", ondelete="CASCADE"), nullable=False
|
||||
)
|
||||
target_page_id: Mapped[uuid.UUID] = mapped_column(
|
||||
ForeignKey("technique_pages.id", ondelete="CASCADE"), nullable=False
|
||||
)
|
||||
relationship: Mapped[RelationshipType] = mapped_column(
|
||||
Enum(RelationshipType, name="relationship_type", create_constraint=True),
|
||||
nullable=False,
|
||||
)
|
||||
|
||||
# relationships
|
||||
source_page: Mapped[TechniquePage] = sa_relationship(
|
||||
foreign_keys=[source_page_id], back_populates="outgoing_links"
|
||||
)
|
||||
target_page: Mapped[TechniquePage] = sa_relationship(
|
||||
foreign_keys=[target_page_id], back_populates="incoming_links"
|
||||
)
|
||||
|
||||
|
||||
class TechniquePageVersion(Base):
|
||||
"""Snapshot of a TechniquePage before a pipeline re-synthesis overwrites it."""
|
||||
__tablename__ = "technique_page_versions"
|
||||
|
||||
id: Mapped[uuid.UUID] = _uuid_pk()
|
||||
technique_page_id: Mapped[uuid.UUID] = mapped_column(
|
||||
ForeignKey("technique_pages.id", ondelete="CASCADE"), nullable=False
|
||||
)
|
||||
version_number: Mapped[int] = mapped_column(Integer, nullable=False)
|
||||
content_snapshot: Mapped[dict] = mapped_column(JSONB, nullable=False)
|
||||
pipeline_metadata: Mapped[dict | None] = mapped_column(JSONB, nullable=True)
|
||||
created_at: Mapped[datetime] = mapped_column(
|
||||
default=_now, server_default=func.now()
|
||||
)
|
||||
|
||||
# relationships
|
||||
technique_page: Mapped[TechniquePage] = sa_relationship(
|
||||
back_populates="versions"
|
||||
)
|
||||
|
||||
|
||||
class Tag(Base):
|
||||
__tablename__ = "tags"
|
||||
|
||||
id: Mapped[uuid.UUID] = _uuid_pk()
|
||||
name: Mapped[str] = mapped_column(String(255), unique=True, nullable=False)
|
||||
category: Mapped[str] = mapped_column(String(255), nullable=False)
|
||||
aliases: Mapped[list[str] | None] = mapped_column(ARRAY(String), nullable=True)
|
||||
|
||||
|
||||
# ── Content Report Enums ─────────────────────────────────────────────────────
|
||||
|
||||
class ReportType(str, enum.Enum):
|
||||
"""Classification of user-submitted content reports."""
|
||||
inaccurate = "inaccurate"
|
||||
missing_info = "missing_info"
|
||||
wrong_attribution = "wrong_attribution"
|
||||
formatting = "formatting"
|
||||
other = "other"
|
||||
|
||||
|
||||
class ReportStatus(str, enum.Enum):
|
||||
"""Triage status for content reports."""
|
||||
open = "open"
|
||||
acknowledged = "acknowledged"
|
||||
resolved = "resolved"
|
||||
dismissed = "dismissed"
|
||||
|
||||
|
||||
# ── Content Report ───────────────────────────────────────────────────────────
|
||||
|
||||
class ContentReport(Base):
|
||||
"""User-submitted report about a content issue.
|
||||
|
||||
Generic: content_type + content_id can reference any entity
|
||||
(technique_page, key_moment, creator, or general).
|
||||
"""
|
||||
__tablename__ = "content_reports"
|
||||
|
||||
id: Mapped[uuid.UUID] = _uuid_pk()
|
||||
content_type: Mapped[str] = mapped_column(
|
||||
String(50), nullable=False, doc="Entity type: technique_page, key_moment, creator, general"
|
||||
)
|
||||
content_id: Mapped[uuid.UUID | None] = mapped_column(
|
||||
UUID(as_uuid=True), nullable=True, doc="FK to the reported entity (null for general reports)"
|
||||
)
|
||||
content_title: Mapped[str | None] = mapped_column(
|
||||
String(500), nullable=True, doc="Snapshot of entity title at report time"
|
||||
)
|
||||
report_type: Mapped[ReportType] = mapped_column(
|
||||
Enum(ReportType, name="report_type", create_constraint=True),
|
||||
nullable=False,
|
||||
)
|
||||
description: Mapped[str] = mapped_column(Text, nullable=False)
|
||||
status: Mapped[ReportStatus] = mapped_column(
|
||||
Enum(ReportStatus, name="report_status", create_constraint=True),
|
||||
default=ReportStatus.open,
|
||||
server_default="open",
|
||||
)
|
||||
admin_notes: Mapped[str | None] = mapped_column(Text, nullable=True)
|
||||
page_url: Mapped[str | None] = mapped_column(
|
||||
String(1000), nullable=True, doc="URL the user was on when reporting"
|
||||
)
|
||||
created_at: Mapped[datetime] = mapped_column(
|
||||
default=_now, server_default=func.now()
|
||||
)
|
||||
resolved_at: Mapped[datetime | None] = mapped_column(nullable=True)
|
||||
|
||||
|
||||
# ── Pipeline Event ───────────────────────────────────────────────────────────
|
||||
|
||||
class PipelineEvent(Base):
|
||||
"""Structured log entry for pipeline execution.
|
||||
|
||||
Captures per-stage start/complete/error/llm_call events with
|
||||
token usage and optional response payloads for debugging.
|
||||
"""
|
||||
__tablename__ = "pipeline_events"
|
||||
|
||||
id: Mapped[uuid.UUID] = _uuid_pk()
|
||||
video_id: Mapped[uuid.UUID] = mapped_column(
|
||||
UUID(as_uuid=True), nullable=False, index=True,
|
||||
)
|
||||
stage: Mapped[str] = mapped_column(
|
||||
String(50), nullable=False, doc="stage2_segmentation, stage3_extraction, etc."
|
||||
)
|
||||
event_type: Mapped[str] = mapped_column(
|
||||
String(30), nullable=False, doc="start, complete, error, llm_call"
|
||||
)
|
||||
prompt_tokens: Mapped[int | None] = mapped_column(Integer, nullable=True)
|
||||
completion_tokens: Mapped[int | None] = mapped_column(Integer, nullable=True)
|
||||
total_tokens: Mapped[int | None] = mapped_column(Integer, nullable=True)
|
||||
model: Mapped[str | None] = mapped_column(String(100), nullable=True)
|
||||
duration_ms: Mapped[int | None] = mapped_column(Integer, nullable=True)
|
||||
payload: Mapped[dict | None] = mapped_column(
|
||||
JSONB, nullable=True, doc="LLM response content, error details, stage metadata"
|
||||
)
|
||||
created_at: Mapped[datetime] = mapped_column(
|
||||
default=_now, server_default=func.now()
|
||||
)
|
||||
|
||||
# Debug mode — full LLM I/O capture columns
|
||||
system_prompt_text: Mapped[str | None] = mapped_column(Text, nullable=True)
|
||||
user_prompt_text: Mapped[str | None] = mapped_column(Text, nullable=True)
|
||||
response_text: Mapped[str | None] = mapped_column(Text, nullable=True)
|
||||
0
backend/pipeline/__init__.py
Normal file
0
backend/pipeline/__init__.py
Normal file
88
backend/pipeline/embedding_client.py
Normal file
88
backend/pipeline/embedding_client.py
Normal file
|
|
@ -0,0 +1,88 @@
|
|||
"""Synchronous embedding client using the OpenAI-compatible /v1/embeddings API.
|
||||
|
||||
Uses ``openai.OpenAI`` (sync) since Celery tasks run synchronously.
|
||||
Handles connection failures gracefully — embedding is non-blocking for the pipeline.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
|
||||
import openai
|
||||
|
||||
from config import Settings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class EmbeddingClient:
|
||||
"""Sync embedding client backed by an OpenAI-compatible /v1/embeddings endpoint."""
|
||||
|
||||
def __init__(self, settings: Settings) -> None:
|
||||
self.settings = settings
|
||||
self._client = openai.OpenAI(
|
||||
base_url=settings.embedding_api_url,
|
||||
api_key=settings.llm_api_key,
|
||||
)
|
||||
|
||||
def embed(self, texts: list[str]) -> list[list[float]]:
|
||||
"""Generate embedding vectors for a batch of texts.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
texts:
|
||||
List of strings to embed.
|
||||
|
||||
Returns
|
||||
-------
|
||||
list[list[float]]
|
||||
Embedding vectors. Returns empty list on connection/timeout errors
|
||||
so the pipeline can continue without embeddings.
|
||||
"""
|
||||
if not texts:
|
||||
return []
|
||||
|
||||
try:
|
||||
response = self._client.embeddings.create(
|
||||
model=self.settings.embedding_model,
|
||||
input=texts,
|
||||
)
|
||||
except (openai.APIConnectionError, openai.APITimeoutError) as exc:
|
||||
logger.warning(
|
||||
"Embedding API unavailable (%s: %s). Skipping %d texts.",
|
||||
type(exc).__name__,
|
||||
exc,
|
||||
len(texts),
|
||||
)
|
||||
return []
|
||||
except openai.APIError as exc:
|
||||
logger.warning(
|
||||
"Embedding API error (%s: %s). Skipping %d texts.",
|
||||
type(exc).__name__,
|
||||
exc,
|
||||
len(texts),
|
||||
)
|
||||
return []
|
||||
|
||||
vectors = [item.embedding for item in response.data]
|
||||
|
||||
# Validate dimensions
|
||||
expected_dim = self.settings.embedding_dimensions
|
||||
for i, vec in enumerate(vectors):
|
||||
if len(vec) != expected_dim:
|
||||
logger.warning(
|
||||
"Embedding dimension mismatch at index %d: expected %d, got %d. "
|
||||
"Returning empty list.",
|
||||
i,
|
||||
expected_dim,
|
||||
len(vec),
|
||||
)
|
||||
return []
|
||||
|
||||
logger.info(
|
||||
"Generated %d embeddings (dim=%d) using model=%s",
|
||||
len(vectors),
|
||||
expected_dim,
|
||||
self.settings.embedding_model,
|
||||
)
|
||||
return vectors
|
||||
328
backend/pipeline/llm_client.py
Normal file
328
backend/pipeline/llm_client.py
Normal file
|
|
@ -0,0 +1,328 @@
|
|||
"""Synchronous LLM client with primary/fallback endpoint logic.
|
||||
|
||||
Uses the OpenAI-compatible API (works with Ollama, vLLM, OpenWebUI, etc.).
|
||||
Celery tasks run synchronously, so this uses ``openai.OpenAI`` (not Async).
|
||||
|
||||
Supports two modalities:
|
||||
- **chat**: Standard JSON mode with ``response_format: {"type": "json_object"}``
|
||||
- **thinking**: For reasoning models that emit ``<think>...</think>`` blocks
|
||||
before their answer. Skips ``response_format``, appends JSON instructions to
|
||||
the system prompt, and strips think tags from the response.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import re
|
||||
from typing import TYPE_CHECKING, TypeVar
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from collections.abc import Callable
|
||||
|
||||
import openai
|
||||
from pydantic import BaseModel
|
||||
|
||||
from config import Settings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
T = TypeVar("T", bound=BaseModel)
|
||||
|
||||
# ── Think-tag stripping ──────────────────────────────────────────────────────
|
||||
|
||||
_THINK_PATTERN = re.compile(r"<think>.*?</think>", re.DOTALL)
|
||||
|
||||
|
||||
def strip_think_tags(text: str) -> str:
|
||||
"""Remove ``<think>...</think>`` blocks from LLM output.
|
||||
|
||||
Thinking/reasoning models often prefix their JSON with a reasoning trace
|
||||
wrapped in ``<think>`` tags. This strips all such blocks (including
|
||||
multiline and multiple occurrences) and returns the cleaned text.
|
||||
|
||||
Handles:
|
||||
- Single ``<think>...</think>`` block
|
||||
- Multiple blocks in one response
|
||||
- Multiline content inside think tags
|
||||
- Responses with no think tags (passthrough)
|
||||
- Empty input (passthrough)
|
||||
"""
|
||||
if not text:
|
||||
return text
|
||||
cleaned = _THINK_PATTERN.sub("", text)
|
||||
return cleaned.strip()
|
||||
|
||||
|
||||
|
||||
# ── Token estimation ─────────────────────────────────────────────────────────
|
||||
|
||||
# Stage-specific output multipliers: estimated output tokens as a ratio of input tokens.
|
||||
# These are empirically tuned based on observed pipeline behavior.
|
||||
_STAGE_OUTPUT_RATIOS: dict[str, float] = {
|
||||
"stage2_segmentation": 0.3, # Compact topic groups — much smaller than input
|
||||
"stage3_extraction": 1.2, # Detailed moments with summaries — can exceed input
|
||||
"stage4_classification": 0.15, # Index + category + tags per moment — very compact
|
||||
"stage5_synthesis": 1.5, # Full prose technique pages — heaviest output
|
||||
}
|
||||
|
||||
# Minimum floor so we never send a trivially small max_tokens
|
||||
_MIN_MAX_TOKENS = 2048
|
||||
|
||||
|
||||
def estimate_tokens(text: str) -> int:
|
||||
"""Estimate token count from text using a chars-per-token heuristic.
|
||||
|
||||
Uses 3.5 chars/token which is conservative for English + JSON markup.
|
||||
"""
|
||||
if not text:
|
||||
return 0
|
||||
return max(1, int(len(text) / 3.5))
|
||||
|
||||
|
||||
def estimate_max_tokens(
|
||||
system_prompt: str,
|
||||
user_prompt: str,
|
||||
stage: str | None = None,
|
||||
hard_limit: int = 32768,
|
||||
) -> int:
|
||||
"""Estimate the max_tokens parameter for an LLM call.
|
||||
|
||||
Calculates expected output size based on input size and stage-specific
|
||||
multipliers. The result is clamped between _MIN_MAX_TOKENS and hard_limit.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
system_prompt:
|
||||
The system prompt text.
|
||||
user_prompt:
|
||||
The user prompt text (transcript, moments, etc.).
|
||||
stage:
|
||||
Pipeline stage name (e.g. "stage3_extraction"). If None or unknown,
|
||||
uses a default 1.0x multiplier.
|
||||
hard_limit:
|
||||
Absolute ceiling — never exceed this value.
|
||||
|
||||
Returns
|
||||
-------
|
||||
int
|
||||
Estimated max_tokens value to pass to the LLM API.
|
||||
"""
|
||||
input_tokens = estimate_tokens(system_prompt) + estimate_tokens(user_prompt)
|
||||
ratio = _STAGE_OUTPUT_RATIOS.get(stage or "", 1.0)
|
||||
estimated_output = int(input_tokens * ratio)
|
||||
|
||||
# Add a 20% buffer for JSON overhead and variability
|
||||
estimated_output = int(estimated_output * 1.2)
|
||||
|
||||
# Clamp to [_MIN_MAX_TOKENS, hard_limit]
|
||||
result = max(_MIN_MAX_TOKENS, min(estimated_output, hard_limit))
|
||||
|
||||
logger.info(
|
||||
"Token estimate: input≈%d, stage=%s, ratio=%.2f, estimated_output=%d, max_tokens=%d (hard_limit=%d)",
|
||||
input_tokens, stage or "default", ratio, estimated_output, result, hard_limit,
|
||||
)
|
||||
return result
|
||||
|
||||
|
||||
class LLMClient:
|
||||
"""Sync LLM client that tries a primary endpoint and falls back on failure."""
|
||||
|
||||
def __init__(self, settings: Settings) -> None:
|
||||
self.settings = settings
|
||||
self._primary = openai.OpenAI(
|
||||
base_url=settings.llm_api_url,
|
||||
api_key=settings.llm_api_key,
|
||||
)
|
||||
self._fallback = openai.OpenAI(
|
||||
base_url=settings.llm_fallback_url,
|
||||
api_key=settings.llm_api_key,
|
||||
)
|
||||
|
||||
# ── Core completion ──────────────────────────────────────────────────
|
||||
|
||||
def complete(
|
||||
self,
|
||||
system_prompt: str,
|
||||
user_prompt: str,
|
||||
response_model: type[BaseModel] | None = None,
|
||||
modality: str = "chat",
|
||||
model_override: str | None = None,
|
||||
on_complete: "Callable | None" = None,
|
||||
max_tokens: int | None = None,
|
||||
) -> str:
|
||||
"""Send a chat completion request, falling back on connection/timeout errors.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
system_prompt:
|
||||
System message content.
|
||||
user_prompt:
|
||||
User message content.
|
||||
response_model:
|
||||
If provided and modality is "chat", ``response_format`` is set to
|
||||
``{"type": "json_object"}``. For "thinking" modality, JSON
|
||||
instructions are appended to the system prompt instead.
|
||||
modality:
|
||||
Either "chat" (default) or "thinking". Thinking modality skips
|
||||
response_format and strips ``<think>`` tags from output.
|
||||
model_override:
|
||||
Model name to use instead of the default. If None, uses the
|
||||
configured default for the endpoint.
|
||||
max_tokens:
|
||||
Override for max_tokens on this call. If None, falls back to
|
||||
the configured ``llm_max_tokens`` from settings.
|
||||
|
||||
Returns
|
||||
-------
|
||||
str
|
||||
Raw completion text from the model (think tags stripped if thinking).
|
||||
"""
|
||||
kwargs: dict = {}
|
||||
effective_system = system_prompt
|
||||
|
||||
if modality == "thinking":
|
||||
# Thinking models often don't support response_format: json_object.
|
||||
# Instead, append explicit JSON instructions to the system prompt.
|
||||
if response_model is not None:
|
||||
json_schema_hint = (
|
||||
"\n\nYou MUST respond with ONLY valid JSON. "
|
||||
"No markdown code fences, no explanation, no preamble — "
|
||||
"just the raw JSON object."
|
||||
)
|
||||
effective_system = system_prompt + json_schema_hint
|
||||
else:
|
||||
# Chat modality — use standard JSON mode
|
||||
if response_model is not None:
|
||||
kwargs["response_format"] = {"type": "json_object"}
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": effective_system},
|
||||
{"role": "user", "content": user_prompt},
|
||||
]
|
||||
|
||||
primary_model = model_override or self.settings.llm_model
|
||||
fallback_model = self.settings.llm_fallback_model
|
||||
effective_max_tokens = max_tokens if max_tokens is not None else self.settings.llm_max_tokens
|
||||
|
||||
logger.info(
|
||||
"LLM request: model=%s, modality=%s, response_model=%s, max_tokens=%d",
|
||||
primary_model,
|
||||
modality,
|
||||
response_model.__name__ if response_model else None,
|
||||
effective_max_tokens,
|
||||
)
|
||||
|
||||
# --- Try primary endpoint ---
|
||||
try:
|
||||
response = self._primary.chat.completions.create(
|
||||
model=primary_model,
|
||||
messages=messages,
|
||||
max_tokens=effective_max_tokens,
|
||||
**kwargs,
|
||||
)
|
||||
raw = response.choices[0].message.content or ""
|
||||
usage = getattr(response, "usage", None)
|
||||
if usage:
|
||||
logger.info(
|
||||
"LLM response: prompt_tokens=%s, completion_tokens=%s, total=%s, content_len=%d, finish=%s",
|
||||
usage.prompt_tokens, usage.completion_tokens, usage.total_tokens,
|
||||
len(raw), response.choices[0].finish_reason,
|
||||
)
|
||||
if modality == "thinking":
|
||||
raw = strip_think_tags(raw)
|
||||
if on_complete is not None:
|
||||
try:
|
||||
on_complete(
|
||||
model=primary_model,
|
||||
prompt_tokens=usage.prompt_tokens if usage else None,
|
||||
completion_tokens=usage.completion_tokens if usage else None,
|
||||
total_tokens=usage.total_tokens if usage else None,
|
||||
content=raw,
|
||||
finish_reason=response.choices[0].finish_reason if response.choices else None,
|
||||
)
|
||||
except Exception as cb_exc:
|
||||
logger.warning("on_complete callback failed: %s", cb_exc)
|
||||
return raw
|
||||
|
||||
except (openai.APIConnectionError, openai.APITimeoutError) as exc:
|
||||
logger.warning(
|
||||
"Primary LLM endpoint failed (%s: %s), trying fallback at %s",
|
||||
type(exc).__name__,
|
||||
exc,
|
||||
self.settings.llm_fallback_url,
|
||||
)
|
||||
|
||||
# --- Try fallback endpoint ---
|
||||
try:
|
||||
response = self._fallback.chat.completions.create(
|
||||
model=fallback_model,
|
||||
messages=messages,
|
||||
max_tokens=effective_max_tokens,
|
||||
**kwargs,
|
||||
)
|
||||
raw = response.choices[0].message.content or ""
|
||||
usage = getattr(response, "usage", None)
|
||||
if usage:
|
||||
logger.info(
|
||||
"LLM response (fallback): prompt_tokens=%s, completion_tokens=%s, total=%s, content_len=%d, finish=%s",
|
||||
usage.prompt_tokens, usage.completion_tokens, usage.total_tokens,
|
||||
len(raw), response.choices[0].finish_reason,
|
||||
)
|
||||
if modality == "thinking":
|
||||
raw = strip_think_tags(raw)
|
||||
if on_complete is not None:
|
||||
try:
|
||||
on_complete(
|
||||
model=fallback_model,
|
||||
prompt_tokens=usage.prompt_tokens if usage else None,
|
||||
completion_tokens=usage.completion_tokens if usage else None,
|
||||
total_tokens=usage.total_tokens if usage else None,
|
||||
content=raw,
|
||||
finish_reason=response.choices[0].finish_reason if response.choices else None,
|
||||
is_fallback=True,
|
||||
)
|
||||
except Exception as cb_exc:
|
||||
logger.warning("on_complete callback failed: %s", cb_exc)
|
||||
return raw
|
||||
|
||||
except (openai.APIConnectionError, openai.APITimeoutError, openai.APIError) as exc:
|
||||
logger.error(
|
||||
"Fallback LLM endpoint also failed (%s: %s). Giving up.",
|
||||
type(exc).__name__,
|
||||
exc,
|
||||
)
|
||||
raise
|
||||
|
||||
# ── Response parsing ─────────────────────────────────────────────────
|
||||
|
||||
def parse_response(self, text: str, model: type[T]) -> T:
|
||||
"""Parse raw LLM output as JSON and validate against a Pydantic model.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
text:
|
||||
Raw JSON string from the LLM.
|
||||
model:
|
||||
Pydantic model class to validate against.
|
||||
|
||||
Returns
|
||||
-------
|
||||
T
|
||||
Validated Pydantic model instance.
|
||||
|
||||
Raises
|
||||
------
|
||||
pydantic.ValidationError
|
||||
If the JSON doesn't match the schema.
|
||||
ValueError
|
||||
If the text is not valid JSON.
|
||||
"""
|
||||
try:
|
||||
return model.model_validate_json(text)
|
||||
except Exception:
|
||||
logger.error(
|
||||
"Failed to parse LLM response as %s. Response text: %.500s",
|
||||
model.__name__,
|
||||
text,
|
||||
)
|
||||
raise
|
||||
184
backend/pipeline/qdrant_client.py
Normal file
184
backend/pipeline/qdrant_client.py
Normal file
|
|
@ -0,0 +1,184 @@
|
|||
"""Qdrant vector database manager for collection lifecycle and point upserts.
|
||||
|
||||
Handles collection creation (idempotent) and batch upserts for technique pages
|
||||
and key moments. Connection failures are non-blocking — the pipeline continues
|
||||
without search indexing.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import uuid
|
||||
|
||||
from qdrant_client import QdrantClient
|
||||
from qdrant_client.http import exceptions as qdrant_exceptions
|
||||
from qdrant_client.models import Distance, PointStruct, VectorParams
|
||||
|
||||
from config import Settings
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class QdrantManager:
|
||||
"""Manages a Qdrant collection for Chrysopedia technique-page and key-moment vectors."""
|
||||
|
||||
def __init__(self, settings: Settings) -> None:
|
||||
self.settings = settings
|
||||
self._client = QdrantClient(url=settings.qdrant_url)
|
||||
self._collection = settings.qdrant_collection
|
||||
|
||||
# ── Collection management ────────────────────────────────────────────
|
||||
|
||||
def ensure_collection(self) -> None:
|
||||
"""Create the collection if it does not already exist.
|
||||
|
||||
Uses cosine distance and the configured embedding dimensions.
|
||||
"""
|
||||
try:
|
||||
if self._client.collection_exists(self._collection):
|
||||
logger.info("Qdrant collection '%s' already exists.", self._collection)
|
||||
return
|
||||
|
||||
self._client.create_collection(
|
||||
collection_name=self._collection,
|
||||
vectors_config=VectorParams(
|
||||
size=self.settings.embedding_dimensions,
|
||||
distance=Distance.COSINE,
|
||||
),
|
||||
)
|
||||
logger.info(
|
||||
"Created Qdrant collection '%s' (dim=%d, cosine).",
|
||||
self._collection,
|
||||
self.settings.embedding_dimensions,
|
||||
)
|
||||
except qdrant_exceptions.UnexpectedResponse as exc:
|
||||
logger.warning(
|
||||
"Qdrant error during ensure_collection (%s). Skipping.",
|
||||
exc,
|
||||
)
|
||||
except Exception as exc:
|
||||
logger.warning(
|
||||
"Qdrant connection failed during ensure_collection (%s: %s). Skipping.",
|
||||
type(exc).__name__,
|
||||
exc,
|
||||
)
|
||||
|
||||
# ── Low-level upsert ─────────────────────────────────────────────────
|
||||
|
||||
def upsert_points(self, points: list[PointStruct]) -> None:
|
||||
"""Upsert a batch of pre-built PointStruct objects."""
|
||||
if not points:
|
||||
return
|
||||
try:
|
||||
self._client.upsert(
|
||||
collection_name=self._collection,
|
||||
points=points,
|
||||
)
|
||||
logger.info(
|
||||
"Upserted %d points to Qdrant collection '%s'.",
|
||||
len(points),
|
||||
self._collection,
|
||||
)
|
||||
except qdrant_exceptions.UnexpectedResponse as exc:
|
||||
logger.warning(
|
||||
"Qdrant upsert failed (%s). %d points skipped.",
|
||||
exc,
|
||||
len(points),
|
||||
)
|
||||
except Exception as exc:
|
||||
logger.warning(
|
||||
"Qdrant upsert connection error (%s: %s). %d points skipped.",
|
||||
type(exc).__name__,
|
||||
exc,
|
||||
len(points),
|
||||
)
|
||||
|
||||
# ── High-level upserts ───────────────────────────────────────────────
|
||||
|
||||
def upsert_technique_pages(
|
||||
self,
|
||||
pages: list[dict],
|
||||
vectors: list[list[float]],
|
||||
) -> None:
|
||||
"""Build and upsert PointStructs for technique pages.
|
||||
|
||||
Each page dict must contain:
|
||||
page_id, creator_id, title, topic_category, topic_tags, summary
|
||||
|
||||
Parameters
|
||||
----------
|
||||
pages:
|
||||
Metadata dicts, one per technique page.
|
||||
vectors:
|
||||
Corresponding embedding vectors (same order as pages).
|
||||
"""
|
||||
if len(pages) != len(vectors):
|
||||
logger.warning(
|
||||
"Technique-page count (%d) != vector count (%d). Skipping upsert.",
|
||||
len(pages),
|
||||
len(vectors),
|
||||
)
|
||||
return
|
||||
|
||||
points = []
|
||||
for page, vector in zip(pages, vectors):
|
||||
point = PointStruct(
|
||||
id=str(uuid.uuid4()),
|
||||
vector=vector,
|
||||
payload={
|
||||
"type": "technique_page",
|
||||
"page_id": page["page_id"],
|
||||
"creator_id": page["creator_id"],
|
||||
"title": page["title"],
|
||||
"topic_category": page["topic_category"],
|
||||
"topic_tags": page.get("topic_tags") or [],
|
||||
"summary": page.get("summary") or "",
|
||||
},
|
||||
)
|
||||
points.append(point)
|
||||
|
||||
self.upsert_points(points)
|
||||
|
||||
def upsert_key_moments(
|
||||
self,
|
||||
moments: list[dict],
|
||||
vectors: list[list[float]],
|
||||
) -> None:
|
||||
"""Build and upsert PointStructs for key moments.
|
||||
|
||||
Each moment dict must contain:
|
||||
moment_id, source_video_id, title, start_time, end_time, content_type
|
||||
|
||||
Parameters
|
||||
----------
|
||||
moments:
|
||||
Metadata dicts, one per key moment.
|
||||
vectors:
|
||||
Corresponding embedding vectors (same order as moments).
|
||||
"""
|
||||
if len(moments) != len(vectors):
|
||||
logger.warning(
|
||||
"Key-moment count (%d) != vector count (%d). Skipping upsert.",
|
||||
len(moments),
|
||||
len(vectors),
|
||||
)
|
||||
return
|
||||
|
||||
points = []
|
||||
for moment, vector in zip(moments, vectors):
|
||||
point = PointStruct(
|
||||
id=str(uuid.uuid4()),
|
||||
vector=vector,
|
||||
payload={
|
||||
"type": "key_moment",
|
||||
"moment_id": moment["moment_id"],
|
||||
"source_video_id": moment["source_video_id"],
|
||||
"title": moment["title"],
|
||||
"start_time": moment["start_time"],
|
||||
"end_time": moment["end_time"],
|
||||
"content_type": moment["content_type"],
|
||||
},
|
||||
)
|
||||
points.append(point)
|
||||
|
||||
self.upsert_points(points)
|
||||
99
backend/pipeline/schemas.py
Normal file
99
backend/pipeline/schemas.py
Normal file
|
|
@ -0,0 +1,99 @@
|
|||
"""Pydantic schemas for pipeline stage inputs and outputs.
|
||||
|
||||
Stage 2 — Segmentation: groups transcript segments by topic.
|
||||
Stage 3 — Extraction: extracts key moments from segments.
|
||||
Stage 4 — Classification: classifies moments by category/tags.
|
||||
Stage 5 — Synthesis: generates technique pages from classified moments.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pydantic import BaseModel, Field
|
||||
|
||||
|
||||
# ── Stage 2: Segmentation ───────────────────────────────────────────────────
|
||||
|
||||
class TopicSegment(BaseModel):
|
||||
"""A contiguous group of transcript segments sharing a topic."""
|
||||
|
||||
start_index: int = Field(description="First transcript segment index in this group")
|
||||
end_index: int = Field(description="Last transcript segment index in this group (inclusive)")
|
||||
topic_label: str = Field(description="Short label describing the topic")
|
||||
summary: str = Field(description="Brief summary of what is discussed")
|
||||
|
||||
|
||||
class SegmentationResult(BaseModel):
|
||||
"""Full output of stage 2 (segmentation)."""
|
||||
|
||||
segments: list[TopicSegment]
|
||||
|
||||
|
||||
# ── Stage 3: Extraction ─────────────────────────────────────────────────────
|
||||
|
||||
class ExtractedMoment(BaseModel):
|
||||
"""A single key moment extracted from a topic segment group."""
|
||||
|
||||
title: str = Field(description="Concise title for the moment")
|
||||
summary: str = Field(description="Detailed summary of the technique/concept")
|
||||
start_time: float = Field(description="Start time in seconds")
|
||||
end_time: float = Field(description="End time in seconds")
|
||||
content_type: str = Field(description="One of: technique, settings, reasoning, workflow")
|
||||
plugins: list[str] = Field(default_factory=list, description="Plugins/tools mentioned")
|
||||
raw_transcript: str = Field(default="", description="Raw transcript text for this moment")
|
||||
|
||||
|
||||
class ExtractionResult(BaseModel):
|
||||
"""Full output of stage 3 (extraction)."""
|
||||
|
||||
moments: list[ExtractedMoment]
|
||||
|
||||
|
||||
# ── Stage 4: Classification ─────────────────────────────────────────────────
|
||||
|
||||
class ClassifiedMoment(BaseModel):
|
||||
"""Classification metadata for a single extracted moment."""
|
||||
|
||||
moment_index: int = Field(description="Index into ExtractionResult.moments")
|
||||
topic_category: str = Field(description="High-level topic category")
|
||||
topic_tags: list[str] = Field(default_factory=list, description="Specific topic tags")
|
||||
content_type_override: str | None = Field(
|
||||
default=None,
|
||||
description="Override for content_type if classification disagrees with extraction",
|
||||
)
|
||||
|
||||
|
||||
class ClassificationResult(BaseModel):
|
||||
"""Full output of stage 4 (classification)."""
|
||||
|
||||
classifications: list[ClassifiedMoment]
|
||||
|
||||
|
||||
# ── Stage 5: Synthesis ───────────────────────────────────────────────────────
|
||||
|
||||
class SynthesizedPage(BaseModel):
|
||||
"""A technique page synthesized from classified moments."""
|
||||
|
||||
title: str = Field(description="Page title")
|
||||
slug: str = Field(description="URL-safe slug")
|
||||
topic_category: str = Field(description="Primary topic category")
|
||||
topic_tags: list[str] = Field(default_factory=list, description="Associated tags")
|
||||
summary: str = Field(description="Page summary / overview paragraph")
|
||||
body_sections: dict = Field(
|
||||
default_factory=dict,
|
||||
description="Structured body content as section_name -> content mapping",
|
||||
)
|
||||
signal_chains: list[dict] = Field(
|
||||
default_factory=list,
|
||||
description="Signal chain descriptions (for audio/music production contexts)",
|
||||
)
|
||||
plugins: list[str] = Field(default_factory=list, description="Plugins/tools referenced")
|
||||
source_quality: str = Field(
|
||||
default="mixed",
|
||||
description="One of: structured, mixed, unstructured",
|
||||
)
|
||||
|
||||
|
||||
class SynthesisResult(BaseModel):
|
||||
"""Full output of stage 5 (synthesis)."""
|
||||
|
||||
pages: list[SynthesizedPage]
|
||||
|
|
@ -12,6 +12,7 @@ from __future__ import annotations
|
|||
import hashlib
|
||||
import json
|
||||
import logging
|
||||
import subprocess
|
||||
import time
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
|
@ -24,6 +25,7 @@ from sqlalchemy.orm import Session, sessionmaker
|
|||
|
||||
from config import get_settings
|
||||
from models import (
|
||||
Creator,
|
||||
KeyMoment,
|
||||
KeyMomentContentType,
|
||||
PipelineEvent,
|
||||
|
|
@ -34,7 +36,7 @@ from models import (
|
|||
TranscriptSegment,
|
||||
)
|
||||
from pipeline.embedding_client import EmbeddingClient
|
||||
from pipeline.llm_client import LLMClient
|
||||
from pipeline.llm_client import LLMClient, estimate_max_tokens
|
||||
from pipeline.qdrant_client import QdrantManager
|
||||
from pipeline.schemas import (
|
||||
ClassificationResult,
|
||||
|
|
@ -60,6 +62,9 @@ def _emit_event(
|
|||
model: str | None = None,
|
||||
duration_ms: int | None = None,
|
||||
payload: dict | None = None,
|
||||
system_prompt_text: str | None = None,
|
||||
user_prompt_text: str | None = None,
|
||||
response_text: str | None = None,
|
||||
) -> None:
|
||||
"""Persist a pipeline event to the DB. Best-effort -- failures logged, not raised."""
|
||||
try:
|
||||
|
|
@ -75,6 +80,9 @@ def _emit_event(
|
|||
model=model,
|
||||
duration_ms=duration_ms,
|
||||
payload=payload,
|
||||
system_prompt_text=system_prompt_text,
|
||||
user_prompt_text=user_prompt_text,
|
||||
response_text=response_text,
|
||||
)
|
||||
session.add(event)
|
||||
session.commit()
|
||||
|
|
@ -84,8 +92,34 @@ def _emit_event(
|
|||
logger.warning("Failed to emit pipeline event: %s", exc)
|
||||
|
||||
|
||||
def _make_llm_callback(video_id: str, stage: str):
|
||||
"""Create an on_complete callback for LLMClient that emits llm_call events."""
|
||||
def _is_debug_mode() -> bool:
|
||||
"""Check if debug mode is enabled via Redis. Falls back to config setting."""
|
||||
try:
|
||||
import redis
|
||||
settings = get_settings()
|
||||
r = redis.from_url(settings.redis_url)
|
||||
val = r.get("chrysopedia:debug_mode")
|
||||
r.close()
|
||||
if val is not None:
|
||||
return val.decode().lower() == "true"
|
||||
except Exception:
|
||||
pass
|
||||
return getattr(get_settings(), "debug_mode", False)
|
||||
|
||||
|
||||
def _make_llm_callback(
|
||||
video_id: str,
|
||||
stage: str,
|
||||
system_prompt: str | None = None,
|
||||
user_prompt: str | None = None,
|
||||
):
|
||||
"""Create an on_complete callback for LLMClient that emits llm_call events.
|
||||
|
||||
When debug mode is enabled, captures full system prompt, user prompt,
|
||||
and response text on each llm_call event.
|
||||
"""
|
||||
debug = _is_debug_mode()
|
||||
|
||||
def callback(*, model=None, prompt_tokens=None, completion_tokens=None,
|
||||
total_tokens=None, content=None, finish_reason=None,
|
||||
is_fallback=False, **_kwargs):
|
||||
|
|
@ -105,6 +139,9 @@ def _make_llm_callback(video_id: str, stage: str):
|
|||
"finish_reason": finish_reason,
|
||||
"is_fallback": is_fallback,
|
||||
},
|
||||
system_prompt_text=system_prompt if debug else None,
|
||||
user_prompt_text=user_prompt if debug else None,
|
||||
response_text=content if debug else None,
|
||||
)
|
||||
return callback
|
||||
|
||||
|
|
@ -271,9 +308,11 @@ def stage2_segmentation(self, video_id: str) -> str:
|
|||
|
||||
llm = _get_llm_client()
|
||||
model_override, modality = _get_stage_config(2)
|
||||
logger.info("Stage 2 using model=%s, modality=%s", model_override or "default", modality)
|
||||
raw = llm.complete(system_prompt, user_prompt, response_model=SegmentationResult, on_complete=_make_llm_callback(video_id, "stage2_segmentation"),
|
||||
modality=modality, model_override=model_override)
|
||||
hard_limit = get_settings().llm_max_tokens_hard_limit
|
||||
max_tokens = estimate_max_tokens(system_prompt, user_prompt, stage="stage2_segmentation", hard_limit=hard_limit)
|
||||
logger.info("Stage 2 using model=%s, modality=%s, max_tokens=%d", model_override or "default", modality, max_tokens)
|
||||
raw = llm.complete(system_prompt, user_prompt, response_model=SegmentationResult, on_complete=_make_llm_callback(video_id, "stage2_segmentation", system_prompt=system_prompt, user_prompt=user_prompt),
|
||||
modality=modality, model_override=model_override, max_tokens=max_tokens)
|
||||
result = _safe_parse_llm_response(raw, SegmentationResult, llm, system_prompt, user_prompt,
|
||||
modality=modality, model_override=model_override)
|
||||
|
||||
|
|
@ -345,6 +384,7 @@ def stage3_extraction(self, video_id: str) -> str:
|
|||
system_prompt = _load_prompt("stage3_extraction.txt")
|
||||
llm = _get_llm_client()
|
||||
model_override, modality = _get_stage_config(3)
|
||||
hard_limit = get_settings().llm_max_tokens_hard_limit
|
||||
logger.info("Stage 3 using model=%s, modality=%s", model_override or "default", modality)
|
||||
total_moments = 0
|
||||
|
||||
|
|
@ -362,8 +402,9 @@ def stage3_extraction(self, video_id: str) -> str:
|
|||
f"<segment>\n{segment_text}\n</segment>"
|
||||
)
|
||||
|
||||
raw = llm.complete(system_prompt, user_prompt, response_model=ExtractionResult, on_complete=_make_llm_callback(video_id, "stage3_extraction"),
|
||||
modality=modality, model_override=model_override)
|
||||
max_tokens = estimate_max_tokens(system_prompt, user_prompt, stage="stage3_extraction", hard_limit=hard_limit)
|
||||
raw = llm.complete(system_prompt, user_prompt, response_model=ExtractionResult, on_complete=_make_llm_callback(video_id, "stage3_extraction", system_prompt=system_prompt, user_prompt=user_prompt),
|
||||
modality=modality, model_override=model_override, max_tokens=max_tokens)
|
||||
result = _safe_parse_llm_response(raw, ExtractionResult, llm, system_prompt, user_prompt,
|
||||
modality=modality, model_override=model_override)
|
||||
|
||||
|
|
@ -474,9 +515,11 @@ def stage4_classification(self, video_id: str) -> str:
|
|||
|
||||
llm = _get_llm_client()
|
||||
model_override, modality = _get_stage_config(4)
|
||||
logger.info("Stage 4 using model=%s, modality=%s", model_override or "default", modality)
|
||||
raw = llm.complete(system_prompt, user_prompt, response_model=ClassificationResult, on_complete=_make_llm_callback(video_id, "stage4_classification"),
|
||||
modality=modality, model_override=model_override)
|
||||
hard_limit = get_settings().llm_max_tokens_hard_limit
|
||||
max_tokens = estimate_max_tokens(system_prompt, user_prompt, stage="stage4_classification", hard_limit=hard_limit)
|
||||
logger.info("Stage 4 using model=%s, modality=%s, max_tokens=%d", model_override or "default", modality, max_tokens)
|
||||
raw = llm.complete(system_prompt, user_prompt, response_model=ClassificationResult, on_complete=_make_llm_callback(video_id, "stage4_classification", system_prompt=system_prompt, user_prompt=user_prompt),
|
||||
modality=modality, model_override=model_override, max_tokens=max_tokens)
|
||||
result = _safe_parse_llm_response(raw, ClassificationResult, llm, system_prompt, user_prompt,
|
||||
modality=modality, model_override=model_override)
|
||||
|
||||
|
|
@ -548,6 +591,44 @@ def _load_classification_data(video_id: str) -> list[dict]:
|
|||
return json.loads(raw)
|
||||
|
||||
|
||||
|
||||
def _get_git_commit_sha() -> str:
|
||||
"""Resolve the git commit SHA used to build this image.
|
||||
|
||||
Resolution order:
|
||||
1. /app/.git-commit file (written during Docker build)
|
||||
2. git rev-parse --short HEAD (local dev)
|
||||
3. GIT_COMMIT_SHA env var / config setting
|
||||
4. "unknown"
|
||||
"""
|
||||
# Docker build artifact
|
||||
git_commit_file = Path("/app/.git-commit")
|
||||
if git_commit_file.exists():
|
||||
sha = git_commit_file.read_text(encoding="utf-8").strip()
|
||||
if sha and sha != "unknown":
|
||||
return sha
|
||||
|
||||
# Local dev — run git
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["git", "rev-parse", "--short", "HEAD"],
|
||||
capture_output=True, text=True, timeout=5,
|
||||
)
|
||||
if result.returncode == 0 and result.stdout.strip():
|
||||
return result.stdout.strip()
|
||||
except (FileNotFoundError, subprocess.TimeoutExpired):
|
||||
pass
|
||||
|
||||
# Config / env var fallback
|
||||
try:
|
||||
sha = get_settings().git_commit_sha
|
||||
if sha and sha != "unknown":
|
||||
return sha
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
return "unknown"
|
||||
|
||||
def _capture_pipeline_metadata() -> dict:
|
||||
"""Capture current pipeline configuration for version metadata.
|
||||
|
||||
|
|
@ -578,6 +659,7 @@ def _capture_pipeline_metadata() -> dict:
|
|||
prompt_hashes[filename] = ""
|
||||
|
||||
return {
|
||||
"git_commit_sha": _get_git_commit_sha(),
|
||||
"models": {
|
||||
"stage2": settings.llm_stage2_model,
|
||||
"stage3": settings.llm_stage3_model,
|
||||
|
|
@ -631,6 +713,12 @@ def stage5_synthesis(self, video_id: str) -> str:
|
|||
.all()
|
||||
)
|
||||
|
||||
# Resolve creator name for the LLM prompt
|
||||
creator = session.execute(
|
||||
select(Creator).where(Creator.id == video.creator_id)
|
||||
).scalar_one_or_none()
|
||||
creator_name = creator.name if creator else "Unknown"
|
||||
|
||||
if not moments:
|
||||
logger.info("Stage 5: No moments found for video_id=%s, skipping.", video_id)
|
||||
return video_id
|
||||
|
|
@ -649,6 +737,7 @@ def stage5_synthesis(self, video_id: str) -> str:
|
|||
system_prompt = _load_prompt("stage5_synthesis.txt")
|
||||
llm = _get_llm_client()
|
||||
model_override, modality = _get_stage_config(5)
|
||||
hard_limit = get_settings().llm_max_tokens_hard_limit
|
||||
logger.info("Stage 5 using model=%s, modality=%s", model_override or "default", modality)
|
||||
pages_created = 0
|
||||
|
||||
|
|
@ -671,19 +760,41 @@ def stage5_synthesis(self, video_id: str) -> str:
|
|||
)
|
||||
moments_text = "\n\n".join(moments_lines)
|
||||
|
||||
user_prompt = f"<moments>\n{moments_text}\n</moments>"
|
||||
user_prompt = f"<creator>{creator_name}</creator>\n<moments>\n{moments_text}\n</moments>"
|
||||
|
||||
raw = llm.complete(system_prompt, user_prompt, response_model=SynthesisResult, on_complete=_make_llm_callback(video_id, "stage5_synthesis"),
|
||||
modality=modality, model_override=model_override)
|
||||
max_tokens = estimate_max_tokens(system_prompt, user_prompt, stage="stage5_synthesis", hard_limit=hard_limit)
|
||||
raw = llm.complete(system_prompt, user_prompt, response_model=SynthesisResult, on_complete=_make_llm_callback(video_id, "stage5_synthesis", system_prompt=system_prompt, user_prompt=user_prompt),
|
||||
modality=modality, model_override=model_override, max_tokens=max_tokens)
|
||||
result = _safe_parse_llm_response(raw, SynthesisResult, llm, system_prompt, user_prompt,
|
||||
modality=modality, model_override=model_override)
|
||||
|
||||
# Load prior pages from this video (snapshot taken before pipeline reset)
|
||||
prior_page_ids = _load_prior_pages(video_id)
|
||||
|
||||
# Create/update TechniquePage rows
|
||||
for page_data in result.pages:
|
||||
# Check if page with this slug already exists
|
||||
existing = session.execute(
|
||||
select(TechniquePage).where(TechniquePage.slug == page_data.slug)
|
||||
).scalar_one_or_none()
|
||||
existing = None
|
||||
|
||||
# First: check prior pages from this video by creator + category
|
||||
if prior_page_ids:
|
||||
existing = session.execute(
|
||||
select(TechniquePage).where(
|
||||
TechniquePage.id.in_(prior_page_ids),
|
||||
TechniquePage.creator_id == video.creator_id,
|
||||
TechniquePage.topic_category == (page_data.topic_category or category),
|
||||
)
|
||||
).scalar_one_or_none()
|
||||
if existing:
|
||||
logger.info(
|
||||
"Stage 5: Matched prior page '%s' (id=%s) by creator+category for video_id=%s",
|
||||
existing.slug, existing.id, video_id,
|
||||
)
|
||||
|
||||
# Fallback: check by slug (handles cross-video dedup)
|
||||
if existing is None:
|
||||
existing = session.execute(
|
||||
select(TechniquePage).where(TechniquePage.slug == page_data.slug)
|
||||
).scalar_one_or_none()
|
||||
|
||||
if existing:
|
||||
# Snapshot existing content before overwriting
|
||||
|
|
@ -912,6 +1023,58 @@ def stage6_embed_and_index(self, video_id: str) -> str:
|
|||
session.close()
|
||||
|
||||
|
||||
|
||||
|
||||
def _snapshot_prior_pages(video_id: str) -> None:
|
||||
"""Save existing technique_page_ids linked to this video before pipeline resets them.
|
||||
|
||||
When a video is reprocessed, stage 3 deletes and recreates key_moments,
|
||||
breaking the link to technique pages. This snapshots the page IDs to Redis
|
||||
so stage 5 can find and update prior pages instead of creating duplicates.
|
||||
"""
|
||||
import redis
|
||||
|
||||
session = _get_sync_session()
|
||||
try:
|
||||
# Find technique pages linked via this video's key moments
|
||||
rows = session.execute(
|
||||
select(KeyMoment.technique_page_id)
|
||||
.where(
|
||||
KeyMoment.source_video_id == video_id,
|
||||
KeyMoment.technique_page_id.isnot(None),
|
||||
)
|
||||
.distinct()
|
||||
).scalars().all()
|
||||
|
||||
page_ids = [str(pid) for pid in rows]
|
||||
|
||||
if page_ids:
|
||||
settings = get_settings()
|
||||
r = redis.Redis.from_url(settings.redis_url)
|
||||
key = f"chrysopedia:prior_pages:{video_id}"
|
||||
r.set(key, json.dumps(page_ids), ex=86400)
|
||||
logger.info(
|
||||
"Snapshot %d prior technique pages for video_id=%s: %s",
|
||||
len(page_ids), video_id, page_ids,
|
||||
)
|
||||
else:
|
||||
logger.info("No prior technique pages for video_id=%s", video_id)
|
||||
finally:
|
||||
session.close()
|
||||
|
||||
|
||||
def _load_prior_pages(video_id: str) -> list[str]:
|
||||
"""Load prior technique page IDs from Redis."""
|
||||
import redis
|
||||
|
||||
settings = get_settings()
|
||||
r = redis.Redis.from_url(settings.redis_url)
|
||||
key = f"chrysopedia:prior_pages:{video_id}"
|
||||
raw = r.get(key)
|
||||
if raw is None:
|
||||
return []
|
||||
return json.loads(raw)
|
||||
|
||||
# ── Orchestrator ─────────────────────────────────────────────────────────────
|
||||
|
||||
@celery_app.task
|
||||
|
|
@ -945,6 +1108,9 @@ def run_pipeline(video_id: str) -> str:
|
|||
finally:
|
||||
session.close()
|
||||
|
||||
# Snapshot prior technique pages before pipeline resets key_moments
|
||||
_snapshot_prior_pages(video_id)
|
||||
|
||||
# Build the chain based on current status
|
||||
stages = []
|
||||
if status in (ProcessingStatus.pending, ProcessingStatus.transcribed):
|
||||
|
|
|
|||
3
backend/pytest.ini
Normal file
3
backend/pytest.ini
Normal file
|
|
@ -0,0 +1,3 @@
|
|||
[pytest]
|
||||
asyncio_mode = auto
|
||||
testpaths = tests
|
||||
15
backend/redis_client.py
Normal file
15
backend/redis_client.py
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
"""Async Redis client helper for Chrysopedia."""
|
||||
|
||||
import redis.asyncio as aioredis
|
||||
|
||||
from config import get_settings
|
||||
|
||||
|
||||
async def get_redis() -> aioredis.Redis:
|
||||
"""Return an async Redis client from the configured URL.
|
||||
|
||||
Callers should close the connection when done, or use it
|
||||
as a short-lived client within a request handler.
|
||||
"""
|
||||
settings = get_settings()
|
||||
return aioredis.from_url(settings.redis_url, decode_responses=True)
|
||||
1
backend/routers/__init__.py
Normal file
1
backend/routers/__init__.py
Normal file
|
|
@ -0,0 +1 @@
|
|||
"""Chrysopedia API routers package."""
|
||||
119
backend/routers/creators.py
Normal file
119
backend/routers/creators.py
Normal file
|
|
@ -0,0 +1,119 @@
|
|||
"""Creator endpoints for Chrysopedia API.
|
||||
|
||||
Enhanced with sort (random default per R014), genre filter, and
|
||||
technique/video counts for browse pages.
|
||||
"""
|
||||
|
||||
import logging
|
||||
from typing import Annotated
|
||||
|
||||
from fastapi import APIRouter, Depends, HTTPException, Query
|
||||
from sqlalchemy import func, select
|
||||
from sqlalchemy.ext.asyncio import AsyncSession
|
||||
|
||||
from database import get_session
|
||||
from models import Creator, SourceVideo, TechniquePage
|
||||
from schemas import CreatorBrowseItem, CreatorDetail, CreatorRead
|
||||
|
||||
logger = logging.getLogger("chrysopedia.creators")
|
||||
|
||||
router = APIRouter(prefix="/creators", tags=["creators"])
|
||||
|
||||
|
||||
@router.get("")
|
||||
async def list_creators(
|
||||
sort: Annotated[str, Query()] = "random",
|
||||
genre: Annotated[str | None, Query()] = None,
|
||||
offset: Annotated[int, Query(ge=0)] = 0,
|
||||
limit: Annotated[int, Query(ge=1, le=100)] = 50,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
):
|
||||
"""List creators with sort, genre filter, and technique/video counts.
|
||||
|
||||
- **sort**: ``random`` (default, R014 creator equity), ``alpha``, ``views``
|
||||
- **genre**: filter by genre (matches against ARRAY column)
|
||||
"""
|
||||
# Subqueries for counts
|
||||
technique_count_sq = (
|
||||
select(func.count())
|
||||
.where(TechniquePage.creator_id == Creator.id)
|
||||
.correlate(Creator)
|
||||
.scalar_subquery()
|
||||
)
|
||||
video_count_sq = (
|
||||
select(func.count())
|
||||
.where(SourceVideo.creator_id == Creator.id)
|
||||
.correlate(Creator)
|
||||
.scalar_subquery()
|
||||
)
|
||||
|
||||
stmt = select(
|
||||
Creator,
|
||||
technique_count_sq.label("technique_count"),
|
||||
video_count_sq.label("video_count"),
|
||||
)
|
||||
|
||||
# Genre filter
|
||||
if genre:
|
||||
stmt = stmt.where(Creator.genres.any(genre))
|
||||
|
||||
# Sorting
|
||||
if sort == "alpha":
|
||||
stmt = stmt.order_by(Creator.name)
|
||||
elif sort == "views":
|
||||
stmt = stmt.order_by(Creator.view_count.desc())
|
||||
else:
|
||||
# Default: random (small dataset <100, func.random() is fine)
|
||||
stmt = stmt.order_by(func.random())
|
||||
|
||||
stmt = stmt.offset(offset).limit(limit)
|
||||
result = await db.execute(stmt)
|
||||
rows = result.all()
|
||||
|
||||
items: list[CreatorBrowseItem] = []
|
||||
for row in rows:
|
||||
creator = row[0]
|
||||
tc = row[1] or 0
|
||||
vc = row[2] or 0
|
||||
base = CreatorRead.model_validate(creator)
|
||||
items.append(
|
||||
CreatorBrowseItem(**base.model_dump(), technique_count=tc, video_count=vc)
|
||||
)
|
||||
|
||||
# Get total count (without offset/limit)
|
||||
count_stmt = select(func.count()).select_from(Creator)
|
||||
if genre:
|
||||
count_stmt = count_stmt.where(Creator.genres.any(genre))
|
||||
total = (await db.execute(count_stmt)).scalar() or 0
|
||||
|
||||
logger.debug(
|
||||
"Listed %d creators (sort=%s, genre=%s, offset=%d, limit=%d)",
|
||||
len(items), sort, genre, offset, limit,
|
||||
)
|
||||
return {"items": items, "total": total, "offset": offset, "limit": limit}
|
||||
|
||||
|
||||
@router.get("/{slug}", response_model=CreatorDetail)
|
||||
async def get_creator(
|
||||
slug: str,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
) -> CreatorDetail:
|
||||
"""Get a single creator by slug, including video count."""
|
||||
stmt = select(Creator).where(Creator.slug == slug)
|
||||
result = await db.execute(stmt)
|
||||
creator = result.scalar_one_or_none()
|
||||
|
||||
if creator is None:
|
||||
raise HTTPException(status_code=404, detail=f"Creator '{slug}' not found")
|
||||
|
||||
# Count videos for this creator
|
||||
count_stmt = (
|
||||
select(func.count())
|
||||
.select_from(SourceVideo)
|
||||
.where(SourceVideo.creator_id == creator.id)
|
||||
)
|
||||
count_result = await db.execute(count_stmt)
|
||||
video_count = count_result.scalar() or 0
|
||||
|
||||
creator_data = CreatorRead.model_validate(creator)
|
||||
return CreatorDetail(**creator_data.model_dump(), video_count=video_count)
|
||||
34
backend/routers/health.py
Normal file
34
backend/routers/health.py
Normal file
|
|
@ -0,0 +1,34 @@
|
|||
"""Health check endpoints for Chrysopedia API."""
|
||||
|
||||
import logging
|
||||
|
||||
from fastapi import APIRouter, Depends
|
||||
from sqlalchemy import text
|
||||
from sqlalchemy.ext.asyncio import AsyncSession
|
||||
|
||||
from database import get_session
|
||||
from schemas import HealthResponse
|
||||
|
||||
logger = logging.getLogger("chrysopedia.health")
|
||||
|
||||
router = APIRouter(tags=["health"])
|
||||
|
||||
|
||||
@router.get("/health", response_model=HealthResponse)
|
||||
async def health_check(db: AsyncSession = Depends(get_session)) -> HealthResponse:
|
||||
"""Root health check — verifies API is running and DB is reachable."""
|
||||
db_status = "unknown"
|
||||
try:
|
||||
result = await db.execute(text("SELECT 1"))
|
||||
result.scalar()
|
||||
db_status = "connected"
|
||||
except Exception:
|
||||
logger.warning("Database health check failed", exc_info=True)
|
||||
db_status = "unreachable"
|
||||
|
||||
return HealthResponse(
|
||||
status="ok",
|
||||
service="chrysopedia-api",
|
||||
version="0.1.0",
|
||||
database=db_status,
|
||||
)
|
||||
284
backend/routers/ingest.py
Normal file
284
backend/routers/ingest.py
Normal file
|
|
@ -0,0 +1,284 @@
|
|||
"""Transcript ingestion endpoint for the Chrysopedia API.
|
||||
|
||||
Accepts a Whisper-format transcript JSON via multipart file upload, finds or
|
||||
creates a Creator, upserts a SourceVideo, bulk-inserts TranscriptSegments,
|
||||
persists the raw JSON to disk, and returns a structured response.
|
||||
"""
|
||||
|
||||
import hashlib
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import uuid
|
||||
|
||||
from fastapi import APIRouter, Depends, HTTPException, UploadFile
|
||||
from sqlalchemy import delete, select
|
||||
from sqlalchemy.ext.asyncio import AsyncSession
|
||||
|
||||
from config import get_settings
|
||||
from database import get_session
|
||||
from models import ContentType, Creator, ProcessingStatus, SourceVideo, TranscriptSegment
|
||||
from schemas import TranscriptIngestResponse
|
||||
|
||||
logger = logging.getLogger("chrysopedia.ingest")
|
||||
|
||||
router = APIRouter(prefix="/ingest", tags=["ingest"])
|
||||
|
||||
REQUIRED_KEYS = {"source_file", "creator_folder", "duration_seconds", "segments"}
|
||||
|
||||
|
||||
def slugify(value: str) -> str:
|
||||
"""Lowercase, replace non-alphanumeric chars with hyphens, collapse/strip."""
|
||||
value = value.lower()
|
||||
value = re.sub(r"[^a-z0-9]+", "-", value)
|
||||
value = value.strip("-")
|
||||
value = re.sub(r"-{2,}", "-", value)
|
||||
return value
|
||||
|
||||
|
||||
|
||||
def compute_content_hash(segments: list[dict]) -> str:
|
||||
"""Compute a stable SHA-256 hash from transcript segment text.
|
||||
|
||||
Hashes only the segment text content in order, ignoring metadata like
|
||||
filenames, timestamps, or dates. Two transcripts of the same audio will
|
||||
produce identical hashes even if ingested with different filenames.
|
||||
"""
|
||||
h = hashlib.sha256()
|
||||
for seg in segments:
|
||||
h.update(str(seg.get("text", "")).encode("utf-8"))
|
||||
return h.hexdigest()
|
||||
|
||||
|
||||
@router.post("", response_model=TranscriptIngestResponse)
|
||||
async def ingest_transcript(
|
||||
file: UploadFile,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
) -> TranscriptIngestResponse:
|
||||
"""Ingest a Whisper transcript JSON file.
|
||||
|
||||
Workflow:
|
||||
1. Parse and validate the uploaded JSON.
|
||||
2. Find-or-create a Creator by folder_name.
|
||||
3. Upsert a SourceVideo by (creator_id, filename).
|
||||
4. Bulk-insert TranscriptSegment rows.
|
||||
5. Save raw JSON to transcript_storage_path.
|
||||
6. Return structured response.
|
||||
"""
|
||||
settings = get_settings()
|
||||
|
||||
# ── 1. Read & parse JSON ─────────────────────────────────────────────
|
||||
try:
|
||||
raw_bytes = await file.read()
|
||||
raw_text = raw_bytes.decode("utf-8")
|
||||
except Exception as exc:
|
||||
raise HTTPException(status_code=400, detail=f"Invalid file: {exc}") from exc
|
||||
|
||||
try:
|
||||
data = json.loads(raw_text)
|
||||
except json.JSONDecodeError as exc:
|
||||
raise HTTPException(
|
||||
status_code=422, detail=f"JSON parse error: {exc}"
|
||||
) from exc
|
||||
|
||||
if not isinstance(data, dict):
|
||||
raise HTTPException(status_code=422, detail="Expected a JSON object at the top level")
|
||||
|
||||
missing = REQUIRED_KEYS - data.keys()
|
||||
if missing:
|
||||
raise HTTPException(
|
||||
status_code=422,
|
||||
detail=f"Missing required keys: {', '.join(sorted(missing))}",
|
||||
)
|
||||
|
||||
source_file: str = data["source_file"]
|
||||
creator_folder: str = data["creator_folder"]
|
||||
duration_seconds: int | None = data.get("duration_seconds")
|
||||
segments_data: list = data["segments"]
|
||||
|
||||
if not isinstance(segments_data, list):
|
||||
raise HTTPException(status_code=422, detail="'segments' must be an array")
|
||||
|
||||
content_hash = compute_content_hash(segments_data)
|
||||
logger.info("Content hash for %s: %s", source_file, content_hash)
|
||||
|
||||
# ── 2. Find-or-create Creator ────────────────────────────────────────
|
||||
stmt = select(Creator).where(Creator.folder_name == creator_folder)
|
||||
result = await db.execute(stmt)
|
||||
creator = result.scalar_one_or_none()
|
||||
|
||||
if creator is None:
|
||||
creator = Creator(
|
||||
name=creator_folder,
|
||||
slug=slugify(creator_folder),
|
||||
folder_name=creator_folder,
|
||||
)
|
||||
db.add(creator)
|
||||
await db.flush() # assign id
|
||||
|
||||
# ── 3. Upsert SourceVideo ────────────────────────────────────────────
|
||||
# First check for exact filename match (original behavior)
|
||||
stmt = select(SourceVideo).where(
|
||||
SourceVideo.creator_id == creator.id,
|
||||
SourceVideo.filename == source_file,
|
||||
)
|
||||
result = await db.execute(stmt)
|
||||
existing_video = result.scalar_one_or_none()
|
||||
|
||||
# Tier 2: content hash match (same audio, different filename/metadata)
|
||||
matched_video = None
|
||||
match_reason = None
|
||||
if existing_video is None:
|
||||
stmt = select(SourceVideo).where(
|
||||
SourceVideo.content_hash == content_hash,
|
||||
)
|
||||
result = await db.execute(stmt)
|
||||
matched_video = result.scalar_one_or_none()
|
||||
if matched_video:
|
||||
match_reason = "content_hash"
|
||||
|
||||
# Tier 3: filename + duration match (same yt-dlp download, re-encoded)
|
||||
if existing_video is None and matched_video is None and duration_seconds is not None:
|
||||
# Strip common prefixes like dates (e.g. "2023-07-19 ") and extensions
|
||||
# to get a normalized base name for fuzzy matching
|
||||
base_name = re.sub(r"^\d{4}-\d{2}-\d{2}\s+", "", source_file)
|
||||
base_name = re.sub(r"\s*\(\d+p\).*$", "", base_name) # strip resolution suffix
|
||||
base_name = os.path.splitext(base_name)[0].strip()
|
||||
|
||||
stmt = select(SourceVideo).where(
|
||||
SourceVideo.creator_id == creator.id,
|
||||
SourceVideo.duration_seconds == duration_seconds,
|
||||
)
|
||||
result = await db.execute(stmt)
|
||||
candidates = result.scalars().all()
|
||||
for candidate in candidates:
|
||||
cand_name = re.sub(r"^\d{4}-\d{2}-\d{2}\s+", "", candidate.filename)
|
||||
cand_name = re.sub(r"\s*\(\d+p\).*$", "", cand_name)
|
||||
cand_name = os.path.splitext(cand_name)[0].strip()
|
||||
if cand_name == base_name:
|
||||
matched_video = candidate
|
||||
match_reason = "filename+duration"
|
||||
break
|
||||
|
||||
is_reupload = existing_video is not None
|
||||
is_duplicate_content = matched_video is not None
|
||||
|
||||
if is_duplicate_content:
|
||||
logger.info(
|
||||
"Duplicate detected via %s: '%s' matches existing video '%s' (%s)",
|
||||
match_reason, source_file, matched_video.filename, matched_video.id,
|
||||
)
|
||||
|
||||
if is_reupload:
|
||||
video = existing_video
|
||||
# Delete old segments for idempotent re-upload
|
||||
await db.execute(
|
||||
delete(TranscriptSegment).where(
|
||||
TranscriptSegment.source_video_id == video.id
|
||||
)
|
||||
)
|
||||
video.duration_seconds = duration_seconds
|
||||
video.content_hash = content_hash
|
||||
video.processing_status = ProcessingStatus.transcribed
|
||||
elif is_duplicate_content:
|
||||
# Same content, different filename — update the existing record
|
||||
video = matched_video
|
||||
await db.execute(
|
||||
delete(TranscriptSegment).where(
|
||||
TranscriptSegment.source_video_id == video.id
|
||||
)
|
||||
)
|
||||
video.filename = source_file
|
||||
video.file_path = f"{creator_folder}/{source_file}"
|
||||
video.duration_seconds = duration_seconds
|
||||
video.content_hash = content_hash
|
||||
video.processing_status = ProcessingStatus.transcribed
|
||||
is_reupload = True # Treat as reupload for response
|
||||
else:
|
||||
video = SourceVideo(
|
||||
creator_id=creator.id,
|
||||
filename=source_file,
|
||||
file_path=f"{creator_folder}/{source_file}",
|
||||
duration_seconds=duration_seconds,
|
||||
content_type=ContentType.tutorial,
|
||||
content_hash=content_hash,
|
||||
processing_status=ProcessingStatus.transcribed,
|
||||
)
|
||||
db.add(video)
|
||||
await db.flush() # assign id
|
||||
|
||||
# ── 4. Bulk-insert TranscriptSegments ────────────────────────────────
|
||||
segment_objs = [
|
||||
TranscriptSegment(
|
||||
source_video_id=video.id,
|
||||
start_time=float(seg["start"]),
|
||||
end_time=float(seg["end"]),
|
||||
text=str(seg["text"]),
|
||||
segment_index=idx,
|
||||
)
|
||||
for idx, seg in enumerate(segments_data)
|
||||
]
|
||||
db.add_all(segment_objs)
|
||||
|
||||
# ── 5. Save raw JSON to disk ─────────────────────────────────────────
|
||||
transcript_dir = os.path.join(
|
||||
settings.transcript_storage_path, creator_folder
|
||||
)
|
||||
transcript_path = os.path.join(transcript_dir, f"{source_file}.json")
|
||||
|
||||
try:
|
||||
os.makedirs(transcript_dir, exist_ok=True)
|
||||
with open(transcript_path, "w", encoding="utf-8") as f:
|
||||
f.write(raw_text)
|
||||
except OSError as exc:
|
||||
raise HTTPException(
|
||||
status_code=500, detail=f"Failed to save transcript: {exc}"
|
||||
) from exc
|
||||
|
||||
video.transcript_path = transcript_path
|
||||
|
||||
# ── 6. Commit & respond ──────────────────────────────────────────────
|
||||
try:
|
||||
await db.commit()
|
||||
except Exception as exc:
|
||||
await db.rollback()
|
||||
logger.error("Database commit failed during ingest: %s", exc)
|
||||
raise HTTPException(
|
||||
status_code=500, detail="Database error during ingest"
|
||||
) from exc
|
||||
|
||||
await db.refresh(video)
|
||||
await db.refresh(creator)
|
||||
|
||||
# ── 7. Dispatch LLM pipeline (best-effort) ──────────────────────────
|
||||
try:
|
||||
from pipeline.stages import run_pipeline
|
||||
|
||||
run_pipeline.delay(str(video.id))
|
||||
logger.info("Pipeline dispatched for video_id=%s", video.id)
|
||||
except Exception as exc:
|
||||
logger.warning(
|
||||
"Pipeline dispatch failed for video_id=%s (ingest still succeeds): %s",
|
||||
video.id,
|
||||
exc,
|
||||
)
|
||||
|
||||
logger.info(
|
||||
"Ingested transcript: creator=%s, file=%s, segments=%d, reupload=%s",
|
||||
creator.name,
|
||||
source_file,
|
||||
len(segment_objs),
|
||||
is_reupload,
|
||||
)
|
||||
|
||||
return TranscriptIngestResponse(
|
||||
video_id=video.id,
|
||||
creator_id=creator.id,
|
||||
creator_name=creator.name,
|
||||
filename=source_file,
|
||||
segments_stored=len(segment_objs),
|
||||
processing_status=video.processing_status.value,
|
||||
is_reupload=is_reupload,
|
||||
content_hash=content_hash,
|
||||
)
|
||||
375
backend/routers/pipeline.py
Normal file
375
backend/routers/pipeline.py
Normal file
|
|
@ -0,0 +1,375 @@
|
|||
"""Pipeline management endpoints — public trigger + admin dashboard.
|
||||
|
||||
Public:
|
||||
POST /pipeline/trigger/{video_id} Trigger pipeline for a video
|
||||
|
||||
Admin:
|
||||
GET /admin/pipeline/videos Video list with status + event counts
|
||||
POST /admin/pipeline/trigger/{video_id} Retrigger (same as public but under admin prefix)
|
||||
POST /admin/pipeline/revoke/{video_id} Revoke/cancel active tasks for a video
|
||||
GET /admin/pipeline/events/{video_id} Event log for a video (paginated)
|
||||
GET /admin/pipeline/worker-status Active/reserved tasks from Celery inspect
|
||||
"""
|
||||
|
||||
import logging
|
||||
import uuid
|
||||
from typing import Annotated
|
||||
|
||||
from fastapi import APIRouter, Depends, HTTPException, Query
|
||||
from sqlalchemy import func, select, case
|
||||
from sqlalchemy.ext.asyncio import AsyncSession
|
||||
|
||||
from config import get_settings
|
||||
from database import get_session
|
||||
from models import PipelineEvent, SourceVideo, Creator
|
||||
from redis_client import get_redis
|
||||
from schemas import DebugModeResponse, DebugModeUpdate, TokenStageSummary, TokenSummaryResponse
|
||||
|
||||
logger = logging.getLogger("chrysopedia.pipeline")
|
||||
|
||||
router = APIRouter(tags=["pipeline"])
|
||||
|
||||
REDIS_DEBUG_MODE_KEY = "chrysopedia:debug_mode"
|
||||
|
||||
|
||||
# ── Public trigger ───────────────────────────────────────────────────────────
|
||||
|
||||
@router.post("/pipeline/trigger/{video_id}")
|
||||
async def trigger_pipeline(
|
||||
video_id: str,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
):
|
||||
"""Manually trigger (or re-trigger) the LLM extraction pipeline for a video."""
|
||||
stmt = select(SourceVideo).where(SourceVideo.id == video_id)
|
||||
result = await db.execute(stmt)
|
||||
video = result.scalar_one_or_none()
|
||||
|
||||
if video is None:
|
||||
raise HTTPException(status_code=404, detail=f"Video not found: {video_id}")
|
||||
|
||||
from pipeline.stages import run_pipeline
|
||||
|
||||
try:
|
||||
run_pipeline.delay(str(video.id))
|
||||
logger.info("Pipeline manually triggered for video_id=%s", video_id)
|
||||
except Exception as exc:
|
||||
logger.warning("Failed to dispatch pipeline for video_id=%s: %s", video_id, exc)
|
||||
raise HTTPException(
|
||||
status_code=503,
|
||||
detail="Pipeline dispatch failed — Celery/Redis may be unavailable",
|
||||
) from exc
|
||||
|
||||
return {
|
||||
"status": "triggered",
|
||||
"video_id": str(video.id),
|
||||
"current_processing_status": video.processing_status.value,
|
||||
}
|
||||
|
||||
|
||||
# ── Admin: Video list ────────────────────────────────────────────────────────
|
||||
|
||||
@router.get("/admin/pipeline/videos")
|
||||
async def list_pipeline_videos(
|
||||
db: AsyncSession = Depends(get_session),
|
||||
):
|
||||
"""List all videos with processing status and pipeline event counts."""
|
||||
# Subquery for event counts per video
|
||||
event_counts = (
|
||||
select(
|
||||
PipelineEvent.video_id,
|
||||
func.count().label("event_count"),
|
||||
func.sum(case(
|
||||
(PipelineEvent.event_type == "llm_call", PipelineEvent.total_tokens),
|
||||
else_=0
|
||||
)).label("total_tokens_used"),
|
||||
func.max(PipelineEvent.created_at).label("last_event_at"),
|
||||
)
|
||||
.group_by(PipelineEvent.video_id)
|
||||
.subquery()
|
||||
)
|
||||
|
||||
stmt = (
|
||||
select(
|
||||
SourceVideo.id,
|
||||
SourceVideo.filename,
|
||||
SourceVideo.processing_status,
|
||||
SourceVideo.content_hash,
|
||||
SourceVideo.created_at,
|
||||
SourceVideo.updated_at,
|
||||
Creator.name.label("creator_name"),
|
||||
event_counts.c.event_count,
|
||||
event_counts.c.total_tokens_used,
|
||||
event_counts.c.last_event_at,
|
||||
)
|
||||
.join(Creator, SourceVideo.creator_id == Creator.id)
|
||||
.outerjoin(event_counts, SourceVideo.id == event_counts.c.video_id)
|
||||
.order_by(SourceVideo.updated_at.desc())
|
||||
)
|
||||
|
||||
result = await db.execute(stmt)
|
||||
rows = result.all()
|
||||
|
||||
return {
|
||||
"items": [
|
||||
{
|
||||
"id": str(r.id),
|
||||
"filename": r.filename,
|
||||
"processing_status": r.processing_status.value if hasattr(r.processing_status, 'value') else str(r.processing_status),
|
||||
"content_hash": r.content_hash,
|
||||
"creator_name": r.creator_name,
|
||||
"created_at": r.created_at.isoformat() if r.created_at else None,
|
||||
"updated_at": r.updated_at.isoformat() if r.updated_at else None,
|
||||
"event_count": r.event_count or 0,
|
||||
"total_tokens_used": r.total_tokens_used or 0,
|
||||
"last_event_at": r.last_event_at.isoformat() if r.last_event_at else None,
|
||||
}
|
||||
for r in rows
|
||||
],
|
||||
"total": len(rows),
|
||||
}
|
||||
|
||||
|
||||
# ── Admin: Retrigger ─────────────────────────────────────────────────────────
|
||||
|
||||
@router.post("/admin/pipeline/trigger/{video_id}")
|
||||
async def admin_trigger_pipeline(
|
||||
video_id: str,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
):
|
||||
"""Admin retrigger — same as public trigger."""
|
||||
return await trigger_pipeline(video_id, db)
|
||||
|
||||
|
||||
# ── Admin: Revoke ────────────────────────────────────────────────────────────
|
||||
|
||||
@router.post("/admin/pipeline/revoke/{video_id}")
|
||||
async def revoke_pipeline(video_id: str):
|
||||
"""Revoke/cancel active Celery tasks for a video.
|
||||
|
||||
Uses Celery's revoke with terminate=True to kill running tasks.
|
||||
This is best-effort — the task may have already completed.
|
||||
"""
|
||||
from worker import celery_app
|
||||
|
||||
try:
|
||||
# Get active tasks and revoke any matching this video_id
|
||||
inspector = celery_app.control.inspect()
|
||||
active = inspector.active() or {}
|
||||
revoked_count = 0
|
||||
|
||||
for _worker, tasks in active.items():
|
||||
for task in tasks:
|
||||
task_args = task.get("args", [])
|
||||
if task_args and str(task_args[0]) == video_id:
|
||||
celery_app.control.revoke(task["id"], terminate=True)
|
||||
revoked_count += 1
|
||||
logger.info("Revoked task %s for video_id=%s", task["id"], video_id)
|
||||
|
||||
return {
|
||||
"status": "revoked" if revoked_count > 0 else "no_active_tasks",
|
||||
"video_id": video_id,
|
||||
"tasks_revoked": revoked_count,
|
||||
}
|
||||
except Exception as exc:
|
||||
logger.warning("Failed to revoke tasks for video_id=%s: %s", video_id, exc)
|
||||
raise HTTPException(
|
||||
status_code=503,
|
||||
detail="Failed to communicate with Celery worker",
|
||||
) from exc
|
||||
|
||||
|
||||
# ── Admin: Event log ─────────────────────────────────────────────────────────
|
||||
|
||||
@router.get("/admin/pipeline/events/{video_id}")
|
||||
async def list_pipeline_events(
|
||||
video_id: str,
|
||||
offset: Annotated[int, Query(ge=0)] = 0,
|
||||
limit: Annotated[int, Query(ge=1, le=200)] = 100,
|
||||
stage: Annotated[str | None, Query(description="Filter by stage name")] = None,
|
||||
event_type: Annotated[str | None, Query(description="Filter by event type")] = None,
|
||||
order: Annotated[str, Query(description="Sort order: asc or desc")] = "desc",
|
||||
db: AsyncSession = Depends(get_session),
|
||||
):
|
||||
"""Get pipeline events for a video. Default: newest first (desc)."""
|
||||
stmt = select(PipelineEvent).where(PipelineEvent.video_id == video_id)
|
||||
|
||||
if stage:
|
||||
stmt = stmt.where(PipelineEvent.stage == stage)
|
||||
if event_type:
|
||||
stmt = stmt.where(PipelineEvent.event_type == event_type)
|
||||
|
||||
# Validate order param
|
||||
if order not in ("asc", "desc"):
|
||||
raise HTTPException(status_code=400, detail="order must be 'asc' or 'desc'")
|
||||
|
||||
# Count
|
||||
count_stmt = select(func.count()).select_from(stmt.subquery())
|
||||
total = (await db.execute(count_stmt)).scalar() or 0
|
||||
|
||||
# Fetch
|
||||
order_clause = PipelineEvent.created_at.asc() if order == "asc" else PipelineEvent.created_at.desc()
|
||||
stmt = stmt.order_by(order_clause)
|
||||
stmt = stmt.offset(offset).limit(limit)
|
||||
result = await db.execute(stmt)
|
||||
events = result.scalars().all()
|
||||
|
||||
return {
|
||||
"items": [
|
||||
{
|
||||
"id": str(e.id),
|
||||
"video_id": str(e.video_id),
|
||||
"stage": e.stage,
|
||||
"event_type": e.event_type,
|
||||
"prompt_tokens": e.prompt_tokens,
|
||||
"completion_tokens": e.completion_tokens,
|
||||
"total_tokens": e.total_tokens,
|
||||
"model": e.model,
|
||||
"duration_ms": e.duration_ms,
|
||||
"payload": e.payload,
|
||||
"created_at": e.created_at.isoformat() if e.created_at else None,
|
||||
"system_prompt_text": e.system_prompt_text,
|
||||
"user_prompt_text": e.user_prompt_text,
|
||||
"response_text": e.response_text,
|
||||
}
|
||||
for e in events
|
||||
],
|
||||
"total": total,
|
||||
"offset": offset,
|
||||
"limit": limit,
|
||||
}
|
||||
|
||||
|
||||
# ── Admin: Debug mode ─────────────────────────────────────────────────────────
|
||||
|
||||
@router.get("/admin/pipeline/debug-mode", response_model=DebugModeResponse)
|
||||
async def get_debug_mode() -> DebugModeResponse:
|
||||
"""Get the current pipeline debug mode (on/off)."""
|
||||
settings = get_settings()
|
||||
try:
|
||||
redis = await get_redis()
|
||||
try:
|
||||
value = await redis.get(REDIS_DEBUG_MODE_KEY)
|
||||
if value is not None:
|
||||
return DebugModeResponse(debug_mode=value.lower() == "true")
|
||||
finally:
|
||||
await redis.aclose()
|
||||
except Exception as exc:
|
||||
logger.warning("Redis unavailable for debug mode read, using config default: %s", exc)
|
||||
|
||||
return DebugModeResponse(debug_mode=settings.debug_mode)
|
||||
|
||||
|
||||
@router.put("/admin/pipeline/debug-mode", response_model=DebugModeResponse)
|
||||
async def set_debug_mode(body: DebugModeUpdate) -> DebugModeResponse:
|
||||
"""Set the pipeline debug mode (on/off)."""
|
||||
try:
|
||||
redis = await get_redis()
|
||||
try:
|
||||
await redis.set(REDIS_DEBUG_MODE_KEY, str(body.debug_mode))
|
||||
finally:
|
||||
await redis.aclose()
|
||||
except Exception as exc:
|
||||
logger.error("Failed to set debug mode in Redis: %s", exc)
|
||||
raise HTTPException(
|
||||
status_code=503,
|
||||
detail=f"Redis unavailable: {exc}",
|
||||
)
|
||||
|
||||
logger.info("Pipeline debug mode set to %s", body.debug_mode)
|
||||
return DebugModeResponse(debug_mode=body.debug_mode)
|
||||
|
||||
|
||||
# ── Admin: Token summary ─────────────────────────────────────────────────────
|
||||
|
||||
@router.get("/admin/pipeline/token-summary/{video_id}", response_model=TokenSummaryResponse)
|
||||
async def get_token_summary(
|
||||
video_id: str,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
) -> TokenSummaryResponse:
|
||||
"""Get per-stage token usage summary for a video."""
|
||||
stmt = (
|
||||
select(
|
||||
PipelineEvent.stage,
|
||||
func.count().label("call_count"),
|
||||
func.coalesce(func.sum(PipelineEvent.prompt_tokens), 0).label("total_prompt_tokens"),
|
||||
func.coalesce(func.sum(PipelineEvent.completion_tokens), 0).label("total_completion_tokens"),
|
||||
func.coalesce(func.sum(PipelineEvent.total_tokens), 0).label("total_tokens"),
|
||||
)
|
||||
.where(PipelineEvent.video_id == video_id)
|
||||
.where(PipelineEvent.event_type == "llm_call")
|
||||
.group_by(PipelineEvent.stage)
|
||||
.order_by(PipelineEvent.stage)
|
||||
)
|
||||
|
||||
result = await db.execute(stmt)
|
||||
rows = result.all()
|
||||
|
||||
stages = [
|
||||
TokenStageSummary(
|
||||
stage=r.stage,
|
||||
call_count=r.call_count,
|
||||
total_prompt_tokens=r.total_prompt_tokens,
|
||||
total_completion_tokens=r.total_completion_tokens,
|
||||
total_tokens=r.total_tokens,
|
||||
)
|
||||
for r in rows
|
||||
]
|
||||
grand_total = sum(s.total_tokens for s in stages)
|
||||
|
||||
return TokenSummaryResponse(
|
||||
video_id=video_id,
|
||||
stages=stages,
|
||||
grand_total_tokens=grand_total,
|
||||
)
|
||||
|
||||
|
||||
# ── Admin: Worker status ─────────────────────────────────────────────────────
|
||||
|
||||
@router.get("/admin/pipeline/worker-status")
|
||||
async def worker_status():
|
||||
"""Get current Celery worker status — active, reserved, and stats."""
|
||||
from worker import celery_app
|
||||
|
||||
try:
|
||||
inspector = celery_app.control.inspect()
|
||||
active = inspector.active() or {}
|
||||
reserved = inspector.reserved() or {}
|
||||
stats = inspector.stats() or {}
|
||||
|
||||
workers = []
|
||||
for worker_name in set(list(active.keys()) + list(reserved.keys()) + list(stats.keys())):
|
||||
worker_active = active.get(worker_name, [])
|
||||
worker_reserved = reserved.get(worker_name, [])
|
||||
worker_stats = stats.get(worker_name, {})
|
||||
|
||||
workers.append({
|
||||
"name": worker_name,
|
||||
"active_tasks": [
|
||||
{
|
||||
"id": t.get("id"),
|
||||
"name": t.get("name"),
|
||||
"args": t.get("args", []),
|
||||
"time_start": t.get("time_start"),
|
||||
}
|
||||
for t in worker_active
|
||||
],
|
||||
"reserved_tasks": len(worker_reserved),
|
||||
"total_completed": worker_stats.get("total", {}).get("tasks.pipeline.stages.stage2_segmentation", 0)
|
||||
+ worker_stats.get("total", {}).get("tasks.pipeline.stages.stage3_extraction", 0)
|
||||
+ worker_stats.get("total", {}).get("tasks.pipeline.stages.stage4_classification", 0)
|
||||
+ worker_stats.get("total", {}).get("tasks.pipeline.stages.stage5_synthesis", 0),
|
||||
"uptime": worker_stats.get("clock", None),
|
||||
"pool_size": worker_stats.get("pool", {}).get("max-concurrency") if isinstance(worker_stats.get("pool"), dict) else None,
|
||||
})
|
||||
|
||||
return {
|
||||
"online": len(workers) > 0,
|
||||
"workers": workers,
|
||||
}
|
||||
except Exception as exc:
|
||||
logger.warning("Failed to inspect Celery workers: %s", exc)
|
||||
return {
|
||||
"online": False,
|
||||
"workers": [],
|
||||
"error": str(exc),
|
||||
}
|
||||
147
backend/routers/reports.py
Normal file
147
backend/routers/reports.py
Normal file
|
|
@ -0,0 +1,147 @@
|
|||
"""Content reports router — public submission + admin management.
|
||||
|
||||
Public:
|
||||
POST /reports Submit a content issue report
|
||||
|
||||
Admin:
|
||||
GET /admin/reports List reports (filterable by status, content_type)
|
||||
GET /admin/reports/{id} Get single report detail
|
||||
PATCH /admin/reports/{id} Update status / add admin notes
|
||||
"""
|
||||
|
||||
import logging
|
||||
import uuid
|
||||
from datetime import datetime, timezone
|
||||
from typing import Annotated
|
||||
|
||||
from fastapi import APIRouter, Depends, HTTPException, Query
|
||||
from sqlalchemy import func, select
|
||||
from sqlalchemy.ext.asyncio import AsyncSession
|
||||
|
||||
from database import get_session
|
||||
from models import ContentReport, ReportStatus
|
||||
from schemas import (
|
||||
ContentReportCreate,
|
||||
ContentReportListResponse,
|
||||
ContentReportRead,
|
||||
ContentReportUpdate,
|
||||
)
|
||||
|
||||
logger = logging.getLogger("chrysopedia.reports")
|
||||
|
||||
router = APIRouter(tags=["reports"])
|
||||
|
||||
|
||||
# ── Public ───────────────────────────────────────────────────────────────────
|
||||
|
||||
@router.post("/reports", response_model=ContentReportRead, status_code=201)
|
||||
async def submit_report(
|
||||
body: ContentReportCreate,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
):
|
||||
"""Submit a content issue report (public, no auth)."""
|
||||
report = ContentReport(
|
||||
content_type=body.content_type,
|
||||
content_id=body.content_id,
|
||||
content_title=body.content_title,
|
||||
report_type=body.report_type,
|
||||
description=body.description,
|
||||
page_url=body.page_url,
|
||||
)
|
||||
db.add(report)
|
||||
await db.commit()
|
||||
await db.refresh(report)
|
||||
|
||||
logger.info(
|
||||
"New content report: id=%s type=%s content=%s/%s",
|
||||
report.id, report.report_type, report.content_type, report.content_id,
|
||||
)
|
||||
return report
|
||||
|
||||
|
||||
# ── Admin ────────────────────────────────────────────────────────────────────
|
||||
|
||||
@router.get("/admin/reports", response_model=ContentReportListResponse)
|
||||
async def list_reports(
|
||||
status: Annotated[str | None, Query(description="Filter by status")] = None,
|
||||
content_type: Annotated[str | None, Query(description="Filter by content type")] = None,
|
||||
offset: Annotated[int, Query(ge=0)] = 0,
|
||||
limit: Annotated[int, Query(ge=1, le=100)] = 50,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
):
|
||||
"""List content reports with optional filters."""
|
||||
stmt = select(ContentReport)
|
||||
|
||||
if status:
|
||||
stmt = stmt.where(ContentReport.status == status)
|
||||
if content_type:
|
||||
stmt = stmt.where(ContentReport.content_type == content_type)
|
||||
|
||||
# Count
|
||||
count_stmt = select(func.count()).select_from(stmt.subquery())
|
||||
total = (await db.execute(count_stmt)).scalar() or 0
|
||||
|
||||
# Fetch page
|
||||
stmt = stmt.order_by(ContentReport.created_at.desc())
|
||||
stmt = stmt.offset(offset).limit(limit)
|
||||
result = await db.execute(stmt)
|
||||
items = result.scalars().all()
|
||||
|
||||
return {"items": items, "total": total, "offset": offset, "limit": limit}
|
||||
|
||||
|
||||
@router.get("/admin/reports/{report_id}", response_model=ContentReportRead)
|
||||
async def get_report(
|
||||
report_id: uuid.UUID,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
):
|
||||
"""Get a single content report by ID."""
|
||||
result = await db.execute(
|
||||
select(ContentReport).where(ContentReport.id == report_id)
|
||||
)
|
||||
report = result.scalar_one_or_none()
|
||||
if not report:
|
||||
raise HTTPException(status_code=404, detail="Report not found")
|
||||
return report
|
||||
|
||||
|
||||
@router.patch("/admin/reports/{report_id}", response_model=ContentReportRead)
|
||||
async def update_report(
|
||||
report_id: uuid.UUID,
|
||||
body: ContentReportUpdate,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
):
|
||||
"""Update report status and/or admin notes."""
|
||||
result = await db.execute(
|
||||
select(ContentReport).where(ContentReport.id == report_id)
|
||||
)
|
||||
report = result.scalar_one_or_none()
|
||||
if not report:
|
||||
raise HTTPException(status_code=404, detail="Report not found")
|
||||
|
||||
if body.status is not None:
|
||||
# Validate status value
|
||||
try:
|
||||
ReportStatus(body.status)
|
||||
except ValueError:
|
||||
raise HTTPException(
|
||||
status_code=422,
|
||||
detail=f"Invalid status: {body.status}. Must be one of: open, acknowledged, resolved, dismissed",
|
||||
)
|
||||
report.status = body.status
|
||||
if body.status in ("resolved", "dismissed"):
|
||||
report.resolved_at = datetime.now(timezone.utc).replace(tzinfo=None)
|
||||
elif body.status == "open":
|
||||
report.resolved_at = None
|
||||
|
||||
if body.admin_notes is not None:
|
||||
report.admin_notes = body.admin_notes
|
||||
|
||||
await db.commit()
|
||||
await db.refresh(report)
|
||||
|
||||
logger.info(
|
||||
"Report updated: id=%s status=%s",
|
||||
report.id, report.status,
|
||||
)
|
||||
return report
|
||||
375
backend/routers/review.py
Normal file
375
backend/routers/review.py
Normal file
|
|
@ -0,0 +1,375 @@
|
|||
"""Review queue endpoints for Chrysopedia API.
|
||||
|
||||
Provides admin review workflow: list queue, stats, approve, reject,
|
||||
edit, split, merge key moments, and toggle review/auto mode via Redis.
|
||||
"""
|
||||
|
||||
import logging
|
||||
import uuid
|
||||
from typing import Annotated
|
||||
|
||||
from fastapi import APIRouter, Depends, HTTPException, Query
|
||||
from sqlalchemy import case, func, select
|
||||
from sqlalchemy.ext.asyncio import AsyncSession
|
||||
|
||||
from config import get_settings
|
||||
from database import get_session
|
||||
from models import Creator, KeyMoment, KeyMomentContentType, ReviewStatus, SourceVideo
|
||||
from redis_client import get_redis
|
||||
from schemas import (
|
||||
KeyMomentRead,
|
||||
MomentEditRequest,
|
||||
MomentMergeRequest,
|
||||
MomentSplitRequest,
|
||||
ReviewModeResponse,
|
||||
ReviewModeUpdate,
|
||||
ReviewQueueItem,
|
||||
ReviewQueueResponse,
|
||||
ReviewStatsResponse,
|
||||
)
|
||||
|
||||
logger = logging.getLogger("chrysopedia.review")
|
||||
|
||||
router = APIRouter(prefix="/review", tags=["review"])
|
||||
|
||||
REDIS_MODE_KEY = "chrysopedia:review_mode"
|
||||
|
||||
VALID_STATUSES = {"pending", "approved", "edited", "rejected", "all"}
|
||||
|
||||
|
||||
# ── Helpers ──────────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def _moment_to_queue_item(
|
||||
moment: KeyMoment, video_filename: str, creator_name: str
|
||||
) -> ReviewQueueItem:
|
||||
"""Convert a KeyMoment ORM instance + joined fields to a ReviewQueueItem."""
|
||||
data = KeyMomentRead.model_validate(moment).model_dump()
|
||||
data["video_filename"] = video_filename
|
||||
data["creator_name"] = creator_name
|
||||
return ReviewQueueItem(**data)
|
||||
|
||||
|
||||
# ── Endpoints ────────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
@router.get("/queue", response_model=ReviewQueueResponse)
|
||||
async def list_queue(
|
||||
status: Annotated[str, Query()] = "pending",
|
||||
offset: Annotated[int, Query(ge=0)] = 0,
|
||||
limit: Annotated[int, Query(ge=1, le=1000)] = 50,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
) -> ReviewQueueResponse:
|
||||
"""List key moments in the review queue, filtered by status."""
|
||||
if status not in VALID_STATUSES:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"Invalid status filter '{status}'. Must be one of: {', '.join(sorted(VALID_STATUSES))}",
|
||||
)
|
||||
|
||||
# Base query joining KeyMoment → SourceVideo → Creator
|
||||
base = (
|
||||
select(
|
||||
KeyMoment,
|
||||
SourceVideo.filename.label("video_filename"),
|
||||
Creator.name.label("creator_name"),
|
||||
)
|
||||
.join(SourceVideo, KeyMoment.source_video_id == SourceVideo.id)
|
||||
.join(Creator, SourceVideo.creator_id == Creator.id)
|
||||
)
|
||||
|
||||
if status != "all":
|
||||
base = base.where(KeyMoment.review_status == ReviewStatus(status))
|
||||
|
||||
# Count total matching rows
|
||||
count_stmt = select(func.count()).select_from(base.subquery())
|
||||
total = (await db.execute(count_stmt)).scalar_one()
|
||||
|
||||
# Fetch paginated results
|
||||
stmt = base.order_by(KeyMoment.created_at.desc()).offset(offset).limit(limit)
|
||||
rows = (await db.execute(stmt)).all()
|
||||
|
||||
items = [
|
||||
_moment_to_queue_item(row.KeyMoment, row.video_filename, row.creator_name)
|
||||
for row in rows
|
||||
]
|
||||
|
||||
return ReviewQueueResponse(items=items, total=total, offset=offset, limit=limit)
|
||||
|
||||
|
||||
@router.get("/stats", response_model=ReviewStatsResponse)
|
||||
async def get_stats(
|
||||
db: AsyncSession = Depends(get_session),
|
||||
) -> ReviewStatsResponse:
|
||||
"""Return counts of key moments grouped by review status."""
|
||||
stmt = (
|
||||
select(
|
||||
KeyMoment.review_status,
|
||||
func.count().label("cnt"),
|
||||
)
|
||||
.group_by(KeyMoment.review_status)
|
||||
)
|
||||
result = await db.execute(stmt)
|
||||
counts = {row.review_status.value: row.cnt for row in result.all()}
|
||||
|
||||
return ReviewStatsResponse(
|
||||
pending=counts.get("pending", 0),
|
||||
approved=counts.get("approved", 0),
|
||||
edited=counts.get("edited", 0),
|
||||
rejected=counts.get("rejected", 0),
|
||||
)
|
||||
|
||||
|
||||
@router.post("/moments/{moment_id}/approve", response_model=KeyMomentRead)
|
||||
async def approve_moment(
|
||||
moment_id: uuid.UUID,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
) -> KeyMomentRead:
|
||||
"""Approve a key moment for publishing."""
|
||||
moment = await db.get(KeyMoment, moment_id)
|
||||
if moment is None:
|
||||
raise HTTPException(
|
||||
status_code=404,
|
||||
detail=f"Key moment {moment_id} not found",
|
||||
)
|
||||
|
||||
moment.review_status = ReviewStatus.approved
|
||||
await db.commit()
|
||||
await db.refresh(moment)
|
||||
|
||||
logger.info("Approved key moment %s", moment_id)
|
||||
return KeyMomentRead.model_validate(moment)
|
||||
|
||||
|
||||
@router.post("/moments/{moment_id}/reject", response_model=KeyMomentRead)
|
||||
async def reject_moment(
|
||||
moment_id: uuid.UUID,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
) -> KeyMomentRead:
|
||||
"""Reject a key moment."""
|
||||
moment = await db.get(KeyMoment, moment_id)
|
||||
if moment is None:
|
||||
raise HTTPException(
|
||||
status_code=404,
|
||||
detail=f"Key moment {moment_id} not found",
|
||||
)
|
||||
|
||||
moment.review_status = ReviewStatus.rejected
|
||||
await db.commit()
|
||||
await db.refresh(moment)
|
||||
|
||||
logger.info("Rejected key moment %s", moment_id)
|
||||
return KeyMomentRead.model_validate(moment)
|
||||
|
||||
|
||||
@router.put("/moments/{moment_id}", response_model=KeyMomentRead)
|
||||
async def edit_moment(
|
||||
moment_id: uuid.UUID,
|
||||
body: MomentEditRequest,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
) -> KeyMomentRead:
|
||||
"""Update editable fields of a key moment and set status to edited."""
|
||||
moment = await db.get(KeyMoment, moment_id)
|
||||
if moment is None:
|
||||
raise HTTPException(
|
||||
status_code=404,
|
||||
detail=f"Key moment {moment_id} not found",
|
||||
)
|
||||
|
||||
update_data = body.model_dump(exclude_unset=True)
|
||||
# Convert content_type string to enum if provided
|
||||
if "content_type" in update_data and update_data["content_type"] is not None:
|
||||
try:
|
||||
update_data["content_type"] = KeyMomentContentType(update_data["content_type"])
|
||||
except ValueError:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=f"Invalid content_type '{update_data['content_type']}'",
|
||||
)
|
||||
|
||||
for field, value in update_data.items():
|
||||
setattr(moment, field, value)
|
||||
|
||||
moment.review_status = ReviewStatus.edited
|
||||
await db.commit()
|
||||
await db.refresh(moment)
|
||||
|
||||
logger.info("Edited key moment %s (fields: %s)", moment_id, list(update_data.keys()))
|
||||
return KeyMomentRead.model_validate(moment)
|
||||
|
||||
|
||||
@router.post("/moments/{moment_id}/split", response_model=list[KeyMomentRead])
|
||||
async def split_moment(
|
||||
moment_id: uuid.UUID,
|
||||
body: MomentSplitRequest,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
) -> list[KeyMomentRead]:
|
||||
"""Split a key moment into two at the given timestamp."""
|
||||
moment = await db.get(KeyMoment, moment_id)
|
||||
if moment is None:
|
||||
raise HTTPException(
|
||||
status_code=404,
|
||||
detail=f"Key moment {moment_id} not found",
|
||||
)
|
||||
|
||||
# Validate split_time is strictly between start_time and end_time
|
||||
if body.split_time <= moment.start_time or body.split_time >= moment.end_time:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail=(
|
||||
f"split_time ({body.split_time}) must be strictly between "
|
||||
f"start_time ({moment.start_time}) and end_time ({moment.end_time})"
|
||||
),
|
||||
)
|
||||
|
||||
# Update original moment to [start_time, split_time)
|
||||
original_end = moment.end_time
|
||||
moment.end_time = body.split_time
|
||||
moment.review_status = ReviewStatus.pending
|
||||
|
||||
# Create new moment for [split_time, end_time]
|
||||
new_moment = KeyMoment(
|
||||
source_video_id=moment.source_video_id,
|
||||
technique_page_id=moment.technique_page_id,
|
||||
title=f"{moment.title} (split)",
|
||||
summary=moment.summary,
|
||||
start_time=body.split_time,
|
||||
end_time=original_end,
|
||||
content_type=moment.content_type,
|
||||
plugins=moment.plugins,
|
||||
review_status=ReviewStatus.pending,
|
||||
raw_transcript=moment.raw_transcript,
|
||||
)
|
||||
db.add(new_moment)
|
||||
|
||||
await db.commit()
|
||||
await db.refresh(moment)
|
||||
await db.refresh(new_moment)
|
||||
|
||||
logger.info(
|
||||
"Split key moment %s at %.2f → original [%.2f, %.2f), new [%.2f, %.2f]",
|
||||
moment_id, body.split_time,
|
||||
moment.start_time, moment.end_time,
|
||||
new_moment.start_time, new_moment.end_time,
|
||||
)
|
||||
|
||||
return [
|
||||
KeyMomentRead.model_validate(moment),
|
||||
KeyMomentRead.model_validate(new_moment),
|
||||
]
|
||||
|
||||
|
||||
@router.post("/moments/{moment_id}/merge", response_model=KeyMomentRead)
|
||||
async def merge_moments(
|
||||
moment_id: uuid.UUID,
|
||||
body: MomentMergeRequest,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
) -> KeyMomentRead:
|
||||
"""Merge two key moments into one."""
|
||||
if moment_id == body.target_moment_id:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail="Cannot merge a moment with itself",
|
||||
)
|
||||
|
||||
source = await db.get(KeyMoment, moment_id)
|
||||
if source is None:
|
||||
raise HTTPException(
|
||||
status_code=404,
|
||||
detail=f"Key moment {moment_id} not found",
|
||||
)
|
||||
|
||||
target = await db.get(KeyMoment, body.target_moment_id)
|
||||
if target is None:
|
||||
raise HTTPException(
|
||||
status_code=404,
|
||||
detail=f"Target key moment {body.target_moment_id} not found",
|
||||
)
|
||||
|
||||
# Both must belong to the same source video
|
||||
if source.source_video_id != target.source_video_id:
|
||||
raise HTTPException(
|
||||
status_code=400,
|
||||
detail="Cannot merge moments from different source videos",
|
||||
)
|
||||
|
||||
# Merge: combined summary, min start, max end
|
||||
source.summary = f"{source.summary}\n\n{target.summary}"
|
||||
source.start_time = min(source.start_time, target.start_time)
|
||||
source.end_time = max(source.end_time, target.end_time)
|
||||
source.review_status = ReviewStatus.pending
|
||||
|
||||
# Delete target
|
||||
await db.delete(target)
|
||||
await db.commit()
|
||||
await db.refresh(source)
|
||||
|
||||
logger.info(
|
||||
"Merged key moment %s with %s → [%.2f, %.2f]",
|
||||
moment_id, body.target_moment_id,
|
||||
source.start_time, source.end_time,
|
||||
)
|
||||
|
||||
return KeyMomentRead.model_validate(source)
|
||||
|
||||
|
||||
|
||||
|
||||
@router.get("/moments/{moment_id}", response_model=ReviewQueueItem)
|
||||
async def get_moment(
|
||||
moment_id: uuid.UUID,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
) -> ReviewQueueItem:
|
||||
"""Get a single key moment by ID with video and creator info."""
|
||||
stmt = (
|
||||
select(KeyMoment, SourceVideo.file_path, Creator.name)
|
||||
.join(SourceVideo, KeyMoment.source_video_id == SourceVideo.id)
|
||||
.join(Creator, SourceVideo.creator_id == Creator.id)
|
||||
.where(KeyMoment.id == moment_id)
|
||||
)
|
||||
result = await db.execute(stmt)
|
||||
row = result.one_or_none()
|
||||
if row is None:
|
||||
raise HTTPException(status_code=404, detail=f"Moment {moment_id} not found")
|
||||
moment, file_path, creator_name = row
|
||||
return _moment_to_queue_item(moment, file_path or "", creator_name)
|
||||
|
||||
@router.get("/mode", response_model=ReviewModeResponse)
|
||||
async def get_mode() -> ReviewModeResponse:
|
||||
"""Get the current review mode (review vs auto)."""
|
||||
settings = get_settings()
|
||||
try:
|
||||
redis = await get_redis()
|
||||
try:
|
||||
value = await redis.get(REDIS_MODE_KEY)
|
||||
if value is not None:
|
||||
return ReviewModeResponse(review_mode=value.lower() == "true")
|
||||
finally:
|
||||
await redis.aclose()
|
||||
except Exception as exc:
|
||||
# Redis unavailable — fall back to config default
|
||||
logger.warning("Redis unavailable for mode read, using config default: %s", exc)
|
||||
|
||||
return ReviewModeResponse(review_mode=settings.review_mode)
|
||||
|
||||
|
||||
@router.put("/mode", response_model=ReviewModeResponse)
|
||||
async def set_mode(
|
||||
body: ReviewModeUpdate,
|
||||
) -> ReviewModeResponse:
|
||||
"""Set the review mode (review vs auto)."""
|
||||
try:
|
||||
redis = await get_redis()
|
||||
try:
|
||||
await redis.set(REDIS_MODE_KEY, str(body.review_mode))
|
||||
finally:
|
||||
await redis.aclose()
|
||||
except Exception as exc:
|
||||
logger.error("Failed to set review mode in Redis: %s", exc)
|
||||
raise HTTPException(
|
||||
status_code=503,
|
||||
detail=f"Redis unavailable: {exc}",
|
||||
)
|
||||
|
||||
logger.info("Review mode set to %s", body.review_mode)
|
||||
return ReviewModeResponse(review_mode=body.review_mode)
|
||||
46
backend/routers/search.py
Normal file
46
backend/routers/search.py
Normal file
|
|
@ -0,0 +1,46 @@
|
|||
"""Search endpoint for semantic + keyword search with graceful fallback."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from typing import Annotated
|
||||
|
||||
from fastapi import APIRouter, Depends, Query
|
||||
from sqlalchemy.ext.asyncio import AsyncSession
|
||||
|
||||
from config import get_settings
|
||||
from database import get_session
|
||||
from schemas import SearchResponse, SearchResultItem
|
||||
from search_service import SearchService
|
||||
|
||||
logger = logging.getLogger("chrysopedia.search.router")
|
||||
|
||||
router = APIRouter(prefix="/search", tags=["search"])
|
||||
|
||||
|
||||
def _get_search_service() -> SearchService:
|
||||
"""Build a SearchService from current settings."""
|
||||
return SearchService(get_settings())
|
||||
|
||||
|
||||
@router.get("", response_model=SearchResponse)
|
||||
async def search(
|
||||
q: Annotated[str, Query(max_length=500)] = "",
|
||||
scope: Annotated[str, Query()] = "all",
|
||||
limit: Annotated[int, Query(ge=1, le=100)] = 20,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
) -> SearchResponse:
|
||||
"""Semantic search with keyword fallback.
|
||||
|
||||
- **q**: Search query (max 500 chars). Empty → empty results.
|
||||
- **scope**: ``all`` | ``topics`` | ``creators``. Invalid → defaults to ``all``.
|
||||
- **limit**: Max results (1–100, default 20).
|
||||
"""
|
||||
svc = _get_search_service()
|
||||
result = await svc.search(query=q, scope=scope, limit=limit, db=db)
|
||||
return SearchResponse(
|
||||
items=[SearchResultItem(**item) for item in result["items"]],
|
||||
total=result["total"],
|
||||
query=result["query"],
|
||||
fallback_used=result["fallback_used"],
|
||||
)
|
||||
217
backend/routers/techniques.py
Normal file
217
backend/routers/techniques.py
Normal file
|
|
@ -0,0 +1,217 @@
|
|||
"""Technique page endpoints — list and detail with eager-loaded relations."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from typing import Annotated
|
||||
|
||||
from fastapi import APIRouter, Depends, HTTPException, Query
|
||||
from sqlalchemy import func, select
|
||||
from sqlalchemy.ext.asyncio import AsyncSession
|
||||
from sqlalchemy.orm import selectinload
|
||||
|
||||
from database import get_session
|
||||
from models import Creator, KeyMoment, RelatedTechniqueLink, SourceVideo, TechniquePage, TechniquePageVersion
|
||||
from schemas import (
|
||||
CreatorInfo,
|
||||
KeyMomentSummary,
|
||||
PaginatedResponse,
|
||||
RelatedLinkItem,
|
||||
TechniquePageDetail,
|
||||
TechniquePageRead,
|
||||
TechniquePageVersionDetail,
|
||||
TechniquePageVersionListResponse,
|
||||
TechniquePageVersionSummary,
|
||||
)
|
||||
|
||||
logger = logging.getLogger("chrysopedia.techniques")
|
||||
|
||||
router = APIRouter(prefix="/techniques", tags=["techniques"])
|
||||
|
||||
|
||||
@router.get("", response_model=PaginatedResponse)
|
||||
async def list_techniques(
|
||||
category: Annotated[str | None, Query()] = None,
|
||||
creator_slug: Annotated[str | None, Query()] = None,
|
||||
offset: Annotated[int, Query(ge=0)] = 0,
|
||||
limit: Annotated[int, Query(ge=1, le=100)] = 50,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
) -> PaginatedResponse:
|
||||
"""List technique pages with optional category/creator filtering."""
|
||||
stmt = select(TechniquePage)
|
||||
|
||||
if category:
|
||||
stmt = stmt.where(TechniquePage.topic_category == category)
|
||||
|
||||
if creator_slug:
|
||||
# Join to Creator to filter by slug
|
||||
stmt = stmt.join(Creator, TechniquePage.creator_id == Creator.id).where(
|
||||
Creator.slug == creator_slug
|
||||
)
|
||||
|
||||
# Count total before pagination
|
||||
from sqlalchemy import func
|
||||
|
||||
count_stmt = select(func.count()).select_from(stmt.subquery())
|
||||
count_result = await db.execute(count_stmt)
|
||||
total = count_result.scalar() or 0
|
||||
|
||||
stmt = stmt.options(selectinload(TechniquePage.creator)).order_by(TechniquePage.created_at.desc()).offset(offset).limit(limit)
|
||||
result = await db.execute(stmt)
|
||||
pages = result.scalars().all()
|
||||
|
||||
items = []
|
||||
for p in pages:
|
||||
item = TechniquePageRead.model_validate(p)
|
||||
if p.creator:
|
||||
item.creator_name = p.creator.name
|
||||
item.creator_slug = p.creator.slug
|
||||
items.append(item)
|
||||
|
||||
return PaginatedResponse(
|
||||
items=items,
|
||||
total=total,
|
||||
offset=offset,
|
||||
limit=limit,
|
||||
)
|
||||
|
||||
|
||||
@router.get("/{slug}", response_model=TechniquePageDetail)
|
||||
async def get_technique(
|
||||
slug: str,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
) -> TechniquePageDetail:
|
||||
"""Get full technique page detail with key moments, creator, and related links."""
|
||||
stmt = (
|
||||
select(TechniquePage)
|
||||
.where(TechniquePage.slug == slug)
|
||||
.options(
|
||||
selectinload(TechniquePage.key_moments).selectinload(KeyMoment.source_video),
|
||||
selectinload(TechniquePage.creator),
|
||||
selectinload(TechniquePage.outgoing_links).selectinload(
|
||||
RelatedTechniqueLink.target_page
|
||||
),
|
||||
selectinload(TechniquePage.incoming_links).selectinload(
|
||||
RelatedTechniqueLink.source_page
|
||||
),
|
||||
)
|
||||
)
|
||||
result = await db.execute(stmt)
|
||||
page = result.scalar_one_or_none()
|
||||
|
||||
if page is None:
|
||||
raise HTTPException(status_code=404, detail=f"Technique '{slug}' not found")
|
||||
|
||||
# Build key moments (ordered by start_time)
|
||||
key_moments = sorted(page.key_moments, key=lambda km: km.start_time)
|
||||
key_moment_items = []
|
||||
for km in key_moments:
|
||||
item = KeyMomentSummary.model_validate(km)
|
||||
item.video_filename = km.source_video.filename if km.source_video else ""
|
||||
key_moment_items.append(item)
|
||||
|
||||
# Build creator info
|
||||
creator_info = None
|
||||
if page.creator:
|
||||
creator_info = CreatorInfo(
|
||||
name=page.creator.name,
|
||||
slug=page.creator.slug,
|
||||
genres=page.creator.genres,
|
||||
)
|
||||
|
||||
# Build related links (outgoing + incoming)
|
||||
related_links: list[RelatedLinkItem] = []
|
||||
for link in page.outgoing_links:
|
||||
if link.target_page:
|
||||
related_links.append(
|
||||
RelatedLinkItem(
|
||||
target_title=link.target_page.title,
|
||||
target_slug=link.target_page.slug,
|
||||
relationship=link.relationship.value if hasattr(link.relationship, 'value') else str(link.relationship),
|
||||
)
|
||||
)
|
||||
for link in page.incoming_links:
|
||||
if link.source_page:
|
||||
related_links.append(
|
||||
RelatedLinkItem(
|
||||
target_title=link.source_page.title,
|
||||
target_slug=link.source_page.slug,
|
||||
relationship=link.relationship.value if hasattr(link.relationship, 'value') else str(link.relationship),
|
||||
)
|
||||
)
|
||||
|
||||
base = TechniquePageRead.model_validate(page)
|
||||
|
||||
# Count versions for this page
|
||||
version_count_stmt = select(func.count()).where(
|
||||
TechniquePageVersion.technique_page_id == page.id
|
||||
)
|
||||
version_count_result = await db.execute(version_count_stmt)
|
||||
version_count = version_count_result.scalar() or 0
|
||||
|
||||
return TechniquePageDetail(
|
||||
**base.model_dump(),
|
||||
key_moments=key_moment_items,
|
||||
creator_info=creator_info,
|
||||
related_links=related_links,
|
||||
version_count=version_count,
|
||||
)
|
||||
|
||||
|
||||
@router.get("/{slug}/versions", response_model=TechniquePageVersionListResponse)
|
||||
async def list_technique_versions(
|
||||
slug: str,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
) -> TechniquePageVersionListResponse:
|
||||
"""List all version snapshots for a technique page, newest first."""
|
||||
# Resolve the technique page
|
||||
page_stmt = select(TechniquePage).where(TechniquePage.slug == slug)
|
||||
page_result = await db.execute(page_stmt)
|
||||
page = page_result.scalar_one_or_none()
|
||||
if page is None:
|
||||
raise HTTPException(status_code=404, detail=f"Technique '{slug}' not found")
|
||||
|
||||
# Fetch versions ordered by version_number DESC
|
||||
versions_stmt = (
|
||||
select(TechniquePageVersion)
|
||||
.where(TechniquePageVersion.technique_page_id == page.id)
|
||||
.order_by(TechniquePageVersion.version_number.desc())
|
||||
)
|
||||
versions_result = await db.execute(versions_stmt)
|
||||
versions = versions_result.scalars().all()
|
||||
|
||||
items = [TechniquePageVersionSummary.model_validate(v) for v in versions]
|
||||
return TechniquePageVersionListResponse(items=items, total=len(items))
|
||||
|
||||
|
||||
@router.get("/{slug}/versions/{version_number}", response_model=TechniquePageVersionDetail)
|
||||
async def get_technique_version(
|
||||
slug: str,
|
||||
version_number: int,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
) -> TechniquePageVersionDetail:
|
||||
"""Get a specific version snapshot by version number."""
|
||||
# Resolve the technique page
|
||||
page_stmt = select(TechniquePage).where(TechniquePage.slug == slug)
|
||||
page_result = await db.execute(page_stmt)
|
||||
page = page_result.scalar_one_or_none()
|
||||
if page is None:
|
||||
raise HTTPException(status_code=404, detail=f"Technique '{slug}' not found")
|
||||
|
||||
# Fetch the specific version
|
||||
version_stmt = (
|
||||
select(TechniquePageVersion)
|
||||
.where(
|
||||
TechniquePageVersion.technique_page_id == page.id,
|
||||
TechniquePageVersion.version_number == version_number,
|
||||
)
|
||||
)
|
||||
version_result = await db.execute(version_stmt)
|
||||
version = version_result.scalar_one_or_none()
|
||||
if version is None:
|
||||
raise HTTPException(
|
||||
status_code=404,
|
||||
detail=f"Version {version_number} not found for technique '{slug}'",
|
||||
)
|
||||
|
||||
return TechniquePageVersionDetail.model_validate(version)
|
||||
144
backend/routers/topics.py
Normal file
144
backend/routers/topics.py
Normal file
|
|
@ -0,0 +1,144 @@
|
|||
"""Topics endpoint — two-level category hierarchy with aggregated counts."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
from typing import Annotated, Any
|
||||
|
||||
import yaml
|
||||
from fastapi import APIRouter, Depends, Query
|
||||
from sqlalchemy import func, select
|
||||
from sqlalchemy.ext.asyncio import AsyncSession
|
||||
from sqlalchemy.orm import selectinload
|
||||
|
||||
from database import get_session
|
||||
from models import Creator, TechniquePage
|
||||
from schemas import (
|
||||
PaginatedResponse,
|
||||
TechniquePageRead,
|
||||
TopicCategory,
|
||||
TopicSubTopic,
|
||||
)
|
||||
|
||||
logger = logging.getLogger("chrysopedia.topics")
|
||||
|
||||
router = APIRouter(prefix="/topics", tags=["topics"])
|
||||
|
||||
# Path to canonical_tags.yaml relative to the backend directory
|
||||
_TAGS_PATH = os.path.join(os.path.dirname(__file__), "..", "..", "config", "canonical_tags.yaml")
|
||||
|
||||
|
||||
def _load_canonical_tags() -> list[dict[str, Any]]:
|
||||
"""Load the canonical tag categories from YAML."""
|
||||
path = os.path.normpath(_TAGS_PATH)
|
||||
try:
|
||||
with open(path) as f:
|
||||
data = yaml.safe_load(f)
|
||||
return data.get("categories", [])
|
||||
except FileNotFoundError:
|
||||
logger.warning("canonical_tags.yaml not found at %s", path)
|
||||
return []
|
||||
|
||||
|
||||
@router.get("", response_model=list[TopicCategory])
|
||||
async def list_topics(
|
||||
db: AsyncSession = Depends(get_session),
|
||||
) -> list[TopicCategory]:
|
||||
"""Return the two-level topic hierarchy with technique/creator counts per sub-topic.
|
||||
|
||||
Categories come from ``canonical_tags.yaml``. Counts are computed
|
||||
from live DB data by matching ``topic_tags`` array contents.
|
||||
"""
|
||||
categories = _load_canonical_tags()
|
||||
|
||||
# Pre-fetch all technique pages with their tags and creator_ids for counting
|
||||
tp_stmt = select(
|
||||
TechniquePage.topic_category,
|
||||
TechniquePage.topic_tags,
|
||||
TechniquePage.creator_id,
|
||||
)
|
||||
tp_result = await db.execute(tp_stmt)
|
||||
tp_rows = tp_result.all()
|
||||
|
||||
# Build per-sub-topic counts
|
||||
result: list[TopicCategory] = []
|
||||
for cat in categories:
|
||||
cat_name = cat.get("name", "")
|
||||
cat_desc = cat.get("description", "")
|
||||
sub_topic_names: list[str] = cat.get("sub_topics", [])
|
||||
|
||||
sub_topics: list[TopicSubTopic] = []
|
||||
for st_name in sub_topic_names:
|
||||
technique_count = 0
|
||||
creator_ids: set[str] = set()
|
||||
|
||||
for tp_cat, tp_tags, tp_creator_id in tp_rows:
|
||||
tags = tp_tags or []
|
||||
# Match if the sub-topic name appears in the technique's tags
|
||||
# or if the category matches and tag is in sub-topics
|
||||
if st_name.lower() in [t.lower() for t in tags]:
|
||||
technique_count += 1
|
||||
creator_ids.add(str(tp_creator_id))
|
||||
|
||||
sub_topics.append(
|
||||
TopicSubTopic(
|
||||
name=st_name,
|
||||
technique_count=technique_count,
|
||||
creator_count=len(creator_ids),
|
||||
)
|
||||
)
|
||||
|
||||
result.append(
|
||||
TopicCategory(
|
||||
name=cat_name,
|
||||
description=cat_desc,
|
||||
sub_topics=sub_topics,
|
||||
)
|
||||
)
|
||||
|
||||
return result
|
||||
|
||||
|
||||
@router.get("/{category_slug}", response_model=PaginatedResponse)
|
||||
async def get_topic_techniques(
|
||||
category_slug: str,
|
||||
offset: Annotated[int, Query(ge=0)] = 0,
|
||||
limit: Annotated[int, Query(ge=1, le=100)] = 50,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
) -> PaginatedResponse:
|
||||
"""Return technique pages filtered by topic_category.
|
||||
|
||||
The ``category_slug`` is matched case-insensitively against
|
||||
``technique_pages.topic_category`` (e.g. 'sound-design' matches 'Sound design').
|
||||
"""
|
||||
# Normalize slug to category name: replace hyphens with spaces, title-case
|
||||
category_name = category_slug.replace("-", " ").title()
|
||||
|
||||
# Also try exact match on the slug form
|
||||
stmt = select(TechniquePage).where(
|
||||
TechniquePage.topic_category.ilike(category_name)
|
||||
)
|
||||
|
||||
count_stmt = select(func.count()).select_from(stmt.subquery())
|
||||
count_result = await db.execute(count_stmt)
|
||||
total = count_result.scalar() or 0
|
||||
|
||||
stmt = stmt.options(selectinload(TechniquePage.creator)).order_by(TechniquePage.title).offset(offset).limit(limit)
|
||||
result = await db.execute(stmt)
|
||||
pages = result.scalars().all()
|
||||
|
||||
items = []
|
||||
for p in pages:
|
||||
item = TechniquePageRead.model_validate(p)
|
||||
if p.creator:
|
||||
item.creator_name = p.creator.name
|
||||
item.creator_slug = p.creator.slug
|
||||
items.append(item)
|
||||
|
||||
return PaginatedResponse(
|
||||
items=items,
|
||||
total=total,
|
||||
offset=offset,
|
||||
limit=limit,
|
||||
)
|
||||
36
backend/routers/videos.py
Normal file
36
backend/routers/videos.py
Normal file
|
|
@ -0,0 +1,36 @@
|
|||
"""Source video endpoints for Chrysopedia API."""
|
||||
|
||||
import logging
|
||||
from typing import Annotated
|
||||
|
||||
from fastapi import APIRouter, Depends, Query
|
||||
from sqlalchemy import select
|
||||
from sqlalchemy.ext.asyncio import AsyncSession
|
||||
|
||||
from database import get_session
|
||||
from models import SourceVideo
|
||||
from schemas import SourceVideoRead
|
||||
|
||||
logger = logging.getLogger("chrysopedia.videos")
|
||||
|
||||
router = APIRouter(prefix="/videos", tags=["videos"])
|
||||
|
||||
|
||||
@router.get("", response_model=list[SourceVideoRead])
|
||||
async def list_videos(
|
||||
offset: Annotated[int, Query(ge=0)] = 0,
|
||||
limit: Annotated[int, Query(ge=1, le=100)] = 50,
|
||||
creator_id: str | None = None,
|
||||
db: AsyncSession = Depends(get_session),
|
||||
) -> list[SourceVideoRead]:
|
||||
"""List source videos with optional filtering by creator."""
|
||||
stmt = select(SourceVideo).order_by(SourceVideo.created_at.desc())
|
||||
|
||||
if creator_id:
|
||||
stmt = stmt.where(SourceVideo.creator_id == creator_id)
|
||||
|
||||
stmt = stmt.offset(offset).limit(limit)
|
||||
result = await db.execute(stmt)
|
||||
videos = result.scalars().all()
|
||||
logger.debug("Listed %d videos (offset=%d, limit=%d)", len(videos), offset, limit)
|
||||
return [SourceVideoRead.model_validate(v) for v in videos]
|
||||
459
backend/schemas.py
Normal file
459
backend/schemas.py
Normal file
|
|
@ -0,0 +1,459 @@
|
|||
"""Pydantic schemas for the Chrysopedia API.
|
||||
|
||||
Read-only schemas for list/detail endpoints and input schemas for creation.
|
||||
Each schema mirrors the corresponding SQLAlchemy model in models.py.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import uuid
|
||||
from datetime import datetime
|
||||
|
||||
from pydantic import BaseModel, ConfigDict, Field
|
||||
|
||||
|
||||
# ── Health ───────────────────────────────────────────────────────────────────
|
||||
|
||||
class HealthResponse(BaseModel):
|
||||
status: str = "ok"
|
||||
service: str = "chrysopedia-api"
|
||||
version: str = "0.1.0"
|
||||
database: str = "unknown"
|
||||
|
||||
|
||||
# ── Creator ──────────────────────────────────────────────────────────────────
|
||||
|
||||
class CreatorBase(BaseModel):
|
||||
name: str
|
||||
slug: str
|
||||
genres: list[str] | None = None
|
||||
folder_name: str
|
||||
|
||||
class CreatorCreate(CreatorBase):
|
||||
pass
|
||||
|
||||
class CreatorRead(CreatorBase):
|
||||
model_config = ConfigDict(from_attributes=True)
|
||||
|
||||
id: uuid.UUID
|
||||
view_count: int = 0
|
||||
created_at: datetime
|
||||
updated_at: datetime
|
||||
|
||||
|
||||
class CreatorDetail(CreatorRead):
|
||||
"""Creator with nested video count."""
|
||||
video_count: int = 0
|
||||
|
||||
|
||||
# ── SourceVideo ──────────────────────────────────────────────────────────────
|
||||
|
||||
class SourceVideoBase(BaseModel):
|
||||
filename: str
|
||||
file_path: str
|
||||
duration_seconds: int | None = None
|
||||
content_type: str
|
||||
transcript_path: str | None = None
|
||||
|
||||
class SourceVideoCreate(SourceVideoBase):
|
||||
creator_id: uuid.UUID
|
||||
|
||||
class SourceVideoRead(SourceVideoBase):
|
||||
model_config = ConfigDict(from_attributes=True)
|
||||
|
||||
id: uuid.UUID
|
||||
creator_id: uuid.UUID
|
||||
content_hash: str | None = None
|
||||
processing_status: str = "pending"
|
||||
created_at: datetime
|
||||
updated_at: datetime
|
||||
|
||||
|
||||
# ── TranscriptSegment ────────────────────────────────────────────────────────
|
||||
|
||||
class TranscriptSegmentBase(BaseModel):
|
||||
start_time: float
|
||||
end_time: float
|
||||
text: str
|
||||
segment_index: int
|
||||
topic_label: str | None = None
|
||||
|
||||
class TranscriptSegmentCreate(TranscriptSegmentBase):
|
||||
source_video_id: uuid.UUID
|
||||
|
||||
class TranscriptSegmentRead(TranscriptSegmentBase):
|
||||
model_config = ConfigDict(from_attributes=True)
|
||||
|
||||
id: uuid.UUID
|
||||
source_video_id: uuid.UUID
|
||||
|
||||
|
||||
# ── KeyMoment ────────────────────────────────────────────────────────────────
|
||||
|
||||
class KeyMomentBase(BaseModel):
|
||||
title: str
|
||||
summary: str
|
||||
start_time: float
|
||||
end_time: float
|
||||
content_type: str
|
||||
plugins: list[str] | None = None
|
||||
raw_transcript: str | None = None
|
||||
|
||||
class KeyMomentCreate(KeyMomentBase):
|
||||
source_video_id: uuid.UUID
|
||||
technique_page_id: uuid.UUID | None = None
|
||||
|
||||
class KeyMomentRead(KeyMomentBase):
|
||||
model_config = ConfigDict(from_attributes=True)
|
||||
|
||||
id: uuid.UUID
|
||||
source_video_id: uuid.UUID
|
||||
technique_page_id: uuid.UUID | None = None
|
||||
review_status: str = "pending"
|
||||
created_at: datetime
|
||||
updated_at: datetime
|
||||
|
||||
|
||||
# ── TechniquePage ────────────────────────────────────────────────────────────
|
||||
|
||||
class TechniquePageBase(BaseModel):
|
||||
title: str
|
||||
slug: str
|
||||
topic_category: str
|
||||
topic_tags: list[str] | None = None
|
||||
summary: str | None = None
|
||||
body_sections: dict | None = None
|
||||
signal_chains: list | None = None
|
||||
plugins: list[str] | None = None
|
||||
|
||||
class TechniquePageCreate(TechniquePageBase):
|
||||
creator_id: uuid.UUID
|
||||
source_quality: str | None = None
|
||||
|
||||
class TechniquePageRead(TechniquePageBase):
|
||||
model_config = ConfigDict(from_attributes=True)
|
||||
|
||||
id: uuid.UUID
|
||||
creator_id: uuid.UUID
|
||||
creator_name: str = ""
|
||||
creator_slug: str = ""
|
||||
source_quality: str | None = None
|
||||
view_count: int = 0
|
||||
review_status: str = "draft"
|
||||
created_at: datetime
|
||||
updated_at: datetime
|
||||
|
||||
|
||||
# ── RelatedTechniqueLink ─────────────────────────────────────────────────────
|
||||
|
||||
class RelatedTechniqueLinkBase(BaseModel):
|
||||
source_page_id: uuid.UUID
|
||||
target_page_id: uuid.UUID
|
||||
relationship: str
|
||||
|
||||
class RelatedTechniqueLinkCreate(RelatedTechniqueLinkBase):
|
||||
pass
|
||||
|
||||
class RelatedTechniqueLinkRead(RelatedTechniqueLinkBase):
|
||||
model_config = ConfigDict(from_attributes=True)
|
||||
|
||||
id: uuid.UUID
|
||||
|
||||
|
||||
# ── Tag ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
class TagBase(BaseModel):
|
||||
name: str
|
||||
category: str
|
||||
aliases: list[str] | None = None
|
||||
|
||||
class TagCreate(TagBase):
|
||||
pass
|
||||
|
||||
class TagRead(TagBase):
|
||||
model_config = ConfigDict(from_attributes=True)
|
||||
|
||||
id: uuid.UUID
|
||||
|
||||
|
||||
# ── Transcript Ingestion ─────────────────────────────────────────────────────
|
||||
|
||||
class TranscriptIngestResponse(BaseModel):
|
||||
"""Response returned after successfully ingesting a transcript."""
|
||||
video_id: uuid.UUID
|
||||
creator_id: uuid.UUID
|
||||
creator_name: str
|
||||
filename: str
|
||||
segments_stored: int
|
||||
processing_status: str
|
||||
is_reupload: bool
|
||||
content_hash: str
|
||||
|
||||
|
||||
# ── Pagination wrapper ───────────────────────────────────────────────────────
|
||||
|
||||
class PaginatedResponse(BaseModel):
|
||||
"""Generic paginated list response."""
|
||||
items: list = Field(default_factory=list)
|
||||
total: int = 0
|
||||
offset: int = 0
|
||||
limit: int = 50
|
||||
|
||||
|
||||
# ── Review Queue ─────────────────────────────────────────────────────────────
|
||||
|
||||
class ReviewQueueItem(KeyMomentRead):
|
||||
"""Key moment enriched with source video and creator info for review UI."""
|
||||
video_filename: str
|
||||
creator_name: str
|
||||
|
||||
|
||||
class ReviewQueueResponse(BaseModel):
|
||||
"""Paginated response for the review queue."""
|
||||
items: list[ReviewQueueItem] = Field(default_factory=list)
|
||||
total: int = 0
|
||||
offset: int = 0
|
||||
limit: int = 50
|
||||
|
||||
|
||||
class ReviewStatsResponse(BaseModel):
|
||||
"""Counts of key moments grouped by review status."""
|
||||
pending: int = 0
|
||||
approved: int = 0
|
||||
edited: int = 0
|
||||
rejected: int = 0
|
||||
|
||||
|
||||
class MomentEditRequest(BaseModel):
|
||||
"""Editable fields for a key moment."""
|
||||
title: str | None = None
|
||||
summary: str | None = None
|
||||
start_time: float | None = None
|
||||
end_time: float | None = None
|
||||
content_type: str | None = None
|
||||
plugins: list[str] | None = None
|
||||
|
||||
|
||||
class MomentSplitRequest(BaseModel):
|
||||
"""Request to split a moment at a given timestamp."""
|
||||
split_time: float
|
||||
|
||||
|
||||
class MomentMergeRequest(BaseModel):
|
||||
"""Request to merge two moments."""
|
||||
target_moment_id: uuid.UUID
|
||||
|
||||
|
||||
class ReviewModeResponse(BaseModel):
|
||||
"""Current review mode state."""
|
||||
review_mode: bool
|
||||
|
||||
|
||||
class ReviewModeUpdate(BaseModel):
|
||||
"""Request to update the review mode."""
|
||||
review_mode: bool
|
||||
|
||||
|
||||
# ── Search ───────────────────────────────────────────────────────────────────
|
||||
|
||||
class SearchResultItem(BaseModel):
|
||||
"""A single search result."""
|
||||
title: str
|
||||
slug: str = ""
|
||||
type: str = ""
|
||||
score: float = 0.0
|
||||
summary: str = ""
|
||||
creator_name: str = ""
|
||||
creator_slug: str = ""
|
||||
topic_category: str = ""
|
||||
topic_tags: list[str] = Field(default_factory=list)
|
||||
|
||||
|
||||
class SearchResponse(BaseModel):
|
||||
"""Top-level search response with metadata."""
|
||||
items: list[SearchResultItem] = Field(default_factory=list)
|
||||
total: int = 0
|
||||
query: str = ""
|
||||
fallback_used: bool = False
|
||||
|
||||
|
||||
# ── Technique Page Detail ────────────────────────────────────────────────────
|
||||
|
||||
class KeyMomentSummary(BaseModel):
|
||||
"""Lightweight key moment for technique page detail."""
|
||||
model_config = ConfigDict(from_attributes=True)
|
||||
|
||||
id: uuid.UUID
|
||||
title: str
|
||||
summary: str
|
||||
start_time: float
|
||||
end_time: float
|
||||
content_type: str
|
||||
plugins: list[str] | None = None
|
||||
source_video_id: uuid.UUID | None = None
|
||||
video_filename: str = ""
|
||||
|
||||
|
||||
class RelatedLinkItem(BaseModel):
|
||||
"""A related technique link with target info."""
|
||||
model_config = ConfigDict(from_attributes=True)
|
||||
|
||||
target_title: str = ""
|
||||
target_slug: str = ""
|
||||
relationship: str = ""
|
||||
|
||||
|
||||
class CreatorInfo(BaseModel):
|
||||
"""Minimal creator info embedded in technique detail."""
|
||||
model_config = ConfigDict(from_attributes=True)
|
||||
|
||||
name: str
|
||||
slug: str
|
||||
genres: list[str] | None = None
|
||||
|
||||
|
||||
class TechniquePageDetail(TechniquePageRead):
|
||||
"""Technique page with nested key moments, creator, and related links."""
|
||||
key_moments: list[KeyMomentSummary] = Field(default_factory=list)
|
||||
creator_info: CreatorInfo | None = None
|
||||
related_links: list[RelatedLinkItem] = Field(default_factory=list)
|
||||
version_count: int = 0
|
||||
|
||||
|
||||
# ── Technique Page Versions ──────────────────────────────────────────────────
|
||||
|
||||
class TechniquePageVersionSummary(BaseModel):
|
||||
"""Lightweight version entry for list responses."""
|
||||
model_config = ConfigDict(from_attributes=True)
|
||||
|
||||
version_number: int
|
||||
created_at: datetime
|
||||
pipeline_metadata: dict | None = None
|
||||
|
||||
|
||||
class TechniquePageVersionDetail(BaseModel):
|
||||
"""Full version snapshot for detail responses."""
|
||||
model_config = ConfigDict(from_attributes=True)
|
||||
|
||||
version_number: int
|
||||
content_snapshot: dict
|
||||
pipeline_metadata: dict | None = None
|
||||
created_at: datetime
|
||||
|
||||
|
||||
class TechniquePageVersionListResponse(BaseModel):
|
||||
"""Response for version list endpoint."""
|
||||
items: list[TechniquePageVersionSummary] = Field(default_factory=list)
|
||||
total: int = 0
|
||||
|
||||
|
||||
# ── Topics ───────────────────────────────────────────────────────────────────
|
||||
|
||||
class TopicSubTopic(BaseModel):
|
||||
"""A sub-topic with aggregated counts."""
|
||||
name: str
|
||||
technique_count: int = 0
|
||||
creator_count: int = 0
|
||||
|
||||
|
||||
class TopicCategory(BaseModel):
|
||||
"""A top-level topic category with sub-topics."""
|
||||
name: str
|
||||
description: str = ""
|
||||
sub_topics: list[TopicSubTopic] = Field(default_factory=list)
|
||||
|
||||
|
||||
# ── Creator Browse ───────────────────────────────────────────────────────────
|
||||
|
||||
class CreatorBrowseItem(CreatorRead):
|
||||
"""Creator with technique and video counts for browse pages."""
|
||||
technique_count: int = 0
|
||||
video_count: int = 0
|
||||
|
||||
|
||||
# ── Content Reports ──────────────────────────────────────────────────────────
|
||||
|
||||
class ContentReportCreate(BaseModel):
|
||||
"""Public submission: report a content issue."""
|
||||
content_type: str = Field(
|
||||
..., description="Entity type: technique_page, key_moment, creator, general"
|
||||
)
|
||||
content_id: uuid.UUID | None = Field(
|
||||
None, description="ID of the reported entity (null for general reports)"
|
||||
)
|
||||
content_title: str | None = Field(
|
||||
None, description="Title of the reported content (for display context)"
|
||||
)
|
||||
report_type: str = Field(
|
||||
..., description="inaccurate, missing_info, wrong_attribution, formatting, other"
|
||||
)
|
||||
description: str = Field(
|
||||
..., min_length=10, max_length=2000,
|
||||
description="Description of the issue"
|
||||
)
|
||||
page_url: str | None = Field(
|
||||
None, description="URL the user was on when reporting"
|
||||
)
|
||||
|
||||
|
||||
class ContentReportRead(BaseModel):
|
||||
"""Full report for admin views."""
|
||||
model_config = ConfigDict(from_attributes=True)
|
||||
|
||||
id: uuid.UUID
|
||||
content_type: str
|
||||
content_id: uuid.UUID | None = None
|
||||
content_title: str | None = None
|
||||
report_type: str
|
||||
description: str
|
||||
status: str = "open"
|
||||
admin_notes: str | None = None
|
||||
page_url: str | None = None
|
||||
created_at: datetime
|
||||
resolved_at: datetime | None = None
|
||||
|
||||
|
||||
class ContentReportUpdate(BaseModel):
|
||||
"""Admin update: change status and/or add notes."""
|
||||
status: str | None = Field(
|
||||
None, description="open, acknowledged, resolved, dismissed"
|
||||
)
|
||||
admin_notes: str | None = Field(
|
||||
None, max_length=2000, description="Admin notes about resolution"
|
||||
)
|
||||
|
||||
|
||||
class ContentReportListResponse(BaseModel):
|
||||
"""Paginated list of content reports."""
|
||||
items: list[ContentReportRead] = Field(default_factory=list)
|
||||
total: int = 0
|
||||
offset: int = 0
|
||||
limit: int = 50
|
||||
|
||||
|
||||
# ── Pipeline Debug Mode ─────────────────────────────────────────────────────
|
||||
|
||||
class DebugModeResponse(BaseModel):
|
||||
"""Current debug mode status."""
|
||||
debug_mode: bool
|
||||
|
||||
|
||||
class DebugModeUpdate(BaseModel):
|
||||
"""Toggle debug mode on/off."""
|
||||
debug_mode: bool
|
||||
|
||||
|
||||
class TokenStageSummary(BaseModel):
|
||||
"""Per-stage token usage aggregation."""
|
||||
stage: str
|
||||
call_count: int
|
||||
total_prompt_tokens: int
|
||||
total_completion_tokens: int
|
||||
total_tokens: int
|
||||
|
||||
|
||||
class TokenSummaryResponse(BaseModel):
|
||||
"""Token usage summary for a video, broken down by stage."""
|
||||
video_id: str
|
||||
stages: list[TokenStageSummary] = Field(default_factory=list)
|
||||
grand_total_tokens: int
|
||||
362
backend/search_service.py
Normal file
362
backend/search_service.py
Normal file
|
|
@ -0,0 +1,362 @@
|
|||
"""Async search service for the public search endpoint.
|
||||
|
||||
Orchestrates semantic search (embedding + Qdrant) with keyword fallback.
|
||||
All external calls have timeouts and graceful degradation — if embedding
|
||||
or Qdrant fail, the service falls back to keyword-only (ILIKE) search.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import asyncio
|
||||
import logging
|
||||
import time
|
||||
from typing import Any
|
||||
|
||||
import openai
|
||||
from qdrant_client import AsyncQdrantClient
|
||||
from qdrant_client.http import exceptions as qdrant_exceptions
|
||||
from qdrant_client.models import FieldCondition, Filter, MatchValue
|
||||
from sqlalchemy import or_, select
|
||||
from sqlalchemy.ext.asyncio import AsyncSession
|
||||
|
||||
from config import Settings
|
||||
from models import Creator, KeyMoment, SourceVideo, TechniquePage
|
||||
|
||||
logger = logging.getLogger("chrysopedia.search")
|
||||
|
||||
# Timeout for external calls (embedding API, Qdrant) in seconds
|
||||
_EXTERNAL_TIMEOUT = 0.3 # 300ms per plan
|
||||
|
||||
|
||||
class SearchService:
|
||||
"""Async search service with semantic + keyword fallback.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
settings:
|
||||
Application settings containing embedding and Qdrant config.
|
||||
"""
|
||||
|
||||
def __init__(self, settings: Settings) -> None:
|
||||
self.settings = settings
|
||||
self._openai = openai.AsyncOpenAI(
|
||||
base_url=settings.embedding_api_url,
|
||||
api_key=settings.llm_api_key,
|
||||
)
|
||||
self._qdrant = AsyncQdrantClient(url=settings.qdrant_url)
|
||||
self._collection = settings.qdrant_collection
|
||||
|
||||
# ── Embedding ────────────────────────────────────────────────────────
|
||||
|
||||
async def embed_query(self, text: str) -> list[float] | None:
|
||||
"""Embed a query string into a vector.
|
||||
|
||||
Returns None on any failure (timeout, connection, malformed response)
|
||||
so the caller can fall back to keyword search.
|
||||
"""
|
||||
try:
|
||||
response = await asyncio.wait_for(
|
||||
self._openai.embeddings.create(
|
||||
model=self.settings.embedding_model,
|
||||
input=text,
|
||||
),
|
||||
timeout=_EXTERNAL_TIMEOUT,
|
||||
)
|
||||
except asyncio.TimeoutError:
|
||||
logger.warning("Embedding API timeout (%.0fms limit) for query: %.50s…", _EXTERNAL_TIMEOUT * 1000, text)
|
||||
return None
|
||||
except (openai.APIConnectionError, openai.APITimeoutError) as exc:
|
||||
logger.warning("Embedding API connection error (%s: %s)", type(exc).__name__, exc)
|
||||
return None
|
||||
except openai.APIError as exc:
|
||||
logger.warning("Embedding API error (%s: %s)", type(exc).__name__, exc)
|
||||
return None
|
||||
|
||||
if not response.data:
|
||||
logger.warning("Embedding API returned empty data for query: %.50s…", text)
|
||||
return None
|
||||
|
||||
vector = response.data[0].embedding
|
||||
if len(vector) != self.settings.embedding_dimensions:
|
||||
logger.warning(
|
||||
"Embedding dimension mismatch: expected %d, got %d",
|
||||
self.settings.embedding_dimensions,
|
||||
len(vector),
|
||||
)
|
||||
return None
|
||||
|
||||
return vector
|
||||
|
||||
# ── Qdrant vector search ─────────────────────────────────────────────
|
||||
|
||||
async def search_qdrant(
|
||||
self,
|
||||
vector: list[float],
|
||||
limit: int = 20,
|
||||
type_filter: str | None = None,
|
||||
) -> list[dict[str, Any]]:
|
||||
"""Search Qdrant for nearest neighbours.
|
||||
|
||||
Returns a list of dicts with 'score' and 'payload' keys.
|
||||
Returns empty list on failure.
|
||||
"""
|
||||
query_filter = None
|
||||
if type_filter:
|
||||
query_filter = Filter(
|
||||
must=[FieldCondition(key="type", match=MatchValue(value=type_filter))]
|
||||
)
|
||||
|
||||
try:
|
||||
results = await asyncio.wait_for(
|
||||
self._qdrant.query_points(
|
||||
collection_name=self._collection,
|
||||
query=vector,
|
||||
query_filter=query_filter,
|
||||
limit=limit,
|
||||
with_payload=True,
|
||||
),
|
||||
timeout=_EXTERNAL_TIMEOUT,
|
||||
)
|
||||
except asyncio.TimeoutError:
|
||||
logger.warning("Qdrant search timeout (%.0fms limit)", _EXTERNAL_TIMEOUT * 1000)
|
||||
return []
|
||||
except qdrant_exceptions.UnexpectedResponse as exc:
|
||||
logger.warning("Qdrant search error: %s", exc)
|
||||
return []
|
||||
except Exception as exc:
|
||||
logger.warning("Qdrant connection error (%s: %s)", type(exc).__name__, exc)
|
||||
return []
|
||||
|
||||
return [
|
||||
{"score": point.score, "payload": point.payload}
|
||||
for point in results.points
|
||||
]
|
||||
|
||||
# ── Keyword fallback ─────────────────────────────────────────────────
|
||||
|
||||
async def keyword_search(
|
||||
self,
|
||||
query: str,
|
||||
scope: str,
|
||||
limit: int,
|
||||
db: AsyncSession,
|
||||
) -> list[dict[str, Any]]:
|
||||
"""ILIKE keyword search across technique pages, key moments, and creators.
|
||||
|
||||
Searches title/name columns. Returns a unified list of result dicts.
|
||||
"""
|
||||
results: list[dict[str, Any]] = []
|
||||
pattern = f"%{query}%"
|
||||
|
||||
if scope in ("all", "topics"):
|
||||
stmt = (
|
||||
select(TechniquePage)
|
||||
.where(
|
||||
or_(
|
||||
TechniquePage.title.ilike(pattern),
|
||||
TechniquePage.summary.ilike(pattern),
|
||||
)
|
||||
)
|
||||
.limit(limit)
|
||||
)
|
||||
rows = await db.execute(stmt)
|
||||
for tp in rows.scalars().all():
|
||||
results.append({
|
||||
"type": "technique_page",
|
||||
"title": tp.title,
|
||||
"slug": tp.slug,
|
||||
"summary": tp.summary or "",
|
||||
"topic_category": tp.topic_category,
|
||||
"topic_tags": tp.topic_tags or [],
|
||||
"creator_id": str(tp.creator_id),
|
||||
"score": 0.0,
|
||||
})
|
||||
|
||||
if scope in ("all",):
|
||||
km_stmt = (
|
||||
select(KeyMoment, SourceVideo, Creator)
|
||||
.join(SourceVideo, KeyMoment.source_video_id == SourceVideo.id)
|
||||
.join(Creator, SourceVideo.creator_id == Creator.id)
|
||||
.where(KeyMoment.title.ilike(pattern))
|
||||
.limit(limit)
|
||||
)
|
||||
km_rows = await db.execute(km_stmt)
|
||||
for km, sv, cr in km_rows.all():
|
||||
results.append({
|
||||
"type": "key_moment",
|
||||
"title": km.title,
|
||||
"slug": "",
|
||||
"summary": km.summary or "",
|
||||
"topic_category": "",
|
||||
"topic_tags": [],
|
||||
"creator_id": str(cr.id),
|
||||
"creator_name": cr.name,
|
||||
"creator_slug": cr.slug,
|
||||
"score": 0.0,
|
||||
})
|
||||
|
||||
if scope in ("all", "creators"):
|
||||
cr_stmt = (
|
||||
select(Creator)
|
||||
.where(Creator.name.ilike(pattern))
|
||||
.limit(limit)
|
||||
)
|
||||
cr_rows = await db.execute(cr_stmt)
|
||||
for cr in cr_rows.scalars().all():
|
||||
results.append({
|
||||
"type": "creator",
|
||||
"title": cr.name,
|
||||
"slug": cr.slug,
|
||||
"summary": "",
|
||||
"topic_category": "",
|
||||
"topic_tags": cr.genres or [],
|
||||
"creator_id": str(cr.id),
|
||||
"score": 0.0,
|
||||
})
|
||||
|
||||
# Enrich keyword results with creator names
|
||||
kw_creator_ids = {r["creator_id"] for r in results if r.get("creator_id")}
|
||||
kw_creator_map: dict[str, dict[str, str]] = {}
|
||||
if kw_creator_ids:
|
||||
import uuid as _uuid_mod
|
||||
valid = []
|
||||
for cid in kw_creator_ids:
|
||||
try:
|
||||
valid.append(_uuid_mod.UUID(cid))
|
||||
except (ValueError, AttributeError):
|
||||
pass
|
||||
if valid:
|
||||
cr_stmt = select(Creator).where(Creator.id.in_(valid))
|
||||
cr_result = await db.execute(cr_stmt)
|
||||
for c in cr_result.scalars().all():
|
||||
kw_creator_map[str(c.id)] = {"name": c.name, "slug": c.slug}
|
||||
for r in results:
|
||||
info = kw_creator_map.get(r.get("creator_id", ""), {"name": "", "slug": ""})
|
||||
r["creator_name"] = info["name"]
|
||||
r["creator_slug"] = info["slug"]
|
||||
|
||||
return results[:limit]
|
||||
|
||||
# ── Orchestrator ─────────────────────────────────────────────────────
|
||||
|
||||
async def search(
|
||||
self,
|
||||
query: str,
|
||||
scope: str,
|
||||
limit: int,
|
||||
db: AsyncSession,
|
||||
) -> dict[str, Any]:
|
||||
"""Run semantic search with keyword fallback.
|
||||
|
||||
Returns a dict matching the SearchResponse schema shape.
|
||||
"""
|
||||
start = time.monotonic()
|
||||
|
||||
# Validate / sanitize inputs
|
||||
if not query or not query.strip():
|
||||
return {"items": [], "total": 0, "query": query, "fallback_used": False}
|
||||
|
||||
# Truncate long queries
|
||||
query = query.strip()[:500]
|
||||
|
||||
# Normalize scope
|
||||
if scope not in ("all", "topics", "creators"):
|
||||
scope = "all"
|
||||
|
||||
# Map scope to Qdrant type filter
|
||||
type_filter_map = {
|
||||
"all": None,
|
||||
"topics": "technique_page",
|
||||
"creators": None, # creators aren't in Qdrant
|
||||
}
|
||||
qdrant_type_filter = type_filter_map.get(scope)
|
||||
|
||||
fallback_used = False
|
||||
items: list[dict[str, Any]] = []
|
||||
|
||||
# Try semantic search
|
||||
vector = await self.embed_query(query)
|
||||
if vector is not None:
|
||||
qdrant_results = await self.search_qdrant(vector, limit=limit, type_filter=qdrant_type_filter)
|
||||
if qdrant_results:
|
||||
# Enrich Qdrant results with DB metadata
|
||||
items = await self._enrich_results(qdrant_results, db)
|
||||
|
||||
# Fallback to keyword search if semantic failed or returned nothing
|
||||
if not items:
|
||||
items = await self.keyword_search(query, scope, limit, db)
|
||||
fallback_used = True
|
||||
|
||||
elapsed_ms = (time.monotonic() - start) * 1000
|
||||
|
||||
logger.info(
|
||||
"Search query=%r scope=%s results=%d fallback=%s latency_ms=%.1f",
|
||||
query,
|
||||
scope,
|
||||
len(items),
|
||||
fallback_used,
|
||||
elapsed_ms,
|
||||
)
|
||||
|
||||
return {
|
||||
"items": items,
|
||||
"total": len(items),
|
||||
"query": query,
|
||||
"fallback_used": fallback_used,
|
||||
}
|
||||
|
||||
# ── Result enrichment ────────────────────────────────────────────────
|
||||
|
||||
async def _enrich_results(
|
||||
self,
|
||||
qdrant_results: list[dict[str, Any]],
|
||||
db: AsyncSession,
|
||||
) -> list[dict[str, Any]]:
|
||||
"""Enrich Qdrant results with creator names and slugs from DB."""
|
||||
enriched: list[dict[str, Any]] = []
|
||||
|
||||
# Collect creator_ids to batch-fetch
|
||||
creator_ids = set()
|
||||
for r in qdrant_results:
|
||||
payload = r.get("payload", {})
|
||||
cid = payload.get("creator_id")
|
||||
if cid:
|
||||
creator_ids.add(cid)
|
||||
|
||||
# Batch fetch creators
|
||||
creator_map: dict[str, dict[str, str]] = {}
|
||||
if creator_ids:
|
||||
from sqlalchemy.dialects.postgresql import UUID as PgUUID
|
||||
import uuid as uuid_mod
|
||||
valid_ids = []
|
||||
for cid in creator_ids:
|
||||
try:
|
||||
valid_ids.append(uuid_mod.UUID(cid))
|
||||
except (ValueError, AttributeError):
|
||||
pass
|
||||
|
||||
if valid_ids:
|
||||
stmt = select(Creator).where(Creator.id.in_(valid_ids))
|
||||
result = await db.execute(stmt)
|
||||
for c in result.scalars().all():
|
||||
creator_map[str(c.id)] = {"name": c.name, "slug": c.slug}
|
||||
|
||||
for r in qdrant_results:
|
||||
payload = r.get("payload", {})
|
||||
cid = payload.get("creator_id", "")
|
||||
creator_info = creator_map.get(cid, {"name": "", "slug": ""})
|
||||
|
||||
enriched.append({
|
||||
"type": payload.get("type", ""),
|
||||
"title": payload.get("title", ""),
|
||||
"slug": payload.get("slug", payload.get("title", "").lower().replace(" ", "-")),
|
||||
"summary": payload.get("summary", ""),
|
||||
"topic_category": payload.get("topic_category", ""),
|
||||
"topic_tags": payload.get("topic_tags", []),
|
||||
"creator_id": cid,
|
||||
"creator_name": creator_info["name"],
|
||||
"creator_slug": creator_info["slug"],
|
||||
"score": r.get("score", 0.0),
|
||||
})
|
||||
|
||||
return enriched
|
||||
0
backend/tests/__init__.py
Normal file
0
backend/tests/__init__.py
Normal file
192
backend/tests/conftest.py
Normal file
192
backend/tests/conftest.py
Normal file
|
|
@ -0,0 +1,192 @@
|
|||
"""Shared fixtures for Chrysopedia integration tests.
|
||||
|
||||
Provides:
|
||||
- Async SQLAlchemy engine/session against a real PostgreSQL test database
|
||||
- Sync SQLAlchemy engine/session for pipeline stage tests (Celery stages are sync)
|
||||
- httpx.AsyncClient wired to the FastAPI app with dependency overrides
|
||||
- Pre-ingest fixture for pipeline tests
|
||||
- Sample transcript fixture path and temporary storage directory
|
||||
|
||||
Key design choice: function-scoped engine with NullPool avoids asyncpg
|
||||
"another operation in progress" errors caused by session-scoped connection
|
||||
reuse between the ASGI test client and verification queries.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import pathlib
|
||||
import uuid
|
||||
|
||||
import pytest
|
||||
import pytest_asyncio
|
||||
from httpx import ASGITransport, AsyncClient
|
||||
from sqlalchemy import create_engine
|
||||
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker, create_async_engine
|
||||
from sqlalchemy.orm import Session, sessionmaker
|
||||
from sqlalchemy.pool import NullPool
|
||||
|
||||
# Ensure backend/ is on sys.path so "from models import ..." works
|
||||
import sys
|
||||
sys.path.insert(0, str(pathlib.Path(__file__).resolve().parent.parent))
|
||||
|
||||
from database import Base, get_session # noqa: E402
|
||||
from main import app # noqa: E402
|
||||
from models import ( # noqa: E402
|
||||
ContentType,
|
||||
Creator,
|
||||
ProcessingStatus,
|
||||
SourceVideo,
|
||||
TranscriptSegment,
|
||||
)
|
||||
|
||||
TEST_DATABASE_URL = os.getenv(
|
||||
"TEST_DATABASE_URL",
|
||||
"postgresql+asyncpg://chrysopedia:changeme@localhost:5433/chrysopedia_test",
|
||||
)
|
||||
|
||||
TEST_DATABASE_URL_SYNC = TEST_DATABASE_URL.replace(
|
||||
"postgresql+asyncpg://", "postgresql+psycopg2://"
|
||||
)
|
||||
|
||||
|
||||
@pytest_asyncio.fixture()
|
||||
async def db_engine():
|
||||
"""Create a per-test async engine (NullPool) and create/drop all tables."""
|
||||
engine = create_async_engine(TEST_DATABASE_URL, echo=False, poolclass=NullPool)
|
||||
|
||||
# Create all tables fresh for each test
|
||||
async with engine.begin() as conn:
|
||||
await conn.run_sync(Base.metadata.drop_all)
|
||||
await conn.run_sync(Base.metadata.create_all)
|
||||
|
||||
yield engine
|
||||
|
||||
# Drop all tables after test
|
||||
async with engine.begin() as conn:
|
||||
await conn.run_sync(Base.metadata.drop_all)
|
||||
|
||||
await engine.dispose()
|
||||
|
||||
|
||||
@pytest_asyncio.fixture()
|
||||
async def client(db_engine, tmp_path):
|
||||
"""Async HTTP test client wired to FastAPI with dependency overrides."""
|
||||
session_factory = async_sessionmaker(
|
||||
db_engine, class_=AsyncSession, expire_on_commit=False
|
||||
)
|
||||
|
||||
async def _override_get_session():
|
||||
async with session_factory() as session:
|
||||
yield session
|
||||
|
||||
# Override DB session dependency
|
||||
app.dependency_overrides[get_session] = _override_get_session
|
||||
|
||||
# Override transcript_storage_path via environment variable
|
||||
os.environ["TRANSCRIPT_STORAGE_PATH"] = str(tmp_path)
|
||||
# Clear the lru_cache so Settings picks up the new env var
|
||||
from config import get_settings
|
||||
get_settings.cache_clear()
|
||||
|
||||
transport = ASGITransport(app=app)
|
||||
async with AsyncClient(transport=transport, base_url="http://testserver") as ac:
|
||||
yield ac
|
||||
|
||||
# Teardown: clean overrides and restore settings cache
|
||||
app.dependency_overrides.clear()
|
||||
os.environ.pop("TRANSCRIPT_STORAGE_PATH", None)
|
||||
get_settings.cache_clear()
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def sample_transcript_path() -> pathlib.Path:
|
||||
"""Path to the sample 5-segment transcript JSON fixture."""
|
||||
return pathlib.Path(__file__).parent / "fixtures" / "sample_transcript.json"
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def tmp_transcript_dir(tmp_path) -> pathlib.Path:
|
||||
"""Temporary directory for transcript storage during tests."""
|
||||
return tmp_path
|
||||
|
||||
|
||||
# ── Sync engine/session for pipeline stages ──────────────────────────────────
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def sync_engine(db_engine):
|
||||
"""Create a sync SQLAlchemy engine pointing at the test database.
|
||||
|
||||
Tables are already created/dropped by the async ``db_engine`` fixture,
|
||||
so this fixture just wraps a sync engine around the same DB URL.
|
||||
"""
|
||||
engine = create_engine(TEST_DATABASE_URL_SYNC, echo=False, poolclass=NullPool)
|
||||
yield engine
|
||||
engine.dispose()
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def sync_session(sync_engine) -> Session:
|
||||
"""Create a sync SQLAlchemy session for pipeline stage tests."""
|
||||
factory = sessionmaker(bind=sync_engine)
|
||||
session = factory()
|
||||
yield session
|
||||
session.close()
|
||||
|
||||
|
||||
# ── Pre-ingest fixture for pipeline tests ────────────────────────────────────
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def pre_ingested_video(sync_engine):
|
||||
"""Ingest the sample transcript directly into the test DB via sync ORM.
|
||||
|
||||
Returns a dict with ``video_id``, ``creator_id``, and ``segment_count``.
|
||||
"""
|
||||
factory = sessionmaker(bind=sync_engine)
|
||||
session = factory()
|
||||
try:
|
||||
# Create creator
|
||||
creator = Creator(
|
||||
name="Skope",
|
||||
slug="skope",
|
||||
folder_name="Skope",
|
||||
)
|
||||
session.add(creator)
|
||||
session.flush()
|
||||
|
||||
# Create video
|
||||
video = SourceVideo(
|
||||
creator_id=creator.id,
|
||||
filename="mixing-basics-ep1.mp4",
|
||||
file_path="Skope/mixing-basics-ep1.mp4",
|
||||
duration_seconds=1234,
|
||||
content_type=ContentType.tutorial,
|
||||
processing_status=ProcessingStatus.transcribed,
|
||||
)
|
||||
session.add(video)
|
||||
session.flush()
|
||||
|
||||
# Create transcript segments
|
||||
sample = pathlib.Path(__file__).parent / "fixtures" / "sample_transcript.json"
|
||||
data = json.loads(sample.read_text())
|
||||
for idx, seg in enumerate(data["segments"]):
|
||||
session.add(TranscriptSegment(
|
||||
source_video_id=video.id,
|
||||
start_time=float(seg["start"]),
|
||||
end_time=float(seg["end"]),
|
||||
text=str(seg["text"]),
|
||||
segment_index=idx,
|
||||
))
|
||||
|
||||
session.commit()
|
||||
|
||||
result = {
|
||||
"video_id": str(video.id),
|
||||
"creator_id": str(creator.id),
|
||||
"segment_count": len(data["segments"]),
|
||||
}
|
||||
finally:
|
||||
session.close()
|
||||
|
||||
return result
|
||||
111
backend/tests/fixtures/mock_llm_responses.py
vendored
Normal file
111
backend/tests/fixtures/mock_llm_responses.py
vendored
Normal file
|
|
@ -0,0 +1,111 @@
|
|||
"""Mock LLM and embedding responses for pipeline integration tests.
|
||||
|
||||
Each response is a JSON string matching the Pydantic schema for that stage.
|
||||
The sample transcript has 5 segments about gain staging, so mock responses
|
||||
reflect that content.
|
||||
"""
|
||||
|
||||
import json
|
||||
import random
|
||||
|
||||
# ── Stage 2: Segmentation ───────────────────────────────────────────────────
|
||||
|
||||
STAGE2_SEGMENTATION_RESPONSE = json.dumps({
|
||||
"segments": [
|
||||
{
|
||||
"start_index": 0,
|
||||
"end_index": 1,
|
||||
"topic_label": "Introduction",
|
||||
"summary": "Introduces the episode about mixing basics and gain staging.",
|
||||
},
|
||||
{
|
||||
"start_index": 2,
|
||||
"end_index": 4,
|
||||
"topic_label": "Gain Staging Technique",
|
||||
"summary": "Covers practical steps for gain staging including setting levels and avoiding clipping.",
|
||||
},
|
||||
]
|
||||
})
|
||||
|
||||
# ── Stage 3: Extraction ─────────────────────────────────────────────────────
|
||||
|
||||
STAGE3_EXTRACTION_RESPONSE = json.dumps({
|
||||
"moments": [
|
||||
{
|
||||
"title": "Setting Levels for Gain Staging",
|
||||
"summary": "Demonstrates the process of setting proper gain levels across the signal chain to maintain headroom.",
|
||||
"start_time": 12.8,
|
||||
"end_time": 28.5,
|
||||
"content_type": "technique",
|
||||
"plugins": ["Pro-Q 3"],
|
||||
"raw_transcript": "First thing you want to do is set your levels. Make sure nothing is clipping on the master bus.",
|
||||
},
|
||||
{
|
||||
"title": "Master Bus Clipping Prevention",
|
||||
"summary": "Explains how to monitor and prevent clipping on the master bus during a mix session.",
|
||||
"start_time": 20.1,
|
||||
"end_time": 35.0,
|
||||
"content_type": "settings",
|
||||
"plugins": [],
|
||||
"raw_transcript": "Make sure nothing is clipping on the master bus. That wraps up this quick overview.",
|
||||
},
|
||||
]
|
||||
})
|
||||
|
||||
# ── Stage 4: Classification ─────────────────────────────────────────────────
|
||||
|
||||
STAGE4_CLASSIFICATION_RESPONSE = json.dumps({
|
||||
"classifications": [
|
||||
{
|
||||
"moment_index": 0,
|
||||
"topic_category": "Mixing",
|
||||
"topic_tags": ["gain staging", "eq"],
|
||||
"content_type_override": None,
|
||||
},
|
||||
{
|
||||
"moment_index": 1,
|
||||
"topic_category": "Mixing",
|
||||
"topic_tags": ["gain staging", "bus processing"],
|
||||
"content_type_override": None,
|
||||
},
|
||||
]
|
||||
})
|
||||
|
||||
# ── Stage 5: Synthesis ───────────────────────────────────────────────────────
|
||||
|
||||
STAGE5_SYNTHESIS_RESPONSE = json.dumps({
|
||||
"pages": [
|
||||
{
|
||||
"title": "Gain Staging in Mixing",
|
||||
"slug": "gain-staging-in-mixing",
|
||||
"topic_category": "Mixing",
|
||||
"topic_tags": ["gain staging"],
|
||||
"summary": "A comprehensive guide to gain staging in a mixing context, covering level setting and master bus management.",
|
||||
"body_sections": {
|
||||
"Overview": "Gain staging ensures each stage of the signal chain operates at optimal levels.",
|
||||
"Steps": "1. Set input levels. 2. Check bus levels. 3. Monitor master output.",
|
||||
},
|
||||
"signal_chains": [
|
||||
{"chain": "Input -> Channel Strip -> Bus -> Master", "notes": "Keep headroom at each stage."}
|
||||
],
|
||||
"plugins": ["Pro-Q 3"],
|
||||
"source_quality": "structured",
|
||||
}
|
||||
]
|
||||
})
|
||||
|
||||
# ── Embedding response ───────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def make_mock_embedding(dim: int = 768) -> list[float]:
|
||||
"""Generate a deterministic-seeded mock embedding vector."""
|
||||
rng = random.Random(42)
|
||||
return [rng.uniform(-1, 1) for _ in range(dim)]
|
||||
|
||||
|
||||
def make_mock_embeddings(n: int, dim: int = 768) -> list[list[float]]:
|
||||
"""Generate n distinct mock embedding vectors."""
|
||||
return [
|
||||
[random.Random(42 + i).uniform(-1, 1) for _ in range(dim)]
|
||||
for i in range(n)
|
||||
]
|
||||
12
backend/tests/fixtures/sample_transcript.json
vendored
Normal file
12
backend/tests/fixtures/sample_transcript.json
vendored
Normal file
|
|
@ -0,0 +1,12 @@
|
|||
{
|
||||
"source_file": "mixing-basics-ep1.mp4",
|
||||
"creator_folder": "Skope",
|
||||
"duration_seconds": 1234,
|
||||
"segments": [
|
||||
{"start": 0.0, "end": 5.2, "text": "Welcome to mixing basics episode one."},
|
||||
{"start": 5.2, "end": 12.8, "text": "Today we are going to talk about gain staging."},
|
||||
{"start": 12.8, "end": 20.1, "text": "First thing you want to do is set your levels."},
|
||||
{"start": 20.1, "end": 28.5, "text": "Make sure nothing is clipping on the master bus."},
|
||||
{"start": 28.5, "end": 35.0, "text": "That wraps up this quick overview of gain staging."}
|
||||
]
|
||||
}
|
||||
179
backend/tests/test_ingest.py
Normal file
179
backend/tests/test_ingest.py
Normal file
|
|
@ -0,0 +1,179 @@
|
|||
"""Integration tests for the transcript ingest endpoint.
|
||||
|
||||
Tests run against a real PostgreSQL database via httpx.AsyncClient
|
||||
on the FastAPI ASGI app. Each test gets a clean database state via
|
||||
TRUNCATE in the client fixture (conftest.py).
|
||||
"""
|
||||
|
||||
import json
|
||||
import pathlib
|
||||
|
||||
import pytest
|
||||
from httpx import AsyncClient
|
||||
from sqlalchemy import func, select, text
|
||||
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker
|
||||
|
||||
from models import Creator, SourceVideo, TranscriptSegment
|
||||
|
||||
|
||||
# ── Helpers ──────────────────────────────────────────────────────────────────
|
||||
|
||||
INGEST_URL = "/api/v1/ingest"
|
||||
|
||||
|
||||
def _upload_file(path: pathlib.Path):
|
||||
"""Return a dict suitable for httpx multipart file upload."""
|
||||
return {"file": (path.name, path.read_bytes(), "application/json")}
|
||||
|
||||
|
||||
async def _query_db(db_engine, stmt):
|
||||
"""Run a read query in its own session to avoid connection contention."""
|
||||
session_factory = async_sessionmaker(
|
||||
db_engine, class_=AsyncSession, expire_on_commit=False
|
||||
)
|
||||
async with session_factory() as session:
|
||||
result = await session.execute(stmt)
|
||||
return result
|
||||
|
||||
|
||||
async def _count_rows(db_engine, model):
|
||||
"""Count rows in a table via a fresh session."""
|
||||
result = await _query_db(db_engine, select(func.count(model.id)))
|
||||
return result.scalar_one()
|
||||
|
||||
|
||||
# ── Happy-path tests ────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
async def test_ingest_creates_creator_and_video(client, sample_transcript_path, db_engine):
|
||||
"""POST a valid transcript → 200 with creator, video, and 5 segments created."""
|
||||
resp = await client.post(INGEST_URL, files=_upload_file(sample_transcript_path))
|
||||
assert resp.status_code == 200, f"Expected 200, got {resp.status_code}: {resp.text}"
|
||||
|
||||
data = resp.json()
|
||||
assert "video_id" in data
|
||||
assert "creator_id" in data
|
||||
assert data["segments_stored"] == 5
|
||||
assert data["creator_name"] == "Skope"
|
||||
assert data["is_reupload"] is False
|
||||
|
||||
# Verify DB state via a fresh session
|
||||
session_factory = async_sessionmaker(db_engine, class_=AsyncSession, expire_on_commit=False)
|
||||
async with session_factory() as session:
|
||||
# Creator exists with correct folder_name and slug
|
||||
result = await session.execute(
|
||||
select(Creator).where(Creator.folder_name == "Skope")
|
||||
)
|
||||
creator = result.scalar_one()
|
||||
assert creator.slug == "skope"
|
||||
assert creator.name == "Skope"
|
||||
|
||||
# SourceVideo exists with correct status
|
||||
result = await session.execute(
|
||||
select(SourceVideo).where(SourceVideo.creator_id == creator.id)
|
||||
)
|
||||
video = result.scalar_one()
|
||||
assert video.processing_status.value == "transcribed"
|
||||
assert video.filename == "mixing-basics-ep1.mp4"
|
||||
|
||||
# 5 TranscriptSegment rows with sequential indices
|
||||
result = await session.execute(
|
||||
select(TranscriptSegment)
|
||||
.where(TranscriptSegment.source_video_id == video.id)
|
||||
.order_by(TranscriptSegment.segment_index)
|
||||
)
|
||||
segments = result.scalars().all()
|
||||
assert len(segments) == 5
|
||||
assert [s.segment_index for s in segments] == [0, 1, 2, 3, 4]
|
||||
|
||||
|
||||
async def test_ingest_reuses_existing_creator(client, sample_transcript_path, db_engine):
|
||||
"""If a Creator with the same folder_name already exists, reuse it."""
|
||||
session_factory = async_sessionmaker(db_engine, class_=AsyncSession, expire_on_commit=False)
|
||||
|
||||
# Pre-create a Creator with folder_name='Skope' in a separate session
|
||||
async with session_factory() as session:
|
||||
existing = Creator(name="Skope", slug="skope", folder_name="Skope")
|
||||
session.add(existing)
|
||||
await session.commit()
|
||||
await session.refresh(existing)
|
||||
existing_id = existing.id
|
||||
|
||||
# POST transcript — should reuse the creator
|
||||
resp = await client.post(INGEST_URL, files=_upload_file(sample_transcript_path))
|
||||
assert resp.status_code == 200
|
||||
data = resp.json()
|
||||
assert data["creator_id"] == str(existing_id)
|
||||
|
||||
# Verify only 1 Creator row in DB
|
||||
count = await _count_rows(db_engine, Creator)
|
||||
assert count == 1, f"Expected 1 creator, got {count}"
|
||||
|
||||
|
||||
async def test_ingest_idempotent_reupload(client, sample_transcript_path, db_engine):
|
||||
"""Uploading the same transcript twice is idempotent: same video, no duplicate segments."""
|
||||
# First upload
|
||||
resp1 = await client.post(INGEST_URL, files=_upload_file(sample_transcript_path))
|
||||
assert resp1.status_code == 200
|
||||
data1 = resp1.json()
|
||||
assert data1["is_reupload"] is False
|
||||
video_id = data1["video_id"]
|
||||
|
||||
# Second upload (same file)
|
||||
resp2 = await client.post(INGEST_URL, files=_upload_file(sample_transcript_path))
|
||||
assert resp2.status_code == 200
|
||||
data2 = resp2.json()
|
||||
assert data2["is_reupload"] is True
|
||||
assert data2["video_id"] == video_id
|
||||
|
||||
# Verify DB: still only 1 SourceVideo and 5 segments (not 10)
|
||||
video_count = await _count_rows(db_engine, SourceVideo)
|
||||
assert video_count == 1, f"Expected 1 video, got {video_count}"
|
||||
|
||||
seg_count = await _count_rows(db_engine, TranscriptSegment)
|
||||
assert seg_count == 5, f"Expected 5 segments, got {seg_count}"
|
||||
|
||||
|
||||
async def test_ingest_saves_json_to_disk(client, sample_transcript_path, tmp_path):
|
||||
"""Ingested transcript raw JSON is persisted to the filesystem."""
|
||||
resp = await client.post(INGEST_URL, files=_upload_file(sample_transcript_path))
|
||||
assert resp.status_code == 200
|
||||
|
||||
# The ingest endpoint saves to {transcript_storage_path}/{creator_folder}/{source_file}.json
|
||||
expected_path = tmp_path / "Skope" / "mixing-basics-ep1.mp4.json"
|
||||
assert expected_path.exists(), f"Expected file at {expected_path}"
|
||||
|
||||
# Verify the saved JSON is valid and matches the source
|
||||
saved = json.loads(expected_path.read_text())
|
||||
source = json.loads(sample_transcript_path.read_text())
|
||||
assert saved == source
|
||||
|
||||
|
||||
# ── Error tests ──────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
async def test_ingest_rejects_invalid_json(client, tmp_path):
|
||||
"""Uploading a non-JSON file returns 422."""
|
||||
bad_file = tmp_path / "bad.json"
|
||||
bad_file.write_text("this is not valid json {{{")
|
||||
|
||||
resp = await client.post(
|
||||
INGEST_URL,
|
||||
files={"file": ("bad.json", bad_file.read_bytes(), "application/json")},
|
||||
)
|
||||
assert resp.status_code == 422, f"Expected 422, got {resp.status_code}: {resp.text}"
|
||||
assert "JSON parse error" in resp.json()["detail"]
|
||||
|
||||
|
||||
async def test_ingest_rejects_missing_fields(client, tmp_path):
|
||||
"""Uploading JSON without required fields returns 422."""
|
||||
incomplete = tmp_path / "incomplete.json"
|
||||
# Missing creator_folder and segments
|
||||
incomplete.write_text(json.dumps({"source_file": "test.mp4", "duration_seconds": 100}))
|
||||
|
||||
resp = await client.post(
|
||||
INGEST_URL,
|
||||
files={"file": ("incomplete.json", incomplete.read_bytes(), "application/json")},
|
||||
)
|
||||
assert resp.status_code == 422, f"Expected 422, got {resp.status_code}: {resp.text}"
|
||||
assert "Missing required keys" in resp.json()["detail"]
|
||||
773
backend/tests/test_pipeline.py
Normal file
773
backend/tests/test_pipeline.py
Normal file
|
|
@ -0,0 +1,773 @@
|
|||
"""Integration tests for the LLM extraction pipeline.
|
||||
|
||||
Tests run against a real PostgreSQL test database with mocked LLM and Qdrant
|
||||
clients. Pipeline stages are sync (Celery tasks), so tests call stage
|
||||
functions directly with sync SQLAlchemy sessions.
|
||||
|
||||
Tests (a)–(f) call pipeline stages directly. Tests (g)–(i) use the async
|
||||
HTTP client. Test (j) verifies LLM fallback logic.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
import pathlib
|
||||
import uuid
|
||||
from unittest.mock import MagicMock, patch, PropertyMock
|
||||
|
||||
import openai
|
||||
import pytest
|
||||
from sqlalchemy import create_engine, select
|
||||
from sqlalchemy.orm import Session, sessionmaker
|
||||
from sqlalchemy.pool import NullPool
|
||||
|
||||
from models import (
|
||||
Creator,
|
||||
KeyMoment,
|
||||
KeyMomentContentType,
|
||||
ProcessingStatus,
|
||||
SourceVideo,
|
||||
TechniquePage,
|
||||
TranscriptSegment,
|
||||
)
|
||||
from pipeline.schemas import (
|
||||
ClassificationResult,
|
||||
ExtractionResult,
|
||||
SegmentationResult,
|
||||
SynthesisResult,
|
||||
)
|
||||
|
||||
from tests.fixtures.mock_llm_responses import (
|
||||
STAGE2_SEGMENTATION_RESPONSE,
|
||||
STAGE3_EXTRACTION_RESPONSE,
|
||||
STAGE4_CLASSIFICATION_RESPONSE,
|
||||
STAGE5_SYNTHESIS_RESPONSE,
|
||||
make_mock_embeddings,
|
||||
)
|
||||
|
||||
# ── Test database URL ────────────────────────────────────────────────────────
|
||||
|
||||
TEST_DATABASE_URL_SYNC = os.getenv(
|
||||
"TEST_DATABASE_URL",
|
||||
"postgresql+asyncpg://chrysopedia:changeme@localhost:5433/chrysopedia_test",
|
||||
).replace("postgresql+asyncpg://", "postgresql+psycopg2://")
|
||||
|
||||
|
||||
# ── Helpers ──────────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def _make_mock_openai_response(content: str):
|
||||
"""Build a mock OpenAI ChatCompletion response object."""
|
||||
mock_message = MagicMock()
|
||||
mock_message.content = content
|
||||
|
||||
mock_choice = MagicMock()
|
||||
mock_choice.message = mock_message
|
||||
|
||||
mock_response = MagicMock()
|
||||
mock_response.choices = [mock_choice]
|
||||
return mock_response
|
||||
|
||||
|
||||
def _make_mock_embedding_response(vectors: list[list[float]]):
|
||||
"""Build a mock OpenAI Embedding response object."""
|
||||
mock_items = []
|
||||
for i, vec in enumerate(vectors):
|
||||
item = MagicMock()
|
||||
item.embedding = vec
|
||||
item.index = i
|
||||
mock_items.append(item)
|
||||
|
||||
mock_response = MagicMock()
|
||||
mock_response.data = mock_items
|
||||
return mock_response
|
||||
|
||||
|
||||
def _patch_pipeline_engine(sync_engine):
|
||||
"""Patch the pipeline.stages module to use the test sync engine/session."""
|
||||
return [
|
||||
patch("pipeline.stages._engine", sync_engine),
|
||||
patch(
|
||||
"pipeline.stages._SessionLocal",
|
||||
sessionmaker(bind=sync_engine),
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
def _patch_llm_completions(side_effect_fn):
|
||||
"""Patch openai.OpenAI so all instances share a mocked chat.completions.create."""
|
||||
mock_client = MagicMock()
|
||||
mock_client.chat.completions.create.side_effect = side_effect_fn
|
||||
return patch("openai.OpenAI", return_value=mock_client)
|
||||
|
||||
|
||||
def _create_canonical_tags_file(tmp_path: pathlib.Path) -> pathlib.Path:
|
||||
"""Write a minimal canonical_tags.yaml for stage4 to load."""
|
||||
config_dir = tmp_path / "config"
|
||||
config_dir.mkdir(exist_ok=True)
|
||||
tags_path = config_dir / "canonical_tags.yaml"
|
||||
tags_path.write_text(
|
||||
"categories:\n"
|
||||
" - name: Mixing\n"
|
||||
" description: Balancing and processing elements\n"
|
||||
" sub_topics: [eq, compression, gain staging, bus processing]\n"
|
||||
" - name: Sound design\n"
|
||||
" description: Creating sounds\n"
|
||||
" sub_topics: [bass, drums]\n"
|
||||
)
|
||||
return tags_path
|
||||
|
||||
|
||||
# ── (a) Stage 2: Segmentation ───────────────────────────────────────────────
|
||||
|
||||
|
||||
def test_stage2_segmentation_updates_topic_labels(
|
||||
db_engine, sync_engine, pre_ingested_video, tmp_path
|
||||
):
|
||||
"""Stage 2 should update topic_label on each TranscriptSegment."""
|
||||
video_id = pre_ingested_video["video_id"]
|
||||
|
||||
# Create prompts directory
|
||||
prompts_dir = tmp_path / "prompts"
|
||||
prompts_dir.mkdir()
|
||||
(prompts_dir / "stage2_segmentation.txt").write_text("You are a segmentation assistant.")
|
||||
|
||||
# Build the mock LLM that returns the segmentation response
|
||||
def llm_side_effect(**kwargs):
|
||||
return _make_mock_openai_response(STAGE2_SEGMENTATION_RESPONSE)
|
||||
|
||||
patches = _patch_pipeline_engine(sync_engine)
|
||||
for p in patches:
|
||||
p.start()
|
||||
|
||||
with _patch_llm_completions(llm_side_effect), \
|
||||
patch("pipeline.stages.get_settings") as mock_settings:
|
||||
s = MagicMock()
|
||||
s.prompts_path = str(prompts_dir)
|
||||
s.llm_api_url = "http://mock:11434/v1"
|
||||
s.llm_api_key = "sk-test"
|
||||
s.llm_model = "test-model"
|
||||
s.llm_fallback_url = "http://mock:11434/v1"
|
||||
s.llm_fallback_model = "test-model"
|
||||
s.database_url = TEST_DATABASE_URL_SYNC.replace("psycopg2", "asyncpg")
|
||||
mock_settings.return_value = s
|
||||
|
||||
# Import and call stage directly (not via Celery)
|
||||
from pipeline.stages import stage2_segmentation
|
||||
|
||||
result = stage2_segmentation(video_id)
|
||||
assert result == video_id
|
||||
|
||||
for p in patches:
|
||||
p.stop()
|
||||
|
||||
# Verify: check topic_label on segments
|
||||
factory = sessionmaker(bind=sync_engine)
|
||||
session = factory()
|
||||
try:
|
||||
segments = (
|
||||
session.execute(
|
||||
select(TranscriptSegment)
|
||||
.where(TranscriptSegment.source_video_id == video_id)
|
||||
.order_by(TranscriptSegment.segment_index)
|
||||
)
|
||||
.scalars()
|
||||
.all()
|
||||
)
|
||||
# Segments 0,1 should have "Introduction", segments 2,3,4 should have "Gain Staging Technique"
|
||||
assert segments[0].topic_label == "Introduction"
|
||||
assert segments[1].topic_label == "Introduction"
|
||||
assert segments[2].topic_label == "Gain Staging Technique"
|
||||
assert segments[3].topic_label == "Gain Staging Technique"
|
||||
assert segments[4].topic_label == "Gain Staging Technique"
|
||||
finally:
|
||||
session.close()
|
||||
|
||||
|
||||
# ── (b) Stage 3: Extraction ─────────────────────────────────────────────────
|
||||
|
||||
|
||||
def test_stage3_extraction_creates_key_moments(
|
||||
db_engine, sync_engine, pre_ingested_video, tmp_path
|
||||
):
|
||||
"""Stages 2+3 should create KeyMoment rows and set processing_status=extracted."""
|
||||
video_id = pre_ingested_video["video_id"]
|
||||
|
||||
prompts_dir = tmp_path / "prompts"
|
||||
prompts_dir.mkdir()
|
||||
(prompts_dir / "stage2_segmentation.txt").write_text("Segment assistant.")
|
||||
(prompts_dir / "stage3_extraction.txt").write_text("Extraction assistant.")
|
||||
|
||||
call_count = {"n": 0}
|
||||
responses = [STAGE2_SEGMENTATION_RESPONSE, STAGE3_EXTRACTION_RESPONSE, STAGE3_EXTRACTION_RESPONSE]
|
||||
|
||||
def llm_side_effect(**kwargs):
|
||||
idx = min(call_count["n"], len(responses) - 1)
|
||||
resp = responses[idx]
|
||||
call_count["n"] += 1
|
||||
return _make_mock_openai_response(resp)
|
||||
|
||||
patches = _patch_pipeline_engine(sync_engine)
|
||||
for p in patches:
|
||||
p.start()
|
||||
|
||||
with _patch_llm_completions(llm_side_effect), \
|
||||
patch("pipeline.stages.get_settings") as mock_settings:
|
||||
s = MagicMock()
|
||||
s.prompts_path = str(prompts_dir)
|
||||
s.llm_api_url = "http://mock:11434/v1"
|
||||
s.llm_api_key = "sk-test"
|
||||
s.llm_model = "test-model"
|
||||
s.llm_fallback_url = "http://mock:11434/v1"
|
||||
s.llm_fallback_model = "test-model"
|
||||
s.database_url = TEST_DATABASE_URL_SYNC.replace("psycopg2", "asyncpg")
|
||||
mock_settings.return_value = s
|
||||
|
||||
from pipeline.stages import stage2_segmentation, stage3_extraction
|
||||
|
||||
stage2_segmentation(video_id)
|
||||
stage3_extraction(video_id)
|
||||
|
||||
for p in patches:
|
||||
p.stop()
|
||||
|
||||
# Verify key moments created
|
||||
factory = sessionmaker(bind=sync_engine)
|
||||
session = factory()
|
||||
try:
|
||||
moments = (
|
||||
session.execute(
|
||||
select(KeyMoment)
|
||||
.where(KeyMoment.source_video_id == video_id)
|
||||
.order_by(KeyMoment.start_time)
|
||||
)
|
||||
.scalars()
|
||||
.all()
|
||||
)
|
||||
# Two topic groups → extraction called twice → up to 4 moments
|
||||
# (2 per group from the mock response)
|
||||
assert len(moments) >= 2
|
||||
assert moments[0].title == "Setting Levels for Gain Staging"
|
||||
assert moments[0].content_type == KeyMomentContentType.technique
|
||||
|
||||
# Verify processing_status
|
||||
video = session.execute(
|
||||
select(SourceVideo).where(SourceVideo.id == video_id)
|
||||
).scalar_one()
|
||||
assert video.processing_status == ProcessingStatus.extracted
|
||||
finally:
|
||||
session.close()
|
||||
|
||||
|
||||
# ── (c) Stage 4: Classification ─────────────────────────────────────────────
|
||||
|
||||
|
||||
def test_stage4_classification_assigns_tags(
|
||||
db_engine, sync_engine, pre_ingested_video, tmp_path
|
||||
):
|
||||
"""Stages 2+3+4 should store classification data in Redis."""
|
||||
video_id = pre_ingested_video["video_id"]
|
||||
|
||||
prompts_dir = tmp_path / "prompts"
|
||||
prompts_dir.mkdir()
|
||||
(prompts_dir / "stage2_segmentation.txt").write_text("Segment assistant.")
|
||||
(prompts_dir / "stage3_extraction.txt").write_text("Extraction assistant.")
|
||||
(prompts_dir / "stage4_classification.txt").write_text("Classification assistant.")
|
||||
|
||||
_create_canonical_tags_file(tmp_path)
|
||||
|
||||
call_count = {"n": 0}
|
||||
responses = [
|
||||
STAGE2_SEGMENTATION_RESPONSE,
|
||||
STAGE3_EXTRACTION_RESPONSE,
|
||||
STAGE3_EXTRACTION_RESPONSE,
|
||||
STAGE4_CLASSIFICATION_RESPONSE,
|
||||
]
|
||||
|
||||
def llm_side_effect(**kwargs):
|
||||
idx = min(call_count["n"], len(responses) - 1)
|
||||
resp = responses[idx]
|
||||
call_count["n"] += 1
|
||||
return _make_mock_openai_response(resp)
|
||||
|
||||
patches = _patch_pipeline_engine(sync_engine)
|
||||
for p in patches:
|
||||
p.start()
|
||||
|
||||
stored_cls_data = {}
|
||||
|
||||
def mock_store_classification(vid, data):
|
||||
stored_cls_data[vid] = data
|
||||
|
||||
with _patch_llm_completions(llm_side_effect), \
|
||||
patch("pipeline.stages.get_settings") as mock_settings, \
|
||||
patch("pipeline.stages._load_canonical_tags") as mock_tags, \
|
||||
patch("pipeline.stages._store_classification_data", side_effect=mock_store_classification):
|
||||
s = MagicMock()
|
||||
s.prompts_path = str(prompts_dir)
|
||||
s.llm_api_url = "http://mock:11434/v1"
|
||||
s.llm_api_key = "sk-test"
|
||||
s.llm_model = "test-model"
|
||||
s.llm_fallback_url = "http://mock:11434/v1"
|
||||
s.llm_fallback_model = "test-model"
|
||||
s.database_url = TEST_DATABASE_URL_SYNC.replace("psycopg2", "asyncpg")
|
||||
s.review_mode = True
|
||||
mock_settings.return_value = s
|
||||
|
||||
mock_tags.return_value = {
|
||||
"categories": [
|
||||
{"name": "Mixing", "description": "Balancing", "sub_topics": ["gain staging", "eq"]},
|
||||
]
|
||||
}
|
||||
|
||||
from pipeline.stages import stage2_segmentation, stage3_extraction, stage4_classification
|
||||
|
||||
stage2_segmentation(video_id)
|
||||
stage3_extraction(video_id)
|
||||
stage4_classification(video_id)
|
||||
|
||||
for p in patches:
|
||||
p.stop()
|
||||
|
||||
# Verify classification data was stored
|
||||
assert video_id in stored_cls_data
|
||||
cls_data = stored_cls_data[video_id]
|
||||
assert len(cls_data) >= 1
|
||||
assert cls_data[0]["topic_category"] == "Mixing"
|
||||
assert "gain staging" in cls_data[0]["topic_tags"]
|
||||
|
||||
|
||||
# ── (d) Stage 5: Synthesis ──────────────────────────────────────────────────
|
||||
|
||||
|
||||
def test_stage5_synthesis_creates_technique_pages(
|
||||
db_engine, sync_engine, pre_ingested_video, tmp_path
|
||||
):
|
||||
"""Full pipeline stages 2-5 should create TechniquePage rows linked to KeyMoments."""
|
||||
video_id = pre_ingested_video["video_id"]
|
||||
|
||||
prompts_dir = tmp_path / "prompts"
|
||||
prompts_dir.mkdir()
|
||||
(prompts_dir / "stage2_segmentation.txt").write_text("Segment assistant.")
|
||||
(prompts_dir / "stage3_extraction.txt").write_text("Extraction assistant.")
|
||||
(prompts_dir / "stage4_classification.txt").write_text("Classification assistant.")
|
||||
(prompts_dir / "stage5_synthesis.txt").write_text("Synthesis assistant.")
|
||||
|
||||
call_count = {"n": 0}
|
||||
responses = [
|
||||
STAGE2_SEGMENTATION_RESPONSE,
|
||||
STAGE3_EXTRACTION_RESPONSE,
|
||||
STAGE3_EXTRACTION_RESPONSE,
|
||||
STAGE4_CLASSIFICATION_RESPONSE,
|
||||
STAGE5_SYNTHESIS_RESPONSE,
|
||||
]
|
||||
|
||||
def llm_side_effect(**kwargs):
|
||||
idx = min(call_count["n"], len(responses) - 1)
|
||||
resp = responses[idx]
|
||||
call_count["n"] += 1
|
||||
return _make_mock_openai_response(resp)
|
||||
|
||||
patches = _patch_pipeline_engine(sync_engine)
|
||||
for p in patches:
|
||||
p.start()
|
||||
|
||||
# Mock classification data in Redis (simulate stage 4 having stored it)
|
||||
mock_cls_data = [
|
||||
{"moment_id": "will-be-replaced", "topic_category": "Mixing", "topic_tags": ["gain staging"]},
|
||||
]
|
||||
|
||||
with _patch_llm_completions(llm_side_effect), \
|
||||
patch("pipeline.stages.get_settings") as mock_settings, \
|
||||
patch("pipeline.stages._load_canonical_tags") as mock_tags, \
|
||||
patch("pipeline.stages._store_classification_data"), \
|
||||
patch("pipeline.stages._load_classification_data") as mock_load_cls:
|
||||
s = MagicMock()
|
||||
s.prompts_path = str(prompts_dir)
|
||||
s.llm_api_url = "http://mock:11434/v1"
|
||||
s.llm_api_key = "sk-test"
|
||||
s.llm_model = "test-model"
|
||||
s.llm_fallback_url = "http://mock:11434/v1"
|
||||
s.llm_fallback_model = "test-model"
|
||||
s.database_url = TEST_DATABASE_URL_SYNC.replace("psycopg2", "asyncpg")
|
||||
s.review_mode = True
|
||||
mock_settings.return_value = s
|
||||
|
||||
mock_tags.return_value = {
|
||||
"categories": [
|
||||
{"name": "Mixing", "description": "Balancing", "sub_topics": ["gain staging"]},
|
||||
]
|
||||
}
|
||||
|
||||
from pipeline.stages import (
|
||||
stage2_segmentation,
|
||||
stage3_extraction,
|
||||
stage4_classification,
|
||||
stage5_synthesis,
|
||||
)
|
||||
|
||||
stage2_segmentation(video_id)
|
||||
stage3_extraction(video_id)
|
||||
stage4_classification(video_id)
|
||||
|
||||
# Now set up mock_load_cls to return data with real moment IDs
|
||||
factory = sessionmaker(bind=sync_engine)
|
||||
sess = factory()
|
||||
real_moments = (
|
||||
sess.execute(
|
||||
select(KeyMoment).where(KeyMoment.source_video_id == video_id)
|
||||
)
|
||||
.scalars()
|
||||
.all()
|
||||
)
|
||||
real_cls = [
|
||||
{"moment_id": str(m.id), "topic_category": "Mixing", "topic_tags": ["gain staging"]}
|
||||
for m in real_moments
|
||||
]
|
||||
sess.close()
|
||||
mock_load_cls.return_value = real_cls
|
||||
|
||||
stage5_synthesis(video_id)
|
||||
|
||||
for p in patches:
|
||||
p.stop()
|
||||
|
||||
# Verify TechniquePages created
|
||||
factory = sessionmaker(bind=sync_engine)
|
||||
session = factory()
|
||||
try:
|
||||
pages = session.execute(select(TechniquePage)).scalars().all()
|
||||
assert len(pages) >= 1
|
||||
page = pages[0]
|
||||
assert page.title == "Gain Staging in Mixing"
|
||||
assert page.body_sections is not None
|
||||
assert "Overview" in page.body_sections
|
||||
assert page.signal_chains is not None
|
||||
assert len(page.signal_chains) >= 1
|
||||
assert page.summary is not None
|
||||
|
||||
# Verify KeyMoments are linked to the TechniquePage
|
||||
moments = (
|
||||
session.execute(
|
||||
select(KeyMoment).where(KeyMoment.technique_page_id == page.id)
|
||||
)
|
||||
.scalars()
|
||||
.all()
|
||||
)
|
||||
assert len(moments) >= 1
|
||||
|
||||
# Verify processing_status updated
|
||||
video = session.execute(
|
||||
select(SourceVideo).where(SourceVideo.id == video_id)
|
||||
).scalar_one()
|
||||
assert video.processing_status == ProcessingStatus.reviewed
|
||||
finally:
|
||||
session.close()
|
||||
|
||||
|
||||
# ── (e) Stage 6: Embed & Index ──────────────────────────────────────────────
|
||||
|
||||
|
||||
def test_stage6_embeds_and_upserts_to_qdrant(
|
||||
db_engine, sync_engine, pre_ingested_video, tmp_path
|
||||
):
|
||||
"""Full pipeline through stage 6 should call EmbeddingClient and QdrantManager."""
|
||||
video_id = pre_ingested_video["video_id"]
|
||||
|
||||
prompts_dir = tmp_path / "prompts"
|
||||
prompts_dir.mkdir()
|
||||
(prompts_dir / "stage2_segmentation.txt").write_text("Segment assistant.")
|
||||
(prompts_dir / "stage3_extraction.txt").write_text("Extraction assistant.")
|
||||
(prompts_dir / "stage4_classification.txt").write_text("Classification assistant.")
|
||||
(prompts_dir / "stage5_synthesis.txt").write_text("Synthesis assistant.")
|
||||
|
||||
call_count = {"n": 0}
|
||||
responses = [
|
||||
STAGE2_SEGMENTATION_RESPONSE,
|
||||
STAGE3_EXTRACTION_RESPONSE,
|
||||
STAGE3_EXTRACTION_RESPONSE,
|
||||
STAGE4_CLASSIFICATION_RESPONSE,
|
||||
STAGE5_SYNTHESIS_RESPONSE,
|
||||
]
|
||||
|
||||
def llm_side_effect(**kwargs):
|
||||
idx = min(call_count["n"], len(responses) - 1)
|
||||
resp = responses[idx]
|
||||
call_count["n"] += 1
|
||||
return _make_mock_openai_response(resp)
|
||||
|
||||
patches = _patch_pipeline_engine(sync_engine)
|
||||
for p in patches:
|
||||
p.start()
|
||||
|
||||
mock_embed_client = MagicMock()
|
||||
mock_embed_client.embed.side_effect = lambda texts: make_mock_embeddings(len(texts))
|
||||
|
||||
mock_qdrant_mgr = MagicMock()
|
||||
|
||||
with _patch_llm_completions(llm_side_effect), \
|
||||
patch("pipeline.stages.get_settings") as mock_settings, \
|
||||
patch("pipeline.stages._load_canonical_tags") as mock_tags, \
|
||||
patch("pipeline.stages._store_classification_data"), \
|
||||
patch("pipeline.stages._load_classification_data") as mock_load_cls, \
|
||||
patch("pipeline.stages.EmbeddingClient", return_value=mock_embed_client), \
|
||||
patch("pipeline.stages.QdrantManager", return_value=mock_qdrant_mgr):
|
||||
s = MagicMock()
|
||||
s.prompts_path = str(prompts_dir)
|
||||
s.llm_api_url = "http://mock:11434/v1"
|
||||
s.llm_api_key = "sk-test"
|
||||
s.llm_model = "test-model"
|
||||
s.llm_fallback_url = "http://mock:11434/v1"
|
||||
s.llm_fallback_model = "test-model"
|
||||
s.database_url = TEST_DATABASE_URL_SYNC.replace("psycopg2", "asyncpg")
|
||||
s.review_mode = True
|
||||
s.embedding_api_url = "http://mock:11434/v1"
|
||||
s.embedding_model = "test-embed"
|
||||
s.embedding_dimensions = 768
|
||||
s.qdrant_url = "http://mock:6333"
|
||||
s.qdrant_collection = "test_collection"
|
||||
mock_settings.return_value = s
|
||||
|
||||
mock_tags.return_value = {
|
||||
"categories": [
|
||||
{"name": "Mixing", "description": "Balancing", "sub_topics": ["gain staging"]},
|
||||
]
|
||||
}
|
||||
|
||||
from pipeline.stages import (
|
||||
stage2_segmentation,
|
||||
stage3_extraction,
|
||||
stage4_classification,
|
||||
stage5_synthesis,
|
||||
stage6_embed_and_index,
|
||||
)
|
||||
|
||||
stage2_segmentation(video_id)
|
||||
stage3_extraction(video_id)
|
||||
stage4_classification(video_id)
|
||||
|
||||
# Load real moment IDs for classification data mock
|
||||
factory = sessionmaker(bind=sync_engine)
|
||||
sess = factory()
|
||||
real_moments = (
|
||||
sess.execute(
|
||||
select(KeyMoment).where(KeyMoment.source_video_id == video_id)
|
||||
)
|
||||
.scalars()
|
||||
.all()
|
||||
)
|
||||
real_cls = [
|
||||
{"moment_id": str(m.id), "topic_category": "Mixing", "topic_tags": ["gain staging"]}
|
||||
for m in real_moments
|
||||
]
|
||||
sess.close()
|
||||
mock_load_cls.return_value = real_cls
|
||||
|
||||
stage5_synthesis(video_id)
|
||||
stage6_embed_and_index(video_id)
|
||||
|
||||
for p in patches:
|
||||
p.stop()
|
||||
|
||||
# Verify EmbeddingClient.embed was called
|
||||
assert mock_embed_client.embed.called
|
||||
# Verify QdrantManager methods called
|
||||
mock_qdrant_mgr.ensure_collection.assert_called_once()
|
||||
assert (
|
||||
mock_qdrant_mgr.upsert_technique_pages.called
|
||||
or mock_qdrant_mgr.upsert_key_moments.called
|
||||
), "Expected at least one upsert call to QdrantManager"
|
||||
|
||||
|
||||
# ── (f) Resumability ────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def test_run_pipeline_resumes_from_extracted(
|
||||
db_engine, sync_engine, pre_ingested_video, tmp_path
|
||||
):
|
||||
"""When status=extracted, run_pipeline should skip stages 2+3 and run 4+5+6."""
|
||||
video_id = pre_ingested_video["video_id"]
|
||||
|
||||
# Set video status to "extracted" directly
|
||||
factory = sessionmaker(bind=sync_engine)
|
||||
session = factory()
|
||||
video = session.execute(
|
||||
select(SourceVideo).where(SourceVideo.id == video_id)
|
||||
).scalar_one()
|
||||
video.processing_status = ProcessingStatus.extracted
|
||||
session.commit()
|
||||
session.close()
|
||||
|
||||
patches = _patch_pipeline_engine(sync_engine)
|
||||
for p in patches:
|
||||
p.start()
|
||||
|
||||
with patch("pipeline.stages.get_settings") as mock_settings, \
|
||||
patch("pipeline.stages.stage2_segmentation") as mock_s2, \
|
||||
patch("pipeline.stages.stage3_extraction") as mock_s3, \
|
||||
patch("pipeline.stages.stage4_classification") as mock_s4, \
|
||||
patch("pipeline.stages.stage5_synthesis") as mock_s5, \
|
||||
patch("pipeline.stages.stage6_embed_and_index") as mock_s6, \
|
||||
patch("pipeline.stages.celery_chain") as mock_chain:
|
||||
s = MagicMock()
|
||||
s.database_url = TEST_DATABASE_URL_SYNC.replace("psycopg2", "asyncpg")
|
||||
mock_settings.return_value = s
|
||||
|
||||
# Mock chain to inspect what stages it gets
|
||||
mock_pipeline = MagicMock()
|
||||
mock_chain.return_value = mock_pipeline
|
||||
|
||||
# Mock the .s() method on each task
|
||||
mock_s2.s = MagicMock(return_value="s2_sig")
|
||||
mock_s3.s = MagicMock(return_value="s3_sig")
|
||||
mock_s4.s = MagicMock(return_value="s4_sig")
|
||||
mock_s5.s = MagicMock(return_value="s5_sig")
|
||||
mock_s6.s = MagicMock(return_value="s6_sig")
|
||||
|
||||
from pipeline.stages import run_pipeline
|
||||
|
||||
run_pipeline(video_id)
|
||||
|
||||
# Verify: stages 2 and 3 should NOT have .s() called with video_id
|
||||
mock_s2.s.assert_not_called()
|
||||
mock_s3.s.assert_not_called()
|
||||
|
||||
# Stages 4, 5, 6 should have .s() called
|
||||
mock_s4.s.assert_called_once_with(video_id)
|
||||
mock_s5.s.assert_called_once()
|
||||
mock_s6.s.assert_called_once()
|
||||
|
||||
for p in patches:
|
||||
p.stop()
|
||||
|
||||
|
||||
# ── (g) Pipeline trigger endpoint ───────────────────────────────────────────
|
||||
|
||||
|
||||
async def test_pipeline_trigger_endpoint(client, db_engine):
|
||||
"""POST /api/v1/pipeline/trigger/{video_id} with valid video returns 200."""
|
||||
# Ingest a transcript first to create a video
|
||||
sample = pathlib.Path(__file__).parent / "fixtures" / "sample_transcript.json"
|
||||
|
||||
with patch("routers.ingest.run_pipeline", create=True) as mock_rp:
|
||||
mock_rp.delay = MagicMock()
|
||||
resp = await client.post(
|
||||
"/api/v1/ingest",
|
||||
files={"file": (sample.name, sample.read_bytes(), "application/json")},
|
||||
)
|
||||
assert resp.status_code == 200
|
||||
video_id = resp.json()["video_id"]
|
||||
|
||||
# Trigger the pipeline
|
||||
with patch("pipeline.stages.run_pipeline") as mock_rp:
|
||||
mock_rp.delay = MagicMock()
|
||||
resp = await client.post(f"/api/v1/pipeline/trigger/{video_id}")
|
||||
|
||||
assert resp.status_code == 200
|
||||
data = resp.json()
|
||||
assert data["status"] == "triggered"
|
||||
assert data["video_id"] == video_id
|
||||
|
||||
|
||||
# ── (h) Pipeline trigger 404 ────────────────────────────────────────────────
|
||||
|
||||
|
||||
async def test_pipeline_trigger_404_for_missing_video(client):
|
||||
"""POST /api/v1/pipeline/trigger/{nonexistent} returns 404."""
|
||||
fake_id = str(uuid.uuid4())
|
||||
resp = await client.post(f"/api/v1/pipeline/trigger/{fake_id}")
|
||||
assert resp.status_code == 404
|
||||
assert "not found" in resp.json()["detail"].lower()
|
||||
|
||||
|
||||
# ── (i) Ingest dispatches pipeline ──────────────────────────────────────────
|
||||
|
||||
|
||||
async def test_ingest_dispatches_pipeline(client, db_engine):
|
||||
"""Ingesting a transcript should call run_pipeline.delay with the video_id."""
|
||||
sample = pathlib.Path(__file__).parent / "fixtures" / "sample_transcript.json"
|
||||
|
||||
with patch("pipeline.stages.run_pipeline") as mock_rp:
|
||||
mock_rp.delay = MagicMock()
|
||||
resp = await client.post(
|
||||
"/api/v1/ingest",
|
||||
files={"file": (sample.name, sample.read_bytes(), "application/json")},
|
||||
)
|
||||
|
||||
assert resp.status_code == 200
|
||||
video_id = resp.json()["video_id"]
|
||||
mock_rp.delay.assert_called_once_with(video_id)
|
||||
|
||||
|
||||
# ── (j) LLM fallback on primary failure ─────────────────────────────────────
|
||||
|
||||
|
||||
def test_llm_fallback_on_primary_failure():
|
||||
"""LLMClient should fall back to secondary endpoint when primary raises APIConnectionError."""
|
||||
from pipeline.llm_client import LLMClient
|
||||
|
||||
settings = MagicMock()
|
||||
settings.llm_api_url = "http://primary:11434/v1"
|
||||
settings.llm_api_key = "sk-test"
|
||||
settings.llm_fallback_url = "http://fallback:11434/v1"
|
||||
settings.llm_fallback_model = "fallback-model"
|
||||
settings.llm_model = "primary-model"
|
||||
|
||||
with patch("openai.OpenAI") as MockOpenAI:
|
||||
primary_client = MagicMock()
|
||||
fallback_client = MagicMock()
|
||||
|
||||
# First call → primary, second call → fallback
|
||||
MockOpenAI.side_effect = [primary_client, fallback_client]
|
||||
|
||||
client = LLMClient(settings)
|
||||
|
||||
# Primary raises APIConnectionError
|
||||
primary_client.chat.completions.create.side_effect = openai.APIConnectionError(
|
||||
request=MagicMock()
|
||||
)
|
||||
|
||||
# Fallback succeeds
|
||||
fallback_response = _make_mock_openai_response('{"result": "ok"}')
|
||||
fallback_client.chat.completions.create.return_value = fallback_response
|
||||
|
||||
result = client.complete("system", "user")
|
||||
|
||||
assert result == '{"result": "ok"}'
|
||||
primary_client.chat.completions.create.assert_called_once()
|
||||
fallback_client.chat.completions.create.assert_called_once()
|
||||
|
||||
|
||||
# ── Think-tag stripping ─────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def test_strip_think_tags():
|
||||
"""strip_think_tags should handle all edge cases correctly."""
|
||||
from pipeline.llm_client import strip_think_tags
|
||||
|
||||
# Single block with JSON after
|
||||
assert strip_think_tags('<think>reasoning here</think>{"a": 1}') == '{"a": 1}'
|
||||
|
||||
# Multiline think block
|
||||
assert strip_think_tags(
|
||||
'<think>\nI need to analyze this.\nLet me think step by step.\n</think>\n{"result": "ok"}'
|
||||
) == '{"result": "ok"}'
|
||||
|
||||
# Multiple think blocks
|
||||
result = strip_think_tags('<think>first</think>hello<think>second</think> world')
|
||||
assert result == "hello world"
|
||||
|
||||
# No think tags — passthrough
|
||||
assert strip_think_tags('{"clean": true}') == '{"clean": true}'
|
||||
|
||||
# Empty string
|
||||
assert strip_think_tags("") == ""
|
||||
|
||||
# Think block with special characters
|
||||
assert strip_think_tags(
|
||||
'<think>analyzing "complex" <data> & stuff</think>{"done": true}'
|
||||
) == '{"done": true}'
|
||||
|
||||
# Only a think block, no actual content
|
||||
assert strip_think_tags("<think>just thinking</think>") == ""
|
||||
526
backend/tests/test_public_api.py
Normal file
526
backend/tests/test_public_api.py
Normal file
|
|
@ -0,0 +1,526 @@
|
|||
"""Integration tests for the public S05 API endpoints:
|
||||
techniques, topics, and enhanced creators.
|
||||
|
||||
Tests run against a real PostgreSQL test database via httpx.AsyncClient.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import uuid
|
||||
|
||||
import pytest
|
||||
import pytest_asyncio
|
||||
from httpx import AsyncClient
|
||||
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker
|
||||
|
||||
from models import (
|
||||
ContentType,
|
||||
Creator,
|
||||
KeyMoment,
|
||||
KeyMomentContentType,
|
||||
ProcessingStatus,
|
||||
RelatedTechniqueLink,
|
||||
RelationshipType,
|
||||
SourceVideo,
|
||||
TechniquePage,
|
||||
)
|
||||
|
||||
TECHNIQUES_URL = "/api/v1/techniques"
|
||||
TOPICS_URL = "/api/v1/topics"
|
||||
CREATORS_URL = "/api/v1/creators"
|
||||
|
||||
|
||||
# ── Seed helpers ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
async def _seed_full_data(db_engine) -> dict:
|
||||
"""Seed 2 creators, 2 videos, 3 technique pages, key moments, and a related link.
|
||||
|
||||
Returns a dict of IDs and metadata for assertions.
|
||||
"""
|
||||
session_factory = async_sessionmaker(
|
||||
db_engine, class_=AsyncSession, expire_on_commit=False
|
||||
)
|
||||
async with session_factory() as session:
|
||||
# Creators
|
||||
creator1 = Creator(
|
||||
name="Alpha Creator",
|
||||
slug="alpha-creator",
|
||||
genres=["Bass music", "Dubstep"],
|
||||
folder_name="AlphaCreator",
|
||||
)
|
||||
creator2 = Creator(
|
||||
name="Beta Producer",
|
||||
slug="beta-producer",
|
||||
genres=["House", "Techno"],
|
||||
folder_name="BetaProducer",
|
||||
)
|
||||
session.add_all([creator1, creator2])
|
||||
await session.flush()
|
||||
|
||||
# Videos
|
||||
video1 = SourceVideo(
|
||||
creator_id=creator1.id,
|
||||
filename="bass-tutorial.mp4",
|
||||
file_path="AlphaCreator/bass-tutorial.mp4",
|
||||
duration_seconds=600,
|
||||
content_type=ContentType.tutorial,
|
||||
processing_status=ProcessingStatus.extracted,
|
||||
)
|
||||
video2 = SourceVideo(
|
||||
creator_id=creator2.id,
|
||||
filename="mixing-masterclass.mp4",
|
||||
file_path="BetaProducer/mixing-masterclass.mp4",
|
||||
duration_seconds=1200,
|
||||
content_type=ContentType.tutorial,
|
||||
processing_status=ProcessingStatus.extracted,
|
||||
)
|
||||
session.add_all([video1, video2])
|
||||
await session.flush()
|
||||
|
||||
# Technique pages
|
||||
tp1 = TechniquePage(
|
||||
creator_id=creator1.id,
|
||||
title="Reese Bass Design",
|
||||
slug="reese-bass-design",
|
||||
topic_category="Sound design",
|
||||
topic_tags=["bass", "textures"],
|
||||
summary="Classic reese bass creation",
|
||||
body_sections={"intro": "Getting started with reese bass"},
|
||||
)
|
||||
tp2 = TechniquePage(
|
||||
creator_id=creator2.id,
|
||||
title="Granular Pad Textures",
|
||||
slug="granular-pad-textures",
|
||||
topic_category="Synthesis",
|
||||
topic_tags=["granular", "pads"],
|
||||
summary="Creating evolving pad textures",
|
||||
)
|
||||
tp3 = TechniquePage(
|
||||
creator_id=creator1.id,
|
||||
title="FM Bass Layering",
|
||||
slug="fm-bass-layering",
|
||||
topic_category="Synthesis",
|
||||
topic_tags=["fm", "bass"],
|
||||
summary="FM synthesis for bass layers",
|
||||
)
|
||||
session.add_all([tp1, tp2, tp3])
|
||||
await session.flush()
|
||||
|
||||
# Key moments
|
||||
km1 = KeyMoment(
|
||||
source_video_id=video1.id,
|
||||
technique_page_id=tp1.id,
|
||||
title="Oscillator setup",
|
||||
summary="Setting up the initial oscillator",
|
||||
start_time=10.0,
|
||||
end_time=60.0,
|
||||
content_type=KeyMomentContentType.technique,
|
||||
)
|
||||
km2 = KeyMoment(
|
||||
source_video_id=video1.id,
|
||||
technique_page_id=tp1.id,
|
||||
title="Distortion chain",
|
||||
summary="Adding distortion to the reese",
|
||||
start_time=60.0,
|
||||
end_time=120.0,
|
||||
content_type=KeyMomentContentType.technique,
|
||||
)
|
||||
km3 = KeyMoment(
|
||||
source_video_id=video2.id,
|
||||
technique_page_id=tp2.id,
|
||||
title="Granular engine parameters",
|
||||
summary="Configuring the granular engine",
|
||||
start_time=20.0,
|
||||
end_time=80.0,
|
||||
content_type=KeyMomentContentType.settings,
|
||||
)
|
||||
session.add_all([km1, km2, km3])
|
||||
await session.flush()
|
||||
|
||||
# Related technique link: tp1 → tp3 (same_creator_adjacent)
|
||||
link = RelatedTechniqueLink(
|
||||
source_page_id=tp1.id,
|
||||
target_page_id=tp3.id,
|
||||
relationship=RelationshipType.same_creator_adjacent,
|
||||
)
|
||||
session.add(link)
|
||||
await session.commit()
|
||||
|
||||
return {
|
||||
"creator1_id": str(creator1.id),
|
||||
"creator1_name": creator1.name,
|
||||
"creator1_slug": creator1.slug,
|
||||
"creator2_id": str(creator2.id),
|
||||
"creator2_name": creator2.name,
|
||||
"creator2_slug": creator2.slug,
|
||||
"video1_id": str(video1.id),
|
||||
"video2_id": str(video2.id),
|
||||
"tp1_slug": tp1.slug,
|
||||
"tp1_title": tp1.title,
|
||||
"tp2_slug": tp2.slug,
|
||||
"tp3_slug": tp3.slug,
|
||||
"tp3_title": tp3.title,
|
||||
}
|
||||
|
||||
|
||||
# ── Technique Tests ──────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_list_techniques(client, db_engine):
|
||||
"""GET /techniques returns a paginated list of technique pages."""
|
||||
seed = await _seed_full_data(db_engine)
|
||||
|
||||
resp = await client.get(TECHNIQUES_URL)
|
||||
assert resp.status_code == 200
|
||||
|
||||
data = resp.json()
|
||||
assert data["total"] == 3
|
||||
assert len(data["items"]) == 3
|
||||
# Each item has required fields
|
||||
slugs = {item["slug"] for item in data["items"]}
|
||||
assert seed["tp1_slug"] in slugs
|
||||
assert seed["tp2_slug"] in slugs
|
||||
assert seed["tp3_slug"] in slugs
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_list_techniques_with_category_filter(client, db_engine):
|
||||
"""GET /techniques?category=Synthesis returns only Synthesis technique pages."""
|
||||
await _seed_full_data(db_engine)
|
||||
|
||||
resp = await client.get(TECHNIQUES_URL, params={"category": "Synthesis"})
|
||||
assert resp.status_code == 200
|
||||
|
||||
data = resp.json()
|
||||
assert data["total"] == 2
|
||||
for item in data["items"]:
|
||||
assert item["topic_category"] == "Synthesis"
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_get_technique_detail(client, db_engine):
|
||||
"""GET /techniques/{slug} returns full detail with key_moments, creator_info, and related_links."""
|
||||
seed = await _seed_full_data(db_engine)
|
||||
|
||||
resp = await client.get(f"{TECHNIQUES_URL}/{seed['tp1_slug']}")
|
||||
assert resp.status_code == 200
|
||||
|
||||
data = resp.json()
|
||||
assert data["title"] == seed["tp1_title"]
|
||||
assert data["slug"] == seed["tp1_slug"]
|
||||
assert data["topic_category"] == "Sound design"
|
||||
|
||||
# Key moments: tp1 has 2 key moments
|
||||
assert len(data["key_moments"]) == 2
|
||||
km_titles = {km["title"] for km in data["key_moments"]}
|
||||
assert "Oscillator setup" in km_titles
|
||||
assert "Distortion chain" in km_titles
|
||||
|
||||
# Creator info
|
||||
assert data["creator_info"] is not None
|
||||
assert data["creator_info"]["name"] == seed["creator1_name"]
|
||||
assert data["creator_info"]["slug"] == seed["creator1_slug"]
|
||||
|
||||
# Related links: tp1 → tp3 (same_creator_adjacent)
|
||||
assert len(data["related_links"]) >= 1
|
||||
related_slugs = {link["target_slug"] for link in data["related_links"]}
|
||||
assert seed["tp3_slug"] in related_slugs
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_get_technique_invalid_slug_returns_404(client, db_engine):
|
||||
"""GET /techniques/{invalid-slug} returns 404."""
|
||||
await _seed_full_data(db_engine)
|
||||
|
||||
resp = await client.get(f"{TECHNIQUES_URL}/nonexistent-slug-xyz")
|
||||
assert resp.status_code == 404
|
||||
assert "not found" in resp.json()["detail"].lower()
|
||||
|
||||
|
||||
# ── Topics Tests ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_list_topics_hierarchy(client, db_engine):
|
||||
"""GET /topics returns category hierarchy with counts matching seeded data."""
|
||||
await _seed_full_data(db_engine)
|
||||
|
||||
resp = await client.get(TOPICS_URL)
|
||||
assert resp.status_code == 200
|
||||
|
||||
data = resp.json()
|
||||
# Should have the 6 categories from canonical_tags.yaml
|
||||
assert len(data) == 6
|
||||
category_names = {cat["name"] for cat in data}
|
||||
assert "Sound design" in category_names
|
||||
assert "Synthesis" in category_names
|
||||
assert "Mixing" in category_names
|
||||
|
||||
# Check Sound design category — should have "bass" sub-topic with count
|
||||
sound_design = next(c for c in data if c["name"] == "Sound design")
|
||||
bass_sub = next(
|
||||
(st for st in sound_design["sub_topics"] if st["name"] == "bass"), None
|
||||
)
|
||||
assert bass_sub is not None
|
||||
# tp1 (tags: ["bass", "textures"]) and tp3 (tags: ["fm", "bass"]) both have "bass"
|
||||
assert bass_sub["technique_count"] == 2
|
||||
# Both from creator1
|
||||
assert bass_sub["creator_count"] == 1
|
||||
|
||||
# Check Synthesis category — "granular" sub-topic
|
||||
synthesis = next(c for c in data if c["name"] == "Synthesis")
|
||||
granular_sub = next(
|
||||
(st for st in synthesis["sub_topics"] if st["name"] == "granular"), None
|
||||
)
|
||||
assert granular_sub is not None
|
||||
assert granular_sub["technique_count"] == 1
|
||||
assert granular_sub["creator_count"] == 1
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_topics_with_no_technique_pages(client, db_engine):
|
||||
"""GET /topics with no seeded data returns categories with zero counts."""
|
||||
# No data seeded — just use the clean DB
|
||||
resp = await client.get(TOPICS_URL)
|
||||
assert resp.status_code == 200
|
||||
|
||||
data = resp.json()
|
||||
assert len(data) == 6
|
||||
# All sub-topic counts should be zero
|
||||
for category in data:
|
||||
for st in category["sub_topics"]:
|
||||
assert st["technique_count"] == 0
|
||||
assert st["creator_count"] == 0
|
||||
|
||||
|
||||
# ── Creator Tests ────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_list_creators_random_sort(client, db_engine):
|
||||
"""GET /creators?sort=random returns all creators (order may vary)."""
|
||||
seed = await _seed_full_data(db_engine)
|
||||
|
||||
resp = await client.get(CREATORS_URL, params={"sort": "random"})
|
||||
assert resp.status_code == 200
|
||||
|
||||
data = resp.json()
|
||||
assert len(data) == 2
|
||||
names = {item["name"] for item in data}
|
||||
assert seed["creator1_name"] in names
|
||||
assert seed["creator2_name"] in names
|
||||
|
||||
# Each item has technique_count and video_count
|
||||
for item in data:
|
||||
assert "technique_count" in item
|
||||
assert "video_count" in item
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_list_creators_alpha_sort(client, db_engine):
|
||||
"""GET /creators?sort=alpha returns creators in alphabetical order."""
|
||||
seed = await _seed_full_data(db_engine)
|
||||
|
||||
resp = await client.get(CREATORS_URL, params={"sort": "alpha"})
|
||||
assert resp.status_code == 200
|
||||
|
||||
data = resp.json()
|
||||
assert len(data) == 2
|
||||
# "Alpha Creator" < "Beta Producer" alphabetically
|
||||
assert data[0]["name"] == "Alpha Creator"
|
||||
assert data[1]["name"] == "Beta Producer"
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_list_creators_genre_filter(client, db_engine):
|
||||
"""GET /creators?genre=Bass+music returns only matching creators."""
|
||||
seed = await _seed_full_data(db_engine)
|
||||
|
||||
resp = await client.get(CREATORS_URL, params={"genre": "Bass music"})
|
||||
assert resp.status_code == 200
|
||||
|
||||
data = resp.json()
|
||||
assert len(data) == 1
|
||||
assert data[0]["name"] == seed["creator1_name"]
|
||||
assert data[0]["slug"] == seed["creator1_slug"]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_get_creator_detail(client, db_engine):
|
||||
"""GET /creators/{slug} returns detail with video_count."""
|
||||
seed = await _seed_full_data(db_engine)
|
||||
|
||||
resp = await client.get(f"{CREATORS_URL}/{seed['creator1_slug']}")
|
||||
assert resp.status_code == 200
|
||||
|
||||
data = resp.json()
|
||||
assert data["name"] == seed["creator1_name"]
|
||||
assert data["slug"] == seed["creator1_slug"]
|
||||
assert data["video_count"] == 1 # creator1 has 1 video
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_get_creator_invalid_slug_returns_404(client, db_engine):
|
||||
"""GET /creators/{invalid-slug} returns 404."""
|
||||
await _seed_full_data(db_engine)
|
||||
|
||||
resp = await client.get(f"{CREATORS_URL}/nonexistent-creator-xyz")
|
||||
assert resp.status_code == 404
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_creators_with_counts(client, db_engine):
|
||||
"""GET /creators returns correct technique_count and video_count."""
|
||||
seed = await _seed_full_data(db_engine)
|
||||
|
||||
resp = await client.get(CREATORS_URL, params={"sort": "alpha"})
|
||||
assert resp.status_code == 200
|
||||
|
||||
data = resp.json()
|
||||
# Alpha Creator: 2 technique pages, 1 video
|
||||
alpha = data[0]
|
||||
assert alpha["name"] == "Alpha Creator"
|
||||
assert alpha["technique_count"] == 2
|
||||
assert alpha["video_count"] == 1
|
||||
|
||||
# Beta Producer: 1 technique page, 1 video
|
||||
beta = data[1]
|
||||
assert beta["name"] == "Beta Producer"
|
||||
assert beta["technique_count"] == 1
|
||||
assert beta["video_count"] == 1
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_creators_empty_list(client, db_engine):
|
||||
"""GET /creators with no creators returns empty list."""
|
||||
# No data seeded
|
||||
resp = await client.get(CREATORS_URL)
|
||||
assert resp.status_code == 200
|
||||
|
||||
data = resp.json()
|
||||
assert data == []
|
||||
|
||||
|
||||
# ── Version Tests ────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
async def _insert_version(db_engine, technique_page_id: str, version_number: int, content_snapshot: dict, pipeline_metadata: dict | None = None):
|
||||
"""Insert a TechniquePageVersion row directly for testing."""
|
||||
from models import TechniquePageVersion
|
||||
session_factory = async_sessionmaker(
|
||||
db_engine, class_=AsyncSession, expire_on_commit=False
|
||||
)
|
||||
async with session_factory() as session:
|
||||
v = TechniquePageVersion(
|
||||
technique_page_id=uuid.UUID(technique_page_id) if isinstance(technique_page_id, str) else technique_page_id,
|
||||
version_number=version_number,
|
||||
content_snapshot=content_snapshot,
|
||||
pipeline_metadata=pipeline_metadata,
|
||||
)
|
||||
session.add(v)
|
||||
await session.commit()
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_version_list_empty(client, db_engine):
|
||||
"""GET /techniques/{slug}/versions returns empty list when page has no versions."""
|
||||
seed = await _seed_full_data(db_engine)
|
||||
|
||||
resp = await client.get(f"{TECHNIQUES_URL}/{seed['tp1_slug']}/versions")
|
||||
assert resp.status_code == 200
|
||||
|
||||
data = resp.json()
|
||||
assert data["items"] == []
|
||||
assert data["total"] == 0
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_version_list_with_versions(client, db_engine):
|
||||
"""GET /techniques/{slug}/versions returns versions after inserting them."""
|
||||
seed = await _seed_full_data(db_engine)
|
||||
|
||||
# Get the technique page ID by fetching the detail
|
||||
detail_resp = await client.get(f"{TECHNIQUES_URL}/{seed['tp1_slug']}")
|
||||
page_id = detail_resp.json()["id"]
|
||||
|
||||
# Insert two versions
|
||||
snapshot1 = {"title": "Old Reese Bass v1", "summary": "First draft"}
|
||||
snapshot2 = {"title": "Old Reese Bass v2", "summary": "Second draft"}
|
||||
await _insert_version(db_engine, page_id, 1, snapshot1, {"model": "gpt-4o"})
|
||||
await _insert_version(db_engine, page_id, 2, snapshot2, {"model": "gpt-4o-mini"})
|
||||
|
||||
resp = await client.get(f"{TECHNIQUES_URL}/{seed['tp1_slug']}/versions")
|
||||
assert resp.status_code == 200
|
||||
|
||||
data = resp.json()
|
||||
assert data["total"] == 2
|
||||
assert len(data["items"]) == 2
|
||||
# Ordered by version_number DESC
|
||||
assert data["items"][0]["version_number"] == 2
|
||||
assert data["items"][1]["version_number"] == 1
|
||||
assert data["items"][0]["pipeline_metadata"]["model"] == "gpt-4o-mini"
|
||||
assert data["items"][1]["pipeline_metadata"]["model"] == "gpt-4o"
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_version_detail_returns_content_snapshot(client, db_engine):
|
||||
"""GET /techniques/{slug}/versions/{version_number} returns full snapshot."""
|
||||
seed = await _seed_full_data(db_engine)
|
||||
|
||||
detail_resp = await client.get(f"{TECHNIQUES_URL}/{seed['tp1_slug']}")
|
||||
page_id = detail_resp.json()["id"]
|
||||
|
||||
snapshot = {"title": "Old Title", "summary": "Old summary", "body_sections": {"intro": "Old intro"}}
|
||||
metadata = {"model": "gpt-4o", "prompt_hash": "abc123"}
|
||||
await _insert_version(db_engine, page_id, 1, snapshot, metadata)
|
||||
|
||||
resp = await client.get(f"{TECHNIQUES_URL}/{seed['tp1_slug']}/versions/1")
|
||||
assert resp.status_code == 200
|
||||
|
||||
data = resp.json()
|
||||
assert data["version_number"] == 1
|
||||
assert data["content_snapshot"] == snapshot
|
||||
assert data["pipeline_metadata"] == metadata
|
||||
assert "created_at" in data
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_version_detail_404_for_nonexistent_version(client, db_engine):
|
||||
"""GET /techniques/{slug}/versions/999 returns 404."""
|
||||
seed = await _seed_full_data(db_engine)
|
||||
|
||||
resp = await client.get(f"{TECHNIQUES_URL}/{seed['tp1_slug']}/versions/999")
|
||||
assert resp.status_code == 404
|
||||
assert "not found" in resp.json()["detail"].lower()
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_versions_404_for_nonexistent_slug(client, db_engine):
|
||||
"""GET /techniques/nonexistent-slug/versions returns 404."""
|
||||
await _seed_full_data(db_engine)
|
||||
|
||||
resp = await client.get(f"{TECHNIQUES_URL}/nonexistent-slug-xyz/versions")
|
||||
assert resp.status_code == 404
|
||||
assert "not found" in resp.json()["detail"].lower()
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_technique_detail_includes_version_count(client, db_engine):
|
||||
"""GET /techniques/{slug} includes version_count field."""
|
||||
seed = await _seed_full_data(db_engine)
|
||||
|
||||
# Initially version_count should be 0
|
||||
resp = await client.get(f"{TECHNIQUES_URL}/{seed['tp1_slug']}")
|
||||
assert resp.status_code == 200
|
||||
data = resp.json()
|
||||
assert data["version_count"] == 0
|
||||
|
||||
# Insert a version and check again
|
||||
page_id = data["id"]
|
||||
await _insert_version(db_engine, page_id, 1, {"title": "Snapshot"})
|
||||
|
||||
resp2 = await client.get(f"{TECHNIQUES_URL}/{seed['tp1_slug']}")
|
||||
assert resp2.status_code == 200
|
||||
assert resp2.json()["version_count"] == 1
|
||||
495
backend/tests/test_review.py
Normal file
495
backend/tests/test_review.py
Normal file
|
|
@ -0,0 +1,495 @@
|
|||
"""Integration tests for the review queue endpoints.
|
||||
|
||||
Tests run against a real PostgreSQL test database via httpx.AsyncClient.
|
||||
Redis is mocked for mode toggle tests.
|
||||
"""
|
||||
|
||||
import uuid
|
||||
from unittest.mock import AsyncMock, patch
|
||||
|
||||
import pytest
|
||||
import pytest_asyncio
|
||||
from httpx import AsyncClient
|
||||
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker
|
||||
|
||||
from models import (
|
||||
ContentType,
|
||||
Creator,
|
||||
KeyMoment,
|
||||
KeyMomentContentType,
|
||||
ProcessingStatus,
|
||||
ReviewStatus,
|
||||
SourceVideo,
|
||||
)
|
||||
|
||||
|
||||
# ── Helpers ──────────────────────────────────────────────────────────────────
|
||||
|
||||
QUEUE_URL = "/api/v1/review/queue"
|
||||
STATS_URL = "/api/v1/review/stats"
|
||||
MODE_URL = "/api/v1/review/mode"
|
||||
|
||||
|
||||
def _moment_url(moment_id: str, action: str = "") -> str:
|
||||
"""Build a moment action URL."""
|
||||
base = f"/api/v1/review/moments/{moment_id}"
|
||||
return f"{base}/{action}" if action else base
|
||||
|
||||
|
||||
async def _seed_creator_and_video(db_engine) -> dict:
|
||||
"""Seed a creator and source video, return their IDs."""
|
||||
session_factory = async_sessionmaker(
|
||||
db_engine, class_=AsyncSession, expire_on_commit=False
|
||||
)
|
||||
async with session_factory() as session:
|
||||
creator = Creator(
|
||||
name="TestCreator",
|
||||
slug="test-creator",
|
||||
folder_name="TestCreator",
|
||||
)
|
||||
session.add(creator)
|
||||
await session.flush()
|
||||
|
||||
video = SourceVideo(
|
||||
creator_id=creator.id,
|
||||
filename="test-video.mp4",
|
||||
file_path="TestCreator/test-video.mp4",
|
||||
duration_seconds=600,
|
||||
content_type=ContentType.tutorial,
|
||||
processing_status=ProcessingStatus.extracted,
|
||||
)
|
||||
session.add(video)
|
||||
await session.flush()
|
||||
|
||||
result = {
|
||||
"creator_id": creator.id,
|
||||
"creator_name": creator.name,
|
||||
"video_id": video.id,
|
||||
"video_filename": video.filename,
|
||||
}
|
||||
await session.commit()
|
||||
return result
|
||||
|
||||
|
||||
async def _seed_moment(
|
||||
db_engine,
|
||||
video_id: uuid.UUID,
|
||||
title: str = "Test Moment",
|
||||
summary: str = "A test key moment",
|
||||
start_time: float = 10.0,
|
||||
end_time: float = 30.0,
|
||||
review_status: ReviewStatus = ReviewStatus.pending,
|
||||
) -> uuid.UUID:
|
||||
"""Seed a single key moment and return its ID."""
|
||||
session_factory = async_sessionmaker(
|
||||
db_engine, class_=AsyncSession, expire_on_commit=False
|
||||
)
|
||||
async with session_factory() as session:
|
||||
moment = KeyMoment(
|
||||
source_video_id=video_id,
|
||||
title=title,
|
||||
summary=summary,
|
||||
start_time=start_time,
|
||||
end_time=end_time,
|
||||
content_type=KeyMomentContentType.technique,
|
||||
review_status=review_status,
|
||||
)
|
||||
session.add(moment)
|
||||
await session.commit()
|
||||
return moment.id
|
||||
|
||||
|
||||
async def _seed_second_video(db_engine, creator_id: uuid.UUID) -> uuid.UUID:
|
||||
"""Seed a second video for cross-video merge tests."""
|
||||
session_factory = async_sessionmaker(
|
||||
db_engine, class_=AsyncSession, expire_on_commit=False
|
||||
)
|
||||
async with session_factory() as session:
|
||||
video = SourceVideo(
|
||||
creator_id=creator_id,
|
||||
filename="other-video.mp4",
|
||||
file_path="TestCreator/other-video.mp4",
|
||||
duration_seconds=300,
|
||||
content_type=ContentType.tutorial,
|
||||
processing_status=ProcessingStatus.extracted,
|
||||
)
|
||||
session.add(video)
|
||||
await session.commit()
|
||||
return video.id
|
||||
|
||||
|
||||
# ── Queue listing tests ─────────────────────────────────────────────────────
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_list_queue_empty(client: AsyncClient):
|
||||
"""Queue returns empty list when no moments exist."""
|
||||
resp = await client.get(QUEUE_URL)
|
||||
assert resp.status_code == 200
|
||||
data = resp.json()
|
||||
assert data["items"] == []
|
||||
assert data["total"] == 0
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_list_queue_with_moments(client: AsyncClient, db_engine):
|
||||
"""Queue returns moments enriched with video filename and creator name."""
|
||||
seed = await _seed_creator_and_video(db_engine)
|
||||
await _seed_moment(db_engine, seed["video_id"], title="EQ Basics")
|
||||
|
||||
resp = await client.get(QUEUE_URL)
|
||||
assert resp.status_code == 200
|
||||
data = resp.json()
|
||||
assert data["total"] == 1
|
||||
item = data["items"][0]
|
||||
assert item["title"] == "EQ Basics"
|
||||
assert item["video_filename"] == seed["video_filename"]
|
||||
assert item["creator_name"] == seed["creator_name"]
|
||||
assert item["review_status"] == "pending"
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_list_queue_filter_by_status(client: AsyncClient, db_engine):
|
||||
"""Queue filters correctly by status query parameter."""
|
||||
seed = await _seed_creator_and_video(db_engine)
|
||||
await _seed_moment(db_engine, seed["video_id"], title="Pending One")
|
||||
await _seed_moment(
|
||||
db_engine, seed["video_id"], title="Approved One",
|
||||
review_status=ReviewStatus.approved,
|
||||
)
|
||||
await _seed_moment(
|
||||
db_engine, seed["video_id"], title="Rejected One",
|
||||
review_status=ReviewStatus.rejected,
|
||||
)
|
||||
|
||||
# Default filter: pending
|
||||
resp = await client.get(QUEUE_URL)
|
||||
assert resp.json()["total"] == 1
|
||||
assert resp.json()["items"][0]["title"] == "Pending One"
|
||||
|
||||
# Approved
|
||||
resp = await client.get(QUEUE_URL, params={"status": "approved"})
|
||||
assert resp.json()["total"] == 1
|
||||
assert resp.json()["items"][0]["title"] == "Approved One"
|
||||
|
||||
# All
|
||||
resp = await client.get(QUEUE_URL, params={"status": "all"})
|
||||
assert resp.json()["total"] == 3
|
||||
|
||||
|
||||
# ── Stats tests ──────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_stats_counts(client: AsyncClient, db_engine):
|
||||
"""Stats returns correct counts per review status."""
|
||||
seed = await _seed_creator_and_video(db_engine)
|
||||
await _seed_moment(db_engine, seed["video_id"], review_status=ReviewStatus.pending)
|
||||
await _seed_moment(db_engine, seed["video_id"], review_status=ReviewStatus.pending)
|
||||
await _seed_moment(db_engine, seed["video_id"], review_status=ReviewStatus.approved)
|
||||
await _seed_moment(db_engine, seed["video_id"], review_status=ReviewStatus.rejected)
|
||||
|
||||
resp = await client.get(STATS_URL)
|
||||
assert resp.status_code == 200
|
||||
data = resp.json()
|
||||
assert data["pending"] == 2
|
||||
assert data["approved"] == 1
|
||||
assert data["edited"] == 0
|
||||
assert data["rejected"] == 1
|
||||
|
||||
|
||||
# ── Approve tests ────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_approve_moment(client: AsyncClient, db_engine):
|
||||
"""Approve sets review_status to approved."""
|
||||
seed = await _seed_creator_and_video(db_engine)
|
||||
moment_id = await _seed_moment(db_engine, seed["video_id"])
|
||||
|
||||
resp = await client.post(_moment_url(str(moment_id), "approve"))
|
||||
assert resp.status_code == 200
|
||||
assert resp.json()["review_status"] == "approved"
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_approve_nonexistent_moment(client: AsyncClient):
|
||||
"""Approve returns 404 for nonexistent moment."""
|
||||
fake_id = str(uuid.uuid4())
|
||||
resp = await client.post(_moment_url(fake_id, "approve"))
|
||||
assert resp.status_code == 404
|
||||
|
||||
|
||||
# ── Reject tests ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_reject_moment(client: AsyncClient, db_engine):
|
||||
"""Reject sets review_status to rejected."""
|
||||
seed = await _seed_creator_and_video(db_engine)
|
||||
moment_id = await _seed_moment(db_engine, seed["video_id"])
|
||||
|
||||
resp = await client.post(_moment_url(str(moment_id), "reject"))
|
||||
assert resp.status_code == 200
|
||||
assert resp.json()["review_status"] == "rejected"
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_reject_nonexistent_moment(client: AsyncClient):
|
||||
"""Reject returns 404 for nonexistent moment."""
|
||||
fake_id = str(uuid.uuid4())
|
||||
resp = await client.post(_moment_url(fake_id, "reject"))
|
||||
assert resp.status_code == 404
|
||||
|
||||
|
||||
# ── Edit tests ───────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_edit_moment(client: AsyncClient, db_engine):
|
||||
"""Edit updates fields and sets review_status to edited."""
|
||||
seed = await _seed_creator_and_video(db_engine)
|
||||
moment_id = await _seed_moment(db_engine, seed["video_id"], title="Original Title")
|
||||
|
||||
resp = await client.put(
|
||||
_moment_url(str(moment_id)),
|
||||
json={"title": "Updated Title", "summary": "New summary"},
|
||||
)
|
||||
assert resp.status_code == 200
|
||||
data = resp.json()
|
||||
assert data["title"] == "Updated Title"
|
||||
assert data["summary"] == "New summary"
|
||||
assert data["review_status"] == "edited"
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_edit_nonexistent_moment(client: AsyncClient):
|
||||
"""Edit returns 404 for nonexistent moment."""
|
||||
fake_id = str(uuid.uuid4())
|
||||
resp = await client.put(
|
||||
_moment_url(fake_id),
|
||||
json={"title": "Won't Work"},
|
||||
)
|
||||
assert resp.status_code == 404
|
||||
|
||||
|
||||
# ── Split tests ──────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_split_moment(client: AsyncClient, db_engine):
|
||||
"""Split creates two moments with correct timestamps."""
|
||||
seed = await _seed_creator_and_video(db_engine)
|
||||
moment_id = await _seed_moment(
|
||||
db_engine, seed["video_id"],
|
||||
title="Full Moment", start_time=10.0, end_time=30.0,
|
||||
)
|
||||
|
||||
resp = await client.post(
|
||||
_moment_url(str(moment_id), "split"),
|
||||
json={"split_time": 20.0},
|
||||
)
|
||||
assert resp.status_code == 200
|
||||
data = resp.json()
|
||||
assert len(data) == 2
|
||||
|
||||
# First (original): [10.0, 20.0)
|
||||
assert data[0]["start_time"] == 10.0
|
||||
assert data[0]["end_time"] == 20.0
|
||||
|
||||
# Second (new): [20.0, 30.0]
|
||||
assert data[1]["start_time"] == 20.0
|
||||
assert data[1]["end_time"] == 30.0
|
||||
assert "(split)" in data[1]["title"]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_split_invalid_time_below_start(client: AsyncClient, db_engine):
|
||||
"""Split returns 400 when split_time is at or below start_time."""
|
||||
seed = await _seed_creator_and_video(db_engine)
|
||||
moment_id = await _seed_moment(
|
||||
db_engine, seed["video_id"], start_time=10.0, end_time=30.0,
|
||||
)
|
||||
|
||||
resp = await client.post(
|
||||
_moment_url(str(moment_id), "split"),
|
||||
json={"split_time": 10.0},
|
||||
)
|
||||
assert resp.status_code == 400
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_split_invalid_time_above_end(client: AsyncClient, db_engine):
|
||||
"""Split returns 400 when split_time is at or above end_time."""
|
||||
seed = await _seed_creator_and_video(db_engine)
|
||||
moment_id = await _seed_moment(
|
||||
db_engine, seed["video_id"], start_time=10.0, end_time=30.0,
|
||||
)
|
||||
|
||||
resp = await client.post(
|
||||
_moment_url(str(moment_id), "split"),
|
||||
json={"split_time": 30.0},
|
||||
)
|
||||
assert resp.status_code == 400
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_split_nonexistent_moment(client: AsyncClient):
|
||||
"""Split returns 404 for nonexistent moment."""
|
||||
fake_id = str(uuid.uuid4())
|
||||
resp = await client.post(
|
||||
_moment_url(fake_id, "split"),
|
||||
json={"split_time": 20.0},
|
||||
)
|
||||
assert resp.status_code == 404
|
||||
|
||||
|
||||
# ── Merge tests ──────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_merge_moments(client: AsyncClient, db_engine):
|
||||
"""Merge combines two moments: combined summary, min start, max end, target deleted."""
|
||||
seed = await _seed_creator_and_video(db_engine)
|
||||
m1_id = await _seed_moment(
|
||||
db_engine, seed["video_id"],
|
||||
title="First", summary="Summary A",
|
||||
start_time=10.0, end_time=20.0,
|
||||
)
|
||||
m2_id = await _seed_moment(
|
||||
db_engine, seed["video_id"],
|
||||
title="Second", summary="Summary B",
|
||||
start_time=25.0, end_time=35.0,
|
||||
)
|
||||
|
||||
resp = await client.post(
|
||||
_moment_url(str(m1_id), "merge"),
|
||||
json={"target_moment_id": str(m2_id)},
|
||||
)
|
||||
assert resp.status_code == 200
|
||||
data = resp.json()
|
||||
assert data["start_time"] == 10.0
|
||||
assert data["end_time"] == 35.0
|
||||
assert "Summary A" in data["summary"]
|
||||
assert "Summary B" in data["summary"]
|
||||
|
||||
# Target should be deleted — reject should 404
|
||||
resp2 = await client.post(_moment_url(str(m2_id), "reject"))
|
||||
assert resp2.status_code == 404
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_merge_different_videos(client: AsyncClient, db_engine):
|
||||
"""Merge returns 400 when moments are from different source videos."""
|
||||
seed = await _seed_creator_and_video(db_engine)
|
||||
m1_id = await _seed_moment(db_engine, seed["video_id"], title="Video 1 moment")
|
||||
|
||||
other_video_id = await _seed_second_video(db_engine, seed["creator_id"])
|
||||
m2_id = await _seed_moment(db_engine, other_video_id, title="Video 2 moment")
|
||||
|
||||
resp = await client.post(
|
||||
_moment_url(str(m1_id), "merge"),
|
||||
json={"target_moment_id": str(m2_id)},
|
||||
)
|
||||
assert resp.status_code == 400
|
||||
assert "different source videos" in resp.json()["detail"]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_merge_with_self(client: AsyncClient, db_engine):
|
||||
"""Merge returns 400 when trying to merge a moment with itself."""
|
||||
seed = await _seed_creator_and_video(db_engine)
|
||||
m_id = await _seed_moment(db_engine, seed["video_id"])
|
||||
|
||||
resp = await client.post(
|
||||
_moment_url(str(m_id), "merge"),
|
||||
json={"target_moment_id": str(m_id)},
|
||||
)
|
||||
assert resp.status_code == 400
|
||||
assert "itself" in resp.json()["detail"]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_merge_nonexistent_target(client: AsyncClient, db_engine):
|
||||
"""Merge returns 404 when target moment does not exist."""
|
||||
seed = await _seed_creator_and_video(db_engine)
|
||||
m_id = await _seed_moment(db_engine, seed["video_id"])
|
||||
|
||||
resp = await client.post(
|
||||
_moment_url(str(m_id), "merge"),
|
||||
json={"target_moment_id": str(uuid.uuid4())},
|
||||
)
|
||||
assert resp.status_code == 404
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_merge_nonexistent_source(client: AsyncClient):
|
||||
"""Merge returns 404 when source moment does not exist."""
|
||||
fake_id = str(uuid.uuid4())
|
||||
resp = await client.post(
|
||||
_moment_url(fake_id, "merge"),
|
||||
json={"target_moment_id": str(uuid.uuid4())},
|
||||
)
|
||||
assert resp.status_code == 404
|
||||
|
||||
|
||||
# ── Mode toggle tests ───────────────────────────────────────────────────────
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_get_mode_default(client: AsyncClient):
|
||||
"""Get mode returns config default when Redis has no value."""
|
||||
mock_redis = AsyncMock()
|
||||
mock_redis.get = AsyncMock(return_value=None)
|
||||
mock_redis.aclose = AsyncMock()
|
||||
|
||||
with patch("routers.review.get_redis", return_value=mock_redis):
|
||||
resp = await client.get(MODE_URL)
|
||||
assert resp.status_code == 200
|
||||
# Default from config is True
|
||||
assert resp.json()["review_mode"] is True
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_set_mode(client: AsyncClient):
|
||||
"""Set mode writes to Redis and returns the new value."""
|
||||
mock_redis = AsyncMock()
|
||||
mock_redis.set = AsyncMock()
|
||||
mock_redis.aclose = AsyncMock()
|
||||
|
||||
with patch("routers.review.get_redis", return_value=mock_redis):
|
||||
resp = await client.put(MODE_URL, json={"review_mode": False})
|
||||
assert resp.status_code == 200
|
||||
assert resp.json()["review_mode"] is False
|
||||
mock_redis.set.assert_called_once_with("chrysopedia:review_mode", "False")
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_get_mode_from_redis(client: AsyncClient):
|
||||
"""Get mode reads the value stored in Redis."""
|
||||
mock_redis = AsyncMock()
|
||||
mock_redis.get = AsyncMock(return_value="False")
|
||||
mock_redis.aclose = AsyncMock()
|
||||
|
||||
with patch("routers.review.get_redis", return_value=mock_redis):
|
||||
resp = await client.get(MODE_URL)
|
||||
assert resp.status_code == 200
|
||||
assert resp.json()["review_mode"] is False
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_get_mode_redis_error_fallback(client: AsyncClient):
|
||||
"""Get mode falls back to config default when Redis is unavailable."""
|
||||
with patch("routers.review.get_redis", side_effect=ConnectionError("Redis down")):
|
||||
resp = await client.get(MODE_URL)
|
||||
assert resp.status_code == 200
|
||||
# Falls back to config default (True)
|
||||
assert resp.json()["review_mode"] is True
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_set_mode_redis_error(client: AsyncClient):
|
||||
"""Set mode returns 503 when Redis is unavailable."""
|
||||
with patch("routers.review.get_redis", side_effect=ConnectionError("Redis down")):
|
||||
resp = await client.put(MODE_URL, json={"review_mode": False})
|
||||
assert resp.status_code == 503
|
||||
341
backend/tests/test_search.py
Normal file
341
backend/tests/test_search.py
Normal file
|
|
@ -0,0 +1,341 @@
|
|||
"""Integration tests for the /api/v1/search endpoint.
|
||||
|
||||
Tests run against a real PostgreSQL test database via httpx.AsyncClient.
|
||||
SearchService is mocked at the router dependency level so we can test
|
||||
endpoint behavior without requiring external embedding API or Qdrant.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import uuid
|
||||
from unittest.mock import AsyncMock, MagicMock, patch
|
||||
|
||||
import pytest
|
||||
import pytest_asyncio
|
||||
from httpx import AsyncClient
|
||||
from sqlalchemy.ext.asyncio import AsyncSession, async_sessionmaker
|
||||
|
||||
from models import (
|
||||
ContentType,
|
||||
Creator,
|
||||
KeyMoment,
|
||||
KeyMomentContentType,
|
||||
ProcessingStatus,
|
||||
SourceVideo,
|
||||
TechniquePage,
|
||||
)
|
||||
|
||||
SEARCH_URL = "/api/v1/search"
|
||||
|
||||
|
||||
# ── Seed helpers ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
async def _seed_search_data(db_engine) -> dict:
|
||||
"""Seed 2 creators, 3 technique pages, and 5 key moments for search tests.
|
||||
|
||||
Returns a dict with creator/technique IDs and metadata for assertions.
|
||||
"""
|
||||
session_factory = async_sessionmaker(
|
||||
db_engine, class_=AsyncSession, expire_on_commit=False
|
||||
)
|
||||
async with session_factory() as session:
|
||||
# Creators
|
||||
creator1 = Creator(
|
||||
name="Mr. Bill",
|
||||
slug="mr-bill",
|
||||
genres=["Bass music", "Glitch"],
|
||||
folder_name="MrBill",
|
||||
)
|
||||
creator2 = Creator(
|
||||
name="KOAN Sound",
|
||||
slug="koan-sound",
|
||||
genres=["Drum & bass", "Neuro"],
|
||||
folder_name="KOANSound",
|
||||
)
|
||||
session.add_all([creator1, creator2])
|
||||
await session.flush()
|
||||
|
||||
# Videos (needed for key moments FK)
|
||||
video1 = SourceVideo(
|
||||
creator_id=creator1.id,
|
||||
filename="bass-design-101.mp4",
|
||||
file_path="MrBill/bass-design-101.mp4",
|
||||
duration_seconds=600,
|
||||
content_type=ContentType.tutorial,
|
||||
processing_status=ProcessingStatus.extracted,
|
||||
)
|
||||
video2 = SourceVideo(
|
||||
creator_id=creator2.id,
|
||||
filename="reese-bass-deep-dive.mp4",
|
||||
file_path="KOANSound/reese-bass-deep-dive.mp4",
|
||||
duration_seconds=900,
|
||||
content_type=ContentType.tutorial,
|
||||
processing_status=ProcessingStatus.extracted,
|
||||
)
|
||||
session.add_all([video1, video2])
|
||||
await session.flush()
|
||||
|
||||
# Technique pages
|
||||
tp1 = TechniquePage(
|
||||
creator_id=creator1.id,
|
||||
title="Reese Bass Design",
|
||||
slug="reese-bass-design",
|
||||
topic_category="Sound design",
|
||||
topic_tags=["bass", "textures"],
|
||||
summary="How to create a classic reese bass",
|
||||
)
|
||||
tp2 = TechniquePage(
|
||||
creator_id=creator2.id,
|
||||
title="Granular Pad Textures",
|
||||
slug="granular-pad-textures",
|
||||
topic_category="Synthesis",
|
||||
topic_tags=["granular", "pads"],
|
||||
summary="Creating pad textures with granular synthesis",
|
||||
)
|
||||
tp3 = TechniquePage(
|
||||
creator_id=creator1.id,
|
||||
title="FM Bass Layering",
|
||||
slug="fm-bass-layering",
|
||||
topic_category="Synthesis",
|
||||
topic_tags=["fm", "bass"],
|
||||
summary="FM synthesis techniques for bass layering",
|
||||
)
|
||||
session.add_all([tp1, tp2, tp3])
|
||||
await session.flush()
|
||||
|
||||
# Key moments
|
||||
km1 = KeyMoment(
|
||||
source_video_id=video1.id,
|
||||
technique_page_id=tp1.id,
|
||||
title="Setting up the Reese oscillator",
|
||||
summary="Initial oscillator setup for reese bass",
|
||||
start_time=10.0,
|
||||
end_time=60.0,
|
||||
content_type=KeyMomentContentType.technique,
|
||||
)
|
||||
km2 = KeyMoment(
|
||||
source_video_id=video1.id,
|
||||
technique_page_id=tp1.id,
|
||||
title="Adding distortion to the Reese",
|
||||
summary="Distortion processing chain for reese bass",
|
||||
start_time=60.0,
|
||||
end_time=120.0,
|
||||
content_type=KeyMomentContentType.technique,
|
||||
)
|
||||
km3 = KeyMoment(
|
||||
source_video_id=video2.id,
|
||||
technique_page_id=tp2.id,
|
||||
title="Granular engine settings",
|
||||
summary="Dialing in granular engine parameters",
|
||||
start_time=20.0,
|
||||
end_time=80.0,
|
||||
content_type=KeyMomentContentType.settings,
|
||||
)
|
||||
km4 = KeyMoment(
|
||||
source_video_id=video1.id,
|
||||
technique_page_id=tp3.id,
|
||||
title="FM ratio selection",
|
||||
summary="Choosing FM ratios for bass tones",
|
||||
start_time=5.0,
|
||||
end_time=45.0,
|
||||
content_type=KeyMomentContentType.technique,
|
||||
)
|
||||
km5 = KeyMoment(
|
||||
source_video_id=video2.id,
|
||||
title="Outro and credits",
|
||||
summary="End of the video",
|
||||
start_time=800.0,
|
||||
end_time=900.0,
|
||||
content_type=KeyMomentContentType.workflow,
|
||||
)
|
||||
session.add_all([km1, km2, km3, km4, km5])
|
||||
await session.commit()
|
||||
|
||||
return {
|
||||
"creator1_id": str(creator1.id),
|
||||
"creator1_name": creator1.name,
|
||||
"creator1_slug": creator1.slug,
|
||||
"creator2_id": str(creator2.id),
|
||||
"creator2_name": creator2.name,
|
||||
"tp1_slug": tp1.slug,
|
||||
"tp1_title": tp1.title,
|
||||
"tp2_slug": tp2.slug,
|
||||
"tp3_slug": tp3.slug,
|
||||
}
|
||||
|
||||
|
||||
# ── Tests ────────────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_search_happy_path_with_mocked_service(client, db_engine):
|
||||
"""Search endpoint returns mocked results with correct response shape."""
|
||||
seed = await _seed_search_data(db_engine)
|
||||
|
||||
# Mock the SearchService.search method to return canned results
|
||||
mock_result = {
|
||||
"items": [
|
||||
{
|
||||
"type": "technique_page",
|
||||
"title": "Reese Bass Design",
|
||||
"slug": "reese-bass-design",
|
||||
"summary": "How to create a classic reese bass",
|
||||
"topic_category": "Sound design",
|
||||
"topic_tags": ["bass", "textures"],
|
||||
"creator_name": "Mr. Bill",
|
||||
"creator_slug": "mr-bill",
|
||||
"score": 0.95,
|
||||
}
|
||||
],
|
||||
"total": 1,
|
||||
"query": "reese bass",
|
||||
"fallback_used": False,
|
||||
}
|
||||
|
||||
with patch("routers.search.SearchService") as MockSvc:
|
||||
instance = MockSvc.return_value
|
||||
instance.search = AsyncMock(return_value=mock_result)
|
||||
|
||||
resp = await client.get(SEARCH_URL, params={"q": "reese bass"})
|
||||
|
||||
assert resp.status_code == 200
|
||||
data = resp.json()
|
||||
assert data["query"] == "reese bass"
|
||||
assert data["total"] == 1
|
||||
assert data["fallback_used"] is False
|
||||
assert len(data["items"]) == 1
|
||||
|
||||
item = data["items"][0]
|
||||
assert item["title"] == "Reese Bass Design"
|
||||
assert item["slug"] == "reese-bass-design"
|
||||
assert "score" in item
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_search_empty_query_returns_empty(client, db_engine):
|
||||
"""Empty search query returns empty results without hitting SearchService."""
|
||||
await _seed_search_data(db_engine)
|
||||
|
||||
# With empty query, the search service returns empty results directly
|
||||
mock_result = {
|
||||
"items": [],
|
||||
"total": 0,
|
||||
"query": "",
|
||||
"fallback_used": False,
|
||||
}
|
||||
|
||||
with patch("routers.search.SearchService") as MockSvc:
|
||||
instance = MockSvc.return_value
|
||||
instance.search = AsyncMock(return_value=mock_result)
|
||||
|
||||
resp = await client.get(SEARCH_URL, params={"q": ""})
|
||||
|
||||
assert resp.status_code == 200
|
||||
data = resp.json()
|
||||
assert data["items"] == []
|
||||
assert data["total"] == 0
|
||||
assert data["query"] == ""
|
||||
assert data["fallback_used"] is False
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_search_keyword_fallback(client, db_engine):
|
||||
"""When embedding fails, search uses keyword fallback and sets fallback_used=true."""
|
||||
seed = await _seed_search_data(db_engine)
|
||||
|
||||
mock_result = {
|
||||
"items": [
|
||||
{
|
||||
"type": "technique_page",
|
||||
"title": "Reese Bass Design",
|
||||
"slug": "reese-bass-design",
|
||||
"summary": "How to create a classic reese bass",
|
||||
"topic_category": "Sound design",
|
||||
"topic_tags": ["bass", "textures"],
|
||||
"creator_name": "",
|
||||
"creator_slug": "",
|
||||
"score": 0.0,
|
||||
}
|
||||
],
|
||||
"total": 1,
|
||||
"query": "reese",
|
||||
"fallback_used": True,
|
||||
}
|
||||
|
||||
with patch("routers.search.SearchService") as MockSvc:
|
||||
instance = MockSvc.return_value
|
||||
instance.search = AsyncMock(return_value=mock_result)
|
||||
|
||||
resp = await client.get(SEARCH_URL, params={"q": "reese"})
|
||||
|
||||
assert resp.status_code == 200
|
||||
data = resp.json()
|
||||
assert data["fallback_used"] is True
|
||||
assert data["total"] >= 1
|
||||
assert data["items"][0]["title"] == "Reese Bass Design"
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_search_scope_filter(client, db_engine):
|
||||
"""Search with scope=topics returns only technique_page type results."""
|
||||
await _seed_search_data(db_engine)
|
||||
|
||||
mock_result = {
|
||||
"items": [
|
||||
{
|
||||
"type": "technique_page",
|
||||
"title": "FM Bass Layering",
|
||||
"slug": "fm-bass-layering",
|
||||
"summary": "FM synthesis techniques for bass layering",
|
||||
"topic_category": "Synthesis",
|
||||
"topic_tags": ["fm", "bass"],
|
||||
"creator_name": "Mr. Bill",
|
||||
"creator_slug": "mr-bill",
|
||||
"score": 0.88,
|
||||
}
|
||||
],
|
||||
"total": 1,
|
||||
"query": "bass",
|
||||
"fallback_used": False,
|
||||
}
|
||||
|
||||
with patch("routers.search.SearchService") as MockSvc:
|
||||
instance = MockSvc.return_value
|
||||
instance.search = AsyncMock(return_value=mock_result)
|
||||
|
||||
resp = await client.get(SEARCH_URL, params={"q": "bass", "scope": "topics"})
|
||||
|
||||
assert resp.status_code == 200
|
||||
data = resp.json()
|
||||
# All items should be technique_page type when scope=topics
|
||||
for item in data["items"]:
|
||||
assert item["type"] == "technique_page"
|
||||
|
||||
# Verify the service was called with scope=topics
|
||||
call_kwargs = instance.search.call_args
|
||||
assert call_kwargs.kwargs.get("scope") == "topics" or call_kwargs[1].get("scope") == "topics"
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_search_no_matching_results(client, db_engine):
|
||||
"""Search with no matching results returns empty items list."""
|
||||
await _seed_search_data(db_engine)
|
||||
|
||||
mock_result = {
|
||||
"items": [],
|
||||
"total": 0,
|
||||
"query": "zzzznonexistent",
|
||||
"fallback_used": True,
|
||||
}
|
||||
|
||||
with patch("routers.search.SearchService") as MockSvc:
|
||||
instance = MockSvc.return_value
|
||||
instance.search = AsyncMock(return_value=mock_result)
|
||||
|
||||
resp = await client.get(SEARCH_URL, params={"q": "zzzznonexistent"})
|
||||
|
||||
assert resp.status_code == 200
|
||||
data = resp.json()
|
||||
assert data["items"] == []
|
||||
assert data["total"] == 0
|
||||
32
backend/worker.py
Normal file
32
backend/worker.py
Normal file
|
|
@ -0,0 +1,32 @@
|
|||
"""Celery application instance for the Chrysopedia pipeline.
|
||||
|
||||
Usage:
|
||||
celery -A worker worker --loglevel=info
|
||||
"""
|
||||
|
||||
from celery import Celery
|
||||
|
||||
from config import get_settings
|
||||
|
||||
settings = get_settings()
|
||||
|
||||
celery_app = Celery(
|
||||
"chrysopedia",
|
||||
broker=settings.redis_url,
|
||||
backend=settings.redis_url,
|
||||
)
|
||||
|
||||
celery_app.conf.update(
|
||||
task_serializer="json",
|
||||
result_serializer="json",
|
||||
accept_content=["json"],
|
||||
timezone="UTC",
|
||||
enable_utc=True,
|
||||
task_track_started=True,
|
||||
task_acks_late=True,
|
||||
worker_prefetch_multiplier=1,
|
||||
)
|
||||
|
||||
# Import pipeline.stages so that @celery_app.task decorators register tasks.
|
||||
# This import must come after celery_app is defined.
|
||||
import pipeline.stages # noqa: E402, F401
|
||||
713
chrysopedia-spec.md
Normal file
713
chrysopedia-spec.md
Normal file
|
|
@ -0,0 +1,713 @@
|
|||
# Chrysopedia — Project Specification
|
||||
|
||||
> **Etymology:** From *chrysopoeia* (the alchemical transmutation of base material into gold) + *encyclopedia* (an organized body of knowledge). Chrysopedia transmutes raw video content into refined, searchable production knowledge.
|
||||
|
||||
---
|
||||
|
||||
## 1. Project overview
|
||||
|
||||
### 1.1 Problem statement
|
||||
|
||||
Hundreds of hours of educational video content from electronic music producers sit on local storage — tutorials, livestreams, track breakdowns, and deep dives covering techniques in sound design, mixing, arrangement, synthesis, and more. This content is extremely valuable but nearly impossible to retrieve: videos are unsearchable, unchaptered, and undocumented. A 4-hour livestream may contain 6 minutes of actionable gold buried among tangents and chat interaction. The current retrieval method is "scrub through from memory and hope" — or more commonly, the knowledge is simply lost.
|
||||
|
||||
### 1.2 Solution
|
||||
|
||||
Chrysopedia is a self-hosted knowledge extraction and retrieval system that:
|
||||
|
||||
1. **Transcribes** video content using local Whisper inference
|
||||
2. **Extracts** key moments, techniques, and insights using LLM analysis
|
||||
3. **Classifies** content by topic, creator, plugins, and production stage
|
||||
4. **Synthesizes** knowledge across multiple sources into coherent technique pages
|
||||
5. **Serves** a fast, search-first web UI for mid-session retrieval
|
||||
|
||||
The system transforms raw video files into a browsable, searchable knowledge base with direct timestamp links back to source material.
|
||||
|
||||
### 1.3 Design principles
|
||||
|
||||
- **Search-first.** The primary interaction is typing a query and getting results in seconds. Browse is secondary, for exploration.
|
||||
- **Surgical retrieval.** A producer mid-session should be able to Alt+Tab, find the technique they need, absorb the key insight, and get back to their DAW in under 2 minutes.
|
||||
- **Creator equity.** No artist is privileged in the UI. All creators get equal visual weight. Default sort is randomized.
|
||||
- **Dual-axis navigation.** Content is accessible by Topic (technique/production stage) and by Creator (artist), with both paths being first-class citizens.
|
||||
- **Incremental, not one-time.** The system must handle ongoing content additions, not just an initial batch.
|
||||
- **Self-hosted and portable.** Packaged as a Docker Compose project, deployable on existing infrastructure.
|
||||
|
||||
### 1.4 Name and identity
|
||||
|
||||
- **Project name:** Chrysopedia
|
||||
- **Suggested subdomain:** `chrysopedia.xpltd.co`
|
||||
- **Docker project name:** `chrysopedia`
|
||||
|
||||
---
|
||||
|
||||
## 2. Content inventory and source material
|
||||
|
||||
### 2.1 Current state
|
||||
|
||||
- **Volume:** 100–500 video files
|
||||
- **Creators:** 50+ distinct artists/producers
|
||||
- **Formats:** Primarily MP4/MKV, mixed quality and naming conventions
|
||||
- **Organization:** Folders per artist, filenames loosely descriptive
|
||||
- **Location:** Local desktop storage (not yet on the hypervisor/NAS)
|
||||
- **Content types:**
|
||||
- Full-length tutorials (30min–4hrs, structured walkthroughs)
|
||||
- Livestream recordings (long, unstructured, conversational)
|
||||
- Track breakdowns / start-to-finish productions
|
||||
|
||||
### 2.2 Content characteristics
|
||||
|
||||
The audio track carries the vast majority of the value. Visual demonstrations (screen recordings of DAW work) are useful context but are not the primary extraction target. The transcript is the primary ore.
|
||||
|
||||
**Structured content** (tutorials, breakdowns) tends to have natural topic boundaries — the producer announces what they're about to cover, then demonstrates. These are easier to segment.
|
||||
|
||||
**Unstructured content** (livestreams) is chaotic: tangents, chat interaction, rambling, with gems appearing without warning. The extraction pipeline must handle both structured and unstructured content using semantic understanding, not just topic detection from speaker announcements.
|
||||
|
||||
---
|
||||
|
||||
## 3. Terminology
|
||||
|
||||
| Term | Definition |
|
||||
|------|-----------|
|
||||
| **Creator** | An artist, producer, or educator whose video content is in the system. Formerly "artist" — renamed for flexibility. |
|
||||
| **Technique page** | The primary knowledge unit: a structured page covering one technique or concept from one creator, compiled from one or more source videos. |
|
||||
| **Key moment** | A discrete, timestamped insight extracted from a video — a specific technique, setting, or piece of reasoning worth capturing. |
|
||||
| **Topic** | A production domain or concept category (e.g., "sound design," "mixing," "snare design"). Organized hierarchically. |
|
||||
| **Genre** | A broad musical style tag (e.g., "dubstep," "drum & bass," "halftime"). Stored as metadata on Creators, not on techniques. Used as a filter across all views. |
|
||||
| **Source video** | An original video file that has been processed by the pipeline. |
|
||||
| **Transcript** | The timestamped text output of Whisper processing a source video's audio. |
|
||||
|
||||
---
|
||||
|
||||
## 4. User experience
|
||||
|
||||
### 4.1 UX philosophy
|
||||
|
||||
The system is accessed via Alt+Tab from a DAW on the same desktop machine. Every design decision optimizes for speed of retrieval and minimal cognitive load. The interface should feel like a tool, not a destination.
|
||||
|
||||
**Primary access method:** Same machine, Alt+Tab to browser.
|
||||
|
||||
### 4.2 Landing page (Launchpad)
|
||||
|
||||
The landing page is a decision point, not a dashboard. Minimal, focused, fast.
|
||||
|
||||
**Layout (top to bottom):**
|
||||
|
||||
1. **Search bar** — prominent, full-width, with live typeahead (results appear after 2–3 characters). This is the primary interaction for most visits. Scope toggle tabs below the search input: `All | Topics | Creators`
|
||||
2. **Two navigation cards** — side-by-side:
|
||||
- **Topics** — "Browse by technique, production stage, or concept" with count of total techniques and categories
|
||||
- **Creators** — "Browse by artist, filterable by genre" with count of total creators and genres
|
||||
3. **Recently added** — a short list of the most recently processed/published technique pages with creator name, topic tag, and relative timestamp
|
||||
|
||||
**Future feature (not v1):** Trending / popular section alongside recently added, driven by view counts and cross-reference frequency.
|
||||
|
||||
### 4.3 Live search (typeahead)
|
||||
|
||||
The search bar is the primary interface. Behavior:
|
||||
|
||||
- Results begin appearing after 2–3 characters typed
|
||||
- Scope toggle: `All | Topics | Creators` — filters what types of results appear
|
||||
- **"All" scope** groups results by type:
|
||||
- **Topics** — technique pages matching the query, showing title, creator name(s), parent topic tag
|
||||
- **Key moments** — individual timestamped insights matching the query, showing moment title, creator, source file, and timestamp. Clicking jumps to the technique page (or eventually direct to the video moment)
|
||||
- **Creators** — creator names matching the query
|
||||
- **"Topics" scope** — shows only technique pages
|
||||
- **"Creators" scope** — shows only creator matches
|
||||
- Genre filter is accessible on Creators scope and cross-filters Topics scope (using creator-level genre metadata)
|
||||
- Search is semantic where possible (powered by Qdrant vector search), with keyword fallback
|
||||
|
||||
### 4.4 Technique page (A+C hybrid format)
|
||||
|
||||
The core content unit. Each technique page covers one technique or concept from one creator. The format adapts by content type but follows a consistent structure.
|
||||
|
||||
**Layout (top to bottom):**
|
||||
|
||||
1. **Header:**
|
||||
- Topic tags (e.g., "sound design," "drums," "snare")
|
||||
- Technique title (e.g., "Snare design")
|
||||
- Creator name
|
||||
- Meta line: "Compiled from N sources · M key moments · Last updated [date]"
|
||||
- Source quality warning (amber banner) if content came from an unstructured livestream
|
||||
|
||||
2. **Study guide prose (Section A):**
|
||||
- Organized by sub-aspects of the technique (e.g., "Layer construction," "Saturation & character," "Mix context")
|
||||
- Rich prose capturing:
|
||||
- The specific technique/method described (highest priority)
|
||||
- Exact settings, plugins, and parameters when the creator was *teaching* the setting (not incidental use)
|
||||
- The reasoning/philosophy behind choices when the creator explains *why*
|
||||
- Signal chain blocks rendered in monospace when a creator walks through a routing chain
|
||||
- Direct quotes of creator opinions/warnings when they add value (e.g., "He says it 'smears the transient into mush'")
|
||||
|
||||
3. **Key moments index (Section C):**
|
||||
- Compact list of individual timestamped insights
|
||||
- Each row: moment title, source video filename, clickable timestamp
|
||||
- Sorted chronologically within each source video
|
||||
|
||||
4. **Related techniques:**
|
||||
- Links to related technique pages — same technique by other creators, adjacent techniques by the same creator, general/cross-creator technique pages
|
||||
- Renders as clickable pill-shaped tags
|
||||
|
||||
5. **Plugins referenced:**
|
||||
- List of all plugins/tools mentioned in the technique page
|
||||
- Each is a clickable tag that could lead to "all techniques referencing this plugin" (future: dedicated plugin pages)
|
||||
|
||||
**Content type adaptation:**
|
||||
- **Technique-heavy content** (sound design, specific methods): Full A+C treatment with signal chains, plugin details, parameter specifics
|
||||
- **Philosophy/workflow content** (mixdown approach, creative process): More prose-heavy, fewer signal chain blocks, but same overall structure. These pages are still browsable but also serve as rich context for future RAG/chat retrieval
|
||||
- **Livestream-sourced content:** Amber warning banner noting source quality. Timestamps may land in messy context with tangents nearby
|
||||
|
||||
### 4.5 Creators browse page
|
||||
|
||||
Accessed from the landing page "Creators" card.
|
||||
|
||||
**Layout:**
|
||||
- Page title: "Creators" with total count
|
||||
- Filter input: type-to-narrow the list
|
||||
- Genre filter pills: `All genres | Bass music | Drum & bass | Dubstep | Halftime | House | IDM | Neuro | Techno | ...` — clicking a genre filters the list to creators tagged with that genre
|
||||
- Sort options: Randomized (default, re-shuffled on every page load), Alphabetical, View count
|
||||
- Creator list: flat, equal-weight rows. Each row shows:
|
||||
- Creator name
|
||||
- Genre tags (multiple allowed)
|
||||
- Technique count
|
||||
- Video count
|
||||
- View count (sum of activity across all content derived from this creator)
|
||||
- Clicking a row navigates to that creator's detail page (list of all their technique pages)
|
||||
|
||||
**Default sort is randomized on every page load** to prevent discovery bias. Users can toggle to alphabetical or sort by view count.
|
||||
|
||||
### 4.6 Topics browse page
|
||||
|
||||
Accessed from the landing page "Topics" card.
|
||||
|
||||
**Layout:**
|
||||
- Page title: "Topics" with total technique count
|
||||
- Filter input: type-to-narrow
|
||||
- Genre filter pills (uses creator-level genre metadata to filter): show only techniques from creators tagged with the selected genre
|
||||
- **Two-level hierarchy displayed:**
|
||||
- **Top-level categories:** Sound design, Mixing, Synthesis, Arrangement, Workflow, Mastering
|
||||
- **Sub-topics within each:** clicking a top-level category expands or navigates to show sub-topics (e.g., Sound Design → Bass, Drums, Pads, Leads, FX, Foley; Drums → Kick, Snare, Hi-hat, Percussion)
|
||||
- Each sub-topic shows: technique count, number of creators covering it
|
||||
- Clicking a sub-topic shows all technique pages in that category, filterable by creator and genre
|
||||
|
||||
### 4.7 Search results page
|
||||
|
||||
For complex queries that go beyond typeahead (e.g., hitting Enter after typing a full query).
|
||||
|
||||
**Layout:**
|
||||
- Search bar at top (retains query)
|
||||
- Scope tabs: `All results (N) | Techniques (N) | Key moments (N) | Creators (N)`
|
||||
- Results split into two tiers:
|
||||
- **Technique pages** — first-class results with title, creator, summary snippet, tags, moment count, plugin list
|
||||
- **Also mentioned in** — cross-references where the search term appears inside other technique pages (e.g., searching "snare" surfaces "drum bus processing" because it mentions snare bus techniques)
|
||||
|
||||
---
|
||||
|
||||
## 5. Taxonomy and topic hierarchy
|
||||
|
||||
### 5.1 Top-level categories
|
||||
|
||||
These are broad production stages/domains. They should cover the full scope of music production education:
|
||||
|
||||
| Category | Description | Example sub-topics |
|
||||
|----------|-------------|-------------------|
|
||||
| Sound design | Creating and shaping sounds from scratch or samples | Bass, drums (kick, snare, hi-hat, percussion), pads, leads, FX, foley, vocals, textures |
|
||||
| Mixing | Balancing, processing, and spatializing elements in a session | EQ, compression, bus processing, reverb/delay, stereo imaging, gain staging, automation |
|
||||
| Synthesis | Methods of generating sound | FM, wavetable, granular, additive, subtractive, modular, physical modeling |
|
||||
| Arrangement | Structuring a track from intro to outro | Song structure, transitions, tension/release, energy flow, breakdowns, drops |
|
||||
| Workflow | Creative process, session management, productivity | DAW setup, templates, creative process, collaboration, file management, resampling |
|
||||
| Mastering | Final stage processing for release | Limiting, stereo width, loudness, format delivery, referencing |
|
||||
|
||||
### 5.2 Sub-topic management
|
||||
|
||||
Sub-topics are not rigidly pre-defined. The extraction pipeline proposes sub-topic tags during classification, and the taxonomy grows organically as content is processed. However, the system maintains a **canonical tag list** that the LLM references during classification to ensure consistency (e.g., always "snare" not sometimes "snare drum" and sometimes "snare design").
|
||||
|
||||
The canonical tag list is editable by the administrator and should be stored as a configuration file that the pipeline references. New tags can be proposed by the pipeline and queued for admin approval, or auto-added if they fit within an existing top-level category.
|
||||
|
||||
### 5.3 Genre taxonomy
|
||||
|
||||
Genres are broad, general-level tags. Sub-genre classification is explicitly out of scope to avoid complexity.
|
||||
|
||||
**Initial genre set (expandable):**
|
||||
Bass music, Drum & bass, Dubstep, Halftime, House, Techno, IDM, Glitch, Downtempo, Neuro, Ambient, Experimental, Cinematic
|
||||
|
||||
**Rules:**
|
||||
- Genres are metadata on Creators, not on techniques
|
||||
- A Creator can have multiple genre tags
|
||||
- Genre is available as a filter on both the Creators browse page and the Topics browse page (filtering Topics by genre shows techniques from creators tagged with that genre)
|
||||
- Genre tags are assigned during initial creator setup (manually or LLM-suggested based on content analysis) and can be edited by the administrator
|
||||
|
||||
---
|
||||
|
||||
## 6. Data model
|
||||
|
||||
### 6.1 Core entities
|
||||
|
||||
**Creator**
|
||||
```
|
||||
id UUID
|
||||
name string (display name, e.g., "KOAN Sound")
|
||||
slug string (URL-safe, e.g., "koan-sound")
|
||||
genres string[] (e.g., ["glitch hop", "neuro", "bass music"])
|
||||
folder_name string (matches the folder name on disk for source mapping)
|
||||
view_count integer (aggregated from child technique page views)
|
||||
created_at timestamp
|
||||
updated_at timestamp
|
||||
```
|
||||
|
||||
**Source Video**
|
||||
```
|
||||
id UUID
|
||||
creator_id FK → Creator
|
||||
filename string (original filename)
|
||||
file_path string (path on disk)
|
||||
duration_seconds integer
|
||||
content_type enum: tutorial | livestream | breakdown | short_form
|
||||
transcript_path string (path to transcript JSON)
|
||||
processing_status enum: pending | transcribed | extracted | reviewed | published
|
||||
created_at timestamp
|
||||
updated_at timestamp
|
||||
```
|
||||
|
||||
**Transcript Segment**
|
||||
```
|
||||
id UUID
|
||||
source_video_id FK → Source Video
|
||||
start_time float (seconds)
|
||||
end_time float (seconds)
|
||||
text text
|
||||
segment_index integer (order within video)
|
||||
topic_label string (LLM-assigned topic label for this segment)
|
||||
```
|
||||
|
||||
**Key Moment**
|
||||
```
|
||||
id UUID
|
||||
source_video_id FK → Source Video
|
||||
technique_page_id FK → Technique Page (nullable until assigned)
|
||||
title string (e.g., "Three-layer snare construction")
|
||||
summary text (1-3 sentence description)
|
||||
start_time float (seconds)
|
||||
end_time float (seconds)
|
||||
content_type enum: technique | settings | reasoning | workflow
|
||||
plugins string[] (plugin names detected)
|
||||
review_status enum: pending | approved | edited | rejected
|
||||
raw_transcript text (the original transcript text for this segment)
|
||||
created_at timestamp
|
||||
updated_at timestamp
|
||||
```
|
||||
|
||||
**Technique Page**
|
||||
```
|
||||
id UUID
|
||||
creator_id FK → Creator
|
||||
title string (e.g., "Snare design")
|
||||
slug string (URL-safe)
|
||||
topic_category string (top-level: "sound design")
|
||||
topic_tags string[] (sub-topics: ["drums", "snare", "layering", "saturation"])
|
||||
summary text (synthesized overview paragraph)
|
||||
body_sections JSONB (structured prose sections with headings)
|
||||
signal_chains JSONB[] (structured signal chain representations)
|
||||
plugins string[] (all plugins referenced across all moments)
|
||||
source_quality enum: structured | mixed | unstructured (derived from source video types)
|
||||
view_count integer
|
||||
review_status enum: draft | reviewed | published
|
||||
created_at timestamp
|
||||
updated_at timestamp
|
||||
```
|
||||
|
||||
**Related Technique Link**
|
||||
```
|
||||
id UUID
|
||||
source_page_id FK → Technique Page
|
||||
target_page_id FK → Technique Page
|
||||
relationship enum: same_technique_other_creator | same_creator_adjacent | general_cross_reference
|
||||
```
|
||||
|
||||
**Tag (canonical)**
|
||||
```
|
||||
id UUID
|
||||
name string (e.g., "snare")
|
||||
category string (parent top-level category: "sound design")
|
||||
aliases string[] (alternative phrasings the LLM should normalize: ["snare drum", "snare design"])
|
||||
```
|
||||
|
||||
### 6.2 Storage layer
|
||||
|
||||
| Store | Purpose | Technology |
|
||||
|-------|---------|------------|
|
||||
| Relational DB | All structured data (creators, videos, moments, technique pages, tags) | PostgreSQL (preferred) or SQLite for initial simplicity |
|
||||
| Vector DB | Semantic search embeddings for transcripts, key moments, and technique page content | Qdrant (already running on hypervisor) |
|
||||
| File store | Raw transcript JSON files, source video reference metadata | Local filesystem on hypervisor, organized by creator slug |
|
||||
|
||||
### 6.3 Vector embeddings
|
||||
|
||||
The following content gets embedded in Qdrant for semantic search:
|
||||
|
||||
- Key moment summaries (with metadata: creator, topic, timestamp, source video)
|
||||
- Technique page summaries and body sections
|
||||
- Transcript segments (for future RAG/chat retrieval)
|
||||
|
||||
Embedding model: configurable. Can use a local model via Ollama (e.g., `nomic-embed-text`) or an API-based model. The embedding endpoint should be a configurable URL, same pattern as the LLM endpoint.
|
||||
|
||||
---
|
||||
|
||||
## 7. Pipeline architecture
|
||||
|
||||
### 7.1 Infrastructure topology
|
||||
|
||||
```
|
||||
Desktop (RTX 4090) Hypervisor (Docker host)
|
||||
┌─────────────────────┐ ┌─────────────────────────────────┐
|
||||
│ Video files (local) │ │ Chrysopedia Docker Compose │
|
||||
│ Whisper (local GPU) │──2.5GbE──────▶│ ├─ API / pipeline service │
|
||||
│ Output: transcript │ (text only) │ ├─ Web UI │
|
||||
│ JSON files │ │ ├─ PostgreSQL │
|
||||
└─────────────────────┘ │ ├─ Qdrant (existing) │
|
||||
│ └─ File store │
|
||||
└────────────┬────────────────────┘
|
||||
│ API calls (text)
|
||||
┌─────────────▼────────────────────┐
|
||||
│ Friend's DGX Sparks │
|
||||
│ Qwen via Open WebUI API │
|
||||
│ (2Gb fiber, high uptime) │
|
||||
└──────────────────────────────────┘
|
||||
```
|
||||
|
||||
**Bandwidth analysis:** Transcript JSON files are 200–500KB each. At 50Mbit upload, the entire library's transcripts could transfer in under a minute. The bandwidth constraint is irrelevant for this workload. The only large files (videos) stay on the desktop.
|
||||
|
||||
**Future centralization:** The Docker Compose project should be structured so that when all hardware is co-located, the only change is config (moving Whisper into the compose stack and pointing file paths to local storage). No architectural rewrite.
|
||||
|
||||
### 7.2 Processing stages
|
||||
|
||||
#### Stage 1: Audio extraction and transcription (Desktop)
|
||||
|
||||
**Tool:** Whisper large-v3 running locally on RTX 4090
|
||||
**Input:** Video file (MP4/MKV)
|
||||
**Process:**
|
||||
1. Extract audio track from video (ffmpeg → WAV or direct pipe)
|
||||
2. Run Whisper with word-level or segment-level timestamps
|
||||
3. Output: JSON file with timestamped transcript
|
||||
|
||||
**Output format:**
|
||||
```json
|
||||
{
|
||||
"source_file": "Skope — Sound Design Masterclass pt2.mp4",
|
||||
"creator_folder": "Skope",
|
||||
"duration_seconds": 7243,
|
||||
"segments": [
|
||||
{
|
||||
"start": 0.0,
|
||||
"end": 4.52,
|
||||
"text": "Hey everyone welcome back to part two...",
|
||||
"words": [
|
||||
{"word": "Hey", "start": 0.0, "end": 0.28},
|
||||
{"word": "everyone", "start": 0.32, "end": 0.74}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Performance estimate:** Whisper large-v3 on a 4090 processes audio at roughly 10-20x real-time. A 2-hour video takes ~6-12 minutes to transcribe. For 300 videos averaging 1.5 hours each, the initial transcription pass is roughly 15-40 hours of GPU time.
|
||||
|
||||
#### Stage 2: Transcript segmentation (Hypervisor → LLM)
|
||||
|
||||
**Tool:** LLM (Qwen on DGX Sparks, or local Ollama as fallback)
|
||||
**Input:** Full timestamped transcript JSON
|
||||
**Process:** The LLM analyzes the transcript to identify topic boundaries — points where the creator shifts from one subject to another. Output is a segmented transcript with topic labels per segment.
|
||||
|
||||
**This stage can use a lighter model** if needed (segmentation is more mechanical than extraction). However, for simplicity in v1, use the same model endpoint as stages 3-5.
|
||||
|
||||
#### Stage 3: Key moment extraction (Hypervisor → LLM)
|
||||
|
||||
**Tool:** LLM (Qwen on DGX Sparks)
|
||||
**Input:** Individual transcript segments from Stage 2
|
||||
**Process:** The LLM reads each segment and identifies actionable insights. The extraction prompt should distinguish between:
|
||||
|
||||
- **Instructional content** (the creator is *teaching* something) → extract as a key moment
|
||||
- **Incidental content** (the creator is *using* a tool without explaining it) → skip
|
||||
- **Philosophical/reasoning content** (the creator explains *why* they make a choice) → extract with `content_type: reasoning`
|
||||
- **Settings/parameters** (specific plugin settings, values, configurations being demonstrated) → extract with `content_type: settings`
|
||||
|
||||
**Extraction rule for plugin detail:** Capture plugin names and settings when the creator is *teaching* the setting — spending time explaining why they chose it, what it does, how to configure it. Skip incidental plugin usage (a plugin is visible but not discussed).
|
||||
|
||||
#### Stage 4: Classification and tagging (Hypervisor → LLM)
|
||||
|
||||
**Tool:** LLM (Qwen on DGX Sparks)
|
||||
**Input:** Extracted key moments from Stage 3
|
||||
**Process:** Each moment is classified with:
|
||||
- Top-level topic category
|
||||
- Sub-topic tags (referencing the canonical tag list)
|
||||
- Plugin names (normalized to canonical names)
|
||||
- Content type classification
|
||||
|
||||
The LLM is provided the canonical tag list as context and instructed to use existing tags where possible, proposing new tags only when no existing tag fits.
|
||||
|
||||
#### Stage 5: Synthesis (Hypervisor → LLM)
|
||||
|
||||
**Tool:** LLM (Qwen on DGX Sparks)
|
||||
**Input:** All approved/published key moments for a given creator + topic combination
|
||||
**Process:** When multiple key moments from the same creator cover overlapping or related topics, the synthesis stage merges them into a coherent technique page. This includes:
|
||||
- Writing the overview summary paragraph
|
||||
- Organizing body sections by sub-aspect
|
||||
- Generating signal chain blocks where applicable
|
||||
- Identifying related technique pages for cross-linking
|
||||
- Compiling the plugin reference list
|
||||
|
||||
This stage runs whenever new key moments are approved for a creator+topic combination that already has a technique page (updating it), or when enough moments accumulate to warrant a new page.
|
||||
|
||||
### 7.3 LLM endpoint configuration
|
||||
|
||||
The pipeline talks to an **OpenAI-compatible API endpoint** (which both Ollama and Open WebUI expose). The LLM is not hardcoded — it's configured via environment variables:
|
||||
|
||||
```
|
||||
LLM_API_URL=https://friend-openwebui.example.com/api
|
||||
LLM_API_KEY=sk-...
|
||||
LLM_MODEL=qwen2.5-72b
|
||||
LLM_FALLBACK_URL=http://localhost:11434/v1 # local Ollama
|
||||
LLM_FALLBACK_MODEL=qwen2.5:14b-q8_0
|
||||
```
|
||||
|
||||
The pipeline should attempt the primary endpoint first and fall back to the local model if the primary is unavailable.
|
||||
|
||||
### 7.4 Embedding endpoint configuration
|
||||
|
||||
Same configurable pattern:
|
||||
|
||||
```
|
||||
EMBEDDING_API_URL=http://localhost:11434/v1
|
||||
EMBEDDING_MODEL=nomic-embed-text
|
||||
```
|
||||
|
||||
### 7.5 Processing estimates for initial seeding
|
||||
|
||||
| Stage | Per video | 300 videos total |
|
||||
|-------|----------|-----------------|
|
||||
| Transcription (Whisper, 4090) | 6–12 min | 30–60 hours |
|
||||
| Segmentation (LLM) | ~1 min | ~5 hours |
|
||||
| Extraction (LLM) | ~2 min | ~10 hours |
|
||||
| Classification (LLM) | ~30 sec | ~2.5 hours |
|
||||
| Synthesis (LLM) | ~2 min per technique page | Varies by page count |
|
||||
|
||||
**Recommendation:** Tell the DGX Sparks friend to expect a weekend of sustained processing for the initial seed. The pipeline must be **resumable** — if it drops, it picks up from the last successfully processed video/stage, not from the beginning.
|
||||
|
||||
---
|
||||
|
||||
## 8. Review and approval workflow
|
||||
|
||||
### 8.1 Modes
|
||||
|
||||
The system supports two modes:
|
||||
|
||||
- **Review mode (initial calibration):** All extracted key moments enter a review queue. The administrator reviews, edits, approves, or rejects each moment before it's published.
|
||||
- **Auto mode (post-calibration):** Extracted moments are published automatically. The review queue still exists but functions as an audit log rather than a gate.
|
||||
|
||||
The mode is a system-level toggle. The transition from review to auto mode happens when the administrator is satisfied with extraction quality — typically after reviewing the first several videos and tuning prompts.
|
||||
|
||||
### 8.2 Review queue interface
|
||||
|
||||
The review UI is part of the Chrysopedia web application (an admin section, not a separate tool).
|
||||
|
||||
**Queue view:**
|
||||
- Counts: pending, approved, edited, rejected
|
||||
- Filter tabs: Pending | Approved | Edited | Rejected
|
||||
- Items organized by source video (review all moments from one video in sequence for context)
|
||||
|
||||
**Individual moment review:**
|
||||
- Extracted moment: title, timestamp range, summary, tags, plugins detected
|
||||
- Raw transcript segment displayed alongside for comparison
|
||||
- Five actions:
|
||||
- **Approve** — publish as-is
|
||||
- **Edit & approve** — modify summary, tags, timestamp, or plugins, then publish
|
||||
- **Split** — the moment actually contains two distinct insights; split into two separate moments
|
||||
- **Merge with adjacent** — the system over-segmented; combine with the next or previous moment
|
||||
- **Reject** — not a key moment; discard
|
||||
|
||||
### 8.3 Prompt tuning
|
||||
|
||||
The extraction prompts (stages 2-5) should be stored as editable configuration, not hardcoded. If review reveals systematic issues (e.g., the LLM consistently misclassifies mixing techniques as sound design), the administrator should be able to:
|
||||
|
||||
1. Edit the prompt templates
|
||||
2. Re-run extraction on specific videos or all videos
|
||||
3. Review the new output
|
||||
|
||||
This is the "calibration loop" — run pipeline, review output, tune prompts, re-run, repeat until quality is sufficient for auto mode.
|
||||
|
||||
---
|
||||
|
||||
## 9. New content ingestion workflow
|
||||
|
||||
### 9.1 Adding new videos
|
||||
|
||||
The ongoing workflow for adding new content after initial seeding:
|
||||
|
||||
1. **Drop file:** Place new video file(s) in the appropriate creator folder on the desktop (or create a new folder for a new creator)
|
||||
2. **Trigger transcription:** Run the Whisper transcription stage on the new file(s). This could be a manual CLI command, a watched-folder daemon, or an n8n workflow trigger.
|
||||
3. **Ship transcript:** Transfer the transcript JSON to the hypervisor (automated via the pipeline)
|
||||
4. **Process:** Stages 2-5 run automatically on the new transcript
|
||||
5. **Review or auto-publish:** Depending on mode, moments enter the review queue or publish directly
|
||||
6. **Synthesis update:** If the new content covers a topic that already has a technique page for this creator, the synthesis stage updates the existing page. If it's a new topic, a new technique page is created.
|
||||
|
||||
### 9.2 Adding new creators
|
||||
|
||||
When a new creator's content is added:
|
||||
|
||||
1. Create a new folder on the desktop with the creator's name
|
||||
2. Add video files
|
||||
3. The pipeline detects the new folder name and creates a Creator record
|
||||
4. Genre tags can be auto-suggested by the LLM based on content analysis, or manually assigned by the administrator
|
||||
5. Process videos as normal
|
||||
|
||||
### 9.3 Watched folder (optional, future)
|
||||
|
||||
For maximum automation, a filesystem watcher on the desktop could detect new video files and automatically trigger the transcription pipeline. This is a nice-to-have for v2, not a v1 requirement. In v1, transcription is triggered manually.
|
||||
|
||||
---
|
||||
|
||||
## 10. Deployment and infrastructure
|
||||
|
||||
### 10.1 Docker Compose project
|
||||
|
||||
The entire Chrysopedia stack (excluding Whisper, which runs on the desktop GPU) is packaged as a single `docker-compose.yml`:
|
||||
|
||||
```yaml
|
||||
# Indicative structure — not final
|
||||
services:
|
||||
chrysopedia-api:
|
||||
# FastAPI or similar — handles pipeline orchestration, API endpoints
|
||||
chrysopedia-web:
|
||||
# Web UI — React, Svelte, or similar SPA
|
||||
chrysopedia-db:
|
||||
# PostgreSQL
|
||||
chrysopedia-qdrant:
|
||||
# Only if not using the existing Qdrant instance
|
||||
chrysopedia-worker:
|
||||
# Background job processor for pipeline stages 2-5
|
||||
```
|
||||
|
||||
### 10.2 Existing infrastructure integration
|
||||
|
||||
**IMPORTANT:** The implementing agent should reference **XPLTD Lore** when making deployment decisions. This includes:
|
||||
|
||||
- Existing Docker conventions, naming patterns, and network configuration
|
||||
- The hypervisor's current resource allocation and available capacity (~60 containers already running)
|
||||
- Existing Qdrant instance (may be shared or a new collection created)
|
||||
- Existing n8n instance (potential for workflow triggers)
|
||||
- Storage paths and volume mount conventions
|
||||
- Any reverse proxy or DNS configuration patterns
|
||||
|
||||
Do not assume infrastructure details — consult XPLTD Lore for how applications are typically deployed in this environment.
|
||||
|
||||
### 10.3 Whisper on desktop
|
||||
|
||||
Whisper runs separately on the desktop with the RTX 4090. It is NOT part of the Docker Compose stack (for now). It should be packaged as a simple Python script or lightweight container that:
|
||||
|
||||
1. Accepts a video file path (or watches a directory)
|
||||
2. Extracts audio via ffmpeg
|
||||
3. Runs Whisper large-v3
|
||||
4. Outputs transcript JSON
|
||||
5. Ships the JSON to the hypervisor (SCP, rsync, or API upload to the Chrysopedia API)
|
||||
|
||||
**Future centralization:** When all hardware is co-located, Whisper can be added to the Docker Compose stack with GPU passthrough, and the video files can be mounted directly. The pipeline should be designed so this migration is a config change, not a rewrite.
|
||||
|
||||
### 10.4 Network considerations
|
||||
|
||||
- Desktop ↔ Hypervisor: 2.5GbE (ample for transcript JSON transfer)
|
||||
- Hypervisor ↔ DGX Sparks: Internet (50Mbit up from Chrysopedia side, 2Gb fiber on the DGX side). Transcript text payloads are tiny; this is not a bottleneck.
|
||||
- Web UI: Served from hypervisor, accessed via local network (same machine Alt+Tab) or from other devices on the network. Eventually shareable with external users.
|
||||
|
||||
---
|
||||
|
||||
## 11. Technology recommendations
|
||||
|
||||
These are recommendations, not mandates. The implementing agent should evaluate alternatives based on current best practices and XPLTD Lore.
|
||||
|
||||
| Component | Recommendation | Rationale |
|
||||
|-----------|---------------|-----------|
|
||||
| Transcription | Whisper large-v3 (local, 4090) | Best accuracy, local processing keeps media files on-network |
|
||||
| LLM inference | Qwen via Open WebUI API (DGX Sparks) | Free, powerful, high uptime. Ollama on 4090 as fallback |
|
||||
| Embedding | nomic-embed-text via Ollama (local) | Good quality, runs easily alongside other local models |
|
||||
| Vector DB | Qdrant | Already running on hypervisor |
|
||||
| Relational DB | PostgreSQL | Robust, good JSONB support for flexible schema fields |
|
||||
| API framework | FastAPI (Python) | Strong async support, good for pipeline orchestration |
|
||||
| Web UI | React or Svelte SPA | Fast, component-based, good for search-heavy UIs |
|
||||
| Background jobs | Celery with Redis, or a simpler task queue | Pipeline stages 2-5 run as background jobs |
|
||||
| Audio extraction | ffmpeg | Universal, reliable |
|
||||
|
||||
---
|
||||
|
||||
## 12. Open questions and future considerations
|
||||
|
||||
These items are explicitly out of scope for v1 but should be considered in architectural decisions:
|
||||
|
||||
### 12.1 Chat / RAG retrieval
|
||||
|
||||
Not required for v1, but the system should be **architected to support it easily.** The Qdrant embeddings and structured knowledge base provide the foundation. A future chat interface could use the Qwen instance (or any compatible LLM) with RAG over the Chrysopedia knowledge base to answer natural language questions like "How does Skope approach snare design differently from Au5?"
|
||||
|
||||
### 12.2 Direct video playback
|
||||
|
||||
v1 provides file paths and timestamps ("Skope — Sound Design Masterclass pt2.mp4 @ 1:42:30"). Future versions could embed video playback directly in the web UI, jumping to the exact timestamp. This requires the video files to be network-accessible from the web UI, which depends on centralizing storage.
|
||||
|
||||
### 12.3 Access control
|
||||
|
||||
Not needed for v1. The system is initially for personal/local use. Future versions may add authentication for sharing with friends or external users. The architecture should not preclude this (e.g., don't hardcode single-user assumptions into the data model).
|
||||
|
||||
### 12.4 Multi-user features
|
||||
|
||||
Eventually: user-specific bookmarks, personal notes on technique pages, view history, and personalized "trending" based on individual usage patterns.
|
||||
|
||||
### 12.5 Content types beyond video
|
||||
|
||||
The extraction pipeline is fundamentally transcript-based. It could be extended to process podcast episodes, audio-only recordings, or even written tutorials/blog posts with minimal architectural changes.
|
||||
|
||||
### 12.6 Plugin knowledge base
|
||||
|
||||
Plugins referenced across all technique pages could be promoted to a first-class entity with their own browse page: "All techniques that reference Serum" or "Signal chains using Pro-Q 3." The data model already captures plugin references — this is primarily a UI feature.
|
||||
|
||||
---
|
||||
|
||||
## 13. Success criteria
|
||||
|
||||
The system is successful when:
|
||||
|
||||
1. **A producer mid-session can find a specific technique in under 30 seconds** — from Alt+Tab to reading the key insight
|
||||
2. **The extraction pipeline correctly identifies 80%+ of key moments** without human intervention (post-calibration)
|
||||
3. **New content can be added and processed within hours**, not days
|
||||
4. **The knowledge base grows more useful over time** — cross-references and related techniques create a web of connected knowledge that surfaces unexpected insights
|
||||
5. **The system runs reliably on existing infrastructure** without requiring significant new hardware or ongoing cloud costs
|
||||
|
||||
---
|
||||
|
||||
## 14. Implementation phases
|
||||
|
||||
### Phase 1: Foundation
|
||||
- Set up Docker Compose project with PostgreSQL, API service, and web UI skeleton
|
||||
- Implement Whisper transcription script for desktop
|
||||
- Build transcript ingestion endpoint on the API
|
||||
- Implement basic Creator and Source Video management
|
||||
|
||||
### Phase 2: Extraction pipeline
|
||||
- Implement stages 2-5 (segmentation, extraction, classification, synthesis)
|
||||
- Build the review queue UI
|
||||
- Process a small batch of videos (5-10) for calibration
|
||||
- Tune extraction prompts based on review feedback
|
||||
|
||||
### Phase 3: Knowledge UI
|
||||
- Build the search-first web UI: landing page, live search, technique pages
|
||||
- Implement Qdrant integration for semantic search
|
||||
- Build Creators and Topics browse pages
|
||||
- Implement related technique cross-linking
|
||||
|
||||
### Phase 4: Initial seeding
|
||||
- Process the full video library through the pipeline
|
||||
- Review and approve extractions (transitioning toward auto mode)
|
||||
- Populate the canonical tag list and genre taxonomy
|
||||
- Build out cross-references and related technique links
|
||||
|
||||
### Phase 5: Polish and ongoing
|
||||
- Transition to auto mode for new content
|
||||
- Implement view count tracking
|
||||
- Optimize search ranking and relevance
|
||||
- Begin sharing with trusted external users
|
||||
|
||||
---
|
||||
|
||||
*This specification was developed through collaborative ideation between the project owner and Claude. The implementing agent should treat this as a comprehensive guide while exercising judgment on technical implementation details, consulting XPLTD Lore for infrastructure conventions, and adapting to discoveries made during development.*
|
||||
48
config/canonical_tags.yaml
Normal file
48
config/canonical_tags.yaml
Normal file
|
|
@ -0,0 +1,48 @@
|
|||
# Canonical tags — 7 top-level production categories
|
||||
# Sub-topics grow organically during pipeline extraction
|
||||
# Order follows the natural production learning arc:
|
||||
# setup → theory → create sounds → structure → polish → deliver
|
||||
categories:
|
||||
- name: Workflow
|
||||
description: Creative process, session management, productivity
|
||||
sub_topics: [daw setup, templates, creative process, collaboration, file management, resampling]
|
||||
|
||||
- name: Music Theory
|
||||
description: Harmony, scales, chord progressions, and musical structure
|
||||
sub_topics: [harmony, chord progressions, scales, rhythm, time signatures, melody, counterpoint, song keys]
|
||||
|
||||
- name: Sound Design
|
||||
description: Creating and shaping sounds from scratch or samples
|
||||
sub_topics: [bass, drums, kick, snare, hi-hat, percussion, pads, leads, fx, foley, vocals, textures]
|
||||
|
||||
- name: Synthesis
|
||||
description: Methods of generating sound
|
||||
sub_topics: [fm, wavetable, granular, additive, subtractive, modular, physical modeling]
|
||||
|
||||
- name: Arrangement
|
||||
description: Structuring a track from intro to outro
|
||||
sub_topics: [song structure, transitions, tension, energy flow, breakdowns, drops]
|
||||
|
||||
- name: Mixing
|
||||
description: Balancing, processing, and spatializing elements
|
||||
sub_topics: [eq, compression, bus processing, reverb, delay, stereo imaging, gain staging, automation]
|
||||
|
||||
- name: Mastering
|
||||
description: Final stage processing for release
|
||||
sub_topics: [limiting, stereo width, loudness, format delivery, referencing]
|
||||
|
||||
# Genre taxonomy (assigned to Creators, not techniques)
|
||||
genres:
|
||||
- Bass music
|
||||
- Drum & bass
|
||||
- Dubstep
|
||||
- Halftime
|
||||
- House
|
||||
- Techno
|
||||
- IDM
|
||||
- Glitch
|
||||
- Downtempo
|
||||
- Neuro
|
||||
- Ambient
|
||||
- Experimental
|
||||
- Cinematic
|
||||
30
docker/Dockerfile.api
Normal file
30
docker/Dockerfile.api
Normal file
|
|
@ -0,0 +1,30 @@
|
|||
FROM python:3.12-slim
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
# System deps
|
||||
RUN apt-get update && apt-get install -y --no-install-recommends \
|
||||
gcc libpq-dev curl \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
# Python deps (cached layer)
|
||||
COPY backend/requirements.txt /app/requirements.txt
|
||||
RUN pip install --no-cache-dir -r requirements.txt
|
||||
|
||||
# Git commit SHA for version tracking
|
||||
ARG GIT_COMMIT_SHA=unknown
|
||||
|
||||
# Application code
|
||||
COPY backend/ /app/
|
||||
RUN echo "${GIT_COMMIT_SHA}" > /app/.git-commit
|
||||
COPY prompts/ /prompts/
|
||||
COPY config/ /config/
|
||||
COPY alembic.ini /app/alembic.ini
|
||||
COPY alembic/ /app/alembic/
|
||||
|
||||
EXPOSE 8000
|
||||
|
||||
HEALTHCHECK --interval=15s --timeout=5s --retries=3 --start-period=10s \
|
||||
CMD curl -f http://localhost:8000/health || exit 1
|
||||
|
||||
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||
20
docker/Dockerfile.web
Normal file
20
docker/Dockerfile.web
Normal file
|
|
@ -0,0 +1,20 @@
|
|||
FROM node:22-alpine AS build
|
||||
|
||||
WORKDIR /app
|
||||
COPY frontend/package*.json ./
|
||||
RUN npm ci --ignore-scripts
|
||||
COPY frontend/ .
|
||||
|
||||
ARG VITE_GIT_COMMIT=dev
|
||||
ENV VITE_GIT_COMMIT=$VITE_GIT_COMMIT
|
||||
|
||||
RUN npm run build
|
||||
|
||||
FROM nginx:1.27-alpine
|
||||
|
||||
COPY --from=build /app/dist /usr/share/nginx/html
|
||||
COPY docker/nginx.conf /etc/nginx/conf.d/default.conf
|
||||
|
||||
EXPOSE 80
|
||||
|
||||
CMD ["nginx", "-g", "daemon off;"]
|
||||
35
docker/nginx.conf
Normal file
35
docker/nginx.conf
Normal file
|
|
@ -0,0 +1,35 @@
|
|||
server {
|
||||
listen 80;
|
||||
server_name _;
|
||||
root /usr/share/nginx/html;
|
||||
index index.html;
|
||||
|
||||
# Use Docker's embedded DNS with 30s TTL so upstream IPs refresh
|
||||
# after container recreates
|
||||
resolver 127.0.0.11 valid=30s ipv6=off;
|
||||
|
||||
# Allow large transcript uploads (up to 50MB)
|
||||
client_max_body_size 50m;
|
||||
|
||||
# SPA fallback
|
||||
location / {
|
||||
try_files $uri $uri/ /index.html;
|
||||
}
|
||||
|
||||
# API proxy — variable forces nginx to re-resolve on each request
|
||||
location /api/ {
|
||||
set $backend http://chrysopedia-api:8000;
|
||||
proxy_pass $backend;
|
||||
proxy_set_header Host $host;
|
||||
proxy_set_header X-Real-IP $remote_addr;
|
||||
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
||||
proxy_set_header X-Forwarded-Proto $scheme;
|
||||
proxy_read_timeout 300s;
|
||||
proxy_send_timeout 300s;
|
||||
}
|
||||
|
||||
location /health {
|
||||
set $backend http://chrysopedia-api:8000;
|
||||
proxy_pass $backend;
|
||||
}
|
||||
}
|
||||
|
|
@ -70,8 +70,8 @@
|
|||
/* Pills / special badges */
|
||||
--color-pill-bg: #22222e;
|
||||
--color-pill-text: #e2e2ea;
|
||||
--color-pill-plugin-bg: #2e1065;
|
||||
--color-pill-plugin-text: #c4b5fd;
|
||||
--color-pill-plugin-bg: #3b1f06;
|
||||
--color-pill-plugin-text: #f6ad55;
|
||||
--color-badge-category-bg: #1e1b4b;
|
||||
--color-badge-category-text: #93c5fd;
|
||||
--color-badge-type-technique-bg: #1e1b4b;
|
||||
|
|
@ -88,8 +88,8 @@
|
|||
/* Per-category badge colors */
|
||||
--color-badge-cat-sound-design-bg: #0d3b3b;
|
||||
--color-badge-cat-sound-design-text: #5eead4;
|
||||
--color-badge-cat-mixing-bg: #2e1065;
|
||||
--color-badge-cat-mixing-text: #c4b5fd;
|
||||
--color-badge-cat-mixing-bg: #0f2942;
|
||||
--color-badge-cat-mixing-text: #7dd3fc;
|
||||
--color-badge-cat-synthesis-bg: #0c2461;
|
||||
--color-badge-cat-synthesis-text: #93c5fd;
|
||||
--color-badge-cat-arrangement-bg: #422006;
|
||||
|
|
@ -198,9 +198,13 @@ body {
|
|||
|
||||
.app-main {
|
||||
flex: 1;
|
||||
width: 100%;
|
||||
max-width: 72rem;
|
||||
margin: 1.5rem auto;
|
||||
padding: 0 1.5rem;
|
||||
box-sizing: border-box;
|
||||
overflow-wrap: break-word;
|
||||
overflow-x: hidden;
|
||||
}
|
||||
|
||||
/* ── App footer ───────────────────────────────────────────────────────────── */
|
||||
|
|
@ -930,7 +934,7 @@ a.app-footer__repo:hover {
|
|||
|
||||
.home-hero {
|
||||
text-align: center;
|
||||
padding: 3rem 1rem 2rem;
|
||||
padding: 0.5rem 1rem 1.5rem;
|
||||
}
|
||||
|
||||
.home-hero__title {
|
||||
|
|
@ -1336,6 +1340,9 @@ a.app-footer__repo:hover {
|
|||
|
||||
.technique-page {
|
||||
max-width: 64rem;
|
||||
width: 100%;
|
||||
overflow-wrap: break-word;
|
||||
word-wrap: break-word;
|
||||
}
|
||||
|
||||
.technique-columns {
|
||||
|
|
@ -1347,6 +1354,8 @@ a.app-footer__repo:hover {
|
|||
|
||||
.technique-columns__main {
|
||||
min-width: 0; /* prevent grid blowout */
|
||||
overflow-wrap: break-word;
|
||||
word-wrap: break-word;
|
||||
}
|
||||
|
||||
.technique-columns__sidebar {
|
||||
|
|
@ -1392,15 +1401,26 @@ a.app-footer__repo:hover {
|
|||
color: var(--color-banner-amber-text);
|
||||
}
|
||||
|
||||
.technique-header {
|
||||
margin-bottom: 1.5rem;
|
||||
|
||||
|
||||
.technique-header__title-row {
|
||||
display: flex;
|
||||
align-items: flex-start;
|
||||
justify-content: space-between;
|
||||
gap: 1rem;
|
||||
margin-bottom: 0.5rem;
|
||||
}
|
||||
|
||||
.technique-header__title-row .badge--category {
|
||||
flex-shrink: 0;
|
||||
margin-top: 0.35rem;
|
||||
}
|
||||
|
||||
.technique-header__title {
|
||||
font-size: 1.75rem;
|
||||
font-weight: 800;
|
||||
letter-spacing: -0.02em;
|
||||
margin-bottom: 0.5rem;
|
||||
margin-bottom: 0;
|
||||
line-height: 1.2;
|
||||
}
|
||||
|
||||
|
|
@ -1415,6 +1435,7 @@ a.app-footer__repo:hover {
|
|||
display: inline-flex;
|
||||
flex-wrap: wrap;
|
||||
gap: 0.25rem;
|
||||
max-width: 100%;
|
||||
}
|
||||
|
||||
.technique-header__creator-genres {
|
||||
|
|
@ -1432,6 +1453,9 @@ a.app-footer__repo:hover {
|
|||
}
|
||||
|
||||
.technique-header__creator-link {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
gap: 0.35rem;
|
||||
font-size: 1.125rem;
|
||||
font-weight: 600;
|
||||
color: var(--color-link-accent);
|
||||
|
|
@ -1458,6 +1482,13 @@ a.app-footer__repo:hover {
|
|||
|
||||
/* ── Technique prose / sections ───────────────────────────────────────────── */
|
||||
|
||||
.technique-main__tags {
|
||||
display: flex;
|
||||
flex-wrap: wrap;
|
||||
gap: 0.25rem;
|
||||
margin-bottom: 1rem;
|
||||
}
|
||||
|
||||
.technique-summary {
|
||||
margin-bottom: 1.5rem;
|
||||
}
|
||||
|
|
@ -1521,7 +1552,6 @@ a.app-footer__repo:hover {
|
|||
background: var(--color-bg-surface);
|
||||
border: 1px solid var(--color-border);
|
||||
border-radius: 0.5rem;
|
||||
overflow: hidden;
|
||||
}
|
||||
|
||||
.technique-moment__title {
|
||||
|
|
@ -1530,7 +1560,6 @@ a.app-footer__repo:hover {
|
|||
font-size: 0.9375rem;
|
||||
font-weight: 600;
|
||||
line-height: 1.3;
|
||||
word-break: break-word;
|
||||
}
|
||||
|
||||
.technique-moment__meta {
|
||||
|
|
@ -1539,7 +1568,6 @@ a.app-footer__repo:hover {
|
|||
gap: 0.5rem;
|
||||
margin-bottom: 0.25rem;
|
||||
flex-wrap: wrap;
|
||||
min-width: 0;
|
||||
}
|
||||
|
||||
.technique-moment__time {
|
||||
|
|
@ -1552,7 +1580,7 @@ a.app-footer__repo:hover {
|
|||
font-size: 0.75rem;
|
||||
color: var(--color-text-muted);
|
||||
font-family: "SF Mono", "Fira Code", "Fira Mono", "Roboto Mono", monospace;
|
||||
max-width: 100%;
|
||||
max-width: 20rem;
|
||||
overflow: hidden;
|
||||
text-overflow: ellipsis;
|
||||
white-space: nowrap;
|
||||
|
|
@ -1810,6 +1838,9 @@ a.app-footer__repo:hover {
|
|||
}
|
||||
|
||||
.creator-row__name {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
gap: 0.5rem;
|
||||
font-size: 0.9375rem;
|
||||
font-weight: 600;
|
||||
min-width: 10rem;
|
||||
|
|
@ -1852,6 +1883,9 @@ a.app-footer__repo:hover {
|
|||
}
|
||||
|
||||
.creator-detail__name {
|
||||
display: flex;
|
||||
align-items: center;
|
||||
gap: 0.75rem;
|
||||
font-size: 1.75rem;
|
||||
font-weight: 800;
|
||||
letter-spacing: -0.02em;
|
||||
|
|
@ -2012,12 +2046,10 @@ a.app-footer__repo:hover {
|
|||
margin: 0;
|
||||
}
|
||||
|
||||
.topic-card__dot {
|
||||
display: inline-block;
|
||||
width: 0.5rem;
|
||||
height: 0.5rem;
|
||||
border-radius: 50%;
|
||||
.topic-card__glyph {
|
||||
flex-shrink: 0;
|
||||
line-height: 1;
|
||||
opacity: 0.7;
|
||||
}
|
||||
|
||||
.topic-card__desc {
|
||||
|
|
@ -2170,26 +2202,6 @@ a.app-footer__repo:hover {
|
|||
.topic-subtopic {
|
||||
padding-left: 1rem;
|
||||
}
|
||||
|
||||
.app-main {
|
||||
padding: 0 1rem;
|
||||
}
|
||||
|
||||
.technique-header__meta {
|
||||
gap: 0.375rem;
|
||||
}
|
||||
|
||||
.technique-header__tags {
|
||||
gap: 0.1875rem;
|
||||
}
|
||||
|
||||
.technique-header__creator-genres {
|
||||
gap: 0.1875rem;
|
||||
}
|
||||
|
||||
.version-switcher__select {
|
||||
max-width: 12rem;
|
||||
}
|
||||
}
|
||||
|
||||
/* ── Report Issue Modal ─────────────────────────────────────────────────── */
|
||||
|
|
@ -2553,9 +2565,6 @@ a.app-footer__repo:hover {
|
|||
padding: 0.3rem 0.5rem;
|
||||
font-size: 0.8rem;
|
||||
cursor: pointer;
|
||||
max-width: 100%;
|
||||
overflow: hidden;
|
||||
text-overflow: ellipsis;
|
||||
}
|
||||
|
||||
.version-switcher__select:focus {
|
||||
|
|
@ -3178,3 +3187,126 @@ a.app-footer__repo:hover {
|
|||
white-space: pre-wrap;
|
||||
word-break: break-word;
|
||||
}
|
||||
|
||||
/* ── Ghost button ─────────────────────────────────────────────────────── */
|
||||
|
||||
.btn--ghost {
|
||||
background: transparent;
|
||||
color: var(--color-text-muted);
|
||||
border-color: transparent;
|
||||
}
|
||||
|
||||
.btn--ghost:hover:not(:disabled) {
|
||||
color: var(--color-text-secondary);
|
||||
background: var(--color-bg-surface);
|
||||
border-color: var(--color-border);
|
||||
}
|
||||
|
||||
/* ── Technique page footer ────────────────────────────────────────────── */
|
||||
|
||||
.technique-footer {
|
||||
margin-top: 2rem;
|
||||
padding-top: 1rem;
|
||||
border-top: 1px solid var(--color-border);
|
||||
display: flex;
|
||||
justify-content: flex-end;
|
||||
}
|
||||
|
||||
/* ── Creator avatar ───────────────────────────────────────────────────── */
|
||||
|
||||
.creator-avatar {
|
||||
border-radius: 4px;
|
||||
flex-shrink: 0;
|
||||
vertical-align: middle;
|
||||
}
|
||||
|
||||
.creator-avatar--img {
|
||||
object-fit: cover;
|
||||
}
|
||||
|
||||
/* ── Copy link button ─────────────────────────────────────────────────── */
|
||||
|
||||
.copy-link-btn {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
justify-content: center;
|
||||
position: relative;
|
||||
background: none;
|
||||
border: none;
|
||||
color: var(--color-text-muted);
|
||||
cursor: pointer;
|
||||
padding: 0.15rem;
|
||||
border-radius: 4px;
|
||||
opacity: 0;
|
||||
transition: opacity 0.15s, color 0.15s, background 0.15s;
|
||||
vertical-align: middle;
|
||||
margin-left: 0.25rem;
|
||||
}
|
||||
|
||||
.technique-header__title:hover .copy-link-btn,
|
||||
.copy-link-btn:focus-visible {
|
||||
opacity: 1;
|
||||
}
|
||||
|
||||
.copy-link-btn:hover {
|
||||
opacity: 1;
|
||||
color: var(--color-accent);
|
||||
background: var(--color-bg-surface);
|
||||
}
|
||||
|
||||
.copy-link-btn__tooltip {
|
||||
position: absolute;
|
||||
top: -1.75rem;
|
||||
left: 50%;
|
||||
transform: translateX(-50%);
|
||||
background: var(--color-bg-surface);
|
||||
color: var(--color-accent);
|
||||
font-size: 0.7rem;
|
||||
padding: 0.15rem 0.5rem;
|
||||
border-radius: 4px;
|
||||
border: 1px solid var(--color-border);
|
||||
white-space: nowrap;
|
||||
pointer-events: none;
|
||||
animation: fadeInUp 0.15s ease-out;
|
||||
}
|
||||
|
||||
@keyframes fadeInUp {
|
||||
from { opacity: 0; transform: translateX(-50%) translateY(4px); }
|
||||
to { opacity: 1; transform: translateX(-50%) translateY(0); }
|
||||
}
|
||||
|
||||
/* ── Recent card with creator ─────────────────────────────────────────── */
|
||||
|
||||
.recent-card__header {
|
||||
display: flex;
|
||||
align-items: flex-start;
|
||||
justify-content: space-between;
|
||||
gap: 0.5rem;
|
||||
}
|
||||
|
||||
.recent-card__creator {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
gap: 0.3rem;
|
||||
font-size: 0.8rem;
|
||||
color: var(--color-text-secondary);
|
||||
white-space: nowrap;
|
||||
flex-shrink: 0;
|
||||
}
|
||||
|
||||
/* ── Search result card creator ───────────────────────────────────────── */
|
||||
|
||||
.search-result-card__creator {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
gap: 0.3rem;
|
||||
}
|
||||
|
||||
/* ── Technique footer inspect link ────────────────────────────────────── */
|
||||
|
||||
.technique-footer__inspect {
|
||||
display: inline-flex;
|
||||
align-items: center;
|
||||
gap: 0.3rem;
|
||||
text-decoration: none;
|
||||
}
|
||||
|
|
|
|||
Some files were not shown because too many files have changed in this diff Show more
Loading…
Add table
Reference in a new issue