feat: Added personality extraction pipeline: prompt template, 3-tier tr…

- "prompts/personality_extraction.txt"
- "backend/pipeline/stages.py"
- "backend/schemas.py"
- "backend/routers/admin.py"

GSD-Task: S06/T02
This commit is contained in:
jlightner 2026-04-04 08:28:18 +00:00
parent 10cd175333
commit 2d9076ae92
7 changed files with 491 additions and 1 deletions

View file

@ -46,7 +46,7 @@ Add the `personality_profile` JSONB column to the Creator model, create the Alem
- Estimate: 30m - Estimate: 30m
- Files: backend/models.py, backend/schemas.py, backend/routers/creators.py, alembic/versions/023_add_personality_profile.py - Files: backend/models.py, backend/schemas.py, backend/routers/creators.py, alembic/versions/023_add_personality_profile.py
- Verify: cd backend && python -c "from models import Creator; assert hasattr(Creator, 'personality_profile'); print('model OK')" && python -c "from schemas import CreatorDetail; assert 'personality_profile' in CreatorDetail.model_fields; print('schema OK')" && test -f ../alembic/versions/023_add_personality_profile.py && echo 'migration exists' - Verify: cd backend && python -c "from models import Creator; assert hasattr(Creator, 'personality_profile'); print('model OK')" && python -c "from schemas import CreatorDetail; assert 'personality_profile' in CreatorDetail.model_fields; print('schema OK')" && test -f ../alembic/versions/023_add_personality_profile.py && echo 'migration exists'
- [ ] **T02: Implement personality extraction Celery task, prompt template, and admin trigger** — ## Description - [x] **T02: Added personality extraction pipeline: prompt template, 3-tier transcript sampling, Celery task with retry/validation, and admin trigger endpoint** — ## Description
Build the core extraction pipeline: a prompt template that analyzes creator transcripts for distinctive personality markers, a Celery task that aggregates and samples transcripts then calls the LLM, and an admin endpoint to trigger extraction. Follows existing stage patterns in `pipeline/stages.py`. Build the core extraction pipeline: a prompt template that analyzes creator transcripts for distinctive personality markers, a Celery task that aggregates and samples transcripts then calls the LLM, and an admin endpoint to trigger extraction. Follows existing stage patterns in `pipeline/stages.py`.

View file

@ -0,0 +1,30 @@
{
"schemaVersion": 1,
"taskId": "T01",
"unitId": "M022/S06/T01",
"timestamp": 1775291084955,
"passed": false,
"discoverySource": "task-plan",
"checks": [
{
"command": "cd backend",
"exitCode": 0,
"durationMs": 8,
"verdict": "pass"
},
{
"command": "test -f ../alembic/versions/023_add_personality_profile.py",
"exitCode": 1,
"durationMs": 7,
"verdict": "fail"
},
{
"command": "echo 'migration exists'",
"exitCode": 0,
"durationMs": 6,
"verdict": "pass"
}
],
"retryAttempt": 1,
"maxRetries": 2
}

View file

@ -0,0 +1,87 @@
---
id: T02
parent: S06
milestone: M022
provides: []
requires: []
affects: []
key_files: ["prompts/personality_extraction.txt", "backend/pipeline/stages.py", "backend/schemas.py", "backend/routers/admin.py"]
key_decisions: ["Used response_model=object to trigger JSON mode with manual parse + Pydantic validation for clearer error handling on nested schema"]
patterns_established: []
drill_down_paths: []
observability_surfaces: []
duration: ""
verification_result: "All 8 verification checks pass: prompt file exists, task importable, validator importable, endpoint wired, model has attribute, schema has field, migration file exists, router references personality_profile."
completed_at: 2026-04-04T08:28:14.600Z
blocker_discovered: false
---
# T02: Added personality extraction pipeline: prompt template, 3-tier transcript sampling, Celery task with retry/validation, and admin trigger endpoint
> Added personality extraction pipeline: prompt template, 3-tier transcript sampling, Celery task with retry/validation, and admin trigger endpoint
## What Happened
---
id: T02
parent: S06
milestone: M022
key_files:
- prompts/personality_extraction.txt
- backend/pipeline/stages.py
- backend/schemas.py
- backend/routers/admin.py
key_decisions:
- Used response_model=object to trigger JSON mode with manual parse + Pydantic validation for clearer error handling on nested schema
duration: ""
verification_result: passed
completed_at: 2026-04-04T08:28:14.600Z
blocker_discovered: false
---
# T02: Added personality extraction pipeline: prompt template, 3-tier transcript sampling, Celery task with retry/validation, and admin trigger endpoint
**Added personality extraction pipeline: prompt template, 3-tier transcript sampling, Celery task with retry/validation, and admin trigger endpoint**
## What Happened
Created the personality extraction prompt at prompts/personality_extraction.txt instructing the LLM to focus on distinctive traits and return structured JSON. Added _sample_creator_transcripts() with three tiers: small uses all text, medium takes 300-char excerpts, large does topic-diverse random sampling via Redis with deterministic seed. The extract_personality_profile Celery task loads creator's key moments via SourceVideo join, samples transcripts, calls LLM, validates response with PersonalityProfile Pydantic model, attaches metadata, and stores on Creator.personality_profile. Handles zero-transcript creators (early return), invalid JSON (retry), and validation failures (retry). Added PersonalityProfile with nested sub-models in schemas.py. Added POST /admin/creators/{slug}/extract-profile endpoint in admin.py.
## Verification
All 8 verification checks pass: prompt file exists, task importable, validator importable, endpoint wired, model has attribute, schema has field, migration file exists, router references personality_profile.
## Verification Evidence
| # | Command | Exit Code | Verdict | Duration |
|---|---------|-----------|---------|----------|
| 1 | `test -f prompts/personality_extraction.txt` | 0 | ✅ pass | 50ms |
| 2 | `cd backend && python -c "from pipeline.stages import extract_personality_profile; print('task OK')"` | 0 | ✅ pass | 1000ms |
| 3 | `cd backend && python -c "from schemas import PersonalityProfile; print('validator OK')"` | 0 | ✅ pass | 500ms |
| 4 | `grep -q 'extract-profile' backend/routers/admin.py` | 0 | ✅ pass | 50ms |
| 5 | `cd backend && python -c "from models import Creator; assert hasattr(Creator, 'personality_profile')"` | 0 | ✅ pass | 500ms |
| 6 | `cd backend && python -c "from schemas import CreatorDetail; assert 'personality_profile' in CreatorDetail.model_fields"` | 0 | ✅ pass | 500ms |
| 7 | `test -f alembic/versions/023_add_personality_profile.py` | 0 | ✅ pass | 50ms |
| 8 | `grep -q 'personality_profile' backend/routers/creators.py` | 0 | ✅ pass | 50ms |
## Deviations
None.
## Known Issues
None.
## Files Created/Modified
- `prompts/personality_extraction.txt`
- `backend/pipeline/stages.py`
- `backend/schemas.py`
- `backend/routers/admin.py`
## Deviations
None.
## Known Issues
None.

View file

@ -2592,3 +2592,271 @@ def stage_highlight_detection(self, video_id: str, run_id: str | None = None) ->
raise self.retry(exc=exc) raise self.retry(exc=exc)
finally: finally:
session.close() session.close()
# ── Personality profile extraction ───────────────────────────────────────────
def _sample_creator_transcripts(
moments: list,
creator_id: str,
max_chars: int = 40000,
) -> tuple[str, int]:
"""Sample transcripts from a creator's key moments, respecting size tiers.
- Small (<20K chars total): use all text.
- Medium (20K-60K): first 300 chars from each moment, up to budget.
- Large (>60K): random sample seeded by creator_id, attempts topic diversity
via Redis classification data.
Returns (sampled_text, total_char_count).
"""
import random
transcripts = [
(m.source_video_id, m.raw_transcript)
for m in moments
if m.raw_transcript and m.raw_transcript.strip()
]
if not transcripts:
return ("", 0)
total_chars = sum(len(t) for _, t in transcripts)
# Small: use everything
if total_chars <= 20_000:
text = "\n\n---\n\n".join(t for _, t in transcripts)
return (text, total_chars)
# Medium: first 300 chars from each moment
if total_chars <= 60_000:
excerpts = []
budget = max_chars
for _, t in transcripts:
chunk = t[:300]
if budget - len(chunk) < 0:
break
excerpts.append(chunk)
budget -= len(chunk)
text = "\n\n---\n\n".join(excerpts)
return (text, total_chars)
# Large: random sample with optional topic diversity from Redis
topic_map: dict[str, list[tuple[str, str]]] = {}
try:
import redis as _redis
settings = get_settings()
r = _redis.from_url(settings.redis_url)
video_ids = {str(vid) for vid, _ in transcripts}
for vid in video_ids:
raw = r.get(f"chrysopedia:classification:{vid}")
if raw:
classification = json.loads(raw)
if isinstance(classification, list):
for item in classification:
cat = item.get("topic_category", "unknown")
moment_id = item.get("moment_id")
if moment_id:
topic_map.setdefault(cat, []).append(moment_id)
r.close()
except Exception:
# Fall back to random sampling without topic diversity
pass
rng = random.Random(creator_id)
if topic_map:
# Interleave from different categories for diversity
ordered = []
cat_lists = list(topic_map.values())
rng.shuffle(cat_lists)
idx = 0
while any(cat_lists):
for cat in cat_lists:
if cat:
ordered.append(cat.pop(0))
cat_lists = [c for c in cat_lists if c]
# Map moment IDs back to transcripts
moment_lookup = {str(m.id): m.raw_transcript for m in moments if m.raw_transcript}
diverse_transcripts = [
moment_lookup[mid] for mid in ordered if mid in moment_lookup
]
if diverse_transcripts:
transcripts_list = diverse_transcripts
else:
transcripts_list = [t for _, t in transcripts]
else:
transcripts_list = [t for _, t in transcripts]
rng.shuffle(transcripts_list)
excerpts = []
budget = max_chars
for t in transcripts_list:
chunk = t[:600]
if budget - len(chunk) < 0:
break
excerpts.append(chunk)
budget -= len(chunk)
text = "\n\n---\n\n".join(excerpts)
return (text, total_chars)
@celery_app.task(bind=True, max_retries=2, default_retry_delay=60)
def extract_personality_profile(self, creator_id: str) -> str:
"""Extract a personality profile from a creator's transcripts via LLM.
Aggregates and samples transcripts from all of the creator's key moments,
sends them to the LLM with the personality_extraction prompt, validates
the response, and stores the profile as JSONB on Creator.personality_profile.
Returns the creator_id for chain compatibility.
"""
from datetime import datetime, timezone
start = time.monotonic()
logger.info("Personality extraction starting for creator_id=%s", creator_id)
_emit_event(creator_id, "personality_extraction", "start")
session = _get_sync_session()
try:
# Load creator
creator = session.execute(
select(Creator).where(Creator.id == creator_id)
).scalar_one_or_none()
if not creator:
logger.error("Creator not found: %s", creator_id)
_emit_event(
creator_id, "personality_extraction", "error",
payload={"error": "creator_not_found"},
)
return creator_id
# Load all key moments with transcripts for this creator
moments = (
session.execute(
select(KeyMoment)
.join(SourceVideo, KeyMoment.source_video_id == SourceVideo.id)
.where(SourceVideo.creator_id == creator.id)
.where(KeyMoment.raw_transcript.isnot(None))
)
.scalars()
.all()
)
if not moments:
logger.warning(
"No transcripts found for creator_id=%s (%s), skipping extraction",
creator_id, creator.name,
)
_emit_event(
creator_id, "personality_extraction", "complete",
payload={"skipped": True, "reason": "no_transcripts"},
)
return creator_id
# Sample transcripts
sampled_text, total_chars = _sample_creator_transcripts(
moments, creator_id,
)
if not sampled_text.strip():
logger.warning(
"Empty transcript sample for creator_id=%s, skipping", creator_id,
)
_emit_event(
creator_id, "personality_extraction", "complete",
payload={"skipped": True, "reason": "empty_sample"},
)
return creator_id
# Load prompt and call LLM
system_prompt = _load_prompt("personality_extraction.txt")
user_prompt = (
f"Creator: {creator.name}\n\n"
f"Transcript excerpts ({len(moments)} moments, {total_chars} total chars, "
f"sample below):\n\n{sampled_text}"
)
llm = _get_llm_client()
callback = _make_llm_callback(
creator_id, "personality_extraction",
system_prompt=system_prompt,
user_prompt=user_prompt,
)
response = llm.complete(
system_prompt=system_prompt,
user_prompt=user_prompt,
response_model=object, # triggers JSON mode
on_complete=callback,
)
# Parse and validate
from schemas import PersonalityProfile as ProfileValidator
try:
raw_profile = json.loads(str(response))
except json.JSONDecodeError as jde:
logger.warning(
"LLM returned invalid JSON for creator_id=%s, retrying: %s",
creator_id, jde,
)
raise self.retry(exc=jde)
try:
validated = ProfileValidator.model_validate(raw_profile)
except ValidationError as ve:
logger.warning(
"LLM profile failed validation for creator_id=%s, retrying: %s",
creator_id, ve,
)
raise self.retry(exc=ve)
# Build final profile dict with metadata
profile_dict = validated.model_dump()
profile_dict["_metadata"] = {
"extracted_at": datetime.now(timezone.utc).replace(tzinfo=None).isoformat(),
"transcript_sample_size": total_chars,
"moments_count": len(moments),
"model_used": getattr(response, "finish_reason", None) or "unknown",
}
# Low sample size note
if total_chars < 500:
profile_dict["_metadata"]["low_sample_size"] = True
# Store on creator
creator.personality_profile = profile_dict
session.commit()
elapsed = time.monotonic() - start
_emit_event(
creator_id, "personality_extraction", "complete",
duration_ms=int(elapsed * 1000),
payload={
"moments_count": len(moments),
"transcript_chars": total_chars,
"sample_chars": len(sampled_text),
},
)
logger.info(
"Personality extraction completed for creator_id=%s (%s) in %.1fs — "
"%d moments, %d chars sampled",
creator_id, creator.name, elapsed, len(moments), len(sampled_text),
)
return creator_id
except Exception as exc:
if isinstance(exc, (self.MaxRetriesExceededError,)):
raise
session.rollback()
_emit_event(
creator_id, "personality_extraction", "error",
payload={"error": str(exc)[:500]},
)
logger.error(
"Personality extraction failed for creator_id=%s: %s", creator_id, exc,
)
raise self.retry(exc=exc)
finally:
session.close()

View file

@ -236,3 +236,29 @@ async def get_impersonation_log(
) )
for log, admin_name, target_name in rows for log, admin_name, target_name in rows
] ]
@router.post("/creators/{slug}/extract-profile")
async def extract_creator_profile(
slug: str,
_admin: Annotated[User, Depends(_require_admin)],
session: Annotated[AsyncSession, Depends(get_session)],
):
"""Queue personality profile extraction for a creator. Admin only."""
from models import Creator
result = await session.execute(
select(Creator).where(Creator.slug == slug)
)
creator = result.scalar_one_or_none()
if creator is None:
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND,
detail=f"Creator not found: {slug}",
)
from pipeline.stages import extract_personality_profile
extract_personality_profile.delay(str(creator.id))
logger.info("Queued personality extraction for creator=%s (%s)", slug, creator.id)
return {"status": "queued", "creator_id": str(creator.id)}

View file

@ -732,3 +732,40 @@ class FollowedCreatorItem(BaseModel):
creator_name: str creator_name: str
creator_slug: str creator_slug: str
followed_at: datetime followed_at: datetime
# ── Personality Profile (LLM output validation) ─────────────────────────────
class VocabularyProfile(BaseModel):
signature_phrases: list[str] = []
jargon_level: str = "mixed"
filler_words: list[str] = []
distinctive_terms: list[str] = []
sound_descriptions: list[str] = []
class ToneProfile(BaseModel):
formality: str = "conversational"
energy: str = "moderate"
humor: str = "none"
teaching_style: str = ""
descriptors: list[str] = []
class StyleMarkersProfile(BaseModel):
explanation_approach: str = "step-by-step"
uses_analogies: bool = False
analogy_examples: list[str] = []
sound_words: list[str] = []
self_references: str = ""
audience_engagement: str = ""
pacing: str = "moderate"
class PersonalityProfile(BaseModel):
"""Validates LLM-generated personality profile before storage."""
vocabulary: VocabularyProfile = Field(default_factory=VocabularyProfile)
tone: ToneProfile = Field(default_factory=ToneProfile)
style_markers: StyleMarkersProfile = Field(default_factory=StyleMarkersProfile)
summary: str = ""

View file

@ -0,0 +1,42 @@
You are a music production educator analyst. You will receive transcript excerpts from a single creator's tutorials. Your task is to identify what makes this creator's communication style DISTINCTIVE — not universal traits shared by all educators.
Analyze the transcripts for:
1. **Vocabulary patterns**: Signature phrases they repeat, jargon level (beginner-friendly vs advanced), filler words or verbal tics, distinctive terminology or invented words, how they name sounds or techniques.
2. **Tone**: Formality level, energy (calm/methodical vs enthusiastic/hype), humor style (dry, self-deprecating, none), teaching warmth, use of encouragement or critique.
3. **Style markers**: How they explain concepts (step-by-step vs intuitive/exploratory), use of analogies or metaphors, onomatopoeia or sound words, self-references and personal anecdotes, how they address the audience, pacing and rhythm of explanation.
Focus on what makes THIS creator stand out. Ignore generic traits like "knowledgeable about music production" or "explains things clearly" — those apply to everyone.
You MUST respond with ONLY valid JSON matching this exact structure:
{
"vocabulary": {
"signature_phrases": ["phrase1", "phrase2"],
"jargon_level": "beginner-friendly | intermediate | advanced | mixed",
"filler_words": ["um", "like"],
"distinctive_terms": ["term1", "term2"],
"sound_descriptions": ["how they describe sounds"]
},
"tone": {
"formality": "casual | conversational | professional | academic",
"energy": "calm | moderate | high | variable",
"humor": "none | occasional | frequent | core-style",
"teaching_style": "one short descriptor, e.g. 'encouraging coach' or 'no-nonsense mentor'",
"descriptors": ["adjective1", "adjective2", "adjective3"]
},
"style_markers": {
"explanation_approach": "step-by-step | exploratory | demo-first | theory-then-practice",
"uses_analogies": true,
"analogy_examples": ["example1"],
"sound_words": ["onomatopoeia they use"],
"self_references": "how they reference themselves or their experience",
"audience_engagement": "how they address/involve the viewer",
"pacing": "fast | moderate | slow | variable"
},
"summary": "One paragraph (3-5 sentences) capturing what makes this creator's voice distinctive. Be specific — reference actual phrases or patterns from the transcripts."
}
No markdown code fences, no explanation, no preamble — just the raw JSON object.