feat: Added personality extraction pipeline: prompt template, 3-tier tr…

- "prompts/personality_extraction.txt" - "backend/pipeline/stages.py" - "backend/schemas.py" - "backend/routers/admin.py" GSD-Task: S06/T02
2026-04-04 08:28:18 +00:00 · 2026-04-04 08:28:18 +00:00 · 2d9076ae92
commit 2d9076ae92
parent 10cd175333
7 changed files with 491 additions and 1 deletions
--- a/.gsd/milestones/M022/slices/S06/S06-PLAN.md
+++ b/.gsd/milestones/M022/slices/S06/S06-PLAN.md
@ -46,7 +46,7 @@ Add the `personality_profile` JSONB column to the Creator model, create the Alem
  - Estimate: 30m
  - Files: backend/models.py, backend/schemas.py, backend/routers/creators.py, alembic/versions/023_add_personality_profile.py
  - Verify: cd backend && python -c "from models import Creator; assert hasattr(Creator, 'personality_profile'); print('model OK')" && python -c "from schemas import CreatorDetail; assert 'personality_profile' in CreatorDetail.model_fields; print('schema OK')" && test -f ../alembic/versions/023_add_personality_profile.py && echo 'migration exists'
- [ ] **T02: Implement personality extraction Celery task, prompt template, and admin trigger** — ## Description
+- [x] **T02: Added personality extraction pipeline: prompt template, 3-tier transcript sampling, Celery task with retry/validation, and admin trigger endpoint** — ## Description
 Build the core extraction pipeline: a prompt template that analyzes creator transcripts for distinctive personality markers, a Celery task that aggregates and samples transcripts then calls the LLM, and an admin endpoint to trigger extraction. Follows existing stage patterns in `pipeline/stages.py`.
--- a/.gsd/milestones/M022/slices/S06/tasks/T01-VERIFY.json
+++ b/.gsd/milestones/M022/slices/S06/tasks/T01-VERIFY.json
@ -0,0 +1,30 @@
 {
  "schemaVersion": 1,
  "taskId": "T01",
  "unitId": "M022/S06/T01",
  "timestamp": 1775291084955,
  "passed": false,
  "discoverySource": "task-plan",
  "checks": [
    {
      "command": "cd backend",
      "exitCode": 0,
      "durationMs": 8,
      "verdict": "pass"
    },
    {
      "command": "test -f ../alembic/versions/023_add_personality_profile.py",
      "exitCode": 1,
      "durationMs": 7,
      "verdict": "fail"
    },
    {
      "command": "echo 'migration exists'",
      "exitCode": 0,
      "durationMs": 6,
      "verdict": "pass"
    }
  ],
  "retryAttempt": 1,
  "maxRetries": 2
 }
--- a/.gsd/milestones/M022/slices/S06/tasks/T02-SUMMARY.md
+++ b/.gsd/milestones/M022/slices/S06/tasks/T02-SUMMARY.md
@ -0,0 +1,87 @@
 ---
 id: T02
 parent: S06
 milestone: M022
 provides: []
 requires: []
 affects: []
 key_files: ["prompts/personality_extraction.txt", "backend/pipeline/stages.py", "backend/schemas.py", "backend/routers/admin.py"]
 key_decisions: ["Used response_model=object to trigger JSON mode with manual parse + Pydantic validation for clearer error handling on nested schema"]
 patterns_established: []
 drill_down_paths: []
 observability_surfaces: []
 duration: ""
 verification_result: "All 8 verification checks pass: prompt file exists, task importable, validator importable, endpoint wired, model has attribute, schema has field, migration file exists, router references personality_profile."
 completed_at: 2026-04-04T08:28:14.600Z
 blocker_discovered: false
 ---
 # T02: Added personality extraction pipeline: prompt template, 3-tier transcript sampling, Celery task with retry/validation, and admin trigger endpoint
 > Added personality extraction pipeline: prompt template, 3-tier transcript sampling, Celery task with retry/validation, and admin trigger endpoint
 ## What Happened
 ---
 id: T02
 parent: S06
 milestone: M022
 key_files:
  - prompts/personality_extraction.txt
  - backend/pipeline/stages.py
  - backend/schemas.py
  - backend/routers/admin.py
 key_decisions:
  - Used response_model=object to trigger JSON mode with manual parse + Pydantic validation for clearer error handling on nested schema
 duration: ""
 verification_result: passed
 completed_at: 2026-04-04T08:28:14.600Z
 blocker_discovered: false
 ---
 # T02: Added personality extraction pipeline: prompt template, 3-tier transcript sampling, Celery task with retry/validation, and admin trigger endpoint
 **Added personality extraction pipeline: prompt template, 3-tier transcript sampling, Celery task with retry/validation, and admin trigger endpoint**
 ## What Happened
 Created the personality extraction prompt at prompts/personality_extraction.txt instructing the LLM to focus on distinctive traits and return structured JSON. Added _sample_creator_transcripts() with three tiers: small uses all text, medium takes 300-char excerpts, large does topic-diverse random sampling via Redis with deterministic seed. The extract_personality_profile Celery task loads creator's key moments via SourceVideo join, samples transcripts, calls LLM, validates response with PersonalityProfile Pydantic model, attaches metadata, and stores on Creator.personality_profile. Handles zero-transcript creators (early return), invalid JSON (retry), and validation failures (retry). Added PersonalityProfile with nested sub-models in schemas.py. Added POST /admin/creators/{slug}/extract-profile endpoint in admin.py.
 ## Verification
 All 8 verification checks pass: prompt file exists, task importable, validator importable, endpoint wired, model has attribute, schema has field, migration file exists, router references personality_profile.
 ## Verification Evidence
 | # | Command | Exit Code | Verdict | Duration |
 |---|---------|-----------|---------|----------|
 | 1 | `test -f prompts/personality_extraction.txt` | 0 | ✅ pass | 50ms |
 | 2 | `cd backend && python -c "from pipeline.stages import extract_personality_profile; print('task OK')"` | 0 | ✅ pass | 1000ms |
 | 3 | `cd backend && python -c "from schemas import PersonalityProfile; print('validator OK')"` | 0 | ✅ pass | 500ms |
 | 4 | `grep -q 'extract-profile' backend/routers/admin.py` | 0 | ✅ pass | 50ms |
 | 5 | `cd backend && python -c "from models import Creator; assert hasattr(Creator, 'personality_profile')"` | 0 | ✅ pass | 500ms |
 | 6 | `cd backend && python -c "from schemas import CreatorDetail; assert 'personality_profile' in CreatorDetail.model_fields"` | 0 | ✅ pass | 500ms |
 | 7 | `test -f alembic/versions/023_add_personality_profile.py` | 0 | ✅ pass | 50ms |
 | 8 | `grep -q 'personality_profile' backend/routers/creators.py` | 0 | ✅ pass | 50ms |
 ## Deviations
 None.
 ## Known Issues
 None.
 ## Files Created/Modified
 - `prompts/personality_extraction.txt`
 - `backend/pipeline/stages.py`
 - `backend/schemas.py`
 - `backend/routers/admin.py`
 ## Deviations
 None.
 ## Known Issues
 None.
--- a/backend/pipeline/stages.py
+++ b/backend/pipeline/stages.py
@ -2592,3 +2592,271 @@ def stage_highlight_detection(self, video_id: str, run_id: str | None = None) ->
        raise self.retry(exc=exc)
    finally:
        session.close()
 # ── Personality profile extraction ───────────────────────────────────────────
 def _sample_creator_transcripts(
    moments: list,
    creator_id: str,
    max_chars: int = 40000,
 ) -> tuple[str, int]:
    """Sample transcripts from a creator's key moments, respecting size tiers.
    - Small (<20K chars total): use all text.
    - Medium (20K-60K): first 300 chars from each moment, up to budget.
    - Large (>60K): random sample seeded by creator_id, attempts topic diversity
      via Redis classification data.
    Returns (sampled_text, total_char_count).
    """
    import random
    transcripts = [
        (m.source_video_id, m.raw_transcript)
        for m in moments
        if m.raw_transcript and m.raw_transcript.strip()
    ]
    if not transcripts:
        return ("", 0)
    total_chars = sum(len(t) for _, t in transcripts)
    # Small: use everything
    if total_chars <= 20_000:
        text = "\n\n---\n\n".join(t for _, t in transcripts)
        return (text, total_chars)
    # Medium: first 300 chars from each moment
    if total_chars <= 60_000:
        excerpts = []
        budget = max_chars
        for _, t in transcripts:
            chunk = t[:300]
            if budget - len(chunk) < 0:
                break
            excerpts.append(chunk)
            budget -= len(chunk)
        text = "\n\n---\n\n".join(excerpts)
        return (text, total_chars)
    # Large: random sample with optional topic diversity from Redis
    topic_map: dict[str, list[tuple[str, str]]] = {}
    try:
        import redis as _redis
        settings = get_settings()
        r = _redis.from_url(settings.redis_url)
        video_ids = {str(vid) for vid, _ in transcripts}
        for vid in video_ids:
            raw = r.get(f"chrysopedia:classification:{vid}")
            if raw:
                classification = json.loads(raw)
                if isinstance(classification, list):
                    for item in classification:
                        cat = item.get("topic_category", "unknown")
                        moment_id = item.get("moment_id")
                        if moment_id:
                            topic_map.setdefault(cat, []).append(moment_id)
        r.close()
    except Exception:
        # Fall back to random sampling without topic diversity
        pass
    rng = random.Random(creator_id)
    if topic_map:
        # Interleave from different categories for diversity
        ordered = []
        cat_lists = list(topic_map.values())
        rng.shuffle(cat_lists)
        idx = 0
        while any(cat_lists):
            for cat in cat_lists:
                if cat:
                    ordered.append(cat.pop(0))
            cat_lists = [c for c in cat_lists if c]
        # Map moment IDs back to transcripts
        moment_lookup = {str(m.id): m.raw_transcript for m in moments if m.raw_transcript}
        diverse_transcripts = [
            moment_lookup[mid] for mid in ordered if mid in moment_lookup
        ]
        if diverse_transcripts:
            transcripts_list = diverse_transcripts
        else:
            transcripts_list = [t for _, t in transcripts]
    else:
        transcripts_list = [t for _, t in transcripts]
        rng.shuffle(transcripts_list)
    excerpts = []
    budget = max_chars
    for t in transcripts_list:
        chunk = t[:600]
        if budget - len(chunk) < 0:
            break
        excerpts.append(chunk)
        budget -= len(chunk)
    text = "\n\n---\n\n".join(excerpts)
    return (text, total_chars)
@celery_app.task(bind=True, max_retries=2, default_retry_delay=60)
 def extract_personality_profile(self, creator_id: str) -> str:
    """Extract a personality profile from a creator's transcripts via LLM.
    Aggregates and samples transcripts from all of the creator's key moments,
    sends them to the LLM with the personality_extraction prompt, validates
    the response, and stores the profile as JSONB on Creator.personality_profile.
    Returns the creator_id for chain compatibility.
    """
    from datetime import datetime, timezone
    start = time.monotonic()
    logger.info("Personality extraction starting for creator_id=%s", creator_id)
    _emit_event(creator_id, "personality_extraction", "start")
    session = _get_sync_session()
    try:
        # Load creator
        creator = session.execute(
            select(Creator).where(Creator.id == creator_id)
        ).scalar_one_or_none()
        if not creator:
            logger.error("Creator not found: %s", creator_id)
            _emit_event(
                creator_id, "personality_extraction", "error",
                payload={"error": "creator_not_found"},
            )
            return creator_id
        # Load all key moments with transcripts for this creator
        moments = (
            session.execute(
                select(KeyMoment)
                .join(SourceVideo, KeyMoment.source_video_id == SourceVideo.id)
                .where(SourceVideo.creator_id == creator.id)
                .where(KeyMoment.raw_transcript.isnot(None))
            )
            .scalars()
            .all()
        )
        if not moments:
            logger.warning(
                "No transcripts found for creator_id=%s (%s), skipping extraction",
                creator_id, creator.name,
            )
            _emit_event(
                creator_id, "personality_extraction", "complete",
                payload={"skipped": True, "reason": "no_transcripts"},
            )
            return creator_id
        # Sample transcripts
        sampled_text, total_chars = _sample_creator_transcripts(
            moments, creator_id,
        )
        if not sampled_text.strip():
            logger.warning(
                "Empty transcript sample for creator_id=%s, skipping", creator_id,
            )
            _emit_event(
                creator_id, "personality_extraction", "complete",
                payload={"skipped": True, "reason": "empty_sample"},
            )
            return creator_id
        # Load prompt and call LLM
        system_prompt = _load_prompt("personality_extraction.txt")
        user_prompt = (
            f"Creator: {creator.name}\n\n"
            f"Transcript excerpts ({len(moments)} moments, {total_chars} total chars, "
            f"sample below):\n\n{sampled_text}"
        )
        llm = _get_llm_client()
        callback = _make_llm_callback(
            creator_id, "personality_extraction",
            system_prompt=system_prompt,
            user_prompt=user_prompt,
        )
        response = llm.complete(
            system_prompt=system_prompt,
            user_prompt=user_prompt,
            response_model=object,  # triggers JSON mode
            on_complete=callback,
        )
        # Parse and validate
        from schemas import PersonalityProfile as ProfileValidator
        try:
            raw_profile = json.loads(str(response))
        except json.JSONDecodeError as jde:
            logger.warning(
                "LLM returned invalid JSON for creator_id=%s, retrying: %s",
                creator_id, jde,
            )
            raise self.retry(exc=jde)
        try:
            validated = ProfileValidator.model_validate(raw_profile)
        except ValidationError as ve:
            logger.warning(
                "LLM profile failed validation for creator_id=%s, retrying: %s",
                creator_id, ve,
            )
            raise self.retry(exc=ve)
        # Build final profile dict with metadata
        profile_dict = validated.model_dump()
        profile_dict["_metadata"] = {
            "extracted_at": datetime.now(timezone.utc).replace(tzinfo=None).isoformat(),
            "transcript_sample_size": total_chars,
            "moments_count": len(moments),
            "model_used": getattr(response, "finish_reason", None) or "unknown",
        }
        # Low sample size note
        if total_chars < 500:
            profile_dict["_metadata"]["low_sample_size"] = True
        # Store on creator
        creator.personality_profile = profile_dict
        session.commit()
        elapsed = time.monotonic() - start
        _emit_event(
            creator_id, "personality_extraction", "complete",
            duration_ms=int(elapsed * 1000),
            payload={
                "moments_count": len(moments),
                "transcript_chars": total_chars,
                "sample_chars": len(sampled_text),
            },
        )
        logger.info(
            "Personality extraction completed for creator_id=%s (%s) in %.1fs — "
            "%d moments, %d chars sampled",
            creator_id, creator.name, elapsed, len(moments), len(sampled_text),
        )
        return creator_id
    except Exception as exc:
        if isinstance(exc, (self.MaxRetriesExceededError,)):
            raise
        session.rollback()
        _emit_event(
            creator_id, "personality_extraction", "error",
            payload={"error": str(exc)[:500]},
        )
        logger.error(
            "Personality extraction failed for creator_id=%s: %s", creator_id, exc,
        )
        raise self.retry(exc=exc)
    finally:
        session.close()
--- a/backend/routers/admin.py
+++ b/backend/routers/admin.py
@ -236,3 +236,29 @@ async def get_impersonation_log(
        )
        for log, admin_name, target_name in rows
    ]
@router.post("/creators/{slug}/extract-profile")
 async def extract_creator_profile(
    slug: str,
    _admin: Annotated[User, Depends(_require_admin)],
    session: Annotated[AsyncSession, Depends(get_session)],
 ):
    """Queue personality profile extraction for a creator. Admin only."""
    from models import Creator
    result = await session.execute(
        select(Creator).where(Creator.slug == slug)
    )
    creator = result.scalar_one_or_none()
    if creator is None:
        raise HTTPException(
            status_code=status.HTTP_404_NOT_FOUND,
            detail=f"Creator not found: {slug}",
        )
    from pipeline.stages import extract_personality_profile
    extract_personality_profile.delay(str(creator.id))
    logger.info("Queued personality extraction for creator=%s (%s)", slug, creator.id)
    return {"status": "queued", "creator_id": str(creator.id)}
--- a/backend/schemas.py
+++ b/backend/schemas.py
@ -732,3 +732,40 @@ class FollowedCreatorItem(BaseModel):
    creator_name: str
    creator_slug: str
    followed_at: datetime
 # ── Personality Profile (LLM output validation) ─────────────────────────────
 class VocabularyProfile(BaseModel):
    signature_phrases: list[str] = []
    jargon_level: str = "mixed"
    filler_words: list[str] = []
    distinctive_terms: list[str] = []
    sound_descriptions: list[str] = []
 class ToneProfile(BaseModel):
    formality: str = "conversational"
    energy: str = "moderate"
    humor: str = "none"
    teaching_style: str = ""
    descriptors: list[str] = []
 class StyleMarkersProfile(BaseModel):
    explanation_approach: str = "step-by-step"
    uses_analogies: bool = False
    analogy_examples: list[str] = []
    sound_words: list[str] = []
    self_references: str = ""
    audience_engagement: str = ""
    pacing: str = "moderate"
 class PersonalityProfile(BaseModel):
    """Validates LLM-generated personality profile before storage."""
    vocabulary: VocabularyProfile = Field(default_factory=VocabularyProfile)
    tone: ToneProfile = Field(default_factory=ToneProfile)
    style_markers: StyleMarkersProfile = Field(default_factory=StyleMarkersProfile)
    summary: str = ""
--- a/prompts/personality_extraction.txt
+++ b/prompts/personality_extraction.txt
@ -0,0 +1,42 @@
 You are a music production educator analyst. You will receive transcript excerpts from a single creator's tutorials. Your task is to identify what makes this creator's communication style DISTINCTIVE — not universal traits shared by all educators.
 Analyze the transcripts for:
 1. **Vocabulary patterns**: Signature phrases they repeat, jargon level (beginner-friendly vs advanced), filler words or verbal tics, distinctive terminology or invented words, how they name sounds or techniques.
 2. **Tone**: Formality level, energy (calm/methodical vs enthusiastic/hype), humor style (dry, self-deprecating, none), teaching warmth, use of encouragement or critique.
 3. **Style markers**: How they explain concepts (step-by-step vs intuitive/exploratory), use of analogies or metaphors, onomatopoeia or sound words, self-references and personal anecdotes, how they address the audience, pacing and rhythm of explanation.
 Focus on what makes THIS creator stand out. Ignore generic traits like "knowledgeable about music production" or "explains things clearly" — those apply to everyone.
 You MUST respond with ONLY valid JSON matching this exact structure:
 {
  "vocabulary": {
    "signature_phrases": ["phrase1", "phrase2"],
    "jargon_level": "beginner-friendly | intermediate | advanced | mixed",
    "filler_words": ["um", "like"],
    "distinctive_terms": ["term1", "term2"],
    "sound_descriptions": ["how they describe sounds"]
  },
  "tone": {
    "formality": "casual | conversational | professional | academic",
    "energy": "calm | moderate | high | variable",
    "humor": "none | occasional | frequent | core-style",
    "teaching_style": "one short descriptor, e.g. 'encouraging coach' or 'no-nonsense mentor'",
    "descriptors": ["adjective1", "adjective2", "adjective3"]
  },
  "style_markers": {
    "explanation_approach": "step-by-step | exploratory | demo-first | theory-then-practice",
    "uses_analogies": true,
    "analogy_examples": ["example1"],
    "sound_words": ["onomatopoeia they use"],
    "self_references": "how they reference themselves or their experience",
    "audience_engagement": "how they address/involve the viewer",
    "pacing": "fast | moderate | slow | variable"
  },
  "summary": "One paragraph (3-5 sentences) capturing what makes this creator's voice distinctive. Be specific — reference actual phrases or patterns from the transcripts."
 }
 No markdown code fences, no explanation, no preamble — just the raw JSON object.