1 Personality Profiles
jlightner edited this page 2026-04-04 03:44:23 -05:00
This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Personality Profiles

LLM-powered extraction of creator teaching personality from transcript analysis. Added in M022/S06.

Overview

Personality profiles capture each creator's distinctive teaching style — vocabulary patterns, tonal qualities, and stylistic markers — by analyzing their transcript corpus with a structured LLM extraction pipeline. Profiles are stored as JSONB on the Creator model and displayed on creator detail pages.

Extraction Pipeline

Transcript Sampling

Three-tier sampling strategy based on total transcript size:

Tier Condition Strategy
Small < 20K chars Use all transcript text
Medium 20K60K chars 300-character excerpts per key moment
Large > 60K chars Topic-diverse random sampling via Redis classification data

Large-tier sampling uses deterministic seeding and pulls from across topic categories to ensure the profile reflects the creator's full range, not just their most common topic.

LLM Extraction

The prompt template at prompts/personality_extraction.txt instructs the LLM to analyze transcript excerpts and produce structured JSON. The LLM response is parsed and validated with a Pydantic model before storage.

Celery task: extract_personality_profile in backend/pipeline/stages.py

  • Joins KeyMoment → SourceVideo to load transcripts
  • Samples transcripts per the tier strategy
  • Calls LLM with response_model=object for JSON mode
  • Validates response with PersonalityProfile Pydantic model
  • Stores result as JSONB on Creator row
  • Emits pipeline_events for observability

Error Handling

  • Zero-transcript creators: early return, no profile
  • Invalid JSON from LLM: retry
  • Pydantic validation failure: retry
  • Pipeline events track start/complete/error

PersonalityProfile Schema

Stored as Creator.personality_profile JSONB column. Nested structure:

VocabularyProfile

Field Type Description
signature_phrases list[str] Characteristic phrases the creator uses repeatedly
jargon_level str How technical their language is (e.g., "high", "moderate")
filler_words list[str] Common filler words/phrases
distinctive_terms list[str] Unique terminology or coined phrases

ToneProfile

Field Type Description
formality str Formal to casual spectrum
energy str Energy level descriptor
humor str Humor style/frequency
teaching_style str Overall teaching approach

StyleMarkersProfile

Field Type Description
explanation_approach str How they explain concepts
analogies bool Whether they use analogies frequently
sound_words bool Whether they use onomatopoeia / sound words
audience_engagement str How they address / engage viewers

Metadata

Each profile includes extraction metadata:

Field Description
extracted_at ISO timestamp of extraction
transcript_sample_size Number of characters sampled
model_used LLM model identifier

API

Admin Trigger

Method Path Purpose
POST /api/v1/admin/creators/{slug}/extract-profile Queue personality extraction task

Returns immediately — extraction runs asynchronously via Celery. Check pipeline_events for status.

Creator Detail

GET /api/v1/creators/{slug} includes personality_profile field (null if not yet extracted).

Frontend Component

PersonalityProfile.tsx — collapsible section on creator detail pages.

Layout

  • Collapsible header with chevron toggle (CSS grid-template-rows: 0fr/1fr animation)
  • Three sub-cards:
    • Teaching Style — formality, energy, humor, teaching_style, explanation_approach, audience_engagement
    • Vocabulary — jargon_level summary, signature_phrases pills, filler_words pills, distinctive_terms pills
    • Style — analogies (checkmark/cross), sound_words (checkmark/cross), summary paragraph
  • Metadata footer — extraction date, sample size

Handles null profiles gracefully (renders nothing).

Key Files

  • prompts/personality_extraction.txt — LLM prompt template
  • backend/pipeline/stages.pyextract_personality_profile Celery task, _sample_creator_transcripts() helper
  • backend/schemas.py — PersonalityProfile, VocabularyProfile, ToneProfile, StyleMarkersProfile Pydantic models
  • backend/models.py — Creator.personality_profile JSONB column
  • backend/routers/admin.py — POST /admin/creators/{slug}/extract-profile endpoint
  • backend/routers/creators.py — Passthrough in GET /creators/{slug}
  • alembic/versions/023_add_personality_profile.py — Migration
  • frontend/src/components/PersonalityProfile.tsx — Collapsible profile component
  • frontend/src/api/creators.ts — TypeScript interfaces for profile sub-objects

Design Decisions

  • 3-tier transcript sampling — Balances coverage vs. token cost. Topic-diverse random sampling for large creators prevents profile skew toward dominant topic.
  • Admin trigger endpoint — On-demand extraction rather than automatic on ingest. Profiles are expensive (large LLM call) and only needed once per creator.
  • JSONB storage — Profile schema may evolve; JSONB avoids migration for every field change.

See also: Data-Model, API-Surface, Frontend, Pipeline