chrysopedia/whisper/README.md
jlightner 4b0914b12b fix: restore complete project tree from ub01 canonical state
Auto-mode commit 7aa33cd accidentally deleted 78 files (14,814 lines) during M005
execution. Subsequent commits rebuilt some frontend files but backend/, alembic/,
tests/, whisper/, docker configs, and prompts were never restored in this repo.

This commit restores the full project tree by syncing from ub01's working directory,
which has all M001-M007 features running in production containers.

Restored: backend/ (config, models, routers, database, redis, search_service, worker),
alembic/ (6 migrations), docker/ (Dockerfiles, nginx, compose), prompts/ (4 stages),
tests/, whisper/, README.md, .env.example, chrysopedia-spec.md
2026-03-31 02:10:41 +00:00

7 KiB
Raw Blame History

Chrysopedia — Whisper Transcription

Desktop transcription tool for extracting timestamped text from video files using OpenAI's Whisper model (large-v3). Designed to run on a machine with an NVIDIA GPU (e.g., RTX 4090).

Prerequisites

  • Python 3.10+
  • ffmpeg installed and on PATH
  • NVIDIA GPU with CUDA support (recommended; CPU fallback available)

Install ffmpeg

# Debian/Ubuntu
sudo apt install ffmpeg

# macOS
brew install ffmpeg

# Windows (via chocolatey or manual install)
choco install ffmpeg

Install Python dependencies

pip install -r requirements.txt

# For CUDA support, install torch with CUDA wheels:
pip install torch --index-url https://download.pytorch.org/whl/cu126

Usage

Single file

python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts

Batch mode (all videos in a directory)

python transcribe.py --input ./videos/ --output-dir ./transcripts

Mass batch mode (recursive, multi-creator)

For large content libraries with nested subdirectories per creator:

python batch_transcribe.py \
  --content-root "A:\Education\Artist Streams & Content" \
  --output-dir "C:\Users\jlightner\chrysopedia\transcripts" \
  --python C:\Users\jlightner\.conda\envs\transcribe\python.exe

# Dry run to preview without transcribing:
python batch_transcribe.py --content-root ... --output-dir ... --dry-run

batch_transcribe.py recursively walks all subdirectories, discovers video files, and calls transcribe.py for each directory. The creator_folder field in the output JSON is set to the top-level subdirectory name (the artist/creator). Output directory structure mirrors the source hierarchy.

A batch_manifest.json is written to the output root on completion with timing, per-creator results, and error details.

Options (transcribe.py)

Flag Default Description
--input (required) Path to a video file or directory of videos
--output-dir (required) Directory to write transcript JSON files
--model large-v3 Whisper model name (tiny, base, small, medium, large-v3)
--device cuda Compute device (cuda or cpu)
--creator (inferred) Override creator folder name in output JSON
-v, --verbose off Enable debug logging

Options (batch_transcribe.py)

Flag Default Description
--content-root (required) Root directory with creator subdirectories
--output-dir (required) Root output directory for transcript JSONs
--script (auto) Path to transcribe.py (default: same directory)
--python (auto) Python interpreter to use
--model large-v3 Whisper model name
--device cuda Compute device
--dry-run off Preview work plan without transcribing

Output Format

Each video produces a JSON file matching the Chrysopedia pipeline spec:

{
  "source_file": "Skope — Sound Design Masterclass pt2.mp4",
  "creator_folder": "Skope",
  "duration_seconds": 7243,
  "segments": [
    {
      "start": 0.0,
      "end": 4.52,
      "text": "Hey everyone welcome back to part two...",
      "words": [
        { "word": "Hey", "start": 0.0, "end": 0.28 },
        { "word": "everyone", "start": 0.32, "end": 0.74 }
      ]
    }
  ]
}

This format is consumed directly by the Chrysopedia pipeline stage 2 (transcript segmentation) via the POST /api/v1/ingest endpoint.

Resumability

Both scripts automatically skip videos whose output JSON already exists. To re-transcribe a file, delete its output JSON first.

Current Transcription Environment

Machine: HAL0022 (10.0.0.131)

  • GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
  • OS: Windows 11
  • Python: Conda env transcribe at C:\Users\jlightner\.conda\envs\transcribe\python.exe
  • CUDA: PyTorch with cu126 wheels

Content Source

A:\Education\Artist Streams & Content\
├── au5/           (334 videos)
├── Keota/         (193 videos)
├── DJ Shortee/    (83 videos)
├── KOAN Sound/    (68 videos)
├── Teddy Killerz/ (62 videos)
├── ... (42 creators, 1197 videos total across 146 directories)

Transcript Output Location

C:\Users\jlightner\chrysopedia\transcripts\

Directory structure mirrors the source hierarchy. Each video produces a <filename>.json transcript file.

Transfer to ub01: Transcripts need to be copied to /vmPool/r/services/chrysopedia_data/transcripts/ on ub01 for pipeline ingestion. This can be done via SMB (\\ub01\vmPool\services\chrysopedia_data\transcripts) or via scp/rsync from a Linux machine with access to both.

Running the Batch Job

The batch transcription runs as a Windows Scheduled Task to survive SSH disconnections:

# The task is already created. To re-run:
schtasks /run /tn "ChrysopediaTranscribe"

# Check status:
schtasks /query /tn "ChrysopediaTranscribe" /v /fo list | findstr /i "status result"

# Monitor log:
Get-Content 'C:\Users\jlightner\chrysopedia\transcription.log' -Tail 30

# Or follow live:
Get-Content 'C:\Users\jlightner\chrysopedia\transcription.log' -Tail 20 -Wait

Scripts on HAL0022

C:\Users\jlightner\chrysopedia\
├── transcribe.py          # Single-file/directory transcription
├── batch_transcribe.py    # Recursive multi-creator batch runner
├── run_transcription.bat  # Batch file invoked by scheduled task
├── launch_transcription.py # Alternative launcher (subprocess)
├── transcription.log      # Current batch run log output
└── transcripts/           # Output directory
    ├── batch_manifest.json
    ├── au5/
    ├── Break/
    └── ...

Performance

Whisper large-v3 on an RTX 4090 processes audio at roughly 1020× real-time. A 2-hour video takes ~612 minutes. For the full 1,197-video library, expect roughly 2060 hours of GPU time depending on average video length.

Directory Convention

The script infers the creator_folder field from the parent directory of each video file (or the top-level creator folder in batch mode). Organize videos like:

content-root/
├── Skope/
│   ├── Youtube/
│   │   ├── Sound Design Masterclass pt1.mp4
│   │   └── Sound Design Masterclass pt2.mp4
│   └── Patreon/
│       └── Advanced Wavetables.mp4
├── Mr Bill/
│   └── Youtube/
│       └── Glitch Techniques.mp4

Override with --creator when processing files outside this structure.