chrysopedia/whisper
jlightner 56adf2f2ef test: Created desktop Whisper transcription script with single-file/bat…
- "whisper/transcribe.py"
- "whisper/requirements.txt"
- "whisper/README.md"

GSD-Task: S01/T04
2026-03-29 21:57:42 +00:00
..
README.md test: Created desktop Whisper transcription script with single-file/bat… 2026-03-29 21:57:42 +00:00
requirements.txt test: Created desktop Whisper transcription script with single-file/bat… 2026-03-29 21:57:42 +00:00
transcribe.py test: Created desktop Whisper transcription script with single-file/bat… 2026-03-29 21:57:42 +00:00

Chrysopedia — Whisper Transcription

Desktop transcription tool for extracting timestamped text from video files using OpenAI's Whisper model (large-v3). Designed to run on a machine with an NVIDIA GPU (e.g., RTX 4090).

Prerequisites

  • Python 3.10+
  • ffmpeg installed and on PATH
  • NVIDIA GPU with CUDA support (recommended; CPU fallback available)

Install ffmpeg

# Debian/Ubuntu
sudo apt install ffmpeg

# macOS
brew install ffmpeg

Install Python dependencies

pip install -r requirements.txt

Usage

Single file

python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts

Batch mode (all videos in a directory)

python transcribe.py --input ./videos/ --output-dir ./transcripts

Options

Flag Default Description
--input (required) Path to a video file or directory of videos
--output-dir (required) Directory to write transcript JSON files
--model large-v3 Whisper model name (tiny, base, small, medium, large-v3)
--device cuda Compute device (cuda or cpu)
--creator (inferred) Override creator folder name in output JSON
-v, --verbose off Enable debug logging

Output Format

Each video produces a JSON file matching the Chrysopedia spec:

{
  "source_file": "Skope — Sound Design Masterclass pt2.mp4",
  "creator_folder": "Skope",
  "duration_seconds": 7243,
  "segments": [
    {
      "start": 0.0,
      "end": 4.52,
      "text": "Hey everyone welcome back to part two...",
      "words": [
        { "word": "Hey", "start": 0.0, "end": 0.28 },
        { "word": "everyone", "start": 0.32, "end": 0.74 }
      ]
    }
  ]
}

Resumability

The script automatically skips videos whose output JSON already exists. To re-transcribe a file, delete its output JSON first.

Performance

Whisper large-v3 on an RTX 4090 processes audio at roughly 1020× real-time. A 2-hour video takes ~612 minutes. For 300 videos averaging 1.5 hours each, the initial transcription pass takes roughly 1540 hours of GPU time.

Directory Convention

The script infers the creator_folder field from the parent directory of each video file. Organize videos like:

videos/
├── Skope/
│   ├── Sound Design Masterclass pt1.mp4
│   └── Sound Design Masterclass pt2.mp4
├── Mr Bill/
│   └── Glitch Techniques.mp4

Override with --creator when processing files outside this structure.