chrysopedia/whisper/README.md

# Chrysopedia — Whisper Transcription

Desktop transcription tool for extracting timestamped text from video files
using OpenAI's Whisper model (large-v3). Designed to run on a machine with
an NVIDIA GPU (e.g., RTX 4090).

## Prerequisites

- **Python 3.10+**
- **ffmpeg** installed and on PATH
- **NVIDIA GPU** with CUDA support (recommended; CPU fallback available)

### Install ffmpeg

```bash
# Debian/Ubuntu
sudo apt install ffmpeg

# macOS
brew install ffmpeg
```

### Install Python dependencies

```bash
pip install -r requirements.txt
```

## Usage

### Single file

```bash
python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts
```

### Batch mode (all videos in a directory)

```bash
python transcribe.py --input ./videos/ --output-dir ./transcripts
```

### Options

| Flag            | Default     | Description                                     |
| --------------- | ----------- | ----------------------------------------------- |
| `--input`       | (required)  | Path to a video file or directory of videos      |
| `--output-dir`  | (required)  | Directory to write transcript JSON files         |
| `--model`       | `large-v3`  | Whisper model name (`tiny`, `base`, `small`, `medium`, `large-v3`) |
| `--device`      | `cuda`      | Compute device (`cuda` or `cpu`)                 |
| `--creator`     | (inferred)  | Override creator folder name in output JSON      |
| `-v, --verbose` | off         | Enable debug logging                             |

## Output Format

Each video produces a JSON file matching the Chrysopedia spec:

```json
{
  "source_file": "Skope — Sound Design Masterclass pt2.mp4",
  "creator_folder": "Skope",
  "duration_seconds": 7243,
  "segments": [
    {
      "start": 0.0,
      "end": 4.52,
      "text": "Hey everyone welcome back to part two...",
      "words": [
        { "word": "Hey", "start": 0.0, "end": 0.28 },
        { "word": "everyone", "start": 0.32, "end": 0.74 }
      ]
    }
  ]
}
```

## Resumability

The script automatically skips videos whose output JSON already exists. To
re-transcribe a file, delete its output JSON first.

## Performance

Whisper large-v3 on an RTX 4090 processes audio at roughly 10–20× real-time.
A 2-hour video takes ~6–12 minutes. For 300 videos averaging 1.5 hours each,
the initial transcription pass takes roughly 15–40 hours of GPU time.

## Directory Convention

The script infers the `creator_folder` field from the parent directory of each
video file. Organize videos like:

```
videos/
├── Skope/
│   ├── Sound Design Masterclass pt1.mp4
│   └── Sound Design Masterclass pt2.mp4
├── Mr Bill/
│   └── Glitch Techniques.mp4
```

Override with `--creator` when processing files outside this structure.