chrysopedia/whisper/README.md

# Chrysopedia — Whisper Transcription

Desktop transcription tool for extracting timestamped text from video files
using OpenAI's Whisper model (large-v3). Designed to run on a machine with
an NVIDIA GPU (e.g., RTX 4090).

## Prerequisites

- **Python 3.10+**
- **ffmpeg** installed and on PATH
- **NVIDIA GPU** with CUDA support (recommended; CPU fallback available)

### Install ffmpeg

```bash
# Debian/Ubuntu
sudo apt install ffmpeg

# macOS
brew install ffmpeg

# Windows (via chocolatey or manual install)
choco install ffmpeg
```

### Install Python dependencies

```bash
pip install -r requirements.txt

# For CUDA support, install torch with CUDA wheels:
pip install torch --index-url https://download.pytorch.org/whl/cu126
```

## Usage

### Single file

```bash
python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts
```

### Batch mode (all videos in a directory)

```bash
python transcribe.py --input ./videos/ --output-dir ./transcripts
```

### Mass batch mode (recursive, multi-creator)

For large content libraries with nested subdirectories per creator:

```bash
python batch_transcribe.py \
  --content-root "A:\Education\Artist Streams & Content" \
  --output-dir "C:\Users\jlightner\chrysopedia\transcripts" \
  --python C:\Users\jlightner\.conda\envs\transcribe\python.exe

# Dry run to preview without transcribing:
python batch_transcribe.py --content-root ... --output-dir ... --dry-run
```

`batch_transcribe.py` recursively walks all subdirectories, discovers video
files, and calls `transcribe.py` for each directory. The `creator_folder`
field in the output JSON is set to the top-level subdirectory name (the
artist/creator). Output directory structure mirrors the source hierarchy.

A `batch_manifest.json` is written to the output root on completion with
timing, per-creator results, and error details.

### Options (transcribe.py)

| Flag            | Default     | Description                                     |
| --------------- | ----------- | ----------------------------------------------- |
| `--input`       | (required)  | Path to a video file or directory of videos      |
| `--output-dir`  | (required)  | Directory to write transcript JSON files         |
| `--model`       | `large-v3`  | Whisper model name (`tiny`, `base`, `small`, `medium`, `large-v3`) |
| `--device`      | `cuda`      | Compute device (`cuda` or `cpu`)                 |
| `--creator`     | (inferred)  | Override creator folder name in output JSON      |
| `-v, --verbose` | off         | Enable debug logging                             |

### Options (batch_transcribe.py)

| Flag              | Default      | Description                                      |
| ----------------- | ------------ | ------------------------------------------------ |
| `--content-root`  | (required)   | Root directory with creator subdirectories        |
| `--output-dir`    | (required)   | Root output directory for transcript JSONs        |
| `--script`        | (auto)       | Path to transcribe.py (default: same directory)   |
| `--python`        | (auto)       | Python interpreter to use                         |
| `--model`         | `large-v3`   | Whisper model name                                |
| `--device`        | `cuda`       | Compute device                                    |
| `--dry-run`       | off          | Preview work plan without transcribing            |

## Output Format

Each video produces a JSON file matching the Chrysopedia pipeline spec:

```json
{
  "source_file": "Skope — Sound Design Masterclass pt2.mp4",
  "creator_folder": "Skope",
  "duration_seconds": 7243,
  "segments": [
    {
      "start": 0.0,
      "end": 4.52,
      "text": "Hey everyone welcome back to part two...",
      "words": [
        { "word": "Hey", "start": 0.0, "end": 0.28 },
        { "word": "everyone", "start": 0.32, "end": 0.74 }
      ]
    }
  ]
}
```

This format is consumed directly by the Chrysopedia pipeline stage 2
(transcript segmentation) via the `POST /api/v1/ingest` endpoint.

## Resumability

Both scripts automatically skip videos whose output JSON already exists. To
re-transcribe a file, delete its output JSON first.

## Current Transcription Environment

### Machine: HAL0022 (10.0.0.131)

- **GPU:** NVIDIA GeForce RTX 4090 (24GB VRAM)
- **OS:** Windows 11
- **Python:** Conda env `transcribe` at `C:\Users\jlightner\.conda\envs\transcribe\python.exe`
- **CUDA:** PyTorch with cu126 wheels

### Content Source

```
A:\Education\Artist Streams & Content\
├── au5/           (334 videos)
├── Keota/         (193 videos)
├── DJ Shortee/    (83 videos)
├── KOAN Sound/    (68 videos)
├── Teddy Killerz/ (62 videos)
├── ... (42 creators, 1197 videos total across 146 directories)
```

### Transcript Output Location

```
C:\Users\jlightner\chrysopedia\transcripts\
```

Directory structure mirrors the source hierarchy. Each video produces a
`<filename>.json` transcript file.

**Transfer to ub01:** Transcripts need to be copied to
`/vmPool/r/services/chrysopedia_data/transcripts/` on ub01 for pipeline
ingestion. This can be done via SMB (`\\ub01\vmPool\services\chrysopedia_data\transcripts`)
or via `scp`/`rsync` from a Linux machine with access to both.

### Running the Batch Job

The batch transcription runs as a Windows Scheduled Task to survive SSH
disconnections:

```powershell
# The task is already created. To re-run:
schtasks /run /tn "ChrysopediaTranscribe"

# Check status:
schtasks /query /tn "ChrysopediaTranscribe" /v /fo list | findstr /i "status result"

# Monitor log:
Get-Content 'C:\Users\jlightner\chrysopedia\transcription.log' -Tail 30

# Or follow live:
Get-Content 'C:\Users\jlightner\chrysopedia\transcription.log' -Tail 20 -Wait
```

### Scripts on HAL0022

```
C:\Users\jlightner\chrysopedia\
├── transcribe.py          # Single-file/directory transcription
├── batch_transcribe.py    # Recursive multi-creator batch runner
├── run_transcription.bat  # Batch file invoked by scheduled task
├── launch_transcription.py # Alternative launcher (subprocess)
├── transcription.log      # Current batch run log output
└── transcripts/           # Output directory
    ├── batch_manifest.json
    ├── au5/
    ├── Break/
    └── ...
```

## Performance

Whisper large-v3 on an RTX 4090 processes audio at roughly 10–20× real-time.
A 2-hour video takes ~6–12 minutes. For the full 1,197-video library, expect
roughly 20–60 hours of GPU time depending on average video length.

## Directory Convention

The script infers the `creator_folder` field from the parent directory of each
video file (or the top-level creator folder in batch mode). Organize videos like:

```
content-root/
├── Skope/
│   ├── Youtube/
│   │   ├── Sound Design Masterclass pt1.mp4
│   │   └── Sound Design Masterclass pt2.mp4
│   └── Patreon/
│       └── Advanced Wavetables.mp4
├── Mr Bill/
│   └── Youtube/
│       └── Glitch Techniques.mp4
```

Override with `--creator` when processing files outside this structure.