chrysopedia/whisper/README.md
jlightner e8bc3fd9a2 feat(whisper): add batch_transcribe.py and document HAL0022 transcription setup
- batch_transcribe.py: recursive multi-creator transcription runner that
  walks nested subdirectories, attributes creator from top-level folder name,
  writes batch_manifest.json with timing and per-creator results
- README.md: updated with batch mode docs, HAL0022 environment details,
  transcript output location (C:\Users\jlightner\chrysopedia\transcripts),
  scheduled task usage, and transfer instructions for ub01 ingestion
2026-03-30 11:46:52 -05:00

219 lines
7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Chrysopedia — Whisper Transcription
Desktop transcription tool for extracting timestamped text from video files
using OpenAI's Whisper model (large-v3). Designed to run on a machine with
an NVIDIA GPU (e.g., RTX 4090).
## Prerequisites
- **Python 3.10+**
- **ffmpeg** installed and on PATH
- **NVIDIA GPU** with CUDA support (recommended; CPU fallback available)
### Install ffmpeg
```bash
# Debian/Ubuntu
sudo apt install ffmpeg
# macOS
brew install ffmpeg
# Windows (via chocolatey or manual install)
choco install ffmpeg
```
### Install Python dependencies
```bash
pip install -r requirements.txt
# For CUDA support, install torch with CUDA wheels:
pip install torch --index-url https://download.pytorch.org/whl/cu126
```
## Usage
### Single file
```bash
python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts
```
### Batch mode (all videos in a directory)
```bash
python transcribe.py --input ./videos/ --output-dir ./transcripts
```
### Mass batch mode (recursive, multi-creator)
For large content libraries with nested subdirectories per creator:
```bash
python batch_transcribe.py \
--content-root "A:\Education\Artist Streams & Content" \
--output-dir "C:\Users\jlightner\chrysopedia\transcripts" \
--python C:\Users\jlightner\.conda\envs\transcribe\python.exe
# Dry run to preview without transcribing:
python batch_transcribe.py --content-root ... --output-dir ... --dry-run
```
`batch_transcribe.py` recursively walks all subdirectories, discovers video
files, and calls `transcribe.py` for each directory. The `creator_folder`
field in the output JSON is set to the top-level subdirectory name (the
artist/creator). Output directory structure mirrors the source hierarchy.
A `batch_manifest.json` is written to the output root on completion with
timing, per-creator results, and error details.
### Options (transcribe.py)
| Flag | Default | Description |
| --------------- | ----------- | ----------------------------------------------- |
| `--input` | (required) | Path to a video file or directory of videos |
| `--output-dir` | (required) | Directory to write transcript JSON files |
| `--model` | `large-v3` | Whisper model name (`tiny`, `base`, `small`, `medium`, `large-v3`) |
| `--device` | `cuda` | Compute device (`cuda` or `cpu`) |
| `--creator` | (inferred) | Override creator folder name in output JSON |
| `-v, --verbose` | off | Enable debug logging |
### Options (batch_transcribe.py)
| Flag | Default | Description |
| ----------------- | ------------ | ------------------------------------------------ |
| `--content-root` | (required) | Root directory with creator subdirectories |
| `--output-dir` | (required) | Root output directory for transcript JSONs |
| `--script` | (auto) | Path to transcribe.py (default: same directory) |
| `--python` | (auto) | Python interpreter to use |
| `--model` | `large-v3` | Whisper model name |
| `--device` | `cuda` | Compute device |
| `--dry-run` | off | Preview work plan without transcribing |
## Output Format
Each video produces a JSON file matching the Chrysopedia pipeline spec:
```json
{
"source_file": "Skope — Sound Design Masterclass pt2.mp4",
"creator_folder": "Skope",
"duration_seconds": 7243,
"segments": [
{
"start": 0.0,
"end": 4.52,
"text": "Hey everyone welcome back to part two...",
"words": [
{ "word": "Hey", "start": 0.0, "end": 0.28 },
{ "word": "everyone", "start": 0.32, "end": 0.74 }
]
}
]
}
```
This format is consumed directly by the Chrysopedia pipeline stage 2
(transcript segmentation) via the `POST /api/v1/ingest` endpoint.
## Resumability
Both scripts automatically skip videos whose output JSON already exists. To
re-transcribe a file, delete its output JSON first.
## Current Transcription Environment
### Machine: HAL0022 (10.0.0.131)
- **GPU:** NVIDIA GeForce RTX 4090 (24GB VRAM)
- **OS:** Windows 11
- **Python:** Conda env `transcribe` at `C:\Users\jlightner\.conda\envs\transcribe\python.exe`
- **CUDA:** PyTorch with cu126 wheels
### Content Source
```
A:\Education\Artist Streams & Content\
├── au5/ (334 videos)
├── Keota/ (193 videos)
├── DJ Shortee/ (83 videos)
├── KOAN Sound/ (68 videos)
├── Teddy Killerz/ (62 videos)
├── ... (42 creators, 1197 videos total across 146 directories)
```
### Transcript Output Location
```
C:\Users\jlightner\chrysopedia\transcripts\
```
Directory structure mirrors the source hierarchy. Each video produces a
`<filename>.json` transcript file.
**Transfer to ub01:** Transcripts need to be copied to
`/vmPool/r/services/chrysopedia_data/transcripts/` on ub01 for pipeline
ingestion. This can be done via SMB (`\\ub01\vmPool\services\chrysopedia_data\transcripts`)
or via `scp`/`rsync` from a Linux machine with access to both.
### Running the Batch Job
The batch transcription runs as a Windows Scheduled Task to survive SSH
disconnections:
```powershell
# The task is already created. To re-run:
schtasks /run /tn "ChrysopediaTranscribe"
# Check status:
schtasks /query /tn "ChrysopediaTranscribe" /v /fo list | findstr /i "status result"
# Monitor log:
Get-Content 'C:\Users\jlightner\chrysopedia\transcription.log' -Tail 30
# Or follow live:
Get-Content 'C:\Users\jlightner\chrysopedia\transcription.log' -Tail 20 -Wait
```
### Scripts on HAL0022
```
C:\Users\jlightner\chrysopedia\
├── transcribe.py # Single-file/directory transcription
├── batch_transcribe.py # Recursive multi-creator batch runner
├── run_transcription.bat # Batch file invoked by scheduled task
├── launch_transcription.py # Alternative launcher (subprocess)
├── transcription.log # Current batch run log output
└── transcripts/ # Output directory
├── batch_manifest.json
├── au5/
├── Break/
└── ...
```
## Performance
Whisper large-v3 on an RTX 4090 processes audio at roughly 1020× real-time.
A 2-hour video takes ~612 minutes. For the full 1,197-video library, expect
roughly 2060 hours of GPU time depending on average video length.
## Directory Convention
The script infers the `creator_folder` field from the parent directory of each
video file (or the top-level creator folder in batch mode). Organize videos like:
```
content-root/
├── Skope/
│ ├── Youtube/
│ │ ├── Sound Design Masterclass pt1.mp4
│ │ └── Sound Design Masterclass pt2.mp4
│ └── Patreon/
│ └── Advanced Wavetables.mp4
├── Mr Bill/
│ └── Youtube/
│ └── Glitch Techniques.mp4
```
Override with `--creator` when processing files outside this structure.