- "whisper/transcribe.py" - "whisper/requirements.txt" - "whisper/README.md" GSD-Task: S01/T04
102 lines
2.7 KiB
Markdown
102 lines
2.7 KiB
Markdown
# Chrysopedia — Whisper Transcription
|
||
|
||
Desktop transcription tool for extracting timestamped text from video files
|
||
using OpenAI's Whisper model (large-v3). Designed to run on a machine with
|
||
an NVIDIA GPU (e.g., RTX 4090).
|
||
|
||
## Prerequisites
|
||
|
||
- **Python 3.10+**
|
||
- **ffmpeg** installed and on PATH
|
||
- **NVIDIA GPU** with CUDA support (recommended; CPU fallback available)
|
||
|
||
### Install ffmpeg
|
||
|
||
```bash
|
||
# Debian/Ubuntu
|
||
sudo apt install ffmpeg
|
||
|
||
# macOS
|
||
brew install ffmpeg
|
||
```
|
||
|
||
### Install Python dependencies
|
||
|
||
```bash
|
||
pip install -r requirements.txt
|
||
```
|
||
|
||
## Usage
|
||
|
||
### Single file
|
||
|
||
```bash
|
||
python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts
|
||
```
|
||
|
||
### Batch mode (all videos in a directory)
|
||
|
||
```bash
|
||
python transcribe.py --input ./videos/ --output-dir ./transcripts
|
||
```
|
||
|
||
### Options
|
||
|
||
| Flag | Default | Description |
|
||
| --------------- | ----------- | ----------------------------------------------- |
|
||
| `--input` | (required) | Path to a video file or directory of videos |
|
||
| `--output-dir` | (required) | Directory to write transcript JSON files |
|
||
| `--model` | `large-v3` | Whisper model name (`tiny`, `base`, `small`, `medium`, `large-v3`) |
|
||
| `--device` | `cuda` | Compute device (`cuda` or `cpu`) |
|
||
| `--creator` | (inferred) | Override creator folder name in output JSON |
|
||
| `-v, --verbose` | off | Enable debug logging |
|
||
|
||
## Output Format
|
||
|
||
Each video produces a JSON file matching the Chrysopedia spec:
|
||
|
||
```json
|
||
{
|
||
"source_file": "Skope — Sound Design Masterclass pt2.mp4",
|
||
"creator_folder": "Skope",
|
||
"duration_seconds": 7243,
|
||
"segments": [
|
||
{
|
||
"start": 0.0,
|
||
"end": 4.52,
|
||
"text": "Hey everyone welcome back to part two...",
|
||
"words": [
|
||
{ "word": "Hey", "start": 0.0, "end": 0.28 },
|
||
{ "word": "everyone", "start": 0.32, "end": 0.74 }
|
||
]
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
## Resumability
|
||
|
||
The script automatically skips videos whose output JSON already exists. To
|
||
re-transcribe a file, delete its output JSON first.
|
||
|
||
## Performance
|
||
|
||
Whisper large-v3 on an RTX 4090 processes audio at roughly 10–20× real-time.
|
||
A 2-hour video takes ~6–12 minutes. For 300 videos averaging 1.5 hours each,
|
||
the initial transcription pass takes roughly 15–40 hours of GPU time.
|
||
|
||
## Directory Convention
|
||
|
||
The script infers the `creator_folder` field from the parent directory of each
|
||
video file. Organize videos like:
|
||
|
||
```
|
||
videos/
|
||
├── Skope/
|
||
│ ├── Sound Design Masterclass pt1.mp4
|
||
│ └── Sound Design Masterclass pt2.mp4
|
||
├── Mr Bill/
|
||
│ └── Glitch Techniques.mp4
|
||
```
|
||
|
||
Override with `--creator` when processing files outside this structure.
|