chrysopedia/whisper/README.md
jlightner 4b0914b12b fix: restore complete project tree from ub01 canonical state
Auto-mode commit 7aa33cd accidentally deleted 78 files (14,814 lines) during M005
execution. Subsequent commits rebuilt some frontend files but backend/, alembic/,
tests/, whisper/, docker configs, and prompts were never restored in this repo.

This commit restores the full project tree by syncing from ub01's working directory,
which has all M001-M007 features running in production containers.

Restored: backend/ (config, models, routers, database, redis, search_service, worker),
alembic/ (6 migrations), docker/ (Dockerfiles, nginx, compose), prompts/ (4 stages),
tests/, whisper/, README.md, .env.example, chrysopedia-spec.md
2026-03-31 02:10:41 +00:00

219 lines
7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Chrysopedia — Whisper Transcription
Desktop transcription tool for extracting timestamped text from video files
using OpenAI's Whisper model (large-v3). Designed to run on a machine with
an NVIDIA GPU (e.g., RTX 4090).
## Prerequisites
- **Python 3.10+**
- **ffmpeg** installed and on PATH
- **NVIDIA GPU** with CUDA support (recommended; CPU fallback available)
### Install ffmpeg
```bash
# Debian/Ubuntu
sudo apt install ffmpeg
# macOS
brew install ffmpeg
# Windows (via chocolatey or manual install)
choco install ffmpeg
```
### Install Python dependencies
```bash
pip install -r requirements.txt
# For CUDA support, install torch with CUDA wheels:
pip install torch --index-url https://download.pytorch.org/whl/cu126
```
## Usage
### Single file
```bash
python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts
```
### Batch mode (all videos in a directory)
```bash
python transcribe.py --input ./videos/ --output-dir ./transcripts
```
### Mass batch mode (recursive, multi-creator)
For large content libraries with nested subdirectories per creator:
```bash
python batch_transcribe.py \
--content-root "A:\Education\Artist Streams & Content" \
--output-dir "C:\Users\jlightner\chrysopedia\transcripts" \
--python C:\Users\jlightner\.conda\envs\transcribe\python.exe
# Dry run to preview without transcribing:
python batch_transcribe.py --content-root ... --output-dir ... --dry-run
```
`batch_transcribe.py` recursively walks all subdirectories, discovers video
files, and calls `transcribe.py` for each directory. The `creator_folder`
field in the output JSON is set to the top-level subdirectory name (the
artist/creator). Output directory structure mirrors the source hierarchy.
A `batch_manifest.json` is written to the output root on completion with
timing, per-creator results, and error details.
### Options (transcribe.py)
| Flag | Default | Description |
| --------------- | ----------- | ----------------------------------------------- |
| `--input` | (required) | Path to a video file or directory of videos |
| `--output-dir` | (required) | Directory to write transcript JSON files |
| `--model` | `large-v3` | Whisper model name (`tiny`, `base`, `small`, `medium`, `large-v3`) |
| `--device` | `cuda` | Compute device (`cuda` or `cpu`) |
| `--creator` | (inferred) | Override creator folder name in output JSON |
| `-v, --verbose` | off | Enable debug logging |
### Options (batch_transcribe.py)
| Flag | Default | Description |
| ----------------- | ------------ | ------------------------------------------------ |
| `--content-root` | (required) | Root directory with creator subdirectories |
| `--output-dir` | (required) | Root output directory for transcript JSONs |
| `--script` | (auto) | Path to transcribe.py (default: same directory) |
| `--python` | (auto) | Python interpreter to use |
| `--model` | `large-v3` | Whisper model name |
| `--device` | `cuda` | Compute device |
| `--dry-run` | off | Preview work plan without transcribing |
## Output Format
Each video produces a JSON file matching the Chrysopedia pipeline spec:
```json
{
"source_file": "Skope — Sound Design Masterclass pt2.mp4",
"creator_folder": "Skope",
"duration_seconds": 7243,
"segments": [
{
"start": 0.0,
"end": 4.52,
"text": "Hey everyone welcome back to part two...",
"words": [
{ "word": "Hey", "start": 0.0, "end": 0.28 },
{ "word": "everyone", "start": 0.32, "end": 0.74 }
]
}
]
}
```
This format is consumed directly by the Chrysopedia pipeline stage 2
(transcript segmentation) via the `POST /api/v1/ingest` endpoint.
## Resumability
Both scripts automatically skip videos whose output JSON already exists. To
re-transcribe a file, delete its output JSON first.
## Current Transcription Environment
### Machine: HAL0022 (10.0.0.131)
- **GPU:** NVIDIA GeForce RTX 4090 (24GB VRAM)
- **OS:** Windows 11
- **Python:** Conda env `transcribe` at `C:\Users\jlightner\.conda\envs\transcribe\python.exe`
- **CUDA:** PyTorch with cu126 wheels
### Content Source
```
A:\Education\Artist Streams & Content\
├── au5/ (334 videos)
├── Keota/ (193 videos)
├── DJ Shortee/ (83 videos)
├── KOAN Sound/ (68 videos)
├── Teddy Killerz/ (62 videos)
├── ... (42 creators, 1197 videos total across 146 directories)
```
### Transcript Output Location
```
C:\Users\jlightner\chrysopedia\transcripts\
```
Directory structure mirrors the source hierarchy. Each video produces a
`<filename>.json` transcript file.
**Transfer to ub01:** Transcripts need to be copied to
`/vmPool/r/services/chrysopedia_data/transcripts/` on ub01 for pipeline
ingestion. This can be done via SMB (`\\ub01\vmPool\services\chrysopedia_data\transcripts`)
or via `scp`/`rsync` from a Linux machine with access to both.
### Running the Batch Job
The batch transcription runs as a Windows Scheduled Task to survive SSH
disconnections:
```powershell
# The task is already created. To re-run:
schtasks /run /tn "ChrysopediaTranscribe"
# Check status:
schtasks /query /tn "ChrysopediaTranscribe" /v /fo list | findstr /i "status result"
# Monitor log:
Get-Content 'C:\Users\jlightner\chrysopedia\transcription.log' -Tail 30
# Or follow live:
Get-Content 'C:\Users\jlightner\chrysopedia\transcription.log' -Tail 20 -Wait
```
### Scripts on HAL0022
```
C:\Users\jlightner\chrysopedia\
├── transcribe.py # Single-file/directory transcription
├── batch_transcribe.py # Recursive multi-creator batch runner
├── run_transcription.bat # Batch file invoked by scheduled task
├── launch_transcription.py # Alternative launcher (subprocess)
├── transcription.log # Current batch run log output
└── transcripts/ # Output directory
├── batch_manifest.json
├── au5/
├── Break/
└── ...
```
## Performance
Whisper large-v3 on an RTX 4090 processes audio at roughly 1020× real-time.
A 2-hour video takes ~612 minutes. For the full 1,197-video library, expect
roughly 2060 hours of GPU time depending on average video length.
## Directory Convention
The script infers the `creator_folder` field from the parent directory of each
video file (or the top-level creator folder in batch mode). Organize videos like:
```
content-root/
├── Skope/
│ ├── Youtube/
│ │ ├── Sound Design Masterclass pt1.mp4
│ │ └── Sound Design Masterclass pt2.mp4
│ └── Patreon/
│ └── Advanced Wavetables.mp4
├── Mr Bill/
│ └── Youtube/
│ └── Glitch Techniques.mp4
```
Override with `--creator` when processing files outside this structure.