Auto-mode commit 7aa33cd accidentally deleted 78 files (14,814 lines) during M005
execution. Subsequent commits rebuilt some frontend files but backend/, alembic/,
tests/, whisper/, docker configs, and prompts were never restored in this repo.
This commit restores the full project tree by syncing from ub01's working directory,
which has all M001-M007 features running in production containers.
Restored: backend/ (config, models, routers, database, redis, search_service, worker),
alembic/ (6 migrations), docker/ (Dockerfiles, nginx, compose), prompts/ (4 stages),
tests/, whisper/, README.md, .env.example, chrysopedia-spec.md
7 KiB
Chrysopedia — Whisper Transcription
Desktop transcription tool for extracting timestamped text from video files using OpenAI's Whisper model (large-v3). Designed to run on a machine with an NVIDIA GPU (e.g., RTX 4090).
Prerequisites
- Python 3.10+
- ffmpeg installed and on PATH
- NVIDIA GPU with CUDA support (recommended; CPU fallback available)
Install ffmpeg
# Debian/Ubuntu
sudo apt install ffmpeg
# macOS
brew install ffmpeg
# Windows (via chocolatey or manual install)
choco install ffmpeg
Install Python dependencies
pip install -r requirements.txt
# For CUDA support, install torch with CUDA wheels:
pip install torch --index-url https://download.pytorch.org/whl/cu126
Usage
Single file
python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts
Batch mode (all videos in a directory)
python transcribe.py --input ./videos/ --output-dir ./transcripts
Mass batch mode (recursive, multi-creator)
For large content libraries with nested subdirectories per creator:
python batch_transcribe.py \
--content-root "A:\Education\Artist Streams & Content" \
--output-dir "C:\Users\jlightner\chrysopedia\transcripts" \
--python C:\Users\jlightner\.conda\envs\transcribe\python.exe
# Dry run to preview without transcribing:
python batch_transcribe.py --content-root ... --output-dir ... --dry-run
batch_transcribe.py recursively walks all subdirectories, discovers video
files, and calls transcribe.py for each directory. The creator_folder
field in the output JSON is set to the top-level subdirectory name (the
artist/creator). Output directory structure mirrors the source hierarchy.
A batch_manifest.json is written to the output root on completion with
timing, per-creator results, and error details.
Options (transcribe.py)
| Flag | Default | Description |
|---|---|---|
--input |
(required) | Path to a video file or directory of videos |
--output-dir |
(required) | Directory to write transcript JSON files |
--model |
large-v3 |
Whisper model name (tiny, base, small, medium, large-v3) |
--device |
cuda |
Compute device (cuda or cpu) |
--creator |
(inferred) | Override creator folder name in output JSON |
-v, --verbose |
off | Enable debug logging |
Options (batch_transcribe.py)
| Flag | Default | Description |
|---|---|---|
--content-root |
(required) | Root directory with creator subdirectories |
--output-dir |
(required) | Root output directory for transcript JSONs |
--script |
(auto) | Path to transcribe.py (default: same directory) |
--python |
(auto) | Python interpreter to use |
--model |
large-v3 |
Whisper model name |
--device |
cuda |
Compute device |
--dry-run |
off | Preview work plan without transcribing |
Output Format
Each video produces a JSON file matching the Chrysopedia pipeline spec:
{
"source_file": "Skope — Sound Design Masterclass pt2.mp4",
"creator_folder": "Skope",
"duration_seconds": 7243,
"segments": [
{
"start": 0.0,
"end": 4.52,
"text": "Hey everyone welcome back to part two...",
"words": [
{ "word": "Hey", "start": 0.0, "end": 0.28 },
{ "word": "everyone", "start": 0.32, "end": 0.74 }
]
}
]
}
This format is consumed directly by the Chrysopedia pipeline stage 2
(transcript segmentation) via the POST /api/v1/ingest endpoint.
Resumability
Both scripts automatically skip videos whose output JSON already exists. To re-transcribe a file, delete its output JSON first.
Current Transcription Environment
Machine: HAL0022 (10.0.0.131)
- GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
- OS: Windows 11
- Python: Conda env
transcribeatC:\Users\jlightner\.conda\envs\transcribe\python.exe - CUDA: PyTorch with cu126 wheels
Content Source
A:\Education\Artist Streams & Content\
├── au5/ (334 videos)
├── Keota/ (193 videos)
├── DJ Shortee/ (83 videos)
├── KOAN Sound/ (68 videos)
├── Teddy Killerz/ (62 videos)
├── ... (42 creators, 1197 videos total across 146 directories)
Transcript Output Location
C:\Users\jlightner\chrysopedia\transcripts\
Directory structure mirrors the source hierarchy. Each video produces a
<filename>.json transcript file.
Transfer to ub01: Transcripts need to be copied to
/vmPool/r/services/chrysopedia_data/transcripts/ on ub01 for pipeline
ingestion. This can be done via SMB (\\ub01\vmPool\services\chrysopedia_data\transcripts)
or via scp/rsync from a Linux machine with access to both.
Running the Batch Job
The batch transcription runs as a Windows Scheduled Task to survive SSH disconnections:
# The task is already created. To re-run:
schtasks /run /tn "ChrysopediaTranscribe"
# Check status:
schtasks /query /tn "ChrysopediaTranscribe" /v /fo list | findstr /i "status result"
# Monitor log:
Get-Content 'C:\Users\jlightner\chrysopedia\transcription.log' -Tail 30
# Or follow live:
Get-Content 'C:\Users\jlightner\chrysopedia\transcription.log' -Tail 20 -Wait
Scripts on HAL0022
C:\Users\jlightner\chrysopedia\
├── transcribe.py # Single-file/directory transcription
├── batch_transcribe.py # Recursive multi-creator batch runner
├── run_transcription.bat # Batch file invoked by scheduled task
├── launch_transcription.py # Alternative launcher (subprocess)
├── transcription.log # Current batch run log output
└── transcripts/ # Output directory
├── batch_manifest.json
├── au5/
├── Break/
└── ...
Performance
Whisper large-v3 on an RTX 4090 processes audio at roughly 10–20× real-time. A 2-hour video takes ~6–12 minutes. For the full 1,197-video library, expect roughly 20–60 hours of GPU time depending on average video length.
Directory Convention
The script infers the creator_folder field from the parent directory of each
video file (or the top-level creator folder in batch mode). Organize videos like:
content-root/
├── Skope/
│ ├── Youtube/
│ │ ├── Sound Design Masterclass pt1.mp4
│ │ └── Sound Design Masterclass pt2.mp4
│ └── Patreon/
│ └── Advanced Wavetables.mp4
├── Mr Bill/
│ └── Youtube/
│ └── Glitch Techniques.mp4
Override with --creator when processing files outside this structure.