Auto-mode commit 7aa33cd accidentally deleted 78 files (14,814 lines) during M005
execution. Subsequent commits rebuilt some frontend files but backend/, alembic/,
tests/, whisper/, docker configs, and prompts were never restored in this repo.
This commit restores the full project tree by syncing from ub01's working directory,
which has all M001-M007 features running in production containers.
Restored: backend/ (config, models, routers, database, redis, search_service, worker),
alembic/ (6 migrations), docker/ (Dockerfiles, nginx, compose), prompts/ (4 stages),
tests/, whisper/, README.md, .env.example, chrysopedia-spec.md
219 lines
7 KiB
Markdown
219 lines
7 KiB
Markdown
# Chrysopedia — Whisper Transcription
|
||
|
||
Desktop transcription tool for extracting timestamped text from video files
|
||
using OpenAI's Whisper model (large-v3). Designed to run on a machine with
|
||
an NVIDIA GPU (e.g., RTX 4090).
|
||
|
||
## Prerequisites
|
||
|
||
- **Python 3.10+**
|
||
- **ffmpeg** installed and on PATH
|
||
- **NVIDIA GPU** with CUDA support (recommended; CPU fallback available)
|
||
|
||
### Install ffmpeg
|
||
|
||
```bash
|
||
# Debian/Ubuntu
|
||
sudo apt install ffmpeg
|
||
|
||
# macOS
|
||
brew install ffmpeg
|
||
|
||
# Windows (via chocolatey or manual install)
|
||
choco install ffmpeg
|
||
```
|
||
|
||
### Install Python dependencies
|
||
|
||
```bash
|
||
pip install -r requirements.txt
|
||
|
||
# For CUDA support, install torch with CUDA wheels:
|
||
pip install torch --index-url https://download.pytorch.org/whl/cu126
|
||
```
|
||
|
||
## Usage
|
||
|
||
### Single file
|
||
|
||
```bash
|
||
python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts
|
||
```
|
||
|
||
### Batch mode (all videos in a directory)
|
||
|
||
```bash
|
||
python transcribe.py --input ./videos/ --output-dir ./transcripts
|
||
```
|
||
|
||
### Mass batch mode (recursive, multi-creator)
|
||
|
||
For large content libraries with nested subdirectories per creator:
|
||
|
||
```bash
|
||
python batch_transcribe.py \
|
||
--content-root "A:\Education\Artist Streams & Content" \
|
||
--output-dir "C:\Users\jlightner\chrysopedia\transcripts" \
|
||
--python C:\Users\jlightner\.conda\envs\transcribe\python.exe
|
||
|
||
# Dry run to preview without transcribing:
|
||
python batch_transcribe.py --content-root ... --output-dir ... --dry-run
|
||
```
|
||
|
||
`batch_transcribe.py` recursively walks all subdirectories, discovers video
|
||
files, and calls `transcribe.py` for each directory. The `creator_folder`
|
||
field in the output JSON is set to the top-level subdirectory name (the
|
||
artist/creator). Output directory structure mirrors the source hierarchy.
|
||
|
||
A `batch_manifest.json` is written to the output root on completion with
|
||
timing, per-creator results, and error details.
|
||
|
||
### Options (transcribe.py)
|
||
|
||
| Flag | Default | Description |
|
||
| --------------- | ----------- | ----------------------------------------------- |
|
||
| `--input` | (required) | Path to a video file or directory of videos |
|
||
| `--output-dir` | (required) | Directory to write transcript JSON files |
|
||
| `--model` | `large-v3` | Whisper model name (`tiny`, `base`, `small`, `medium`, `large-v3`) |
|
||
| `--device` | `cuda` | Compute device (`cuda` or `cpu`) |
|
||
| `--creator` | (inferred) | Override creator folder name in output JSON |
|
||
| `-v, --verbose` | off | Enable debug logging |
|
||
|
||
### Options (batch_transcribe.py)
|
||
|
||
| Flag | Default | Description |
|
||
| ----------------- | ------------ | ------------------------------------------------ |
|
||
| `--content-root` | (required) | Root directory with creator subdirectories |
|
||
| `--output-dir` | (required) | Root output directory for transcript JSONs |
|
||
| `--script` | (auto) | Path to transcribe.py (default: same directory) |
|
||
| `--python` | (auto) | Python interpreter to use |
|
||
| `--model` | `large-v3` | Whisper model name |
|
||
| `--device` | `cuda` | Compute device |
|
||
| `--dry-run` | off | Preview work plan without transcribing |
|
||
|
||
## Output Format
|
||
|
||
Each video produces a JSON file matching the Chrysopedia pipeline spec:
|
||
|
||
```json
|
||
{
|
||
"source_file": "Skope — Sound Design Masterclass pt2.mp4",
|
||
"creator_folder": "Skope",
|
||
"duration_seconds": 7243,
|
||
"segments": [
|
||
{
|
||
"start": 0.0,
|
||
"end": 4.52,
|
||
"text": "Hey everyone welcome back to part two...",
|
||
"words": [
|
||
{ "word": "Hey", "start": 0.0, "end": 0.28 },
|
||
{ "word": "everyone", "start": 0.32, "end": 0.74 }
|
||
]
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
This format is consumed directly by the Chrysopedia pipeline stage 2
|
||
(transcript segmentation) via the `POST /api/v1/ingest` endpoint.
|
||
|
||
## Resumability
|
||
|
||
Both scripts automatically skip videos whose output JSON already exists. To
|
||
re-transcribe a file, delete its output JSON first.
|
||
|
||
## Current Transcription Environment
|
||
|
||
### Machine: HAL0022 (10.0.0.131)
|
||
|
||
- **GPU:** NVIDIA GeForce RTX 4090 (24GB VRAM)
|
||
- **OS:** Windows 11
|
||
- **Python:** Conda env `transcribe` at `C:\Users\jlightner\.conda\envs\transcribe\python.exe`
|
||
- **CUDA:** PyTorch with cu126 wheels
|
||
|
||
### Content Source
|
||
|
||
```
|
||
A:\Education\Artist Streams & Content\
|
||
├── au5/ (334 videos)
|
||
├── Keota/ (193 videos)
|
||
├── DJ Shortee/ (83 videos)
|
||
├── KOAN Sound/ (68 videos)
|
||
├── Teddy Killerz/ (62 videos)
|
||
├── ... (42 creators, 1197 videos total across 146 directories)
|
||
```
|
||
|
||
### Transcript Output Location
|
||
|
||
```
|
||
C:\Users\jlightner\chrysopedia\transcripts\
|
||
```
|
||
|
||
Directory structure mirrors the source hierarchy. Each video produces a
|
||
`<filename>.json` transcript file.
|
||
|
||
**Transfer to ub01:** Transcripts need to be copied to
|
||
`/vmPool/r/services/chrysopedia_data/transcripts/` on ub01 for pipeline
|
||
ingestion. This can be done via SMB (`\\ub01\vmPool\services\chrysopedia_data\transcripts`)
|
||
or via `scp`/`rsync` from a Linux machine with access to both.
|
||
|
||
### Running the Batch Job
|
||
|
||
The batch transcription runs as a Windows Scheduled Task to survive SSH
|
||
disconnections:
|
||
|
||
```powershell
|
||
# The task is already created. To re-run:
|
||
schtasks /run /tn "ChrysopediaTranscribe"
|
||
|
||
# Check status:
|
||
schtasks /query /tn "ChrysopediaTranscribe" /v /fo list | findstr /i "status result"
|
||
|
||
# Monitor log:
|
||
Get-Content 'C:\Users\jlightner\chrysopedia\transcription.log' -Tail 30
|
||
|
||
# Or follow live:
|
||
Get-Content 'C:\Users\jlightner\chrysopedia\transcription.log' -Tail 20 -Wait
|
||
```
|
||
|
||
### Scripts on HAL0022
|
||
|
||
```
|
||
C:\Users\jlightner\chrysopedia\
|
||
├── transcribe.py # Single-file/directory transcription
|
||
├── batch_transcribe.py # Recursive multi-creator batch runner
|
||
├── run_transcription.bat # Batch file invoked by scheduled task
|
||
├── launch_transcription.py # Alternative launcher (subprocess)
|
||
├── transcription.log # Current batch run log output
|
||
└── transcripts/ # Output directory
|
||
├── batch_manifest.json
|
||
├── au5/
|
||
├── Break/
|
||
└── ...
|
||
```
|
||
|
||
## Performance
|
||
|
||
Whisper large-v3 on an RTX 4090 processes audio at roughly 10–20× real-time.
|
||
A 2-hour video takes ~6–12 minutes. For the full 1,197-video library, expect
|
||
roughly 20–60 hours of GPU time depending on average video length.
|
||
|
||
## Directory Convention
|
||
|
||
The script infers the `creator_folder` field from the parent directory of each
|
||
video file (or the top-level creator folder in batch mode). Organize videos like:
|
||
|
||
```
|
||
content-root/
|
||
├── Skope/
|
||
│ ├── Youtube/
|
||
│ │ ├── Sound Design Masterclass pt1.mp4
|
||
│ │ └── Sound Design Masterclass pt2.mp4
|
||
│ └── Patreon/
|
||
│ └── Advanced Wavetables.mp4
|
||
├── Mr Bill/
|
||
│ └── Youtube/
|
||
│ └── Glitch Techniques.mp4
|
||
```
|
||
|
||
Override with `--creator` when processing files outside this structure.
|