jlightner 4b0914b12b fix: restore complete project tree from ub01 canonical state

Auto-mode commit 7aa33cd accidentally deleted 78 files (14,814 lines) during M005
execution. Subsequent commits rebuilt some frontend files but backend/, alembic/,
tests/, whisper/, docker configs, and prompts were never restored in this repo.

This commit restores the full project tree by syncing from ub01's working directory,
which has all M001-M007 features running in production containers.

Restored: backend/ (config, models, routers, database, redis, search_service, worker),
alembic/ (6 migrations), docker/ (Dockerfiles, nginx, compose), prompts/ (4 stages),
tests/, whisper/, README.md, .env.example, chrysopedia-spec.md

2026-03-31 02:10:41 +00:00

7 KiB

Raw Blame History

Chrysopedia — Whisper Transcription

Desktop transcription tool for extracting timestamped text from video files using OpenAI's Whisper model (large-v3). Designed to run on a machine with an NVIDIA GPU (e.g., RTX 4090).

Prerequisites

Python 3.10+
ffmpeg installed and on PATH
NVIDIA GPU with CUDA support (recommended; CPU fallback available)

Install ffmpeg

# Debian/Ubuntu
sudo apt install ffmpeg

# macOS
brew install ffmpeg

# Windows (via chocolatey or manual install)
choco install ffmpeg

Install Python dependencies

pip install -r requirements.txt

# For CUDA support, install torch with CUDA wheels:
pip install torch --index-url https://download.pytorch.org/whl/cu126

Usage

Single file

python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts

Batch mode (all videos in a directory)

python transcribe.py --input ./videos/ --output-dir ./transcripts

Mass batch mode (recursive, multi-creator)

For large content libraries with nested subdirectories per creator:

python batch_transcribe.py \
  --content-root "A:\Education\Artist Streams & Content" \
  --output-dir "C:\Users\jlightner\chrysopedia\transcripts" \
  --python C:\Users\jlightner\.conda\envs\transcribe\python.exe

# Dry run to preview without transcribing:
python batch_transcribe.py --content-root ... --output-dir ... --dry-run

batch_transcribe.py recursively walks all subdirectories, discovers video files, and calls transcribe.py for each directory. The creator_folder field in the output JSON is set to the top-level subdirectory name (the artist/creator). Output directory structure mirrors the source hierarchy.

A batch_manifest.json is written to the output root on completion with timing, per-creator results, and error details.

Options (transcribe.py)

Flag	Default	Description
`--input`	(required)	Path to a video file or directory of videos
`--output-dir`	(required)	Directory to write transcript JSON files
`--model`	`large-v3`	Whisper model name (`tiny`, `base`, `small`, `medium`, `large-v3`)
`--device`	`cuda`	Compute device (`cuda` or `cpu`)
`--creator`	(inferred)	Override creator folder name in output JSON
`-v, --verbose`	off	Enable debug logging

Options (batch_transcribe.py)

Flag	Default	Description
`--content-root`	(required)	Root directory with creator subdirectories
`--output-dir`	(required)	Root output directory for transcript JSONs
`--script`	(auto)	Path to transcribe.py (default: same directory)
`--python`	(auto)	Python interpreter to use
`--model`	`large-v3`	Whisper model name
`--device`	`cuda`	Compute device
`--dry-run`	off	Preview work plan without transcribing

Output Format

Each video produces a JSON file matching the Chrysopedia pipeline spec:

{
  "source_file": "Skope — Sound Design Masterclass pt2.mp4",
  "creator_folder": "Skope",
  "duration_seconds": 7243,
  "segments": [
    {
      "start": 0.0,
      "end": 4.52,
      "text": "Hey everyone welcome back to part two...",
      "words": [
        { "word": "Hey", "start": 0.0, "end": 0.28 },
        { "word": "everyone", "start": 0.32, "end": 0.74 }
      ]
    }
  ]
}

This format is consumed directly by the Chrysopedia pipeline stage 2 (transcript segmentation) via the POST /api/v1/ingest endpoint.

Resumability

Both scripts automatically skip videos whose output JSON already exists. To re-transcribe a file, delete its output JSON first.

Current Transcription Environment

Machine: HAL0022 (10.0.0.131)

GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
OS: Windows 11
Python: Conda env transcribe at C:\Users\jlightner\.conda\envs\transcribe\python.exe
CUDA: PyTorch with cu126 wheels

Content Source

A:\Education\Artist Streams & Content\
├── au5/           (334 videos)
├── Keota/         (193 videos)
├── DJ Shortee/    (83 videos)
├── KOAN Sound/    (68 videos)
├── Teddy Killerz/ (62 videos)
├── ... (42 creators, 1197 videos total across 146 directories)

Transcript Output Location

C:\Users\jlightner\chrysopedia\transcripts\

Directory structure mirrors the source hierarchy. Each video produces a <filename>.json transcript file.

Transfer to ub01: Transcripts need to be copied to /vmPool/r/services/chrysopedia_data/transcripts/ on ub01 for pipeline ingestion. This can be done via SMB (\\ub01\vmPool\services\chrysopedia_data\transcripts) or via scp/rsync from a Linux machine with access to both.

Running the Batch Job

The batch transcription runs as a Windows Scheduled Task to survive SSH disconnections:

# The task is already created. To re-run:
schtasks /run /tn "ChrysopediaTranscribe"

# Check status:
schtasks /query /tn "ChrysopediaTranscribe" /v /fo list | findstr /i "status result"

# Monitor log:
Get-Content 'C:\Users\jlightner\chrysopedia\transcription.log' -Tail 30

# Or follow live:
Get-Content 'C:\Users\jlightner\chrysopedia\transcription.log' -Tail 20 -Wait

Scripts on HAL0022

C:\Users\jlightner\chrysopedia\
├── transcribe.py          # Single-file/directory transcription
├── batch_transcribe.py    # Recursive multi-creator batch runner
├── run_transcription.bat  # Batch file invoked by scheduled task
├── launch_transcription.py # Alternative launcher (subprocess)
├── transcription.log      # Current batch run log output
└── transcripts/           # Output directory
    ├── batch_manifest.json
    ├── au5/
    ├── Break/
    └── ...

Performance

Whisper large-v3 on an RTX 4090 processes audio at roughly 10–20× real-time. A 2-hour video takes ~6–12 minutes. For the full 1,197-video library, expect roughly 20–60 hours of GPU time depending on average video length.

Directory Convention

The script infers the creator_folder field from the parent directory of each video file (or the top-level creator folder in batch mode). Organize videos like:

content-root/
├── Skope/
│   ├── Youtube/
│   │   ├── Sound Design Masterclass pt1.mp4
│   │   └── Sound Design Masterclass pt2.mp4
│   └── Patreon/
│       └── Advanced Wavetables.mp4
├── Mr Bill/
│   └── Youtube/
│       └── Glitch Techniques.mp4

Override with --creator when processing files outside this structure.

7 KiB Raw Blame History Unescape Escape