diff --git a/whisper/README.md b/whisper/README.md index 03a7179..40e36e3 100644 --- a/whisper/README.md +++ b/whisper/README.md @@ -18,12 +18,18 @@ sudo apt install ffmpeg # macOS brew install ffmpeg + +# Windows (via chocolatey or manual install) +choco install ffmpeg ``` ### Install Python dependencies ```bash pip install -r requirements.txt + +# For CUDA support, install torch with CUDA wheels: +pip install torch --index-url https://download.pytorch.org/whl/cu126 ``` ## Usage @@ -40,7 +46,29 @@ python transcribe.py --input "path/to/video.mp4" --output-dir ./transcripts python transcribe.py --input ./videos/ --output-dir ./transcripts ``` -### Options +### Mass batch mode (recursive, multi-creator) + +For large content libraries with nested subdirectories per creator: + +```bash +python batch_transcribe.py \ + --content-root "A:\Education\Artist Streams & Content" \ + --output-dir "C:\Users\jlightner\chrysopedia\transcripts" \ + --python C:\Users\jlightner\.conda\envs\transcribe\python.exe + +# Dry run to preview without transcribing: +python batch_transcribe.py --content-root ... --output-dir ... --dry-run +``` + +`batch_transcribe.py` recursively walks all subdirectories, discovers video +files, and calls `transcribe.py` for each directory. The `creator_folder` +field in the output JSON is set to the top-level subdirectory name (the +artist/creator). Output directory structure mirrors the source hierarchy. + +A `batch_manifest.json` is written to the output root on completion with +timing, per-creator results, and error details. + +### Options (transcribe.py) | Flag | Default | Description | | --------------- | ----------- | ----------------------------------------------- | @@ -51,9 +79,21 @@ python transcribe.py --input ./videos/ --output-dir ./transcripts | `--creator` | (inferred) | Override creator folder name in output JSON | | `-v, --verbose` | off | Enable debug logging | +### Options (batch_transcribe.py) + +| Flag | Default | Description | +| ----------------- | ------------ | ------------------------------------------------ | +| `--content-root` | (required) | Root directory with creator subdirectories | +| `--output-dir` | (required) | Root output directory for transcript JSONs | +| `--script` | (auto) | Path to transcribe.py (default: same directory) | +| `--python` | (auto) | Python interpreter to use | +| `--model` | `large-v3` | Whisper model name | +| `--device` | `cuda` | Compute device | +| `--dry-run` | off | Preview work plan without transcribing | + ## Output Format -Each video produces a JSON file matching the Chrysopedia spec: +Each video produces a JSON file matching the Chrysopedia pipeline spec: ```json { @@ -74,29 +114,106 @@ Each video produces a JSON file matching the Chrysopedia spec: } ``` +This format is consumed directly by the Chrysopedia pipeline stage 2 +(transcript segmentation) via the `POST /api/v1/ingest` endpoint. + ## Resumability -The script automatically skips videos whose output JSON already exists. To +Both scripts automatically skip videos whose output JSON already exists. To re-transcribe a file, delete its output JSON first. +## Current Transcription Environment + +### Machine: HAL0022 (10.0.0.131) + +- **GPU:** NVIDIA GeForce RTX 4090 (24GB VRAM) +- **OS:** Windows 11 +- **Python:** Conda env `transcribe` at `C:\Users\jlightner\.conda\envs\transcribe\python.exe` +- **CUDA:** PyTorch with cu126 wheels + +### Content Source + +``` +A:\Education\Artist Streams & Content\ +├── au5/ (334 videos) +├── Keota/ (193 videos) +├── DJ Shortee/ (83 videos) +├── KOAN Sound/ (68 videos) +├── Teddy Killerz/ (62 videos) +├── ... (42 creators, 1197 videos total across 146 directories) +``` + +### Transcript Output Location + +``` +C:\Users\jlightner\chrysopedia\transcripts\ +``` + +Directory structure mirrors the source hierarchy. Each video produces a +`.json` transcript file. + +**Transfer to ub01:** Transcripts need to be copied to +`/vmPool/r/services/chrysopedia_data/transcripts/` on ub01 for pipeline +ingestion. This can be done via SMB (`\\ub01\vmPool\services\chrysopedia_data\transcripts`) +or via `scp`/`rsync` from a Linux machine with access to both. + +### Running the Batch Job + +The batch transcription runs as a Windows Scheduled Task to survive SSH +disconnections: + +```powershell +# The task is already created. To re-run: +schtasks /run /tn "ChrysopediaTranscribe" + +# Check status: +schtasks /query /tn "ChrysopediaTranscribe" /v /fo list | findstr /i "status result" + +# Monitor log: +Get-Content 'C:\Users\jlightner\chrysopedia\transcription.log' -Tail 30 + +# Or follow live: +Get-Content 'C:\Users\jlightner\chrysopedia\transcription.log' -Tail 20 -Wait +``` + +### Scripts on HAL0022 + +``` +C:\Users\jlightner\chrysopedia\ +├── transcribe.py # Single-file/directory transcription +├── batch_transcribe.py # Recursive multi-creator batch runner +├── run_transcription.bat # Batch file invoked by scheduled task +├── launch_transcription.py # Alternative launcher (subprocess) +├── transcription.log # Current batch run log output +└── transcripts/ # Output directory + ├── batch_manifest.json + ├── au5/ + ├── Break/ + └── ... +``` + ## Performance Whisper large-v3 on an RTX 4090 processes audio at roughly 10–20× real-time. -A 2-hour video takes ~6–12 minutes. For 300 videos averaging 1.5 hours each, -the initial transcription pass takes roughly 15–40 hours of GPU time. +A 2-hour video takes ~6–12 minutes. For the full 1,197-video library, expect +roughly 20–60 hours of GPU time depending on average video length. ## Directory Convention The script infers the `creator_folder` field from the parent directory of each -video file. Organize videos like: +video file (or the top-level creator folder in batch mode). Organize videos like: ``` -videos/ +content-root/ ├── Skope/ -│ ├── Sound Design Masterclass pt1.mp4 -│ └── Sound Design Masterclass pt2.mp4 +│ ├── Youtube/ +│ │ ├── Sound Design Masterclass pt1.mp4 +│ │ └── Sound Design Masterclass pt2.mp4 +│ └── Patreon/ +│ └── Advanced Wavetables.mp4 ├── Mr Bill/ -│ └── Glitch Techniques.mp4 +│ └── Youtube/ +│ └── Glitch Techniques.mp4 ``` Override with `--creator` when processing files outside this structure. diff --git a/whisper/batch_transcribe.py b/whisper/batch_transcribe.py new file mode 100644 index 0000000..6e536fa --- /dev/null +++ b/whisper/batch_transcribe.py @@ -0,0 +1,319 @@ +#!/usr/bin/env python3 +""" +Chrysopedia — Batch Transcription Runner + +Recursively iterates all creator subdirectories under a content root and runs +transcribe.py against each leaf directory containing videos. Outputs are +written to a parallel directory structure under --output-dir that mirrors +the source hierarchy: + + output-dir/ + ├── au5/ + │ ├── Patreon/ + │ │ ├── video1.mp4.json + │ │ └── video2.mp4.json + │ └── Youtube/ + │ └── ... + ├── Skope/ + │ └── ... + +The --creator flag passed to transcribe.py is always the top-level folder +name (the artist), regardless of subdirectory nesting depth. + +Resumable: transcribe.py already skips videos whose output JSON exists. +""" + +from __future__ import annotations + +import argparse +import json +import logging +import os +import subprocess +import sys +import time +from datetime import datetime, timezone +from pathlib import Path + +LOG_FORMAT = "%(asctime)s [%(levelname)s] %(message)s" +logging.basicConfig(format=LOG_FORMAT, level=logging.INFO) +logger = logging.getLogger("chrysopedia.batch_transcribe") + +SUPPORTED_EXTENSIONS = {".mp4", ".mkv", ".avi", ".mov", ".webm", ".flv", ".wmv"} + + +def find_video_dirs(content_root: Path) -> list[dict]: + """ + Recursively find all directories containing video files. + Returns list of dicts with folder path, creator name, and video count. + """ + results = [] + for dirpath, dirnames, filenames in os.walk(content_root): + dirpath = Path(dirpath) + videos = [ + f for f in filenames + if Path(f).suffix.lower() in SUPPORTED_EXTENSIONS + ] + if not videos: + continue + + # Creator is always the top-level subdirectory under content_root + rel = dirpath.relative_to(content_root) + creator = rel.parts[0] + + results.append({ + "folder": dirpath, + "creator": creator, + "rel_path": rel, + "video_count": len(videos), + }) + + return sorted(results, key=lambda x: (x["creator"], str(x["rel_path"]))) + + +def count_existing_transcripts(output_dir: Path) -> int: + """Count existing transcript JSONs in an output folder.""" + if not output_dir.exists(): + return 0 + return sum(1 for p in output_dir.iterdir() if p.suffix == ".json") + + +def main() -> int: + parser = argparse.ArgumentParser( + description="Batch transcription runner for Chrysopedia" + ) + parser.add_argument( + "--content-root", + required=True, + help="Root directory containing creator subdirectories with videos", + ) + parser.add_argument( + "--output-dir", + required=True, + help="Root output directory for transcript JSONs", + ) + parser.add_argument( + "--script", + default=None, + help="Path to transcribe.py (default: same directory as this script)", + ) + parser.add_argument( + "--python", + default=sys.executable, + help="Python interpreter to use (default: current interpreter)", + ) + parser.add_argument( + "--model", + default="large-v3", + help="Whisper model name (default: large-v3)", + ) + parser.add_argument( + "--device", + default="cuda", + help="Compute device (default: cuda)", + ) + parser.add_argument( + "--dry-run", + action="store_true", + help="List what would be transcribed without running", + ) + args = parser.parse_args() + + content_root = Path(args.content_root) + output_root = Path(args.output_dir) + script_path = Path(args.script) if args.script else Path(__file__).parent / "transcribe.py" + + if not content_root.is_dir(): + logger.error("Content root does not exist: %s", content_root) + return 1 + + if not script_path.is_file(): + logger.error("transcribe.py not found at: %s", script_path) + return 1 + + # Discover all directories with videos (recursive) + video_dirs = find_video_dirs(content_root) + + if not video_dirs: + logger.warning("No video files found anywhere under %s", content_root) + return 0 + + # Build work plan with existing transcript counts + total_videos = 0 + total_existing = 0 + work_plan = [] + + for item in video_dirs: + # Output mirrors the relative source structure + out_dir = output_root / item["rel_path"] + n_existing = count_existing_transcripts(out_dir) + n_remaining = item["video_count"] - n_existing + + total_videos += item["video_count"] + total_existing += n_existing + + work_plan.append({ + **item, + "output": out_dir, + "existing": n_existing, + "remaining": n_remaining, + }) + + # Aggregate by creator for summary + creator_stats = {} + for item in work_plan: + c = item["creator"] + if c not in creator_stats: + creator_stats[c] = {"videos": 0, "existing": 0, "remaining": 0, "dirs": 0} + creator_stats[c]["videos"] += item["video_count"] + creator_stats[c]["existing"] += item["existing"] + creator_stats[c]["remaining"] += item["remaining"] + creator_stats[c]["dirs"] += 1 + + logger.info("=" * 70) + logger.info("BATCH TRANSCRIPTION PLAN") + logger.info("=" * 70) + logger.info("Content root: %s", content_root) + logger.info("Output root: %s", output_root) + logger.info("Creators: %d", len(creator_stats)) + logger.info("Directories: %d", len(work_plan)) + logger.info("Total videos: %d", total_videos) + logger.info("Already done: %d", total_existing) + logger.info("Remaining: %d", total_videos - total_existing) + logger.info("=" * 70) + + for name in sorted(creator_stats.keys()): + s = creator_stats[name] + status = "DONE" if s["remaining"] == 0 else f"{s['remaining']} to do" + logger.info( + " %-35s %4d videos (%d dirs), %4d done [%s]", + name, s["videos"], s["dirs"], s["existing"], status, + ) + + logger.info("=" * 70) + + if args.dry_run: + logger.info("DRY RUN — exiting without transcribing.") + return 0 + + # Execute + manifest = { + "started_at": datetime.now(timezone.utc).isoformat(), + "content_root": str(content_root), + "output_root": str(output_root), + "model": args.model, + "device": args.device, + "results": [], + } + + total_processed = 0 + total_skipped = 0 + total_failed_dirs = 0 + batch_start = time.time() + + for i, item in enumerate(work_plan, 1): + if item["remaining"] == 0: + logger.info("[%d/%d] SKIP %s (all %d videos already transcribed)", + i, len(work_plan), item["rel_path"], item["video_count"]) + total_skipped += item["video_count"] + continue + + logger.info("=" * 70) + logger.info("[%d/%d] TRANSCRIBING: %s (creator: %s)", + i, len(work_plan), item["rel_path"], item["creator"]) + logger.info(" %d videos, %d remaining", + item["video_count"], item["remaining"]) + logger.info("=" * 70) + + dir_start = time.time() + + cmd = [ + args.python, + str(script_path), + "--input", str(item["folder"]), + "--output-dir", str(item["output"]), + "--model", args.model, + "--device", args.device, + "--creator", item["creator"], + ] + + try: + result = subprocess.run( + cmd, + capture_output=False, + text=True, + timeout=14400, # 4-hour timeout per directory + ) + dir_elapsed = time.time() - dir_start + + n_after = count_existing_transcripts(item["output"]) + n_new = n_after - item["existing"] + + manifest["results"].append({ + "creator": item["creator"], + "rel_path": str(item["rel_path"]), + "videos": item["video_count"], + "new_transcripts": n_new, + "total_transcripts": n_after, + "exit_code": result.returncode, + "elapsed_seconds": round(dir_elapsed, 1), + }) + + if result.returncode == 0: + total_processed += n_new + total_skipped += item["existing"] + logger.info("Completed %s: %d new transcripts in %.1f s", + item["rel_path"], n_new, dir_elapsed) + else: + total_failed_dirs += 1 + logger.error("FAILED %s (exit code %d) after %.1f s", + item["rel_path"], result.returncode, dir_elapsed) + + except subprocess.TimeoutExpired: + total_failed_dirs += 1 + logger.error("TIMEOUT: %s exceeded 4-hour limit", item["rel_path"]) + manifest["results"].append({ + "creator": item["creator"], + "rel_path": str(item["rel_path"]), + "videos": item["video_count"], + "error": "timeout", + }) + except Exception as exc: + total_failed_dirs += 1 + logger.exception("ERROR processing %s: %s", item["rel_path"], exc) + manifest["results"].append({ + "creator": item["creator"], + "rel_path": str(item["rel_path"]), + "videos": item["video_count"], + "error": str(exc), + }) + + batch_elapsed = time.time() - batch_start + manifest["completed_at"] = datetime.now(timezone.utc).isoformat() + manifest["elapsed_seconds"] = round(batch_elapsed, 1) + manifest["summary"] = { + "processed": total_processed, + "skipped": total_skipped, + "failed_dirs": total_failed_dirs, + } + + # Write manifest + manifest_path = output_root / "batch_manifest.json" + output_root.mkdir(parents=True, exist_ok=True) + with open(manifest_path, "w", encoding="utf-8") as f: + json.dump(manifest, f, indent=2, ensure_ascii=False) + + logger.info("=" * 70) + logger.info("BATCH COMPLETE") + logger.info(" Processed: %d new transcripts", total_processed) + logger.info(" Skipped: %d (already existed)", total_skipped) + logger.info(" Failed dirs: %d", total_failed_dirs) + logger.info(" Elapsed: %.1f s (%.1f hours)", batch_elapsed, batch_elapsed / 3600) + logger.info(" Manifest: %s", manifest_path) + logger.info("=" * 70) + + return 1 if total_failed_dirs > 0 else 0 + + +if __name__ == "__main__": + sys.exit(main())