diff --git a/.planning/phases/01-core-pipeline/01-RESEARCH.md b/.planning/phases/01-core-pipeline/01-RESEARCH.md new file mode 100644 index 0000000..400c065 --- /dev/null +++ b/.planning/phases/01-core-pipeline/01-RESEARCH.md @@ -0,0 +1,374 @@ +# Phase 1: Core Pipeline - Research + +**Researched:** 2026-04-11 +**Domain:** ACE-Step cover mode CLI wrapper, audio I/O, Python scripting +**Confidence:** HIGH + +## Summary + +Phase 1 wraps the existing ACE-Step 1.5 XL-SFT cover mode into a single `hum2inst.py` script. The user's prior experimentation has already validated the entire generation pipeline end-to-end: raw humming WAV goes into ACE-Step cover mode, a caption describes the target instrument, and the output preserves melody contour and rhythm. The technical challenge is purely integration and UX: reading the input WAV duration, constructing the right `GenerationParams`, initializing the handlers, calling `generate_music()`, and saving the output with a meaningful filename. + +The ACE-Step codebase exposes a clean Python API via `acestep.inference.generate_music()`, `acestep.handler.AceStepHandler`, and `acestep.llm_inference.LLMHandler`. Cover mode does NOT require the LLM handler (it's in the `skip_lm_tasks` set), which simplifies initialization. The script needs only the DiT handler with the XL-SFT model. + +**Primary recommendation:** Import ACE-Step's Python API directly (not subprocess). Initialize `AceStepHandler` + dummy `LLMHandler`, configure `GenerationParams` with `task_type="cover"`, and call `generate_music()`. Use `torchaudio` (already in the ACE-Step venv) to read input WAV duration. + + +## User Constraints (from CONTEXT.md) + +### Locked Decisions +- Invoked as `python hum2inst.py input.wav --instrument piano` +- Python script directly, no installation step +- `--instrument` as a named CLI flag (not positional) +- `--output` flag optional, defaults to `./output/` directory +- Use Python argparse for argument parsing (gives --help for free) +- Default cover_strength in the high fidelity range (0.8-1.0) +- `--strength` flag exposed in Phase 1 so users can experiment immediately +- Duration matches input WAV length by default; `--duration` flag to override +- Caption auto-built from instrument name (e.g., "piano cover of a melody") -- no custom prompt flag in Phase 1 +- Output filename includes instrument and timestamp (exact format at Claude's discretion) +- On generation failure or silence: print clear error message, exit with non-zero code +- No auto-play -- just save and print the output path +- No silent failures +- Single `hum2inst.py` script -- no module splitting in Phase 1 +- Assume CUDA GPU is available; fail with clear message if no GPU detected +- Move existing experimental scripts (midi_to_audio.py, musicgen_melody.py) to an `/archive` folder + +### Claude's Discretion +- ACE-Step invocation method (import Python API vs subprocess call -- choose based on what ACE-Step exposes) +- Progress/feedback during generation (print statements, progress bar, or similar -- pick what's appropriate) +- Exact output filename format (instrument + timestamp pattern) +- Exact cover_strength default value within the 0.8-1.0 range + +### Deferred Ideas (OUT OF SCOPE) +None -- discussion stayed within phase scope + + + +## Phase Requirements + +| ID | Description | Research Support | +|----|-------------|-----------------| +| MEL-01 | Output audio audibly follows the pitch contour of the hummed input melody | ACE-Step cover mode with audio_cover_strength 0.8-1.0 preserves pitch contour from source. Validated in prior testing (File 10 comparison). | +| MEL-02 | Output audio preserves the rhythmic timing and phrasing of the hummed input | Cover mode encodes source into VAE latents, preserving temporal structure. Duration matching ensures timing alignment. | +| MEL-04 | Output is musically coherent -- sounds like a real instrument performance | XL-SFT model (50 inference steps) produces coherent instrument audio. Caption guides timbre. | +| INST-01 | User can specify target instrument via text prompt | `--instrument` flag maps to caption string (e.g., "solo acoustic piano, gentle melody"). | +| INP-01 | Pipeline accepts raw humming WAV audio as input with no manual preprocessing | Cover mode takes raw audio directly via `src_audio` parameter. No MIDI extraction needed. | +| OUT-02 | Output is saved as WAV file to a user-specified or default output directory | `generate_music()` saves to `save_dir`. Script renames output file with instrument+timestamp. | +| PIPE-01 | Single CLI command or script invocation to go from humming WAV to instrument output | Single `python hum2inst.py input.wav --instrument piano` command. | + + +## Standard Stack + +### Core +| Library | Version | Purpose | Why Standard | +|---------|---------|---------|--------------| +| ACE-Step (acestep) | 1.5 XL-SFT | Music generation via cover mode | Already installed, validated for this exact use case | +| torch | 2.7.1+cu128 | GPU compute, model inference | Already in ace-step/.venv | +| torchaudio | 2.7.1+cu128 | WAV file reading (duration detection) | Already in ace-step/.venv, native torch integration | +| argparse | stdlib | CLI argument parsing | User-specified, gives --help free | + +### Supporting +| Library | Version | Purpose | When to Use | +|---------|---------|---------|-------------| +| datetime | stdlib | Timestamp generation for output filenames | Always (output naming) | +| pathlib/os | stdlib | Path manipulation, directory creation | Always (file I/O) | +| sys | stdlib | Exit codes, stderr output | Error handling | + +### Alternatives Considered +| Instead of | Could Use | Tradeoff | +|------------|-----------|----------| +| Direct Python API import | subprocess calling `cli.py` | Subprocess adds overhead, harder error handling, but avoids import complexity. **Recommendation: use direct import** -- ACE-Step exposes clean Python API via `acestep.inference.generate_music()` | +| torchaudio for duration | wave stdlib module | wave is simpler but torchaudio is already loaded and handles more formats | +| argparse | click/typer | User specifically chose argparse for simplicity and --help | + +**Installation:** +No new packages needed. Everything runs inside the existing `ace-step/.venv` environment. + +## Architecture Patterns + +### Recommended Project Structure +``` +AiMusicPipeline/ +├── hum2inst.py # Single CLI entry point (Phase 1) +├── ace-step/ # ACE-Step installation (existing) +│ ├── acestep/ # ACE-Step Python package +│ ├── checkpoints/ # Model weights (XL-SFT, VAE) +│ └── .venv/ # Python environment +├── input/ # User's humming WAV files +├── output/ # Generated instrument audio +└── archive/ # Moved experimental scripts + ├── midi_to_audio.py + └── musicgen_melody.py +``` + +### Pattern 1: ACE-Step Python API Direct Import +**What:** Import `AceStepHandler`, `LLMHandler`, `GenerationParams`, `GenerationConfig`, and `generate_music` directly from the `acestep` package. +**When to use:** Always -- this is the recommended invocation method. +**Key finding:** Cover mode is in the `skip_lm_tasks` set, meaning the LLM handler is instantiated but NOT loaded with a model. This simplifies initialization significantly. + +**Initialization sequence (from cli.py analysis):** +```python +import sys +import os + +# Add ace-step to path so acestep package is importable +sys.path.insert(0, os.path.join(os.path.dirname(__file__), "ace-step")) + +from acestep.handler import AceStepHandler +from acestep.llm_inference import LLMHandler +from acestep.inference import GenerationParams, GenerationConfig, generate_music +from acestep.gpu_config import get_gpu_config, set_global_gpu_config + +# GPU setup +gpu_config = get_gpu_config() +set_global_gpu_config(gpu_config) + +# Initialize handlers +dit_handler = AceStepHandler() +llm_handler = LLMHandler() # Not loaded for cover mode -- just instantiated + +# Initialize DiT with XL-SFT model +dit_handler.initialize_service( + project_root=os.path.join(os.path.dirname(__file__), "ace-step"), + config_path="acestep-v15-xl-sft", + device="cuda", + use_flash_attention=dit_handler.is_flash_attention_available("cuda"), +) +``` + +### Pattern 2: Cover Mode Generation Parameters +**What:** Configure `GenerationParams` for cover mode with raw humming input. +**Validated configuration (from prior testing):** +```python +params = GenerationParams( + task_type="cover", + src_audio="path/to/humming.wav", + caption="solo acoustic piano, gentle melody, warm tone", + lyrics="", + instrumental=True, + duration=10, # Match input WAV length + bpm=120, # Default; not critical for cover mode + audio_cover_strength=0.9, # High fidelity to source melody + inference_steps=50, # XL-SFT model uses 50 steps + guidance_scale=5.0, + thinking=False, # No LLM needed for cover +) + +config = GenerationConfig( + batch_size=1, + use_random_seed=True, + audio_format="wav", +) + +result = generate_music( + dit_handler, llm_handler, params, config, + save_dir="./output" +) +``` + +### Pattern 3: Input WAV Duration Detection +**What:** Read input WAV to determine output duration automatically. +```python +import torchaudio + +def get_wav_duration(wav_path: str) -> float: + info = torchaudio.info(wav_path) + return info.num_frames / info.sample_rate +``` + +### Pattern 4: Caption Construction from Instrument Name +**What:** Build the ACE-Step caption from a simple instrument name. +```python +def build_caption(instrument: str) -> str: + return f"solo {instrument}, clear and expressive melody, warm tone" +``` +**Note:** The caption strongly influences output timbre. Keep it focused on the instrument. Adding genre/mood modifiers is deferred to later phases. + +### Anti-Patterns to Avoid +- **Subprocess invocation of cli.py:** cli.py has interactive wizard prompts that will hang in non-interactive mode. Use the Python API directly. +- **Loading LLM model for cover mode:** Cover and repaint tasks are in `skip_lm_tasks`. The LLM handler must be instantiated (generate_music expects it) but should NOT have its model loaded -- this saves ~5GB VRAM and ~10s startup time. +- **Setting `thinking=True` for cover mode:** This triggers LLM inference which is skipped anyway for cover, but may cause errors if no LLM model is loaded. +- **Hardcoding `inference_steps=8`:** That's the turbo model default. XL-SFT needs 50 steps for quality output. + +## Don't Hand-Roll + +| Problem | Don't Build | Use Instead | Why | +|---------|-------------|-------------|-----| +| Audio file saving | Custom WAV writer | `generate_music()` with `save_dir` parameter | ACE-Step handles format, sample rate, normalization internally | +| Model loading/initialization | Manual weight loading | `AceStepHandler.initialize_service()` | Handles config paths, checkpoints, device placement, flash attention | +| Audio duration detection | Manual WAV header parsing | `torchaudio.info()` | Handles all formats, already in dependencies | +| GPU detection | Custom CUDA checks | `acestep.gpu_config.get_gpu_config()` | Already handles CUDA/MPS/CPU detection with memory tier logic | + +**Key insight:** ACE-Step's existing Python API handles all the complex audio/ML plumbing. The wrapper script's job is purely argument parsing, caption construction, and output file management. + +## Common Pitfalls + +### Pitfall 1: sys.path and Package Import Order +**What goes wrong:** `acestep` package is not on Python's path since hum2inst.py lives outside the ace-step directory. +**Why it happens:** ACE-Step is installed as editable in its own venv, but hum2inst.py is at the project root. +**How to avoid:** Add `sys.path.insert(0, "ace-step")` before importing from `acestep`. Alternatively, ensure the script is run from within the ace-step venv which has acestep installed as editable package. +**Warning signs:** `ModuleNotFoundError: No module named 'acestep'` + +### Pitfall 2: Output File Renaming Race +**What goes wrong:** `generate_music()` saves output with a UUID-based filename. The script needs to rename it to include instrument+timestamp. +**Why it happens:** The `generate_music` API uses `generate_uuid_from_params()` for filenames, not user-friendly names. +**How to avoid:** After `generate_music()` returns, read `result.audios[0]["path"]` to get the saved path, then rename/copy to the desired filename. Or pass a custom `save_dir` per-run and rename afterward. +**Warning signs:** Output files named like `a3f2b8c1...wav` instead of `piano_20260411_143022.wav` + +### Pitfall 3: Duration Mismatch +**What goes wrong:** Output audio is much longer or shorter than the input humming, causing melody to stretch or truncate. +**Why it happens:** Not setting `duration` parameter, or setting it to -1 (auto), lets the model choose its own duration. +**How to avoid:** Always measure input WAV duration with `torchaudio.info()` and pass it as the `duration` parameter. Round to nearest integer (ACE-Step duration is in seconds). +**Warning signs:** 10-second humming produces 30-second output, or vice versa. + +### Pitfall 4: Silence Detection +**What goes wrong:** Model occasionally generates near-silence, especially with very short inputs or unusual timbres. +**Why it happens:** Cover mode's noise injection at high audio_cover_strength sometimes produces degenerate outputs. +**How to avoid:** After generation, check if the output audio tensor (from `result.audios[0]["tensor"]`) has RMS energy above a threshold. If too quiet, report error. +**Warning signs:** Output WAV file exists but plays as silence or very faint noise. + +### Pitfall 5: CUDA Not Available +**What goes wrong:** Script crashes with CUDA errors on a machine without GPU. +**Why it happens:** User decision is to assume CUDA and fail clearly. +**How to avoid:** Check `torch.cuda.is_available()` early and exit with a clear message before any model loading. +**Warning signs:** RuntimeError about CUDA device. + +## Code Examples + +### Complete Invocation Flow (verified from cli.py source) + +```python +# 1. Check CUDA +import torch +if not torch.cuda.is_available(): + print("ERROR: CUDA GPU required. No CUDA-capable GPU detected.", file=sys.stderr) + sys.exit(1) + +# 2. Get input duration +import torchaudio +info = torchaudio.info(input_wav_path) +duration = info.num_frames / info.sample_rate + +# 3. Initialize handlers (from cli.py lines 1318-1319, 1411-1419) +from acestep.gpu_config import get_gpu_config, set_global_gpu_config +gpu_config = get_gpu_config() +set_global_gpu_config(gpu_config) + +dit_handler = AceStepHandler() +llm_handler = LLMHandler() + +dit_handler.initialize_service( + project_root="./ace-step", + config_path="acestep-v15-xl-sft", + device="cuda", + use_flash_attention=dit_handler.is_flash_attention_available("cuda"), +) + +# 4. Configure generation (from prior testing + cli.py lines 1629-1676) +params = GenerationParams( + task_type="cover", + src_audio=str(input_wav_path), + caption=f"solo {instrument}, clear and expressive melody", + lyrics="", + instrumental=True, + duration=round(duration), + audio_cover_strength=0.9, + inference_steps=50, + guidance_scale=5.0, + thinking=False, +) + +config = GenerationConfig( + batch_size=1, + use_random_seed=True, + audio_format="wav", +) + +# 5. Generate +result = generate_music(dit_handler, llm_handler, params, config, save_dir=output_dir) + +# 6. Check result +if not result.success or not result.audios: + print(f"ERROR: Generation failed: {result.error or 'no audio produced'}", file=sys.stderr) + sys.exit(1) + +output_path = result.audios[0]["path"] +``` + +### Caption Construction Examples +```python +# Simple instrument mapping -- keep captions focused on instrument +CAPTION_TEMPLATES = { + "piano": "solo acoustic piano, gentle melody, warm tone, clear and expressive", + "guitar": "solo acoustic guitar, fingerpicked melody, warm and intimate", + "saxophone": "solo saxophone, smooth jazz melody, soulful and expressive", + "violin": "solo violin, classical melody, rich and emotional", + "flute": "solo flute, gentle melody, airy and delicate", +} + +def build_caption(instrument: str) -> str: + instrument_lower = instrument.lower().strip() + if instrument_lower in CAPTION_TEMPLATES: + return CAPTION_TEMPLATES[instrument_lower] + # Generic fallback for any instrument name + return f"solo {instrument_lower}, clear and expressive melody, warm tone" +``` + +## State of the Art + +| Old Approach | Current Approach | When Changed | Impact | +|--------------|------------------|--------------|--------| +| MusicGen melody conditioning | ACE-Step XL-SFT cover mode | 2026-04-11 testing | MusicGen chromagram conditioning too lossy; ACE-Step preserves melody+rhythm | +| MIDI extraction pipeline | Direct raw audio input | 2026-04-11 testing | No intermediate steps needed; raw humming goes directly into cover mode | +| Turbo model (8 steps) | XL-SFT model (50 steps) | 2026-04-11 testing | Better quality for cover mode; ~3s generation on RTX 4090 | + +**Deprecated/outdated:** +- `musicgen_melody.py`: MusicGen approach abandoned. Moving to `/archive`. +- `midi_to_audio.py`: MIDI synthesis no longer needed. Moving to `/archive`. + +## Open Questions + +1. **Optimal audio_cover_strength default** + - What we know: 0.8 worked well in testing. Range 0.8-1.0 specified by user. + - What's unclear: Whether 0.85 or 0.9 might be more universally good across different instruments. + - Recommendation: Default to 0.9 (high fidelity bias per user preference). Expose `--strength` flag so users can tune. + +2. **Caption quality impact on different instruments** + - What we know: Piano captions worked well in testing. + - What's unclear: How well generic captions work for uncommon instruments (e.g., "solo theremin"). + - Recommendation: Include a small set of curated caption templates for common instruments, with a generic fallback for anything else. + +3. **Output silence detection threshold** + - What we know: User wants clear error on silence/failure. + - What's unclear: What RMS threshold constitutes "silence" vs "very quiet music." + - Recommendation: Use a conservative threshold (e.g., RMS < -60 dBFS). Log a warning rather than hard-failing for borderline cases. + +4. **Project root path resolution** + - What we know: `hum2inst.py` lives at project root, `ace-step/` is a subdirectory. + - What's unclear: Whether `initialize_service(project_root=...)` needs an absolute path. + - Recommendation: Use `os.path.abspath()` to resolve the ace-step directory path relative to the script location. + +## Sources + +### Primary (HIGH confidence) +- `ace-step/cli.py` -- Full CLI implementation showing exact initialization sequence, parameter defaults, and cover mode handling +- `ace-step/acestep/inference.py` -- `GenerationParams`, `GenerationConfig`, `GenerationResult` dataclasses and `generate_music()` function +- `ace-step/acestep/handler.py` -- `AceStepHandler` class with `initialize_service()` method +- `ace-step/acestep/constants.py` -- `TASK_INSTRUCTIONS` dictionary showing cover mode instruction +- User's prior testing notes (`pipeline_hum_to_instrument.md`) -- Validated configuration and results + +### Secondary (MEDIUM confidence) +- `ace-step/requirements.txt` -- Dependency versions for the project environment + +### Tertiary (LOW confidence) +- None -- all findings verified against actual source code + +## Metadata + +**Confidence breakdown:** +- Standard stack: HIGH -- all libraries already installed and validated in prior testing +- Architecture: HIGH -- direct API import path fully traced through cli.py source code +- Pitfalls: HIGH -- identified from actual code analysis (not speculation) + +**Research date:** 2026-04-11 +**Valid until:** 2026-05-11 (stable -- ACE-Step codebase is local and pinned)