docs(core-pipeline): research phase domain

2026-04-11 02:03:30 -05:00 · 2026-04-11 02:03:30 -05:00 · b9932b08b4
commit b9932b08b4
parent 5a04fc3498
1 changed files with 374 additions and 0 deletions
--- a/.planning/phases/01-core-pipeline/01-RESEARCH.md
+++ b/.planning/phases/01-core-pipeline/01-RESEARCH.md
@ -0,0 +1,374 @@
 # Phase 1: Core Pipeline - Research
 **Researched:** 2026-04-11
 **Domain:** ACE-Step cover mode CLI wrapper, audio I/O, Python scripting
 **Confidence:** HIGH
 ## Summary
 Phase 1 wraps the existing ACE-Step 1.5 XL-SFT cover mode into a single `hum2inst.py` script. The user's prior experimentation has already validated the entire generation pipeline end-to-end: raw humming WAV goes into ACE-Step cover mode, a caption describes the target instrument, and the output preserves melody contour and rhythm. The technical challenge is purely integration and UX: reading the input WAV duration, constructing the right `GenerationParams`, initializing the handlers, calling `generate_music()`, and saving the output with a meaningful filename.
 The ACE-Step codebase exposes a clean Python API via `acestep.inference.generate_music()`, `acestep.handler.AceStepHandler`, and `acestep.llm_inference.LLMHandler`. Cover mode does NOT require the LLM handler (it's in the `skip_lm_tasks` set), which simplifies initialization. The script needs only the DiT handler with the XL-SFT model.
 **Primary recommendation:** Import ACE-Step's Python API directly (not subprocess). Initialize `AceStepHandler` + dummy `LLMHandler`, configure `GenerationParams` with `task_type="cover"`, and call `generate_music()`. Use `torchaudio` (already in the ACE-Step venv) to read input WAV duration.
 <user_constraints>
 ## User Constraints (from CONTEXT.md)
 ### Locked Decisions
 - Invoked as `python hum2inst.py input.wav --instrument piano`
 - Python script directly, no installation step
 - `--instrument` as a named CLI flag (not positional)
 - `--output` flag optional, defaults to `./output/` directory
 - Use Python argparse for argument parsing (gives --help for free)
 - Default cover_strength in the high fidelity range (0.8-1.0)
 - `--strength` flag exposed in Phase 1 so users can experiment immediately
 - Duration matches input WAV length by default; `--duration` flag to override
 - Caption auto-built from instrument name (e.g., "piano cover of a melody") -- no custom prompt flag in Phase 1
 - Output filename includes instrument and timestamp (exact format at Claude's discretion)
 - On generation failure or silence: print clear error message, exit with non-zero code
 - No auto-play -- just save and print the output path
 - No silent failures
 - Single `hum2inst.py` script -- no module splitting in Phase 1
 - Assume CUDA GPU is available; fail with clear message if no GPU detected
 - Move existing experimental scripts (midi_to_audio.py, musicgen_melody.py) to an `/archive` folder
 ### Claude's Discretion
 - ACE-Step invocation method (import Python API vs subprocess call -- choose based on what ACE-Step exposes)
 - Progress/feedback during generation (print statements, progress bar, or similar -- pick what's appropriate)
 - Exact output filename format (instrument + timestamp pattern)
 - Exact cover_strength default value within the 0.8-1.0 range
 ### Deferred Ideas (OUT OF SCOPE)
 None -- discussion stayed within phase scope
 </user_constraints>
 <phase_requirements>
 ## Phase Requirements
 | ID | Description | Research Support |
 |----|-------------|-----------------|
 | MEL-01 | Output audio audibly follows the pitch contour of the hummed input melody | ACE-Step cover mode with audio_cover_strength 0.8-1.0 preserves pitch contour from source. Validated in prior testing (File 10 comparison). |
 | MEL-02 | Output audio preserves the rhythmic timing and phrasing of the hummed input | Cover mode encodes source into VAE latents, preserving temporal structure. Duration matching ensures timing alignment. |
 | MEL-04 | Output is musically coherent -- sounds like a real instrument performance | XL-SFT model (50 inference steps) produces coherent instrument audio. Caption guides timbre. |
 | INST-01 | User can specify target instrument via text prompt | `--instrument` flag maps to caption string (e.g., "solo acoustic piano, gentle melody"). |
 | INP-01 | Pipeline accepts raw humming WAV audio as input with no manual preprocessing | Cover mode takes raw audio directly via `src_audio` parameter. No MIDI extraction needed. |
 | OUT-02 | Output is saved as WAV file to a user-specified or default output directory | `generate_music()` saves to `save_dir`. Script renames output file with instrument+timestamp. |
 | PIPE-01 | Single CLI command or script invocation to go from humming WAV to instrument output | Single `python hum2inst.py input.wav --instrument piano` command. |
 </phase_requirements>
 ## Standard Stack
 ### Core
 | Library | Version | Purpose | Why Standard |
 |---------|---------|---------|--------------|
 | ACE-Step (acestep) | 1.5 XL-SFT | Music generation via cover mode | Already installed, validated for this exact use case |
 | torch | 2.7.1+cu128 | GPU compute, model inference | Already in ace-step/.venv |
 | torchaudio | 2.7.1+cu128 | WAV file reading (duration detection) | Already in ace-step/.venv, native torch integration |
 | argparse | stdlib | CLI argument parsing | User-specified, gives --help free |
 ### Supporting
 | Library | Version | Purpose | When to Use |
 |---------|---------|---------|-------------|
 | datetime | stdlib | Timestamp generation for output filenames | Always (output naming) |
 | pathlib/os | stdlib | Path manipulation, directory creation | Always (file I/O) |
 | sys | stdlib | Exit codes, stderr output | Error handling |
 ### Alternatives Considered
 | Instead of | Could Use | Tradeoff |
 |------------|-----------|----------|
 | Direct Python API import | subprocess calling `cli.py` | Subprocess adds overhead, harder error handling, but avoids import complexity. **Recommendation: use direct import** -- ACE-Step exposes clean Python API via `acestep.inference.generate_music()` |
 | torchaudio for duration | wave stdlib module | wave is simpler but torchaudio is already loaded and handles more formats |
 | argparse | click/typer | User specifically chose argparse for simplicity and --help |
 **Installation:**
 No new packages needed. Everything runs inside the existing `ace-step/.venv` environment.
 ## Architecture Patterns
 ### Recommended Project Structure
 ```
 AiMusicPipeline/
 ├── hum2inst.py          # Single CLI entry point (Phase 1)
 ├── ace-step/            # ACE-Step installation (existing)
 │   ├── acestep/         # ACE-Step Python package
 │   ├── checkpoints/     # Model weights (XL-SFT, VAE)
 │   └── .venv/           # Python environment
 ├── input/               # User's humming WAV files
 ├── output/              # Generated instrument audio
 └── archive/             # Moved experimental scripts
    ├── midi_to_audio.py
    └── musicgen_melody.py
 ```
 ### Pattern 1: ACE-Step Python API Direct Import
 **What:** Import `AceStepHandler`, `LLMHandler`, `GenerationParams`, `GenerationConfig`, and `generate_music` directly from the `acestep` package.
 **When to use:** Always -- this is the recommended invocation method.
 **Key finding:** Cover mode is in the `skip_lm_tasks` set, meaning the LLM handler is instantiated but NOT loaded with a model. This simplifies initialization significantly.
 **Initialization sequence (from cli.py analysis):**
 ```python
 import sys
 import os
 # Add ace-step to path so acestep package is importable
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), "ace-step"))
 from acestep.handler import AceStepHandler
 from acestep.llm_inference import LLMHandler
 from acestep.inference import GenerationParams, GenerationConfig, generate_music
 from acestep.gpu_config import get_gpu_config, set_global_gpu_config
 # GPU setup
 gpu_config = get_gpu_config()
 set_global_gpu_config(gpu_config)
 # Initialize handlers
 dit_handler = AceStepHandler()
 llm_handler = LLMHandler()  # Not loaded for cover mode -- just instantiated
 # Initialize DiT with XL-SFT model
 dit_handler.initialize_service(
    project_root=os.path.join(os.path.dirname(__file__), "ace-step"),
    config_path="acestep-v15-xl-sft",
    device="cuda",
    use_flash_attention=dit_handler.is_flash_attention_available("cuda"),
 )
 ```
 ### Pattern 2: Cover Mode Generation Parameters
 **What:** Configure `GenerationParams` for cover mode with raw humming input.
 **Validated configuration (from prior testing):**
 ```python
 params = GenerationParams(
    task_type="cover",
    src_audio="path/to/humming.wav",
    caption="solo acoustic piano, gentle melody, warm tone",
    lyrics="",
    instrumental=True,
    duration=10,        # Match input WAV length
    bpm=120,            # Default; not critical for cover mode
    audio_cover_strength=0.9,  # High fidelity to source melody
    inference_steps=50,        # XL-SFT model uses 50 steps
    guidance_scale=5.0,
    thinking=False,            # No LLM needed for cover
 )
 config = GenerationConfig(
    batch_size=1,
    use_random_seed=True,
    audio_format="wav",
 )
 result = generate_music(
    dit_handler, llm_handler, params, config,
    save_dir="./output"
 )
 ```
 ### Pattern 3: Input WAV Duration Detection
 **What:** Read input WAV to determine output duration automatically.
 ```python
 import torchaudio
 def get_wav_duration(wav_path: str) -> float:
    info = torchaudio.info(wav_path)
    return info.num_frames / info.sample_rate
 ```
 ### Pattern 4: Caption Construction from Instrument Name
 **What:** Build the ACE-Step caption from a simple instrument name.
 ```python
 def build_caption(instrument: str) -> str:
    return f"solo {instrument}, clear and expressive melody, warm tone"
 ```
 **Note:** The caption strongly influences output timbre. Keep it focused on the instrument. Adding genre/mood modifiers is deferred to later phases.
 ### Anti-Patterns to Avoid
 - **Subprocess invocation of cli.py:** cli.py has interactive wizard prompts that will hang in non-interactive mode. Use the Python API directly.
 - **Loading LLM model for cover mode:** Cover and repaint tasks are in `skip_lm_tasks`. The LLM handler must be instantiated (generate_music expects it) but should NOT have its model loaded -- this saves ~5GB VRAM and ~10s startup time.
 - **Setting `thinking=True` for cover mode:** This triggers LLM inference which is skipped anyway for cover, but may cause errors if no LLM model is loaded.
 - **Hardcoding `inference_steps=8`:** That's the turbo model default. XL-SFT needs 50 steps for quality output.
 ## Don't Hand-Roll
 | Problem | Don't Build | Use Instead | Why |
 |---------|-------------|-------------|-----|
 | Audio file saving | Custom WAV writer | `generate_music()` with `save_dir` parameter | ACE-Step handles format, sample rate, normalization internally |
 | Model loading/initialization | Manual weight loading | `AceStepHandler.initialize_service()` | Handles config paths, checkpoints, device placement, flash attention |
 | Audio duration detection | Manual WAV header parsing | `torchaudio.info()` | Handles all formats, already in dependencies |
 | GPU detection | Custom CUDA checks | `acestep.gpu_config.get_gpu_config()` | Already handles CUDA/MPS/CPU detection with memory tier logic |
 **Key insight:** ACE-Step's existing Python API handles all the complex audio/ML plumbing. The wrapper script's job is purely argument parsing, caption construction, and output file management.
 ## Common Pitfalls
 ### Pitfall 1: sys.path and Package Import Order
 **What goes wrong:** `acestep` package is not on Python's path since hum2inst.py lives outside the ace-step directory.
 **Why it happens:** ACE-Step is installed as editable in its own venv, but hum2inst.py is at the project root.
 **How to avoid:** Add `sys.path.insert(0, "ace-step")` before importing from `acestep`. Alternatively, ensure the script is run from within the ace-step venv which has acestep installed as editable package.
 **Warning signs:** `ModuleNotFoundError: No module named 'acestep'`
 ### Pitfall 2: Output File Renaming Race
 **What goes wrong:** `generate_music()` saves output with a UUID-based filename. The script needs to rename it to include instrument+timestamp.
 **Why it happens:** The `generate_music` API uses `generate_uuid_from_params()` for filenames, not user-friendly names.
 **How to avoid:** After `generate_music()` returns, read `result.audios[0]["path"]` to get the saved path, then rename/copy to the desired filename. Or pass a custom `save_dir` per-run and rename afterward.
 **Warning signs:** Output files named like `a3f2b8c1...wav` instead of `piano_20260411_143022.wav`
 ### Pitfall 3: Duration Mismatch
 **What goes wrong:** Output audio is much longer or shorter than the input humming, causing melody to stretch or truncate.
 **Why it happens:** Not setting `duration` parameter, or setting it to -1 (auto), lets the model choose its own duration.
 **How to avoid:** Always measure input WAV duration with `torchaudio.info()` and pass it as the `duration` parameter. Round to nearest integer (ACE-Step duration is in seconds).
 **Warning signs:** 10-second humming produces 30-second output, or vice versa.
 ### Pitfall 4: Silence Detection
 **What goes wrong:** Model occasionally generates near-silence, especially with very short inputs or unusual timbres.
 **Why it happens:** Cover mode's noise injection at high audio_cover_strength sometimes produces degenerate outputs.
 **How to avoid:** After generation, check if the output audio tensor (from `result.audios[0]["tensor"]`) has RMS energy above a threshold. If too quiet, report error.
 **Warning signs:** Output WAV file exists but plays as silence or very faint noise.
 ### Pitfall 5: CUDA Not Available
 **What goes wrong:** Script crashes with CUDA errors on a machine without GPU.
 **Why it happens:** User decision is to assume CUDA and fail clearly.
 **How to avoid:** Check `torch.cuda.is_available()` early and exit with a clear message before any model loading.
 **Warning signs:** RuntimeError about CUDA device.
 ## Code Examples
 ### Complete Invocation Flow (verified from cli.py source)
 ```python
 # 1. Check CUDA
 import torch
 if not torch.cuda.is_available():
    print("ERROR: CUDA GPU required. No CUDA-capable GPU detected.", file=sys.stderr)
    sys.exit(1)
 # 2. Get input duration
 import torchaudio
 info = torchaudio.info(input_wav_path)
 duration = info.num_frames / info.sample_rate
 # 3. Initialize handlers (from cli.py lines 1318-1319, 1411-1419)
 from acestep.gpu_config import get_gpu_config, set_global_gpu_config
 gpu_config = get_gpu_config()
 set_global_gpu_config(gpu_config)
 dit_handler = AceStepHandler()
 llm_handler = LLMHandler()
 dit_handler.initialize_service(
    project_root="./ace-step",
    config_path="acestep-v15-xl-sft",
    device="cuda",
    use_flash_attention=dit_handler.is_flash_attention_available("cuda"),
 )
 # 4. Configure generation (from prior testing + cli.py lines 1629-1676)
 params = GenerationParams(
    task_type="cover",
    src_audio=str(input_wav_path),
    caption=f"solo {instrument}, clear and expressive melody",
    lyrics="",
    instrumental=True,
    duration=round(duration),
    audio_cover_strength=0.9,
    inference_steps=50,
    guidance_scale=5.0,
    thinking=False,
 )
 config = GenerationConfig(
    batch_size=1,
    use_random_seed=True,
    audio_format="wav",
 )
 # 5. Generate
 result = generate_music(dit_handler, llm_handler, params, config, save_dir=output_dir)
 # 6. Check result
 if not result.success or not result.audios:
    print(f"ERROR: Generation failed: {result.error or 'no audio produced'}", file=sys.stderr)
    sys.exit(1)
 output_path = result.audios[0]["path"]
 ```
 ### Caption Construction Examples
 ```python
 # Simple instrument mapping -- keep captions focused on instrument
 CAPTION_TEMPLATES = {
    "piano": "solo acoustic piano, gentle melody, warm tone, clear and expressive",
    "guitar": "solo acoustic guitar, fingerpicked melody, warm and intimate",
    "saxophone": "solo saxophone, smooth jazz melody, soulful and expressive",
    "violin": "solo violin, classical melody, rich and emotional",
    "flute": "solo flute, gentle melody, airy and delicate",
 }
 def build_caption(instrument: str) -> str:
    instrument_lower = instrument.lower().strip()
    if instrument_lower in CAPTION_TEMPLATES:
        return CAPTION_TEMPLATES[instrument_lower]
    # Generic fallback for any instrument name
    return f"solo {instrument_lower}, clear and expressive melody, warm tone"
 ```
 ## State of the Art
 | Old Approach | Current Approach | When Changed | Impact |
 |--------------|------------------|--------------|--------|
 | MusicGen melody conditioning | ACE-Step XL-SFT cover mode | 2026-04-11 testing | MusicGen chromagram conditioning too lossy; ACE-Step preserves melody+rhythm |
 | MIDI extraction pipeline | Direct raw audio input | 2026-04-11 testing | No intermediate steps needed; raw humming goes directly into cover mode |
 | Turbo model (8 steps) | XL-SFT model (50 steps) | 2026-04-11 testing | Better quality for cover mode; ~3s generation on RTX 4090 |
 **Deprecated/outdated:**
 - `musicgen_melody.py`: MusicGen approach abandoned. Moving to `/archive`.
 - `midi_to_audio.py`: MIDI synthesis no longer needed. Moving to `/archive`.
 ## Open Questions
 1. **Optimal audio_cover_strength default**
   - What we know: 0.8 worked well in testing. Range 0.8-1.0 specified by user.
   - What's unclear: Whether 0.85 or 0.9 might be more universally good across different instruments.
   - Recommendation: Default to 0.9 (high fidelity bias per user preference). Expose `--strength` flag so users can tune.
 2. **Caption quality impact on different instruments**
   - What we know: Piano captions worked well in testing.
   - What's unclear: How well generic captions work for uncommon instruments (e.g., "solo theremin").
   - Recommendation: Include a small set of curated caption templates for common instruments, with a generic fallback for anything else.
 3. **Output silence detection threshold**
   - What we know: User wants clear error on silence/failure.
   - What's unclear: What RMS threshold constitutes "silence" vs "very quiet music."
   - Recommendation: Use a conservative threshold (e.g., RMS < -60 dBFS). Log a warning rather than hard-failing for borderline cases.
 4. **Project root path resolution**
   - What we know: `hum2inst.py` lives at project root, `ace-step/` is a subdirectory.
   - What's unclear: Whether `initialize_service(project_root=...)` needs an absolute path.
   - Recommendation: Use `os.path.abspath()` to resolve the ace-step directory path relative to the script location.
 ## Sources
 ### Primary (HIGH confidence)
 - `ace-step/cli.py` -- Full CLI implementation showing exact initialization sequence, parameter defaults, and cover mode handling
 - `ace-step/acestep/inference.py` -- `GenerationParams`, `GenerationConfig`, `GenerationResult` dataclasses and `generate_music()` function
 - `ace-step/acestep/handler.py` -- `AceStepHandler` class with `initialize_service()` method
 - `ace-step/acestep/constants.py` -- `TASK_INSTRUCTIONS` dictionary showing cover mode instruction
 - User's prior testing notes (`pipeline_hum_to_instrument.md`) -- Validated configuration and results
 ### Secondary (MEDIUM confidence)
 - `ace-step/requirements.txt` -- Dependency versions for the project environment
 ### Tertiary (LOW confidence)
 - None -- all findings verified against actual source code
 ## Metadata
 **Confidence breakdown:**
 - Standard stack: HIGH -- all libraries already installed and validated in prior testing
 - Architecture: HIGH -- direct API import path fully traced through cli.py source code
 - Pitfalls: HIGH -- identified from actual code analysis (not speculation)
 **Research date:** 2026-04-11
 **Valid until:** 2026-05-11 (stable -- ACE-Step codebase is local and pinned)