Compare commits

..

10 commits

Author SHA1 Message Date
John Lightner
87763e21db chore: add .gitignore and batch generation script 2026-04-13 16:52:32 -05:00
John Lightner
38331ca59a docs(01-02): complete end-to-end verification plan with tuning findings 2026-04-11 03:25:42 -05:00
John Lightner
45f6863131 feat(01-02): add generation tuning params, seed control, multi-take, and run logging
Tuning session revealed strength=0.3 as optimal default for cover mode.
Added --caption, --noise-strength, --guidance, --steps, --shift, --sampler,
--vel-clamp, --vel-ema, --seed, --takes flags. Each output now has a JSON
sidecar logging all parameters for reproducibility.
2026-04-11 03:24:27 -05:00
John Lightner
f40ca2b3fb docs(01-01): complete core pipeline plan
- Add 01-01-SUMMARY.md with execution results
- Update STATE.md with plan progress and decisions
- Update ROADMAP.md and REQUIREMENTS.md
2026-04-11 02:13:43 -05:00
John Lightner
5a233898c8 feat(01-01): create hum2inst.py CLI pipeline script
- Single-file CLI wrapping ACE-Step XL-SFT cover mode
- argparse with --instrument, --output, --strength, --duration flags
- Auto-detect input WAV duration via torchaudio
- Caption templates for piano, guitar, saxophone, violin, flute
- CUDA GPU check with clear error message
- Silence detection on output audio
- User-friendly output naming with instrument + timestamp
2026-04-11 02:11:53 -05:00
John Lightner
262ee6f7d1 chore(01-01): archive experimental scripts
- Move midi_to_audio.py to archive/
- Move musicgen_melody.py to archive/
2026-04-11 02:10:59 -05:00
John Lightner
b9cc6121f3 docs(01-core-pipeline): create phase plan 2026-04-11 02:07:27 -05:00
John Lightner
b9932b08b4 docs(core-pipeline): research phase domain 2026-04-11 02:03:30 -05:00
John Lightner
5a04fc3498 docs(01): capture phase context 2026-04-11 01:58:51 -05:00
John Lightner
b7260bbd26 docs: create roadmap (4 phases) 2026-04-11 01:45:42 -05:00
14 changed files with 1898 additions and 28 deletions

16
.gitignore vendored Normal file
View file

@ -0,0 +1,16 @@
# Vendored deps / envs
ace-step/
basic-pitch-env/
.cache/
*.venv/
__pycache__/
*.pyc
# User data
input/
output/
# Editor
.vscode/
.idea/
*.swp

View file

@ -9,27 +9,27 @@ Requirements for initial release. Each maps to roadmap phases.
### Melody Fidelity ### Melody Fidelity
- [ ] **MEL-01**: Output audio audibly follows the pitch contour of the hummed input melody - [x] **MEL-01**: Output audio audibly follows the pitch contour of the hummed input melody
- [ ] **MEL-02**: Output audio preserves the rhythmic timing and phrasing of the hummed input - [x] **MEL-02**: Output audio preserves the rhythmic timing and phrasing of the hummed input
- [ ] **MEL-03**: User can control fidelity vs creativity tradeoff via cover_strength parameter - [ ] **MEL-03**: User can control fidelity vs creativity tradeoff via cover_strength parameter
- [ ] **MEL-04**: Output is musically coherent — sounds like a real instrument performance, not garbled audio - [x] **MEL-04**: Output is musically coherent — sounds like a real instrument performance, not garbled audio
### Instrument Selection ### Instrument Selection
- [ ] **INST-01**: User can specify target instrument via text prompt (piano, guitar, saxophone, etc.) - [x] **INST-01**: User can specify target instrument via text prompt (piano, guitar, saxophone, etc.)
- [ ] **INST-02**: Different instrument prompts produce audibly different timbres in the output - [ ] **INST-02**: Different instrument prompts produce audibly different timbres in the output
- [ ] **INST-03**: Instrument selection works across at least 5 distinct instrument types - [ ] **INST-03**: Instrument selection works across at least 5 distinct instrument types
### Input Handling ### Input Handling
- [ ] **INP-01**: Pipeline accepts raw humming WAV audio as input with no manual preprocessing - [x] **INP-01**: Pipeline accepts raw humming WAV audio as input with no manual preprocessing
- [ ] **INP-02**: Pipeline auto-detects input audio duration and configures output duration appropriately for quality - [ ] **INP-02**: Pipeline auto-detects input audio duration and configures output duration appropriately for quality
- [ ] **INP-03**: Input audio at common sample rates (44.1kHz, 48kHz, 16kHz) is handled without errors - [ ] **INP-03**: Input audio at common sample rates (44.1kHz, 48kHz, 16kHz) is handled without errors
### Output Quality ### Output Quality
- [ ] **OUT-01**: Output audio is at least 44.1kHz sample rate (CD quality) - [ ] **OUT-01**: Output audio is at least 44.1kHz sample rate (CD quality)
- [ ] **OUT-02**: Output is saved as WAV file to a user-specified or default output directory - [x] **OUT-02**: Output is saved as WAV file to a user-specified or default output directory
- [ ] **OUT-03**: Output filenames include instrument name and timestamp for easy identification - [ ] **OUT-03**: Output filenames include instrument name and timestamp for easy identification
### Reproducibility ### Reproducibility
@ -39,7 +39,7 @@ Requirements for initial release. Each maps to roadmap phases.
### Pipeline Usability ### Pipeline Usability
- [ ] **PIPE-01**: Single CLI command or script invocation to go from humming WAV to instrument output - [x] **PIPE-01**: Single CLI command or script invocation to go from humming WAV to instrument output
- [ ] **PIPE-02**: Configuration via TOML file or CLI arguments for instrument, strength, duration, seed - [ ] **PIPE-02**: Configuration via TOML file or CLI arguments for instrument, strength, duration, seed
- [ ] **PIPE-03**: Clear error messages when input file is missing, corrupted, or in unsupported format - [ ] **PIPE-03**: Clear error messages when input file is missing, corrupted, or in unsupported format
@ -84,30 +84,30 @@ Deferred to future release. Tracked but not in current roadmap.
| Requirement | Phase | Status | | Requirement | Phase | Status |
|-------------|-------|--------| |-------------|-------|--------|
| MEL-01 | — | Pending | | MEL-01 | Phase 1 | Complete |
| MEL-02 | — | Pending | | MEL-02 | Phase 1 | Complete |
| MEL-03 | | Pending | | MEL-03 | Phase 2 | Pending |
| MEL-04 | — | Pending | | MEL-04 | Phase 1 | Complete |
| INST-01 | — | Pending | | INST-01 | Phase 1 | Complete |
| INST-02 | | Pending | | INST-02 | Phase 2 | Pending |
| INST-03 | | Pending | | INST-03 | Phase 2 | Pending |
| INP-01 | — | Pending | | INP-01 | Phase 1 | Complete |
| INP-02 | | Pending | | INP-02 | Phase 3 | Pending |
| INP-03 | | Pending | | INP-03 | Phase 3 | Pending |
| OUT-01 | | Pending | | OUT-01 | Phase 3 | Pending |
| OUT-02 | — | Pending | | OUT-02 | Phase 1 | Complete |
| OUT-03 | | Pending | | OUT-03 | Phase 3 | Pending |
| REPR-01 | | Pending | | REPR-01 | Phase 4 | Pending |
| REPR-02 | | Pending | | REPR-02 | Phase 4 | Pending |
| PIPE-01 | — | Pending | | PIPE-01 | Phase 1 | Complete |
| PIPE-02 | | Pending | | PIPE-02 | Phase 4 | Pending |
| PIPE-03 | | Pending | | PIPE-03 | Phase 3 | Pending |
**Coverage:** **Coverage:**
- v1 requirements: 18 total - v1 requirements: 18 total
- Mapped to phases: 0 - Mapped to phases: 18
- Unmapped: 18 - Unmapped: 0
--- ---
*Requirements defined: 2026-04-11* *Requirements defined: 2026-04-11*
*Last updated: 2026-04-11 after initial definition* *Last updated: 2026-04-11 after roadmap creation*

91
.planning/ROADMAP.md Normal file
View file

@ -0,0 +1,91 @@
# Roadmap: AI Music Pipeline
## Overview
This roadmap delivers a voice-to-instrument pipeline built on ACE-Step 1.5 XL-SFT cover mode. Phase 1 establishes the core end-to-end flow (hum in, instrument out), Phase 2 validates instrument variety and exposes fidelity control, Phase 3 hardens input/output handling, and Phase 4 adds configuration file support and reproducibility via seed control. The result is a single CLI tool that takes a humming WAV and produces high-quality instrument renditions that faithfully follow the input melody.
## Phases
**Phase Numbering:**
- Integer phases (1, 2, 3, 4): Planned milestone work
- Decimal phases (e.g., 2.1): Urgent insertions (marked with INSERTED)
- [ ] **Phase 1: Core Pipeline** - End-to-end humming WAV to instrument output via ACE-Step cover mode
- [ ] **Phase 2: Instrument Variety & Fidelity Control** - Multiple distinct instruments and cover_strength tuning
- [ ] **Phase 3: Input & Output Robustness** - Sample rate handling, duration detection, CD quality output, error messages
- [ ] **Phase 4: Configuration & Reproducibility** - TOML config support and seed control for reproducible outputs
## Phase Details
### Phase 1: Core Pipeline
**Goal**: User can hum a melody, run one command, and get an instrument rendition that audibly follows the melody
**Depends on**: Nothing (first phase)
**Requirements**: MEL-01, MEL-02, MEL-04, INST-01, INP-01, OUT-02, PIPE-01
**Success Criteria** (what must be TRUE):
1. User can run a single script/command with a humming WAV file and get instrument audio output
2. Output audio audibly follows the pitch contour of the input humming
3. Output audio preserves the rhythmic timing of the input humming
4. Output sounds like a coherent instrument performance, not garbled noise
5. User can specify the target instrument (e.g., piano, guitar) and the output reflects that instrument
**Plans**: 2 plans
Plans:
- [ ] 01-01-PLAN.md — Build hum2inst.py CLI pipeline and archive old scripts
- [ ] 01-02-PLAN.md — End-to-end test and human verification of melody fidelity
### Phase 2: Instrument Variety & Fidelity Control
**Goal**: User can choose from multiple instruments that sound distinctly different, and control how closely the output follows the input melody
**Depends on**: Phase 1
**Requirements**: INST-02, INST-03, MEL-03
**Success Criteria** (what must be TRUE):
1. Different instrument prompts (piano, guitar, saxophone, violin, flute) produce audibly different timbres from the same input
2. At least 5 distinct instrument types produce usable output
3. User can adjust cover_strength parameter and hear the difference -- higher values follow the melody more closely, lower values allow more creative interpretation
**Plans**: TBD
Plans:
- [ ] 02-01: TBD
- [ ] 02-02: TBD
### Phase 3: Input & Output Robustness
**Goal**: Pipeline handles real-world input files gracefully and produces properly named CD-quality output
**Depends on**: Phase 1
**Requirements**: INP-02, INP-03, OUT-01, OUT-03, PIPE-03
**Success Criteria** (what must be TRUE):
1. Input WAV files at 44.1kHz, 48kHz, and 16kHz sample rates all work without errors
2. Pipeline auto-detects input audio duration and configures generation duration appropriately
3. Output audio is at least 44.1kHz sample rate
4. Output filenames include the instrument name and a timestamp (e.g., piano_20260411_143022.wav)
5. Clear error message shown when input file is missing, corrupted, or in an unsupported format
**Plans**: TBD
Plans:
- [ ] 03-01: TBD
- [ ] 03-02: TBD
- [ ] 03-03: TBD
### Phase 4: Configuration & Reproducibility
**Goal**: User can configure the pipeline via TOML file and reproduce or vary outputs using seed control
**Depends on**: Phase 1
**Requirements**: PIPE-02, REPR-01, REPR-02
**Success Criteria** (what must be TRUE):
1. User can specify instrument, cover_strength, duration, and seed via a TOML config file instead of CLI arguments
2. Running the pipeline twice with the same seed, input, and prompt produces identical output
3. Running with different seeds produces meaningfully different outputs from the same input and prompt
**Plans**: TBD
Plans:
- [ ] 04-01: TBD
- [ ] 04-02: TBD
## Progress
**Execution Order:**
Phases execute in numeric order. Phases 2, 3, and 4 all depend on Phase 1 but are independent of each other.
| Phase | Plans Complete | Status | Completed |
|-------|----------------|--------|-----------|
| 1. Core Pipeline | 1/2 | In Progress| |
| 2. Instrument Variety & Fidelity Control | 0/2 | Not started | - |
| 3. Input & Output Robustness | 0/3 | Not started | - |
| 4. Configuration & Reproducibility | 0/2 | Not started | - |

68
.planning/STATE.md Normal file
View file

@ -0,0 +1,68 @@
# Project State
## Project Reference
See: .planning/PROJECT.md (updated 2026-04-11)
**Core value:** A hummed melody input must produce instrument-specific output that audibly follows the melody's contour and rhythm
**Current focus:** Phase 1: Core Pipeline
## Current Position
Phase: 1 of 4 (Core Pipeline)
Plan: 2 of 2 in current phase
Status: Phase complete — awaiting verification
Last activity: 2026-04-11 -- Completed 01-02-PLAN.md
Progress: [##........] 20%
## Performance Metrics
**Velocity:**
- Total plans completed: 0
- Average duration: -
- Total execution time: 0 hours
**By Phase:**
| Phase | Plans | Total | Avg/Plan |
|-------|-------|-------|----------|
| 01-core-pipeline | 01-01, 01-02 | 27min | 13.5min |
**Recent Trend:**
- Last 5 plans: -
- Trend: -
*Updated after each plan completion*
## Accumulated Context
### Decisions
Decisions are logged in PROJECT.md Key Decisions table.
Recent decisions affecting current work:
- [Roadmap]: ACE-Step 1.5 XL-SFT cover mode is the sole generation engine for v1. No MusicGen/AudioCraft.
- [Roadmap]: Phases 2-4 are independent after Phase 1; can be executed in any order.
- [01-01]: Direct Python API import of ACE-Step (not subprocess) for clean error handling
- [01-01]: Default audio_cover_strength=0.9 for high melody fidelity
- [01-01]: Conservative -60 dBFS silence threshold with warning (not hard fail)
- [01-01]: Temp directory isolation for ACE-Step UUID output before renaming
- [01-02]: audio_cover_strength=0.3 is optimal (0.9 garbled, 0.5+ copies source)
- [01-02]: cover_noise_strength must be 0.0 (any >0 produces source copies)
- [01-02]: Seed variance dominates quality — multi-take cherry-pick workflow required
- [01-02]: Simple instrument captions outperform verbose descriptive captions
### Pending Todos
None yet.
### Blockers/Concerns
None yet.
## Session Continuity
Last session: 2026-04-11
Stopped at: Completed 01-02-PLAN.md — all plans in phase 01 done, awaiting verification
Resume file: None

View file

@ -0,0 +1,197 @@
---
phase: 01-core-pipeline
plan: 01
type: execute
wave: 1
depends_on: []
files_modified:
- hum2inst.py
- archive/midi_to_audio.py
- archive/musicgen_melody.py
autonomous: true
requirements:
- MEL-01
- MEL-02
- MEL-04
- INST-01
- INP-01
- OUT-02
- PIPE-01
must_haves:
truths:
- "User can run `python hum2inst.py input.wav --instrument piano` and get a WAV file in ./output/"
- "Script reads input WAV duration and passes it to ACE-Step so output length matches input"
- "Script builds a caption from the instrument name that guides timbre selection"
- "Script checks for CUDA GPU and exits with clear error if unavailable"
- "Script detects silent/failed output and exits with non-zero code"
- "Output WAV filename includes instrument name and timestamp"
- "Old experimental scripts are moved to /archive and no longer in project root"
artifacts:
- path: "hum2inst.py"
provides: "Complete CLI pipeline: argparse, CUDA check, ACE-Step init, generation, output rename, error handling"
min_lines: 100
- path: "archive/midi_to_audio.py"
provides: "Archived experimental script"
- path: "archive/musicgen_melody.py"
provides: "Archived experimental script"
key_links:
- from: "hum2inst.py"
to: "acestep.inference.generate_music"
via: "direct Python import with sys.path insert"
pattern: "from acestep\\.inference import"
- from: "hum2inst.py"
to: "acestep.handler.AceStepHandler"
via: "handler initialization with XL-SFT config"
pattern: "initialize_service.*acestep-v15-xl-sft"
- from: "hum2inst.py"
to: "torchaudio.info"
via: "duration detection from input WAV"
pattern: "torchaudio\\.info"
---
<objective>
Build the complete hum-to-instrument CLI pipeline as a single Python script, and archive old experimental scripts.
Purpose: Deliver the core end-to-end functionality -- a user hums into a WAV file, runs one command, and gets an instrument rendition that follows their melody.
Output: Working `hum2inst.py` script at project root, old scripts moved to `archive/`.
</objective>
<execution_context>
@C:/Users/jlightner/.claude/get-shit-done/workflows/execute-plan.md
@C:/Users/jlightner/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/01-core-pipeline/01-CONTEXT.md
@.planning/phases/01-core-pipeline/01-RESEARCH.md
</context>
<tasks>
<task type="auto">
<name>Task 1: Archive experimental scripts</name>
<files>archive/midi_to_audio.py, archive/musicgen_melody.py</files>
<action>
Create an `archive/` directory at the project root. Move `midi_to_audio.py` and `musicgen_melody.py` from the project root into `archive/`. Use git mv if the repo tracks these files, otherwise use regular file move.
Verify both files exist at their new locations and are removed from the project root.
</action>
<verify>
<automated>ls archive/midi_to_audio.py archive/musicgen_melody.py && test ! -f midi_to_audio.py && test ! -f musicgen_melody.py && echo "PASS"</automated>
</verify>
<done>Both experimental scripts exist in archive/ and are gone from project root</done>
</task>
<task type="auto">
<name>Task 2: Create hum2inst.py CLI pipeline script</name>
<files>hum2inst.py</files>
<action>
Create `hum2inst.py` at the project root. This is a single-file CLI script that wraps ACE-Step XL-SFT cover mode. The script must implement ALL of the following (per user decisions in CONTEXT.md):
**CLI interface (argparse):**
- Positional argument: input WAV file path
- `--instrument` (required): target instrument name (e.g., piano, guitar, saxophone)
- `--output` (optional): output directory, defaults to `./output/`
- `--strength` (optional): audio_cover_strength float, defaults to 0.9 (high fidelity per user preference, within locked range 0.8-1.0)
- `--duration` (optional): override output duration in seconds; if not provided, auto-detect from input WAV
**Startup checks:**
- Validate input WAV file exists and is readable; exit 1 with clear message if not
- Check `torch.cuda.is_available()`; exit 1 with message "ERROR: CUDA GPU required. No CUDA-capable GPU detected." if false
- Create output directory if it doesn't exist
**ACE-Step initialization (per RESEARCH.md Pattern 1):**
- Add ace-step directory to sys.path: `sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), "ace-step"))`
- Import from acestep: `AceStepHandler`, `LLMHandler`, `GenerationParams`, `GenerationConfig`, `generate_music`, `get_gpu_config`, `set_global_gpu_config`
- Call `get_gpu_config()` and `set_global_gpu_config()`
- Initialize `AceStepHandler` and call `initialize_service()` with `project_root` pointing to the ace-step directory, `config_path="acestep-v15-xl-sft"`, `device="cuda"`, and `use_flash_attention` from handler detection
- Initialize `LLMHandler()` (instantiate only, no model load -- cover mode is in skip_lm_tasks)
**Duration detection (per RESEARCH.md Pattern 3):**
- Use `torchaudio.info(input_path)` to get `num_frames` and `sample_rate`
- Compute duration as `num_frames / sample_rate`
- Round to nearest integer for the `duration` parameter
- Print detected duration: "Input duration: {duration}s"
**Caption construction (per RESEARCH.md Pattern 4 + code examples):**
- Include a dictionary of curated caption templates for common instruments: piano, guitar, saxophone, violin, flute (use the exact templates from RESEARCH.md code examples section)
- For instruments not in the dictionary, use generic fallback: `f"solo {instrument}, clear and expressive melody, warm tone"`
- Print the caption being used: "Caption: {caption}"
**Generation (per RESEARCH.md Pattern 2):**
- Create `GenerationParams` with:
- `task_type="cover"`
- `src_audio` = absolute path to input WAV
- `caption` = built caption string
- `lyrics=""`
- `instrumental=True`
- `duration` = detected or overridden duration (rounded int)
- `audio_cover_strength` = user's --strength value (default 0.9)
- `inference_steps=50` (XL-SFT model requirement -- NOT 8 which is turbo)
- `guidance_scale=5.0`
- `thinking=False` (no LLM needed for cover)
- `bpm=120` (default, not critical for cover mode)
- Create `GenerationConfig` with `batch_size=1`, `use_random_seed=True`, `audio_format="wav"`
- Call `generate_music(dit_handler, llm_handler, params, config, save_dir=temp_output_dir)` where temp_output_dir is a temporary subdirectory to isolate ACE-Step's UUID-named output
- Print progress: "Generating {instrument} cover..." before the call
**Output handling (per RESEARCH.md Pitfall 2 + user decisions):**
- After generation, check `result.success` and `result.audios` -- if failed, print error to stderr and exit 1
- Get the generated file path from the result
- Rename/copy to user-friendly filename: `{instrument}_{YYYYMMDD}_{HHMMSS}.wav` in the user's output directory
- Use `datetime.now().strftime("%Y%m%d_%H%M%S")` for the timestamp
- Print the final output path: "Output saved: {path}"
**Silence detection (per RESEARCH.md Pitfall 4 + user decisions):**
- After generation, load the output audio with torchaudio and compute RMS energy
- If RMS is below a conservative threshold (e.g., -60 dBFS), print warning to stderr: "WARNING: Output audio appears to be near-silent. The generation may have failed."
- Do NOT hard-fail on borderline cases -- just warn (per research recommendation)
**Error handling (per user decisions):**
- Wrap the main execution in try/except
- On any exception during generation: print clear error message to stderr and exit 1
- No silent failures -- every error path must print something useful
**Progress feedback (Claude's discretion):**
- Print status messages at key stages: loading model, detecting duration, generating, saving output
- Keep it simple -- print statements, no progress bar library needed
</action>
<verify>
<automated>python -c "import ast; ast.parse(open('hum2inst.py').read()); print('SYNTAX OK')" && grep -q "argparse" hum2inst.py && grep -q "generate_music" hum2inst.py && grep -q "cover" hum2inst.py && grep -q "instrument" hum2inst.py && grep -q "torchaudio" hum2inst.py && echo "PASS"</automated>
<manual>Review script structure: argparse setup, CUDA check, ACE-Step init, generation call, output rename, error handling all present</manual>
</verify>
<done>hum2inst.py exists at project root with valid Python syntax, contains argparse CLI with --instrument/--output/--strength/--duration flags, imports ACE-Step API, uses cover mode with XL-SFT config, detects input duration, builds caption from instrument name, renames output with instrument+timestamp, checks for CUDA, detects silence, and handles errors with non-zero exit codes</done>
</task>
</tasks>
<verification>
1. `hum2inst.py` exists at project root and passes Python syntax check
2. Script imports from acestep package (generate_music, AceStepHandler, etc.)
3. Script uses argparse with required --instrument flag and optional --output, --strength, --duration
4. Script checks CUDA availability before model loading
5. Old scripts archived in archive/ directory
6. Script contains cover mode configuration (task_type="cover", inference_steps=50)
7. Script contains duration detection via torchaudio
8. Script contains output file renaming with instrument + timestamp pattern
</verification>
<success_criteria>
- hum2inst.py is syntactically valid Python
- All required CLI flags are present (--instrument, --output, --strength, --duration)
- ACE-Step cover mode is configured correctly (XL-SFT, 50 steps, cover task type)
- Input duration auto-detection is implemented
- Output naming includes instrument and timestamp
- CUDA check is present
- Silence detection is present
- Error handling covers generation failure, missing input, no GPU
- Old scripts moved to archive/
</success_criteria>
<output>
After completion, create `.planning/phases/01-core-pipeline/01-01-SUMMARY.md`
</output>

View file

@ -0,0 +1,98 @@
---
phase: 01-core-pipeline
plan: 01
subsystem: pipeline
tags: [ace-step, cover-mode, cli, torchaudio, argparse, cuda]
requires:
- phase: none
provides: none
provides:
- "hum2inst.py CLI script wrapping ACE-Step XL-SFT cover mode"
- "Archived experimental scripts in archive/"
affects: [02-quality-presets, 03-batch-processing, 04-output-polish]
tech-stack:
added: []
patterns: [direct-python-api-import, caption-template-mapping, silence-detection-rms]
key-files:
created: [hum2inst.py, archive/midi_to_audio.py, archive/musicgen_melody.py]
modified: []
key-decisions:
- "Direct Python API import of ACE-Step (not subprocess) for clean error handling"
- "Default audio_cover_strength=0.9 within locked 0.8-1.0 range for high melody fidelity"
- "Conservative -60 dBFS silence threshold with warning (not hard fail) for borderline cases"
- "Temp directory for ACE-Step UUID output, then copy to user-friendly filename"
patterns-established:
- "Caption template dict for common instruments with generic fallback"
- "Temp dir isolation for ACE-Step output before renaming"
- "Early CUDA check before model loading"
requirements-completed: [MEL-01, MEL-02, MEL-04, INST-01, INP-01, OUT-02, PIPE-01]
duration: 2min
completed: 2026-04-11
---
# Phase 1 Plan 01: Core Pipeline Summary
**Single-file hum2inst.py CLI wrapping ACE-Step XL-SFT cover mode with auto duration detection, instrument caption templates, and silence detection**
## Performance
- **Duration:** 2 min
- **Started:** 2026-04-11T07:10:42Z
- **Completed:** 2026-04-11T07:12:04Z
- **Tasks:** 2
- **Files modified:** 3
## Accomplishments
- Archived experimental scripts (midi_to_audio.py, musicgen_melody.py) to archive/
- Created complete hum2inst.py CLI pipeline (273 lines) with argparse, CUDA check, ACE-Step init, cover mode generation, output renaming, silence detection, and error handling
- Caption templates for 5 common instruments with generic fallback for any instrument name
## Task Commits
Each task was committed atomically:
1. **Task 1: Archive experimental scripts** - `262ee6f` (chore)
2. **Task 2: Create hum2inst.py CLI pipeline script** - `5a23389` (feat)
## Files Created/Modified
- `hum2inst.py` - Complete CLI pipeline: argparse, CUDA check, ACE-Step XL-SFT cover mode, duration detection, caption building, silence detection, error handling
- `archive/midi_to_audio.py` - Archived experimental MIDI-to-audio script
- `archive/musicgen_melody.py` - Archived experimental MusicGen melody script
## Decisions Made
- Used direct Python API import of ACE-Step (not subprocess) for cleaner error handling and access to result objects
- Set default audio_cover_strength=0.9 (high end of 0.8-1.0 range) to prioritize melody fidelity
- Used -60 dBFS as silence detection threshold with warning-only behavior for borderline cases
- Used temp directory for ACE-Step's UUID-named output, then copy to user-friendly filename in output dir
## Deviations from Plan
None - plan executed exactly as written.
## Issues Encountered
None.
## User Setup Required
None - no external service configuration required. Script uses existing ACE-Step installation and venv.
## Next Phase Readiness
- hum2inst.py is ready for end-to-end testing with actual humming WAV files
- Foundation is set for Phase 2 (quality presets), Phase 3 (batch processing), and Phase 4 (output polish)
- All phases 2-4 can import or extend the patterns established here
## Self-Check: PASSED
All files exist at expected paths. All commit hashes verified in git log.
---
*Phase: 01-core-pipeline*
*Completed: 2026-04-11*

View file

@ -0,0 +1,144 @@
---
phase: 01-core-pipeline
plan: 02
type: execute
wave: 2
depends_on:
- 01-01
files_modified:
- hum2inst.py
autonomous: false
requirements:
- MEL-01
- MEL-02
- MEL-04
- INST-01
- INP-01
- OUT-02
- PIPE-01
must_haves:
truths:
- "Running hum2inst.py with a real humming WAV produces an output WAV file"
- "Output audio audibly follows the pitch contour of the input humming"
- "Output audio preserves the rhythmic timing of the input humming"
- "Output sounds like the specified instrument, not garbled noise"
- "Output filename contains the instrument name and a timestamp"
artifacts:
- path: "output/"
provides: "Generated instrument WAV file from test run"
key_links:
- from: "hum2inst.py"
to: "output/*.wav"
via: "end-to-end generation from humming input"
pattern: "Output saved:"
---
<objective>
Run the complete pipeline end-to-end with a real humming WAV file and verify the output meets quality requirements.
Purpose: Validate that the script actually produces instrument audio that follows the hummed melody -- the core value proposition of the project.
Output: A verified instrument rendition WAV file and confirmation that the pipeline works.
</objective>
<execution_context>
@C:/Users/jlightner/.claude/get-shit-done/workflows/execute-plan.md
@C:/Users/jlightner/.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/01-core-pipeline/01-CONTEXT.md
@.planning/phases/01-core-pipeline/01-01-SUMMARY.md
</context>
<tasks>
<task type="auto">
<name>Task 1: Run end-to-end pipeline test</name>
<files>hum2inst.py</files>
<action>
Run the hum2inst.py script with a real humming WAV file from the input/ directory. Use the ace-step venv Python interpreter.
Execute this command:
```
ace-step/.venv/Scripts/python.exe hum2inst.py "input/humming_step_down_jl [2026-04-11 004915].wav" --instrument piano
```
Verify:
1. The script runs without errors
2. An output WAV file is created in ./output/ with a filename containing "piano" and a timestamp
3. The script prints status messages (loading model, detecting duration, generating, output path)
4. Exit code is 0
If the script fails:
- Read the error message carefully
- Fix the issue in hum2inst.py (common issues: import paths, API parameter names, missing attributes on result object)
- Re-run until it succeeds
- Document any fixes made
After successful run, also test the --help flag to verify argparse setup:
```
ace-step/.venv/Scripts/python.exe hum2inst.py --help
```
</action>
<verify>
<automated>ls output/piano_*.wav 2>/dev/null | head -1 | xargs -I{} test -f "{}" && echo "PASS" || echo "FAIL: no piano output WAV found"</automated>
<manual>Check script console output shows: duration detection, caption, generation progress, output path</manual>
</verify>
<done>Pipeline produces a piano WAV output file from real humming input without errors</done>
</task>
<task type="checkpoint:human-verify" gate="blocking">
<name>Task 2: Human verification of melody fidelity and instrument quality</name>
<files>output/piano_*.wav</files>
<action>
Present the generated output to the user for listening verification. The user needs to compare the input humming with the generated instrument output.
What was built: Complete hum-to-instrument pipeline -- hum2inst.py takes a humming WAV and produces an instrument rendition via ACE-Step XL-SFT cover mode.
How to verify:
1. Listen to the INPUT humming file: `input/humming_step_down_jl [2026-04-11 004915].wav`
2. Listen to the OUTPUT piano file in `output/` (the latest piano_*.wav file)
3. Verify these qualities:
- Does the output follow the same melody (pitch contour) as the humming? (MEL-01)
- Does the output preserve the timing/rhythm of the humming? (MEL-02)
- Does the output sound like a coherent piano performance, not garbled noise? (MEL-04)
- Is the output audibly a piano (not generic synth or noise)? (INST-01)
4. Optionally, try a second instrument:
`ace-step/.venv/Scripts/python.exe hum2inst.py "input/humming_step_down_jl [2026-04-11 004915].wav" --instrument guitar`
Listen and verify it sounds distinctly different from the piano output.
Resume signal: Type "approved" if the output sounds good, or describe any issues with melody fidelity, instrument quality, or other problems.
</action>
<verify>
<automated>ls output/piano_*.wav 2>/dev/null && echo "Output file exists"</automated>
<manual>Human listens and confirms melody fidelity, rhythm preservation, instrument quality, and musical coherence</manual>
</verify>
<done>Human confirms output audio follows the hummed melody's pitch contour and rhythm, sounds like the specified instrument, and is musically coherent</done>
</task>
</tasks>
<verification>
1. hum2inst.py runs successfully with real humming input
2. Output WAV file exists with correct naming convention
3. Human confirms output follows input melody contour
4. Human confirms output preserves rhythmic timing
5. Human confirms output sounds like specified instrument
6. Human confirms output is musically coherent
</verification>
<success_criteria>
- Pipeline runs end-to-end without errors on real humming input
- Output WAV file produced with instrument+timestamp filename
- Human verifies melody fidelity (pitch contour and rhythm preserved)
- Human verifies instrument quality (sounds like specified instrument)
- Human verifies musical coherence (not garbled noise)
</success_criteria>
<output>
After completion, create `.planning/phases/01-core-pipeline/01-02-SUMMARY.md`
</output>

View file

@ -0,0 +1,83 @@
---
phase: 01-core-pipeline
plan: 02
subsystem: pipeline
tags: [ace-step, cover-mode, tuning, seed-control, quality-testing]
requires:
- phase: 01-01
provides: "hum2inst.py CLI script"
provides:
- "Validated end-to-end pipeline with real humming input"
- "Tuned generation parameters (strength=0.3 optimal)"
- "Multi-take seed generation workflow"
- "JSON run logging for reproducibility"
affects: [02-quality-presets]
tech-stack:
added: []
patterns: [multi-take-generation, json-sidecar-logging, seed-reproducibility]
key-files:
created: []
modified: [hum2inst.py]
key-decisions:
- "audio_cover_strength=0.3 is optimal default (0.9 garbled, 0.5+ often copies source)"
- "cover_noise_strength=0.0 required (any >0 produces near-identical source copy)"
- "Seed variance dominates output quality — multi-take cherry-pick workflow is necessary"
- "Custom verbose captions hurt quality — simple instrument captions work best"
- "JSON sidecar logging for every output WAV to track parameters"
- "Seed embedded in filename for traceability"
patterns-established:
- "Generate N takes with --takes, cherry-pick best by ear, reproduce with --seed"
- "JSON sidecar per output with full parameter snapshot"
- "Parameter sweep methodology: isolate one variable, same seed, compare"
requirements-completed: [MEL-01, MEL-02, MEL-04, INST-01, INP-01, OUT-02, PIPE-01]
## Self-Check: PARTIAL
quality-notes: |
Pipeline runs end-to-end without errors. Best outputs follow melody contour and
sound recognizably like target instrument. Quality is seed-dependent — some seeds
produce excellent results, others garbled or off-topic. Strength=0.3 with default
params is the confirmed sweet spot. This is a model capability ceiling, not a
script issue.
duration: 25min
completed: 2026-04-11
---
# Phase 1 Plan 02: End-to-End Pipeline Verification
**Validated pipeline with real humming input, tuned parameters through systematic testing, added seed control and run logging**
## Performance
- **Duration:** ~25 min (includes iterative tuning with user)
- **Tasks:** 2/2 (Task 2 checkpoint resolved through iterative testing)
- **Files modified:** 1
## Accomplishments
- Pipeline ran successfully on first attempt with no code fixes needed
- Systematic parameter sweep across strength, noise-strength, guidance, steps, shift, sampler, vel-clamp, vel-ema
- Identified optimal defaults: strength=0.3, noise-strength=0.0, guidance=5.0, steps=50
- Added seed control (--seed), multi-take generation (--takes), caption override (--caption)
- Added 6 advanced tuning flags (--guidance, --steps, --shift, --sampler, --vel-clamp, --vel-ema)
- Every output now has JSON sidecar with full parameter log
- Seed embedded in output filename for traceability
## Key Findings
- **audio_cover_strength** has a narrow sweet spot around 0.3 — above 0.5 often copies source, at 0.9 produces garbled "deep dream" output
- **cover_noise_strength** is cliff-like — any value > 0 produces near-identical source copies
- **Seed variance dominates** — same parameters produce wildly different results across seeds
- **guidance=7.0** improved melody fidelity but shifted timbre too bright (glockenspiel-like)
- **heun sampler** produced most authentic piano sound but melody diverged
- **Custom verbose captions** (descriptive register/tone) degraded output severely
- **Higher steps (100)** and **shift values** did not improve quality
## Task Commits
- `45f6863` feat(01-02): add generation tuning params, seed control, multi-take, and run logging

View file

@ -0,0 +1,67 @@
# Phase 1: Core Pipeline - Context
**Gathered:** 2026-04-11
**Status:** Ready for planning
<domain>
## Phase Boundary
Single CLI command that takes a humming WAV file and an instrument name, and produces an instrument rendition via ACE-Step XL-SFT cover mode. Output audibly follows the pitch contour and rhythmic timing of the input. No multi-instrument batch, no config files, no advanced error handling — just the core end-to-end flow.
</domain>
<decisions>
## Implementation Decisions
### CLI invocation design
- Invoked as `python hum2inst.py input.wav --instrument piano`
- Python script directly, no installation step
- `--instrument` as a named CLI flag (not positional)
- `--output` flag optional, defaults to `./output/` directory
- Use Python argparse for argument parsing (gives --help for free)
### ACE-Step generation parameters
- Default cover_strength in the high fidelity range (0.8-1.0)
- `--strength` flag exposed in Phase 1 so users can experiment immediately
- Duration matches input WAV length by default; `--duration` flag to override
- Caption auto-built from instrument name (e.g., "piano cover of a melody") — no custom prompt flag in Phase 1
### Output behavior
- Output filename includes instrument and timestamp (exact format at Claude's discretion)
- On generation failure or silence: print clear error message, exit with non-zero code
- No auto-play — just save and print the output path
- No silent failures
### Pipeline architecture
- Single `hum2inst.py` script — no module splitting in Phase 1
- Assume CUDA GPU is available; fail with clear message if no GPU detected
- Move existing experimental scripts (midi_to_audio.py, musicgen_melody.py) to an `/archive` folder
### Claude's Discretion
- ACE-Step invocation method (import Python API vs subprocess call — choose based on what ACE-Step exposes)
- Progress/feedback during generation (print statements, progress bar, or similar — pick what's appropriate)
- Exact output filename format (instrument + timestamp pattern)
- Exact cover_strength default value within the 0.8-1.0 range
</decisions>
<specifics>
## Specific Ideas
- ACE-Step XL-SFT cover mode is the generation backend — this is established from prior experimentation
- The `ace-step/` directory already exists in the project root with the model code
- User wants high melodic fidelity as the default — the pipeline should prioritize faithful melody reproduction over creative interpretation
</specifics>
<deferred>
## Deferred Ideas
None — discussion stayed within phase scope
</deferred>
---
*Phase: 01-core-pipeline*
*Context gathered: 2026-04-11*

View file

@ -0,0 +1,374 @@
# Phase 1: Core Pipeline - Research
**Researched:** 2026-04-11
**Domain:** ACE-Step cover mode CLI wrapper, audio I/O, Python scripting
**Confidence:** HIGH
## Summary
Phase 1 wraps the existing ACE-Step 1.5 XL-SFT cover mode into a single `hum2inst.py` script. The user's prior experimentation has already validated the entire generation pipeline end-to-end: raw humming WAV goes into ACE-Step cover mode, a caption describes the target instrument, and the output preserves melody contour and rhythm. The technical challenge is purely integration and UX: reading the input WAV duration, constructing the right `GenerationParams`, initializing the handlers, calling `generate_music()`, and saving the output with a meaningful filename.
The ACE-Step codebase exposes a clean Python API via `acestep.inference.generate_music()`, `acestep.handler.AceStepHandler`, and `acestep.llm_inference.LLMHandler`. Cover mode does NOT require the LLM handler (it's in the `skip_lm_tasks` set), which simplifies initialization. The script needs only the DiT handler with the XL-SFT model.
**Primary recommendation:** Import ACE-Step's Python API directly (not subprocess). Initialize `AceStepHandler` + dummy `LLMHandler`, configure `GenerationParams` with `task_type="cover"`, and call `generate_music()`. Use `torchaudio` (already in the ACE-Step venv) to read input WAV duration.
<user_constraints>
## User Constraints (from CONTEXT.md)
### Locked Decisions
- Invoked as `python hum2inst.py input.wav --instrument piano`
- Python script directly, no installation step
- `--instrument` as a named CLI flag (not positional)
- `--output` flag optional, defaults to `./output/` directory
- Use Python argparse for argument parsing (gives --help for free)
- Default cover_strength in the high fidelity range (0.8-1.0)
- `--strength` flag exposed in Phase 1 so users can experiment immediately
- Duration matches input WAV length by default; `--duration` flag to override
- Caption auto-built from instrument name (e.g., "piano cover of a melody") -- no custom prompt flag in Phase 1
- Output filename includes instrument and timestamp (exact format at Claude's discretion)
- On generation failure or silence: print clear error message, exit with non-zero code
- No auto-play -- just save and print the output path
- No silent failures
- Single `hum2inst.py` script -- no module splitting in Phase 1
- Assume CUDA GPU is available; fail with clear message if no GPU detected
- Move existing experimental scripts (midi_to_audio.py, musicgen_melody.py) to an `/archive` folder
### Claude's Discretion
- ACE-Step invocation method (import Python API vs subprocess call -- choose based on what ACE-Step exposes)
- Progress/feedback during generation (print statements, progress bar, or similar -- pick what's appropriate)
- Exact output filename format (instrument + timestamp pattern)
- Exact cover_strength default value within the 0.8-1.0 range
### Deferred Ideas (OUT OF SCOPE)
None -- discussion stayed within phase scope
</user_constraints>
<phase_requirements>
## Phase Requirements
| ID | Description | Research Support |
|----|-------------|-----------------|
| MEL-01 | Output audio audibly follows the pitch contour of the hummed input melody | ACE-Step cover mode with audio_cover_strength 0.8-1.0 preserves pitch contour from source. Validated in prior testing (File 10 comparison). |
| MEL-02 | Output audio preserves the rhythmic timing and phrasing of the hummed input | Cover mode encodes source into VAE latents, preserving temporal structure. Duration matching ensures timing alignment. |
| MEL-04 | Output is musically coherent -- sounds like a real instrument performance | XL-SFT model (50 inference steps) produces coherent instrument audio. Caption guides timbre. |
| INST-01 | User can specify target instrument via text prompt | `--instrument` flag maps to caption string (e.g., "solo acoustic piano, gentle melody"). |
| INP-01 | Pipeline accepts raw humming WAV audio as input with no manual preprocessing | Cover mode takes raw audio directly via `src_audio` parameter. No MIDI extraction needed. |
| OUT-02 | Output is saved as WAV file to a user-specified or default output directory | `generate_music()` saves to `save_dir`. Script renames output file with instrument+timestamp. |
| PIPE-01 | Single CLI command or script invocation to go from humming WAV to instrument output | Single `python hum2inst.py input.wav --instrument piano` command. |
</phase_requirements>
## Standard Stack
### Core
| Library | Version | Purpose | Why Standard |
|---------|---------|---------|--------------|
| ACE-Step (acestep) | 1.5 XL-SFT | Music generation via cover mode | Already installed, validated for this exact use case |
| torch | 2.7.1+cu128 | GPU compute, model inference | Already in ace-step/.venv |
| torchaudio | 2.7.1+cu128 | WAV file reading (duration detection) | Already in ace-step/.venv, native torch integration |
| argparse | stdlib | CLI argument parsing | User-specified, gives --help free |
### Supporting
| Library | Version | Purpose | When to Use |
|---------|---------|---------|-------------|
| datetime | stdlib | Timestamp generation for output filenames | Always (output naming) |
| pathlib/os | stdlib | Path manipulation, directory creation | Always (file I/O) |
| sys | stdlib | Exit codes, stderr output | Error handling |
### Alternatives Considered
| Instead of | Could Use | Tradeoff |
|------------|-----------|----------|
| Direct Python API import | subprocess calling `cli.py` | Subprocess adds overhead, harder error handling, but avoids import complexity. **Recommendation: use direct import** -- ACE-Step exposes clean Python API via `acestep.inference.generate_music()` |
| torchaudio for duration | wave stdlib module | wave is simpler but torchaudio is already loaded and handles more formats |
| argparse | click/typer | User specifically chose argparse for simplicity and --help |
**Installation:**
No new packages needed. Everything runs inside the existing `ace-step/.venv` environment.
## Architecture Patterns
### Recommended Project Structure
```
AiMusicPipeline/
├── hum2inst.py # Single CLI entry point (Phase 1)
├── ace-step/ # ACE-Step installation (existing)
│ ├── acestep/ # ACE-Step Python package
│ ├── checkpoints/ # Model weights (XL-SFT, VAE)
│ └── .venv/ # Python environment
├── input/ # User's humming WAV files
├── output/ # Generated instrument audio
└── archive/ # Moved experimental scripts
├── midi_to_audio.py
└── musicgen_melody.py
```
### Pattern 1: ACE-Step Python API Direct Import
**What:** Import `AceStepHandler`, `LLMHandler`, `GenerationParams`, `GenerationConfig`, and `generate_music` directly from the `acestep` package.
**When to use:** Always -- this is the recommended invocation method.
**Key finding:** Cover mode is in the `skip_lm_tasks` set, meaning the LLM handler is instantiated but NOT loaded with a model. This simplifies initialization significantly.
**Initialization sequence (from cli.py analysis):**
```python
import sys
import os
# Add ace-step to path so acestep package is importable
sys.path.insert(0, os.path.join(os.path.dirname(__file__), "ace-step"))
from acestep.handler import AceStepHandler
from acestep.llm_inference import LLMHandler
from acestep.inference import GenerationParams, GenerationConfig, generate_music
from acestep.gpu_config import get_gpu_config, set_global_gpu_config
# GPU setup
gpu_config = get_gpu_config()
set_global_gpu_config(gpu_config)
# Initialize handlers
dit_handler = AceStepHandler()
llm_handler = LLMHandler() # Not loaded for cover mode -- just instantiated
# Initialize DiT with XL-SFT model
dit_handler.initialize_service(
project_root=os.path.join(os.path.dirname(__file__), "ace-step"),
config_path="acestep-v15-xl-sft",
device="cuda",
use_flash_attention=dit_handler.is_flash_attention_available("cuda"),
)
```
### Pattern 2: Cover Mode Generation Parameters
**What:** Configure `GenerationParams` for cover mode with raw humming input.
**Validated configuration (from prior testing):**
```python
params = GenerationParams(
task_type="cover",
src_audio="path/to/humming.wav",
caption="solo acoustic piano, gentle melody, warm tone",
lyrics="",
instrumental=True,
duration=10, # Match input WAV length
bpm=120, # Default; not critical for cover mode
audio_cover_strength=0.9, # High fidelity to source melody
inference_steps=50, # XL-SFT model uses 50 steps
guidance_scale=5.0,
thinking=False, # No LLM needed for cover
)
config = GenerationConfig(
batch_size=1,
use_random_seed=True,
audio_format="wav",
)
result = generate_music(
dit_handler, llm_handler, params, config,
save_dir="./output"
)
```
### Pattern 3: Input WAV Duration Detection
**What:** Read input WAV to determine output duration automatically.
```python
import torchaudio
def get_wav_duration(wav_path: str) -> float:
info = torchaudio.info(wav_path)
return info.num_frames / info.sample_rate
```
### Pattern 4: Caption Construction from Instrument Name
**What:** Build the ACE-Step caption from a simple instrument name.
```python
def build_caption(instrument: str) -> str:
return f"solo {instrument}, clear and expressive melody, warm tone"
```
**Note:** The caption strongly influences output timbre. Keep it focused on the instrument. Adding genre/mood modifiers is deferred to later phases.
### Anti-Patterns to Avoid
- **Subprocess invocation of cli.py:** cli.py has interactive wizard prompts that will hang in non-interactive mode. Use the Python API directly.
- **Loading LLM model for cover mode:** Cover and repaint tasks are in `skip_lm_tasks`. The LLM handler must be instantiated (generate_music expects it) but should NOT have its model loaded -- this saves ~5GB VRAM and ~10s startup time.
- **Setting `thinking=True` for cover mode:** This triggers LLM inference which is skipped anyway for cover, but may cause errors if no LLM model is loaded.
- **Hardcoding `inference_steps=8`:** That's the turbo model default. XL-SFT needs 50 steps for quality output.
## Don't Hand-Roll
| Problem | Don't Build | Use Instead | Why |
|---------|-------------|-------------|-----|
| Audio file saving | Custom WAV writer | `generate_music()` with `save_dir` parameter | ACE-Step handles format, sample rate, normalization internally |
| Model loading/initialization | Manual weight loading | `AceStepHandler.initialize_service()` | Handles config paths, checkpoints, device placement, flash attention |
| Audio duration detection | Manual WAV header parsing | `torchaudio.info()` | Handles all formats, already in dependencies |
| GPU detection | Custom CUDA checks | `acestep.gpu_config.get_gpu_config()` | Already handles CUDA/MPS/CPU detection with memory tier logic |
**Key insight:** ACE-Step's existing Python API handles all the complex audio/ML plumbing. The wrapper script's job is purely argument parsing, caption construction, and output file management.
## Common Pitfalls
### Pitfall 1: sys.path and Package Import Order
**What goes wrong:** `acestep` package is not on Python's path since hum2inst.py lives outside the ace-step directory.
**Why it happens:** ACE-Step is installed as editable in its own venv, but hum2inst.py is at the project root.
**How to avoid:** Add `sys.path.insert(0, "ace-step")` before importing from `acestep`. Alternatively, ensure the script is run from within the ace-step venv which has acestep installed as editable package.
**Warning signs:** `ModuleNotFoundError: No module named 'acestep'`
### Pitfall 2: Output File Renaming Race
**What goes wrong:** `generate_music()` saves output with a UUID-based filename. The script needs to rename it to include instrument+timestamp.
**Why it happens:** The `generate_music` API uses `generate_uuid_from_params()` for filenames, not user-friendly names.
**How to avoid:** After `generate_music()` returns, read `result.audios[0]["path"]` to get the saved path, then rename/copy to the desired filename. Or pass a custom `save_dir` per-run and rename afterward.
**Warning signs:** Output files named like `a3f2b8c1...wav` instead of `piano_20260411_143022.wav`
### Pitfall 3: Duration Mismatch
**What goes wrong:** Output audio is much longer or shorter than the input humming, causing melody to stretch or truncate.
**Why it happens:** Not setting `duration` parameter, or setting it to -1 (auto), lets the model choose its own duration.
**How to avoid:** Always measure input WAV duration with `torchaudio.info()` and pass it as the `duration` parameter. Round to nearest integer (ACE-Step duration is in seconds).
**Warning signs:** 10-second humming produces 30-second output, or vice versa.
### Pitfall 4: Silence Detection
**What goes wrong:** Model occasionally generates near-silence, especially with very short inputs or unusual timbres.
**Why it happens:** Cover mode's noise injection at high audio_cover_strength sometimes produces degenerate outputs.
**How to avoid:** After generation, check if the output audio tensor (from `result.audios[0]["tensor"]`) has RMS energy above a threshold. If too quiet, report error.
**Warning signs:** Output WAV file exists but plays as silence or very faint noise.
### Pitfall 5: CUDA Not Available
**What goes wrong:** Script crashes with CUDA errors on a machine without GPU.
**Why it happens:** User decision is to assume CUDA and fail clearly.
**How to avoid:** Check `torch.cuda.is_available()` early and exit with a clear message before any model loading.
**Warning signs:** RuntimeError about CUDA device.
## Code Examples
### Complete Invocation Flow (verified from cli.py source)
```python
# 1. Check CUDA
import torch
if not torch.cuda.is_available():
print("ERROR: CUDA GPU required. No CUDA-capable GPU detected.", file=sys.stderr)
sys.exit(1)
# 2. Get input duration
import torchaudio
info = torchaudio.info(input_wav_path)
duration = info.num_frames / info.sample_rate
# 3. Initialize handlers (from cli.py lines 1318-1319, 1411-1419)
from acestep.gpu_config import get_gpu_config, set_global_gpu_config
gpu_config = get_gpu_config()
set_global_gpu_config(gpu_config)
dit_handler = AceStepHandler()
llm_handler = LLMHandler()
dit_handler.initialize_service(
project_root="./ace-step",
config_path="acestep-v15-xl-sft",
device="cuda",
use_flash_attention=dit_handler.is_flash_attention_available("cuda"),
)
# 4. Configure generation (from prior testing + cli.py lines 1629-1676)
params = GenerationParams(
task_type="cover",
src_audio=str(input_wav_path),
caption=f"solo {instrument}, clear and expressive melody",
lyrics="",
instrumental=True,
duration=round(duration),
audio_cover_strength=0.9,
inference_steps=50,
guidance_scale=5.0,
thinking=False,
)
config = GenerationConfig(
batch_size=1,
use_random_seed=True,
audio_format="wav",
)
# 5. Generate
result = generate_music(dit_handler, llm_handler, params, config, save_dir=output_dir)
# 6. Check result
if not result.success or not result.audios:
print(f"ERROR: Generation failed: {result.error or 'no audio produced'}", file=sys.stderr)
sys.exit(1)
output_path = result.audios[0]["path"]
```
### Caption Construction Examples
```python
# Simple instrument mapping -- keep captions focused on instrument
CAPTION_TEMPLATES = {
"piano": "solo acoustic piano, gentle melody, warm tone, clear and expressive",
"guitar": "solo acoustic guitar, fingerpicked melody, warm and intimate",
"saxophone": "solo saxophone, smooth jazz melody, soulful and expressive",
"violin": "solo violin, classical melody, rich and emotional",
"flute": "solo flute, gentle melody, airy and delicate",
}
def build_caption(instrument: str) -> str:
instrument_lower = instrument.lower().strip()
if instrument_lower in CAPTION_TEMPLATES:
return CAPTION_TEMPLATES[instrument_lower]
# Generic fallback for any instrument name
return f"solo {instrument_lower}, clear and expressive melody, warm tone"
```
## State of the Art
| Old Approach | Current Approach | When Changed | Impact |
|--------------|------------------|--------------|--------|
| MusicGen melody conditioning | ACE-Step XL-SFT cover mode | 2026-04-11 testing | MusicGen chromagram conditioning too lossy; ACE-Step preserves melody+rhythm |
| MIDI extraction pipeline | Direct raw audio input | 2026-04-11 testing | No intermediate steps needed; raw humming goes directly into cover mode |
| Turbo model (8 steps) | XL-SFT model (50 steps) | 2026-04-11 testing | Better quality for cover mode; ~3s generation on RTX 4090 |
**Deprecated/outdated:**
- `musicgen_melody.py`: MusicGen approach abandoned. Moving to `/archive`.
- `midi_to_audio.py`: MIDI synthesis no longer needed. Moving to `/archive`.
## Open Questions
1. **Optimal audio_cover_strength default**
- What we know: 0.8 worked well in testing. Range 0.8-1.0 specified by user.
- What's unclear: Whether 0.85 or 0.9 might be more universally good across different instruments.
- Recommendation: Default to 0.9 (high fidelity bias per user preference). Expose `--strength` flag so users can tune.
2. **Caption quality impact on different instruments**
- What we know: Piano captions worked well in testing.
- What's unclear: How well generic captions work for uncommon instruments (e.g., "solo theremin").
- Recommendation: Include a small set of curated caption templates for common instruments, with a generic fallback for anything else.
3. **Output silence detection threshold**
- What we know: User wants clear error on silence/failure.
- What's unclear: What RMS threshold constitutes "silence" vs "very quiet music."
- Recommendation: Use a conservative threshold (e.g., RMS < -60 dBFS). Log a warning rather than hard-failing for borderline cases.
4. **Project root path resolution**
- What we know: `hum2inst.py` lives at project root, `ace-step/` is a subdirectory.
- What's unclear: Whether `initialize_service(project_root=...)` needs an absolute path.
- Recommendation: Use `os.path.abspath()` to resolve the ace-step directory path relative to the script location.
## Sources
### Primary (HIGH confidence)
- `ace-step/cli.py` -- Full CLI implementation showing exact initialization sequence, parameter defaults, and cover mode handling
- `ace-step/acestep/inference.py` -- `GenerationParams`, `GenerationConfig`, `GenerationResult` dataclasses and `generate_music()` function
- `ace-step/acestep/handler.py` -- `AceStepHandler` class with `initialize_service()` method
- `ace-step/acestep/constants.py` -- `TASK_INSTRUCTIONS` dictionary showing cover mode instruction
- User's prior testing notes (`pipeline_hum_to_instrument.md`) -- Validated configuration and results
### Secondary (MEDIUM confidence)
- `ace-step/requirements.txt` -- Dependency versions for the project environment
### Tertiary (LOW confidence)
- None -- all findings verified against actual source code
## Metadata
**Confidence breakdown:**
- Standard stack: HIGH -- all libraries already installed and validated in prior testing
- Architecture: HIGH -- direct API import path fully traced through cli.py source code
- Pitfalls: HIGH -- identified from actual code analysis (not speculation)
**Research date:** 2026-04-11
**Valid until:** 2026-05-11 (stable -- ACE-Step codebase is local and pinned)

142
archive/midi_to_audio.py Normal file
View file

@ -0,0 +1,142 @@
"""
MIDI to Audio Synthesizer
=========================
Renders MIDI files to clean WAV audio using sine-wave synthesis with pitch bend support.
Produces a clean melody signal suitable for MusicGen melody conditioning.
Usage:
python midi_to_audio.py input.mid # outputs input_synth.wav
python midi_to_audio.py input.mid -o output.wav # custom output path
python midi_to_audio.py input.mid --sample-rate 32000 # match MusicGen's 32kHz
"""
import argparse
import os
import mido
import numpy as np
import scipy.io.wavfile as wavfile
def midi_note_to_freq(note: int) -> float:
"""Convert MIDI note number to frequency in Hz."""
return 440.0 * (2.0 ** ((note - 69) / 12.0))
def render_midi(midi_path: str, sample_rate: int = 32000) -> np.ndarray:
"""Render a MIDI file to audio using sine synthesis with pitch bend."""
mid = mido.MidiFile(midi_path)
# Calculate total duration
total_seconds = mid.length
total_samples = int(total_seconds * sample_rate) + sample_rate # +1s padding
audio = np.zeros(total_samples, dtype=np.float64)
# Process each track
for track in mid.tracks:
current_time = 0.0 # in seconds
tempo = 500000 # default 120 BPM
active_notes = {} # note -> (start_sample, velocity)
current_pitch_bend = 0 # in MIDI pitch bend units (-8192 to 8191)
pitch_bend_range = 2 # semitones (standard GM default)
for msg in track:
# Advance time
if msg.time > 0:
delta_seconds = mido.tick2second(msg.time, mid.ticks_per_beat, tempo)
current_time += delta_seconds
if msg.type == 'set_tempo':
tempo = msg.tempo
elif msg.type == 'pitchwheel':
current_pitch_bend = msg.pitch
elif msg.type == 'note_on' and msg.velocity > 0:
sample_pos = int(current_time * sample_rate)
active_notes[msg.note] = (sample_pos, msg.velocity, current_pitch_bend)
elif msg.type == 'note_off' or (msg.type == 'note_on' and msg.velocity == 0):
if msg.note in active_notes:
start_sample, velocity, start_bend = active_notes.pop(msg.note)
end_sample = int(current_time * sample_rate)
if end_sample <= start_sample:
continue
# Average the pitch bend (simplification - uses start bend)
bend_semitones = (start_bend / 8192.0) * pitch_bend_range
freq = midi_note_to_freq(msg.note + bend_semitones)
# Generate tone
n_samples = end_sample - start_sample
t = np.arange(n_samples) / sample_rate
tone = np.sin(2 * np.pi * freq * t)
# Apply ADSR envelope
envelope = np.ones(n_samples)
attack = min(int(0.01 * sample_rate), n_samples) # 10ms attack
release = min(int(0.05 * sample_rate), n_samples) # 50ms release
if attack > 0:
envelope[:attack] = np.linspace(0, 1, attack)
if release > 0:
envelope[-release:] = np.linspace(1, 0, release)
# Scale by velocity
amplitude = velocity / 127.0 * 0.5
tone *= envelope * amplitude
# Mix into output
end_idx = min(start_sample + n_samples, len(audio))
actual_len = end_idx - start_sample
audio[start_sample:end_idx] += tone[:actual_len]
# Normalize
peak = np.max(np.abs(audio))
if peak > 0:
audio = audio / peak * 0.9
# Trim trailing silence
nonzero = np.nonzero(np.abs(audio) > 0.001)[0]
if len(nonzero) > 0:
end = min(nonzero[-1] + sample_rate, len(audio)) # +1s tail
audio = audio[:end]
return audio.astype(np.float32)
def main():
parser = argparse.ArgumentParser(description="Render MIDI to clean audio")
parser.add_argument("input", help="Input MIDI file")
parser.add_argument("-o", "--output", help="Output WAV path")
parser.add_argument("--sample-rate", "-sr", type=int, default=32000,
help="Sample rate (default: 32000, matches MusicGen)")
args = parser.parse_args()
print(f"Reading MIDI: {args.input}")
mid = mido.MidiFile(args.input)
print(f" Duration: {mid.length:.1f}s")
print(f" Tracks: {len(mid.tracks)}")
# Count notes
note_count = sum(1 for track in mid.tracks for msg in track
if msg.type == 'note_on' and msg.velocity > 0)
print(f" Notes: {note_count}")
print(f"Rendering at {args.sample_rate}Hz...")
audio = render_midi(args.input, args.sample_rate)
print(f" Output duration: {len(audio) / args.sample_rate:.1f}s")
if args.output:
output_path = args.output
else:
base = os.path.splitext(args.input)[0]
output_path = f"{base}_synth.wav"
os.makedirs(os.path.dirname(output_path) or ".", exist_ok=True)
wavfile.write(output_path, args.sample_rate, audio)
print(f"Saved: {output_path}")
if __name__ == "__main__":
main()

164
archive/musicgen_melody.py Normal file
View file

@ -0,0 +1,164 @@
"""
MusicGen Melody Large - Hum to Instrument
==========================================
Feed it a WAV/MP3 of you humming + a text prompt describing the instrument,
and it outputs that melody played on the described instrument.
Usage:
python musicgen_melody.py --input hum.wav --prompt "solo acoustic piano, gentle and warm"
python musicgen_melody.py --input hum.wav --prompt "solo electric guitar, jazz improvisation" --duration 20
python musicgen_melody.py --input hum.wav --prompt "solo saxophone, smooth jazz" --output sax_output.wav
Without --input, generates from text prompt only (no melody conditioning).
"""
import argparse
import os
import sys
import time
import numpy as np
import torch
import torchaudio
from transformers import AutoProcessor, MusicgenMelodyForConditionalGeneration
def load_audio(path: str, target_sr: int = 32000):
"""Load audio file and resample to target sample rate."""
waveform, sr = torchaudio.load(path)
# Convert to mono if stereo
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
# Resample if needed
if sr != target_sr:
resampler = torchaudio.transforms.Resample(sr, target_sr)
waveform = resampler(waveform)
return waveform, target_sr
def main():
parser = argparse.ArgumentParser(description="MusicGen Melody - Hum to Instrument")
parser.add_argument("--input", "-i", type=str, default=None,
help="Path to input audio (WAV/MP3) of humming/melody (8-30 seconds)")
parser.add_argument("--prompt", "-p", type=str, required=True,
help="Text prompt describing desired instrument and style")
parser.add_argument("--duration", "-d", type=int, default=None,
help="Output duration in seconds (default: match input length, max 30)")
parser.add_argument("--output", "-o", type=str, default=None,
help="Output WAV path (default: auto-generated in output/musicgen/)")
parser.add_argument("--guidance", "-g", type=float, default=3.0,
help="Classifier-free guidance scale (default: 3.0, range 1-5)")
parser.add_argument("--top-k", type=int, default=250,
help="Top-k sampling (default: 250)")
parser.add_argument("--top-p", type=float, default=0.0,
help="Top-p nucleus sampling (default: 0.0 = disabled)")
parser.add_argument("--temperature", "-t", type=float, default=1.0,
help="Sampling temperature (default: 1.0)")
parser.add_argument("--seed", "-s", type=int, default=-1,
help="Random seed (-1 = random)")
args = parser.parse_args()
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")
if device == "cuda":
print(f"GPU: {torch.cuda.get_device_name(0)}")
# Load model
print("Loading MusicGen Melody Large...")
t0 = time.time()
processor = AutoProcessor.from_pretrained("facebook/musicgen-melody-large")
model = MusicgenMelodyForConditionalGeneration.from_pretrained("facebook/musicgen-melody-large")
model = model.to(device)
print(f"Model loaded in {time.time() - t0:.1f}s")
print(f"VRAM used: {torch.cuda.memory_allocated() / 1e9:.1f} GB")
# Determine duration
sample_rate = model.config.audio_encoder.sampling_rate # 32000
max_duration = 30
duration = min(args.duration or max_duration, max_duration)
# Prepare inputs
if args.input:
print(f"Loading input audio: {args.input}")
waveform, sr = load_audio(args.input, target_sr=sample_rate)
input_duration = waveform.shape[1] / sr
print(f"Input duration: {input_duration:.1f}s")
if input_duration > max_duration:
print(f"Warning: Input longer than {max_duration}s, truncating.")
waveform = waveform[:, :max_duration * sr]
input_duration = max_duration
if args.duration is None:
duration = min(int(input_duration) + 2, max_duration) # slight padding
print(f"Auto duration: {duration}s (input + 2s padding)")
audio_np = waveform.squeeze(0).numpy().astype(np.float32)
inputs = processor(
audio=audio_np,
sampling_rate=sample_rate,
text=[args.prompt],
padding=True,
return_tensors="pt",
).to(device)
print(f"Melody conditioning active: input_features shape = {inputs['input_features'].shape}")
else:
print("No input audio - generating from text prompt only.")
inputs = processor(
text=[args.prompt],
padding=True,
return_tensors="pt",
).to(device)
# Calculate max new tokens from duration
# MusicGen generates at ~50 tokens per second of audio
max_new_tokens = int(duration * 50)
# Seed right before generation (after model loading) so conditioning is not overridden
if args.seed >= 0:
torch.manual_seed(args.seed)
if device == "cuda":
torch.cuda.manual_seed(args.seed)
print(f"Seed: {args.seed}")
print(f"\nGenerating {duration}s of audio...")
print(f" Prompt: {args.prompt}")
print(f" Guidance: {args.guidance}")
print(f" Temperature: {args.temperature}")
print(f" Top-k: {args.top_k}")
t0 = time.time()
with torch.inference_mode():
audio_values = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
guidance_scale=args.guidance,
do_sample=True,
temperature=args.temperature,
top_k=args.top_k,
top_p=args.top_p if args.top_p > 0 else None,
)
gen_time = time.time() - t0
print(f"Generated in {gen_time:.1f}s (RTF: {gen_time/duration:.2f}x)")
# Save output
audio = audio_values[0, 0].cpu() # [samples]
# Create output directory
output_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), "output", "musicgen")
os.makedirs(output_dir, exist_ok=True)
if args.output:
output_path = args.output
else:
timestamp = time.strftime("%Y%m%d_%H%M%S")
safe_prompt = args.prompt[:40].replace(" ", "_").replace(",", "")
output_path = os.path.join(output_dir, f"{timestamp}_{safe_prompt}.wav")
torchaudio.save(output_path, audio.unsqueeze(0), sample_rate)
print(f"\nSaved: {output_path}")
print(f"Duration: {audio.shape[0] / sample_rate:.1f}s @ {sample_rate}Hz")
if __name__ == "__main__":
main()

59
batch_generate.py Normal file
View file

@ -0,0 +1,59 @@
"""Batch generation script — 100 outputs across instruments and parameter permutations."""
import subprocess
import sys
import itertools
import random
import os
BASE = os.path.dirname(os.path.abspath(__file__))
PYTHON = os.path.join(BASE, "ace-step", ".venv", "Scripts", "python.exe")
SCRIPT = os.path.join(BASE, "hum2inst.py")
INPUT = os.path.join(BASE, "input", "bum bum bum [2026-04-10 230403].wav")
OUTPUT = os.path.join(BASE, "output", "batch-bumbum")
instruments = ["piano", "guitar", "saxophone", "violin", "flute",
"cello", "trumpet", "organ", "marimba", "harmonica"]
# Parameter combos: (strength, guidance)
param_combos = [
(0.2, 5.0),
(0.3, 5.0), # our best default
(0.4, 5.0),
(0.3, 7.0),
(0.3, 3.0),
]
# 10 instruments x 5 param combos x 2 takes = 100 outputs
takes_per_combo = 2
count = 0
total = len(instruments) * len(param_combos) * takes_per_combo
print(f"Generating {total} outputs...")
for instrument in instruments:
for strength, guidance in param_combos:
seeds = [random.randint(0, 2**31 - 1) for _ in range(takes_per_combo)]
for seed in seeds:
count += 1
print(f"\n[{count}/{total}] {instrument} str={strength} guide={guidance} seed={seed}")
cmd = [
PYTHON, SCRIPT, INPUT,
"--instrument", instrument,
"--output", OUTPUT,
"--strength", str(strength),
"--noise-strength", "0.0",
"--guidance", str(guidance),
"--seed", str(seed),
]
result = subprocess.run(cmd, capture_output=True, text=True)
# Print just the output/log lines
for line in result.stdout.splitlines():
if "Output saved" in line or "Run log" in line:
print(f" {line.strip()}")
if result.returncode != 0:
err_lines = [l for l in result.stderr.splitlines() if "ERROR" in l]
for l in err_lines:
print(f" FAILED: {l.strip()}")
print(f"\nDone! {count} outputs in {OUTPUT}/")

367
hum2inst.py Normal file
View file

@ -0,0 +1,367 @@
#!/usr/bin/env python3
"""
hum2inst.py - Convert humming to instrument audio using ACE-Step XL-SFT cover mode.
Usage:
python hum2inst.py input.wav --instrument piano
python hum2inst.py input.wav --instrument guitar --strength 0.85 --output ./my_output/
python hum2inst.py input.wav --instrument saxophone --duration 15
"""
import argparse
import json
import math
import os
import shutil
import sys
import tempfile
from datetime import datetime
from pathlib import Path
# ---------------------------------------------------------------------------
# Caption templates for common instruments
# ---------------------------------------------------------------------------
CAPTION_TEMPLATES = {
"piano": "solo acoustic piano, gentle melody, warm tone, clear and expressive",
"guitar": "solo acoustic guitar, fingerpicked melody, warm and intimate",
"saxophone": "solo saxophone, smooth jazz melody, soulful and expressive",
"violin": "solo violin, classical melody, rich and emotional",
"flute": "solo flute, gentle melody, airy and delicate",
}
def build_caption(instrument: str) -> str:
"""Build an ACE-Step caption from the instrument name."""
instrument_lower = instrument.lower().strip()
if instrument_lower in CAPTION_TEMPLATES:
return CAPTION_TEMPLATES[instrument_lower]
return f"solo {instrument_lower}, clear and expressive melody, warm tone"
def get_wav_duration(wav_path: str) -> float:
"""Return the duration of a WAV file in seconds using torchaudio."""
import torchaudio
info = torchaudio.info(wav_path)
return info.num_frames / info.sample_rate
def check_silence(audio_path: str, threshold_db: float = -60.0) -> bool:
"""Check if the output audio is near-silent.
Returns True if the audio is below the threshold (likely silent/failed).
"""
import torch
import torchaudio
waveform, _sr = torchaudio.load(audio_path)
rms = torch.sqrt(torch.mean(waveform ** 2))
if rms > 0:
rms_db = 20 * torch.log10(rms).item()
else:
rms_db = -float("inf")
return rms_db < threshold_db
def parse_args() -> argparse.Namespace:
"""Parse command-line arguments."""
parser = argparse.ArgumentParser(
description="Convert a hummed melody to an instrument rendition using ACE-Step.",
epilog="Example: python hum2inst.py humming.wav --instrument piano",
)
parser.add_argument(
"input",
type=str,
help="Path to the input WAV file containing humming",
)
parser.add_argument(
"--instrument",
type=str,
required=True,
help="Target instrument name (e.g., piano, guitar, saxophone, violin, flute)",
)
parser.add_argument(
"--caption",
type=str,
default=None,
help="Override the auto-generated caption (default: built from instrument name)",
)
parser.add_argument(
"--output",
type=str,
default="./output/",
help="Output directory for the generated audio (default: ./output/)",
)
parser.add_argument(
"--strength",
type=float,
default=0.3,
help="Audio cover strength (0.0-1.0, fraction of steps using cover conditioning, default: 0.3)",
)
parser.add_argument(
"--noise-strength",
type=float,
default=0.0,
help="Cover noise strength (0.0-1.0, 0=pure noise start, 1=closest to source audio, default: 0.0)",
)
parser.add_argument(
"--duration",
type=float,
default=None,
help="Override output duration in seconds (default: auto-detect from input WAV)",
)
parser.add_argument(
"--guidance",
type=float,
default=5.0,
help="Classifier-free guidance scale — higher = follows caption more strictly (default: 5.0)",
)
parser.add_argument(
"--steps",
type=int,
default=50,
help="Number of inference/denoising steps (default: 50)",
)
parser.add_argument(
"--shift",
type=float,
default=1.0,
help="Timestep shift factor — warps denoising schedule (default: 1.0)",
)
parser.add_argument(
"--sampler",
type=str,
default="euler",
choices=["euler", "heun"],
help="Sampler mode: euler (fast) or heun (2nd-order, more accurate) (default: euler)",
)
parser.add_argument(
"--vel-clamp",
type=float,
default=0.0,
help="Velocity norm threshold — clamps outlier predictions to reduce artifacts (0=off, try 2.0) (default: 0.0)",
)
parser.add_argument(
"--vel-ema",
type=float,
default=0.0,
help="Velocity EMA smoothing — smooths predictions across steps (0=off, try 0.1) (default: 0.0)",
)
parser.add_argument(
"--seed",
type=int,
default=None,
help="Random seed for reproducible results (default: random)",
)
parser.add_argument(
"--takes",
type=int,
default=1,
help="Number of takes to generate with different seeds (default: 1)",
)
return parser.parse_args()
def main() -> None:
args = parse_args()
# -----------------------------------------------------------------------
# Validate input file
# -----------------------------------------------------------------------
input_path = os.path.abspath(args.input)
if not os.path.isfile(input_path):
print(f"ERROR: Input file not found: {input_path}", file=sys.stderr)
sys.exit(1)
# -----------------------------------------------------------------------
# CUDA check
# -----------------------------------------------------------------------
print("Checking GPU availability...")
import torch
if not torch.cuda.is_available():
print(
"ERROR: CUDA GPU required. No CUDA-capable GPU detected.",
file=sys.stderr,
)
sys.exit(1)
print(f"GPU detected: {torch.cuda.get_device_name(0)}")
# -----------------------------------------------------------------------
# Detect input duration
# -----------------------------------------------------------------------
if args.duration is not None:
duration = round(args.duration)
print(f"Using override duration: {duration}s")
else:
raw_duration = get_wav_duration(input_path)
duration = round(raw_duration)
print(f"Input duration: {raw_duration:.1f}s (rounded to {duration}s)")
# -----------------------------------------------------------------------
# Build caption
# -----------------------------------------------------------------------
caption = args.caption if args.caption else build_caption(args.instrument)
print(f"Caption: {caption}")
# -----------------------------------------------------------------------
# Create output directory
# -----------------------------------------------------------------------
output_dir = os.path.abspath(args.output)
os.makedirs(output_dir, exist_ok=True)
# -----------------------------------------------------------------------
# Initialize ACE-Step
# -----------------------------------------------------------------------
print("Loading ACE-Step model...")
script_dir = os.path.dirname(os.path.abspath(__file__))
ace_step_dir = os.path.join(script_dir, "ace-step")
sys.path.insert(0, ace_step_dir)
from acestep.handler import AceStepHandler
from acestep.llm_inference import LLMHandler
from acestep.inference import GenerationParams, GenerationConfig, generate_music
from acestep.gpu_config import get_gpu_config, set_global_gpu_config
gpu_config = get_gpu_config()
set_global_gpu_config(gpu_config)
dit_handler = AceStepHandler()
llm_handler = LLMHandler()
dit_handler.initialize_service(
project_root=ace_step_dir,
config_path="acestep-v15-xl-sft",
device="cuda",
use_flash_attention=dit_handler.is_flash_attention_available("cuda"),
)
print("Model loaded successfully.")
# -----------------------------------------------------------------------
# Generate takes
# -----------------------------------------------------------------------
import random
instrument_clean = args.instrument.lower().strip().replace(" ", "-")
num_takes = args.takes
seeds = []
if args.seed is not None:
# Explicit seed: use it for take 1, derive sequential seeds for additional takes
seeds = [args.seed + i for i in range(num_takes)]
else:
seeds = [random.randint(0, 2**31 - 1) for _ in range(num_takes)]
for take_idx, seed in enumerate(seeds):
take_label = f"[take {take_idx + 1}/{num_takes}]" if num_takes > 1 else ""
params = GenerationParams(
task_type="cover",
src_audio=input_path,
caption=caption,
lyrics="",
instrumental=True,
duration=duration,
bpm=120,
audio_cover_strength=args.strength,
cover_noise_strength=args.noise_strength,
inference_steps=args.steps,
guidance_scale=args.guidance,
shift=args.shift,
sampler_mode=args.sampler,
velocity_norm_threshold=args.vel_clamp,
velocity_ema_factor=args.vel_ema,
thinking=False,
seed=seed,
)
config = GenerationConfig(
batch_size=1,
use_random_seed=False,
audio_format="wav",
)
temp_save_dir = tempfile.mkdtemp(prefix="hum2inst_")
print(f"Generating {args.instrument} cover (seed={seed}) {take_label}...")
try:
result = generate_music(
dit_handler, llm_handler, params, config, save_dir=temp_save_dir
)
except Exception as e:
print(f"ERROR: Generation failed: {e}", file=sys.stderr)
if num_takes == 1:
sys.exit(1)
continue
if not result.success or not result.audios:
error_msg = getattr(result, "error", None) or "no audio produced"
print(f"ERROR: Generation failed: {error_msg}", file=sys.stderr)
if num_takes == 1:
sys.exit(1)
continue
generated_path = result.audios[0]["path"]
if not os.path.isfile(generated_path):
print(f"ERROR: Expected output file not found: {generated_path}", file=sys.stderr)
if num_takes == 1:
sys.exit(1)
continue
if check_silence(generated_path):
print(
"WARNING: Output audio appears to be near-silent. "
"The generation may have failed.",
file=sys.stderr,
)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
final_filename = f"{instrument_clean}_{timestamp}_s{seed}.wav"
final_path = os.path.join(output_dir, final_filename)
shutil.copy2(generated_path, final_path)
log_path = final_path.replace(".wav", ".json")
run_log = {
"timestamp": timestamp,
"input": input_path,
"output": final_path,
"instrument": args.instrument,
"caption": caption,
"duration": duration,
"strength": args.strength,
"noise_strength": args.noise_strength,
"inference_steps": args.steps,
"guidance_scale": args.guidance,
"shift": args.shift,
"sampler": args.sampler,
"vel_clamp": args.vel_clamp,
"vel_ema": args.vel_ema,
"seed": seed,
"take": take_idx + 1,
"total_takes": num_takes,
}
with open(log_path, "w") as f:
json.dump(run_log, f, indent=2)
try:
shutil.rmtree(temp_save_dir)
except OSError:
pass
print(f"Output saved: {final_path}")
print(f"Run log: {log_path}")
if __name__ == "__main__":
try:
main()
except KeyboardInterrupt:
print("\nInterrupted.", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"ERROR: Unexpected error: {e}", file=sys.stderr)
sys.exit(1)