docs: complete project research for AI Music Pipeline

Research covers technology stack (ACE-Step vs MusicGen), feature
landscape, MusicGen melody architecture analysis, and domain pitfalls.
Key finding: ACE-Step 1.5 cover mode is the proven primary engine;
MusicGen via HF transformers has fundamental conditioning flaws.
This commit is contained in:
John Lightner 2026-04-11 01:31:32 -05:00
parent 9d613bdcad
commit b51cf7cd6b
5 changed files with 1045 additions and 0 deletions

View file

@ -0,0 +1,306 @@
# Architecture Patterns: MusicGen Melody Conditioning Pipeline
**Domain:** Melody-conditioned music generation (voice-to-instrument conversion)
**Researched:** 2026-04-11
**Overall confidence:** HIGH (verified against source code + original audiocraft + paper)
## System Architecture Overview
MusicGen Melody uses a **prefix-concatenation conditioning architecture** -- fundamentally different from the cross-attention conditioning used in standard MusicGen for text. Understanding why this matters is key to diagnosing conditioning failures.
```
Raw Audio Input
|
v
[Feature Extraction] -- STFT (n_fft=16384, hop=4096) --> Spectrogram
| |
| chroma_filter_bank (12 bins)
| |
v v
Waveform (32kHz) Raw Chromagram
|
normalize(inf-norm)
|
argmax --> one-hot binarization
|
input_features [B, T_chroma, 12]
|
Text Prompt |
| |
[T5 Encoder] [audio_enc_to_dec_proj]
| nn.Linear(12, 1536)
v |
text_hidden [B, T_text, 1536] v
| audio_hidden [B, 235, 1536]
| |
+------- [Concatenate dim=1] -------------------+
|
v
encoder_hidden_states [B, 235 + T_text, 1536]
|
[Concatenate with decoder input_embeds as PREFIX]
|
v
[Decoder Self-Attention (48 layers, causal)]
|
v
Audio token logits
|
[CFG weighting]
|
[Sampling (top-k=250)]
|
[EnCodec Decoder]
|
v
Output waveform (32kHz)
```
## Component Boundaries
| Component | Responsibility | Communicates With | Key Dimensions |
|-----------|---------------|-------------------|----------------|
| **MusicgenMelodyFeatureExtractor** | Raw audio --> binarized chromagram | Processor (wraps it) | Input: [samples] at 32kHz, Output: [B, T_chroma, 12] |
| **T5 Text Encoder** | Text prompt --> hidden states | Forward/generate method | Output: [B, T_text, hidden_size_t5] |
| **enc_to_dec_proj** | Project T5 output to decoder dim | Forward/generate method | nn.Linear(T5_hidden, 1536) |
| **audio_enc_to_dec_proj** | Project 12-dim chroma to decoder dim | Forward/generate method | nn.Linear(12, 1536) |
| **Chroma padding/truncation** | Ensure chroma is exactly 235 frames | Forward/generate method | Repeat or truncate to chroma_length=235 |
| **ConditionFuser (implicit)** | Concatenate audio + text as prefix | Decoder input construction | [B, 235+T_text, 1536] prepended to decoder embeddings |
| **MusicgenMelodyDecoder** | Autoregressive token generation | Gets prefix + decoder tokens | 48 layers, self-attention only (NO cross-attention) |
| **CFG Logits Processor** | Blend conditional + unconditional logits | Generation loop | `logits = uncond + scale * (cond - uncond)` |
| **EnCodec Decoder** | Audio tokens --> waveform | Post-generation | 4 codebooks at 50Hz, 32kHz output |
## Critical Data Flow: The Conditioning Signal Path
### Stage 1: Chromagram Extraction (Feature Extractor)
**What happens:**
1. Audio padded/truncated to 30s (960,000 samples at 32kHz)
2. Power spectrogram computed: n_fft=16384, hop_length=4096, power=2, normalized=True
3. Chroma filter bank applied via einsum: `raw_chroma = einsum("cf, ...ft -> ...ct", filters, spec)`
4. Inf-norm normalization per time frame
5. **Binarization**: argmax across 12 chroma bins, then one-hot encode -- only the dominant pitch class survives
**Output shape:** [B, T_chroma, 12] where T_chroma = ceil(960000 / 4096) ~ 235 frames
**Signal loss risk: MEDIUM.** The binarization is an intentional information bottleneck designed to prevent overfitting during training. It preserves pitch class (which of 12 semitones is loudest) but destroys amplitude, harmonics, and polyphonic information. For monophonic humming input, this should preserve the melody adequately -- the concern is whether downstream processing respects this signal.
### Stage 2: Projection (audio_enc_to_dec_proj)
**What happens:**
- nn.Linear(12, 1536) projects each 12-dim one-hot chroma vector to the decoder's hidden dimension (1536 for melody-large)
- This is a learned projection -- the weights encode how each pitch class maps to the decoder's representation space
**Signal loss risk: LOW if weights are properly trained, HIGH if weights are near-zero.** This is a critical checkpoint to inspect: if these weights are near-zero or have very small magnitude, the chroma signal will be negligible compared to text conditioning and decoder embeddings.
### Stage 3: Padding/Truncation to chroma_length=235
**What happens:**
- If chroma sequence < 235 frames: repeat (tile) to fill, then truncate to exactly 235
- If chroma sequence > 235 frames: truncate with a warning
**Signal loss risk: LOW for standard 30s audio.** 30s at 32kHz with hop_length=4096 yields ~234 frames, so it naturally fits. For shorter audio (e.g., 8s humming), the chroma gets repeated ~3x to fill 235 frames, which means the melody loops in the prefix -- this could confuse the model.
### Stage 4: Prefix Concatenation
**What happens:**
```python
# In decoder forward():
encoder_hidden_states = torch.cat([audio_hidden_states, encoder_hidden_states], dim=1)
# audio_hidden [B, 235, 1536] + text_hidden [B, T_text, 1536] --> [B, 235+T_text, 1536]
# Then prepended to decoder input:
inputs_embeds = torch.cat([encoder_hidden_states, inputs_embeds], dim=1)
# [B, 235+T_text+T_decoder, 1536]
```
**This is the architectural crux.** The conditioning signal exists as the first ~249 tokens (235 chroma + ~14 text tokens) of the sequence. The decoder uses **causal self-attention only** -- no cross-attention layers. Each generated token can attend to:
1. All 235 chroma prefix tokens
2. All text prefix tokens
3. All previously generated audio tokens
**Signal loss risk: MODERATE-HIGH during long generation.** As the sequence grows, the prefix represents a shrinking fraction of the attention context. With 1500 tokens of generated audio (30s), the 235 chroma tokens are only 14% of the context. Self-attention may learn to weight recent audio tokens more heavily than the distant prefix, causing melody drift.
### Stage 5: Classifier-Free Guidance (CFG)
**What happens in the HuggingFace implementation:**
```python
# In _prepare_encoder_hidden_states_kwargs_for_generation():
# Text: conditional doubled with zeros for unconditional
encoder_hidden_states = torch.concatenate(
[encoder_hidden_states, torch.zeros_like(encoder_hidden_states)], dim=0
)
# Audio: conditional doubled with null_audio (zeros except [:,:,0]=1)
audio_hidden_states = torch.concatenate(
[audio_hidden_states, null_audio_hidden_states], dim=0
)
```
The batch is doubled: first half is conditional (real chroma + real text), second half is unconditional (null chroma + zeroed text). The CFG formula then applies:
```
logits = uncond_logits + guidance_scale * (cond_logits - uncond_logits)
```
**Signal loss risk: THIS IS THE MOST LIKELY FAILURE POINT.**
The null audio conditioning is NOT all-zeros. It is:
```python
null_audio_hidden_states = torch.zeros(...)
null_audio_hidden_states[:, :, 0] = 1 # First chroma bin (C) set to 1
```
This means the "unconditional" path still has a chromagram -- a constant C-note signal. After projection through audio_enc_to_dec_proj, this produces a non-trivial hidden state. If the audio_enc_to_dec_proj projects chroma bin 0 to large values, the "unconditional" path may be very similar to the conditional path, causing:
```
cond_logits - uncond_logits ≈ 0
```
In other words: **CFG amplifies the DIFFERENCE between conditional and unconditional. If the null chroma signal produces similar decoder behavior to the real chroma signal, conditioning effect gets canceled out.**
Compare this to the original audiocraft implementation where the null conditioning comes from processing a 1-sample silent waveform through the full ChromaStemConditioner, producing genuinely minimal/empty chroma features.
## Key Architectural Differences: HuggingFace vs Original Audiocraft
| Aspect | Original Audiocraft | HuggingFace Transformers |
|--------|-------------------|-------------------------|
| **Chroma extraction** | Demucs stem separation (removes drums/bass) THEN chroma | Direct chroma from raw audio (no Demucs) |
| **Null conditioning** | 1-sample silent wav through full pipeline | Hardcoded zeros-except-bin-0 tensor |
| **Fusion mechanism** | ConditionFuser class with explicit prepend tracking | Inline code in forward/generate methods |
| **Generation loop** | Streaming API with explicit prefix caching | Standard HF generate with KV cache |
| **CFG null text** | Zero tensor | Zero tensor (same) |
**The Demucs difference is significant.** The original model was trained with chroma extracted from Demucs-separated stems (vocals + other, excluding drums and bass). The HuggingFace implementation skips Demucs entirely and computes chroma directly from raw audio. This means:
1. For polyphonic music input: drums/bass frequencies contaminate the chroma, producing different features than what the model was trained on
2. For humming input: this is actually less of a problem since humming is essentially a clean melodic source with no drums/bass to remove
3. The model was trained to expect Demucs-filtered chroma patterns, and raw-audio chroma will have different statistical properties
## Patterns to Follow
### Pattern 1: Diagnostic Signal Tracing
**What:** Instrument each stage of the conditioning pipeline to verify signal magnitude and shape
**When:** Before making any architectural changes
**How:**
```python
# At each stage, log:
# 1. Tensor shape
# 2. Min/max/mean/std of values
# 3. Number of non-zero elements
# 4. Cosine similarity between conditional and unconditional paths
# Key checkpoints:
# - input_features after processor (verify chroma is not all-zeros or constant)
# - audio_enc_to_dec_proj output (verify weights produce meaningful projection)
# - encoder_hidden_states after concatenation (verify audio and text are both present)
# - cond_logits vs uncond_logits in CFG (verify they are actually different)
```
### Pattern 2: Ablation Testing
**What:** Systematically disable/modify one component at a time to isolate the failure
**When:** After signal tracing identifies suspicious components
**How:**
- Test 1: Bypass CFG entirely (guidance_scale=1.0) -- does conditioning work without CFG?
- Test 2: Replace null chroma with true zeros (not bin-0=1) -- does CFG delta increase?
- Test 3: Amplify audio_enc_to_dec_proj output by 10x -- does melody become audible?
- Test 4: Feed known synthetic chromagram (chromatic scale) -- is it reflected in output?
### Pattern 3: Weight Inspection
**What:** Examine the learned weights of the audio_enc_to_dec_proj layer
**When:** Early in investigation
**How:**
```python
proj = model.audio_enc_to_dec_proj
print(f"Weight shape: {proj.weight.shape}")
print(f"Weight stats: min={proj.weight.min():.6f}, max={proj.weight.max():.6f}, "
f"mean={proj.weight.mean():.6f}, std={proj.weight.std():.6f}")
print(f"Bias stats: min={proj.bias.min():.6f}, max={proj.bias.max():.6f}")
# Compare to enc_to_dec_proj (text) for scale reference
```
## Anti-Patterns to Avoid
### Anti-Pattern 1: Treating Melody Conditioning Like Cross-Attention
**What:** Assuming the melody signal is maintained throughout generation like cross-attention would
**Why bad:** Prefix concatenation means the signal is ONLY in the first 235 positions and must be "remembered" through causal self-attention across potentially 1500+ generated tokens. Cross-attention re-injects the signal at every layer and every time step.
**Instead:** Understand that melody influence naturally decays over sequence length. This is by design in the original architecture but may not be strong enough for faithful melody following.
### Anti-Pattern 2: Assuming HuggingFace Port is Identical to Audiocraft
**What:** Debugging against the original paper's architecture description
**Why bad:** The HuggingFace port has substantive differences: no Demucs preprocessing, different null conditioning, different generation loop
**Instead:** Debug against the actual HuggingFace source code in site-packages
### Anti-Pattern 3: Only Changing Generation Parameters
**What:** Tweaking temperature/top_k/guidance_scale without understanding the conditioning pipeline
**Why bad:** If the conditioning signal is zero or near-zero entering the decoder, no amount of sampling parameter adjustment will recover it
**Instead:** First verify the signal is non-trivial at each stage, THEN tune generation parameters
## Theoretical Analysis: Can Prefix Concatenation Maintain Melody Fidelity?
### The Case FOR Prefix Concatenation Working
1. **Training objective**: The model was trained with this architecture. If conditioning was ineffective during training, the training loss would not have benefited from conditioning, and the model would have learned to ignore the prefix. The paper reports melody conditioning improves over baselines, suggesting the architecture functions.
2. **Self-attention is powerful**: Transformer self-attention can attend to any position in the context. A well-trained model can learn to use prefix positions as persistent conditioning. GPT-style models routinely use system prompts (prefix) to condition behavior over thousands of tokens.
3. **Positional encoding**: The decoder uses learned position embeddings. The chroma occupies positions 0-234, text occupies 235-248ish. The model can learn "positions 0-234 contain melody information" as a structural prior.
### The Case AGAINST Faithful Melody Following
1. **Information bottleneck is severe**: 12-bin binarized chroma captures pitch class but not octave, timing precision, or dynamics. A hummed melody [C4, D4, E4, D4] and [C3, D5, E2, D4] produce identical chromagrams.
2. **Temporal alignment is loose**: With hop_length=4096 at 32kHz, each chroma frame covers 128ms. Melody timing finer than ~130ms is lost. Combined with repeat-padding for short audio, temporal alignment is approximate at best.
3. **Prefix attention may be too weak**: Unlike cross-attention which directly feeds conditioning into every layer, prefix conditioning must propagate through layer-to-layer residual connections. In a 48-layer decoder, the gradient of the loss w.r.t. the prefix tokens passes through many layers, potentially suffering from vanishing attention.
4. **CFG may actively suppress conditioning**: As analyzed above, the null chroma (bin-0=1) may produce sufficiently similar outputs to real chroma that CFG cancels the conditioning effect. This is the most actionable hypothesis.
### Verdict
Prefix concatenation CAN work for melody conditioning -- the model was trained to use it, and the paper reports positive results. The question is whether the HuggingFace implementation's specific differences (no Demucs, hardcoded null chroma) break the conditioning pathway. **The most likely failure point is CFG interaction with the null chroma signal.**
## Suggested Investigation Order
The investigation should follow signal flow, starting from where issues are cheapest to detect and fix:
```
1. Weight Inspection (5 min)
|-- Are audio_enc_to_dec_proj weights near-zero? If yes: model weights issue
|-- Compare magnitude to enc_to_dec_proj (text)
|
2. Signal Tracing (15 min)
|-- Log tensor stats at every stage from input_features through to CFG
|-- Key question: is cond_logits meaningfully different from uncond_logits?
|
3. CFG Ablation (10 min)
|-- Generate with guidance_scale=1.0 (no CFG)
|-- If conditioning works without CFG: null chroma is the problem
|
4. Null Chroma Fix (10 min)
|-- Replace null_audio with true all-zeros (no bin-0=1)
|-- Or use the audiocraft approach: minimal silent wav through extractor
|
5. Demucs Integration (30-60 min, only if needed)
|-- Install Demucs, add stem separation before chroma extraction
|-- Only needed for polyphonic input; humming may not need this
|
6. Alternative: Original Audiocraft (60 min, fallback)
|-- If HF port is fundamentally broken, use audiocraft directly
|-- generate_with_chroma() is the authoritative implementation
```
**Dependency chain:** Step 1 and 2 can be parallel. Step 3 depends on 2 (need to understand baseline). Steps 4-6 are alternative fixes chosen based on 1-3 findings.
## Sources
- HuggingFace Transformers source: `transformers/models/musicgen_melody/modeling_musicgen_melody.py` (verified against installed v4.x) -- HIGH confidence
- HuggingFace Transformers source: `transformers/models/musicgen_melody/feature_extraction_musicgen_melody.py` -- HIGH confidence
- Original audiocraft ConditionFuser: [audiocraft/modules/conditioners.py](https://github.com/facebookresearch/audiocraft/blob/main/audiocraft/modules/conditioners.py) -- HIGH confidence
- Original audiocraft LM: [audiocraft/models/lm.py](https://github.com/facebookresearch/audiocraft/blob/main/audiocraft/models/lm.py) -- HIGH confidence
- MusicGen paper: [Simple and Controllable Music Generation](https://arxiv.org/abs/2306.05284) -- HIGH confidence (architecture description)
- AudioCraft conditioning docs: [CONDITIONING.md](https://github.com/facebookresearch/audiocraft/blob/main/docs/CONDITIONING.md) -- HIGH confidence
- [AI-Bites architecture explanation](https://www.ai-bites.net/musicgen-from-meta-ai-model-architecture-vector-quantization-and-model-conditining-explained/) -- MEDIUM confidence (secondary source)
- [MusicGen documentation](https://huggingface.co/docs/transformers/en/model_doc/musicgen_melody) -- HIGH confidence

View file

@ -0,0 +1,86 @@
# Feature Landscape
**Domain:** Melody-conditioned music generation (voice-to-instrument conversion)
**Researched:** 2026-04-11
## Table Stakes
Features users expect. Missing = product feels incomplete.
| Feature | Why Expected | Complexity | Notes |
|---------|--------------|------------|-------|
| Melody fidelity - pitch contour preservation | The entire point. If output doesn't follow the hummed melody's pitch contour, the tool is useless. Users hum *that specific melody* because they want *that melody* played back on an instrument. | High | ACE-Step cover mode (strength 0.8) is currently the only approach that demonstrably preserves contour. MusicGen melody-large's chromagram conditioning is too lossy -- strips rhythm and reduces pitch to 12 chroma bins, losing octave and timing information. Fidelity measurement: compare input/output F0 contours via CREPE or PYIN pitch tracking. |
| Instrument/timbre selection via text prompt | Users need to say "piano" or "saxophone" and get that instrument. Without this, it's just a generic audio resynthesis tool. | Low | Both ACE-Step and MusicGen support text-prompt instrument selection. ACE-Step captions like "solo acoustic piano, gentle jazz melody" work well. This is a solved problem at the prompt level. |
| Humming/voice input acceptance | The pipeline's core input modality. Users should be able to record a raw hum and feed it in without manual preprocessing. | Medium | ACE-Step cover mode accepts raw humming WAV directly -- no MIDI extraction needed. This is a major advantage over MusicGen melody-large, which requires chromagram extraction that loses critical information. Basic Pitch can extract MIDI from humming for alternative paths but adds a lossy intermediate step. |
| Output audio quality >= 32kHz, musical coherence | Output must sound like a real instrument, not robotic or garbled. Must be listenable without cringing. | Medium | ACE-Step outputs at 44.1kHz (CD quality). MusicGen outputs at 32kHz. Both produce coherent musical audio. ACE-Step XL-SFT at 50 diffusion steps produces noticeably better quality than turbo mode (8 steps). |
| Rhythm preservation | Timing matters as much as pitch. If the user hums a syncopated melody, the output should be syncopated, not metronomic. | High | ACE-Step cover mode preserves rhythm structure via semantic code encoding. MusicGen chromagram discards rhythm entirely (only encodes chroma over time, not note onsets/durations). This is a key differentiator for ACE-Step. |
| Reproducibility / deterministic output | Given the same input, users should be able to get the same output (or intentionally vary it). Needed for iterating on a melody. | Low | Both models support seed control. ACE-Step has `use_random_seed` toggle. MusicGen supports `torch.manual_seed()` (after model loading -- seed before loading was a bug that caused identical outputs regardless of input). |
## Differentiators
Features that set product apart. Not expected, but valued.
| Feature | Value Proposition | Complexity | Notes |
|---------|-------------------|------------|-------|
| Multi-instrument rendering from single hum | Hum once, get piano + guitar + sax versions. Saves time for composers sketching arrangements. | Low | Already achievable by running the same input through multiple captions. No new model work needed -- just pipeline orchestration (loop over instrument list, batch configs). |
| Melody fidelity scoring / confidence report | Show users how closely the output matches their input contour. Builds trust and helps them decide whether to re-record their hum or adjust parameters. | Medium | Implement via pitch extraction (CREPE/PYIN) on both input and output, compute correlation or DTW distance on F0 contours. No ML training needed, pure signal processing. |
| Interactive parameter tuning | Let users adjust cover_strength, guidance_scale, BPM to iterate toward desired output. | Medium | ACE-Step already exposes these parameters. Building a simple UI (or CLI wizard) that presents meaningful presets ("faithful", "creative", "loose interpretation") would be more user-friendly than raw numbers. |
| Whistling and played-instrument input | Accept not just humming but also whistled melodies or played instrument recordings as input. Broader input flexibility. | Low | ACE-Step cover mode is input-agnostic -- it processes any audio through its VAE. Whistling, singing, played instruments all work as source audio. No additional work needed, just documentation/testing. |
| MIDI input path | Accept MIDI files as input for users who prefer precise notation over audio recording. | Medium | Requires MIDI-to-audio synthesis step (FluidSynth + SoundFont) before feeding into ACE-Step. Basic Pitch already extracts MIDI; the reverse path (MIDI -> synth -> cover) is the gap. |
| Style transfer beyond instrument | Not just "play this on piano" but "play this as a jazz piano ballad" or "play this as aggressive rock guitar". Richer creative control. | Low | Already supported by ACE-Step caption system. Captions like "solo jazz saxophone, smooth and soulful, slow tempo" vs "distorted electric guitar, aggressive rock riff" produce meaningfully different outputs. |
| Batch processing / queue | Process multiple humming recordings in sequence without manual intervention. | Low | Pure engineering -- TOML config generation + subprocess management. No model changes needed. |
| Duration matching | Auto-detect input audio length and set output duration to match, rather than requiring manual duration specification. | Low | Read input WAV duration via torchaudio/soundfile, pass to ACE-Step config. Trivial to implement. |
| BPM detection from input | Auto-detect tempo of humming input and set BPM parameter accordingly. | Medium | Use librosa.beat.beat_track() or similar on input audio. Humming BPM detection is less reliable than instrument audio but usually gets ballpark right. Fallback to user-specified BPM. |
## Anti-Features
Features to explicitly NOT build.
| Anti-Feature | Why Avoid | What to Do Instead |
|--------------|-----------|-------------------|
| Real-time streaming generation | Adds enormous complexity (model loading latency, streaming audio, GPU memory management) for minimal value in a creative workflow where 3-27 seconds of batch processing is perfectly acceptable. | Keep batch inference. ACE-Step generates 10s of audio in ~3s on RTX 4090. |
| Model training or fine-tuning | Out of scope, requires massive compute and data. Pretrained models are sufficient for the voice-to-instrument use case. | Use pretrained ACE-Step XL-SFT and MusicGen checkpoints as-is. |
| Lyric generation / singing synthesis | Different problem domain entirely (TTS/singing synthesis). The pipeline is about instrumental rendition of melodies, not generating vocals. | Keep `instrumental = true` in ACE-Step config. If users want vocals, that's a separate tool. |
| Web/mobile deployment | Adds deployment complexity (GPU serving, auth, bandwidth) without improving the core creative workflow. User has an RTX 4090 locally. | Local CLI/script execution. Consider Gradio UI later if polish is desired. |
| Multi-track arrangement generation | Generating full arrangements (drums + bass + melody + harmony) from a single hum is a much harder problem and dilutes focus from the core value: faithful melody-to-instrument conversion. | Generate single-instrument tracks. Users can layer them manually in a DAW. |
| Polyphonic input handling | Handling chords or harmonized humming adds complexity for a rare use case. Most users hum single-note melodies. | Assume monophonic input. Document this limitation. If polyphonic input is provided, the model will do its best but no guarantees. |
| Note-perfect transcription accuracy | Trying to make the output match every single note perfectly (like a MIDI playback engine) misses the point. Some creative interpretation by the model is desirable and unavoidable. | Set expectations: the tool preserves melodic contour and rhythm, not individual note precision. Use cover_strength to control fidelity vs. creativity tradeoff. |
## Feature Dependencies
```
Humming input acceptance --> Melody fidelity (can't assess fidelity without input)
Melody fidelity --> Instrument selection (fidelity must work before instrument choice matters)
Instrument selection --> Multi-instrument rendering (need single-instrument to work first)
Melody fidelity --> Fidelity scoring (need fidelity to exist before measuring it)
MIDI input path --> FluidSynth integration (MIDI needs audio synthesis before ACE-Step)
BPM detection --> Duration matching (both are input-analysis features, independent but related)
Output quality --> All downstream features (if audio quality is bad, nothing else matters)
```
## MVP Recommendation
Prioritize (in order):
1. **Melody fidelity via ACE-Step cover mode** -- This is validated and working (cover_strength=0.8). The foundation.
2. **Instrument selection via text caption** -- Already working. Just needs good default caption templates per instrument.
3. **Raw humming input** -- Already working with ACE-Step. No preprocessing needed.
4. **Duration matching** -- Trivial to implement, improves UX significantly.
5. **Multi-instrument batch rendering** -- Low complexity, high value. Loop over instrument captions.
Defer:
- **MIDI input path**: Requires FluidSynth integration. Nice to have but raw audio input already works.
- **BPM detection**: Humming BPM detection is unreliable. Manual BPM specification (or default 120) is fine for MVP.
- **Fidelity scoring**: Useful for iteration but not essential for first working pipeline.
- **Interactive parameter tuning UI**: CLI with good defaults is sufficient for MVP. Gradio UI is a polish item.
## Sources
- ACE-Step 1.5 GitHub and documentation: https://github.com/ace-step/ACE-Step-1.5 (HIGH confidence - official docs)
- ACE-Step Tutorial: https://github.com/ace-step/ACE-Step-1.5/blob/main/docs/en/Tutorial.md (HIGH confidence)
- ACE-Step cover mode discussions: https://github.com/ace-step/ACE-Step-1.5/discussions/398 (MEDIUM confidence)
- MusicGen documentation: https://huggingface.co/docs/transformers/en/model_doc/musicgen_melody (HIGH confidence)
- MusicGen paper: https://arxiv.org/pdf/2306.05284 (HIGH confidence)
- Melody-Guided Music Generation (MG2): https://arxiv.org/abs/2409.20196 (MEDIUM confidence - research paper, not production tool)
- SingSong (Google): https://arxiv.org/abs/2301.12662 (MEDIUM confidence - research, no public weights)
- Basic Pitch (Spotify): https://github.com/spotify/basic-pitch (HIGH confidence)
- Project memory files documenting empirical testing results (HIGH confidence - firsthand testing)

View file

@ -0,0 +1,255 @@
# Domain Pitfalls
**Domain:** Melody-conditioned music generation (voice-to-instrument conversion)
**Researched:** 2026-04-11
## Critical Pitfalls
Mistakes that cause complete failure of melody conditioning or require fundamental approach changes.
### Pitfall 1: Missing Demucs Stem Separation Before Chromagram Extraction
**What goes wrong:** The original audiocraft implementation runs Demucs stem separation on ALL input audio at inference time -- extracting vocals + "other" stems (indices 3, 2) while removing drums and bass -- before computing the chromagram. The HuggingFace transformers port does NOT run Demucs internally. It accepts either raw audio or pre-separated Demucs output (detected as 4D tensor). When you pass raw humming audio as a numpy array, the feature extractor skips stem separation entirely and computes the chromagram directly on the raw waveform.
For professional music input (which is what the model was trained on), this means the chromagram captures drums and bass energy that the model was never trained to see. For humming input, the raw voice waveform has a very different spectral profile than what the model expects (which is the stems-separated output of actual music).
**Why it happens:** The HuggingFace docs show Demucs as an optional preprocessing step, and the API works without it -- it just silently produces worse conditioning. Most users skip the Demucs step because the code runs without error.
**Consequences:** Melody conditioning appears weak or non-functional. The chromagram from raw audio (especially voice) does not match the distribution the model was trained on. This is likely the primary cause of the "melody conditioning has zero effect" observation in the project.
**Prevention:**
- For music input: Install Demucs, run stem separation, pass the 4D tensor to the processor
- For humming input: Either (a) skip Demucs entirely since humming IS the melody stem, but be aware the spectral shape differs from instrument recordings, or (b) convert humming to a clean instrument sound first (sine wave, piano) before feeding to MusicGen
**Detection:** Compare chromagram from raw humming vs chromagram from a clean piano playing the same notes. If they differ substantially (especially in which chroma bins are activated), the input preprocessing is the problem.
**Confidence:** HIGH -- verified by reading the actual HuggingFace feature extraction source code (lines 251-262 of feature_extraction_musicgen_melody.py) and the audiocraft GitHub issue #478 confirming Demucs runs at inference time in the original implementation.
**Phase:** Should be addressed first -- this is likely the root cause of the current "zero effect" problem.
---
### Pitfall 2: Chromagram Binarization Destroys Dynamics and Polyphonic Information
**What goes wrong:** The HuggingFace feature extractor binarizes the chromagram: for each time step, it finds the argmax chroma bin, sets it to 1, and sets ALL others to 0. This means:
- Only one pitch class is active per time frame (no chords, no harmonics)
- The transition between notes is hard binary, not smooth
- If the dominant energy in a humming voice is in a harmonic rather than the fundamental, the wrong chroma bin gets selected
- All amplitude/dynamics information is lost
**Why it happens:** The original audiocraft training used binarized chromagrams for conditioning. This is by design -- chromagram conditioning is intentionally coarse, capturing only which pitch class is dominant at each moment, not precise pitch, octave, or amplitude.
**Consequences:** Users expect "melody conditioning" to mean "follow my exact melody" but it actually means "loosely stay in the same pitch class neighborhood." The model captures harmonic movement (e.g., C -> G -> Am progression) but NOT precise pitch contour, octave, rhythm, or dynamics. This is a fundamental limitation of the approach, not a bug.
**Prevention:**
- Understand that chromagram conditioning preserves pitch CLASS (C, D, E...) but not octave, timing precision, or dynamics
- Do not expect exact melody reproduction -- expect harmonic/tonal guidance
- If exact melody reproduction is needed, MusicGen melody is the wrong tool. Use ACE-Step cover mode instead (which operates on full VAE latents)
**Detection:** Print/visualize the binarized chromagram output. If it looks like a reasonable representation of your melody's pitch classes, the chromagram is fine -- the model just cannot do more precise conditioning than this.
**Confidence:** HIGH -- verified by reading the actual binarization code in `_torch_extract_fbank_features` (lines 142-145 of feature_extraction_musicgen_melody.py).
**Phase:** Understanding this shapes expectations for the entire project. If the goal is faithful melody reproduction, the project should pivot to ACE-Step or a different approach rather than debugging MusicGen further.
---
### Pitfall 3: CFG Unconditional Path Zeros Out Audio Conditioning
**What goes wrong:** During classifier-free guidance (CFG), the model generates two predictions: one conditioned (on text + audio) and one unconditional. The unconditional prediction uses zeroed-out text encoder states AND zeroed-out audio features (a null chromagram where only bin 0 is set to 1). The final output is:
`output = unconditional + guidance_scale * (conditional - unconditional)`
With guidance_scale=3.0, this means the model amplifies the DIFFERENCE between conditioned and unconditioned outputs. If the audio conditioning signal is weak (e.g., because the chromagram from humming doesn't strongly match training distribution), the difference between conditional and unconditional is small, and CFG amplifies mostly the text conditioning difference.
The HuggingFace implementation does NOT support `cfg_coef_beta` (double CFG) which the original audiocraft provides. Double CFG uses a separate coefficient to independently control how much the text vs audio conditioning is amplified. Without it, you cannot push audio conditioning harder than text.
**Why it happens:** The HuggingFace port implements single CFG only. The `cfg_coef_beta` parameter from audiocraft's `set_generation_params` has no equivalent in the transformers `generate()` API.
**Consequences:** With single CFG at guidance_scale=3.0, text and audio conditioning are amplified together. You cannot independently boost melody following. If the text prompt is strong and the chromagram signal is weak, the text dominates.
**Prevention:**
- Use the original audiocraft library (`facebookresearch/audiocraft`) instead of HuggingFace transformers for melody conditioning -- it supports `cfg_coef_beta` to boost text relative to audio
- Alternatively, try lower guidance_scale (1.5-2.0) which reduces the CFG amplification of text and may let the audio prefix signal have more relative influence
- Or try higher guidance_scale (4.0-5.0) which amplifies everything, but may improve overall conditioning adherence at the cost of quality
**Detection:** Generate the same prompt with and without audio conditioning at various guidance scales. If the outputs converge at high guidance_scale, CFG is drowning out the audio signal.
**Confidence:** HIGH for the missing feature. MEDIUM for the practical impact -- the exact interaction between single CFG and melody conditioning quality is empirical.
**Phase:** If staying with HuggingFace transformers, this limits what's achievable. The decision to switch to audiocraft should be made early.
---
### Pitfall 4: Voice/Humming Has Wrong Spectral Profile for Chromagram Extraction
**What goes wrong:** Human voice produces a fundamental frequency plus strong harmonics. A chromagram computed from voice will have energy spread across multiple chroma bins due to harmonics, and the binarization step may pick the wrong bin (a harmonic rather than the fundamental) for some time frames. This is especially true for:
- Low male voice: strong harmonics, fundamental may be weaker than upper partials
- Breathy humming: noisy spectrum with energy across many bins
- Vibrato: rapid pitch oscillation causes the dominant bin to flip rapidly
The model was trained on chromagrams from stem-separated instrumental music, not from raw voice input.
**Why it happens:** Chromagrams are designed for pitched instrument signals with clear harmonic structure. Human voice, especially humming, has a different spectral envelope -- more noise, more formant emphasis, different harmonic relationships.
**Consequences:** The binarized chromagram from humming may be noisy, inaccurate, or contain spurious pitch class detections that confuse the model.
**Prevention:**
- Convert humming to MIDI first (using Basic Pitch), then render MIDI as a clean sine wave or piano tone
- Feed the clean rendered audio to MusicGen instead of raw humming
- This gives the chromagram extractor a clean signal with unambiguous pitch classes
**Detection:** Extract and visualize the chromagram from your humming file. Compare it to the chromagram from a clean instrument playing the same melody. If they diverge significantly, pre-processing the humming is needed.
**Confidence:** MEDIUM -- based on well-established acoustics of chromagram extraction from voice, but specific impact on MusicGen not empirically verified.
**Phase:** This is a preprocessing step that should be attempted as part of diagnosing the current "zero effect" problem.
---
## Moderate Pitfalls
### Pitfall 5: Large Window Size Prevents Fine Temporal Resolution
**What goes wrong:** The chromagram extraction uses n_fft=16384 and hop_length=4096 at 32kHz sampling rate. This means each chromagram frame covers 16384/32000 = 0.512 seconds with hops of 4096/32000 = 0.128 seconds. For a 10-second input, this produces only ~78 chromagram frames. Fast melodic passages (notes shorter than ~128ms) are blurred together.
The official documentation states: "Using a large window prevents the model from recovering fine temporal details."
**Prevention:** This is by design and cannot be changed without retraining. Accept that melody conditioning is temporally coarse. Do not expect the model to follow rapid melodic passages note-by-note.
**Detection:** If your input melody has notes faster than ~8 notes per second, those details will be lost in the chromagram.
**Confidence:** HIGH -- directly from feature extraction code and official documentation.
---
### Pitfall 6: Input Duration Mismatch Causes Chromagram Repetition or Truncation
**What goes wrong:** The model has a fixed `chroma_length=235` (configurable). If the input chromagram is shorter than this, it gets REPEATED (tiled) to fill the length. If longer, it gets truncated. This means:
- Short input (e.g., 3 seconds of humming) gets looped multiple times, causing the model to see a repeating melody pattern
- The chromagram length does not correspond 1:1 to the output audio length
**Prevention:**
- Use input audio that's at least 15-20 seconds to provide enough non-repeating chromagram frames
- Match input duration roughly to desired output duration
- Be aware that the relationship between input chromagram position and output audio position is approximate, not exact
**Detection:** If your output seems to have a repeating melodic pattern that matches your input duration, chromagram repetition is the cause.
**Confidence:** HIGH -- verified in the generate code (lines 1943-1946 of modeling_musicgen_melody.py).
---
### Pitfall 7: Sampling Temperature and Top-k Too High Drown Out Conditioning
**What goes wrong:** With temperature=1.0 and top_k=250, the sampling introduces substantial randomness. The model's logits contain the conditioning signal, but high temperature flattens the probability distribution and high top_k allows selection from many tokens. The conditioning signal needs to push logit probabilities enough to survive this sampling noise.
**Prevention:**
- Try lower temperature (0.5-0.8) to make the model more confident in its top choices
- Try lower top_k (50-100) to restrict the sample space
- These reduce output variety but increase conditioning adherence
- The tradeoff: lower temp/top_k = more repetitive but more melody-following; higher = more creative but less melody-following
**Detection:** Generate multiple outputs with the same seed but different temperature/top_k. If outputs at low temperature follow the melody better, sampling parameters are the issue.
**Confidence:** MEDIUM -- standard behavior of autoregressive sampling, but the exact interaction with MusicGen's conditioning is empirical.
---
### Pitfall 8: Prefix Concatenation Signal Decays Over Generation Length
**What goes wrong:** MusicGen Melody uses prefix concatenation (not cross-attention) for conditioning. The text and chromagram embeddings are concatenated to the beginning of the decoder's input sequence. As the model generates more tokens, the self-attention window grows, and the conditioning prefix becomes a smaller fraction of the total context. For long generations (20-30 seconds = 1000-1500 tokens), the model may "forget" the conditioning prefix.
**Prevention:**
- Keep generated audio short (8-15 seconds) for strongest conditioning
- For longer outputs, generate in segments and concatenate
- The first few seconds of output are most likely to follow the melody; later portions may drift
**Detection:** If the beginning of your output follows the melody but the end drifts to generic music, prefix decay is the cause.
**Confidence:** MEDIUM -- architecturally plausible (prefix attention is a known limitation) but not empirically measured for MusicGen specifically.
---
### Pitfall 9: Using musicgen-melody-large When musicgen-melody (Small) May Condition Better
**What goes wrong:** Users assume larger model = better melody conditioning. But the melody conditioning mechanism is identical across model sizes -- only the decoder capacity changes. A larger decoder may have more capacity to "ignore" weak conditioning signals and generate plausible music from the text prompt alone.
**Prevention:**
- Test with both `facebook/musicgen-melody` (small/medium) and `facebook/musicgen-melody-large`
- Compare melody adherence, not just audio quality
- The small model may follow the melody MORE closely because it has less capacity to improvise away from it
**Detection:** Same input, same parameters, different model sizes. If the small model follows the melody better, use it for this use case despite lower audio quality.
**Confidence:** LOW -- this is a reasonable hypothesis but not verified with evidence. Worth testing.
---
## Minor Pitfalls
### Pitfall 10: Seed Placement After Model Load But Before Generate
**What goes wrong:** Setting `torch.manual_seed()` before model loading causes the model initialization to consume random state, resulting in identical RNG state at generation time regardless of seed value. This was already discovered and fixed in the project.
**Prevention:** Always set seed AFTER model loading and BEFORE `model.generate()`. The current code correctly does this.
**Confidence:** HIGH -- already verified in this project.
---
### Pitfall 11: Numpy Float64 to Float32 Conversion for Processor Input
**What goes wrong:** If input audio is numpy float64, the processor may silently convert it but some paths may not handle the dtype correctly, leading to subtle numerical differences in the chromagram.
**Prevention:** Explicitly cast to float32 before passing to the processor: `audio_np = waveform.numpy().astype(np.float32)`. The current code already does this.
**Confidence:** HIGH -- already handled in the current codebase.
---
### Pitfall 12: Confusing HuggingFace MusicGen (Cross-Attention) with MusicGen Melody (Prefix Concatenation)
**What goes wrong:** HuggingFace has two separate model classes: `MusicgenForConditionalGeneration` (uses cross-attention, supports audio continuation, NOT melody conditioning) and `MusicgenMelodyForConditionalGeneration` (uses prefix concatenation, supports melody/chromagram conditioning). Using the wrong class means audio input is treated as continuation context, not melody.
**Prevention:** Always use `MusicgenMelodyForConditionalGeneration` and `facebook/musicgen-melody-large` for melody conditioning. The current code is correct on this point.
**Confidence:** HIGH -- documented in HuggingFace official docs.
---
## Phase-Specific Warnings
| Phase Topic | Likely Pitfall | Mitigation |
|-------------|---------------|------------|
| Diagnosing "zero effect" | Missing Demucs preprocessing (#1), wrong spectral profile (#4) | Test with Demucs-separated input OR clean sine wave input first to isolate the problem |
| Improving melody adherence | CFG balance (#3), sampling params (#7) | Try audiocraft library with cfg_coef_beta, lower temperature/top_k |
| Setting expectations | Chromagram is coarse by design (#2, #5) | If precise melody reproduction is needed, switch to ACE-Step cover mode |
| Voice input preprocessing | Voice harmonics confuse chromagram (#4) | Convert voice -> MIDI -> clean audio before feeding to MusicGen |
| HuggingFace vs audiocraft | Missing double CFG (#3), no Demucs integration (#1) | Consider switching to original audiocraft library for melody tasks |
| Long generation | Prefix attention decay (#8) | Keep generations under 15 seconds |
| Output duration | Chromagram repetition (#6) | Match input length to output length |
## Overall Assessment
The most likely root cause of the "melody conditioning has zero effect" problem is a combination of:
1. **No Demucs stem separation** (Pitfall #1) -- the HuggingFace port doesn't do this automatically
2. **Voice spectral mismatch** (Pitfall #4) -- humming produces different chromagram than instruments
3. **Single CFG limiting audio conditioning influence** (Pitfall #3) -- no way to independently boost melody following
The recommended diagnostic sequence is:
1. Feed a well-known instrumental melody (e.g., a piano recording) through the pipeline to verify melody conditioning works at all with clean input
2. If that works, the problem is voice preprocessing. If it doesn't work either, the problem is deeper (CFG balance, model weights, or a fundamental HuggingFace port issue)
3. Consider switching to the original `facebookresearch/audiocraft` library which includes Demucs and double CFG
4. If MusicGen melody conditioning remains weak even with ideal input, accept that chromagram conditioning is inherently coarse and pivot to ACE-Step cover mode for faithful melody reproduction
## Sources
- [HuggingFace MusicGen Melody docs](https://huggingface.co/docs/transformers/en/model_doc/musicgen_melody) -- official HuggingFace documentation (HIGH confidence)
- [Audiocraft CONDITIONING.md](https://github.com/facebookresearch/audiocraft/blob/main/docs/CONDITIONING.md) -- ChromaStemConditioner details (HIGH confidence)
- [Audiocraft issue #478](https://github.com/facebookresearch/audiocraft/issues/478) -- confirms Demucs runs at inference (HIGH confidence)
- [Audiocraft MusicGen API docs](https://facebookresearch.github.io/audiocraft/api_docs/audiocraft/models/musicgen.html) -- cfg_coef_beta, two_step_cfg parameters (HIGH confidence)
- [AudioCipher MusicGen guide](https://www.audiocipher.com/post/meta-musicgen) -- practical tips for melody input quality (MEDIUM confidence)
- [MusicGen paper](https://arxiv.org/pdf/2306.05284) -- chromagram window/hop sizes, training details (HIGH confidence)
- HuggingFace transformers source code at `transformers/models/musicgen_melody/` -- feature extraction and generation implementation (HIGH confidence, directly read from installed package)

225
.planning/research/STACK.md Normal file
View file

@ -0,0 +1,225 @@
# Technology Stack
**Project:** AI Music Pipeline - Melody-Conditioned Voice-to-Instrument Generation
**Researched:** 2026-04-11
## Critical Finding: Why MusicGen Melody-Large Conditioning Appears Broken
**Confidence: HIGH** (verified across official AudioCraft docs, HuggingFace transformers source, and API documentation)
The user's existing approach uses MusicGen melody-large via HuggingFace `transformers`. This is likely the root cause of weak/absent melody conditioning. Here is why:
### 1. Missing Double Classifier-Free Guidance (cfg_coef_beta)
The **original AudioCraft library** supports a `cfg_coef_beta` parameter in `set_generation_params()` -- a "beta coefficient in double classifier free guidance" that "should be only used for MusicGen melody if we want to push the text condition more than the audio conditioning" ([AudioCraft API docs](https://facebookresearch.github.io/audiocraft/api_docs/audiocraft/models/musicgen.html)). This parameter is documented in paragraph 4.3 of [arXiv 2407.12563](https://arxiv.org/pdf/2407.12563) ("Audio Conditioning for Music Generation via Discrete Bottleneck Features").
The **HuggingFace transformers implementation does NOT expose `cfg_coef_beta`**. It only has `guidance_scale` (equivalent to `cfg_coef`), which controls text vs. unconditional weighting but does NOT separately control text vs. audio conditioning balance. Without double CFG, the model cannot properly balance melody conditioning against text conditioning -- the text signal likely dominates or both get washed out.
### 2. Chromagram Is Inherently Lossy
MusicGen melody uses a 12-pitch-class chromagram (one-hot across 12 bins). This representation:
- **Collapses octave information** -- C2 and C5 map to the same bin
- **Strips rhythm/timing nuance** -- only captures which pitch classes are active per frame
- **Loses dynamics** -- no amplitude information
Research (MusiConGen, [arXiv 2407.15060](https://arxiv.org/html/2407.15060v1)) found that this 12-pitch-class chromagram produces "perceptually poor results" and proposed top-4 128-pitch-class CQT as a superior alternative.
### 3. Demucs Pre-processing Is Recommended But Often Skipped
The official HuggingFace docs recommend running input audio through Demucs (source separation) to remove drums/bass before chromagram extraction: "The audio prompt should ideally be free of the low-frequency signals usually produced by instruments such as drums and bass." Raw humming may contain harmonics and noise that confuse the chromagram extractor.
### 4. Architecture Limitation: Prefix Concatenation, Not Cross-Attention
As already discovered by the user, MusicGen melody uses prefix concatenation (chromagram features projected and prepended to decoder hidden states). This is weaker than cross-attention -- the conditioning signal's influence attenuates as the autoregressive generation extends further from the prefix.
## Recommended Stack
### Primary Recommendation: ACE-Step 1.5 Cover Mode
**Confidence: HIGH** (user has already validated this works)
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| ACE-Step 1.5 XL-SFT | v1.5 (Jan 2026) | Core melody-conditioned generation | Already proven to preserve melody/rhythm from humming input. VAE latent encoding + diffusion denoising preserves structure far better than chromagram. ~3s generation on RTX 4090 |
| Python | 3.12.x | Runtime | Already in use, compatible with all components |
| PyTorch | 2.7.x+cu128 | ML framework | Already installed, CUDA 12.8 for RTX 4090 |
**Key parameters:**
- `audio_cover_strength=0.8` for faithful melody reproduction (tested)
- `config_path=acestep-v15-xl-sft` for quality (50 steps)
- `caption` controls instrument selection
- `task_type=cover`
### Secondary Recommendation: AudioCraft (Original Library) for MusicGen Melody
**Confidence: MEDIUM** (strong theoretical basis but untested in user's environment)
If MusicGen melody conditioning is still desired (e.g., for comparison, or for cases where ACE-Step's cover mode produces artifacts), use the **original `audiocraft` library instead of HuggingFace `transformers`**.
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| audiocraft | latest (pip) | MusicGen melody via `generate_with_chroma()` | Exposes `cfg_coef_beta` for double CFG, proper chromagram pipeline, Demucs integration |
| Demucs (htdemucs) | via audiocraft | Source separation pre-processing | Removes drums/bass before chromagram extraction, critical for clean melody signal |
**Usage pattern (AudioCraft native):**
```python
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
import torchaudio
model = MusicGen.get_pretrained('facebook/musicgen-melody-large')
model.set_generation_params(
duration=10,
cfg_coef=3.0,
cfg_coef_beta=2.0, # THIS IS THE MISSING PIECE
use_sampling=True,
top_k=250,
temperature=1.0,
)
melody, sr = torchaudio.load('input.wav')
wav = model.generate_with_chroma(
descriptions=['solo acoustic piano, gentle melody'],
melody_wavs=melody[None],
melody_sample_rate=sr,
)
audio_write('output', wav[0].cpu(), model.sample_rate)
```
**Why this may fix the issue:** `cfg_coef_beta` enables double classifier-free guidance, which separately weights text conditioning vs. audio (melody) conditioning. Without it (as in the HF transformers port), the single `guidance_scale` cannot properly balance these two conditioning signals.
### Audio Pre-processing Pipeline
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| Basic Pitch (Spotify) | latest | MIDI extraction from humming | Already validated, works well for monophonic voice. Python 3.11 required (separate venv) |
| FluidSynth | 2.4.6 | MIDI-to-audio rendering with SoundFonts | Clean sine/instrument audio from MIDI for MusicGen input. Pre-built Windows x64 binaries available |
| pyfluidsynth | latest | Python bindings for FluidSynth | Programmatic MIDI rendering without subprocess calls |
| torchaudio | (bundled with torch) | Audio I/O, resampling | Already installed, handles format conversion |
| librosa | latest | Audio analysis, pitch detection | Chromagram computation, spectral analysis for debugging |
### Monitoring / Evaluation
| Technology | Version | Purpose | Why |
|------------|---------|---------|-----|
| CREPE / torchcrepe | latest | Pitch tracking for output evaluation | Measure whether output pitch contour follows input. GPU-accelerated via torchcrepe |
| numpy/scipy | (installed) | Signal analysis | Cross-correlation between input/output pitch contours |
## Future-Watch: Models Worth Monitoring
These are not ready for production use today but represent the direction the field is heading.
### JASCO (Meta/AudioCraft)
**Confidence: MEDIUM** (model exists, but melody conditioning requires complex Deepsalience setup)
- Part of AudioCraft ecosystem, supports chords + drums + melody conditioning simultaneously
- Melody conditioning uses salience maps (not chromagram) -- richer representation
- Available as `facebook/jasco-chords-drums-melody-1B`
- **Blocker:** Requires Deepsalience (Python 3.7 environment) for melody extraction. Complex setup. Only generates 10s clips at 32kHz
- **When to consider:** If the project needs fine-grained temporal control (specific chords at specific times)
### MuseControlLite (ICML 2025)
**Confidence: LOW** (paper published May 2025, code released, but Windows compatibility untested)
- Built on top of Stable Audio Open 1.0
- Supports melody + rhythm + dynamics conditioning simultaneously
- State-of-the-art melody accuracy per paper benchmarks
- Uses rotary positional embeddings in cross-attention for better temporal control
- **Blocker:** Built on Stable Audio Open (different ecosystem). Requires HuggingFace token. Windows support undocumented. Research code, not production-ready
- Checkpoint download via `gdown` from Google Drive
- **When to consider:** If ACE-Step and AudioCraft both fail to meet quality requirements
### Stable Audio Control (ICASSP 2025)
**Confidence: LOW** (academic paper only, no public code/checkpoints found)
- Uses top-k CQT (Constant-Q Transform) instead of chromagram -- much richer melody representation
- Built on Stable Audio with ControlNet architecture
- Claims voice input can function "akin to an instrument"
- **Blocker:** No public release found. Academic only
- **When to consider:** If/when code is released
### MG2 (Melody-Guided Music Generation)
**Confidence: LOW** (paper from Sept 2024, research prototype)
- 416M params, trained on only 132 hours -- small and efficient
- Uses contrastive learning to align text with melody
- Retrieval-augmented diffusion
- **Blocker:** Research paper, unclear if weights/code are public
- **When to consider:** If a lightweight, fast model is needed
## Alternatives Considered and Rejected
| Category | Recommended | Alternative | Why Not |
|----------|-------------|-------------|---------|
| Melody-conditioned generation | ACE-Step 1.5 cover mode | MusicGen melody via HF transformers | Missing double CFG (cfg_coef_beta), chromagram is lossy, prefix concatenation washes out over time |
| Melody-conditioned generation | ACE-Step 1.5 cover mode | SingSong (Google) | Generates accompaniment, not solo instrument rendition. Not available as open weights for local inference |
| MusicGen implementation | audiocraft (original) | transformers (HuggingFace) | HF port missing cfg_coef_beta, no Demucs integration, simpler CFG implementation |
| Pitch extraction | Basic Pitch | CREPE | Basic Pitch already validated and working. CREPE is monophonic only (same as Basic Pitch for this use case) |
| MIDI rendering | FluidSynth + SoundFonts | Custom sine synth | FluidSynth provides realistic instrument timbres from SoundFonts vs. pure sine waves |
| Melody conditioning model | ACE-Step 1.5 | JASCO | JASCO requires complex Deepsalience setup (Python 3.7), only 10s generation, more complex pipeline |
## DO NOT Use
| Technology | Why Not |
|------------|---------|
| MusicGen melody via HuggingFace `transformers` | Missing `cfg_coef_beta` for double CFG. This is the most likely cause of the user's melody conditioning failure. Use original `audiocraft` library instead if MusicGen is desired |
| MusicGen text-only (musicgen-large) | No melody conditioning at all. Good audio quality but cannot follow an input melody |
| Noise2Music | Google internal, not publicly available |
| SingSong | Accompaniment generation (adds harmony around voice), not voice-to-instrument conversion |
## Installation
```bash
# ACE-Step (already installed)
# Located at W:/programming/Projects/AiMusicPipeline/ace-step/
# Venv: ace-step/.venv/
# AudioCraft (for MusicGen melody with proper double CFG)
# Install in ace-step/.venv or create new venv
pip install audiocraft
# FluidSynth (Windows)
# Download fluidsynth-2.4.6-win10-x64.zip from GitHub releases
# Extract, add bin/ to PATH
pip install pyfluidsynth
# SoundFont (for MIDI rendering)
# Download a GM SoundFont like FluidR3_GM.sf2 or MuseScore_General.sf3
# Pitch evaluation
pip install torchcrepe
# Basic Pitch (separate Python 3.11 venv - already set up)
# Located at W:/programming/Projects/AiMusicPipeline/basic-pitch-env/
```
## Recommended Investigation Order
1. **First:** Try AudioCraft native `generate_with_chroma()` with `cfg_coef_beta=2.0` on the same humming input that failed with HF transformers. This directly tests the hypothesis that missing double CFG is the root cause.
2. **If AudioCraft melody works:** Great -- use it for cases where you want lighter-weight generation or different output characteristics than ACE-Step.
3. **If AudioCraft melody still fails:** The chromagram representation itself is too lossy for humming. Stick with ACE-Step cover mode as primary, and consider JASCO or MuseControlLite as future alternatives.
4. **Regardless:** ACE-Step cover mode is the proven path. Invest in tuning `audio_cover_strength` and caption engineering for different instruments.
## Sources
- [AudioCraft MusicGen API docs (generate_with_chroma, cfg_coef_beta)](https://facebookresearch.github.io/audiocraft/api_docs/audiocraft/models/musicgen.html) - HIGH confidence
- [AudioCraft MUSICGEN.md](https://github.com/facebookresearch/audiocraft/blob/main/docs/MUSICGEN.md) - HIGH confidence
- [HuggingFace MusicGen Melody docs](https://huggingface.co/docs/transformers/en/model_doc/musicgen_melody) - HIGH confidence
- [arXiv 2407.12563 - Double CFG for audio conditioning](https://arxiv.org/pdf/2407.12563) - HIGH confidence
- [MusiConGen - chromagram limitations (arXiv 2407.15060)](https://arxiv.org/html/2407.15060v1) - MEDIUM confidence
- [ACE-Step 1.5 Tutorial](https://github.com/ace-step/ACE-Step-1.5/blob/main/docs/en/Tutorial.md) - HIGH confidence
- [ACE-Step cover mode discussion](https://github.com/ace-step/ACE-Step-1.5/discussions/398) - MEDIUM confidence
- [JASCO docs](https://github.com/facebookresearch/audiocraft/blob/main/docs/JASCO.md) - MEDIUM confidence
- [MuseControlLite (ICML 2025)](https://github.com/fundwotsai2001/MuseControlLite) - LOW confidence (untested on Windows)
- [Stable Audio Control (ICASSP 2025)](https://stable-audio-control.github.io/web/) - LOW confidence (no public code)
- [FluidSynth 2.4.6](https://www.fluidsynth.org/) - HIGH confidence
- [pyfluidsynth](https://pypi.org/project/pyfluidsynth/) - HIGH confidence
- [Basic Pitch (Spotify)](https://basicpitch.spotify.com/) - HIGH confidence
- [CREPE pitch tracker](https://github.com/marl/crepe) - HIGH confidence

View file

@ -0,0 +1,173 @@
# Project Research Summary
**Project:** AI Music Pipeline - Melody-Conditioned Voice-to-Instrument Generation
**Domain:** Melody-conditioned music generation (voice-to-instrument conversion)
**Researched:** 2026-04-11
**Confidence:** HIGH
## Executive Summary
This project builds a pipeline that converts hummed melodies into instrument renditions (piano, guitar, saxophone, etc.). The field offers two viable approaches: MusicGen melody conditioning (chromagram-based, prefix concatenation) and ACE-Step cover mode (VAE latent encoding + diffusion). Research strongly indicates that ACE-Step 1.5 XL-SFT cover mode is the proven path -- it accepts raw humming directly, preserves both pitch contour and rhythm, and generates CD-quality (44.1kHz) audio in ~3 seconds on an RTX 4090. The user has already validated this works with audio_cover_strength=0.8.
MusicGen melody-large via HuggingFace transformers has been diagnosed as fundamentally flawed for this use case. The HF port is missing double classifier-free guidance (cfg_coef_beta), skips Demucs stem separation, and uses a suspicious null chromagram (bin-0=1 instead of true silence) that likely cancels out the conditioning signal during CFG. The chromagram representation itself is inherently lossy -- it collapses octave information, strips rhythm, and binarizes to one pitch class per frame. If MusicGen melody is still desired for comparison, switching to the original facebookresearch/audiocraft library (which exposes cfg_coef_beta and includes Demucs) is the correct move, not further debugging the HF port.
The key risk is over-investing in MusicGen melody debugging when a working solution (ACE-Step) already exists. The recommended approach is to build the pipeline around ACE-Step cover mode as the primary engine, add AudioCraft native as an optional secondary engine for experimentation, and focus engineering effort on pipeline orchestration (multi-instrument batch rendering, duration matching, fidelity scoring) rather than model-level debugging.
## Key Findings
### Recommended Stack
ACE-Step 1.5 XL-SFT is the primary generation engine, already installed and validated. The original AudioCraft library (not HuggingFace transformers) is recommended as a secondary option if MusicGen melody is desired. Basic Pitch handles MIDI extraction, FluidSynth + SoundFonts enable MIDI-to-audio rendering, and torchcrepe/CREPE provides pitch tracking for output evaluation.
**Core technologies:**
- **ACE-Step 1.5 XL-SFT**: melody-conditioned generation via cover mode -- proven to preserve pitch contour and rhythm from raw humming input
- **AudioCraft (original library)**: MusicGen melody with proper double CFG (cfg_coef_beta) and Demucs integration -- the correct implementation if MusicGen is needed
- **Basic Pitch (Spotify)**: MIDI extraction from monophonic voice -- already validated, requires Python 3.11 venv
- **FluidSynth + pyfluidsynth**: MIDI-to-audio rendering with SoundFonts -- needed for MIDI input path
- **torchcrepe**: GPU-accelerated pitch tracking for fidelity measurement between input and output
**Do NOT use:** MusicGen melody via HuggingFace transformers -- missing cfg_coef_beta, no Demucs, broken null conditioning. This is the root cause of the zero effect problem.
### Expected Features
**Must have (table stakes):**
- Melody fidelity (pitch contour preservation) -- the entire value proposition
- Rhythm preservation -- timing matters as much as pitch
- Instrument/timbre selection via text prompt -- piano, saxophone, etc.
- Raw humming input acceptance -- no manual preprocessing required
- Output audio quality >= 32kHz with musical coherence
- Reproducibility via seed control
**Should have (differentiators):**
- Multi-instrument rendering from single hum -- loop over instrument captions, low complexity
- Duration matching -- auto-detect input length, trivial to implement
- Style transfer via caption engineering -- jazz piano ballad vs aggressive rock guitar
- Whistling and played-instrument input -- ACE-Step is input-agnostic, just needs testing/docs
- Melody fidelity scoring -- CREPE pitch tracking + correlation, pure signal processing
**Defer (v2+):**
- MIDI input path (requires FluidSynth integration)
- BPM auto-detection (unreliable for humming)
- Interactive parameter tuning UI (Gradio)
- Real-time streaming (unnecessary -- 3s batch is fine)
- Multi-track arrangement generation (different problem domain)
### Architecture Approach
The architecture research focused heavily on MusicGen melody internals, revealing why the HF implementation fails. MusicGen melody uses prefix concatenation (chromagram projected to decoder dimension, prepended to the sequence) with causal self-attention only -- no cross-attention re-injects the melody signal at each layer. The conditioning influence decays as generation extends beyond the 235-token prefix. The CFG implementation compounds this: the null chromagram (bin-0=1) produces non-trivial hidden states, making cond_logits - uncond_logits near zero, effectively canceling melody conditioning.
**Major components for the pipeline:**
1. **Input Handler** -- accepts raw audio (humming, whistling, instruments), validates format, reads duration
2. **ACE-Step Cover Engine** -- core generation via cover mode with configurable strength, caption, BPM, duration
3. **AudioCraft Engine (optional)** -- secondary generation via generate_with_chroma() with cfg_coef_beta for double CFG
4. **Pipeline Orchestrator** -- manages multi-instrument batch runs, config generation, output organization
5. **Fidelity Evaluator** -- pitch extraction (CREPE) on input/output, correlation scoring
### Critical Pitfalls
1. **Using HF transformers for MusicGen melody** -- missing cfg_coef_beta, no Demucs, broken null conditioning. Switch to original AudioCraft library or skip MusicGen entirely in favor of ACE-Step.
2. **Chromagram binarization destroys information** -- only captures pitch class (not octave, rhythm, dynamics). This is a fundamental limitation of MusicGen melody, not a bug. If faithful melody reproduction is the goal, use ACE-Step.
3. **CFG unconditional path cancels audio conditioning** -- the HF null chromagram (bin-0=1) produces similar decoder behavior to real chroma, making the CFG delta near zero. Double CFG (cfg_coef_beta) in AudioCraft is the fix.
4. **Voice spectral mismatch for chromagram** -- humming harmonics confuse the chromagram extractor. Prevention: convert hum to MIDI (Basic Pitch) then render as clean instrument before feeding to MusicGen.
5. **Prefix attention decay over long sequences** -- melody influence weakens as generation extends. Keep generations under 15 seconds or use ACE-Step which does not have this limitation.
## Implications for Roadmap
Based on research, suggested phase structure:
### Phase 1: Core Pipeline with ACE-Step Cover Mode
**Rationale:** ACE-Step cover mode is already validated. This phase wraps the proven functionality into a clean, scriptable pipeline.
**Delivers:** End-to-end hum-to-instrument CLI tool using ACE-Step
**Addresses:** Melody fidelity, rhythm preservation, humming input acceptance, instrument selection, output quality, reproducibility, duration matching
**Avoids:** Over-engineering MusicGen debugging (Pitfall #1-4); keeps scope tight
### Phase 2: Multi-Instrument Batch Rendering
**Rationale:** Low complexity, high value. Depends on Phase 1 single-instrument pipeline. Pure orchestration -- no model work needed.
**Delivers:** Hum once, get piano + guitar + sax + etc. versions automatically
**Addresses:** Multi-instrument rendering, batch processing, style transfer via captions
**Avoids:** No new pitfalls -- this is pipeline engineering
### Phase 3: Fidelity Evaluation and Quality Metrics
**Rationale:** Now that generation works, add measurement. Builds trust and enables parameter tuning.
**Delivers:** Pitch contour comparison scores, per-output fidelity reports
**Uses:** torchcrepe for pitch tracking, numpy/scipy for correlation
**Addresses:** Melody fidelity scoring, interactive parameter tuning (data for tuning decisions)
**Avoids:** Setting unrealistic expectations about note-perfect reproduction
### Phase 4: AudioCraft Native Integration (Optional/Experimental)
**Rationale:** Provides a secondary engine for comparison and cases where ACE-Step produces artifacts. Validates the hypothesis that cfg_coef_beta fixes MusicGen melody conditioning.
**Delivers:** MusicGen melody generation via original AudioCraft with double CFG and optional Demucs preprocessing
**Uses:** audiocraft library, Demucs
**Addresses:** Alternative generation path, diagnostic validation of the MusicGen conditioning hypothesis
**Avoids:** CFG cancellation (Pitfall #3) by using proper double CFG; voice spectral mismatch (Pitfall #4) via Demucs
### Phase 5: MIDI Input Path and Advanced Preprocessing
**Rationale:** Extends input flexibility. Requires FluidSynth integration which is a separate dependency.
**Delivers:** Accept MIDI files as input (render via FluidSynth, feed to ACE-Step), BPM detection
**Uses:** FluidSynth, pyfluidsynth, SoundFonts, librosa
**Addresses:** MIDI input path, BPM detection
**Avoids:** Not critical path -- raw audio input already works
### Phase Ordering Rationale
- Phase 1 first because ACE-Step cover mode is proven and everything else depends on a working pipeline
- Phase 2 before Phase 3 because batch rendering is trivial (loop over captions) and immediately useful, while evaluation requires more thought
- Phase 3 before Phase 4 because measurement infrastructure helps evaluate whether AudioCraft integration is even worth pursuing
- Phase 4 is optional/experimental -- the research strongly suggests ACE-Step is the better approach, so AudioCraft integration is for validation and edge cases only
- Phase 5 last because MIDI input is a nice to have and raw audio input covers the primary use case
### Research Flags
Phases likely needing deeper research during planning:
- **Phase 4 (AudioCraft Native):** Complex integration -- AudioCraft API, Demucs setup, cfg_coef_beta tuning, and potential Windows compatibility issues need investigation
- **Phase 5 (MIDI Input):** FluidSynth Windows setup, SoundFont selection, and MIDI rendering quality need testing
Phases with standard patterns (skip research-phase):
- **Phase 1 (Core Pipeline):** Well-documented -- ACE-Step TOML config and CLI are established, just needs wrapping
- **Phase 2 (Batch Rendering):** Pure engineering -- loop over configs, no unknowns
- **Phase 3 (Fidelity Evaluation):** Standard signal processing -- CREPE pitch extraction and correlation are well-documented
## Confidence Assessment
| Area | Confidence | Notes |
|------|------------|-------|
| Stack | HIGH | ACE-Step validated by user testing; AudioCraft docs verified against source code; all tools have official documentation |
| Features | HIGH | Feature priorities derived from user actual workflow and empirical testing; clear dependency chain |
| Architecture | HIGH | MusicGen internals verified against HF transformers source code and original AudioCraft source; failure modes well-understood |
| Pitfalls | HIGH | Root causes verified against actual source code; diagnostic sequence is concrete and actionable |
**Overall confidence:** HIGH
### Gaps to Address
- **AudioCraft Windows compatibility:** The original AudioCraft library Demucs integration has not been tested on Windows. May require WSL or workarounds.
- **ACE-Step cover_strength tuning per instrument:** 0.8 works generally, but optimal values may differ by instrument timbre. Needs empirical testing in Phase 1.
- **Long-form generation:** Neither ACE-Step nor MusicGen handles melodies longer than ~30 seconds gracefully. Segment-and-concatenate approach needs design if long melodies are a requirement.
- **Future model landscape:** JASCO, MuseControlLite, and Stable Audio Control are emerging alternatives. None are production-ready today, but the field is moving fast. Revisit in 6 months.
## Sources
### Primary (HIGH confidence)
- AudioCraft MusicGen API docs -- cfg_coef_beta, generate_with_chroma
- AudioCraft GitHub -- source code for ConditionFuser, ChromaStemConditioner
- HuggingFace MusicGen Melody docs -- API, feature extraction
- ACE-Step 1.5 GitHub and Tutorial -- cover mode, configuration
- MusicGen paper (arXiv 2306.05284) -- architecture, training details
- Basic Pitch (Spotify) -- MIDI extraction
- FluidSynth -- MIDI rendering
- HuggingFace transformers source code (installed package) -- verified feature extraction and generation internals
### Secondary (MEDIUM confidence)
- MusiConGen (arXiv 2407.15060) -- chromagram limitations
- AudioCraft issue #478 -- Demucs at inference
- ACE-Step cover mode discussions -- community usage patterns
- Project memory files -- firsthand empirical testing results
### Tertiary (LOW confidence)
- MuseControlLite (ICML 2025) -- promising but untested on Windows
- Stable Audio Control (ICASSP 2025) -- no public code
- MG2 (arXiv 2409.20196) -- research prototype, uncertain availability
---
*Research completed: 2026-04-11*
*Ready for roadmap: yes*