docs: initialize project
This commit is contained in:
commit
6fb0181b08
1 changed files with 86 additions and 0 deletions
86
.planning/PROJECT.md
Normal file
86
.planning/PROJECT.md
Normal file
|
|
@ -0,0 +1,86 @@
|
||||||
|
# MusicGen Melody Conditioning Debug & Voice-to-Instrument Pipeline
|
||||||
|
|
||||||
|
## What This Is
|
||||||
|
|
||||||
|
A systematic investigation into why MusicGen melody-large's melody conditioning appears non-functional, followed by either fixing it or identifying and integrating the best alternative model. The end goal is a working voice-to-instrument pipeline: hum a melody, specify an instrument via text prompt, and get high-quality audio output that faithfully follows the input melody's pitch contour and rhythm.
|
||||||
|
|
||||||
|
## Core Value
|
||||||
|
|
||||||
|
A hummed melody input must produce instrument-specific output that audibly follows the melody's contour — if the conditioning doesn't shape the output, nothing else matters.
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
### Validated
|
||||||
|
|
||||||
|
(None yet — ship to validate)
|
||||||
|
|
||||||
|
### Active
|
||||||
|
|
||||||
|
- [ ] Diagnose why MusicGen melody-large conditioning produces near-identical output regardless of input melody
|
||||||
|
- [ ] Determine whether MusicGen melody-large can faithfully condition on melody input as documented
|
||||||
|
- [ ] If MusicGen works: produce a working demo with voice input → instrument output
|
||||||
|
- [ ] If MusicGen doesn't work: research and identify the best alternative melody-conditioned generation model
|
||||||
|
- [ ] If alternative chosen: integrate it into a working voice → instrument demo
|
||||||
|
- [ ] Demo must support instrument selection via text prompt (guitar, piano, saxophone, etc.)
|
||||||
|
- [ ] Output must audibly follow input melody's pitch contour and rhythm
|
||||||
|
- [ ] Output quality should be high — not toy/lo-fi
|
||||||
|
|
||||||
|
### Out of Scope
|
||||||
|
|
||||||
|
- Full end-to-end pipeline integration with ACE-Step and Basic Pitch — that's a separate milestone
|
||||||
|
- Training or fine-tuning models — using pretrained checkpoints only
|
||||||
|
- Real-time / streaming generation — batch inference is fine
|
||||||
|
- Mobile or web deployment — local CLI/script execution
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
### What We Know So Far
|
||||||
|
|
||||||
|
1. **Seed bug found and fixed**: `torch.manual_seed()` before model loading caused identical RNG state at generation time. Fixed by moving seed to after model load. Outputs now numerically different but perceptually near-identical.
|
||||||
|
|
||||||
|
2. **Text conditioning works weakly**: Different prompts ("piano" vs "drums") produce measurably different outputs (max_diff ~0.48), but differences are subtle — not the dramatic instrument change expected.
|
||||||
|
|
||||||
|
3. **Melody conditioning appears to have zero effect**: With vs without `input_features`, outputs are perceptually the same. Chromagram data looks correct (86 timesteps, one-hot-ish across 12 bins).
|
||||||
|
|
||||||
|
4. **Architecture**: No cross-attention in decoder (`layer.encoder_attn` is None). Conditioning via prefix concatenation: `inputs_embeds = torch.cat([encoder_hidden_states, inputs_embeds], dim=1)`. Chromagram projected from 12 dims to hidden_size via `audio_enc_to_dec_proj`, padded/repeated to `chroma_length=235`, concatenated with text encoder output.
|
||||||
|
|
||||||
|
5. **Generation params**: `guidance_scale=3.0, do_sample=True, temperature=1.0, top_k=250, top_p=None`
|
||||||
|
|
||||||
|
### Open Hypotheses
|
||||||
|
|
||||||
|
- CFG math may be canceling out conditioning (unconditional path uses null audio + zeroed text)
|
||||||
|
- Chromagram extraction from processor may not be working correctly
|
||||||
|
- `audio_enc_to_dec_proj` weights may be near-zero or broken
|
||||||
|
- Prefix signal may wash out as self-attention context grows during generation
|
||||||
|
- Temperature/top_k too high — sampling may drown out conditioning
|
||||||
|
- HuggingFace transformers port may differ from original Meta implementation
|
||||||
|
|
||||||
|
### Previous Session Concern
|
||||||
|
|
||||||
|
The prior debugging session may have operated within too narrow a frame — focusing on whether the plumbing was connected rather than whether the approach fundamentally works or whether there are preprocessing/configuration steps being missed.
|
||||||
|
|
||||||
|
### Existing Code
|
||||||
|
|
||||||
|
- `musicgen_melody.py` — generation script
|
||||||
|
- `midi_to_audio.py` — MIDI to sine-wave synth
|
||||||
|
- `output/comparison/` — test outputs from prior session
|
||||||
|
- Venv: `ace-step/.venv/` (transformers, torch, torchaudio, mido)
|
||||||
|
- Test audio: `input/humming_step_down_jl [2026-04-11 004915].wav`
|
||||||
|
|
||||||
|
## Constraints
|
||||||
|
|
||||||
|
- **Environment**: Windows 11, native Python via existing venv at `ace-step/.venv/`
|
||||||
|
- **Hardware**: Local GPU inference (whatever the user's machine has)
|
||||||
|
- **Models**: Pretrained only — no training/fine-tuning budget
|
||||||
|
- **Quality**: Highest quality output preferred — this feeds into a creative music workflow
|
||||||
|
|
||||||
|
## Key Decisions
|
||||||
|
|
||||||
|
| Decision | Rationale | Outcome |
|
||||||
|
|----------|-----------|---------|
|
||||||
|
| Investigate before replacing | MusicGen may work correctly with right configuration; previous session may have missed something | — Pending |
|
||||||
|
| Model-agnostic if MusicGen fails | Quality and melody fidelity matter more than staying in one ecosystem | — Pending |
|
||||||
|
| Debug + working demo scope | Full pipeline integration is a separate milestone | — Pending |
|
||||||
|
|
||||||
|
---
|
||||||
|
*Last updated: 2026-04-11 after initialization*
|
||||||
Loading…
Add table
Reference in a new issue