docs: add AudioX vs SelVA evaluation
Architecture comparison, capability matrix, integration cost estimate, LoRA training difficulty analysis, and license implications. Verdict: SelVA remains preferred for V2A + LoRA fine-tuning; AudioX adds value for music generation, inpainting, and text-to-audio tasks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,182 @@
|
|||||||
|
# AudioX vs SelVA — Evaluation
|
||||||
|
|
||||||
|
AudioX (arXiv:2503.10522, ICLR 2026) is a unified multimodal audio generation model from HKUST.
|
||||||
|
This document compares it against SelVA/MMAudio and assesses the cost of adding it to PrismAudio.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Decision Guide
|
||||||
|
|
||||||
|
| Situation | Use |
|
||||||
|
|---|---|
|
||||||
|
| Video → realistic sound effects | **SelVA** — faster, purpose-built, MIT license |
|
||||||
|
| Music generation from video or text | **AudioX** — SelVA cannot do this |
|
||||||
|
| Audio inpainting / music continuation | **AudioX** — SelVA cannot do this |
|
||||||
|
| LoRA fine-tuning on a custom sound | **SelVA** — full training infrastructure already exists |
|
||||||
|
| Variable output duration | **AudioX** — SelVA is fixed at 8 s |
|
||||||
|
| Inference speed matters | **SelVA** — 25 steps vs 250 (10× faster) |
|
||||||
|
| Non-commercial research | Either |
|
||||||
|
| Any commercial use | **SelVA only** — AudioX is CC-BY-NC-4.0 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
| Dimension | SelVA (MMAudio) | AudioX-MAF |
|
||||||
|
|---|---|---|
|
||||||
|
| Core paradigm | Flow matching | Diffusion (k-diffusion / DPM++) |
|
||||||
|
| Inference steps | 25 ODE steps (Euler) | 250 diffusion steps (DPM++ 3M SDE) |
|
||||||
|
| Sample rate | 44.1 kHz (large) / 16 kHz (small) | 48 kHz (fixed) |
|
||||||
|
| Generator | MM-DiT, velocity prediction | ContinuousMMDiTTransformer |
|
||||||
|
| Video encoder | Synchformer | Synchformer (AudioX custom re-impl, same concept) |
|
||||||
|
| VAE / codec | DAC (descript-audio-codec) | DAC + AudioCraft options |
|
||||||
|
| Text encoder | T5-large | T5 (configurable small → XXL) |
|
||||||
|
| Video-audio fusion | Cross-attention in MM-DiT | MAF: dual-projection (dim alignment + seq length alignment) |
|
||||||
|
| Output duration | Fixed 8 s | Configurable via `sample_size` (default ~44 s at 48kHz) |
|
||||||
|
| Training data | ~2 M samples (MMAudio paper) | 7 M samples (IF-caps dataset, curated) |
|
||||||
|
| License | MIT | CC-BY-NC-4.0 |
|
||||||
|
|
||||||
|
**MAF (Multimodal Adaptive Fusion):** AudioX's key architectural contribution. Instead of directly
|
||||||
|
concatenating multimodal tokens into the DiT's cross-attention, MAF projects each modality to
|
||||||
|
match the latent's sequence length via a dedicated linear + transposed-conv stack, then applies
|
||||||
|
`MMDitSingleBlock` layers for cross-modal fusion. The paper reports this improves cross-modal
|
||||||
|
alignment particularly for video-to-audio tasks.
|
||||||
|
|
||||||
|
**Flow matching vs diffusion:** Flow matching (SelVA) trains a single velocity field to move
|
||||||
|
directly from noise to data along a straight trajectory — this is why 25 steps suffice. Standard
|
||||||
|
diffusion (AudioX) approximates a longer stochastic path, requiring 250 steps for quality output.
|
||||||
|
This is not a quality difference per se; flow matching is simply more efficient.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Capabilities
|
||||||
|
|
||||||
|
| Task | SelVA | AudioX |
|
||||||
|
|---|---|---|
|
||||||
|
| Video → sound effects | ✓ (primary use case) | ✓ |
|
||||||
|
| Text → sound effects | Partial (T5 conditions quality but not primary) | ✓ (strong benchmark scores) |
|
||||||
|
| Video → music | ✗ | ✓ |
|
||||||
|
| Text → music | ✗ | ✓ |
|
||||||
|
| Audio inpainting | ✗ | ✓ (mask_args parameter) |
|
||||||
|
| Music continuation | ✗ | ✓ (init_audio parameter) |
|
||||||
|
| Variable output duration | ✗ (fixed 8 s) | ✓ |
|
||||||
|
| Multiple input modalities simultaneously | Partial | ✓ (text + video + audio at once) |
|
||||||
|
|
||||||
|
AudioX benchmarks claim superior results on text-to-audio (AudioCaps) and text-to-music
|
||||||
|
(MusicCaps) vs prior models. Video-to-audio comparison against MMAudio specifically is not
|
||||||
|
prominently featured in the paper, which suggests SelVA remains competitive there.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Integration Cost
|
||||||
|
|
||||||
|
Adding AudioX inference-only nodes to PrismAudio would require:
|
||||||
|
|
||||||
|
### New nodes (3 files)
|
||||||
|
|
||||||
|
```
|
||||||
|
nodes/
|
||||||
|
audiox_model_loader.py AUDIOX_MODEL loader — get_pretrained_model("HKUSTAudio/AudioX-MAF")
|
||||||
|
audiox_sampler.py wraps generate_diffusion_cond(), inputs: model + text + video + audio
|
||||||
|
audiox_feature_extractor.py optional — pre-extract Synchformer sync features (caching)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Installation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install git+https://github.com/ZeyueT/AudioX.git
|
||||||
|
```
|
||||||
|
|
||||||
|
New dependencies not currently in PrismAudio:
|
||||||
|
- `pytorch-lightning==2.4.0`
|
||||||
|
- `k-diffusion==0.1.1`
|
||||||
|
- `v-diffusion-pytorch==0.0.2`
|
||||||
|
- `descript-audio-codec==1.0.0` (already used by SelVA — no conflict, same package)
|
||||||
|
- `gradio==4.44.1` (optional — only for the upstream Gradio UI)
|
||||||
|
|
||||||
|
Model weights: `HKUSTAudio/AudioX-MAF` on HuggingFace (~several GB).
|
||||||
|
|
||||||
|
### Inference API surface
|
||||||
|
|
||||||
|
```python
|
||||||
|
from audiox import get_pretrained_model
|
||||||
|
from audiox.inference.generation import generate_diffusion_cond
|
||||||
|
|
||||||
|
model, config = get_pretrained_model("HKUSTAudio/AudioX-MAF")
|
||||||
|
|
||||||
|
output = generate_diffusion_cond(
|
||||||
|
model,
|
||||||
|
steps=250,
|
||||||
|
cfg_scale=6.0,
|
||||||
|
conditioning={
|
||||||
|
"text_prompt": "a dog barking",
|
||||||
|
"video_prompt": {"video": frames_tensor, "sync_features": sync_feat},
|
||||||
|
"seconds_total": 8.0,
|
||||||
|
},
|
||||||
|
sample_size=384000, # 8 s at 48kHz
|
||||||
|
sample_rate=48000,
|
||||||
|
device="cuda",
|
||||||
|
)
|
||||||
|
# output: torch.Tensor (batch, channels, num_samples) float32 [-1, 1]
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## LoRA Training
|
||||||
|
|
||||||
|
Adding AudioX LoRA training to PrismAudio is **significantly harder** than the SelVA trainer:
|
||||||
|
|
||||||
|
| Aspect | SelVA LoRA | AudioX LoRA |
|
||||||
|
|---|---|---|
|
||||||
|
| Loss function | Single MSE velocity loss | Diffusion loss over 250-step schedule |
|
||||||
|
| Training steps needed | ~2000 steps practical | Unknown — likely much more |
|
||||||
|
| Step cost | Fast (1 velocity prediction) | Slow (full diffusion forward pass per step) |
|
||||||
|
| Existing infrastructure | Full trainer + scheduler + experiments | Nothing — would need to build from scratch |
|
||||||
|
| Noise schedule | Trivial (linear interpolation) | Cosine alpha-sigma schedule |
|
||||||
|
| Prior art for LoRA | LoRA on flow matching well-studied | Less explored; closer to Stable Diffusion LoRA |
|
||||||
|
|
||||||
|
**Conclusion:** AudioX LoRA training is feasible (it would follow SD-style LoRA with the DPM++
|
||||||
|
noise schedule) but would be a substantial new project. Not worth building until inference nodes
|
||||||
|
are stable and there is a clear use case that SelVA cannot serve.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
AudioX weights are released under **CC-BY-NC-4.0** (Creative Commons Non-Commercial).
|
||||||
|
|
||||||
|
- Free for personal use, research, and non-commercial projects
|
||||||
|
- **Cannot be used in commercial products or services** without a separate agreement
|
||||||
|
- Attribution required
|
||||||
|
- SelVA/MMAudio: MIT (no restrictions)
|
||||||
|
|
||||||
|
If PrismAudio is ever distributed as part of a commercial tool, AudioX nodes must be clearly
|
||||||
|
opt-in with a license warning, or excluded entirely.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommendation
|
||||||
|
|
||||||
|
**Short term:** AudioX is not a replacement for SelVA for the current use case (video → custom
|
||||||
|
sound effects with LoRA fine-tuning). SelVA is faster, has full training infrastructure, and
|
||||||
|
is MIT licensed.
|
||||||
|
|
||||||
|
**When AudioX becomes worth integrating:**
|
||||||
|
- If you need to generate background music synchronized to video
|
||||||
|
- If you need audio inpainting (fill a gap in an existing audio track)
|
||||||
|
- If you need text-to-audio generation without a video input
|
||||||
|
- After verifying the CC-BY-NC-4.0 license is acceptable for your use
|
||||||
|
|
||||||
|
**Estimated integration effort for inference nodes only:** 2–3 days of work (3 new node files,
|
||||||
|
dependency management, testing). No changes to existing SelVA nodes required — they would
|
||||||
|
coexist in the same package.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
- Paper: arXiv:2503.10522 — *AudioX: Diffusion Transformer for Anything-to-Audio Generation*
|
||||||
|
- GitHub: https://github.com/ZeyueT/AudioX
|
||||||
|
- Model weights: https://huggingface.co/HKUSTAudio/AudioX-MAF
|
||||||
|
- Demo: https://huggingface.co/spaces/Zeyue7/AudioX
|
||||||
|
- Project page: https://zeyuet.github.io/AudioX/
|
||||||
Reference in New Issue
Block a user