Files
ComfyUI-SelVA/docs/audiox_evaluation.md
T
2026-04-07 09:12:00 +02:00

185 lines
7.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AudioX vs SelVA — Evaluation
AudioX (arXiv:2503.10522, ICLR 2026) is a unified multimodal audio generation model from HKUST.
This document compares it against SelVA/MMAudio and assesses the cost of adding it to PrismAudio.
---
## Quick Decision Guide
| Situation | Use |
|---|---|
| Video → realistic sound effects | **SelVA** — faster, purpose-built, MIT license |
| Music generation from video or text | **AudioX** — SelVA cannot do this |
| Audio inpainting / music continuation | **AudioX** — SelVA cannot do this |
| LoRA fine-tuning on a custom sound | **SelVA** — full training infrastructure already exists |
| Variable output duration | **AudioX** — SelVA is fixed at 8 s |
| Inference speed matters | **SelVA** — 25 steps vs 250 (10× faster) |
| Non-commercial research | Either |
| Any commercial use | **SelVA only** — AudioX is CC-BY-NC-4.0 |
---
## Architecture
| Dimension | SelVA (MMAudio) | AudioX-MAF |
|---|---|---|
| Core paradigm | Flow matching | Diffusion (k-diffusion / DPM++) |
| Inference steps | 25 ODE steps (Euler) | 250 diffusion steps (DPM++ 3M SDE) |
| Sample rate | 44.1 kHz (large) / 16 kHz (small) | 48 kHz (fixed) |
| Generator | MM-DiT, velocity prediction | ContinuousMMDiTTransformer |
| Video encoder | Synchformer | Synchformer (AudioX custom re-impl, same concept) |
| VAE / codec | DAC (descript-audio-codec) | DAC + AudioCraft options |
| Text encoder | T5-large | T5 (configurable small → XXL) |
| Video-audio fusion | Cross-attention in MM-DiT | MAF: dual-projection (dim alignment + seq length alignment) |
| Output duration | Fixed 8 s | Configurable via `sample_size` (default ~44 s at 48kHz) |
| Training data | ~2 M samples (MMAudio paper) | 7 M samples (IF-caps dataset, curated) |
| License | MIT | CC-BY-NC-4.0 |
**MAF (Multimodal Adaptive Fusion):** AudioX's key architectural contribution. Instead of directly
concatenating multimodal tokens into the DiT's cross-attention, MAF projects each modality to
match the latent's sequence length via a dedicated linear + transposed-conv stack, then applies
`MMDitSingleBlock` layers for cross-modal fusion. The paper reports this improves cross-modal
alignment particularly for video-to-audio tasks.
**Flow matching vs diffusion:** Flow matching (SelVA) trains a single velocity field to move
directly from noise to data along a straight trajectory — this is why 25 steps suffice. Standard
diffusion (AudioX) approximates a longer stochastic path, requiring 250 steps for quality output.
This is not a quality difference per se; flow matching is simply more efficient.
---
## Capabilities
| Task | SelVA | AudioX |
|---|---|---|
| Video → sound effects | ✓ (primary use case) | ✓ |
| Text → sound effects | Partial (T5 conditions quality but not primary) | ✓ (strong benchmark scores) |
| Video → music | ✗ | ✓ |
| Text → music | ✗ | ✓ |
| Audio inpainting | ✗ | ✓ (mask_args parameter) |
| Music continuation | ✗ | ✓ (init_audio parameter) |
| Variable output duration | ✗ (fixed 8 s) | ✓ |
| Multiple input modalities simultaneously | Partial | ✓ (text + video + audio at once) |
AudioX benchmarks claim superior results on text-to-audio (AudioCaps) and text-to-music
(MusicCaps) vs prior models. Video-to-audio comparison against MMAudio specifically is not
prominently featured in the paper. Perceptual evaluation confirms this: AudioX does not sound
noticeably better than SelVA on video-to-audio tasks. AudioX's advantage is **breadth**
(music, inpainting, variable duration), not raw video-to-audio quality.
---
## Integration Cost
Adding AudioX inference-only nodes to PrismAudio would require:
### New nodes (3 files)
```
nodes/
audiox_model_loader.py AUDIOX_MODEL loader — get_pretrained_model("HKUSTAudio/AudioX-MAF")
audiox_sampler.py wraps generate_diffusion_cond(), inputs: model + text + video + audio
audiox_feature_extractor.py optional — pre-extract Synchformer sync features (caching)
```
### Installation
```bash
pip install git+https://github.com/ZeyueT/AudioX.git
```
New dependencies not currently in PrismAudio:
- `pytorch-lightning==2.4.0`
- `k-diffusion==0.1.1`
- `v-diffusion-pytorch==0.0.2`
- `descript-audio-codec==1.0.0` (already used by SelVA — no conflict, same package)
- `gradio==4.44.1` (optional — only for the upstream Gradio UI)
Model weights: `HKUSTAudio/AudioX-MAF` on HuggingFace (~several GB).
### Inference API surface
```python
from audiox import get_pretrained_model
from audiox.inference.generation import generate_diffusion_cond
model, config = get_pretrained_model("HKUSTAudio/AudioX-MAF")
output = generate_diffusion_cond(
model,
steps=250,
cfg_scale=6.0,
conditioning={
"text_prompt": "a dog barking",
"video_prompt": {"video": frames_tensor, "sync_features": sync_feat},
"seconds_total": 8.0,
},
sample_size=384000, # 8 s at 48kHz
sample_rate=48000,
device="cuda",
)
# output: torch.Tensor (batch, channels, num_samples) float32 [-1, 1]
```
---
## LoRA Training
Adding AudioX LoRA training to PrismAudio is **significantly harder** than the SelVA trainer:
| Aspect | SelVA LoRA | AudioX LoRA |
|---|---|---|
| Loss function | Single MSE velocity loss | Diffusion loss over 250-step schedule |
| Training steps needed | ~2000 steps practical | Unknown — likely much more |
| Step cost | Fast (1 velocity prediction) | Slow (full diffusion forward pass per step) |
| Existing infrastructure | Full trainer + scheduler + experiments | Nothing — would need to build from scratch |
| Noise schedule | Trivial (linear interpolation) | Cosine alpha-sigma schedule |
| Prior art for LoRA | LoRA on flow matching well-studied | Less explored; closer to Stable Diffusion LoRA |
**Conclusion:** AudioX LoRA training is feasible (it would follow SD-style LoRA with the DPM++
noise schedule) but would be a substantial new project. Not worth building until inference nodes
are stable and there is a clear use case that SelVA cannot serve.
---
## License
AudioX weights are released under **CC-BY-NC-4.0** (Creative Commons Non-Commercial).
- Free for personal use, research, and non-commercial projects
- **Cannot be used in commercial products or services** without a separate agreement
- Attribution required
- SelVA/MMAudio: MIT (no restrictions)
If PrismAudio is ever distributed as part of a commercial tool, AudioX nodes must be clearly
opt-in with a license warning, or excluded entirely.
---
## Recommendation
**Short term:** AudioX is not a replacement for SelVA for the current use case (video → custom
sound effects with LoRA fine-tuning). SelVA is faster, has full training infrastructure, and
is MIT licensed.
**When AudioX becomes worth integrating:**
- If you need to generate background music synchronized to video
- If you need audio inpainting (fill a gap in an existing audio track)
- If you need text-to-audio generation without a video input
- After verifying the CC-BY-NC-4.0 license is acceptable for your use
**Estimated integration effort for inference nodes only:** 23 days of work (3 new node files,
dependency management, testing). No changes to existing SelVA nodes required — they would
coexist in the same package.
---
## References
- Paper: arXiv:2503.10522 — *AudioX: Diffusion Transformer for Anything-to-Audio Generation*
- GitHub: https://github.com/ZeyueT/AudioX
- Model weights: https://huggingface.co/HKUSTAudio/AudioX-MAF
- Demo: https://huggingface.co/spaces/Zeyue7/AudioX
- Project page: https://zeyuet.github.io/AudioX/