# Audio Dataset Pipeline for Generative Model Training Research notes on audio cleaning, augmentation, and quality metrics for LoRA fine-tuning of MMAudio/SelVA. Based on papers and tooling survey (April 2026). --- ## Core Principle Augmentation for generative models ≠ augmentation for classifiers. The goal is **not invariance** — it is expanding the training manifold so the model learns the distribution of a sound rather than memorizing a fixed set of waveforms. With 10 clips, velocity field collapse (arXiv:2410.23594) is mathematically expected: the flow-matching model memorizes the training trajectories instead of generalizing. More diverse data is the only real fix. --- ## Recommended Pipeline ### Step 1 — Quality Screening ```python # Clipping check clip_ratio = np.sum(np.abs(audio) >= 0.99) / len(audio) # flag if > 0.1% # DC offset check + removal dc = np.mean(audio) audio -= dc # LUFS normalization to -14 LUFS (essential for training consistency) # pip install pyloudnorm import pyloudnorm as pyln meter = pyln.Meter(sr) loudness = meter.integrated_loudness(audio) audio = pyln.normalize.loudness(audio, loudness, -14.0) # Or via ffmpeg: ffmpeg -af loudnorm=I=-14:LRA=7:TP=-1 # DNSMOS quality gate (discard if OVRL < 3.5 for training; < 2.5 is unusable) # from Microsoft DNS-Challenge repo ``` ### Step 2 — Cleaning | Tool | Install | Use | |---|---|---| | **AudioSep** | `pip install audiosep` | Isolate target sound from background — most impactful tool | | **noisereduce** | `pip install noisereduce` | Light stationary/non-stationary denoising, preserves character | | **librosa** | `pip install librosa` | Silence trimming: `librosa.effects.trim(audio, top_db=30)` | | **torchaudio.transforms.Fade** | (torchaudio) | Prevent click artifacts at clip edges | | **DeepFilterNet** | `pip install deepfilternet` | Heavy denoising — good for speech, may alter tonal sounds | **AudioSep usage:** ```python from audiosep import AudioSep model = AudioSep.from_pretrained("audio-agi/audiosep") # ~1.5 GB checkpoint, ~4 GB VRAM model.inference(audio_path, "a dog barking loudly", output_path) ``` ### Step 3 — Waveform Augmentation (10 clips → 50–100) Apply stochastically per clip: | Transform | Params | Notes | |---|---|---| | **PitchShift** | ±1–3 semitones | 3 variants per clip. Limit to ±1 st for tonal/pitched sounds | | **ApplyImpulseResponse** | 5 different RIRs | 5 variants per clip — EchoThief (~150 free IRs) or pyroomacoustics | | **LoudnessNormalization** | ±2 dB random | Subtle level variation | | **SevenBandParametricEQ** | ±3 dB | Gentle spectral variation | | **TimeStretch** | 0.9–1.1× only | Do NOT use 2× to pad short clips — breaks video sync | ```python # pip install audiomentations pedalboard pyroomacoustics import audiomentations as A augment = A.Compose([ A.PitchShift(min_semitones=-2, max_semitones=2, p=0.5), A.ApplyImpulseResponse(ir_paths="path/to/irs/", p=0.5), A.SevenBandParametricEQ(min_gain_db=-3, max_gain_db=3, p=0.3), A.LoudnessNormalization(min_lufs=-16, max_lufs=-12, p=0.5), A.TimeStretch(min_rate=0.9, max_rate=1.1, p=0.3), ]) audio_aug = augment(samples=audio, sample_rate=sr) ``` **RIR sources:** - EchoThief: ~150 free real-world IRs (churches, caves, parking garages) - pyroomacoustics: synthetic room simulation, fully controllable ### Step 4 — Latent Augmentation (at training time) After VAE encoding: **Latent mixup** between same-category pairs: ```python # Mix latents BEFORE flow-matching noise is added # Only mix clips from the same sound category — cross-category mixing produces garbage lam = torch.distributions.Beta(0.4, 0.4).sample() z_mix = lam * z1 + (1 - lam) * z2 ``` With 10 clips: C(10,2) = 45 possible pairs → significant expansion without new recordings. **Small Gaussian noise:** ```python z_noised = z + torch.randn_like(z) * 0.02 * z.std() ``` Prevents trivial memorization of exact latent coordinates. MusicLDM (arXiv:2308.01546) shows latent mixup > waveform mixup for generative quality. --- ## Transforms to AVOID for Generative Training | Transform | Why | |---|---| | ClippingDistortion, BitCrush, TanhDistortion, Mp3Compression | Model learns the artifact | | Reverse | Breaks temporal structure for video-to-audio task | | TimeMask (creating silence gaps) | Unnatural — model learns to produce silence | | TimeStretch > 1.3× | Phase vocoder artifacts become part of the target distribution | | Heavy background noise (< 15 dB SNR) | Model learns to reproduce the noise | --- ## Quality Metrics | Metric | Tool | Threshold | |---|---|---| | DNSMOS P.835 (SIG/BAK/OVRL) | Microsoft DNS-Challenge | OVRL > 3.5 for training | | LUFS | pyloudnorm | Normalize all clips to -14 LUFS | | WADA-SNR | (standalone) | No-reference SNR estimate | | Clipping ratio | NumPy | Flag if > 0.1% of samples at ±0.99 | --- ## Tool Reference | Tool | Install | Purpose | |---|---|---| | audiomentations | `pip install audiomentations` | Primary augmentation library | | pedalboard | `pip install pedalboard` | Higher quality pitch shift, IR convolution | | AudioSep | `pip install audiosep` | Source separation / isolation | | noisereduce | `pip install noisereduce` | Non-stationary denoising | | DeepFilterNet | `pip install deepfilternet` | Heavy denoising (speech-optimized) | | pyloudnorm | `pip install pyloudnorm` | LUFS normalization | | Silero VAD | `pip install silero-vad` | Voice/silence detection | | pyroomacoustics | `pip install pyroomacoustics` | Synthetic RIR generation | --- ## Integration with PrismAudio / SelVA No established ComfyUI audio preprocessing ecosystem as of early 2026. Build thin wrapper nodes around the tools above. PrismAudio already has all required patterns (subprocess isolation, AUDIO type transport). **Target node set:** - `SelVA Dataset Cleaner` — wraps noisereduce + LUFS normalization + trim + DNSMOS gate - `SelVA Dataset Augmenter` — wraps audiomentations Compose pipeline Steps 1–3 are preprocessing (run once before feature extraction). Step 4 (latent mixup) is a training loop modification — integrate into `selva_lora_trainer.py`. --- ## Key Papers | Paper | ArXiv | Finding | |---|---|---| | MusicLDM | 2308.01546 | Latent mixup > waveform mixup for generative quality | | EDMSound | 2311.08667 | Memorization documented — same failure mode as 10-clip training | | Synthio | 2410.02056 | Synthetic audio as augmentation data (ICLR 2025) | | HunyuanVideo-Foley | 2508.16930 | V2A data pipeline at scale (100K hrs) | | FM memorization | 2410.23594 | Velocity field collapse theory — proves early overfitting on small datasets |