Files
ComfyUI-SelVA/docs/audio_dataset_pipeline.md
T
Ethanfel 21ed93d3ee docs: add audio dataset pipeline reference doc
Full research notes on cleaning, augmentation, and quality metrics for
generative model training. Covers LUFS normalization, AudioSep, waveform
augmentation (pitch shift, RIR, EQ), latent mixup, DNSMOS gating, tool
install commands, and key paper references.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 13:37:48 +02:00

6.5 KiB
Raw Blame History

Audio Dataset Pipeline for Generative Model Training

Research notes on audio cleaning, augmentation, and quality metrics for LoRA fine-tuning of MMAudio/SelVA. Based on papers and tooling survey (April 2026).


Core Principle

Augmentation for generative models ≠ augmentation for classifiers. The goal is not invariance — it is expanding the training manifold so the model learns the distribution of a sound rather than memorizing a fixed set of waveforms.

With 10 clips, velocity field collapse (arXiv:2410.23594) is mathematically expected: the flow-matching model memorizes the training trajectories instead of generalizing. More diverse data is the only real fix.


Step 1 — Quality Screening

# Clipping check
clip_ratio = np.sum(np.abs(audio) >= 0.99) / len(audio)  # flag if > 0.1%

# DC offset check + removal
dc = np.mean(audio)
audio -= dc

# LUFS normalization to -14 LUFS (essential for training consistency)
# pip install pyloudnorm
import pyloudnorm as pyln
meter = pyln.Meter(sr)
loudness = meter.integrated_loudness(audio)
audio = pyln.normalize.loudness(audio, loudness, -14.0)
# Or via ffmpeg: ffmpeg -af loudnorm=I=-14:LRA=7:TP=-1

# DNSMOS quality gate (discard if OVRL < 3.5 for training; < 2.5 is unusable)
# from Microsoft DNS-Challenge repo

Step 2 — Cleaning

Tool Install Use
AudioSep pip install audiosep Isolate target sound from background — most impactful tool
noisereduce pip install noisereduce Light stationary/non-stationary denoising, preserves character
librosa pip install librosa Silence trimming: librosa.effects.trim(audio, top_db=30)
torchaudio.transforms.Fade (torchaudio) Prevent click artifacts at clip edges
DeepFilterNet pip install deepfilternet Heavy denoising — good for speech, may alter tonal sounds

AudioSep usage:

from audiosep import AudioSep
model = AudioSep.from_pretrained("audio-agi/audiosep")
# ~1.5 GB checkpoint, ~4 GB VRAM
model.inference(audio_path, "a dog barking loudly", output_path)

Step 3 — Waveform Augmentation (10 clips → 50100)

Apply stochastically per clip:

Transform Params Notes
PitchShift ±13 semitones 3 variants per clip. Limit to ±1 st for tonal/pitched sounds
ApplyImpulseResponse 5 different RIRs 5 variants per clip — EchoThief (~150 free IRs) or pyroomacoustics
LoudnessNormalization ±2 dB random Subtle level variation
SevenBandParametricEQ ±3 dB Gentle spectral variation
TimeStretch 0.91.1× only Do NOT use 2× to pad short clips — breaks video sync
# pip install audiomentations pedalboard pyroomacoustics
import audiomentations as A

augment = A.Compose([
    A.PitchShift(min_semitones=-2, max_semitones=2, p=0.5),
    A.ApplyImpulseResponse(ir_paths="path/to/irs/", p=0.5),
    A.SevenBandParametricEQ(min_gain_db=-3, max_gain_db=3, p=0.3),
    A.LoudnessNormalization(min_lufs=-16, max_lufs=-12, p=0.5),
    A.TimeStretch(min_rate=0.9, max_rate=1.1, p=0.3),
])
audio_aug = augment(samples=audio, sample_rate=sr)

RIR sources:

  • EchoThief: ~150 free real-world IRs (churches, caves, parking garages)
  • pyroomacoustics: synthetic room simulation, fully controllable

Step 4 — Latent Augmentation (at training time)

After VAE encoding:

Latent mixup between same-category pairs:

# Mix latents BEFORE flow-matching noise is added
# Only mix clips from the same sound category — cross-category mixing produces garbage
lam = torch.distributions.Beta(0.4, 0.4).sample()
z_mix = lam * z1 + (1 - lam) * z2

With 10 clips: C(10,2) = 45 possible pairs → significant expansion without new recordings.

Small Gaussian noise:

z_noised = z + torch.randn_like(z) * 0.02 * z.std()

Prevents trivial memorization of exact latent coordinates.

MusicLDM (arXiv:2308.01546) shows latent mixup > waveform mixup for generative quality.


Transforms to AVOID for Generative Training

Transform Why
ClippingDistortion, BitCrush, TanhDistortion, Mp3Compression Model learns the artifact
Reverse Breaks temporal structure for video-to-audio task
TimeMask (creating silence gaps) Unnatural — model learns to produce silence
TimeStretch > 1.3× Phase vocoder artifacts become part of the target distribution
Heavy background noise (< 15 dB SNR) Model learns to reproduce the noise

Quality Metrics

Metric Tool Threshold
DNSMOS P.835 (SIG/BAK/OVRL) Microsoft DNS-Challenge OVRL > 3.5 for training
LUFS pyloudnorm Normalize all clips to -14 LUFS
WADA-SNR (standalone) No-reference SNR estimate
Clipping ratio NumPy Flag if > 0.1% of samples at ±0.99

Tool Reference

Tool Install Purpose
audiomentations pip install audiomentations Primary augmentation library
pedalboard pip install pedalboard Higher quality pitch shift, IR convolution
AudioSep pip install audiosep Source separation / isolation
noisereduce pip install noisereduce Non-stationary denoising
DeepFilterNet pip install deepfilternet Heavy denoising (speech-optimized)
pyloudnorm pip install pyloudnorm LUFS normalization
Silero VAD pip install silero-vad Voice/silence detection
pyroomacoustics pip install pyroomacoustics Synthetic RIR generation

Integration with PrismAudio / SelVA

No established ComfyUI audio preprocessing ecosystem as of early 2026. Build thin wrapper nodes around the tools above. PrismAudio already has all required patterns (subprocess isolation, AUDIO type transport).

Target node set:

  • SelVA Dataset Cleaner — wraps noisereduce + LUFS normalization + trim + DNSMOS gate
  • SelVA Dataset Augmenter — wraps audiomentations Compose pipeline

Steps 13 are preprocessing (run once before feature extraction). Step 4 (latent mixup) is a training loop modification — integrate into selva_lora_trainer.py.


Key Papers

Paper ArXiv Finding
MusicLDM 2308.01546 Latent mixup > waveform mixup for generative quality
EDMSound 2311.08667 Memorization documented — same failure mode as 10-clip training
Synthio 2410.02056 Synthetic audio as augmentation data (ICLR 2025)
HunyuanVideo-Foley 2508.16930 V2A data pipeline at scale (100K hrs)
FM memorization 2410.23594 Velocity field collapse theory — proves early overfitting on small datasets