diff --git a/docs/audio_dataset_pipeline.md b/docs/audio_dataset_pipeline.md new file mode 100644 index 0000000..570bc4f --- /dev/null +++ b/docs/audio_dataset_pipeline.md @@ -0,0 +1,170 @@ +# Audio Dataset Pipeline for Generative Model Training + +Research notes on audio cleaning, augmentation, and quality metrics for LoRA fine-tuning of MMAudio/SelVA. Based on papers and tooling survey (April 2026). + +--- + +## Core Principle + +Augmentation for generative models ≠ augmentation for classifiers. +The goal is **not invariance** — it is expanding the training manifold so the model learns the distribution of a sound rather than memorizing a fixed set of waveforms. + +With 10 clips, velocity field collapse (arXiv:2410.23594) is mathematically expected: the flow-matching model memorizes the training trajectories instead of generalizing. More diverse data is the only real fix. + +--- + +## Recommended Pipeline + +### Step 1 — Quality Screening + +```python +# Clipping check +clip_ratio = np.sum(np.abs(audio) >= 0.99) / len(audio) # flag if > 0.1% + +# DC offset check + removal +dc = np.mean(audio) +audio -= dc + +# LUFS normalization to -14 LUFS (essential for training consistency) +# pip install pyloudnorm +import pyloudnorm as pyln +meter = pyln.Meter(sr) +loudness = meter.integrated_loudness(audio) +audio = pyln.normalize.loudness(audio, loudness, -14.0) +# Or via ffmpeg: ffmpeg -af loudnorm=I=-14:LRA=7:TP=-1 + +# DNSMOS quality gate (discard if OVRL < 3.5 for training; < 2.5 is unusable) +# from Microsoft DNS-Challenge repo +``` + +### Step 2 — Cleaning + +| Tool | Install | Use | +|---|---|---| +| **AudioSep** | `pip install audiosep` | Isolate target sound from background — most impactful tool | +| **noisereduce** | `pip install noisereduce` | Light stationary/non-stationary denoising, preserves character | +| **librosa** | `pip install librosa` | Silence trimming: `librosa.effects.trim(audio, top_db=30)` | +| **torchaudio.transforms.Fade** | (torchaudio) | Prevent click artifacts at clip edges | +| **DeepFilterNet** | `pip install deepfilternet` | Heavy denoising — good for speech, may alter tonal sounds | + +**AudioSep usage:** +```python +from audiosep import AudioSep +model = AudioSep.from_pretrained("audio-agi/audiosep") +# ~1.5 GB checkpoint, ~4 GB VRAM +model.inference(audio_path, "a dog barking loudly", output_path) +``` + +### Step 3 — Waveform Augmentation (10 clips → 50–100) + +Apply stochastically per clip: + +| Transform | Params | Notes | +|---|---|---| +| **PitchShift** | ±1–3 semitones | 3 variants per clip. Limit to ±1 st for tonal/pitched sounds | +| **ApplyImpulseResponse** | 5 different RIRs | 5 variants per clip — EchoThief (~150 free IRs) or pyroomacoustics | +| **LoudnessNormalization** | ±2 dB random | Subtle level variation | +| **SevenBandParametricEQ** | ±3 dB | Gentle spectral variation | +| **TimeStretch** | 0.9–1.1× only | Do NOT use 2× to pad short clips — breaks video sync | + +```python +# pip install audiomentations pedalboard pyroomacoustics +import audiomentations as A + +augment = A.Compose([ + A.PitchShift(min_semitones=-2, max_semitones=2, p=0.5), + A.ApplyImpulseResponse(ir_paths="path/to/irs/", p=0.5), + A.SevenBandParametricEQ(min_gain_db=-3, max_gain_db=3, p=0.3), + A.LoudnessNormalization(min_lufs=-16, max_lufs=-12, p=0.5), + A.TimeStretch(min_rate=0.9, max_rate=1.1, p=0.3), +]) +audio_aug = augment(samples=audio, sample_rate=sr) +``` + +**RIR sources:** +- EchoThief: ~150 free real-world IRs (churches, caves, parking garages) +- pyroomacoustics: synthetic room simulation, fully controllable + +### Step 4 — Latent Augmentation (at training time) + +After VAE encoding: + +**Latent mixup** between same-category pairs: +```python +# Mix latents BEFORE flow-matching noise is added +# Only mix clips from the same sound category — cross-category mixing produces garbage +lam = torch.distributions.Beta(0.4, 0.4).sample() +z_mix = lam * z1 + (1 - lam) * z2 +``` +With 10 clips: C(10,2) = 45 possible pairs → significant expansion without new recordings. + +**Small Gaussian noise:** +```python +z_noised = z + torch.randn_like(z) * 0.02 * z.std() +``` +Prevents trivial memorization of exact latent coordinates. + +MusicLDM (arXiv:2308.01546) shows latent mixup > waveform mixup for generative quality. + +--- + +## Transforms to AVOID for Generative Training + +| Transform | Why | +|---|---| +| ClippingDistortion, BitCrush, TanhDistortion, Mp3Compression | Model learns the artifact | +| Reverse | Breaks temporal structure for video-to-audio task | +| TimeMask (creating silence gaps) | Unnatural — model learns to produce silence | +| TimeStretch > 1.3× | Phase vocoder artifacts become part of the target distribution | +| Heavy background noise (< 15 dB SNR) | Model learns to reproduce the noise | + +--- + +## Quality Metrics + +| Metric | Tool | Threshold | +|---|---|---| +| DNSMOS P.835 (SIG/BAK/OVRL) | Microsoft DNS-Challenge | OVRL > 3.5 for training | +| LUFS | pyloudnorm | Normalize all clips to -14 LUFS | +| WADA-SNR | (standalone) | No-reference SNR estimate | +| Clipping ratio | NumPy | Flag if > 0.1% of samples at ±0.99 | + +--- + +## Tool Reference + +| Tool | Install | Purpose | +|---|---|---| +| audiomentations | `pip install audiomentations` | Primary augmentation library | +| pedalboard | `pip install pedalboard` | Higher quality pitch shift, IR convolution | +| AudioSep | `pip install audiosep` | Source separation / isolation | +| noisereduce | `pip install noisereduce` | Non-stationary denoising | +| DeepFilterNet | `pip install deepfilternet` | Heavy denoising (speech-optimized) | +| pyloudnorm | `pip install pyloudnorm` | LUFS normalization | +| Silero VAD | `pip install silero-vad` | Voice/silence detection | +| pyroomacoustics | `pip install pyroomacoustics` | Synthetic RIR generation | + +--- + +## Integration with PrismAudio / SelVA + +No established ComfyUI audio preprocessing ecosystem as of early 2026. Build thin wrapper nodes around the tools above. PrismAudio already has all required patterns (subprocess isolation, AUDIO type transport). + +**Target node set:** +- `SelVA Dataset Cleaner` — wraps noisereduce + LUFS normalization + trim + DNSMOS gate +- `SelVA Dataset Augmenter` — wraps audiomentations Compose pipeline + +Steps 1–3 are preprocessing (run once before feature extraction). +Step 4 (latent mixup) is a training loop modification — integrate into `selva_lora_trainer.py`. + +--- + +## Key Papers + +| Paper | ArXiv | Finding | +|---|---|---| +| MusicLDM | 2308.01546 | Latent mixup > waveform mixup for generative quality | +| EDMSound | 2311.08667 | Memorization documented — same failure mode as 10-clip training | +| Synthio | 2410.02056 | Synthetic audio as augmentation data (ICLR 2025) | +| HunyuanVideo-Foley | 2508.16930 | V2A data pipeline at scale (100K hrs) | +| FM memorization | 2410.23594 | Velocity field collapse theory — proves early overfitting on small datasets |