ComfyUI-SelVA

Author	SHA1	Message	Date
Ethanfel	c86306bde8	fix(bigvgan-trainer): clone vocoder parameters to strip inference tensor flag The vocoder is loaded inside ComfyUI's torch.inference_mode(), making all its parameters inference tensors. Autograd cannot save inference tensors for backward even with requires_grad=True. Clone all parameters inside torch.inference_mode(False) before training to get normal tensors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:55:16 +02:00
Ethanfel	f04d59fe63	fix(bigvgan-trainer): clone mel outputs to strip inference tensor flag from buffers mel_converter buffers (mel_basis, hann_window) are inference tensors because the model was loaded inside ComfyUI's torch.inference_mode(). Operations on them propagate the flag to outputs. Clone both target_mel and pred_mel to get normal autograd-compatible tensors. .clone() is differentiable so the grad graph to vocoder parameters is preserved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:51:28 +02:00
Ethanfel	daa36a5f7b	fix(bigvgan-trainer): clone target tensor to exit inference mode before backward Clips loaded outside torch.inference_mode(False) are inference tensors. Autograd cannot save them for backward. .clone() creates a normal tensor, same fix pattern as selva_lora_trainer's dist.mode().clone(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:47:47 +02:00
Ethanfel	16e20b30ce	fix(bigvgan-trainer): cast audio to model dtype to match bf16 mel_converter buffers Model loaded in bf16 causes mel_basis buffer to be bf16. Audio loaded from disk is float32, causing matmul dtype mismatch. Cast all audio tensors to model["dtype"] before passing to mel_converter/vocoder. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:46:01 +02:00
Ethanfel	ea7dfed27a	fix(bigvgan-trainer): fallback to soundfile when torchaudio ffmpeg backend fails torchcodec/libavutil soname mismatch causes torchaudio to fail on every file load, silently emptying clips. Add _load_wav() that tries torchaudio first then falls back to soundfile (handles wav/flac without ffmpeg). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:41:59 +02:00
Ethanfel	81ff0d46c9	fix(bigvgan-trainer): resolve device mismatch in _save_sample after offload After the finally block, offload_to_cpu moves the vocoder to CPU while ref_mel stays on GPU. Fix: detect vocoder's current device via next(vocoder.parameters()).device and move ref_mel there before vocoding. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:35:07 +02:00
Ethanfel	9fdeb65182	feat(bigvgan-trainer): add eval samples at checkpoints and end Saves baseline.wav (ground truth roundtrip before training), stepN.wav at each save_every checkpoint, and final.wav after training completes. All use the same fixed reference segment (clip 0, position 0) for direct comparison across checkpoints. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:30:34 +02:00
Ethanfel	790a53e3df	fix(bigvgan): add 44k/BigVGANv2 support to trainer and loader 44k variants use BigVGANv2 directly as the vocoder (no wrapper, no @inference_mode decorator), accessible at feature_utils.tod.vocoder. 16k wraps BigVGANVocoder inside BigVGAN, accessed at .vocoder.vocoder. Both trainer and loader now branch on model["mode"]. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:28:32 +02:00
Ethanfel	9c784b4bdb	feat: add BigVGAN vocoder fine-tuner and loader nodes Spectral-loss-only fine-tuning of the BigVGAN vocoder (mel→waveform) on BJ audio clips. DiT and VAE are completely frozen. Losses: mel L1 reconstruction + multi-resolution STFT magnitude L1 (same three resolutions as the BigVGAN discriminator config). Saves in {'generator': state_dict} format compatible with the original BigVGAN checkpoint. Loader replaces vocoder weights in the loaded SELVA_MODEL in-place so no full model reload is needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:26:12 +02:00

9 Commits