ComfyUI-SelVA

Author	SHA1	Message	Date
Ethanfel	bee518a855	fix: cast all STFT inputs to float32 to prevent cuFFT bfloat16 crash cuFFT does not support bfloat16 tensors. When the model is loaded in bfloat16, all torch.stft calls (mel_converter, discriminator spectrogram, multi-resolution STFT loss) crash. Add .float() at every STFT boundary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-09 23:53:36 +02:00
Ethanfel	48b72c0be0	feat: add LoRA mel pre-generation to BigVGAN vocoder trainer When a lora_adapter path is provided, the trainer pre-generates LoRA-distorted mels for each training clip (full ODE generation + VAE decode) and trains the vocoder to produce clean audio from them. This teaches the vocoder to compensate for LoRA latent distribution shift without requiring perfectly aligned training pairs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-09 23:26:36 +02:00
Ethanfel	8ccc2438e4	fix: remove FlashSR (audiosr incompatible with Python 3.12), add training loss CSV - Drop SelvaFlashSR node — audiosr pins numpy<=1.23.5 which cannot build on Python 3.12 (pkgutil.ImpImporter removed); use Saganaki22/ComfyUI-AudioSR instead - BigVGAN trainer now writes <output_stem>_training_log.csv alongside the checkpoint: step, total, fm, mel, stft, phase, l2sp columns, line-buffered so loss can be tailed live during training Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 17:18:34 +02:00
Ethanfel	8371466e44	fix: guarantee length preservation in _ActivationWithGAFilter Activation1d's anti-alias Kaiser sinc resampling (asymmetric pad_left / pad_right) can produce ±1-2 sample rounding in edge cases, causing the BigVGAN AMPBlock residual addition (xt + x) to fail with a size mismatch. Trim or pad the output to exactly match the input length so the resblock skip connection always has matching dimensions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 16:39:03 +02:00
Ethanfel	45fced55bc	fix: exclude GAFilter params from L2-SP regularization L2-SP anchors trainable params to their pretrained values. GAFilter is a newly initialized module (identity FIR filter) with no pretrained values — anchoring it to identity initialization would resist learning. Exclude gafilter params from the L2-SP loss so they train freely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 16:19:52 +02:00
Ethanfel	db112394e8	feat: add AF-Vocoder GAFilter to BigVGAN trainer and loader Implements AF-Vocoder GAFilter (Interspeech 2025): learnable per-channel depthwise FIR filter inserted after each Snake/Activation1d in BigVGAN residual blocks. Initialized as identity so training starts from pretrained behaviour. - inject_gafilters() walks resblocks.*.activations and wraps each Activation1d with _ActivationWithGAFilter — weights appear in vocoder.state_dict() automatically - Trained alongside Snake alphas in snake_alpha_only mode - Checkpoint saves has_gafilter + gafilter_kernel_size metadata - Loader detects metadata and injects before load_state_dict so weights populate correctly - Controlled by use_gafilter (default True) and gafilter_kernel_size (default 9) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 16:15:14 +02:00
Ethanfel	c53ea5517c	feat: add FA-GAN phase-aware STFT loss to BigVGAN trainer Adds L1 loss on real, imaginary, and magnitude STFT components across three resolutions (FA-GAN, arXiv:2407.04575). Penalizes phase smearing directly — magnitude-only losses cannot distinguish correct spectrum with wrong phase from a smeared spectrum. Controlled by lambda_phase (default 1.0, 0 = disabled). Applied on top of both the discriminator FM path and the fallback mel+STFT path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 16:09:31 +02:00
Ethanfel	b9f95cfd7e	fix: detect silent discriminator load failure and fall back explicitly If no matching key was found for MPD or MRD in the checkpoint, the for-loops completed silently and randomly-initialized discriminators were used as frozen feature extractors — producing meaningless feature matching loss while appearing to work. Now raises RuntimeError (caught by outer except) which triggers the existing fallback to mel+STFT losses with a clear warning. Also prints available checkpoint keys to help diagnose format mismatches. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 14:39:55 +02:00
Ethanfel	2b10205657	fix: raise segment_seconds max from 4s to 30s Hardcoded max of 4.0 prevented using full 8s clips. Raised to 30s. Also bumped default from 1.0 to 2.0 as a more sensible starting point. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 13:49:50 +02:00
Ethanfel	8166c56552	perf: gradient checkpointing on vocoder forward to reduce activation memory BigVGAN's 512x upsampling stack stores huge intermediate activations for backward even in snake_alpha_only mode (only 5K trainable params, but activation graph runs through the full network after each snake op). Wrapping vocoder() in checkpoint(use_reentrant=False) recomputes activations during backward instead of storing them — ~2x compute cost, large reduction in peak VRAM. Should allow batch_size > 1 on 96 GB without OOM. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 13:45:24 +02:00
Ethanfel	eece79ccae	fix: correct MRD channel width to 128 and unload models before training Two bugs: 1. _DiscriminatorR used channels=32 but the BigVGAN pretrained discriminator checkpoint has channels=128. All convs in _DiscriminatorR now use 128, matching the checkpoint architecture so state_dict loads without error. 2. BigVGAN trainer OOM: SelVA generator and other ComfyUI models remain in VRAM during training (~90 GiB used). Add unload_all_models() + cache flush before the training loop to reclaim VRAM headroom. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 13:40:01 +02:00
Ethanfel	211494a91c	fix: DITTO gradient never reached x0, remove unused imports and dead code DITTO critical bug: x was reassigned on every ODE step, so by the time loss.backward() ran, x pointed to the final output tensor (grad_fn, not a leaf) and x.grad was always None. The manual gradient transfer never fired — x0 was never updated. The optimization was a no-op. Fix: use a straight-through estimator after the no-grad prefix: x = x + (x0 - x0.detach()) This adds zero value but creates a grad_fn back to x0, so backward() propagates ∂loss/∂x (at the Phase-1/2 boundary) directly to x0.grad. Equivalent to truncated BPTT with ∂x_prefix/∂x0 ≈ I. Also remove unused imports (SelvaSampler, _inject_tokens, random) that caused cascade ImportError risk, and remove dead trainable_count variable in BigVGAN trainer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 12:10:02 +02:00
Ethanfel	1e9551152e	feat: add DITTO optimizer, upgrade BigVGAN trainer, document all nodes BigVGAN trainer (selva_bigvgan_trainer.py): - Add snake_alpha_only train mode: tunes only ~27K per-channel α params (0.024% of 112M) — physically cannot cause harmonic smearing - Add lambda_l2sp: L2-SP anchor regularization toward pretrained weights - Add optional discriminator_path: frozen MPD+MRD feature matching loss replaces mel L1 when a BigVGAN discriminator checkpoint is provided - Inline MPD + MRD discriminator implementations (no extra dependencies) DITTO optimizer (selva_ditto_optimizer.py): - New node: inference-time noise optimization (arXiv:2401.12179) - Optimizes x₀ via mel Gram matrix style loss against BJ reference clips - All model weights frozen — zero quality degradation risk - Truncated BPTT through last n_grad_steps of the ODE (configurable) - Gradient checkpointing on each differentiated step Docs: - README: document all 20 nodes (was 3), add workflow diagrams - STYLE_TRANSFER.md: new guide — DITTO, vocoder fine-tuning tiers, why LoRA/TI fail, combined approach, dataset prep Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 12:04:05 +02:00
Ethanfel	f17f6f0863	feat: save ground truth spectrogram once for direct comparison Writes _gt_spec.png from ref_mel before training starts so each step's _spec.png can be compared against the unmodified vocoder roundtrip target. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 03:05:47 +02:00
Ethanfel	304d9d01bf	feat: save mel spectrogram PNG alongside each eval sample Adds _save_spectrogram() using PIL only (no matplotlib). Each _save_sample call now writes both a .wav and a _spec.png so training progress is visible without listening. Colour map is blue→green→yellow (viridis-ish), low frequencies at the bottom. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 03:03:28 +02:00
Ethanfel	0128a81cc2	fix: use full first clip for eval samples instead of 1s segment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 03:01:52 +02:00
Ethanfel	710261f5be	fix: add soundfile fallback for torchaudio.save in sample writing Same environment has no compatible ffmpeg/torchcodec for saving. Mirror the _load_wav pattern: try torchaudio, fall back to soundfile. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:58:07 +02:00
Ethanfel	5df2abd6dd	fix: handle all three inference-tensor sources in vocoder sanitization remove_parametrizations() stores weight as a plain __dict__ tensor (not nn.Parameter), making it invisible to _parameters iteration. Also, buffers (Activation1d anti-aliasing filters) are inference tensors that break the backward graph mid-network. Fix all three categories: 1. _parameters: clone().detach(), wrap as Parameter 2. plain __dict__ tensors: clone(), register_parameter (also makes trainable) 3. _buffers: clone() to strip inference flag without parametrizing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:54:41 +02:00
Ethanfel	b243908873	debug: inspect conv_pre parametrizations and _parameters keys Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:46:16 +02:00
Ethanfel	9df855ee0e	debug: print is_inference() status before failing conv_pre call Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:41:51 +02:00
Ethanfel	78f8aa98ad	fix: clone inference tensors at thread entry to strip the inference flag torch.inference_mode is thread-local, but the inference flag lives on the tensor object. Operations on inference tensors always propagate it, even in a clean thread. The only escape is .clone() called outside inference_mode. At thread entry (inference_mode disabled): clone clips and mel_converter buffers to get clean normal tensors before any training computation. Vocoder parameter clone() also now works correctly in this thread context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:35:48 +02:00
Ethanfel	e870446b0f	fix: run BigVGAN training in a fresh thread to escape inference_mode torch.inference_mode is thread-local. ComfyUI sets it on the node-execution thread; inference_mode(False) alone is insufficient to escape it in some environments (e.g. async wrappers, lora-manager hook). A new thread always starts clean. Moved all training logic into _do_train() called via threading.Thread so every tensor is a normal autograd tensor by default. Simplified parameter cloning: clone().detach().requires_grad_(True). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:30:53 +02:00
Ethanfel	df63b147e9	fix: sanitize all submodule buffers of mel_converter + guarantee target_mel output Previous fix only iterated mel_converter._buffers (direct buffers). Submodules (e.g. Spectrogram.window) still held inference tensors. Switch to .modules() to cover all nested buffers, matching the vocoder parameter sanitization. Also add a zeros+copy_ safety net on target_mel output so conv can save it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:14:12 +02:00
Ethanfel	51ac099073	fix: sanitize target_flat — clips are inference tensors from outer inference_mode The clips list is built inside ComfyUI's inference_mode context, so every element is an inference tensor. torch.stack().clone() propagates the flag. Use zeros+copy_ (same pattern as params/buffers) to get a normal tensor, so mel_converter(target_flat) inside no_grad produces a saveable input. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:09:26 +02:00
Ethanfel	b7565ec458	fix: sanitize inference tensors in BigVGAN trainer via zeros+copy_ pattern param.data.clone() and tensor.detach() on inference tensors both produce inference tensors — the flag propagates through all operations on them. Inside inference_mode(False), torch.zeros() creates genuine normal tensors. Use zeros+copy_ to sanitize both vocoder parameters and mel_converter buffers once before training, so autograd can save inputs for backward. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:05:36 +02:00
Ethanfel	0fcb6d3106	fix(bigvgan-trainer): replace parameter objects to fully strip inference tensor flag param.data = clone() only replaces storage — the nn.Parameter object itself retains the inference tensor flag set when the model was loaded. Replace each parameter with a fresh nn.Parameter(data.clone()) created inside inference_mode(False) so both the object and its data are normal tensors. Move optimizer creation to after re-creation so it references the new objects. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:58:57 +02:00
Ethanfel	c86306bde8	fix(bigvgan-trainer): clone vocoder parameters to strip inference tensor flag The vocoder is loaded inside ComfyUI's torch.inference_mode(), making all its parameters inference tensors. Autograd cannot save inference tensors for backward even with requires_grad=True. Clone all parameters inside torch.inference_mode(False) before training to get normal tensors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:55:16 +02:00
Ethanfel	f04d59fe63	fix(bigvgan-trainer): clone mel outputs to strip inference tensor flag from buffers mel_converter buffers (mel_basis, hann_window) are inference tensors because the model was loaded inside ComfyUI's torch.inference_mode(). Operations on them propagate the flag to outputs. Clone both target_mel and pred_mel to get normal autograd-compatible tensors. .clone() is differentiable so the grad graph to vocoder parameters is preserved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:51:28 +02:00
Ethanfel	daa36a5f7b	fix(bigvgan-trainer): clone target tensor to exit inference mode before backward Clips loaded outside torch.inference_mode(False) are inference tensors. Autograd cannot save them for backward. .clone() creates a normal tensor, same fix pattern as selva_lora_trainer's dist.mode().clone(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:47:47 +02:00
Ethanfel	16e20b30ce	fix(bigvgan-trainer): cast audio to model dtype to match bf16 mel_converter buffers Model loaded in bf16 causes mel_basis buffer to be bf16. Audio loaded from disk is float32, causing matmul dtype mismatch. Cast all audio tensors to model["dtype"] before passing to mel_converter/vocoder. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:46:01 +02:00
Ethanfel	ea7dfed27a	fix(bigvgan-trainer): fallback to soundfile when torchaudio ffmpeg backend fails torchcodec/libavutil soname mismatch causes torchaudio to fail on every file load, silently emptying clips. Add _load_wav() that tries torchaudio first then falls back to soundfile (handles wav/flac without ffmpeg). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:41:59 +02:00
Ethanfel	81ff0d46c9	fix(bigvgan-trainer): resolve device mismatch in _save_sample after offload After the finally block, offload_to_cpu moves the vocoder to CPU while ref_mel stays on GPU. Fix: detect vocoder's current device via next(vocoder.parameters()).device and move ref_mel there before vocoding. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:35:07 +02:00
Ethanfel	9fdeb65182	feat(bigvgan-trainer): add eval samples at checkpoints and end Saves baseline.wav (ground truth roundtrip before training), stepN.wav at each save_every checkpoint, and final.wav after training completes. All use the same fixed reference segment (clip 0, position 0) for direct comparison across checkpoints. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:30:34 +02:00
Ethanfel	790a53e3df	fix(bigvgan): add 44k/BigVGANv2 support to trainer and loader 44k variants use BigVGANv2 directly as the vocoder (no wrapper, no @inference_mode decorator), accessible at feature_utils.tod.vocoder. 16k wraps BigVGANVocoder inside BigVGAN, accessed at .vocoder.vocoder. Both trainer and loader now branch on model["mode"]. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:28:32 +02:00
Ethanfel	9c784b4bdb	feat: add BigVGAN vocoder fine-tuner and loader nodes Spectral-loss-only fine-tuning of the BigVGAN vocoder (mel→waveform) on BJ audio clips. DiT and VAE are completely frozen. Losses: mel L1 reconstruction + multi-resolution STFT magnitude L1 (same three resolutions as the BigVGAN discriminator config). Saves in {'generator': state_dict} format compatible with the original BigVGAN checkpoint. Loader replaces vocoder weights in the loaded SELVA_MODEL in-place so no full model reload is needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:26:12 +02:00

35 Commits