ComfyUI-SelVA

Author	SHA1	Message	Date
Ethanfel	286681edff	fix: cast mel to model dtype before VAE encode in DITTO reference loading mel_converter outputs float32 (cuFFT requirement), but VAE encoder weights are bfloat16. Cast mel to dtype before encode to avoid type mismatch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 18:18:41 +02:00
Ethanfel	056a7b973d	fix: enable VAE encoder in model loader — required for DITTO reference encoding need_vae_encoder=False was deleting the encoder to save a small amount of VRAM. DITTO now needs it to encode reference clips to latent space for style loss. The spectrogram VAE encoder is small enough that the overhead is negligible. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 18:15:27 +02:00
Ethanfel	633fe36fbb	fix: compute DITTO style loss in latent space to eliminate VAE decoder noise Root cause of white noise: backpropagating through vae.decode produces unstable gradients — the VAE decoder was designed for inference only. Fix: encode reference clips to VAE latent space once (no grad), compute mean + Gram matrix statistics there, and compute style loss directly on net_generator.unnormalize(x) — a single differentiable linear operation. The gradient path is now: loss → x (unnormalized) → ODE → x0, with no decoder in the backward pass. Also adds VAE encoder availability check (fails cleanly if encoder was deleted to save VRAM). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 18:12:31 +02:00
Ethanfel	8862089fd0	fix: remove 32-clip cap on DITTO reference loading — use all available clips Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 18:10:10 +02:00
Ethanfel	608e7df04b	feat: add gram_weight param to DITTO, reduce default style_weight to 0.1 White noise on output was caused by the Gram matrix loss pushing the latent into incoherent regions. Now gram_weight defaults to 0 (mean spectrum only) and style_weight defaults to 0.1 instead of 1.0. Users can enable Gram gradually once mean-only optimization converges cleanly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 18:03:32 +02:00
Ethanfel	101b1bdb41	fix: _do_optimize returns dict not tuple — prevent double-wrapping AUDIO output optimize() does return (_result[0],) to wrap for ComfyUI. _do_optimize was returning (dict,) instead of dict, causing double-wrapping: ((dict,),). ComfyUI then received a tuple as audio and failed on audio["waveform"]. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 17:56:59 +02:00
Ethanfel	732df151b0	fix: cast ref_mean/ref_gram to model dtype before loss computation ref_mean and ref_gram are float32 (mel computed via cuFFT which requires float32). mel_gen is bfloat16. F.l1_loss(bfloat16, float32) promotes to float32, producing a float32 loss. loss.backward() then pushes float32 gradients through bfloat16 ops → 'Found dtype Float but expected BFloat16'. Fix: clone().detach().to(dtype) at the start of _do_optimize. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 17:48:41 +02:00
Ethanfel	817b75df49	fix: bypass @torch.inference_mode() on decode to preserve gradient chain feature_utils.decode and autoencoder.decode are both decorated with @torch.inference_mode(), which unconditionally destroys grad_fn on all outputs — making loss.backward() fail with 'does not require grad'. Fix: call feature_utils.tod.vae.decode() directly, which has no decorator and is fully differentiable. Transpose matches the original wrapper signature. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 17:44:35 +02:00
Ethanfel	1f02d73a3e	fix: remove checkpoint wrapper on decode — direct call preserves grad chain _unnorm_decode was wrapped in checkpoint(use_reentrant=False) to avoid saving inference-mode weight tensors during backward. Since _strip_inference() now cleans all params/buffers before any forward pass, the checkpoint is no longer needed and was silently breaking the gradient chain from mel_gen back to x0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 17:40:00 +02:00
Ethanfel	fb255edaf0	fix: strip inference-mode tensor flags in DITTO before conditions computation Root cause: net_generator/feature_utils/mel_converter parameters were loaded in ComfyUI's inference_mode; operations on inference tensors propagate the flag, so conditions computed from tainted weights were also tainted. checkpoint() with use_reentrant=False then failed trying to save inference tensors during the backward recompute pass. Fix: _strip_inference() clones all params/buffers of all three models before any forward pass, and _clone_nested() cleans any residual inference flags in the conditions/empty_conditions output tensors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 17:35:15 +02:00
Ethanfel	8ccc2438e4	fix: remove FlashSR (audiosr incompatible with Python 3.12), add training loss CSV - Drop SelvaFlashSR node — audiosr pins numpy<=1.23.5 which cannot build on Python 3.12 (pkgutil.ImpImporter removed); use Saganaki22/ComfyUI-AudioSR instead - BigVGAN trainer now writes <output_stem>_training_log.csv alongside the checkpoint: step, total, fm, mel, stft, phase, l2sp columns, line-buffered so loss can be tailed live during training Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 17:18:34 +02:00
Ethanfel	8371466e44	fix: guarantee length preservation in _ActivationWithGAFilter Activation1d's anti-alias Kaiser sinc resampling (asymmetric pad_left / pad_right) can produce ±1-2 sample rounding in edge cases, causing the BigVGAN AMPBlock residual addition (xt + x) to fail with a size mismatch. Trim or pad the output to exactly match the input length so the resblock skip connection always has matching dimensions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 16:39:03 +02:00
Ethanfel	ba0499b77c	fix: FlashSR device handling and remove unused tmp_out Use device="auto" for audiosr.build_model — safer than passing a device string that may not be accepted in all audiosr versions. Remove unused tmp_out temp file that was created but never written to. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 16:32:02 +02:00
Ethanfel	ce62bccc1f	feat: add post-generation audio enhancement nodes Three new nodes for post-generation quality improvement: - SelvaHarmonicExciter: multi-band exciter (HPF → tanh saturation → mix) restores harmonic richness lost in BigVGAN HF reconstruction - SelvaFlashSR: audio super-resolution via FlashSR basic model (haoheliu/versatile_audio_super_resolution, requires pip install audiosr) predicts missing HF content above vocoder reconstruction ceiling - SelvaOutputNormalizer: BS.1770-4 LUFS normalization + true peak limiting for consistent loudness on generated outputs (pyloudnorm) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 16:27:39 +02:00
Ethanfel	45fced55bc	fix: exclude GAFilter params from L2-SP regularization L2-SP anchors trainable params to their pretrained values. GAFilter is a newly initialized module (identity FIR filter) with no pretrained values — anchoring it to identity initialization would resist learning. Exclude gafilter params from the L2-SP loss so they train freely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 16:19:52 +02:00
Ethanfel	db112394e8	feat: add AF-Vocoder GAFilter to BigVGAN trainer and loader Implements AF-Vocoder GAFilter (Interspeech 2025): learnable per-channel depthwise FIR filter inserted after each Snake/Activation1d in BigVGAN residual blocks. Initialized as identity so training starts from pretrained behaviour. - inject_gafilters() walks resblocks.*.activations and wraps each Activation1d with _ActivationWithGAFilter — weights appear in vocoder.state_dict() automatically - Trained alongside Snake alphas in snake_alpha_only mode - Checkpoint saves has_gafilter + gafilter_kernel_size metadata - Loader detects metadata and injects before load_state_dict so weights populate correctly - Controlled by use_gafilter (default True) and gafilter_kernel_size (default 9) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 16:15:14 +02:00
Ethanfel	c53ea5517c	feat: add FA-GAN phase-aware STFT loss to BigVGAN trainer Adds L1 loss on real, imaginary, and magnitude STFT components across three resolutions (FA-GAN, arXiv:2407.04575). Penalizes phase smearing directly — magnitude-only losses cannot distinguish correct spectrum with wrong phase from a smeared spectrum. Controlled by lambda_phase (default 1.0, 0 = disabled). Applied on top of both the discriminator FM path and the fallback mel+STFT path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 16:09:31 +02:00
Ethanfel	82e449681c	fix: cast mel_converter and wav to float32 before cuFFT in DITTO cuFFT does not support bfloat16. mel_converter was being moved to device without an explicit dtype, inheriting bfloat16 from the model context. Force float32 for both mel_converter.to() and wav.to() so the STFT inside the mel converter runs in a supported dtype. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 15:59:55 +02:00
Ethanfel	15fc5f0793	feat: add SelvaDatasetCompressor node for parallel compression Mild 2:1-3:1 parallel compression via pedalboard.Compressor to reduce within-clip loudness variance after LUFS normalization. Blend ratio keeps transients intact while tightening dynamics. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 15:36:27 +02:00
Ethanfel	48493a3f0d	feat: add SelvaDatasetSaver node with NPZ sidecar copy Saves all clips in an AUDIO_DATASET to FLAC. When npz_source_dir is provided, copies the matching .npz for each clip so FLAC/NPZ pairs stay in sync after the inspector filters out bad clips. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 15:27:48 +02:00
Ethanfel	becb38c27e	fix: use soundfile for WAV/FLAC/OGG to bypass torchcodec/FFmpeg dependency torchaudio was defaulting to the torchcodec backend which requires FFmpeg shared libraries not present in the ComfyUI venv, silently skipping every clip and producing an empty dataset. Also add experiments/vocoder_finetune.json for the BJ vocoder LoRA run (lr=3e-4, rank=128, 10k steps). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 15:16:22 +02:00
Ethanfel	b9f95cfd7e	fix: detect silent discriminator load failure and fall back explicitly If no matching key was found for MPD or MRD in the checkpoint, the for-loops completed silently and randomly-initialized discriminators were used as frozen feature extractors — producing meaningless feature matching loss while appearing to work. Now raises RuntimeError (caught by outer except) which triggers the existing fallback to mel+STFT losses with a clear warning. Also prints available checkpoint keys to help diagnose format mismatches. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 14:39:55 +02:00
Ethanfel	f50afa9796	fix: guard _estimate_snr against short clips, fix freqs device in _check_hf_shelf Bug 1: mono.unfold(0, 2048, 512) returns an empty tensor for clips shorter than 2048 samples (~46ms). torch.quantile on an empty tensor crashes with "quantile() input tensor must be non-empty". Guard: return 60.0 (assume clean) for clips too short to frame — the pipeline has no minimum-length filter so any short file in the dataset folder would crash the Inspector. Bug 2: torch.linspace(...) in _check_hf_shelf created a CPU tensor, making band_lo/band_hi CPU boolean masks. Indexing a GPU mag_sq tensor with CPU masks crashes. Pass device=mono.device so freqs lands on the same device as the audio. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 14:28:36 +02:00
Ethanfel	8a85819f97	feat: register audio dataset pipeline nodes in __init__.py	2026-04-09 14:25:57 +02:00
Ethanfel	f1c4654bab	feat: add SelvaDatasetItemExtractor node	2026-04-09 14:24:58 +02:00
Ethanfel	2d06cb2f52	fix: pass device to hann_window in _check_hf_shelf to avoid GPU mismatch	2026-04-09 14:22:13 +02:00
Ethanfel	0731addea9	feat: add SelvaDatasetInspector node (codec artifacts, SNR, clipping) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 14:20:03 +02:00
Ethanfel	7eb9bd5745	feat: add SelvaDatasetLUFSNormalizer node (pyloudnorm BS.1770-4) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 14:17:44 +02:00
Ethanfel	057bfb813d	feat: add SelvaDatasetResampler node (soxr VHQ) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 14:13:45 +02:00
Ethanfel	2c71d4c184	feat: add SelvaDatasetLoader node Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 14:09:43 +02:00
Ethanfel	d25df10aa5	feat: add audio dataset pipeline skeleton	2026-04-09 14:05:31 +02:00
Ethanfel	d70a4d2123	docs: add audio dataset pipeline implementation plan	2026-04-09 14:02:46 +02:00
Ethanfel	2b10205657	fix: raise segment_seconds max from 4s to 30s Hardcoded max of 4.0 prevented using full 8s clips. Raised to 30s. Also bumped default from 1.0 to 2.0 as a more sensible starting point. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 13:49:50 +02:00
Ethanfel	8166c56552	perf: gradient checkpointing on vocoder forward to reduce activation memory BigVGAN's 512x upsampling stack stores huge intermediate activations for backward even in snake_alpha_only mode (only 5K trainable params, but activation graph runs through the full network after each snake op). Wrapping vocoder() in checkpoint(use_reentrant=False) recomputes activations during backward instead of storing them — ~2x compute cost, large reduction in peak VRAM. Should allow batch_size > 1 on 96 GB without OOM. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 13:45:24 +02:00
Ethanfel	eece79ccae	fix: correct MRD channel width to 128 and unload models before training Two bugs: 1. _DiscriminatorR used channels=32 but the BigVGAN pretrained discriminator checkpoint has channels=128. All convs in _DiscriminatorR now use 128, matching the checkpoint architecture so state_dict loads without error. 2. BigVGAN trainer OOM: SelVA generator and other ComfyUI models remain in VRAM during training (~90 GiB used). Add unload_all_models() + cache flush before the training loop to reclaim VRAM headroom. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 13:40:01 +02:00
Ethanfel	357b875e5e	fix: strip inference tensor flags in DITTO optimizer Two crash paths under "RuntimeError: Inference tensors cannot be saved for backward": 1. clip_f / sync_f loaded from main-thread inference_mode carry the inference flag. Clone them on entry to the worker thread so the conditions built from them are clean non-inference tensors. Also clone x after Phase 1 before the STE reconnection — Phase 1 runs under no_grad and produces outputs that may still carry the flag through the conditions path. 2. net_generator.unnormalize + feature_utils.decode called outside any checkpoint wrapper with requires_grad=True input. Backward tried to save inference-flagged model weights. Wrapped both calls in checkpoint(use_reentrant=False) so they recompute on backward instead of storing activations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 12:18:20 +02:00
Ethanfel	211494a91c	fix: DITTO gradient never reached x0, remove unused imports and dead code DITTO critical bug: x was reassigned on every ODE step, so by the time loss.backward() ran, x pointed to the final output tensor (grad_fn, not a leaf) and x.grad was always None. The manual gradient transfer never fired — x0 was never updated. The optimization was a no-op. Fix: use a straight-through estimator after the no-grad prefix: x = x + (x0 - x0.detach()) This adds zero value but creates a grad_fn back to x0, so backward() propagates ∂loss/∂x (at the Phase-1/2 boundary) directly to x0.grad. Equivalent to truncated BPTT with ∂x_prefix/∂x0 ≈ I. Also remove unused imports (SelvaSampler, _inject_tokens, random) that caused cascade ImportError risk, and remove dead trainable_count variable in BigVGAN trainer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 12:10:02 +02:00
Ethanfel	1e9551152e	feat: add DITTO optimizer, upgrade BigVGAN trainer, document all nodes BigVGAN trainer (selva_bigvgan_trainer.py): - Add snake_alpha_only train mode: tunes only ~27K per-channel α params (0.024% of 112M) — physically cannot cause harmonic smearing - Add lambda_l2sp: L2-SP anchor regularization toward pretrained weights - Add optional discriminator_path: frozen MPD+MRD feature matching loss replaces mel L1 when a BigVGAN discriminator checkpoint is provided - Inline MPD + MRD discriminator implementations (no extra dependencies) DITTO optimizer (selva_ditto_optimizer.py): - New node: inference-time noise optimization (arXiv:2401.12179) - Optimizes x₀ via mel Gram matrix style loss against BJ reference clips - All model weights frozen — zero quality degradation risk - Truncated BPTT through last n_grad_steps of the ODE (configurable) - Gradient checkpointing on each differentiated step Docs: - README: document all 20 nodes (was 3), add workflow diagrams - STYLE_TRANSFER.md: new guide — DITTO, vocoder fine-tuning tiers, why LoRA/TI fail, combined approach, dataset prep Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 12:04:05 +02:00
Ethanfel	f17f6f0863	feat: save ground truth spectrogram once for direct comparison Writes _gt_spec.png from ref_mel before training starts so each step's _spec.png can be compared against the unmodified vocoder roundtrip target. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 03:05:47 +02:00
Ethanfel	304d9d01bf	feat: save mel spectrogram PNG alongside each eval sample Adds _save_spectrogram() using PIL only (no matplotlib). Each _save_sample call now writes both a .wav and a _spec.png so training progress is visible without listening. Colour map is blue→green→yellow (viridis-ish), low frequencies at the bottom. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 03:03:28 +02:00
Ethanfel	0128a81cc2	fix: use full first clip for eval samples instead of 1s segment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 03:01:52 +02:00
Ethanfel	710261f5be	fix: add soundfile fallback for torchaudio.save in sample writing Same environment has no compatible ffmpeg/torchcodec for saving. Mirror the _load_wav pattern: try torchaudio, fall back to soundfile. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:58:07 +02:00
Ethanfel	5df2abd6dd	fix: handle all three inference-tensor sources in vocoder sanitization remove_parametrizations() stores weight as a plain __dict__ tensor (not nn.Parameter), making it invisible to _parameters iteration. Also, buffers (Activation1d anti-aliasing filters) are inference tensors that break the backward graph mid-network. Fix all three categories: 1. _parameters: clone().detach(), wrap as Parameter 2. plain __dict__ tensors: clone(), register_parameter (also makes trainable) 3. _buffers: clone() to strip inference flag without parametrizing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:54:41 +02:00
Ethanfel	b243908873	debug: inspect conv_pre parametrizations and _parameters keys Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:46:16 +02:00
Ethanfel	9df855ee0e	debug: print is_inference() status before failing conv_pre call Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:41:51 +02:00
Ethanfel	78f8aa98ad	fix: clone inference tensors at thread entry to strip the inference flag torch.inference_mode is thread-local, but the inference flag lives on the tensor object. Operations on inference tensors always propagate it, even in a clean thread. The only escape is .clone() called outside inference_mode. At thread entry (inference_mode disabled): clone clips and mel_converter buffers to get clean normal tensors before any training computation. Vocoder parameter clone() also now works correctly in this thread context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:35:48 +02:00
Ethanfel	e870446b0f	fix: run BigVGAN training in a fresh thread to escape inference_mode torch.inference_mode is thread-local. ComfyUI sets it on the node-execution thread; inference_mode(False) alone is insufficient to escape it in some environments (e.g. async wrappers, lora-manager hook). A new thread always starts clean. Moved all training logic into _do_train() called via threading.Thread so every tensor is a normal autograd tensor by default. Simplified parameter cloning: clone().detach().requires_grad_(True). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:30:53 +02:00
Ethanfel	df63b147e9	fix: sanitize all submodule buffers of mel_converter + guarantee target_mel output Previous fix only iterated mel_converter._buffers (direct buffers). Submodules (e.g. Spectrogram.window) still held inference tensors. Switch to .modules() to cover all nested buffers, matching the vocoder parameter sanitization. Also add a zeros+copy_ safety net on target_mel output so conv can save it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:14:12 +02:00
Ethanfel	51ac099073	fix: sanitize target_flat — clips are inference tensors from outer inference_mode The clips list is built inside ComfyUI's inference_mode context, so every element is an inference tensor. torch.stack().clone() propagates the flag. Use zeros+copy_ (same pattern as params/buffers) to get a normal tensor, so mel_converter(target_flat) inside no_grad produces a saveable input. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:09:26 +02:00
Ethanfel	b7565ec458	fix: sanitize inference tensors in BigVGAN trainer via zeros+copy_ pattern param.data.clone() and tensor.detach() on inference tensors both produce inference tensors — the flag propagates through all operations on them. Inside inference_mode(False), torch.zeros() creates genuine normal tensors. Use zeros+copy_ to sanitize both vocoder parameters and mel_converter buffers once before training, so autograd can save inputs for backward. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:05:36 +02:00

1 2 3 4 5

246 Commits