mel_converter outputs float32 (cuFFT requirement), but VAE encoder weights
are bfloat16. Cast mel to dtype before encode to avoid type mismatch.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
need_vae_encoder=False was deleting the encoder to save a small amount of VRAM.
DITTO now needs it to encode reference clips to latent space for style loss.
The spectrogram VAE encoder is small enough that the overhead is negligible.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause of white noise: backpropagating through vae.decode produces
unstable gradients — the VAE decoder was designed for inference only.
Fix: encode reference clips to VAE latent space once (no grad), compute
mean + Gram matrix statistics there, and compute style loss directly on
net_generator.unnormalize(x) — a single differentiable linear operation.
The gradient path is now: loss → x (unnormalized) → ODE → x0, with no
decoder in the backward pass.
Also adds VAE encoder availability check (fails cleanly if encoder was
deleted to save VRAM).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
White noise on output was caused by the Gram matrix loss pushing the latent
into incoherent regions. Now gram_weight defaults to 0 (mean spectrum only)
and style_weight defaults to 0.1 instead of 1.0. Users can enable Gram
gradually once mean-only optimization converges cleanly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
optimize() does return (_result[0],) to wrap for ComfyUI. _do_optimize was
returning (dict,) instead of dict, causing double-wrapping: ((dict,),).
ComfyUI then received a tuple as audio and failed on audio["waveform"].
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ref_mean and ref_gram are float32 (mel computed via cuFFT which requires
float32). mel_gen is bfloat16. F.l1_loss(bfloat16, float32) promotes to
float32, producing a float32 loss. loss.backward() then pushes float32
gradients through bfloat16 ops → 'Found dtype Float but expected BFloat16'.
Fix: clone().detach().to(dtype) at the start of _do_optimize.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feature_utils.decode and autoencoder.decode are both decorated with
@torch.inference_mode(), which unconditionally destroys grad_fn on all
outputs — making loss.backward() fail with 'does not require grad'.
Fix: call feature_utils.tod.vae.decode() directly, which has no decorator
and is fully differentiable. Transpose matches the original wrapper signature.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_unnorm_decode was wrapped in checkpoint(use_reentrant=False) to avoid saving
inference-mode weight tensors during backward. Since _strip_inference() now
cleans all params/buffers before any forward pass, the checkpoint is no longer
needed and was silently breaking the gradient chain from mel_gen back to x0.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause: net_generator/feature_utils/mel_converter parameters were loaded
in ComfyUI's inference_mode; operations on inference tensors propagate the flag,
so conditions computed from tainted weights were also tainted. checkpoint()
with use_reentrant=False then failed trying to save inference tensors during
the backward recompute pass.
Fix: _strip_inference() clones all params/buffers of all three models before
any forward pass, and _clone_nested() cleans any residual inference flags in
the conditions/empty_conditions output tensors.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Drop SelvaFlashSR node — audiosr pins numpy<=1.23.5 which cannot build
on Python 3.12 (pkgutil.ImpImporter removed); use Saganaki22/ComfyUI-AudioSR instead
- BigVGAN trainer now writes <output_stem>_training_log.csv alongside the
checkpoint: step, total, fm, mel, stft, phase, l2sp columns, line-buffered
so loss can be tailed live during training
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Activation1d's anti-alias Kaiser sinc resampling (asymmetric pad_left /
pad_right) can produce ±1-2 sample rounding in edge cases, causing the
BigVGAN AMPBlock residual addition (xt + x) to fail with a size mismatch.
Trim or pad the output to exactly match the input length so the resblock
skip connection always has matching dimensions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use device="auto" for audiosr.build_model — safer than passing a device
string that may not be accepted in all audiosr versions.
Remove unused tmp_out temp file that was created but never written to.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
L2-SP anchors trainable params to their pretrained values. GAFilter is a
newly initialized module (identity FIR filter) with no pretrained values —
anchoring it to identity initialization would resist learning. Exclude
gafilter params from the L2-SP loss so they train freely.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements AF-Vocoder GAFilter (Interspeech 2025): learnable per-channel
depthwise FIR filter inserted after each Snake/Activation1d in BigVGAN
residual blocks. Initialized as identity so training starts from pretrained
behaviour.
- inject_gafilters() walks resblocks.*.activations and wraps each Activation1d
with _ActivationWithGAFilter — weights appear in vocoder.state_dict() automatically
- Trained alongside Snake alphas in snake_alpha_only mode
- Checkpoint saves has_gafilter + gafilter_kernel_size metadata
- Loader detects metadata and injects before load_state_dict so weights populate correctly
- Controlled by use_gafilter (default True) and gafilter_kernel_size (default 9)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds L1 loss on real, imaginary, and magnitude STFT components across
three resolutions (FA-GAN, arXiv:2407.04575). Penalizes phase smearing
directly — magnitude-only losses cannot distinguish correct spectrum
with wrong phase from a smeared spectrum.
Controlled by lambda_phase (default 1.0, 0 = disabled). Applied on top
of both the discriminator FM path and the fallback mel+STFT path.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cuFFT does not support bfloat16. mel_converter was being moved to device
without an explicit dtype, inheriting bfloat16 from the model context.
Force float32 for both mel_converter.to() and wav.to() so the STFT
inside the mel converter runs in a supported dtype.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mild 2:1-3:1 parallel compression via pedalboard.Compressor to reduce
within-clip loudness variance after LUFS normalization. Blend ratio
keeps transients intact while tightening dynamics.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Saves all clips in an AUDIO_DATASET to FLAC. When npz_source_dir is
provided, copies the matching .npz for each clip so FLAC/NPZ pairs
stay in sync after the inspector filters out bad clips.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
torchaudio was defaulting to the torchcodec backend which requires FFmpeg
shared libraries not present in the ComfyUI venv, silently skipping every
clip and producing an empty dataset.
Also add experiments/vocoder_finetune.json for the BJ vocoder LoRA run
(lr=3e-4, rank=128, 10k steps).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If no matching key was found for MPD or MRD in the checkpoint, the for-loops
completed silently and randomly-initialized discriminators were used as frozen
feature extractors — producing meaningless feature matching loss while
appearing to work. Now raises RuntimeError (caught by outer except) which
triggers the existing fallback to mel+STFT losses with a clear warning.
Also prints available checkpoint keys to help diagnose format mismatches.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bug 1: mono.unfold(0, 2048, 512) returns an empty tensor for clips shorter
than 2048 samples (~46ms). torch.quantile on an empty tensor crashes with
"quantile() input tensor must be non-empty". Guard: return 60.0 (assume
clean) for clips too short to frame — the pipeline has no minimum-length
filter so any short file in the dataset folder would crash the Inspector.
Bug 2: torch.linspace(...) in _check_hf_shelf created a CPU tensor, making
band_lo/band_hi CPU boolean masks. Indexing a GPU mag_sq tensor with CPU
masks crashes. Pass device=mono.device so freqs lands on the same device
as the audio.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hardcoded max of 4.0 prevented using full 8s clips. Raised to 30s.
Also bumped default from 1.0 to 2.0 as a more sensible starting point.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BigVGAN's 512x upsampling stack stores huge intermediate activations for
backward even in snake_alpha_only mode (only 5K trainable params, but
activation graph runs through the full network after each snake op).
Wrapping vocoder() in checkpoint(use_reentrant=False) recomputes activations
during backward instead of storing them — ~2x compute cost, large reduction
in peak VRAM. Should allow batch_size > 1 on 96 GB without OOM.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs:
1. _DiscriminatorR used channels=32 but the BigVGAN pretrained discriminator
checkpoint has channels=128. All convs in _DiscriminatorR now use 128,
matching the checkpoint architecture so state_dict loads without error.
2. BigVGAN trainer OOM: SelVA generator and other ComfyUI models remain in
VRAM during training (~90 GiB used). Add unload_all_models() + cache
flush before the training loop to reclaim VRAM headroom.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two crash paths under "RuntimeError: Inference tensors cannot be
saved for backward":
1. clip_f / sync_f loaded from main-thread inference_mode carry the
inference flag. Clone them on entry to the worker thread so the
conditions built from them are clean non-inference tensors.
Also clone x after Phase 1 before the STE reconnection — Phase 1
runs under no_grad and produces outputs that may still carry the
flag through the conditions path.
2. net_generator.unnormalize + feature_utils.decode called outside
any checkpoint wrapper with requires_grad=True input. Backward
tried to save inference-flagged model weights. Wrapped both calls
in checkpoint(use_reentrant=False) so they recompute on backward
instead of storing activations.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DITTO critical bug: x was reassigned on every ODE step, so by the time
loss.backward() ran, x pointed to the final output tensor (grad_fn, not
a leaf) and x.grad was always None. The manual gradient transfer never
fired — x0 was never updated. The optimization was a no-op.
Fix: use a straight-through estimator after the no-grad prefix:
x = x + (x0 - x0.detach())
This adds zero value but creates a grad_fn back to x0, so backward()
propagates ∂loss/∂x (at the Phase-1/2 boundary) directly to x0.grad.
Equivalent to truncated BPTT with ∂x_prefix/∂x0 ≈ I.
Also remove unused imports (SelvaSampler, _inject_tokens, random) that
caused cascade ImportError risk, and remove dead trainable_count variable
in BigVGAN trainer.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Writes _gt_spec.png from ref_mel before training starts so each step's
_spec.png can be compared against the unmodified vocoder roundtrip target.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds _save_spectrogram() using PIL only (no matplotlib). Each _save_sample
call now writes both a .wav and a _spec.png so training progress is visible
without listening. Colour map is blue→green→yellow (viridis-ish), low
frequencies at the bottom.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Same environment has no compatible ffmpeg/torchcodec for saving.
Mirror the _load_wav pattern: try torchaudio, fall back to soundfile.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
remove_parametrizations() stores weight as a plain __dict__ tensor (not
nn.Parameter), making it invisible to _parameters iteration. Also, buffers
(Activation1d anti-aliasing filters) are inference tensors that break the
backward graph mid-network. Fix all three categories:
1. _parameters: clone().detach(), wrap as Parameter
2. plain __dict__ tensors: clone(), register_parameter (also makes trainable)
3. _buffers: clone() to strip inference flag without parametrizing
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
torch.inference_mode is thread-local, but the inference flag lives on the
tensor object. Operations on inference tensors always propagate it, even in
a clean thread. The only escape is .clone() called outside inference_mode.
At thread entry (inference_mode disabled): clone clips and mel_converter
buffers to get clean normal tensors before any training computation.
Vocoder parameter clone() also now works correctly in this thread context.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
torch.inference_mode is thread-local. ComfyUI sets it on the node-execution
thread; inference_mode(False) alone is insufficient to escape it in some
environments (e.g. async wrappers, lora-manager hook). A new thread always
starts clean. Moved all training logic into _do_train() called via
threading.Thread so every tensor is a normal autograd tensor by default.
Simplified parameter cloning: clone().detach().requires_grad_(True).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previous fix only iterated mel_converter._buffers (direct buffers). Submodules
(e.g. Spectrogram.window) still held inference tensors. Switch to .modules()
to cover all nested buffers, matching the vocoder parameter sanitization.
Also add a zeros+copy_ safety net on target_mel output so conv can save it.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The clips list is built inside ComfyUI's inference_mode context, so every
element is an inference tensor. torch.stack().clone() propagates the flag.
Use zeros+copy_ (same pattern as params/buffers) to get a normal tensor,
so mel_converter(target_flat) inside no_grad produces a saveable input.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
param.data.clone() and tensor.detach() on inference tensors both produce
inference tensors — the flag propagates through all operations on them.
Inside inference_mode(False), torch.zeros() creates genuine normal tensors.
Use zeros+copy_ to sanitize both vocoder parameters and mel_converter
buffers once before training, so autograd can save inputs for backward.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>