torch.stft requires float32 input, but the .float() cast was not
reversed before the spectrogram hit bfloat16 Conv2d weights. Save
the original dtype and cast back after abs().
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The frozen discriminators are loaded in model dtype (bfloat16) but vocoder
waveform outputs are float32, causing a Conv2d dtype mismatch.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Runs a series of BigVGAN fine-tuning experiments from a JSON sweep file.
Audio clips loaded once, vocoder deep-copied per experiment, results
collected in experiment_summary.json with comparison loss curves.
Resume-aware — skips completed experiments on re-run.
Includes overnight sweep config (8 experiments): snake alpha steps,
GAFilter ablation, phase loss weight, discriminator FM, all_params.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Training confirmed working — VRAM usage is normal backward-pass
activation memory, not a leak. Removed all debug _vram_log and _vram
calls. Kept the video_enc offload and torch.cuda.empty_cache fixes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
torch.cuda.memory_allocated only tracks PyTorch allocator. Added
torch.cuda.mem_get_info to see actual CUDA driver memory usage.
Also offload video_enc (TextSynch) which was missed in the original
offload — stays on GPU when strategy != offload_to_cpu.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PyTorch's caching allocator reserves GPU memory from pre-generation
(~90 GiB for generator + tod) and doesn't return it to CUDA/OS.
soft_empty_cache may not call torch.cuda.empty_cache(). Force a full
cache release after CLIP encoding and after LoRA mel pre-generation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Logs VRAM at: after target_mel, after vocoder forward, before loss,
after loss computation, and after backward. Only logs for step 0 to
avoid spam. Will identify which operation causes the 94 GiB spike.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Logs torch.cuda.memory_allocated/reserved at each step: before unload,
after unload_all_models, after feature_utils.to(cpu), after generator
to(cpu), after cache clear, after mel_converter to(device), and before
training loop. This will identify what's holding VRAM.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
_save_sample("baseline") was called before the vocoder's inference
tensors were sanitized, causing "Inference tensors do not track version
counter". Moved it after the clone/detach loop and vocoder.to(device).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CLIP weights are inference tensors from ComfyUI loading. inference_mode
is thread-local, so the worker thread can't use CLIP even with a context
manager. Pre-compute all text embeddings in the main thread (where
inference_mode IS active), clone+detach to normal tensors, and pass them
to the worker via text_clip_cache dict. CLIP no longer needs to be on
GPU during pre-generation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CLIP weights are inference tensors from ComfyUI loading. The worker
thread runs without inference_mode, so PyTorch rejects inference tensors
in multi_head_attention_forward (version counter tracking). Wrap the
encode_text_clip call in torch.inference_mode() since text encoding
doesn't need gradients.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous offload ran inside the worker thread, but by then ComfyUI
had already loaded the full model to GPU. Now feature_utils.to('cpu')
and generator.to('cpu') run in the main thread right after
unload_all_models(), before the worker starts. vocoder.to(device, dtype)
is called explicitly after inference flag stripping in _do_train to
bring only the vocoder back to GPU.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ref_mel is float32 (from mel_converter) but vocoder weights are bfloat16
before inference flag stripping. Cast mel to vocoder's dtype to prevent
input/bias type mismatch during baseline sample save.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feature_utils.to(device) was loading CLIP ViT-H, synchformer, T5, VAE,
and vocoder (~90 GiB) to GPU for the entire training run. Now only
mel_converter (tiny) is moved to GPU. Pre-generation manages its own
device placement: temporarily moves CLIP and tod to GPU, then moves them
back when done. This frees ~90 GiB for the backward pass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only the vocoder and mel_converter are needed during BigVGAN training.
The rest of the SelVA pipeline (CLIP ViT-H, synchformer, T5, generator,
VAE) was staying on GPU and consuming ~90 GiB, leaving no room for
backward pass activations. Now offloaded individually to CPU before
the training loop starts.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
LoRA mel pre-generation runs a full ODE+CFG for every clip, which is slow.
Cache results to a .pt file next to the output, keyed by a SHA-256 hash
of the LoRA adapter content + generation parameters (seed, steps, CFG,
duration, sample rate, npz file list). Automatically reused on subsequent
runs when parameters haven't changed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Discriminators are constructed as float32 but receive bfloat16 tensors
from the vocoder. Cast to model dtype on load to prevent conv dtype
mismatch in feature matching loss.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GAFilter conv weights are created as float32 but the rest of the vocoder
is bfloat16. vocoder.to(device) missed the dtype cast, causing conv1d
dtype mismatch when Snake bfloat16 output flows into GAFilter.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mel_converter outputs float32 (cuFFT requirement) but vocoder weights are
bfloat16 from model loading. Cast input_mel back to model dtype before
feeding the vocoder to avoid conv1d dtype mismatch.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pre-generated mels were using a bare forward pass with no classifier-free
guidance, producing mels that don't match what the vocoder sees at inference
(where cfg_strength=4.5 is the default). Now uses ode_wrapper with
preprocess_conditions/get_empty_conditions, same as the sampler node.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mel_basis and hann_window buffers inherit bfloat16 from model loading.
Since all mel_converter inputs are cast to float32 for cuFFT, the
internal buffers must also be float32 to avoid matmul dtype mismatch.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cuFFT does not support bfloat16 tensors. When the model is loaded in
bfloat16, all torch.stft calls (mel_converter, discriminator spectrogram,
multi-resolution STFT loss) crash. Add .float() at every STFT boundary.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a lora_adapter path is provided, the trainer pre-generates
LoRA-distorted mels for each training clip (full ODE generation +
VAE decode) and trains the vocoder to produce clean audio from them.
This teaches the vocoder to compensate for LoRA latent distribution
shift without requiring perfectly aligned training pairs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Thread init_mode and use_rslora through the scheduler's config parsing,
experiment record, and _train_inner call. Default alpha changed to 2*rank
to match trainer. Add pissa_sweep.json with 7 experiments ablating PiSSA
init vs standard, rsLoRA scaling, and learning rate variations at rank 128.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
LoRA quality improvements addressing intruder dimension problem:
1. PiSSA initialization (arXiv:2404.02948): init A,B from top-r SVD of
pretrained weight. Starts on-manifold, eliminates intruder dimensions
at init. Base weight stores residual W_res = W - B@A*scale.
2. rsLoRA scaling (arXiv:2312.03732): alpha/sqrt(rank) instead of
alpha/rank. Prevents gradient collapse at high ranks (128+).
3. Post-training Spectral Surgery (arXiv:2603.03995): SVD of trained
LoRA update, gradient-sensitivity reweighting to suppress remaining
intruder dimensions. Runs automatically after training completes.
4. alpha default changed to 2*rank (was 1*rank). Produces fewer intruder
dimensions per arXiv:2410.21228.
5. weight_decay reduced from 1e-2 to 0.0 (standard for LoRA, prevents
erasing learned style weights).
6. random.choices replaced with random.sample when batch_size <= dataset
size (eliminates duplicate samples per batch).
PiSSA checkpoints include base weights (residual). Loader/evaluator
updated to handle both standard and PiSSA checkpoint formats.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
inject_gafilters creates Conv1d modules on CPU. load_state_dict
preserves existing param devices but GAFilter params stay on CPU,
causing device mismatch during vocode. Save target device before
injection, then move entire vocoder after loading.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
x1_pred is an inference tensor (computed from inference-mode weights
loaded by ComfyUI). generator.unnormalize() uses in-place mul_/add_
which fails on inference tensors. Clone strips the flag.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous check (next(feature_utils_orig.parameters()).device) only
inspected the first parameter (from CLIP), missing CPU-stranded vocoder
weights when the module was in a mixed-device state.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
References were stored in normalized flow-matching space
(net_generator.normalize(z_sample)) but the style loss compares against
unnormalize(x) which is in VAE latent space. The optimizer was minimizing
L1 between tensors at different scales, pushing the ODE endpoint out of
distribution and producing noise.
Fix: store reference latents in VAE space (z_sample directly) so both
ref_mean/ref_gram and x_un are in the same coordinate system.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The std clamp was post-hoc and only addressed magnitude, not direction.
x0 was drifting to mean=-0.55/std=3.1 (ODE expected mean=0/std=1).
Replace with anchor_weight * MSE(x0, x0_init) added directly to the loss.
The optimizer now balances style matching against staying near the initial
N(0,1) noise — gradient-aware, prevents both magnitude and mean drift.
Also logs style/anchor losses and x0_std per step for diagnostics.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Optimized x0 was reaching std=2.72 vs expected ~1.0 for flow matching.
An out-of-distribution initial condition maps to white noise in the output.
After each step, rescale x0 back toward unit std if it exceeds 1.5.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mel_converter outputs float32 (cuFFT requirement), but VAE encoder weights
are bfloat16. Cast mel to dtype before encode to avoid type mismatch.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
need_vae_encoder=False was deleting the encoder to save a small amount of VRAM.
DITTO now needs it to encode reference clips to latent space for style loss.
The spectrogram VAE encoder is small enough that the overhead is negligible.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause of white noise: backpropagating through vae.decode produces
unstable gradients — the VAE decoder was designed for inference only.
Fix: encode reference clips to VAE latent space once (no grad), compute
mean + Gram matrix statistics there, and compute style loss directly on
net_generator.unnormalize(x) — a single differentiable linear operation.
The gradient path is now: loss → x (unnormalized) → ODE → x0, with no
decoder in the backward pass.
Also adds VAE encoder availability check (fails cleanly if encoder was
deleted to save VRAM).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
White noise on output was caused by the Gram matrix loss pushing the latent
into incoherent regions. Now gram_weight defaults to 0 (mean spectrum only)
and style_weight defaults to 0.1 instead of 1.0. Users can enable Gram
gradually once mean-only optimization converges cleanly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
optimize() does return (_result[0],) to wrap for ComfyUI. _do_optimize was
returning (dict,) instead of dict, causing double-wrapping: ((dict,),).
ComfyUI then received a tuple as audio and failed on audio["waveform"].
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ref_mean and ref_gram are float32 (mel computed via cuFFT which requires
float32). mel_gen is bfloat16. F.l1_loss(bfloat16, float32) promotes to
float32, producing a float32 loss. loss.backward() then pushes float32
gradients through bfloat16 ops → 'Found dtype Float but expected BFloat16'.
Fix: clone().detach().to(dtype) at the start of _do_optimize.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feature_utils.decode and autoencoder.decode are both decorated with
@torch.inference_mode(), which unconditionally destroys grad_fn on all
outputs — making loss.backward() fail with 'does not require grad'.
Fix: call feature_utils.tod.vae.decode() directly, which has no decorator
and is fully differentiable. Transpose matches the original wrapper signature.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_unnorm_decode was wrapped in checkpoint(use_reentrant=False) to avoid saving
inference-mode weight tensors during backward. Since _strip_inference() now
cleans all params/buffers before any forward pass, the checkpoint is no longer
needed and was silently breaking the gradient chain from mel_gen back to x0.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause: net_generator/feature_utils/mel_converter parameters were loaded
in ComfyUI's inference_mode; operations on inference tensors propagate the flag,
so conditions computed from tainted weights were also tainted. checkpoint()
with use_reentrant=False then failed trying to save inference tensors during
the backward recompute pass.
Fix: _strip_inference() clones all params/buffers of all three models before
any forward pass, and _clone_nested() cleans any residual inference flags in
the conditions/empty_conditions output tensors.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Drop SelvaFlashSR node — audiosr pins numpy<=1.23.5 which cannot build
on Python 3.12 (pkgutil.ImpImporter removed); use Saganaki22/ComfyUI-AudioSR instead
- BigVGAN trainer now writes <output_stem>_training_log.csv alongside the
checkpoint: step, total, fm, mel, stft, phase, l2sp columns, line-buffered
so loss can be tailed live during training
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Activation1d's anti-alias Kaiser sinc resampling (asymmetric pad_left /
pad_right) can produce ±1-2 sample rounding in edge cases, causing the
BigVGAN AMPBlock residual addition (xt + x) to fail with a size mismatch.
Trim or pad the output to exactly match the input length so the resblock
skip connection always has matching dimensions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Use device="auto" for audiosr.build_model — safer than passing a device
string that may not be accepted in all audiosr versions.
Remove unused tmp_out temp file that was created but never written to.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
L2-SP anchors trainable params to their pretrained values. GAFilter is a
newly initialized module (identity FIR filter) with no pretrained values —
anchoring it to identity initialization would resist learning. Exclude
gafilter params from the L2-SP loss so they train freely.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements AF-Vocoder GAFilter (Interspeech 2025): learnable per-channel
depthwise FIR filter inserted after each Snake/Activation1d in BigVGAN
residual blocks. Initialized as identity so training starts from pretrained
behaviour.
- inject_gafilters() walks resblocks.*.activations and wraps each Activation1d
with _ActivationWithGAFilter — weights appear in vocoder.state_dict() automatically
- Trained alongside Snake alphas in snake_alpha_only mode
- Checkpoint saves has_gafilter + gafilter_kernel_size metadata
- Loader detects metadata and injects before load_state_dict so weights populate correctly
- Controlled by use_gafilter (default True) and gafilter_kernel_size (default 9)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds L1 loss on real, imaginary, and magnitude STFT components across
three resolutions (FA-GAN, arXiv:2407.04575). Penalizes phase smearing
directly — magnitude-only losses cannot distinguish correct spectrum
with wrong phase from a smeared spectrum.
Controlled by lambda_phase (default 1.0, 0 = disabled). Applied on top
of both the discriminator FM path and the fallback mel+STFT path.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>