Implements AF-Vocoder GAFilter (Interspeech 2025): learnable per-channel
depthwise FIR filter inserted after each Snake/Activation1d in BigVGAN
residual blocks. Initialized as identity so training starts from pretrained
behaviour.
- inject_gafilters() walks resblocks.*.activations and wraps each Activation1d
with _ActivationWithGAFilter — weights appear in vocoder.state_dict() automatically
- Trained alongside Snake alphas in snake_alpha_only mode
- Checkpoint saves has_gafilter + gafilter_kernel_size metadata
- Loader detects metadata and injects before load_state_dict so weights populate correctly
- Controlled by use_gafilter (default True) and gafilter_kernel_size (default 9)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds L1 loss on real, imaginary, and magnitude STFT components across
three resolutions (FA-GAN, arXiv:2407.04575). Penalizes phase smearing
directly — magnitude-only losses cannot distinguish correct spectrum
with wrong phase from a smeared spectrum.
Controlled by lambda_phase (default 1.0, 0 = disabled). Applied on top
of both the discriminator FM path and the fallback mel+STFT path.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cuFFT does not support bfloat16. mel_converter was being moved to device
without an explicit dtype, inheriting bfloat16 from the model context.
Force float32 for both mel_converter.to() and wav.to() so the STFT
inside the mel converter runs in a supported dtype.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Mild 2:1-3:1 parallel compression via pedalboard.Compressor to reduce
within-clip loudness variance after LUFS normalization. Blend ratio
keeps transients intact while tightening dynamics.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Saves all clips in an AUDIO_DATASET to FLAC. When npz_source_dir is
provided, copies the matching .npz for each clip so FLAC/NPZ pairs
stay in sync after the inspector filters out bad clips.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
torchaudio was defaulting to the torchcodec backend which requires FFmpeg
shared libraries not present in the ComfyUI venv, silently skipping every
clip and producing an empty dataset.
Also add experiments/vocoder_finetune.json for the BJ vocoder LoRA run
(lr=3e-4, rank=128, 10k steps).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If no matching key was found for MPD or MRD in the checkpoint, the for-loops
completed silently and randomly-initialized discriminators were used as frozen
feature extractors — producing meaningless feature matching loss while
appearing to work. Now raises RuntimeError (caught by outer except) which
triggers the existing fallback to mel+STFT losses with a clear warning.
Also prints available checkpoint keys to help diagnose format mismatches.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bug 1: mono.unfold(0, 2048, 512) returns an empty tensor for clips shorter
than 2048 samples (~46ms). torch.quantile on an empty tensor crashes with
"quantile() input tensor must be non-empty". Guard: return 60.0 (assume
clean) for clips too short to frame — the pipeline has no minimum-length
filter so any short file in the dataset folder would crash the Inspector.
Bug 2: torch.linspace(...) in _check_hf_shelf created a CPU tensor, making
band_lo/band_hi CPU boolean masks. Indexing a GPU mag_sq tensor with CPU
masks crashes. Pass device=mono.device so freqs lands on the same device
as the audio.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Hardcoded max of 4.0 prevented using full 8s clips. Raised to 30s.
Also bumped default from 1.0 to 2.0 as a more sensible starting point.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
BigVGAN's 512x upsampling stack stores huge intermediate activations for
backward even in snake_alpha_only mode (only 5K trainable params, but
activation graph runs through the full network after each snake op).
Wrapping vocoder() in checkpoint(use_reentrant=False) recomputes activations
during backward instead of storing them — ~2x compute cost, large reduction
in peak VRAM. Should allow batch_size > 1 on 96 GB without OOM.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs:
1. _DiscriminatorR used channels=32 but the BigVGAN pretrained discriminator
checkpoint has channels=128. All convs in _DiscriminatorR now use 128,
matching the checkpoint architecture so state_dict loads without error.
2. BigVGAN trainer OOM: SelVA generator and other ComfyUI models remain in
VRAM during training (~90 GiB used). Add unload_all_models() + cache
flush before the training loop to reclaim VRAM headroom.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two crash paths under "RuntimeError: Inference tensors cannot be
saved for backward":
1. clip_f / sync_f loaded from main-thread inference_mode carry the
inference flag. Clone them on entry to the worker thread so the
conditions built from them are clean non-inference tensors.
Also clone x after Phase 1 before the STE reconnection — Phase 1
runs under no_grad and produces outputs that may still carry the
flag through the conditions path.
2. net_generator.unnormalize + feature_utils.decode called outside
any checkpoint wrapper with requires_grad=True input. Backward
tried to save inference-flagged model weights. Wrapped both calls
in checkpoint(use_reentrant=False) so they recompute on backward
instead of storing activations.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
DITTO critical bug: x was reassigned on every ODE step, so by the time
loss.backward() ran, x pointed to the final output tensor (grad_fn, not
a leaf) and x.grad was always None. The manual gradient transfer never
fired — x0 was never updated. The optimization was a no-op.
Fix: use a straight-through estimator after the no-grad prefix:
x = x + (x0 - x0.detach())
This adds zero value but creates a grad_fn back to x0, so backward()
propagates ∂loss/∂x (at the Phase-1/2 boundary) directly to x0.grad.
Equivalent to truncated BPTT with ∂x_prefix/∂x0 ≈ I.
Also remove unused imports (SelvaSampler, _inject_tokens, random) that
caused cascade ImportError risk, and remove dead trainable_count variable
in BigVGAN trainer.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Writes _gt_spec.png from ref_mel before training starts so each step's
_spec.png can be compared against the unmodified vocoder roundtrip target.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds _save_spectrogram() using PIL only (no matplotlib). Each _save_sample
call now writes both a .wav and a _spec.png so training progress is visible
without listening. Colour map is blue→green→yellow (viridis-ish), low
frequencies at the bottom.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Same environment has no compatible ffmpeg/torchcodec for saving.
Mirror the _load_wav pattern: try torchaudio, fall back to soundfile.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
remove_parametrizations() stores weight as a plain __dict__ tensor (not
nn.Parameter), making it invisible to _parameters iteration. Also, buffers
(Activation1d anti-aliasing filters) are inference tensors that break the
backward graph mid-network. Fix all three categories:
1. _parameters: clone().detach(), wrap as Parameter
2. plain __dict__ tensors: clone(), register_parameter (also makes trainable)
3. _buffers: clone() to strip inference flag without parametrizing
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
torch.inference_mode is thread-local, but the inference flag lives on the
tensor object. Operations on inference tensors always propagate it, even in
a clean thread. The only escape is .clone() called outside inference_mode.
At thread entry (inference_mode disabled): clone clips and mel_converter
buffers to get clean normal tensors before any training computation.
Vocoder parameter clone() also now works correctly in this thread context.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
torch.inference_mode is thread-local. ComfyUI sets it on the node-execution
thread; inference_mode(False) alone is insufficient to escape it in some
environments (e.g. async wrappers, lora-manager hook). A new thread always
starts clean. Moved all training logic into _do_train() called via
threading.Thread so every tensor is a normal autograd tensor by default.
Simplified parameter cloning: clone().detach().requires_grad_(True).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previous fix only iterated mel_converter._buffers (direct buffers). Submodules
(e.g. Spectrogram.window) still held inference tensors. Switch to .modules()
to cover all nested buffers, matching the vocoder parameter sanitization.
Also add a zeros+copy_ safety net on target_mel output so conv can save it.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The clips list is built inside ComfyUI's inference_mode context, so every
element is an inference tensor. torch.stack().clone() propagates the flag.
Use zeros+copy_ (same pattern as params/buffers) to get a normal tensor,
so mel_converter(target_flat) inside no_grad produces a saveable input.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
param.data.clone() and tensor.detach() on inference tensors both produce
inference tensors — the flag propagates through all operations on them.
Inside inference_mode(False), torch.zeros() creates genuine normal tensors.
Use zeros+copy_ to sanitize both vocoder parameters and mel_converter
buffers once before training, so autograd can save inputs for backward.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
param.data = clone() only replaces storage — the nn.Parameter object itself
retains the inference tensor flag set when the model was loaded. Replace each
parameter with a fresh nn.Parameter(data.clone()) created inside
inference_mode(False) so both the object and its data are normal tensors.
Move optimizer creation to after re-creation so it references the new objects.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The vocoder is loaded inside ComfyUI's torch.inference_mode(), making all
its parameters inference tensors. Autograd cannot save inference tensors
for backward even with requires_grad=True. Clone all parameters inside
torch.inference_mode(False) before training to get normal tensors.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
mel_converter buffers (mel_basis, hann_window) are inference tensors
because the model was loaded inside ComfyUI's torch.inference_mode().
Operations on them propagate the flag to outputs. Clone both target_mel
and pred_mel to get normal autograd-compatible tensors. .clone() is
differentiable so the grad graph to vocoder parameters is preserved.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Clips loaded outside torch.inference_mode(False) are inference tensors.
Autograd cannot save them for backward. .clone() creates a normal tensor,
same fix pattern as selva_lora_trainer's dist.mode().clone().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Model loaded in bf16 causes mel_basis buffer to be bf16. Audio loaded
from disk is float32, causing matmul dtype mismatch. Cast all audio
tensors to model["dtype"] before passing to mel_converter/vocoder.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
torchcodec/libavutil soname mismatch causes torchaudio to fail on every
file load, silently emptying clips. Add _load_wav() that tries torchaudio
first then falls back to soundfile (handles wav/flac without ffmpeg).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After the finally block, offload_to_cpu moves the vocoder to CPU while
ref_mel stays on GPU. Fix: detect vocoder's current device via
next(vocoder.parameters()).device and move ref_mel there before vocoding.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Saves baseline.wav (ground truth roundtrip before training), stepN.wav
at each save_every checkpoint, and final.wav after training completes.
All use the same fixed reference segment (clip 0, position 0) for
direct comparison across checkpoints.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
44k variants use BigVGANv2 directly as the vocoder (no wrapper, no
@inference_mode decorator), accessible at feature_utils.tod.vocoder.
16k wraps BigVGANVocoder inside BigVGAN, accessed at .vocoder.vocoder.
Both trainer and loader now branch on model["mode"].
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Spectral-loss-only fine-tuning of the BigVGAN vocoder (mel→waveform)
on BJ audio clips. DiT and VAE are completely frozen. Losses: mel L1
reconstruction + multi-resolution STFT magnitude L1 (same three
resolutions as the BigVGAN discriminator config). Saves in
{'generator': state_dict} format compatible with the original BigVGAN
checkpoint. Loader replaces vocoder weights in the loaded SELVA_MODEL
in-place so no full model reload is needed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two improvements for stronger steering effect:
1. Apply steering only during the conditional predict_flow pass by
monkey-patching predict_flow to set a flag via identity check
(cond is conditions). Hooks skip the unconditional pass, so
steering is amplified by cfg_strength (~4.5x) instead of canceling
out in the CFG guidance term.
2. Restore per-position [seq, hidden] steering vectors instead of
seq-averaged [hidden]. More spatially specific — captures positional
activation patterns rather than a global mean. Seq length mismatch
at inference time handled via linear interpolation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements per-block DiT activation steering as an alternative to textual
inversion. Extractor runs frozen generator on dataset with BJ vs empty
conditions, records mean hidden-state delta per block, saves [hidden_dim]
vectors (seq-averaged so they broadcast to any inference duration). Loader
reads the bundle. Sampler registers forward hooks during the ODE that add
strength × vec to each block output, cleaned up in a finally block.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TI via text conditioning produces buzz because SelVA's text path is
mean-pooled into a global DiT bias — not rich per-token cross-attention
like SD. The optimizer learns a constant spectral artifact rather than
semantic style shift.
ti_strength=1.0 (default) = full injection as before.
ti_strength<1.0 = lerp between original and injected text_clip,
allowing the effect to be dialled back without retraining.
Applies to both text_clip and neg_text_clip symmetrically.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Diagnosis: learned tokens grew to norm ~3.2 while real CLIP content tokens
sit at ~1.0. Model never trained on embeddings that large — activates buzz
artifact instead of semantic style shift.
Fix: measure mean token norm from content positions (1–20) of dataset CLIP
embeddings at startup, clamp learned_tokens per-token after every optimizer
step to max 1.5× that reference (50% headroom). Token norm is now logged
as current/limit for easy monitoring.
ti_sweep_1.json: rebuild around norm_clamp group — n4_clamped (primary
diagnostic), prefix_clamped, n8_prefix_clamped, warm_clamped.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
n4_baseline showed token_norm growing linearly without plateau — classic
sign of lr too high relative to parameter count. With only K×1024 params,
gradient signal per param is already high-magnitude; high lr causes
overshoot rather than convergence.
- Default lr: 1e-3 → 2e-4 (matches LoRA working regime)
- Default batch_size: 16 → 4 (more diverse gradients, helps norm saturate)
- ti_sweep_1.json: add lr_batch group (lr_low_b4, lr_mid_b8,
lr_low_b4_prefix, lr_2e3), restructure with clearer groups,
annotate n4_baseline as completed with findings
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously the comparison PNG was only written at the very end of the sweep,
so an interrupted run produced no image at all. Now _save_comparison() is
called right after _write_summary() for every successful experiment, keeping
loss_comparison.png current throughout the sweep.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>