Commit Graph

52 Commits

Author SHA1 Message Date
Ethanfel 4297715a08 debug: add driver-level VRAM reporting + offload video_enc
torch.cuda.memory_allocated only tracks PyTorch allocator. Added
torch.cuda.mem_get_info to see actual CUDA driver memory usage.
Also offload video_enc (TextSynch) which was missed in the original
offload — stays on GPU when strategy != offload_to_cpu.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 01:48:04 +02:00
Ethanfel 9af4bbdd91 fix: force torch.cuda.empty_cache() after pre-generation and CLIP encoding
PyTorch's caching allocator reserves GPU memory from pre-generation
(~90 GiB for generator + tod) and doesn't return it to CUDA/OS.
soft_empty_cache may not call torch.cuda.empty_cache(). Force a full
cache release after CLIP encoding and after LoRA mel pre-generation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 01:42:45 +02:00
Ethanfel 89d6fccd28 debug: add per-operation VRAM logging in first training step
Logs VRAM at: after target_mel, after vocoder forward, before loss,
after loss computation, and after backward. Only logs for step 0 to
avoid spam. Will identify which operation causes the 94 GiB spike.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 01:35:54 +02:00
Ethanfel bd84242fa1 debug: add VRAM logging at offload and training checkpoints
Logs torch.cuda.memory_allocated/reserved at each step: before unload,
after unload_all_models, after feature_utils.to(cpu), after generator
to(cpu), after cache clear, after mel_converter to(device), and before
training loop. This will identify what's holding VRAM.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 01:28:31 +02:00
Ethanfel 5a2c003fb2 fix: move baseline sample after inference flag stripping
_save_sample("baseline") was called before the vocoder's inference
tensors were sanitized, causing "Inference tensors do not track version
counter". Moved it after the clone/detach loop and vocoder.to(device).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 01:26:11 +02:00
Ethanfel f8d4d77b0d fix: pre-compute text CLIP embeddings in main thread to avoid inference tensor crash
CLIP weights are inference tensors from ComfyUI loading. inference_mode
is thread-local, so the worker thread can't use CLIP even with a context
manager. Pre-compute all text embeddings in the main thread (where
inference_mode IS active), clone+detach to normal tensors, and pass them
to the worker via text_clip_cache dict. CLIP no longer needs to be on
GPU during pre-generation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 01:19:44 +02:00
Ethanfel 32e5344ea2 fix: wrap CLIP encoding in inference_mode during pre-generation
CLIP weights are inference tensors from ComfyUI loading. The worker
thread runs without inference_mode, so PyTorch rejects inference tensors
in multi_head_attention_forward (version counter tracking). Wrap the
encode_text_clip call in torch.inference_mode() since text encoding
doesn't need gradients.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 01:10:58 +02:00
Ethanfel 10a71b0c4f fix: offload entire model to CPU in main thread before worker starts
The previous offload ran inside the worker thread, but by then ComfyUI
had already loaded the full model to GPU. Now feature_utils.to('cpu')
and generator.to('cpu') run in the main thread right after
unload_all_models(), before the worker starts. vocoder.to(device, dtype)
is called explicitly after inference flag stripping in _do_train to
bring only the vocoder back to GPU.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 00:56:13 +02:00
Ethanfel 37a27160aa fix: match mel dtype to vocoder in baseline sample generation
ref_mel is float32 (from mel_converter) but vocoder weights are bfloat16
before inference flag stripping. Cast mel to vocoder's dtype to prevent
input/bias type mismatch during baseline sample save.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 00:45:31 +02:00
Ethanfel cb9a1eef01 fix: stop loading full feature_utils to GPU before training
feature_utils.to(device) was loading CLIP ViT-H, synchformer, T5, VAE,
and vocoder (~90 GiB) to GPU for the entire training run. Now only
mel_converter (tiny) is moved to GPU. Pre-generation manages its own
device placement: temporarily moves CLIP and tod to GPU, then moves them
back when done. This frees ~90 GiB for the backward pass.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 00:44:38 +02:00
Ethanfel d70c611bf7 fix: offload CLIP, synchformer, T5, generator, VAE to CPU before training
Only the vocoder and mel_converter are needed during BigVGAN training.
The rest of the SelVA pipeline (CLIP ViT-H, synchformer, T5, generator,
VAE) was staying on GPU and consuming ~90 GiB, leaving no room for
backward pass activations. Now offloaded individually to CPU before
the training loop starts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 00:33:07 +02:00
Ethanfel 4e6cc4d519 feat: cache pre-generated LoRA mels to disk for reuse
LoRA mel pre-generation runs a full ODE+CFG for every clip, which is slow.
Cache results to a .pt file next to the output, keyed by a SHA-256 hash
of the LoRA adapter content + generation parameters (seed, steps, CFG,
duration, sample rate, npz file list). Automatically reused on subsequent
runs when parameters haven't changed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 00:30:20 +02:00
Ethanfel 0854bd2638 fix: cast discriminators to model dtype to match vocoder output
Discriminators are constructed as float32 but receive bfloat16 tensors
from the vocoder. Cast to model dtype on load to prevent conv dtype
mismatch in feature matching loss.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 00:25:04 +02:00
Ethanfel 187b2e3169 fix: cast GAFilter to model dtype after injection
GAFilter conv weights are created as float32 but the rest of the vocoder
is bfloat16. vocoder.to(device) missed the dtype cast, causing conv1d
dtype mismatch when Snake bfloat16 output flows into GAFilter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 00:24:11 +02:00
Ethanfel 608746ce7b fix: cast input mel to model dtype before vocoder forward pass
mel_converter outputs float32 (cuFFT requirement) but vocoder weights are
bfloat16 from model loading. Cast input_mel back to model dtype before
feeding the vocoder to avoid conv1d dtype mismatch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 00:18:05 +02:00
Ethanfel bba5aec7a5 fix: add CFG to LoRA mel pre-generation to match inference conditions
Pre-generated mels were using a bare forward pass with no classifier-free
guidance, producing mels that don't match what the vocoder sees at inference
(where cfg_strength=4.5 is the default). Now uses ode_wrapper with
preprocess_conditions/get_empty_conditions, same as the sampler node.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 00:17:16 +02:00
Ethanfel d06936802b fix: cast mel_converter buffers to float32 to match STFT input dtype
mel_basis and hann_window buffers inherit bfloat16 from model loading.
Since all mel_converter inputs are cast to float32 for cuFFT, the
internal buffers must also be float32 to avoid matmul dtype mismatch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 00:10:52 +02:00
Ethanfel bee518a855 fix: cast all STFT inputs to float32 to prevent cuFFT bfloat16 crash
cuFFT does not support bfloat16 tensors. When the model is loaded in
bfloat16, all torch.stft calls (mel_converter, discriminator spectrogram,
multi-resolution STFT loss) crash. Add .float() at every STFT boundary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-09 23:53:36 +02:00
Ethanfel 48b72c0be0 feat: add LoRA mel pre-generation to BigVGAN vocoder trainer
When a lora_adapter path is provided, the trainer pre-generates
LoRA-distorted mels for each training clip (full ODE generation +
VAE decode) and trains the vocoder to produce clean audio from them.
This teaches the vocoder to compensate for LoRA latent distribution
shift without requiring perfectly aligned training pairs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-09 23:26:36 +02:00
Ethanfel 8ccc2438e4 fix: remove FlashSR (audiosr incompatible with Python 3.12), add training loss CSV
- Drop SelvaFlashSR node — audiosr pins numpy<=1.23.5 which cannot build
  on Python 3.12 (pkgutil.ImpImporter removed); use Saganaki22/ComfyUI-AudioSR instead
- BigVGAN trainer now writes <output_stem>_training_log.csv alongside the
  checkpoint: step, total, fm, mel, stft, phase, l2sp columns, line-buffered
  so loss can be tailed live during training

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 17:18:34 +02:00
Ethanfel 8371466e44 fix: guarantee length preservation in _ActivationWithGAFilter
Activation1d's anti-alias Kaiser sinc resampling (asymmetric pad_left /
pad_right) can produce ±1-2 sample rounding in edge cases, causing the
BigVGAN AMPBlock residual addition (xt + x) to fail with a size mismatch.

Trim or pad the output to exactly match the input length so the resblock
skip connection always has matching dimensions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 16:39:03 +02:00
Ethanfel 45fced55bc fix: exclude GAFilter params from L2-SP regularization
L2-SP anchors trainable params to their pretrained values. GAFilter is a
newly initialized module (identity FIR filter) with no pretrained values —
anchoring it to identity initialization would resist learning. Exclude
gafilter params from the L2-SP loss so they train freely.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 16:19:52 +02:00
Ethanfel db112394e8 feat: add AF-Vocoder GAFilter to BigVGAN trainer and loader
Implements AF-Vocoder GAFilter (Interspeech 2025): learnable per-channel
depthwise FIR filter inserted after each Snake/Activation1d in BigVGAN
residual blocks. Initialized as identity so training starts from pretrained
behaviour.

- inject_gafilters() walks resblocks.*.activations and wraps each Activation1d
  with _ActivationWithGAFilter — weights appear in vocoder.state_dict() automatically
- Trained alongside Snake alphas in snake_alpha_only mode
- Checkpoint saves has_gafilter + gafilter_kernel_size metadata
- Loader detects metadata and injects before load_state_dict so weights populate correctly
- Controlled by use_gafilter (default True) and gafilter_kernel_size (default 9)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 16:15:14 +02:00
Ethanfel c53ea5517c feat: add FA-GAN phase-aware STFT loss to BigVGAN trainer
Adds L1 loss on real, imaginary, and magnitude STFT components across
three resolutions (FA-GAN, arXiv:2407.04575). Penalizes phase smearing
directly — magnitude-only losses cannot distinguish correct spectrum
with wrong phase from a smeared spectrum.

Controlled by lambda_phase (default 1.0, 0 = disabled). Applied on top
of both the discriminator FM path and the fallback mel+STFT path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 16:09:31 +02:00
Ethanfel b9f95cfd7e fix: detect silent discriminator load failure and fall back explicitly
If no matching key was found for MPD or MRD in the checkpoint, the for-loops
completed silently and randomly-initialized discriminators were used as frozen
feature extractors — producing meaningless feature matching loss while
appearing to work. Now raises RuntimeError (caught by outer except) which
triggers the existing fallback to mel+STFT losses with a clear warning.
Also prints available checkpoint keys to help diagnose format mismatches.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 14:39:55 +02:00
Ethanfel 2b10205657 fix: raise segment_seconds max from 4s to 30s
Hardcoded max of 4.0 prevented using full 8s clips. Raised to 30s.
Also bumped default from 1.0 to 2.0 as a more sensible starting point.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 13:49:50 +02:00
Ethanfel 8166c56552 perf: gradient checkpointing on vocoder forward to reduce activation memory
BigVGAN's 512x upsampling stack stores huge intermediate activations for
backward even in snake_alpha_only mode (only 5K trainable params, but
activation graph runs through the full network after each snake op).

Wrapping vocoder() in checkpoint(use_reentrant=False) recomputes activations
during backward instead of storing them — ~2x compute cost, large reduction
in peak VRAM. Should allow batch_size > 1 on 96 GB without OOM.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 13:45:24 +02:00
Ethanfel eece79ccae fix: correct MRD channel width to 128 and unload models before training
Two bugs:

1. _DiscriminatorR used channels=32 but the BigVGAN pretrained discriminator
   checkpoint has channels=128. All convs in _DiscriminatorR now use 128,
   matching the checkpoint architecture so state_dict loads without error.

2. BigVGAN trainer OOM: SelVA generator and other ComfyUI models remain in
   VRAM during training (~90 GiB used). Add unload_all_models() + cache
   flush before the training loop to reclaim VRAM headroom.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 13:40:01 +02:00
Ethanfel 211494a91c fix: DITTO gradient never reached x0, remove unused imports and dead code
DITTO critical bug: x was reassigned on every ODE step, so by the time
loss.backward() ran, x pointed to the final output tensor (grad_fn, not
a leaf) and x.grad was always None. The manual gradient transfer never
fired — x0 was never updated. The optimization was a no-op.

Fix: use a straight-through estimator after the no-grad prefix:
  x = x + (x0 - x0.detach())
This adds zero value but creates a grad_fn back to x0, so backward()
propagates ∂loss/∂x (at the Phase-1/2 boundary) directly to x0.grad.
Equivalent to truncated BPTT with ∂x_prefix/∂x0 ≈ I.

Also remove unused imports (SelvaSampler, _inject_tokens, random) that
caused cascade ImportError risk, and remove dead trainable_count variable
in BigVGAN trainer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 12:10:02 +02:00
Ethanfel 1e9551152e feat: add DITTO optimizer, upgrade BigVGAN trainer, document all nodes
BigVGAN trainer (selva_bigvgan_trainer.py):
- Add snake_alpha_only train mode: tunes only ~27K per-channel α params
  (0.024% of 112M) — physically cannot cause harmonic smearing
- Add lambda_l2sp: L2-SP anchor regularization toward pretrained weights
- Add optional discriminator_path: frozen MPD+MRD feature matching loss
  replaces mel L1 when a BigVGAN discriminator checkpoint is provided
- Inline MPD + MRD discriminator implementations (no extra dependencies)

DITTO optimizer (selva_ditto_optimizer.py):
- New node: inference-time noise optimization (arXiv:2401.12179)
- Optimizes x₀ via mel Gram matrix style loss against BJ reference clips
- All model weights frozen — zero quality degradation risk
- Truncated BPTT through last n_grad_steps of the ODE (configurable)
- Gradient checkpointing on each differentiated step

Docs:
- README: document all 20 nodes (was 3), add workflow diagrams
- STYLE_TRANSFER.md: new guide — DITTO, vocoder fine-tuning tiers,
  why LoRA/TI fail, combined approach, dataset prep

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 12:04:05 +02:00
Ethanfel f17f6f0863 feat: save ground truth spectrogram once for direct comparison
Writes _gt_spec.png from ref_mel before training starts so each step's
_spec.png can be compared against the unmodified vocoder roundtrip target.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 03:05:47 +02:00
Ethanfel 304d9d01bf feat: save mel spectrogram PNG alongside each eval sample
Adds _save_spectrogram() using PIL only (no matplotlib). Each _save_sample
call now writes both a .wav and a _spec.png so training progress is visible
without listening. Colour map is blue→green→yellow (viridis-ish), low
frequencies at the bottom.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 03:03:28 +02:00
Ethanfel 0128a81cc2 fix: use full first clip for eval samples instead of 1s segment
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 03:01:52 +02:00
Ethanfel 710261f5be fix: add soundfile fallback for torchaudio.save in sample writing
Same environment has no compatible ffmpeg/torchcodec for saving.
Mirror the _load_wav pattern: try torchaudio, fall back to soundfile.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 02:58:07 +02:00
Ethanfel 5df2abd6dd fix: handle all three inference-tensor sources in vocoder sanitization
remove_parametrizations() stores weight as a plain __dict__ tensor (not
nn.Parameter), making it invisible to _parameters iteration. Also, buffers
(Activation1d anti-aliasing filters) are inference tensors that break the
backward graph mid-network. Fix all three categories:
1. _parameters: clone().detach(), wrap as Parameter
2. plain __dict__ tensors: clone(), register_parameter (also makes trainable)
3. _buffers: clone() to strip inference flag without parametrizing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 02:54:41 +02:00
Ethanfel b243908873 debug: inspect conv_pre parametrizations and _parameters keys
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 02:46:16 +02:00
Ethanfel 9df855ee0e debug: print is_inference() status before failing conv_pre call
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 02:41:51 +02:00
Ethanfel 78f8aa98ad fix: clone inference tensors at thread entry to strip the inference flag
torch.inference_mode is thread-local, but the inference flag lives on the
tensor object. Operations on inference tensors always propagate it, even in
a clean thread. The only escape is .clone() called outside inference_mode.
At thread entry (inference_mode disabled): clone clips and mel_converter
buffers to get clean normal tensors before any training computation.
Vocoder parameter clone() also now works correctly in this thread context.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 02:35:48 +02:00
Ethanfel e870446b0f fix: run BigVGAN training in a fresh thread to escape inference_mode
torch.inference_mode is thread-local. ComfyUI sets it on the node-execution
thread; inference_mode(False) alone is insufficient to escape it in some
environments (e.g. async wrappers, lora-manager hook). A new thread always
starts clean. Moved all training logic into _do_train() called via
threading.Thread so every tensor is a normal autograd tensor by default.
Simplified parameter cloning: clone().detach().requires_grad_(True).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 02:30:53 +02:00
Ethanfel df63b147e9 fix: sanitize all submodule buffers of mel_converter + guarantee target_mel output
Previous fix only iterated mel_converter._buffers (direct buffers). Submodules
(e.g. Spectrogram.window) still held inference tensors. Switch to .modules()
to cover all nested buffers, matching the vocoder parameter sanitization.
Also add a zeros+copy_ safety net on target_mel output so conv can save it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 02:14:12 +02:00
Ethanfel 51ac099073 fix: sanitize target_flat — clips are inference tensors from outer inference_mode
The clips list is built inside ComfyUI's inference_mode context, so every
element is an inference tensor. torch.stack().clone() propagates the flag.
Use zeros+copy_ (same pattern as params/buffers) to get a normal tensor,
so mel_converter(target_flat) inside no_grad produces a saveable input.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 02:09:26 +02:00
Ethanfel b7565ec458 fix: sanitize inference tensors in BigVGAN trainer via zeros+copy_ pattern
param.data.clone() and tensor.detach() on inference tensors both produce
inference tensors — the flag propagates through all operations on them.
Inside inference_mode(False), torch.zeros() creates genuine normal tensors.
Use zeros+copy_ to sanitize both vocoder parameters and mel_converter
buffers once before training, so autograd can save inputs for backward.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 02:05:36 +02:00
Ethanfel 0fcb6d3106 fix(bigvgan-trainer): replace parameter objects to fully strip inference tensor flag
param.data = clone() only replaces storage — the nn.Parameter object itself
retains the inference tensor flag set when the model was loaded. Replace each
parameter with a fresh nn.Parameter(data.clone()) created inside
inference_mode(False) so both the object and its data are normal tensors.
Move optimizer creation to after re-creation so it references the new objects.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 01:58:57 +02:00
Ethanfel c86306bde8 fix(bigvgan-trainer): clone vocoder parameters to strip inference tensor flag
The vocoder is loaded inside ComfyUI's torch.inference_mode(), making all
its parameters inference tensors. Autograd cannot save inference tensors
for backward even with requires_grad=True. Clone all parameters inside
torch.inference_mode(False) before training to get normal tensors.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 01:55:16 +02:00
Ethanfel f04d59fe63 fix(bigvgan-trainer): clone mel outputs to strip inference tensor flag from buffers
mel_converter buffers (mel_basis, hann_window) are inference tensors
because the model was loaded inside ComfyUI's torch.inference_mode().
Operations on them propagate the flag to outputs. Clone both target_mel
and pred_mel to get normal autograd-compatible tensors. .clone() is
differentiable so the grad graph to vocoder parameters is preserved.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 01:51:28 +02:00
Ethanfel daa36a5f7b fix(bigvgan-trainer): clone target tensor to exit inference mode before backward
Clips loaded outside torch.inference_mode(False) are inference tensors.
Autograd cannot save them for backward. .clone() creates a normal tensor,
same fix pattern as selva_lora_trainer's dist.mode().clone().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 01:47:47 +02:00
Ethanfel 16e20b30ce fix(bigvgan-trainer): cast audio to model dtype to match bf16 mel_converter buffers
Model loaded in bf16 causes mel_basis buffer to be bf16. Audio loaded
from disk is float32, causing matmul dtype mismatch. Cast all audio
tensors to model["dtype"] before passing to mel_converter/vocoder.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 01:46:01 +02:00
Ethanfel ea7dfed27a fix(bigvgan-trainer): fallback to soundfile when torchaudio ffmpeg backend fails
torchcodec/libavutil soname mismatch causes torchaudio to fail on every
file load, silently emptying clips. Add _load_wav() that tries torchaudio
first then falls back to soundfile (handles wav/flac without ffmpeg).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 01:41:59 +02:00
Ethanfel 81ff0d46c9 fix(bigvgan-trainer): resolve device mismatch in _save_sample after offload
After the finally block, offload_to_cpu moves the vocoder to CPU while
ref_mel stays on GPU. Fix: detect vocoder's current device via
next(vocoder.parameters()).device and move ref_mel there before vocoding.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 01:35:07 +02:00
Ethanfel 9fdeb65182 feat(bigvgan-trainer): add eval samples at checkpoints and end
Saves baseline.wav (ground truth roundtrip before training), stepN.wav
at each save_every checkpoint, and final.wav after training completes.
All use the same fixed reference segment (clip 0, position 0) for
direct comparison across checkpoints.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-09 01:30:34 +02:00