fix: offload CLIP, synchformer, T5, generator, VAE to CPU before training

Only the vocoder and mel_converter are needed during BigVGAN training.
The rest of the SelVA pipeline (CLIP ViT-H, synchformer, T5, generator,
VAE) was staying on GPU and consuming ~90 GiB, leaving no room for
backward pass activations. Now offloaded individually to CPU before
the training loop starts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-04-10 00:33:07 +02:00
parent 4e6cc4d519
commit d70c611bf7
+16
View File
@@ -820,6 +820,22 @@ class SelvaBigvganTrainer:
"files with matching audio files."
)
# Offload heavy SelVA components to CPU — only vocoder + mel_converter
# are needed for training. CLIP, synchformer, T5, generator sit on
# GPU doing nothing and eat tens of GiB otherwise.
for attr in ("clip_model", "synchformer", "text_encoder_t5"):
sub = getattr(feature_utils, attr, None)
if sub is not None:
sub.to("cpu")
if "generator" in model:
model["generator"].to("cpu")
# tod contains VAE + vocoder; VAE not needed but vocoder is a
# submodule we're about to train — move just the VAE part.
tod = feature_utils.tod
if hasattr(tod, "vae"):
tod.vae.to("cpu")
soft_empty_cache()
_result[0] = _do_train(
vocoder, mel_converter, clips,
device, dtype, strategy, feature_utils,