fix: offload CLIP, synchformer, T5, generator, VAE to CPU before training
Only the vocoder and mel_converter are needed during BigVGAN training. The rest of the SelVA pipeline (CLIP ViT-H, synchformer, T5, generator, VAE) was staying on GPU and consuming ~90 GiB, leaving no room for backward pass activations. Now offloaded individually to CPU before the training loop starts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -820,6 +820,22 @@ class SelvaBigvganTrainer:
|
||||
"files with matching audio files."
|
||||
)
|
||||
|
||||
# Offload heavy SelVA components to CPU — only vocoder + mel_converter
|
||||
# are needed for training. CLIP, synchformer, T5, generator sit on
|
||||
# GPU doing nothing and eat tens of GiB otherwise.
|
||||
for attr in ("clip_model", "synchformer", "text_encoder_t5"):
|
||||
sub = getattr(feature_utils, attr, None)
|
||||
if sub is not None:
|
||||
sub.to("cpu")
|
||||
if "generator" in model:
|
||||
model["generator"].to("cpu")
|
||||
# tod contains VAE + vocoder; VAE not needed but vocoder is a
|
||||
# submodule we're about to train — move just the VAE part.
|
||||
tod = feature_utils.tod
|
||||
if hasattr(tod, "vae"):
|
||||
tod.vae.to("cpu")
|
||||
soft_empty_cache()
|
||||
|
||||
_result[0] = _do_train(
|
||||
vocoder, mel_converter, clips,
|
||||
device, dtype, strategy, feature_utils,
|
||||
|
||||
Reference in New Issue
Block a user