fix: transpose VAE latent from [B,C,T] to [B,T,C] before generator

VAE encoder returns channels-first [B, latent_dim, T]; the generator
expects time-first [B, T, latent_dim] (same convention as decode which
already does .transpose(1,2)). Fixes normalize() size mismatch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-04-05 22:08:00 +02:00
parent 6b9adf0816
commit 43f732f904
2 changed files with 4 additions and 2 deletions
+2 -1
View File
@@ -338,7 +338,8 @@ class SelvaLoraTrainer:
# encode_audio is @inference_mode — .clone() exits inference mode
audio_b = audio.unsqueeze(0).to(device)
dist = vae_utils.encode_audio(audio_b)
x1 = dist.mode().clone().cpu()
# VAE outputs [B, latent_dim, T]; generator expects [B, T, latent_dim]
x1 = dist.mode().clone().transpose(1, 2).cpu()
# Text → CLIP features (reuse already-loaded CLIP from inference model)
text_clip = feature_utils_orig.encode_text_clip([prompt]).cpu()