d83632e754
Clips from shorter videos produce fewer CLIP frames (e.g. 2s → 16 frames, 8s → 64 frames). Mixed-length datasets would cause torch.stack() to fail during batching. Normalize to seq_cfg.clip_seq_len / sync_seq_len at load, same as latents are already normalized to latent_seq_len. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>