Commit Graph

13 Commits

Author SHA1 Message Date
Ethanfel 57fae4a8ce chore: default timestep_mode back to uniform
logit_normal reaches lower loss but perceptual improvement over uniform
is dataset-dependent. Keeping uniform as default to match original MMAudio
training behavior; logit_normal remains available as an option.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 01:21:08 +02:00
Ethanfel d83632e754 fix: pad/trim clip and sync features to fixed seq_len at dataset load time
Clips from shorter videos produce fewer CLIP frames (e.g. 2s → 16 frames,
8s → 64 frames). Mixed-length datasets would cause torch.stack() to fail
during batching. Normalize to seq_cfg.clip_seq_len / sync_seq_len at load,
same as latents are already normalized to latent_seq_len.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 00:54:05 +02:00
Ethanfel a5014e49eb feat: add logit-normal timestep sampling to reduce white noise artifacts
Uniform timestep sampling undertrained t>0.8 (the final denoising steps),
leaving residual noise that CFG amplifies at inference. Logit-normal sampling
concentrates training near t=0.5 while still covering the full range, improving
high-t coverage and reducing noise floor in generated audio.

Default changed from uniform to logit_normal (sigma=1.0). Previous behavior
available with timestep_mode=uniform.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 00:35:42 +02:00
Ethanfel 8ae0ba3c7d fix: increment adapter_final filename on resume to avoid overwriting previous final
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 00:15:31 +02:00
Ethanfel 09b3b94ddd feat: add batch_size parameter to training (default 4)
Replaces single-sample steps with batched sampling via random.choices().
Tensors are stacked to [B, T, C] before the forward pass; t is now [B].
Default grad_accum lowered to 1 since real batching gives stable gradients.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 23:36:12 +02:00
Ethanfel 4806daa4ca chore: lower default warmup_steps from 500 to 100
500 warmup steps is 25% of a 2000-step run — too long. 100 steps lets
the full lr kick in much earlier without sacrificing stability.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 22:51:27 +02:00
Ethanfel ad57432803 fix: pad/trim latent to exact latent_seq_len after VAE encoding
STFT hop-size rounding produces ±1 latent frame vs the expected seq length.
Clamp to seq_cfg.latent_seq_len after transpose so generator.forward assertion passes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 22:12:20 +02:00
Ethanfel 43f732f904 fix: transpose VAE latent from [B,C,T] to [B,T,C] before generator
VAE encoder returns channels-first [B, latent_dim, T]; the generator
expects time-first [B, T, latent_dim] (same convention as decode which
already does .transpose(1,2)). Fixes normalize() size mismatch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 22:08:00 +02:00
Ethanfel 2f4641247a feat: add resume support to train_lora.py
Step checkpoints now save optimizer state, scheduler state, and step
number alongside the LoRA weights. Pass --resume path/to/adapter_stepXXXXX.pt
to continue training from that checkpoint. --steps always means total steps,
so resuming from 1000 with --steps 2000 trains 1000 more steps.

adapter_final.pt format is unchanged (state_dict + meta only) so
SelvaLoraLoader remains compatible.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 16:59:30 +02:00
Ethanfel c88e27742c fix: sanitize name field and remove double load_npz call
- _resolve_named_path: replace / \ and null in name to prevent path
  traversal outside cache_dir (would cause a confusing FileNotFoundError
  at np.savez time instead of at path resolution).
- train_lora: load_npz was called twice per clip when prompt was in
  prompts.txt; consolidate to a single call before prompt resolution.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 15:30:25 +02:00
Ethanfel 1eb82d8050 refactor: train_lora accepts .npz + audio pairs instead of raw video
- Input is now pre-extracted .npz files (from SelvaFeatureExtractor) paired
  with clean audio files (same stem). Visual features no longer re-extracted
  during training.
- FeaturesUtils loaded with enable_conditions=False (VAE only) — Synchformer
  and T5 are no longer loaded, saving ~3-4 GB VRAM.
- CLIP text encoder loaded separately via patch_clip so text prompt can differ
  from the one used during feature extraction.
- Prompt priority: prompts.txt override > embedded in .npz > directory name.
- Removed: torchvision video loading, frame sampling/resizing, net_video_enc,
  synchformer path check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 15:14:26 +02:00
Ethanfel cde280049b fix: correct LoRALinear dtype and remove unused import
- LoRALinear now creates lora_A/lora_B with dtype matching the base
  linear's weight, preventing a float32/bf16 mismatch at forward time
  when the generator is loaded in bf16 or fp16.
- Remove unused `import math` from train_lora.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:57:09 +02:00
Ethanfel 437c62b28f feat: LoRA fine-tuning for SelVA generator
Teaches the model new/partial sound classes from custom video+audio pairs.
Only ~10 MB of adapter weights are trained vs ~4.4 GB for the full model.

selva_core/model/lora.py
  LoRALinear: wraps nn.Linear with frozen base + trainable A/B matrices.
  B initialised to zero → zero adapter contribution at init.
  apply_lora(): walks named_modules, replaces matching nn.Linear in-place.
  Default target: "attn.qkv" (all 21 SelfAttention QKV projections in
  large_44k). Add "linear1" to also wrap post-attention output projections.
  get_lora_state_dict() / load_lora() for ~10 MB save/load.

train_lora.py (standalone script, no ComfyUI dependency)
  Data format: directory of video files + optional prompts.txt
  ("filename: description"). Falls back to directory name as prompt.
  Pre-extracts features for all clips into RAM, then trains from those.
  Training loop: encode audio→latent (need_vae_encoder=True), flow
  matching MSE loss on velocity prediction, backward on LoRA params only.
  Saves adapter_stepNNNNN.pt checkpoints + adapter_final.pt with metadata.
  Key verified interfaces used:
    encode_audio() → DiagonalGaussianDistribution; .mode().clone() required
    normalize() is in-place
    forward(latent, clip_f, sync_f, text_f, t) takes raw tensors

nodes/selva_lora_loader.py (SelVA LoRA Loader ComfyUI node)
  Loads .pt adapter, deep-copies the generator, applies LoRA, loads weights.
  strength param scales lora_B to adjust adapter contribution at inference.
  Reads rank/alpha/target from embedded metadata if present.
  Returns a patched SELVA_MODEL bundle for use with the existing Sampler.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:38:46 +02:00