Commit Graph

280 Commits

Author SHA1 Message Date
Ethanfel 57fae4a8ce chore: default timestep_mode back to uniform
logit_normal reaches lower loss but perceptual improvement over uniform
is dataset-dependent. Keeping uniform as default to match original MMAudio
training behavior; logit_normal remains available as an option.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 01:21:08 +02:00
Ethanfel 8e919c0459 fix: resolve relative and Unix-style output_dir paths to ComfyUI output folder
On Windows, /folder is drive-relative (no drive letter) rather than a real
absolute path. Redirect these to ComfyUI's output directory so files don't
land at C:\folder. Also redirects plain relative paths (e.g. lora_output)
to output/ instead of the process working directory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 01:14:04 +02:00
Ethanfel fec8eaac95 fix: save adapter and loss curves on cancel, not only on normal completion
Wraps training loop in try/finally so adapter_final.pt and loss PNGs are
always written. On cancellation the adapter is named
adapter_cancelled_stepXXXXX.pt so it can be used with --resume.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 01:06:44 +02:00
Ethanfel d83632e754 fix: pad/trim clip and sync features to fixed seq_len at dataset load time
Clips from shorter videos produce fewer CLIP frames (e.g. 2s → 16 frames,
8s → 64 frames). Mixed-length datasets would cause torch.stack() to fail
during batching. Normalize to seq_cfg.clip_seq_len / sync_seq_len at load,
same as latents are already normalized to latent_seq_len.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 00:54:05 +02:00
Ethanfel a5014e49eb feat: add logit-normal timestep sampling to reduce white noise artifacts
Uniform timestep sampling undertrained t>0.8 (the final denoising steps),
leaving residual noise that CFG amplifies at inference. Logit-normal sampling
concentrates training near t=0.5 while still covering the full range, improving
high-t coverage and reducing noise floor in generated audio.

Default changed from uniform to logit_normal (sigma=1.0). Previous behavior
available with timestep_mode=uniform.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 00:35:42 +02:00
Ethanfel 8ae0ba3c7d fix: increment adapter_final filename on resume to avoid overwriting previous final
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 00:15:31 +02:00
Ethanfel 2b2b438307 fix: set OUTPUT_NODE=True on SelVA Feature Extractor so it runs without connected outputs
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 00:11:16 +02:00
Ethanfel 39984f73c2 docs: add observed batching results to training guide
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 00:05:16 +02:00
Ethanfel 1f8cd6f930 docs: rewrite LORA_TRAINING.md with real-world findings
- Added batch_size VRAM table and updated step recommendations for batched training
- Added adapter strength section with practical guidance (0.6-0.7 for noise)
- Added ComfyUI node as Option A for training (not just CLI)
- Noted .mp3 as not recommended, soundfile fallback implied
- Added output files section with sample_*.wav and loss curve PNGs
- Added "LoRA has no effect" troubleshooting (wrong node wired)
- Updated loss convergence targets based on observed training runs
- Clarified linear1 target: 150+ clips recommended

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 00:00:36 +02:00
Ethanfel 20f8138146 chore: show batch_size in training step log
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 23:45:43 +02:00
Ethanfel 09b3b94ddd feat: add batch_size parameter to training (default 4)
Replaces single-sample steps with batched sampling via random.choices().
Tensors are stacked to [B, T, C] before the forward pass; t is now [B].
Default grad_accum lowered to 1 since real batching gives stable gradients.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 23:36:12 +02:00
Ethanfel 3f67de694c feat: save loss_raw.png and loss_smoothed.png to output_dir
Raw curve shown in light blue, EMA-smoothed (beta=0.9) overlay in darker
blue. Both saved as PNG at end of training. The node IMAGE output now
returns the smoothed version. Live preview also uses the smoothed overlay.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 23:15:48 +02:00
Ethanfel 423e174b88 debug: print lora_A norm after loading to confirm adapter applied
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 23:05:23 +02:00
Ethanfel 4806daa4ca chore: lower default warmup_steps from 500 to 100
500 warmup steps is 25% of a 2000-step run — too long. 100 steps lets
the full lr kick in much earlier without sacrificing stability.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 22:51:27 +02:00
Ethanfel 16b3eb11cc fix: pass max_size=800 to progress bar preview (was 85px wide)
The third element in ComfyUI's preview tuple is max_size in pixels, not
JPEG quality. Passing 85 was capping the live loss curve at 85×40px.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 22:48:56 +02:00
Ethanfel 004ea63f62 fix: fall back to soundfile for torchaudio.save when torchcodec unavailable
Same torchcodec/FFmpeg issue as the load path, now on the eval sample save.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 22:44:04 +02:00
Ethanfel afb3242eca fix: disable inference_mode entirely for training via inference_mode(False)
torch.enable_grad() alone is insufficient: operations on inference tensors
(created inside ComfyUI's outer inference_mode context) produce inference
tensors even inside enable_grad, breaking autograd. inference_mode(False)
exits the inference context so the deepcopy, apply_lora, and training loop
run with a fully clean autograd context.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 22:40:50 +02:00
Ethanfel 849f31e2a6 fix: create LoRA params inside torch.enable_grad() to escape inference_mode
torch.enable_grad() re-enables grad tracking but nn.Parameters created while
torch.inference_mode() is active are inference tensors that can't enter autograd
regardless. Splitting into _train_inner() and calling it inside enable_grad()
ensures the deepcopy, apply_lora, and the training loop all run with a clean
autograd context.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 22:36:28 +02:00
Ethanfel 505d445eb3 fix: wrap training loop in torch.enable_grad()
ComfyUI executes all nodes inside torch.no_grad(), which prevents gradient
tracking and makes loss.backward() fail. torch.enable_grad() overrides it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 22:32:00 +02:00
Ethanfel 8fade1b0e3 fix: initialize LoRA params on same device as wrapped linear
apply_lora() is called after generator.to(device), so lora_A/lora_B were
being created on CPU while the rest of the model was on CUDA.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 22:17:29 +02:00
Ethanfel ad57432803 fix: pad/trim latent to exact latent_seq_len after VAE encoding
STFT hop-size rounding produces ±1 latent frame vs the expected seq length.
Clamp to seq_cfg.latent_seq_len after transpose so generator.forward assertion passes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 22:12:20 +02:00
Ethanfel 43f732f904 fix: transpose VAE latent from [B,C,T] to [B,T,C] before generator
VAE encoder returns channels-first [B, latent_dim, T]; the generator
expects time-first [B, T, latent_dim] (same convention as decode which
already does .transpose(1,2)). Fixes normalize() size mismatch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 22:08:00 +02:00
Ethanfel 6b9adf0816 fix: fall back to soundfile when torchcodec FFmpeg libs are missing
Recent torchaudio defaults to torchcodec as the audio backend, which requires
FFmpeg shared libraries. Falls back to soundfile for envs where torchcodec
can't load (e.g. containerised ComfyUI without system FFmpeg).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 22:03:57 +02:00
Ethanfel 52434a053a fix: keep VAE in float32 for mel/stft; print full traceback on clip load failure
torch.stft requires float32 input — casting vae_utils to bf16 caused silent
failures during dataset pre-loading. Also adds traceback.print_exc() so future
clip-load errors are visible in the ComfyUI log.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 21:57:20 +02:00
Ethanfel 56c8d5d6b4 feat: save eval audio sample alongside each checkpoint
At every save_every steps, run a quick 8-step no-CFG inference pass on
a random training clip and save the decoded waveform as
sample_stepXXXXX.wav next to the checkpoint. Uses the existing
generator.unnormalize + feature_utils.decode + vocode pipeline from
the sampler. Failure is non-fatal (logged and skipped).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 21:47:02 +02:00
Ethanfel b430953602 feat: live loss curve preview during training
- Send updated loss curve to ComfyUI frontend every 50 steps via
  pbar_train.update_absolute() with a JPEG preview tuple — same
  mechanism as KSampler's denoising previews.
- Fix x-axis step labels for resumed runs (previously always started
  at 0; now correctly shows start_step + offset).
- Split _draw_loss_curve (returns PIL Image) from _pil_to_tensor
  (converts for ComfyUI IMAGE output).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 17:11:38 +02:00
Ethanfel 57cd3dd4b4 fix: use load_lora for resume and remove redundant inference_mode wrapper
- Resume now calls load_lora() instead of load_state_dict() directly,
  giving proper warnings for missing/unexpected LoRA keys.
- Remove redundant `with torch.inference_mode():` around encode_audio
  (already @inference_mode decorated); dist.mode().clone() pattern
  is now clearer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 17:09:35 +02:00
Ethanfel f206a1b38c feat: add SelVA LoRA Trainer ComfyUI node
Runs the full training loop inside ComfyUI. Reuses the already-loaded
CLIP model from the inference model for text encoding; loads only a
minimal VAE encoder separately (freed after dataset pre-loading).

Outputs:
- SELVA_MODEL with LoRA applied (ready to connect directly to Sampler)
- adapter_path STRING (for SelVA LoRA Loader in future sessions)
- loss_curve IMAGE (PIL-rendered line chart of training loss per 50 steps)

Progress is shown via ComfyUI ProgressBar (two phases: dataset loading,
then training steps). Resume is supported via resume_path input.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 17:07:38 +02:00
Ethanfel 2f4641247a feat: add resume support to train_lora.py
Step checkpoints now save optimizer state, scheduler state, and step
number alongside the LoRA weights. Pass --resume path/to/adapter_stepXXXXX.pt
to continue training from that checkpoint. --steps always means total steps,
so resuming from 1000 with --steps 2000 trains 1000 more steps.

adapter_final.pt format is unchanged (state_dict + meta only) so
SelvaLoraLoader remains compatible.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 16:59:30 +02:00
Ethanfel 8e9114b92c docs: add clip length and scalable dataset size recommendations
- Clip length section: fixed 8s duration, padding/trim behavior, per-sound-type
  strategies (continuous, short events, repeating, onset placement).
- Dataset size table: 5-10 / 15-30 / 30-60 / 60-150 / 150-300 / 300+ clips
  with scenario and expected result for each tier.
- Note on diversity vs quantity.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 16:34:50 +02:00
Ethanfel 63b4391573 fix: named .npz files always start at _001
dog_bark_001.npz, dog_bark_002.npz instead of dog_bark.npz, dog_bark_001.npz.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 15:44:26 +02:00
Ethanfel 89af5a468c docs: add LoRA training guide
Covers dataset preparation (ComfyUI feature extraction + clean audio),
training CLI reference, tuning guide (rank/steps/lr), adapter loading
in ComfyUI, and troubleshooting.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 15:43:09 +02:00
Ethanfel c88e27742c fix: sanitize name field and remove double load_npz call
- _resolve_named_path: replace / \ and null in name to prevent path
  traversal outside cache_dir (would cause a confusing FileNotFoundError
  at np.savez time instead of at path resolution).
- train_lora: load_npz was called twice per clip when prompt was in
  prompts.txt; consolidate to a single call before prompt resolution.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 15:30:25 +02:00
Ethanfel cbcd154c96 feat: add name field with auto-increment to SelvaFeatureExtractor
When name is provided, features are saved as name.npz (or name_001.npz,
name_002.npz etc. if the file already exists) instead of a content hash —
useful for building a named training dataset. Hash-based caching is
unchanged when name is left empty.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 15:16:51 +02:00
Ethanfel 1eb82d8050 refactor: train_lora accepts .npz + audio pairs instead of raw video
- Input is now pre-extracted .npz files (from SelvaFeatureExtractor) paired
  with clean audio files (same stem). Visual features no longer re-extracted
  during training.
- FeaturesUtils loaded with enable_conditions=False (VAE only) — Synchformer
  and T5 are no longer loaded, saving ~3-4 GB VRAM.
- CLIP text encoder loaded separately via patch_clip so text prompt can differ
  from the one used during feature extraction.
- Prompt priority: prompts.txt override > embedded in .npz > directory name.
- Removed: torchvision video loading, frame sampling/resizing, net_video_enc,
  synchformer path check.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 15:14:26 +02:00
Ethanfel cde280049b fix: correct LoRALinear dtype and remove unused import
- LoRALinear now creates lora_A/lora_B with dtype matching the base
  linear's weight, preventing a float32/bf16 mismatch at forward time
  when the generator is loaded in bf16 or fp16.
- Remove unused `import math` from train_lora.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:57:09 +02:00
Ethanfel 437c62b28f feat: LoRA fine-tuning for SelVA generator
Teaches the model new/partial sound classes from custom video+audio pairs.
Only ~10 MB of adapter weights are trained vs ~4.4 GB for the full model.

selva_core/model/lora.py
  LoRALinear: wraps nn.Linear with frozen base + trainable A/B matrices.
  B initialised to zero → zero adapter contribution at init.
  apply_lora(): walks named_modules, replaces matching nn.Linear in-place.
  Default target: "attn.qkv" (all 21 SelfAttention QKV projections in
  large_44k). Add "linear1" to also wrap post-attention output projections.
  get_lora_state_dict() / load_lora() for ~10 MB save/load.

train_lora.py (standalone script, no ComfyUI dependency)
  Data format: directory of video files + optional prompts.txt
  ("filename: description"). Falls back to directory name as prompt.
  Pre-extracts features for all clips into RAM, then trains from those.
  Training loop: encode audio→latent (need_vae_encoder=True), flow
  matching MSE loss on velocity prediction, backward on LoRA params only.
  Saves adapter_stepNNNNN.pt checkpoints + adapter_final.pt with metadata.
  Key verified interfaces used:
    encode_audio() → DiagonalGaussianDistribution; .mode().clone() required
    normalize() is in-place
    forward(latent, clip_f, sync_f, text_f, t) takes raw tensors

nodes/selva_lora_loader.py (SelVA LoRA Loader ComfyUI node)
  Loads .pt adapter, deep-copies the generator, applies LoRA, loads weights.
  strength param scales lora_B to adjust adapter contribution at inference.
  Reads rank/alpha/target from embedded metadata if present.
  Returns a patched SELVA_MODEL bundle for use with the existing Sampler.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 14:38:46 +02:00
Ethanfel b519b042e2 docs: document mask inputs and normalize toggle in README
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 10:43:42 +02:00
Ethanfel f28759f1e3 feat: improve mask support with neutral fill, mask_strength, and per-path toggles
- Replace zero-fill with neutral gray (0.5) fill so masked background
  pixels stay in-distribution: 0.5 maps to ~0 in CLIP normalized space
  and exactly 0 after sync's [-1,1] normalization
- Add mask_strength float (0–1) for partial background suppression
- Add mask_clip / mask_sync booleans to toggle masking independently
  on the CLIP (384px) and TextSynchformer (224px) encoding paths
- Fix temporal mask sampling: use fps-accurate index formula (same as
  _sample_frames) instead of proportional int(i*M/N)
- Include mask_strength, mask_clip, mask_sync in cache hash when mask
  is connected, so changing any param correctly busts the cache
- Log lines now report masked/skipped state and strength per path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 10:43:01 +02:00
Ethanfel 3dd6badfd9 fix: guarantee offload cleanup on exception with try/finally
Both nodes moved models to GPU before work then back to CPU after.
Any exception (OOM, cancellation, bad input) would skip the cleanup,
leaving models on GPU permanently until ComfyUI restarts.

Wrap the entire work block in try/finally so offload_to_cpu cleanup
always runs regardless of how the node exits. Also removes the unused
`mode` variable in SelvaSampler.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 08:40:39 +02:00
Ethanfel 8bb2fb7015 fix: extend OOM catch to decode/vocode, add (masked) to sync log line
- selva_sampler: wrap decode+vocode in their own OOM catch — previously
  OOM during mel decode or vocoding gave a raw CUDA traceback instead
  of the actionable hint
- selva_feature_extractor: sync frames log line now shows (masked) when
  a mask is active, matching the CLIP log line

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 08:38:59 +02:00
Ethanfel f4a7292cde feat: add optional MASK input to SelVA Feature Extractor
Allows per-frame or static segmentation masks to be applied before CLIP
and sync encoding, zeroing background pixels. Useful when multiple objects
compete for the same sound and text prompting alone is insufficient.

- _apply_mask(): resizes mask spatially (nearest-exact), samples temporally
  to match sampled frame count, multiplies into frames
- _hash_inputs(): includes mask bytes in cache key (begin/mid/end sampling)
- INPUT_TYPES: mask added to optional inputs with tooltip
- extract_features(): mask=None parameter, applied after _resize_frames for
  both CLIP (384px) and sync (224px) paths, before normalization
- Log line notes when masking is active

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 08:34:13 +02:00
Ethanfel bd53744e2d feat: comprehensive node improvements
Model Loader:
- bf16 support check — auto-falls back to fp16 on unsupported GPUs
- DESCRIPTION and OUTPUT_TOOLTIPS

Feature Extractor:
- Store variant in features dict and .npz cache
- Progress bar (3 steps: CLIP encode, T5 encode, sync encode)
- Expand cache hash to 32 hex chars
- DESCRIPTION and OUTPUT_TOOLTIPS

Sampler:
- Variant mismatch validation against extracted features
- Cancellation support via throw_exception_if_processing_interrupted()
- OOM catch with actionable error message
- normalize toggle (optional BOOLEAN, default true) for peak normalization
- Remove empty optional: {} block
- DESCRIPTION and OUTPUT_TOOLTIPS

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 18:16:03 +02:00
Ethanfel 429810db5b docs: improve tooltips on all three SelVA nodes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 18:10:05 +02:00
Ethanfel 57f56c04e2 feat: update demo workflow with VHS_VideoCombine output
- Replace PreviewAudio with VHS_VideoCombine — outputs video+audio together
- Wire fps from FeatureExtractor to VideoCombine frame_rate
- Wire audio from Sampler into VideoCombine
- Clear hardcoded video filename
- Set filename_prefix to SelVA, save_output=true

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 18:07:56 +02:00
Ethanfel ff26d0b87d fix: bug sweep and improvements
- nodes/__init__.py: fix [PrismAudio] leftover label in error print
- selva_feature_extractor: hash beginning, middle and end of video tensor
  instead of just first 1MB, avoiding collisions on videos with same opening frames
- selva_sampler: derive SequenceConfig from model template via dataclasses.replace
  instead of hardcoding sampling_rate/spectrogram_frame_rate per mode

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 18:04:35 +02:00
Ethanfel 83b1da9520 chore: remove all PrismAudio code from main branch
- Delete prismaudio_core/, data_utils/, scripts/, docs/plans/
- Delete PrismAudio nodes (feature_extractor, feature_loader, model_loader, sampler, text_only)
- Delete PrismAudio workflows (video_to_audio, text_to_audio)
- Clean nodes/utils.py: rename PRISMAUDIO_CATEGORY → SELVA_CATEGORY, remove unused helpers
- Strip PrismAudio-only deps from requirements.txt

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 17:58:31 +02:00
Ethanfel 679a607a85 feat: wire prompt output from feature extractor to sampler in demo workflow
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 17:13:23 +02:00
Ethanfel d495939367 docs: rewrite README for SelVA
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 17:12:28 +02:00
Ethanfel 982d66e078 chore: remove PrismAudio nodes from selva-integration branch
This branch registers only the three SelVA nodes. PrismAudio nodes stay
on master/feature/lora-trainer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 17:01:21 +02:00