On Windows, /folder is drive-relative (no drive letter) rather than a real
absolute path. Redirect these to ComfyUI's output directory so files don't
land at C:\folder. Also redirects plain relative paths (e.g. lora_output)
to output/ instead of the process working directory.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wraps training loop in try/finally so adapter_final.pt and loss PNGs are
always written. On cancellation the adapter is named
adapter_cancelled_stepXXXXX.pt so it can be used with --resume.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Clips from shorter videos produce fewer CLIP frames (e.g. 2s → 16 frames,
8s → 64 frames). Mixed-length datasets would cause torch.stack() to fail
during batching. Normalize to seq_cfg.clip_seq_len / sync_seq_len at load,
same as latents are already normalized to latent_seq_len.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Added batch_size VRAM table and updated step recommendations for batched training
- Added adapter strength section with practical guidance (0.6-0.7 for noise)
- Added ComfyUI node as Option A for training (not just CLI)
- Noted .mp3 as not recommended, soundfile fallback implied
- Added output files section with sample_*.wav and loss curve PNGs
- Added "LoRA has no effect" troubleshooting (wrong node wired)
- Updated loss convergence targets based on observed training runs
- Clarified linear1 target: 150+ clips recommended
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces single-sample steps with batched sampling via random.choices().
Tensors are stacked to [B, T, C] before the forward pass; t is now [B].
Default grad_accum lowered to 1 since real batching gives stable gradients.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Raw curve shown in light blue, EMA-smoothed (beta=0.9) overlay in darker
blue. Both saved as PNG at end of training. The node IMAGE output now
returns the smoothed version. Live preview also uses the smoothed overlay.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
500 warmup steps is 25% of a 2000-step run — too long. 100 steps lets
the full lr kick in much earlier without sacrificing stability.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The third element in ComfyUI's preview tuple is max_size in pixels, not
JPEG quality. Passing 85 was capping the live loss curve at 85×40px.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
torch.enable_grad() alone is insufficient: operations on inference tensors
(created inside ComfyUI's outer inference_mode context) produce inference
tensors even inside enable_grad, breaking autograd. inference_mode(False)
exits the inference context so the deepcopy, apply_lora, and training loop
run with a fully clean autograd context.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
torch.enable_grad() re-enables grad tracking but nn.Parameters created while
torch.inference_mode() is active are inference tensors that can't enter autograd
regardless. Splitting into _train_inner() and calling it inside enable_grad()
ensures the deepcopy, apply_lora, and the training loop all run with a clean
autograd context.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ComfyUI executes all nodes inside torch.no_grad(), which prevents gradient
tracking and makes loss.backward() fail. torch.enable_grad() overrides it.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
apply_lora() is called after generator.to(device), so lora_A/lora_B were
being created on CPU while the rest of the model was on CUDA.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
STFT hop-size rounding produces ±1 latent frame vs the expected seq length.
Clamp to seq_cfg.latent_seq_len after transpose so generator.forward assertion passes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Recent torchaudio defaults to torchcodec as the audio backend, which requires
FFmpeg shared libraries. Falls back to soundfile for envs where torchcodec
can't load (e.g. containerised ComfyUI without system FFmpeg).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
torch.stft requires float32 input — casting vae_utils to bf16 caused silent
failures during dataset pre-loading. Also adds traceback.print_exc() so future
clip-load errors are visible in the ComfyUI log.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
At every save_every steps, run a quick 8-step no-CFG inference pass on
a random training clip and save the decoded waveform as
sample_stepXXXXX.wav next to the checkpoint. Uses the existing
generator.unnormalize + feature_utils.decode + vocode pipeline from
the sampler. Failure is non-fatal (logged and skipped).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Send updated loss curve to ComfyUI frontend every 50 steps via
pbar_train.update_absolute() with a JPEG preview tuple — same
mechanism as KSampler's denoising previews.
- Fix x-axis step labels for resumed runs (previously always started
at 0; now correctly shows start_step + offset).
- Split _draw_loss_curve (returns PIL Image) from _pil_to_tensor
(converts for ComfyUI IMAGE output).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Runs the full training loop inside ComfyUI. Reuses the already-loaded
CLIP model from the inference model for text encoding; loads only a
minimal VAE encoder separately (freed after dataset pre-loading).
Outputs:
- SELVA_MODEL with LoRA applied (ready to connect directly to Sampler)
- adapter_path STRING (for SelVA LoRA Loader in future sessions)
- loss_curve IMAGE (PIL-rendered line chart of training loss per 50 steps)
Progress is shown via ComfyUI ProgressBar (two phases: dataset loading,
then training steps). Resume is supported via resume_path input.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Step checkpoints now save optimizer state, scheduler state, and step
number alongside the LoRA weights. Pass --resume path/to/adapter_stepXXXXX.pt
to continue training from that checkpoint. --steps always means total steps,
so resuming from 1000 with --steps 2000 trains 1000 more steps.
adapter_final.pt format is unchanged (state_dict + meta only) so
SelvaLoraLoader remains compatible.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _resolve_named_path: replace / \ and null in name to prevent path
traversal outside cache_dir (would cause a confusing FileNotFoundError
at np.savez time instead of at path resolution).
- train_lora: load_npz was called twice per clip when prompt was in
prompts.txt; consolidate to a single call before prompt resolution.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When name is provided, features are saved as name.npz (or name_001.npz,
name_002.npz etc. if the file already exists) instead of a content hash —
useful for building a named training dataset. Hash-based caching is
unchanged when name is left empty.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Input is now pre-extracted .npz files (from SelvaFeatureExtractor) paired
with clean audio files (same stem). Visual features no longer re-extracted
during training.
- FeaturesUtils loaded with enable_conditions=False (VAE only) — Synchformer
and T5 are no longer loaded, saving ~3-4 GB VRAM.
- CLIP text encoder loaded separately via patch_clip so text prompt can differ
from the one used during feature extraction.
- Prompt priority: prompts.txt override > embedded in .npz > directory name.
- Removed: torchvision video loading, frame sampling/resizing, net_video_enc,
synchformer path check.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- LoRALinear now creates lora_A/lora_B with dtype matching the base
linear's weight, preventing a float32/bf16 mismatch at forward time
when the generator is loaded in bf16 or fp16.
- Remove unused `import math` from train_lora.py.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Teaches the model new/partial sound classes from custom video+audio pairs.
Only ~10 MB of adapter weights are trained vs ~4.4 GB for the full model.
selva_core/model/lora.py
LoRALinear: wraps nn.Linear with frozen base + trainable A/B matrices.
B initialised to zero → zero adapter contribution at init.
apply_lora(): walks named_modules, replaces matching nn.Linear in-place.
Default target: "attn.qkv" (all 21 SelfAttention QKV projections in
large_44k). Add "linear1" to also wrap post-attention output projections.
get_lora_state_dict() / load_lora() for ~10 MB save/load.
train_lora.py (standalone script, no ComfyUI dependency)
Data format: directory of video files + optional prompts.txt
("filename: description"). Falls back to directory name as prompt.
Pre-extracts features for all clips into RAM, then trains from those.
Training loop: encode audio→latent (need_vae_encoder=True), flow
matching MSE loss on velocity prediction, backward on LoRA params only.
Saves adapter_stepNNNNN.pt checkpoints + adapter_final.pt with metadata.
Key verified interfaces used:
encode_audio() → DiagonalGaussianDistribution; .mode().clone() required
normalize() is in-place
forward(latent, clip_f, sync_f, text_f, t) takes raw tensors
nodes/selva_lora_loader.py (SelVA LoRA Loader ComfyUI node)
Loads .pt adapter, deep-copies the generator, applies LoRA, loads weights.
strength param scales lora_B to adjust adapter contribution at inference.
Reads rank/alpha/target from embedded metadata if present.
Returns a patched SELVA_MODEL bundle for use with the existing Sampler.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace zero-fill with neutral gray (0.5) fill so masked background
pixels stay in-distribution: 0.5 maps to ~0 in CLIP normalized space
and exactly 0 after sync's [-1,1] normalization
- Add mask_strength float (0–1) for partial background suppression
- Add mask_clip / mask_sync booleans to toggle masking independently
on the CLIP (384px) and TextSynchformer (224px) encoding paths
- Fix temporal mask sampling: use fps-accurate index formula (same as
_sample_frames) instead of proportional int(i*M/N)
- Include mask_strength, mask_clip, mask_sync in cache hash when mask
is connected, so changing any param correctly busts the cache
- Log lines now report masked/skipped state and strength per path
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both nodes moved models to GPU before work then back to CPU after.
Any exception (OOM, cancellation, bad input) would skip the cleanup,
leaving models on GPU permanently until ComfyUI restarts.
Wrap the entire work block in try/finally so offload_to_cpu cleanup
always runs regardless of how the node exits. Also removes the unused
`mode` variable in SelvaSampler.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- selva_sampler: wrap decode+vocode in their own OOM catch — previously
OOM during mel decode or vocoding gave a raw CUDA traceback instead
of the actionable hint
- selva_feature_extractor: sync frames log line now shows (masked) when
a mask is active, matching the CLIP log line
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Allows per-frame or static segmentation masks to be applied before CLIP
and sync encoding, zeroing background pixels. Useful when multiple objects
compete for the same sound and text prompting alone is insufficient.
- _apply_mask(): resizes mask spatially (nearest-exact), samples temporally
to match sampled frame count, multiplies into frames
- _hash_inputs(): includes mask bytes in cache key (begin/mid/end sampling)
- INPUT_TYPES: mask added to optional inputs with tooltip
- extract_features(): mask=None parameter, applied after _resize_frames for
both CLIP (384px) and sync (224px) paths, before normalization
- Log line notes when masking is active
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Model Loader:
- bf16 support check — auto-falls back to fp16 on unsupported GPUs
- DESCRIPTION and OUTPUT_TOOLTIPS
Feature Extractor:
- Store variant in features dict and .npz cache
- Progress bar (3 steps: CLIP encode, T5 encode, sync encode)
- Expand cache hash to 32 hex chars
- DESCRIPTION and OUTPUT_TOOLTIPS
Sampler:
- Variant mismatch validation against extracted features
- Cancellation support via throw_exception_if_processing_interrupted()
- OOM catch with actionable error message
- normalize toggle (optional BOOLEAN, default true) for peak normalization
- Remove empty optional: {} block
- DESCRIPTION and OUTPUT_TOOLTIPS
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace PreviewAudio with VHS_VideoCombine — outputs video+audio together
- Wire fps from FeatureExtractor to VideoCombine frame_rate
- Wire audio from Sampler into VideoCombine
- Clear hardcoded video filename
- Set filename_prefix to SelVA, save_output=true
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- nodes/__init__.py: fix [PrismAudio] leftover label in error print
- selva_feature_extractor: hash beginning, middle and end of video tensor
instead of just first 1MB, avoiding collisions on videos with same opening frames
- selva_sampler: derive SequenceConfig from model template via dataclasses.replace
instead of hardcoding sampling_rate/spectrogram_frame_rate per mode
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This branch registers only the three SelVA nodes. PrismAudio nodes stay
on master/feature/lora-trainer.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Newer hf_hub stopped passing proxies/resume_download/local_files_only/token
to _from_pretrained(). Give them defaults so the call doesn't fail when
these kwargs are omitted.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Actual filenames in jnwnlee/SelVA: generator_*_44khz_sup_5.pth.
download_utils.py had the wrong names so those MD5s are unverified — set to
None to skip MD5 check for 44k generators. All other files verified/unchanged.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>