ComfyUI-SelVA

Author	SHA1	Message	Date
Ethanfel	82fb7a0009	docs: note AudioX shows no perceptual quality gain on V2A vs SelVA Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-07 09:12:00 +02:00
Ethanfel	af4777d2d7	docs: add AudioX vs SelVA evaluation Architecture comparison, capability matrix, integration cost estimate, LoRA training difficulty analysis, and license implications. Verdict: SelVA remains preferred for V2A + LoRA fine-tuning; AudioX adds value for music generation, inpainting, and text-to-audio tasks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-07 09:11:09 +02:00
Ethanfel	ed8abf7a5b	docs: add video format recommendations to dataset preparation section New section 1.1 covers aspect ratio (16:9 landscape preferred), resolution (≥480p), frame rate (any, use VHS_VIDEOINFO), and portrait handling (center-crop to square). Based on CLIP 384px and Synchformer 224px internals. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 13:44:14 +02:00
Ethanfel	21ed93d3ee	docs: add audio dataset pipeline reference doc Full research notes on cleaning, augmentation, and quality metrics for generative model training. Covers LUFS normalization, AudioSep, waveform augmentation (pitch shift, RIR, EQ), latent mixup, DNSMOS gating, tool install commands, and key paper references. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 13:37:48 +02:00
Ethanfel	f1e2bbd55b	feat: add first experiment sweep file for Tier 1 ablation 6 experiments: baseline, LoRA+ (ratio=16), dropout 0.05, dropout 0.1, curriculum sampling, and all three combined. bf16 batch 16, 2000 steps, seed 42. data_dir placeholder needs to be updated before running. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 13:15:06 +02:00
Ethanfel	3d9221c248	fix: three bugs in scheduler and trainer - trainer: raise ValueError early when remaining steps < log_interval (50) instead of UnboundLocalError on smoothed_img/final_path at return - trainer: use None in grad_norm_history instead of silent 0.0 when grad_accum > log_interval and no optimizer step fired in the interval - trainer: include start_step in _train_inner return dict - scheduler: use start_step from result dict for min_loss_step and loss_at_steps (fixes wrong step labels on resumed experiments) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 13:11:25 +02:00
Ethanfel	2d200395af	feat: add grad norm logging and richer experiment summary output trainer: - Track gradient norm before clipping at each optimizer step - Log avg grad_norm per log_interval alongside loss in console output - Include grad_norm_history in _train_inner return dict scheduler: - Add system block to summary (GPU name, VRAM, torch/CUDA version) - Include full loss_history and grad_norm_history arrays in each experiment result (50-step resolution, not just save_every checkpoints) - Add loss_std_last_quarter stability metric (std dev of raw loss over last 25% of steps — high value indicates unstable training) - Add log_interval field so consumers know the x-axis resolution Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 13:06:39 +02:00
Ethanfel	3ec380a27e	feat: add SelVA LoRA Scheduler node for automated experiment sweeps - Extract _prepare_dataset() from SelvaLoraTrainer.train() as a module-level function so the dataset can be encoded once and reused across experiments - Change _train_inner() return value from tuple to dict (adds loss_history, meta, completed; train() unpacks for ComfyUI — no change to node outputs) - New SelvaLoraScheduler node: reads a JSON sweep file, runs N experiments sequentially, writes experiment_summary.json (updated after each run) and loss_comparison.png with all smoothed curves overlaid on the same axes - Register SelvaLoraScheduler in nodes/__init__.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 13:03:21 +02:00
Ethanfel	9bc2568543	docs: document LoRA dropout, LoRA+, and curriculum timestep sampling Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 12:45:53 +02:00
Ethanfel	eb63c1ead7	feat: add LoRA dropout, LoRA+ asymmetric LR, and curriculum timestep sampling - LoRA dropout: applied to the LoRA path only (not frozen base weights), 0.05–0.1 helps regularize on small datasets (arXiv:2404.09610) - LoRA+: separate optimizer param groups for lora_A and lora_B with configurable LR ratio; ratio=16 enables LoRA+ (arXiv:2402.12354) - Curriculum mode: logit_normal for first N% of steps then uniform, directly addresses early convergence + fine-detail degradation at boundaries (arXiv:2603.12517) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 12:43:18 +02:00
Ethanfel	5baa070e61	docs: add observations section with fp32/batch/precision findings Work-in-progress empirical notes: fp32 batch 32 reaches same quality as bf16 batch 16 in 1/3 the steps but overfits past ~2000 steps on 10 clips. Lower loss does not reliably mean better audio on small datasets. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 02:34:53 +02:00
Ethanfel	9fc739fe9e	docs: add prompt guide and masking note to dataset preparation section Poor prompts and missing masks are a common source of white noise in LoRA training — imprecise sync features force the adapter to compensate with noise. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 01:43:28 +02:00
Ethanfel	57fae4a8ce	chore: default timestep_mode back to uniform logit_normal reaches lower loss but perceptual improvement over uniform is dataset-dependent. Keeping uniform as default to match original MMAudio training behavior; logit_normal remains available as an option. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 01:21:08 +02:00
Ethanfel	8e919c0459	fix: resolve relative and Unix-style output_dir paths to ComfyUI output folder On Windows, /folder is drive-relative (no drive letter) rather than a real absolute path. Redirect these to ComfyUI's output directory so files don't land at C:\folder. Also redirects plain relative paths (e.g. lora_output) to output/ instead of the process working directory. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 01:14:04 +02:00
Ethanfel	fec8eaac95	fix: save adapter and loss curves on cancel, not only on normal completion Wraps training loop in try/finally so adapter_final.pt and loss PNGs are always written. On cancellation the adapter is named adapter_cancelled_stepXXXXX.pt so it can be used with --resume. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 01:06:44 +02:00
Ethanfel	d83632e754	fix: pad/trim clip and sync features to fixed seq_len at dataset load time Clips from shorter videos produce fewer CLIP frames (e.g. 2s → 16 frames, 8s → 64 frames). Mixed-length datasets would cause torch.stack() to fail during batching. Normalize to seq_cfg.clip_seq_len / sync_seq_len at load, same as latents are already normalized to latent_seq_len. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 00:54:05 +02:00
Ethanfel	a5014e49eb	feat: add logit-normal timestep sampling to reduce white noise artifacts Uniform timestep sampling undertrained t>0.8 (the final denoising steps), leaving residual noise that CFG amplifies at inference. Logit-normal sampling concentrates training near t=0.5 while still covering the full range, improving high-t coverage and reducing noise floor in generated audio. Default changed from uniform to logit_normal (sigma=1.0). Previous behavior available with timestep_mode=uniform. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 00:35:42 +02:00
Ethanfel	8ae0ba3c7d	fix: increment adapter_final filename on resume to avoid overwriting previous final Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 00:15:31 +02:00
Ethanfel	2b2b438307	fix: set OUTPUT_NODE=True on SelVA Feature Extractor so it runs without connected outputs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 00:11:16 +02:00
Ethanfel	39984f73c2	docs: add observed batching results to training guide Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 00:05:16 +02:00
Ethanfel	1f8cd6f930	docs: rewrite LORA_TRAINING.md with real-world findings - Added batch_size VRAM table and updated step recommendations for batched training - Added adapter strength section with practical guidance (0.6-0.7 for noise) - Added ComfyUI node as Option A for training (not just CLI) - Noted .mp3 as not recommended, soundfile fallback implied - Added output files section with sample_*.wav and loss curve PNGs - Added "LoRA has no effect" troubleshooting (wrong node wired) - Updated loss convergence targets based on observed training runs - Clarified linear1 target: 150+ clips recommended Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 00:00:36 +02:00
Ethanfel	20f8138146	chore: show batch_size in training step log Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 23:45:43 +02:00
Ethanfel	09b3b94ddd	feat: add batch_size parameter to training (default 4) Replaces single-sample steps with batched sampling via random.choices(). Tensors are stacked to [B, T, C] before the forward pass; t is now [B]. Default grad_accum lowered to 1 since real batching gives stable gradients. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 23:36:12 +02:00
Ethanfel	3f67de694c	feat: save loss_raw.png and loss_smoothed.png to output_dir Raw curve shown in light blue, EMA-smoothed (beta=0.9) overlay in darker blue. Both saved as PNG at end of training. The node IMAGE output now returns the smoothed version. Live preview also uses the smoothed overlay. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 23:15:48 +02:00
Ethanfel	423e174b88	debug: print lora_A norm after loading to confirm adapter applied Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 23:05:23 +02:00
Ethanfel	4806daa4ca	chore: lower default warmup_steps from 500 to 100 500 warmup steps is 25% of a 2000-step run — too long. 100 steps lets the full lr kick in much earlier without sacrificing stability. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 22:51:27 +02:00
Ethanfel	16b3eb11cc	fix: pass max_size=800 to progress bar preview (was 85px wide) The third element in ComfyUI's preview tuple is max_size in pixels, not JPEG quality. Passing 85 was capping the live loss curve at 85×40px. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 22:48:56 +02:00
Ethanfel	004ea63f62	fix: fall back to soundfile for torchaudio.save when torchcodec unavailable Same torchcodec/FFmpeg issue as the load path, now on the eval sample save. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 22:44:04 +02:00
Ethanfel	afb3242eca	fix: disable inference_mode entirely for training via inference_mode(False) torch.enable_grad() alone is insufficient: operations on inference tensors (created inside ComfyUI's outer inference_mode context) produce inference tensors even inside enable_grad, breaking autograd. inference_mode(False) exits the inference context so the deepcopy, apply_lora, and training loop run with a fully clean autograd context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 22:40:50 +02:00
Ethanfel	849f31e2a6	fix: create LoRA params inside torch.enable_grad() to escape inference_mode torch.enable_grad() re-enables grad tracking but nn.Parameters created while torch.inference_mode() is active are inference tensors that can't enter autograd regardless. Splitting into _train_inner() and calling it inside enable_grad() ensures the deepcopy, apply_lora, and the training loop all run with a clean autograd context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 22:36:28 +02:00
Ethanfel	505d445eb3	fix: wrap training loop in torch.enable_grad() ComfyUI executes all nodes inside torch.no_grad(), which prevents gradient tracking and makes loss.backward() fail. torch.enable_grad() overrides it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 22:32:00 +02:00
Ethanfel	8fade1b0e3	fix: initialize LoRA params on same device as wrapped linear apply_lora() is called after generator.to(device), so lora_A/lora_B were being created on CPU while the rest of the model was on CUDA. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 22:17:29 +02:00
Ethanfel	ad57432803	fix: pad/trim latent to exact latent_seq_len after VAE encoding STFT hop-size rounding produces ±1 latent frame vs the expected seq length. Clamp to seq_cfg.latent_seq_len after transpose so generator.forward assertion passes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 22:12:20 +02:00
Ethanfel	43f732f904	fix: transpose VAE latent from [B,C,T] to [B,T,C] before generator VAE encoder returns channels-first [B, latent_dim, T]; the generator expects time-first [B, T, latent_dim] (same convention as decode which already does .transpose(1,2)). Fixes normalize() size mismatch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 22:08:00 +02:00
Ethanfel	6b9adf0816	fix: fall back to soundfile when torchcodec FFmpeg libs are missing Recent torchaudio defaults to torchcodec as the audio backend, which requires FFmpeg shared libraries. Falls back to soundfile for envs where torchcodec can't load (e.g. containerised ComfyUI without system FFmpeg). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 22:03:57 +02:00
Ethanfel	52434a053a	fix: keep VAE in float32 for mel/stft; print full traceback on clip load failure torch.stft requires float32 input — casting vae_utils to bf16 caused silent failures during dataset pre-loading. Also adds traceback.print_exc() so future clip-load errors are visible in the ComfyUI log. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 21:57:20 +02:00
Ethanfel	56c8d5d6b4	feat: save eval audio sample alongside each checkpoint At every save_every steps, run a quick 8-step no-CFG inference pass on a random training clip and save the decoded waveform as sample_stepXXXXX.wav next to the checkpoint. Uses the existing generator.unnormalize + feature_utils.decode + vocode pipeline from the sampler. Failure is non-fatal (logged and skipped). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 21:47:02 +02:00
Ethanfel	b430953602	feat: live loss curve preview during training - Send updated loss curve to ComfyUI frontend every 50 steps via pbar_train.update_absolute() with a JPEG preview tuple — same mechanism as KSampler's denoising previews. - Fix x-axis step labels for resumed runs (previously always started at 0; now correctly shows start_step + offset). - Split _draw_loss_curve (returns PIL Image) from _pil_to_tensor (converts for ComfyUI IMAGE output). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 17:11:38 +02:00
Ethanfel	57cd3dd4b4	fix: use load_lora for resume and remove redundant inference_mode wrapper - Resume now calls load_lora() instead of load_state_dict() directly, giving proper warnings for missing/unexpected LoRA keys. - Remove redundant `with torch.inference_mode():` around encode_audio (already @inference_mode decorated); dist.mode().clone() pattern is now clearer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 17:09:35 +02:00
Ethanfel	f206a1b38c	feat: add SelVA LoRA Trainer ComfyUI node Runs the full training loop inside ComfyUI. Reuses the already-loaded CLIP model from the inference model for text encoding; loads only a minimal VAE encoder separately (freed after dataset pre-loading). Outputs: - SELVA_MODEL with LoRA applied (ready to connect directly to Sampler) - adapter_path STRING (for SelVA LoRA Loader in future sessions) - loss_curve IMAGE (PIL-rendered line chart of training loss per 50 steps) Progress is shown via ComfyUI ProgressBar (two phases: dataset loading, then training steps). Resume is supported via resume_path input. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 17:07:38 +02:00
Ethanfel	2f4641247a	feat: add resume support to train_lora.py Step checkpoints now save optimizer state, scheduler state, and step number alongside the LoRA weights. Pass --resume path/to/adapter_stepXXXXX.pt to continue training from that checkpoint. --steps always means total steps, so resuming from 1000 with --steps 2000 trains 1000 more steps. adapter_final.pt format is unchanged (state_dict + meta only) so SelvaLoraLoader remains compatible. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 16:59:30 +02:00
Ethanfel	8e9114b92c	docs: add clip length and scalable dataset size recommendations - Clip length section: fixed 8s duration, padding/trim behavior, per-sound-type strategies (continuous, short events, repeating, onset placement). - Dataset size table: 5-10 / 15-30 / 30-60 / 60-150 / 150-300 / 300+ clips with scenario and expected result for each tier. - Note on diversity vs quantity. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 16:34:50 +02:00
Ethanfel	63b4391573	fix: named .npz files always start at _001 dog_bark_001.npz, dog_bark_002.npz instead of dog_bark.npz, dog_bark_001.npz. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 15:44:26 +02:00
Ethanfel	89af5a468c	docs: add LoRA training guide Covers dataset preparation (ComfyUI feature extraction + clean audio), training CLI reference, tuning guide (rank/steps/lr), adapter loading in ComfyUI, and troubleshooting. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 15:43:09 +02:00
Ethanfel	c88e27742c	fix: sanitize name field and remove double load_npz call - _resolve_named_path: replace / \ and null in name to prevent path traversal outside cache_dir (would cause a confusing FileNotFoundError at np.savez time instead of at path resolution). - train_lora: load_npz was called twice per clip when prompt was in prompts.txt; consolidate to a single call before prompt resolution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 15:30:25 +02:00
Ethanfel	cbcd154c96	feat: add name field with auto-increment to SelvaFeatureExtractor When name is provided, features are saved as name.npz (or name_001.npz, name_002.npz etc. if the file already exists) instead of a content hash — useful for building a named training dataset. Hash-based caching is unchanged when name is left empty. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 15:16:51 +02:00
Ethanfel	1eb82d8050	refactor: train_lora accepts .npz + audio pairs instead of raw video - Input is now pre-extracted .npz files (from SelvaFeatureExtractor) paired with clean audio files (same stem). Visual features no longer re-extracted during training. - FeaturesUtils loaded with enable_conditions=False (VAE only) — Synchformer and T5 are no longer loaded, saving ~3-4 GB VRAM. - CLIP text encoder loaded separately via patch_clip so text prompt can differ from the one used during feature extraction. - Prompt priority: prompts.txt override > embedded in .npz > directory name. - Removed: torchvision video loading, frame sampling/resizing, net_video_enc, synchformer path check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 15:14:26 +02:00
Ethanfel	cde280049b	fix: correct LoRALinear dtype and remove unused import - LoRALinear now creates lora_A/lora_B with dtype matching the base linear's weight, preventing a float32/bf16 mismatch at forward time when the generator is loaded in bf16 or fp16. - Remove unused `import math` from train_lora.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 14:57:09 +02:00
Ethanfel	437c62b28f	feat: LoRA fine-tuning for SelVA generator Teaches the model new/partial sound classes from custom video+audio pairs. Only ~10 MB of adapter weights are trained vs ~4.4 GB for the full model. selva_core/model/lora.py LoRALinear: wraps nn.Linear with frozen base + trainable A/B matrices. B initialised to zero → zero adapter contribution at init. apply_lora(): walks named_modules, replaces matching nn.Linear in-place. Default target: "attn.qkv" (all 21 SelfAttention QKV projections in large_44k). Add "linear1" to also wrap post-attention output projections. get_lora_state_dict() / load_lora() for ~10 MB save/load. train_lora.py (standalone script, no ComfyUI dependency) Data format: directory of video files + optional prompts.txt ("filename: description"). Falls back to directory name as prompt. Pre-extracts features for all clips into RAM, then trains from those. Training loop: encode audio→latent (need_vae_encoder=True), flow matching MSE loss on velocity prediction, backward on LoRA params only. Saves adapter_stepNNNNN.pt checkpoints + adapter_final.pt with metadata. Key verified interfaces used: encode_audio() → DiagonalGaussianDistribution; .mode().clone() required normalize() is in-place forward(latent, clip_f, sync_f, text_f, t) takes raw tensors nodes/selva_lora_loader.py (SelVA LoRA Loader ComfyUI node) Loads .pt adapter, deep-copies the generator, applies LoRA, loads weights. strength param scales lora_B to adjust adapter contribution at inference. Reads rank/alpha/target from embedded metadata if present. Returns a patched SELVA_MODEL bundle for use with the existing Sampler. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 14:38:46 +02:00
Ethanfel	b519b042e2	docs: document mask inputs and normalize toggle in README Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-05 10:43:42 +02:00

1 2 3 4

192 Commits