ComfyUI-SelVA

Author	SHA1	Message	Date
Ethanfel	429810db5b	docs: improve tooltips on all three SelVA nodes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 18:10:05 +02:00
Ethanfel	57f56c04e2	feat: update demo workflow with VHS_VideoCombine output - Replace PreviewAudio with VHS_VideoCombine — outputs video+audio together - Wire fps from FeatureExtractor to VideoCombine frame_rate - Wire audio from Sampler into VideoCombine - Clear hardcoded video filename - Set filename_prefix to SelVA, save_output=true Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 18:07:56 +02:00
Ethanfel	ff26d0b87d	fix: bug sweep and improvements - nodes/__init__.py: fix [PrismAudio] leftover label in error print - selva_feature_extractor: hash beginning, middle and end of video tensor instead of just first 1MB, avoiding collisions on videos with same opening frames - selva_sampler: derive SequenceConfig from model template via dataclasses.replace instead of hardcoding sampling_rate/spectrogram_frame_rate per mode Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 18:04:35 +02:00
Ethanfel	83b1da9520	chore: remove all PrismAudio code from main branch - Delete prismaudio_core/, data_utils/, scripts/, docs/plans/ - Delete PrismAudio nodes (feature_extractor, feature_loader, model_loader, sampler, text_only) - Delete PrismAudio workflows (video_to_audio, text_to_audio) - Clean nodes/utils.py: rename PRISMAUDIO_CATEGORY → SELVA_CATEGORY, remove unused helpers - Strip PrismAudio-only deps from requirements.txt Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 17:58:31 +02:00
Ethanfel	679a607a85	feat: wire prompt output from feature extractor to sampler in demo workflow Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 17:13:23 +02:00
Ethanfel	d495939367	docs: rewrite README for SelVA Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 17:12:28 +02:00
Ethanfel	982d66e078	chore: remove PrismAudio nodes from selva-integration branch This branch registers only the three SelVA nodes. PrismAudio nodes stay on master/feature/lora-trainer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 17:01:21 +02:00
Ethanfel	b4124f58b3	fix: BigVGANv2._from_pretrained() compat with newer huggingface_hub Newer hf_hub stopped passing proxies/resume_download/local_files_only/token to _from_pretrained(). Give them defaults so the call doesn't fail when these kwargs are omitted. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 16:51:48 +02:00
Ethanfel	2c9d521565	fix: 44k generator HF paths use 44khz suffix (not 44k) Actual filenames in jnwnlee/SelVA: generator_*_44khz_sup_5.pth. download_utils.py had the wrong names so those MD5s are unverified — set to None to skip MD5 check for 44k generators. All other files verified/unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 16:46:20 +02:00
Ethanfel	28229d62ce	fix: MD5 validation on existing files — re-download if corrupt Previously _ensure() trusted any existing file. Files downloaded by the broken requests-based code (HTML error pages) would be silently reused. Now checks MD5 on every load; deletes and re-downloads on mismatch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 16:42:38 +02:00
Ethanfel	92593189f0	fix: use huggingface_hub for downloads instead of raw requests download_utils.py used requests without auth — jnwnlee/SelVA returned an HTML error page which torch then failed to unpickle ('E' / opcode 69). huggingface_hub.hf_hub_download() handles HF_TOKEN auth automatically, validates downloads, and retries. Files are still copied to models/selva/. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 16:41:29 +02:00
Ethanfel	614a2e02aa	fix: weights_only=False for SelVA checkpoints (PyTorch 2.6 compat) PyTorch 2.6 changed the default to weights_only=True. SelVA checkpoints contain non-tensor types (numpy scalars etc.) that fail strict unpickling. All weights come from trusted sources (jnwnlee/selva HF repo). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 16:38:31 +02:00
Ethanfel	40388ba6de	fix: negative_prompt inline (multiline:false) + VAE filename v1-44.pth not v1-44k.pth - SelvaSampler: multiline:false puts negative_prompt inline above sliders - SelvaModelLoader: VAE filenames in download_utils are v1-16.pth/v1-44.pth, not v1-{mode}.pth (mode includes the 'k' suffix) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 16:35:17 +02:00
Ethanfel	789e09535d	fix: SelvaSampler — negative_prompt above settings Move negative_prompt to required inputs, right after prompt, so it appears above duration/steps/cfg/seed in the ComfyUI node layout. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 16:31:53 +02:00
Ethanfel	4da4858e4a	fix: inline prune helpers when removed from both transformers locations find_pruneable_heads_and_indices and prune_linear_layer were removed from both pytorch_utils and modeling_utils in some transformers builds. Provide minimal inline implementations as final fallback — prune_heads() is never called at inference time so correctness is only needed for completeness. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 16:30:58 +02:00
Ethanfel	ab8e1e5b7b	feat: SelvaFeatureExtractor outputs prompt as STRING Users can now wire the prompt output directly to SelvaSampler's prompt input, making the data flow explicit instead of relying on the implicit features fallback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 16:27:49 +02:00
Ethanfel	e3a3384727	fix: SelvaSampler input order — prompt required, negative_prompt optional ComfyUI renders required inputs above optional ones. Moving negative_prompt to optional puts prompt first (natural order) and negative_prompt at the bottom where it belongs as a power-user input. Also guards against negative_prompt=None when not connected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 16:27:07 +02:00
Ethanfel	9a985499e7	feat: auto-download SelVA weights on first use Uses selva_core/utils/download_utils.py (already has URLs + MD5s for all weights). Models download to models/selva/ on first load. Synchformer reuses models/prismaudio/synchformer_state_dict.pth if already present (no duplicate download for PrismAudio users), otherwise downloads to models/selva/. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 16:25:36 +02:00
Ethanfel	27b4424e1a	feat: prompt entered once in SelvaFeatureExtractor, reused by SelvaSampler SelvaFeatureExtractor now stores the prompt in SELVA_FEATURES (both in the returned dict and the .npz cache). SelvaSampler's prompt is now optional — when left empty it falls back to the prompt stored in features. A non-empty override can still be passed when CLIP text should differ from the sync text. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 16:22:59 +02:00
Ethanfel	0e417f4078	fix: transformers compat — find_pruneable_heads_and_indices import Some transformers builds removed these from pytorch_utils. Fall back to modeling_utils which exposes them in all known versions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 16:21:26 +02:00
Ethanfel	6474e2816c	fix: two bugs in SelVA nodes - selva_feature_extractor: cache hash now includes resolved duration; same video + different duration override no longer returns stale features - selva_sampler: MPS-safe noise generation (torch.Generator on CPU then move to device, same pattern as PrismAudioSampler) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 15:39:57 +02:00
Ethanfel	c23d210ab2	feat: SelVA video-to-audio example workflow LoadVideo → SelvaFeatureExtractor → SelvaSampler → PreviewAudio. Defaults: medium_44k, bf16, 25 steps, cfg=4.5. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 15:31:53 +02:00
Ethanfel	b59b657b6f	feat: SelvaSampler — flow matching ODE with CFG and negative prompts Calls update_seq_lengths with actual feature dimensions (not seq_cfg) to avoid rounding assertion mismatches. Progress bar tracks each Euler step. Supports negative prompts for steering, normalizes output to [-1,1]. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 15:31:18 +02:00
Ethanfel	578b501d38	feat: SelvaFeatureExtractor — inline CLIP + TextSynchformer feature extraction CLIP frames at 8fps→384px (normalize inside FeaturesUtils). Sync frames at 25fps→224px, normalized to [-1,1] externally. T5 text encoded via FeaturesUtils, sup tokens prepended, then text-conditioned sync features extracted via TextSynch.encode_video_with_sync(). Results cached as .npz keyed by hash(frames[:1MB] + prompt + fps + variant). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 15:23:40 +02:00
Ethanfel	fe94438356	feat: SelvaModelLoader node — loads TextSynch + MMAudio + FeaturesUtils Resolves weights from models/selva/. Reuses synchformer_state_dict.pth from models/prismaudio/ (no duplicate download). Supports four variants: small_16k / small_44k / medium_44k / large_44k. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 15:21:03 +02:00
Ethanfel	6bc3fd6443	chore: vendor selva_core from jnwnlee/selva@d7d40a9 Pure PyTorch SelVA source for SelvaModelLoader/FeatureExtractor/Sampler nodes. Imports rewritten from selva.* to selva_core.*. mel_converter.py: replaced librosa.filters.mel with pure-numpy implementation to avoid librosa→numba→NumPy version incompatibility in some ComfyUI environments. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 15:18:09 +02:00
Ethanfel	762b19fd3a	fix: return fps from non-cache extraction path The fps output was only returned on cache hits. Fresh extractions returned only features, leaving fps null. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:26:15 +01:00
Ethanfel	807a2e51fb	docs: fix README references — PrismAudio not ThinkSound Point links to huggingface.co/FunAudioLLM/PrismAudio and use public GitHub URL for install instructions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:16:31 +01:00
Ethanfel	67be94c45c	chore: add updated V2A example workflow Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:13:06 +01:00
Ethanfel	681d230b0c	chore: update T2A workflow to match V2A style and current defaults Steps=100, cfg=7.0, randomize seed, consistent node format with aux_id/ver/ue_properties. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:11:20 +01:00
Ethanfel	62a3c5d0dc	docs: rewrite README to reflect current node design Update node descriptions, inputs/outputs, workflows, and environment setup to match current implementation (managed_env dropdown, VHS video_info, auto-duration, fps output, synchformer auto-resolve). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:10:07 +01:00
Ethanfel	30631c0cb4	fix: change fps output type from INT to FLOAT Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:05:35 +01:00
Ethanfel	d0c9a72782	feat: add fps INT output to PrismAudioFeatureExtractor Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:05:03 +01:00
Ethanfel	5b62be0447	chore: update default steps=100 and cfg_scale=7.0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:03:48 +01:00
Ethanfel	abd315092b	feat: auto-use video duration from features when duration=0 Setting duration to 0 in PrismAudioSampler now reads the duration stored in the PRISMAUDIO_FEATURES dict (set by the feature extractor). Default changed from 10.0 to 0.0 so V2A workflows are wired up automatically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:00:47 +01:00
Ethanfel	972d379369	refactor: simplify feature extractor inputs - Remove synchformer_ckpt input — always resolved from models/prismaudio/ (errors early with clear message if missing) - Replace python_env string input with dropdown: managed_env (isolated auto-created venv, default) or comfyui_env (current Python, with warning) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:55:08 +01:00
Ethanfel	8969d407f6	feat: accept VHS_VIDEOINFO to auto-set fps in feature extractor When the VHS LoadVideo video_info output is connected, loaded_fps is used automatically instead of the manual fps input. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:52:51 +01:00
Ethanfel	707ccb463e	perf: replace MP4 encode/decode with lossless .npy frame transfer Saves frames as uint8 .npy instead of H.264 MP4, eliminating the lossy codec roundtrip. extract_features.py loads .npy directly and skips decord when given a numpy file. Passes --source_fps for correct temporal sampling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:50:35 +01:00
Ethanfel	c38df8c6fa	chore: remove debug options and diagnostic logging Remove debug_zero_video/debug_zero_sync inputs from PrismAudioSampler, DIT velocity diagnostics, conditioner stats logging, and feature stats prints from both sampler.py and text_only.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:47:00 +01:00
Ethanfel	2f626d8a96	fix: use videoprism_lvt_public_v1_large with joint video-text forward The wrong model (videoprism_public_v1_large, vision-only) was used, causing V2A audio distortion. Switch to the LvT variant which has a text tower, pass CoT captions for joint encoding, and extract per-frame features from outputs['frame_embeddings'] (L2-normalized, [T, 1024]) instead of manually averaging spatial patches. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:37:02 +01:00
Ethanfel	1d8b9b59e0	debug: add DIT velocity diagnostic at t=1 to isolate DIT vs VAE quality issue Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 23:57:03 +01:00
Ethanfel	8bf4a0c3fc	debug: log conditioner output stats and T2A text feature stats Add per-key conditioning output stats (after Cond_MLP/Sync_MLP, after _substitute_empty_features) to both sampler and text_only nodes. Also add raw T5 text feature stats in T2A before conditioning. This lets us directly compare: - T2A vs V2A conditioning outputs to find which path differs - T2A vs npz text feature ranges Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 22:39:44 +01:00
Ethanfel	477fe0f08f	debug: add latent and audio stats logging to V2A sampler Match the diagnostic output already in text_only.py to compare V2A vs T2A latent distributions and diagnose conditioning issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 22:28:08 +01:00
Ethanfel	c0b7ccbcee	fix: substitute empty_clip_feat for video features when no video present Zero features through bias-free Cond_MLP produce near-zero activations, not the learned null signal the model was trained with. Use empty_clip_feat (the learned null video embedding) just like empty_sync_feat for sync. Also improve text_prompt tooltip to encourage detailed CoT descriptions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 22:13:22 +01:00
Ethanfel	45633788a4	debug: add latent and audio stats logging to T2A node Print fakes latent stats (mean/std/min/max) and audio pre-norm stats to diagnose whether diffusion output is numerically reasonable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 22:06:39 +01:00
Ethanfel	11457fc27a	debug: fix VAE load_state_dict diagnostic — load into .model directly AutoencoderPretransform.load_state_dict() doesn't return IncompatibleKeys. Load into pretransform.model (AudioAutoencoder) to get the return value and see actual missing/unexpected key counts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:56:06 +01:00
Ethanfel	f2705b3063	debug: log weight load stats for diffusion and VAE checkpoints Print key counts, missing/unexpected keys, and sample key names to diagnose whether weights are actually loading correctly (strict=False silently hides mismatches that would cause garbage audio output). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:53:25 +01:00
Ethanfel	83a7f2787b	feat: add debug_zero_video/sync toggles and feature stats logging to sampler Allows isolating which feature set causes quality issues: - debug_zero_video: zero video_features → text+sync only - debug_zero_sync: zero sync_features → text+video only Also logs mean/std/shape for all three feature tensors on every run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:40:34 +01:00
Ethanfel	140cc5ee9a	feat: implement real Synchformer visual encoder (TimeSformer ViT-B/16) Replace placeholder single-linear with proper architecture reverse-engineered from synchformer_state_dict.pth: - _PatchEmbed: Conv2d(3, 768, 16x16) → [B, 196, 768] - _TimeSformerBlock: factorized spatial + temporal attention (norm1/attn/norm3/timeattn/norm2/mlp) - _SpatialAttnAgg: TransformerEncoderLayer with CLS token, aggregates 196 patches → 1/frame - 12 blocks, dim=768, 8 frames/segment - Loads from vfeat_extractor.* prefix, skips 3D patch embed Output: [T_aligned, 768] per-frame features for Sync_MLP conditioner. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:28:20 +01:00
Ethanfel	f99d2666e8	fix: interpolate sync_cond to match audio sequence length in transformer Sync_MLP interpolates sync features based on video duration, but audio latent length depends on the user-set audio duration. When video != audio duration, the sequences diverge. Resample sync_cond to x's length before the gated addition so any video/audio duration combo works. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:21:39 +01:00

1 2

87 Commits