ComfyUI-SelVA

Author	SHA1	Message	Date
Ethanfel	e3a3384727	fix: SelvaSampler input order — prompt required, negative_prompt optional ComfyUI renders required inputs above optional ones. Moving negative_prompt to optional puts prompt first (natural order) and negative_prompt at the bottom where it belongs as a power-user input. Also guards against negative_prompt=None when not connected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 16:27:07 +02:00
Ethanfel	9a985499e7	feat: auto-download SelVA weights on first use Uses selva_core/utils/download_utils.py (already has URLs + MD5s for all weights). Models download to models/selva/ on first load. Synchformer reuses models/prismaudio/synchformer_state_dict.pth if already present (no duplicate download for PrismAudio users), otherwise downloads to models/selva/. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 16:25:36 +02:00
Ethanfel	27b4424e1a	feat: prompt entered once in SelvaFeatureExtractor, reused by SelvaSampler SelvaFeatureExtractor now stores the prompt in SELVA_FEATURES (both in the returned dict and the .npz cache). SelvaSampler's prompt is now optional — when left empty it falls back to the prompt stored in features. A non-empty override can still be passed when CLIP text should differ from the sync text. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 16:22:59 +02:00
Ethanfel	0e417f4078	fix: transformers compat — find_pruneable_heads_and_indices import Some transformers builds removed these from pytorch_utils. Fall back to modeling_utils which exposes them in all known versions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 16:21:26 +02:00
Ethanfel	6474e2816c	fix: two bugs in SelVA nodes - selva_feature_extractor: cache hash now includes resolved duration; same video + different duration override no longer returns stale features - selva_sampler: MPS-safe noise generation (torch.Generator on CPU then move to device, same pattern as PrismAudioSampler) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 15:39:57 +02:00
Ethanfel	c23d210ab2	feat: SelVA video-to-audio example workflow LoadVideo → SelvaFeatureExtractor → SelvaSampler → PreviewAudio. Defaults: medium_44k, bf16, 25 steps, cfg=4.5. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 15:31:53 +02:00
Ethanfel	b59b657b6f	feat: SelvaSampler — flow matching ODE with CFG and negative prompts Calls update_seq_lengths with actual feature dimensions (not seq_cfg) to avoid rounding assertion mismatches. Progress bar tracks each Euler step. Supports negative prompts for steering, normalizes output to [-1,1]. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 15:31:18 +02:00
Ethanfel	578b501d38	feat: SelvaFeatureExtractor — inline CLIP + TextSynchformer feature extraction CLIP frames at 8fps→384px (normalize inside FeaturesUtils). Sync frames at 25fps→224px, normalized to [-1,1] externally. T5 text encoded via FeaturesUtils, sup tokens prepended, then text-conditioned sync features extracted via TextSynch.encode_video_with_sync(). Results cached as .npz keyed by hash(frames[:1MB] + prompt + fps + variant). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 15:23:40 +02:00
Ethanfel	fe94438356	feat: SelvaModelLoader node — loads TextSynch + MMAudio + FeaturesUtils Resolves weights from models/selva/. Reuses synchformer_state_dict.pth from models/prismaudio/ (no duplicate download). Supports four variants: small_16k / small_44k / medium_44k / large_44k. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 15:21:03 +02:00
Ethanfel	6bc3fd6443	chore: vendor selva_core from jnwnlee/selva@d7d40a9 Pure PyTorch SelVA source for SelvaModelLoader/FeatureExtractor/Sampler nodes. Imports rewritten from selva.* to selva_core.*. mel_converter.py: replaced librosa.filters.mel with pure-numpy implementation to avoid librosa→numba→NumPy version incompatibility in some ComfyUI environments. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 15:18:09 +02:00
Ethanfel	762b19fd3a	fix: return fps from non-cache extraction path The fps output was only returned on cache hits. Fresh extractions returned only features, leaving fps null. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:26:15 +01:00
Ethanfel	807a2e51fb	docs: fix README references — PrismAudio not ThinkSound Point links to huggingface.co/FunAudioLLM/PrismAudio and use public GitHub URL for install instructions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:16:31 +01:00
Ethanfel	67be94c45c	chore: add updated V2A example workflow Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:13:06 +01:00
Ethanfel	681d230b0c	chore: update T2A workflow to match V2A style and current defaults Steps=100, cfg=7.0, randomize seed, consistent node format with aux_id/ver/ue_properties. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:11:20 +01:00
Ethanfel	62a3c5d0dc	docs: rewrite README to reflect current node design Update node descriptions, inputs/outputs, workflows, and environment setup to match current implementation (managed_env dropdown, VHS video_info, auto-duration, fps output, synchformer auto-resolve). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:10:07 +01:00
Ethanfel	30631c0cb4	fix: change fps output type from INT to FLOAT Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:05:35 +01:00
Ethanfel	d0c9a72782	feat: add fps INT output to PrismAudioFeatureExtractor Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:05:03 +01:00
Ethanfel	5b62be0447	chore: update default steps=100 and cfg_scale=7.0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:03:48 +01:00
Ethanfel	abd315092b	feat: auto-use video duration from features when duration=0 Setting duration to 0 in PrismAudioSampler now reads the duration stored in the PRISMAUDIO_FEATURES dict (set by the feature extractor). Default changed from 10.0 to 0.0 so V2A workflows are wired up automatically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:00:47 +01:00
Ethanfel	972d379369	refactor: simplify feature extractor inputs - Remove synchformer_ckpt input — always resolved from models/prismaudio/ (errors early with clear message if missing) - Replace python_env string input with dropdown: managed_env (isolated auto-created venv, default) or comfyui_env (current Python, with warning) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:55:08 +01:00
Ethanfel	8969d407f6	feat: accept VHS_VIDEOINFO to auto-set fps in feature extractor When the VHS LoadVideo video_info output is connected, loaded_fps is used automatically instead of the manual fps input. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:52:51 +01:00
Ethanfel	707ccb463e	perf: replace MP4 encode/decode with lossless .npy frame transfer Saves frames as uint8 .npy instead of H.264 MP4, eliminating the lossy codec roundtrip. extract_features.py loads .npy directly and skips decord when given a numpy file. Passes --source_fps for correct temporal sampling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:50:35 +01:00
Ethanfel	c38df8c6fa	chore: remove debug options and diagnostic logging Remove debug_zero_video/debug_zero_sync inputs from PrismAudioSampler, DIT velocity diagnostics, conditioner stats logging, and feature stats prints from both sampler.py and text_only.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:47:00 +01:00
Ethanfel	2f626d8a96	fix: use videoprism_lvt_public_v1_large with joint video-text forward The wrong model (videoprism_public_v1_large, vision-only) was used, causing V2A audio distortion. Switch to the LvT variant which has a text tower, pass CoT captions for joint encoding, and extract per-frame features from outputs['frame_embeddings'] (L2-normalized, [T, 1024]) instead of manually averaging spatial patches. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:37:02 +01:00
Ethanfel	1d8b9b59e0	debug: add DIT velocity diagnostic at t=1 to isolate DIT vs VAE quality issue Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 23:57:03 +01:00
Ethanfel	8bf4a0c3fc	debug: log conditioner output stats and T2A text feature stats Add per-key conditioning output stats (after Cond_MLP/Sync_MLP, after _substitute_empty_features) to both sampler and text_only nodes. Also add raw T5 text feature stats in T2A before conditioning. This lets us directly compare: - T2A vs V2A conditioning outputs to find which path differs - T2A vs npz text feature ranges Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 22:39:44 +01:00
Ethanfel	477fe0f08f	debug: add latent and audio stats logging to V2A sampler Match the diagnostic output already in text_only.py to compare V2A vs T2A latent distributions and diagnose conditioning issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 22:28:08 +01:00
Ethanfel	c0b7ccbcee	fix: substitute empty_clip_feat for video features when no video present Zero features through bias-free Cond_MLP produce near-zero activations, not the learned null signal the model was trained with. Use empty_clip_feat (the learned null video embedding) just like empty_sync_feat for sync. Also improve text_prompt tooltip to encourage detailed CoT descriptions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 22:13:22 +01:00
Ethanfel	45633788a4	debug: add latent and audio stats logging to T2A node Print fakes latent stats (mean/std/min/max) and audio pre-norm stats to diagnose whether diffusion output is numerically reasonable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 22:06:39 +01:00
Ethanfel	11457fc27a	debug: fix VAE load_state_dict diagnostic — load into .model directly AutoencoderPretransform.load_state_dict() doesn't return IncompatibleKeys. Load into pretransform.model (AudioAutoencoder) to get the return value and see actual missing/unexpected key counts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:56:06 +01:00
Ethanfel	f2705b3063	debug: log weight load stats for diffusion and VAE checkpoints Print key counts, missing/unexpected keys, and sample key names to diagnose whether weights are actually loading correctly (strict=False silently hides mismatches that would cause garbage audio output). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:53:25 +01:00
Ethanfel	83a7f2787b	feat: add debug_zero_video/sync toggles and feature stats logging to sampler Allows isolating which feature set causes quality issues: - debug_zero_video: zero video_features → text+sync only - debug_zero_sync: zero sync_features → text+video only Also logs mean/std/shape for all three feature tensors on every run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:40:34 +01:00
Ethanfel	140cc5ee9a	feat: implement real Synchformer visual encoder (TimeSformer ViT-B/16) Replace placeholder single-linear with proper architecture reverse-engineered from synchformer_state_dict.pth: - _PatchEmbed: Conv2d(3, 768, 16x16) → [B, 196, 768] - _TimeSformerBlock: factorized spatial + temporal attention (norm1/attn/norm3/timeattn/norm2/mlp) - _SpatialAttnAgg: TransformerEncoderLayer with CLS token, aggregates 196 patches → 1/frame - 12 blocks, dim=768, 8 frames/segment - Loads from vfeat_extractor.* prefix, skips 3D patch embed Output: [T_aligned, 768] per-frame features for Sync_MLP conditioner. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:28:20 +01:00
Ethanfel	f99d2666e8	fix: interpolate sync_cond to match audio sequence length in transformer Sync_MLP interpolates sync features based on video duration, but audio latent length depends on the user-set audio duration. When video != audio duration, the sequences diverge. Resample sync_cond to x's length before the gated addition so any video/audio duration combo works. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:21:39 +01:00
Ethanfel	934a401633	perf: replace PIL+PNG frame files with direct ffmpeg stdin pipe Stream raw RGB bytes from tensor directly to ffmpeg stdin. Eliminates all intermediate PNG file I/O — much faster for large frame counts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:20:00 +01:00
Ethanfel	b3ac9ab22f	feat: log MP4 conversion time before subprocess spawn Shows how long PIL+ffmpeg video export takes so we can see if that's contributing to the gap before [extract] output appears. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:19:26 +01:00
Ethanfel	ca87c41a2e	feat: add per-step timing to feature extraction logs Each step now prints elapsed seconds on completion. Total time printed at the end to identify bottlenecks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:13:42 +01:00
Ethanfel	63bd999dfa	fix: switch to VideoPrism large (1024-dim) and fix Synchformer output shape prismaudio.json conditioner config requires: - video_features: dim=1024 → switch videoprism_public_v1_base → large (ViT-L) - sync_features: dim=768, length divisible by 8 → expand [num_seg,768] to [num_seg*8,768] (per-frame) so Sync_MLP can reshape by groups of 8 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:07:17 +01:00
Ethanfel	20fb766ad2	fix: cast tensors to float32 before numpy() in feature save T5-Gemma outputs BFloat16 which numpy does not support. Cast all feature tensors with .float() before .numpy(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:56:52 +01:00
Ethanfel	93120eb6b9	feat: auto-resolve synchformer checkpoint from prismaudio models dir When synchformer_ckpt input is empty, look for synchformer_state_dict.pth in the ComfyUI prismaudio models directory automatically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:49:56 +01:00
Ethanfel	b1a2ee594e	fix: correct VideoPrism import (videoprism.models, not videoprism); add flax dep videoprism/__init__.py is empty — API lives in videoprism.models. Fix: from videoprism import models as vp (not import videoprism as vp). Also add flax to managed venv packages (required by videoprism Flax model). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:38:00 +01:00
Ethanfel	0f46e8359d	feat: switch managed venv to jax[cuda13] for GPU feature extraction RTX 6000 Pro (Blackwell SM 10.0) fully supports CUDA 13. Switch from jax[cpu]+jaxlib to jax[cuda13] which bundles jaxlib and uses pip-managed CUDA libraries. Delete _extract_env to force a rebuild. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:33:45 +01:00
Ethanfel	06f8dbbab4	feat: add hf_token input and HF_TOKEN env forwarding to feature extractor google/t5gemma-l-l-ul2-it is a gated HuggingFace model requiring auth. Add optional hf_token input on the node; forward it (plus the legacy HUGGING_FACE_HUB_TOKEN alias) to the subprocess env. Falls back to HF_TOKEN from the host environment. Warn clearly when neither is set. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:27:33 +01:00
Ethanfel	a6d584bd34	fix: treat empty python_env as auto-managed venv trigger Empty string from clearing the node field caused subprocess to execute '' which raises PermissionError. Now any blank or 'python' value uses the auto-installed venv. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:21:16 +01:00
Ethanfel	829f398ed0	feat: verbose step-by-step logging in feature extraction - extract_features.py: 6 numbered steps with shapes, fps, frame counts - feature_extractor.py: stream subprocess output live (capture_output=False) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:19:38 +01:00
Ethanfel	878025450a	feat: add data_utils package with FeaturesUtils implementation Creates data_utils/v2a_utils/feature_utils_288.py with FeaturesUtils: - T5-Gemma text encoding via transformers - VideoPrism video encoding via JAX videoprism package - Synchformer visual encoder loading from checkpoint Also fixes extract_features.py to add plugin root to sys.path so data_utils is importable in the subprocess venv. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:14:34 +01:00
Ethanfel	f32456a142	feat: add fps input to PrismAudioFeatureExtractor Exposes the video frame rate as an optional input (default 30). Correct FPS ensures accurate temporal frame sampling in VideoPrism and Synchformer feature extraction. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:08:10 +01:00
Ethanfel	c416045ace	fix: replace torchvision.io.write_video with PIL+ffmpeg write_video requires the optional 'av' (PyAV) package. Use PIL to save frames as PNGs then combine with ffmpeg, which is always present in ComfyUI Docker images. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:03:39 +01:00
Ethanfel	824550bed3	feat: verbose per-package progress during venv auto-install Installs each package individually with [n/total] counters and pip progress bars, so failures pinpoint the exact failing package. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:00:04 +01:00
Ethanfel	8f2e204146	fix: show pip output, handle incomplete venv, fix TF version for Python 3.12 - tensorflow-cpu==2.15.0 only supports Python <=3.11; relax to >=2.16.0 - capture_output=False so pip errors are visible in ComfyUI logs - clean up incomplete venv dir before retrying install Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 19:55:55 +01:00

1 2 3 4 5

221 Commits