ComfyUI-SelVA

Author	SHA1	Message	Date
Ethanfel	6bc3fd6443	chore: vendor selva_core from jnwnlee/selva@d7d40a9 Pure PyTorch SelVA source for SelvaModelLoader/FeatureExtractor/Sampler nodes. Imports rewritten from selva.* to selva_core.*. mel_converter.py: replaced librosa.filters.mel with pure-numpy implementation to avoid librosa→numba→NumPy version incompatibility in some ComfyUI environments. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 15:18:09 +02:00
Ethanfel	762b19fd3a	fix: return fps from non-cache extraction path The fps output was only returned on cache hits. Fresh extractions returned only features, leaving fps null. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:26:15 +01:00
Ethanfel	807a2e51fb	docs: fix README references — PrismAudio not ThinkSound Point links to huggingface.co/FunAudioLLM/PrismAudio and use public GitHub URL for install instructions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:16:31 +01:00
Ethanfel	67be94c45c	chore: add updated V2A example workflow Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:13:06 +01:00
Ethanfel	681d230b0c	chore: update T2A workflow to match V2A style and current defaults Steps=100, cfg=7.0, randomize seed, consistent node format with aux_id/ver/ue_properties. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:11:20 +01:00
Ethanfel	62a3c5d0dc	docs: rewrite README to reflect current node design Update node descriptions, inputs/outputs, workflows, and environment setup to match current implementation (managed_env dropdown, VHS video_info, auto-duration, fps output, synchformer auto-resolve). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:10:07 +01:00
Ethanfel	30631c0cb4	fix: change fps output type from INT to FLOAT Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:05:35 +01:00
Ethanfel	d0c9a72782	feat: add fps INT output to PrismAudioFeatureExtractor Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:05:03 +01:00
Ethanfel	5b62be0447	chore: update default steps=100 and cfg_scale=7.0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:03:48 +01:00
Ethanfel	abd315092b	feat: auto-use video duration from features when duration=0 Setting duration to 0 in PrismAudioSampler now reads the duration stored in the PRISMAUDIO_FEATURES dict (set by the feature extractor). Default changed from 10.0 to 0.0 so V2A workflows are wired up automatically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:00:47 +01:00
Ethanfel	972d379369	refactor: simplify feature extractor inputs - Remove synchformer_ckpt input — always resolved from models/prismaudio/ (errors early with clear message if missing) - Replace python_env string input with dropdown: managed_env (isolated auto-created venv, default) or comfyui_env (current Python, with warning) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:55:08 +01:00
Ethanfel	8969d407f6	feat: accept VHS_VIDEOINFO to auto-set fps in feature extractor When the VHS LoadVideo video_info output is connected, loaded_fps is used automatically instead of the manual fps input. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:52:51 +01:00
Ethanfel	707ccb463e	perf: replace MP4 encode/decode with lossless .npy frame transfer Saves frames as uint8 .npy instead of H.264 MP4, eliminating the lossy codec roundtrip. extract_features.py loads .npy directly and skips decord when given a numpy file. Passes --source_fps for correct temporal sampling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:50:35 +01:00
Ethanfel	c38df8c6fa	chore: remove debug options and diagnostic logging Remove debug_zero_video/debug_zero_sync inputs from PrismAudioSampler, DIT velocity diagnostics, conditioner stats logging, and feature stats prints from both sampler.py and text_only.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:47:00 +01:00
Ethanfel	2f626d8a96	fix: use videoprism_lvt_public_v1_large with joint video-text forward The wrong model (videoprism_public_v1_large, vision-only) was used, causing V2A audio distortion. Switch to the LvT variant which has a text tower, pass CoT captions for joint encoding, and extract per-frame features from outputs['frame_embeddings'] (L2-normalized, [T, 1024]) instead of manually averaging spatial patches. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:37:02 +01:00
Ethanfel	1d8b9b59e0	debug: add DIT velocity diagnostic at t=1 to isolate DIT vs VAE quality issue Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 23:57:03 +01:00
Ethanfel	8bf4a0c3fc	debug: log conditioner output stats and T2A text feature stats Add per-key conditioning output stats (after Cond_MLP/Sync_MLP, after _substitute_empty_features) to both sampler and text_only nodes. Also add raw T5 text feature stats in T2A before conditioning. This lets us directly compare: - T2A vs V2A conditioning outputs to find which path differs - T2A vs npz text feature ranges Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 22:39:44 +01:00
Ethanfel	477fe0f08f	debug: add latent and audio stats logging to V2A sampler Match the diagnostic output already in text_only.py to compare V2A vs T2A latent distributions and diagnose conditioning issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 22:28:08 +01:00
Ethanfel	c0b7ccbcee	fix: substitute empty_clip_feat for video features when no video present Zero features through bias-free Cond_MLP produce near-zero activations, not the learned null signal the model was trained with. Use empty_clip_feat (the learned null video embedding) just like empty_sync_feat for sync. Also improve text_prompt tooltip to encourage detailed CoT descriptions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 22:13:22 +01:00
Ethanfel	45633788a4	debug: add latent and audio stats logging to T2A node Print fakes latent stats (mean/std/min/max) and audio pre-norm stats to diagnose whether diffusion output is numerically reasonable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 22:06:39 +01:00
Ethanfel	11457fc27a	debug: fix VAE load_state_dict diagnostic — load into .model directly AutoencoderPretransform.load_state_dict() doesn't return IncompatibleKeys. Load into pretransform.model (AudioAutoencoder) to get the return value and see actual missing/unexpected key counts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:56:06 +01:00
Ethanfel	f2705b3063	debug: log weight load stats for diffusion and VAE checkpoints Print key counts, missing/unexpected keys, and sample key names to diagnose whether weights are actually loading correctly (strict=False silently hides mismatches that would cause garbage audio output). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:53:25 +01:00
Ethanfel	83a7f2787b	feat: add debug_zero_video/sync toggles and feature stats logging to sampler Allows isolating which feature set causes quality issues: - debug_zero_video: zero video_features → text+sync only - debug_zero_sync: zero sync_features → text+video only Also logs mean/std/shape for all three feature tensors on every run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:40:34 +01:00
Ethanfel	140cc5ee9a	feat: implement real Synchformer visual encoder (TimeSformer ViT-B/16) Replace placeholder single-linear with proper architecture reverse-engineered from synchformer_state_dict.pth: - _PatchEmbed: Conv2d(3, 768, 16x16) → [B, 196, 768] - _TimeSformerBlock: factorized spatial + temporal attention (norm1/attn/norm3/timeattn/norm2/mlp) - _SpatialAttnAgg: TransformerEncoderLayer with CLS token, aggregates 196 patches → 1/frame - 12 blocks, dim=768, 8 frames/segment - Loads from vfeat_extractor.* prefix, skips 3D patch embed Output: [T_aligned, 768] per-frame features for Sync_MLP conditioner. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:28:20 +01:00
Ethanfel	f99d2666e8	fix: interpolate sync_cond to match audio sequence length in transformer Sync_MLP interpolates sync features based on video duration, but audio latent length depends on the user-set audio duration. When video != audio duration, the sequences diverge. Resample sync_cond to x's length before the gated addition so any video/audio duration combo works. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:21:39 +01:00
Ethanfel	934a401633	perf: replace PIL+PNG frame files with direct ffmpeg stdin pipe Stream raw RGB bytes from tensor directly to ffmpeg stdin. Eliminates all intermediate PNG file I/O — much faster for large frame counts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:20:00 +01:00
Ethanfel	b3ac9ab22f	feat: log MP4 conversion time before subprocess spawn Shows how long PIL+ffmpeg video export takes so we can see if that's contributing to the gap before [extract] output appears. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:19:26 +01:00
Ethanfel	ca87c41a2e	feat: add per-step timing to feature extraction logs Each step now prints elapsed seconds on completion. Total time printed at the end to identify bottlenecks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:13:42 +01:00
Ethanfel	63bd999dfa	fix: switch to VideoPrism large (1024-dim) and fix Synchformer output shape prismaudio.json conditioner config requires: - video_features: dim=1024 → switch videoprism_public_v1_base → large (ViT-L) - sync_features: dim=768, length divisible by 8 → expand [num_seg,768] to [num_seg*8,768] (per-frame) so Sync_MLP can reshape by groups of 8 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:07:17 +01:00
Ethanfel	20fb766ad2	fix: cast tensors to float32 before numpy() in feature save T5-Gemma outputs BFloat16 which numpy does not support. Cast all feature tensors with .float() before .numpy(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:56:52 +01:00
Ethanfel	93120eb6b9	feat: auto-resolve synchformer checkpoint from prismaudio models dir When synchformer_ckpt input is empty, look for synchformer_state_dict.pth in the ComfyUI prismaudio models directory automatically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:49:56 +01:00
Ethanfel	b1a2ee594e	fix: correct VideoPrism import (videoprism.models, not videoprism); add flax dep videoprism/__init__.py is empty — API lives in videoprism.models. Fix: from videoprism import models as vp (not import videoprism as vp). Also add flax to managed venv packages (required by videoprism Flax model). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:38:00 +01:00
Ethanfel	0f46e8359d	feat: switch managed venv to jax[cuda13] for GPU feature extraction RTX 6000 Pro (Blackwell SM 10.0) fully supports CUDA 13. Switch from jax[cpu]+jaxlib to jax[cuda13] which bundles jaxlib and uses pip-managed CUDA libraries. Delete _extract_env to force a rebuild. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:33:45 +01:00
Ethanfel	06f8dbbab4	feat: add hf_token input and HF_TOKEN env forwarding to feature extractor google/t5gemma-l-l-ul2-it is a gated HuggingFace model requiring auth. Add optional hf_token input on the node; forward it (plus the legacy HUGGING_FACE_HUB_TOKEN alias) to the subprocess env. Falls back to HF_TOKEN from the host environment. Warn clearly when neither is set. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:27:33 +01:00
Ethanfel	a6d584bd34	fix: treat empty python_env as auto-managed venv trigger Empty string from clearing the node field caused subprocess to execute '' which raises PermissionError. Now any blank or 'python' value uses the auto-installed venv. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:21:16 +01:00
Ethanfel	829f398ed0	feat: verbose step-by-step logging in feature extraction - extract_features.py: 6 numbered steps with shapes, fps, frame counts - feature_extractor.py: stream subprocess output live (capture_output=False) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:19:38 +01:00
Ethanfel	878025450a	feat: add data_utils package with FeaturesUtils implementation Creates data_utils/v2a_utils/feature_utils_288.py with FeaturesUtils: - T5-Gemma text encoding via transformers - VideoPrism video encoding via JAX videoprism package - Synchformer visual encoder loading from checkpoint Also fixes extract_features.py to add plugin root to sys.path so data_utils is importable in the subprocess venv. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:14:34 +01:00
Ethanfel	f32456a142	feat: add fps input to PrismAudioFeatureExtractor Exposes the video frame rate as an optional input (default 30). Correct FPS ensures accurate temporal frame sampling in VideoPrism and Synchformer feature extraction. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:08:10 +01:00
Ethanfel	c416045ace	fix: replace torchvision.io.write_video with PIL+ffmpeg write_video requires the optional 'av' (PyAV) package. Use PIL to save frames as PNGs then combine with ffmpeg, which is always present in ComfyUI Docker images. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:03:39 +01:00
Ethanfel	824550bed3	feat: verbose per-package progress during venv auto-install Installs each package individually with [n/total] counters and pip progress bars, so failures pinpoint the exact failing package. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:00:04 +01:00
Ethanfel	8f2e204146	fix: show pip output, handle incomplete venv, fix TF version for Python 3.12 - tensorflow-cpu==2.15.0 only supports Python <=3.11; relax to >=2.16.0 - capture_output=False so pip errors are visible in ComfyUI logs - clean up incomplete venv dir before retrying install Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 19:55:55 +01:00
Ethanfel	8e3ab999f0	fix: load VAE state dict with strict=False vae.ckpt is a full training checkpoint containing discriminator, STFT loss modules, and EMA wrappers that are absent from the inference AudioAutoencoder. strict=False ignores these training-only keys while still loading all encoder/decoder/bottleneck weights correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 19:51:51 +01:00
Ethanfel	afc7d5b657	fix: add missing runtime dependencies to requirements.txt einops-exts, vector-quantize-pytorch, scipy were imported by prismaudio_core but not listed in requirements.txt. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 19:48:33 +01:00
Ethanfel	e372cdc488	fix: add plugin root to sys.path so prismaudio_core is importable ComfyUI does not add the custom node directory to sys.path automatically, so prismaudio_core (a package inside the plugin dir) was not found at runtime. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 19:41:11 +01:00
Ethanfel	7671d296fa	fix: remove spurious caption_cot input entry from video_to_audio workflow Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 19:39:05 +01:00
Ethanfel	3894fcc9b4	feat: add demo workflows for text-to-audio and video-to-audio Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 19:32:24 +01:00
Ethanfel	35d0615253	feat: auto-install pip venv for feature extraction on first use PrismAudioFeatureExtractor now creates and populates a managed venv (_extract_env/) automatically when python_env is left as the default 'python'. Also adds scripts/install_extract_env.sh for manual/Docker setup without conda. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 19:27:27 +01:00
Ethanfel	9b1cb71b2a	fix: remove MMDiTWrapper import and dead code paths from factory.py MMDiTWrapper was removed from diffusion.py during cleanup but the import in factory.py was missed, causing ImportError on every model load. Also stub wavelet and diffusion_prior paths that reference deleted modules. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 19:12:40 +01:00
Ethanfel	807f00417f	docs: README with installation and usage instructions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 18:15:17 +01:00
Ethanfel	618e7de64b	feat: PrismAudioTextOnly node with correct T5-Gemma encoding Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 18:09:11 +01:00

1 2

62 Commits