ComfyUI-SelVA

Author	SHA1	Message	Date
Ethanfel	6474e2816c	fix: two bugs in SelVA nodes - selva_feature_extractor: cache hash now includes resolved duration; same video + different duration override no longer returns stale features - selva_sampler: MPS-safe noise generation (torch.Generator on CPU then move to device, same pattern as PrismAudioSampler) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 15:39:57 +02:00
Ethanfel	b59b657b6f	feat: SelvaSampler — flow matching ODE with CFG and negative prompts Calls update_seq_lengths with actual feature dimensions (not seq_cfg) to avoid rounding assertion mismatches. Progress bar tracks each Euler step. Supports negative prompts for steering, normalizes output to [-1,1]. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 15:31:18 +02:00
Ethanfel	578b501d38	feat: SelvaFeatureExtractor — inline CLIP + TextSynchformer feature extraction CLIP frames at 8fps→384px (normalize inside FeaturesUtils). Sync frames at 25fps→224px, normalized to [-1,1] externally. T5 text encoded via FeaturesUtils, sup tokens prepended, then text-conditioned sync features extracted via TextSynch.encode_video_with_sync(). Results cached as .npz keyed by hash(frames[:1MB] + prompt + fps + variant). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 15:23:40 +02:00
Ethanfel	fe94438356	feat: SelvaModelLoader node — loads TextSynch + MMAudio + FeaturesUtils Resolves weights from models/selva/. Reuses synchformer_state_dict.pth from models/prismaudio/ (no duplicate download). Supports four variants: small_16k / small_44k / medium_44k / large_44k. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 15:21:03 +02:00
Ethanfel	762b19fd3a	fix: return fps from non-cache extraction path The fps output was only returned on cache hits. Fresh extractions returned only features, leaving fps null. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:26:15 +01:00
Ethanfel	30631c0cb4	fix: change fps output type from INT to FLOAT Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:05:35 +01:00
Ethanfel	d0c9a72782	feat: add fps INT output to PrismAudioFeatureExtractor Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:05:03 +01:00
Ethanfel	5b62be0447	chore: update default steps=100 and cfg_scale=7.0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:03:48 +01:00
Ethanfel	abd315092b	feat: auto-use video duration from features when duration=0 Setting duration to 0 in PrismAudioSampler now reads the duration stored in the PRISMAUDIO_FEATURES dict (set by the feature extractor). Default changed from 10.0 to 0.0 so V2A workflows are wired up automatically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 11:00:47 +01:00
Ethanfel	972d379369	refactor: simplify feature extractor inputs - Remove synchformer_ckpt input — always resolved from models/prismaudio/ (errors early with clear message if missing) - Replace python_env string input with dropdown: managed_env (isolated auto-created venv, default) or comfyui_env (current Python, with warning) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:55:08 +01:00
Ethanfel	8969d407f6	feat: accept VHS_VIDEOINFO to auto-set fps in feature extractor When the VHS LoadVideo video_info output is connected, loaded_fps is used automatically instead of the manual fps input. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:52:51 +01:00
Ethanfel	707ccb463e	perf: replace MP4 encode/decode with lossless .npy frame transfer Saves frames as uint8 .npy instead of H.264 MP4, eliminating the lossy codec roundtrip. extract_features.py loads .npy directly and skips decord when given a numpy file. Passes --source_fps for correct temporal sampling. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:50:35 +01:00
Ethanfel	c38df8c6fa	chore: remove debug options and diagnostic logging Remove debug_zero_video/debug_zero_sync inputs from PrismAudioSampler, DIT velocity diagnostics, conditioner stats logging, and feature stats prints from both sampler.py and text_only.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:47:00 +01:00
Ethanfel	1d8b9b59e0	debug: add DIT velocity diagnostic at t=1 to isolate DIT vs VAE quality issue Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 23:57:03 +01:00
Ethanfel	8bf4a0c3fc	debug: log conditioner output stats and T2A text feature stats Add per-key conditioning output stats (after Cond_MLP/Sync_MLP, after _substitute_empty_features) to both sampler and text_only nodes. Also add raw T5 text feature stats in T2A before conditioning. This lets us directly compare: - T2A vs V2A conditioning outputs to find which path differs - T2A vs npz text feature ranges Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 22:39:44 +01:00
Ethanfel	477fe0f08f	debug: add latent and audio stats logging to V2A sampler Match the diagnostic output already in text_only.py to compare V2A vs T2A latent distributions and diagnose conditioning issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 22:28:08 +01:00
Ethanfel	c0b7ccbcee	fix: substitute empty_clip_feat for video features when no video present Zero features through bias-free Cond_MLP produce near-zero activations, not the learned null signal the model was trained with. Use empty_clip_feat (the learned null video embedding) just like empty_sync_feat for sync. Also improve text_prompt tooltip to encourage detailed CoT descriptions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 22:13:22 +01:00
Ethanfel	45633788a4	debug: add latent and audio stats logging to T2A node Print fakes latent stats (mean/std/min/max) and audio pre-norm stats to diagnose whether diffusion output is numerically reasonable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 22:06:39 +01:00
Ethanfel	11457fc27a	debug: fix VAE load_state_dict diagnostic — load into .model directly AutoencoderPretransform.load_state_dict() doesn't return IncompatibleKeys. Load into pretransform.model (AudioAutoencoder) to get the return value and see actual missing/unexpected key counts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:56:06 +01:00
Ethanfel	f2705b3063	debug: log weight load stats for diffusion and VAE checkpoints Print key counts, missing/unexpected keys, and sample key names to diagnose whether weights are actually loading correctly (strict=False silently hides mismatches that would cause garbage audio output). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:53:25 +01:00
Ethanfel	83a7f2787b	feat: add debug_zero_video/sync toggles and feature stats logging to sampler Allows isolating which feature set causes quality issues: - debug_zero_video: zero video_features → text+sync only - debug_zero_sync: zero sync_features → text+video only Also logs mean/std/shape for all three feature tensors on every run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:40:34 +01:00
Ethanfel	934a401633	perf: replace PIL+PNG frame files with direct ffmpeg stdin pipe Stream raw RGB bytes from tensor directly to ffmpeg stdin. Eliminates all intermediate PNG file I/O — much faster for large frame counts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:20:00 +01:00
Ethanfel	b3ac9ab22f	feat: log MP4 conversion time before subprocess spawn Shows how long PIL+ffmpeg video export takes so we can see if that's contributing to the gap before [extract] output appears. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:19:26 +01:00
Ethanfel	93120eb6b9	feat: auto-resolve synchformer checkpoint from prismaudio models dir When synchformer_ckpt input is empty, look for synchformer_state_dict.pth in the ComfyUI prismaudio models directory automatically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:49:56 +01:00
Ethanfel	b1a2ee594e	fix: correct VideoPrism import (videoprism.models, not videoprism); add flax dep videoprism/__init__.py is empty — API lives in videoprism.models. Fix: from videoprism import models as vp (not import videoprism as vp). Also add flax to managed venv packages (required by videoprism Flax model). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:38:00 +01:00
Ethanfel	0f46e8359d	feat: switch managed venv to jax[cuda13] for GPU feature extraction RTX 6000 Pro (Blackwell SM 10.0) fully supports CUDA 13. Switch from jax[cpu]+jaxlib to jax[cuda13] which bundles jaxlib and uses pip-managed CUDA libraries. Delete _extract_env to force a rebuild. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:33:45 +01:00
Ethanfel	06f8dbbab4	feat: add hf_token input and HF_TOKEN env forwarding to feature extractor google/t5gemma-l-l-ul2-it is a gated HuggingFace model requiring auth. Add optional hf_token input on the node; forward it (plus the legacy HUGGING_FACE_HUB_TOKEN alias) to the subprocess env. Falls back to HF_TOKEN from the host environment. Warn clearly when neither is set. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:27:33 +01:00
Ethanfel	a6d584bd34	fix: treat empty python_env as auto-managed venv trigger Empty string from clearing the node field caused subprocess to execute '' which raises PermissionError. Now any blank or 'python' value uses the auto-installed venv. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:21:16 +01:00
Ethanfel	829f398ed0	feat: verbose step-by-step logging in feature extraction - extract_features.py: 6 numbered steps with shapes, fps, frame counts - feature_extractor.py: stream subprocess output live (capture_output=False) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:19:38 +01:00
Ethanfel	f32456a142	feat: add fps input to PrismAudioFeatureExtractor Exposes the video frame rate as an optional input (default 30). Correct FPS ensures accurate temporal frame sampling in VideoPrism and Synchformer feature extraction. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:08:10 +01:00
Ethanfel	c416045ace	fix: replace torchvision.io.write_video with PIL+ffmpeg write_video requires the optional 'av' (PyAV) package. Use PIL to save frames as PNGs then combine with ffmpeg, which is always present in ComfyUI Docker images. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:03:39 +01:00
Ethanfel	824550bed3	feat: verbose per-package progress during venv auto-install Installs each package individually with [n/total] counters and pip progress bars, so failures pinpoint the exact failing package. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:00:04 +01:00
Ethanfel	8f2e204146	fix: show pip output, handle incomplete venv, fix TF version for Python 3.12 - tensorflow-cpu==2.15.0 only supports Python <=3.11; relax to >=2.16.0 - capture_output=False so pip errors are visible in ComfyUI logs - clean up incomplete venv dir before retrying install Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 19:55:55 +01:00
Ethanfel	8e3ab999f0	fix: load VAE state dict with strict=False vae.ckpt is a full training checkpoint containing discriminator, STFT loss modules, and EMA wrappers that are absent from the inference AudioAutoencoder. strict=False ignores these training-only keys while still loading all encoder/decoder/bottleneck weights correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 19:51:51 +01:00
Ethanfel	35d0615253	feat: auto-install pip venv for feature extraction on first use PrismAudioFeatureExtractor now creates and populates a managed venv (_extract_env/) automatically when python_env is left as the default 'python'. Also adds scripts/install_extract_env.sh for manual/Docker setup without conda. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 19:27:27 +01:00
Ethanfel	618e7de64b	feat: PrismAudioTextOnly node with correct T5-Gemma encoding Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 18:09:11 +01:00
Ethanfel	3d62688e8c	feat: PrismAudioSampler node with correct metadata format and peak normalization Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 18:07:33 +01:00
Ethanfel	7c54ee8482	feat: PrismAudioFeatureExtractor node with subprocess bridge and conda env Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 18:06:10 +01:00
Ethanfel	3f35aa39f2	feat: PrismAudioFeatureLoader node for pre-computed .npz files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 18:04:32 +01:00
Ethanfel	1043f4bacb	feat: PrismAudioModelLoader node with auto-download and adaptive VRAM Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 18:02:47 +01:00
Ethanfel	baa80de194	feat: project scaffolding with shared utils and node registration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-27 16:59:21 +01:00

1 2 3 4

191 Commits