ComfyUI-SelVA

Author	SHA1	Message	Date
Ethanfel	2f626d8a96	fix: use videoprism_lvt_public_v1_large with joint video-text forward The wrong model (videoprism_public_v1_large, vision-only) was used, causing V2A audio distortion. Switch to the LvT variant which has a text tower, pass CoT captions for joint encoding, and extract per-frame features from outputs['frame_embeddings'] (L2-normalized, [T, 1024]) instead of manually averaging spatial patches. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 10:37:02 +01:00
Ethanfel	140cc5ee9a	feat: implement real Synchformer visual encoder (TimeSformer ViT-B/16) Replace placeholder single-linear with proper architecture reverse-engineered from synchformer_state_dict.pth: - _PatchEmbed: Conv2d(3, 768, 16x16) → [B, 196, 768] - _TimeSformerBlock: factorized spatial + temporal attention (norm1/attn/norm3/timeattn/norm2/mlp) - _SpatialAttnAgg: TransformerEncoderLayer with CLS token, aggregates 196 patches → 1/frame - 12 blocks, dim=768, 8 frames/segment - Loads from vfeat_extractor.* prefix, skips 3D patch embed Output: [T_aligned, 768] per-frame features for Sync_MLP conditioner. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:28:20 +01:00
Ethanfel	63bd999dfa	fix: switch to VideoPrism large (1024-dim) and fix Synchformer output shape prismaudio.json conditioner config requires: - video_features: dim=1024 → switch videoprism_public_v1_base → large (ViT-L) - sync_features: dim=768, length divisible by 8 → expand [num_seg,768] to [num_seg*8,768] (per-frame) so Sync_MLP can reshape by groups of 8 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 21:07:17 +01:00
Ethanfel	b1a2ee594e	fix: correct VideoPrism import (videoprism.models, not videoprism); add flax dep videoprism/__init__.py is empty — API lives in videoprism.models. Fix: from videoprism import models as vp (not import videoprism as vp). Also add flax to managed venv packages (required by videoprism Flax model). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:38:00 +01:00
Ethanfel	878025450a	feat: add data_utils package with FeaturesUtils implementation Creates data_utils/v2a_utils/feature_utils_288.py with FeaturesUtils: - T5-Gemma text encoding via transformers - VideoPrism video encoding via JAX videoprism package - Synchformer visual encoder loading from checkpoint Also fixes extract_features.py to add plugin root to sys.path so data_utils is importable in the subprocess venv. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 20:14:34 +01:00

5 Commits