ComfyUI-SelVA

Files

T

Ethanfel 2f626d8a96 fix: use videoprism_lvt_public_v1_large with joint video-text forward

The wrong model (videoprism_public_v1_large, vision-only) was used,
causing V2A audio distortion. Switch to the LvT variant which has a
text tower, pass CoT captions for joint encoding, and extract per-frame
features from outputs['frame_embeddings'] (L2-normalized, [T, 1024])
instead of manually averaging spatial patches.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-28 10:37:02 +01:00

v2a_utils

fix: use videoprism_lvt_public_v1_large with joint video-text forward

2026-03-28 10:37:02 +01:00

__init__.py

feat: add data_utils package with FeaturesUtils implementation

2026-03-27 20:14:34 +01:00