2f626d8a96
The wrong model (videoprism_public_v1_large, vision-only) was used, causing V2A audio distortion. Switch to the LvT variant which has a text tower, pass CoT captions for joint encoding, and extract per-frame features from outputs['frame_embeddings'] (L2-normalized, [T, 1024]) instead of manually averaging spatial patches. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>