The wrong model (videoprism_public_v1_large, vision-only) was used,
causing V2A audio distortion. Switch to the LvT variant which has a
text tower, pass CoT captions for joint encoding, and extract per-frame
features from outputs['frame_embeddings'] (L2-normalized, [T, 1024])
instead of manually averaging spatial patches.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
prismaudio.json conditioner config requires:
- video_features: dim=1024 → switch videoprism_public_v1_base → large (ViT-L)
- sync_features: dim=768, length divisible by 8 → expand [num_seg,768] to
[num_seg*8,768] (per-frame) so Sync_MLP can reshape by groups of 8
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
videoprism/__init__.py is empty — API lives in videoprism.models.
Fix: from videoprism import models as vp (not import videoprism as vp).
Also add flax to managed venv packages (required by videoprism Flax model).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Creates data_utils/v2a_utils/feature_utils_288.py with FeaturesUtils:
- T5-Gemma text encoding via transformers
- VideoPrism video encoding via JAX videoprism package
- Synchformer visual encoder loading from checkpoint
Also fixes extract_features.py to add plugin root to sys.path so
data_utils is importable in the subprocess venv.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>