578b501d38
CLIP frames at 8fps→384px (normalize inside FeaturesUtils). Sync frames at 25fps→224px, normalized to [-1,1] externally. T5 text encoded via FeaturesUtils, sup tokens prepended, then text-conditioned sync features extracted via TextSynch.encode_video_with_sync(). Results cached as .npz keyed by hash(frames[:1MB] + prompt + fps + variant). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>