Files
ComfyUI-SelVA/data_utils/v2a_utils
Ethanfel 140cc5ee9a feat: implement real Synchformer visual encoder (TimeSformer ViT-B/16)
Replace placeholder single-linear with proper architecture reverse-engineered
from synchformer_state_dict.pth:
- _PatchEmbed: Conv2d(3, 768, 16x16) → [B, 196, 768]
- _TimeSformerBlock: factorized spatial + temporal attention (norm1/attn/norm3/timeattn/norm2/mlp)
- _SpatialAttnAgg: TransformerEncoderLayer with CLS token, aggregates 196 patches → 1/frame
- 12 blocks, dim=768, 8 frames/segment
- Loads from vfeat_extractor.* prefix, skips 3D patch embed

Output: [T_aligned, 768] per-frame features for Sync_MLP conditioner.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:28:20 +01:00
..