ComfyUI-SelVA

Files

T

Ethanfel 140cc5ee9a feat: implement real Synchformer visual encoder (TimeSformer ViT-B/16)

Replace placeholder single-linear with proper architecture reverse-engineered
from synchformer_state_dict.pth:
- _PatchEmbed: Conv2d(3, 768, 16x16) → [B, 196, 768]
- _TimeSformerBlock: factorized spatial + temporal attention (norm1/attn/norm3/timeattn/norm2/mlp)
- _SpatialAttnAgg: TransformerEncoderLayer with CLS token, aggregates 196 patches → 1/frame
- 12 blocks, dim=768, 8 frames/segment
- Loads from vfeat_extractor.* prefix, skips 3D patch embed

Output: [T_aligned, 768] per-frame features for Sync_MLP conditioner.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-27 21:28:20 +01:00

v2a_utils

feat: implement real Synchformer visual encoder (TimeSformer ViT-B/16)

2026-03-27 21:28:20 +01:00

__init__.py

feat: add data_utils package with FeaturesUtils implementation

2026-03-27 20:14:34 +01:00