Files
ComfyUI-SelVA/docs/plans/2026-04-04-selva-integration-design.md
T
Ethanfel 51f93f9688 docs: SelVA integration design doc
Three new nodes (SelvaModelLoader, SelvaFeatureExtractor, SelvaSampler)
vendoring selva_core from jnwnlee/selva. Pure PyTorch, no subprocess,
zero new pip dependencies. TextSynchformer provides text-conditioned sync
features for improved audio-visual alignment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 15:00:40 +02:00

168 lines
5.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# SelVA Integration Design
**Date:** 2026-04-04
**Branch:** feature/selva-integration (new from master)
**Status:** Approved, ready for implementation
---
## Problem
PrismAudio's sync conditioning is text-agnostic: Synchformer extracts features from
all visual motion equally. In multi-source videos (person walking near a car), the DiT
receives unfocused sync guidance and struggles to match audio events to the correct
visual source.
SelVA (CVPR 2026, arXiv:2512.02650) solves this with TextSynchformer — text conditioning
is injected inside the Synchformer encoder via cross-attention, so sync features only
encode motion relevant to the requested sound. This is the core architectural improvement
needed for reliable V2A sync.
---
## Architecture
### New directory layout
```
selva_core/ ← vendored SelVA source (model + ext + utils)
nodes/
selva_model_loader.py
selva_feature_extractor.py
selva_sampler.py
```
### New custom types
- `SELVA_MODEL``{generator, video_enc, feature_utils, variant, strategy, dtype}`
- `SELVA_FEATURES``{clip_features, sync_features, duration}`
### No subprocess
SelVA is pure PyTorch. Feature extraction runs inline in ComfyUI — no managed venv,
no JAX/TF, no pip install on first run.
### Dependencies
Zero new pip packages. ComfyUI already ships:
- `open_clip_torch` (CLIP ViT-H-14-384, auto-downloads via `hf-hub:` on first use)
- `transformers` (flan-t5-base, auto-downloads from HuggingFace on first use)
- `torch`, `torchaudio`, `einops`
---
## Nodes
### `SelvaModelLoader` → `SELVA_MODEL`
| Input | Type | Default | Notes |
|---|---|---|---|
| variant | dropdown | medium_44k | small_16k / small_44k / medium_44k / large_44k |
| precision | dropdown | bf16 | bf16 / fp16 / fp32 |
| offload_strategy | dropdown | auto | auto / keep_in_vram / offload_to_cpu |
Resolves weights from `models/selva/`. Raises descriptive errors with download
instructions if files are missing.
### `SelvaFeatureExtractor` → `SELVA_FEATURES`, `FLOAT` (fps)
| Input | Type | Default | Notes |
|---|---|---|---|
| video | IMAGE | — | ComfyUI video tensor [T,H,W,C] |
| prompt | STRING | — | Used by TextSynchformer to select relevant motion |
| video_info | VHS_VIDEOINFO | opt | Auto-sets fps when connected |
| fps | FLOAT | 30.0 | Fallback fps if video_info not connected |
| cache_dir | STRING | "" | Empty = system temp dir |
Feature extraction steps (all inline, no subprocess):
1. Resize frames to 384×384 → CLIP video features `[B, T, 1024]`
2. Resize frames to 224×224 + encode prompt with flan-T5 → TextSynchformer → text-conditioned sync features `[B, T, 768]`
3. Save to `.npz` cache keyed by hash(frames[:1MB] + prompt + fps)
### `SelvaSampler` → `AUDIO`
| Input | Type | Default | Notes |
|---|---|---|---|
| model | SELVA_MODEL | — | |
| features | SELVA_FEATURES | — | |
| prompt | STRING | — | Should match extractor prompt; drives CLIP text guidance |
| negative_prompt | STRING | "" | Steers away from unwanted sounds |
| duration | FLOAT | 0.0 | 0 = auto from features duration |
| steps | INT | 25 | Euler steps (25 is SelVA default, fast) |
| cfg_strength | FLOAT | 4.5 | CFG scale (SelVA default) |
| seed | INT | 0 | |
Generation steps:
1. Encode prompt → CLIP text features (for MMAudio)
2. Encode negative prompt → empty conditions for CFG
3. `net_generator.preprocess_conditions(clip_f, sync_f, text_clip)`
4. Flow matching Euler ODE (`num_steps` iterations) with CFG
5. `feature_utils.decode(latent)` → mel spectrogram
6. `feature_utils.vocode(spec)` → waveform (BigVGAN for 16k, direct for 44k)
**Note on dual prompt:** The extractor prompt is baked into sync_features via
TextSynchformer at extraction time. The sampler prompt drives CLIP text conditioning
at generation time. They should match — a tooltip explains this.
---
## Data Flow
```
[VHS LoadVideo] ──► [SelvaFeatureExtractor]
│ prompt: "dog barking"
│ video_info: (fps auto)
SELVA_FEATURES
{clip_features [B,T,1024],
sync_features [B,T,768], ← text-conditioned
duration: 8.2s}
[SelvaModelLoader] ──► [SelvaSampler]
variant: medium_44k │ prompt: "dog barking"
precision: bf16 │ negative: "wind noise"
│ cfg_strength: 4.5, steps: 25
AUDIO (44.1kHz or 16kHz)
```
---
## Model Weights
Location: `models/selva/`
```
video_enc_sup_5.pth ← TextSynch, shared across all variants
generator_small_16k_sup_5.pth
generator_small_44k_sup_5.pth
generator_medium_44k_sup_5.pth
generator_large_44k_sup_5.pth
ext/
v1-16.pth ← VAE for 16k variants
v1-44.pth ← VAE for 44k variants
best_netG.pt ← BigVGAN vocoder (16k only)
```
`synchformer_state_dict.pth` is reused from `models/prismaudio/` — no duplicate.
---
## selva_core vendoring
Copy from `jnwnlee/selva` (pinned to a specific commit for stability):
- `selva_core/model/` — MMAudio, TextSynch, transformer layers, embeddings, flow matching
- `selva_core/ext/` — autoencoder, BigVGAN, synchformer, rotary embeddings, mel converters
- `selva_core/utils/` — transforms, generate() helper
Rename all internal imports from `selva.*``selva_core.*`.
---
## What stays the same
- All PrismAudio nodes unchanged
- `models/prismaudio/` unchanged
- Synchformer checkpoint shared (not duplicated)
- Branch: new `feature/selva-integration` off master (LoRA work stays separate)