Files
2026-04-05 10:43:42 +02:00

5.9 KiB
Raw Permalink Blame History

ComfyUI-SelVA

Custom nodes for SelVA — video-to-audio generation driven by text prompts. SelVA conditions audio synthesis on both visual content and natural language, letting you describe what sounds to generate rather than just when.

Built on MMAudio with a TextSynchformer encoder that injects text guidance directly into the visual sync stream.


Nodes

SelVA Model Loader

Loads the generator, TextSynchformer encoder, and all feature utilities (CLIP, T5, Synchformer, VAE). Weights are auto-downloaded from HuggingFace on first use.

Input Options Description
variant small_16k / small_44k / medium_44k / large_44k Model size and output sample rate
precision bf16 / fp16 / fp32 Compute dtype
offload_strategy auto / keep_in_vram / offload_to_cpu Memory management

Output: model (SELVA_MODEL)


SelVA Feature Extractor

Extracts CLIP visual features and text-guided sync features from a video. Results are cached on disk — re-running with the same inputs is instant.

Input Description
model From SelVA Model Loader
video IMAGE tensor from any ComfyUI video loader
prompt Text description of the audio to generate
video_info (optional) VHS_VIDEOINFO from VHS LoadVideo — sets fps automatically
fps Source fps — ignored if video_info is connected
duration Override clip duration in seconds. 0 = infer from video length
cache_dir Directory for cached .npz files. Empty = system temp dir
mask (optional) Segmentation mask [T,H,W] float [0,1] — static (1 frame) or per-frame
mask_strength Background suppression strength. 1.0 = full neutral fill, 0.0 = no effect
mask_clip Apply mask to CLIP features (384px path). Disable to let CLIP see the full scene
mask_sync Apply mask to TextSynchformer sync features (224px path)

Outputs: features (SELVA_FEATURES), fps (FLOAT), prompt (STRING)

Connect prompt output to the Sampler's prompt input to avoid entering it twice.

Masking

Connect a segmentation mask (SAM2, Grounding DINO+SAM, or any ComfyUI mask node) to isolate a specific object's motion before encoding. Background pixels are filled with a neutral value (0.5) rather than zeroed — this keeps them in-distribution for CLIP and maps to exactly 0 after sync's [-1,1] normalization, minimising the influence of background motion on the generated audio.

Use mask_sync=true, mask_clip=false if you want sync features focused on the target object while CLIP still sees the full scene for broader context. Changing any mask parameter correctly busts the feature cache.


SelVA Sampler

Generates audio from video features. Runs the rectified flow ODE with classifier-free guidance.

Input Description
model From SelVA Model Loader
features From SelVA Feature Extractor
prompt Text description — leave empty to use the prompt stored in features
negative_prompt What to suppress (e.g. "speech, voice, talking")
duration Audio duration in seconds. 0 = use duration from features
steps Sampling steps (default: 25)
cfg_strength Classifier-free guidance scale (default: 4.5)
seed RNG seed
normalize Peak-normalize output to [-1, 1] (default: true)

Output: AUDIO


Workflow

VHS LoadVideo ──► SelVA Feature Extractor ──────────────────────► SelVA Sampler ──► Save Audio
                      │ (video_info) ─► (fps auto)                      ▲
                      │ (features) ────────────────────────────────────►│
                      │ (prompt) ──────────────────────────────────────►│

Connect the prompt output of Feature Extractor directly to Sampler's prompt to keep them in sync. Leave Sampler's prompt empty and it will use whatever was stored during extraction.


Installation

cd ComfyUI/custom_nodes
git clone https://github.com/Ethanfel/ComfyUI-SelVA.git
pip install -r ComfyUI-SelVA/requirements.txt

Model Weights

Weights are auto-downloaded to ComfyUI/models/selva/ on first load. No manual setup required.

File Size Description
video_enc_sup_5.pth ~300 MB TextSynchformer encoder
generator_small_16k_sup_5.pth ~340 MB Small generator, 16 kHz output
generator_small_44k_sup_5.pth ~340 MB Small generator, 44.1 kHz output
generator_medium_44k_sup_5.pth ~860 MB Medium generator, 44.1 kHz output
generator_large_44k_sup_5.pth ~2.0 GB Large generator, 44.1 kHz output
v1-16.pth ~1.1 GB VAE for 16 kHz
v1-44.pth ~1.1 GB VAE for 44.1 kHz
best_netG.pt ~90 MB BigVGAN vocoder for 16 kHz
synchformer_state_dict.pth ~950 MB Synchformer (shared with PrismAudio if present)

CLIP (DFN5B-ViT-H-14-384) and T5 (flan-t5-base) are downloaded automatically from HuggingFace to ~/.cache/huggingface/.


VRAM Requirements

VRAM Recommended settings
24 GB+ keep_in_vram, any variant
1224 GB offload_to_cpu, medium or smaller
812 GB offload_to_cpu, small variant, fp16

The auto offload strategy picks keep_in_vram if ≥ 16 GB VRAM is available, otherwise offload_to_cpu.


Credits

  • SelVA by Jaehwan Lee et al. — TextSynchformer and SelVA training
  • MMAudio by Feng et al. — MM-DiT audio generator and flow matching framework
  • BigVGAN by NVIDIA — neural vocoder for 16 kHz synthesis