# ComfyUI-SelVA Custom nodes for [SelVA](https://github.com/jnwnlee/selva) — video-to-audio generation driven by text prompts. SelVA conditions audio synthesis on both visual content and natural language, letting you describe *what* sounds to generate rather than just *when*. Built on [MMAudio](https://github.com/hkchengrex/MMAudio) with a TextSynchformer encoder that injects text guidance directly into the visual sync stream. --- ## Nodes ### SelVA Model Loader Loads the generator, TextSynchformer encoder, and all feature utilities (CLIP, T5, Synchformer, VAE). Weights are auto-downloaded from HuggingFace on first use. | Input | Options | Description | |-------|---------|-------------| | `variant` | small_16k / small_44k / medium_44k / large_44k | Model size and output sample rate | | `precision` | bf16 / fp16 / fp32 | Compute dtype | | `offload_strategy` | auto / keep_in_vram / offload_to_cpu | Memory management | **Output:** `model` (SELVA_MODEL) --- ### SelVA Feature Extractor Extracts CLIP visual features and text-guided sync features from a video. Results are cached on disk — re-running with the same inputs is instant. | Input | Description | |-------|-------------| | `model` | From SelVA Model Loader | | `video` | IMAGE tensor from any ComfyUI video loader | | `prompt` | Text description of the audio to generate | | `video_info` | *(optional)* VHS_VIDEOINFO from VHS LoadVideo — sets fps automatically | | `fps` | Source fps — ignored if `video_info` is connected | | `duration` | Override clip duration in seconds. `0` = infer from video length | | `cache_dir` | Directory for cached `.npz` files. Empty = system temp dir | | `mask` | *(optional)* Segmentation mask `[T,H,W]` float [0,1] — static (1 frame) or per-frame | | `mask_strength` | Background suppression strength. `1.0` = full neutral fill, `0.0` = no effect | | `mask_clip` | Apply mask to CLIP features (384px path). Disable to let CLIP see the full scene | | `mask_sync` | Apply mask to TextSynchformer sync features (224px path) | **Outputs:** `features` (SELVA_FEATURES), `fps` (FLOAT), `prompt` (STRING) Connect `prompt` output to the Sampler's `prompt` input to avoid entering it twice. #### Masking Connect a segmentation mask (SAM2, Grounding DINO+SAM, or any ComfyUI mask node) to isolate a specific object's motion before encoding. Background pixels are filled with a neutral value (0.5) rather than zeroed — this keeps them in-distribution for CLIP and maps to exactly 0 after sync's `[-1,1]` normalization, minimising the influence of background motion on the generated audio. Use `mask_sync=true, mask_clip=false` if you want sync features focused on the target object while CLIP still sees the full scene for broader context. Changing any mask parameter correctly busts the feature cache. --- ### SelVA Sampler Generates audio from video features. Runs the rectified flow ODE with classifier-free guidance. | Input | Description | |-------|-------------| | `model` | From SelVA Model Loader | | `features` | From SelVA Feature Extractor | | `prompt` | Text description — leave empty to use the prompt stored in features | | `negative_prompt` | What to suppress (e.g. `"speech, voice, talking"`) | | `duration` | Audio duration in seconds. `0` = use duration from features | | `steps` | Sampling steps (default: 25) | | `cfg_strength` | Classifier-free guidance scale (default: 4.5) | | `seed` | RNG seed | | `normalize` | Peak-normalize output to [-1, 1] (default: true) | **Output:** `AUDIO` --- ## Workflow ``` VHS LoadVideo ──► SelVA Feature Extractor ──────────────────────► SelVA Sampler ──► Save Audio │ (video_info) ─► (fps auto) ▲ │ (features) ────────────────────────────────────►│ │ (prompt) ──────────────────────────────────────►│ ``` Connect the `prompt` output of Feature Extractor directly to Sampler's `prompt` to keep them in sync. Leave Sampler's `prompt` empty and it will use whatever was stored during extraction. --- ## Installation ```bash cd ComfyUI/custom_nodes git clone https://github.com/Ethanfel/ComfyUI-SelVA.git pip install -r ComfyUI-SelVA/requirements.txt ``` --- ## Model Weights Weights are auto-downloaded to `ComfyUI/models/selva/` on first load. No manual setup required. | File | Size | Description | |------|------|-------------| | `video_enc_sup_5.pth` | ~300 MB | TextSynchformer encoder | | `generator_small_16k_sup_5.pth` | ~340 MB | Small generator, 16 kHz output | | `generator_small_44k_sup_5.pth` | ~340 MB | Small generator, 44.1 kHz output | | `generator_medium_44k_sup_5.pth` | ~860 MB | Medium generator, 44.1 kHz output | | `generator_large_44k_sup_5.pth` | ~2.0 GB | Large generator, 44.1 kHz output | | `v1-16.pth` | ~1.1 GB | VAE for 16 kHz | | `v1-44.pth` | ~1.1 GB | VAE for 44.1 kHz | | `best_netG.pt` | ~90 MB | BigVGAN vocoder for 16 kHz | | `synchformer_state_dict.pth` | ~950 MB | Synchformer (shared with PrismAudio if present) | CLIP (DFN5B-ViT-H-14-384) and T5 (flan-t5-base) are downloaded automatically from HuggingFace to `~/.cache/huggingface/`. --- ## VRAM Requirements | VRAM | Recommended settings | |------|----------------------| | 24 GB+ | `keep_in_vram`, any variant | | 12–24 GB | `offload_to_cpu`, medium or smaller | | 8–12 GB | `offload_to_cpu`, small variant, fp16 | The `auto` offload strategy picks `keep_in_vram` if ≥ 16 GB VRAM is available, otherwise `offload_to_cpu`. --- ## Credits - [SelVA](https://github.com/jnwnlee/selva) by Jaehwan Lee et al. — TextSynchformer and SelVA training - [MMAudio](https://github.com/hkchengrex/MMAudio) by Feng et al. — MM-DiT audio generator and flow matching framework - [BigVGAN](https://github.com/NVIDIA/BigVGAN) by NVIDIA — neural vocoder for 16 kHz synthesis