- Delete prismaudio_core/, data_utils/, scripts/, docs/plans/ - Delete PrismAudio nodes (feature_extractor, feature_loader, model_loader, sampler, text_only) - Delete PrismAudio workflows (video_to_audio, text_to_audio) - Clean nodes/utils.py: rename PRISMAUDIO_CATEGORY → SELVA_CATEGORY, remove unused helpers - Strip PrismAudio-only deps from requirements.txt Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ComfyUI-SelVA
Custom nodes for SelVA — video-to-audio generation driven by text prompts. SelVA conditions audio synthesis on both visual content and natural language, letting you describe what sounds to generate rather than just when.
Built on MMAudio with a TextSynchformer encoder that injects text guidance directly into the visual sync stream.
Nodes
SelVA Model Loader
Loads the generator, TextSynchformer encoder, and all feature utilities (CLIP, T5, Synchformer, VAE). Weights are auto-downloaded from HuggingFace on first use.
| Input | Options | Description |
|---|---|---|
variant |
small_16k / small_44k / medium_44k / large_44k | Model size and output sample rate |
precision |
bf16 / fp16 / fp32 | Compute dtype |
offload_strategy |
auto / keep_in_vram / offload_to_cpu | Memory management |
Output: model (SELVA_MODEL)
SelVA Feature Extractor
Extracts CLIP visual features and text-guided sync features from a video. Results are cached on disk — re-running with the same inputs is instant.
| Input | Description |
|---|---|
model |
From SelVA Model Loader |
video |
IMAGE tensor from any ComfyUI video loader |
prompt |
Text description of the audio to generate |
video_info |
(optional) VHS_VIDEOINFO from VHS LoadVideo — sets fps automatically |
fps |
Source fps — ignored if video_info is connected |
duration |
Override clip duration in seconds. 0 = infer from video length |
cache_dir |
Directory for cached .npz files. Empty = system temp dir |
Outputs: features (SELVA_FEATURES), fps (FLOAT), prompt (STRING)
Connect prompt output to the Sampler's prompt input to avoid entering it twice.
SelVA Sampler
Generates audio from video features. Runs the rectified flow ODE with classifier-free guidance.
| Input | Description |
|---|---|
model |
From SelVA Model Loader |
features |
From SelVA Feature Extractor |
prompt |
Text description — leave empty to use the prompt stored in features |
negative_prompt |
What to suppress (e.g. "speech, voice, talking") |
duration |
Audio duration in seconds. 0 = use duration from features |
steps |
Sampling steps (default: 25) |
cfg_strength |
Classifier-free guidance scale (default: 4.5) |
seed |
RNG seed |
Output: AUDIO
Workflow
VHS LoadVideo ──► SelVA Feature Extractor ──────────────────────► SelVA Sampler ──► Save Audio
│ (video_info) ─► (fps auto) ▲
│ (features) ────────────────────────────────────►│
│ (prompt) ──────────────────────────────────────►│
Connect the prompt output of Feature Extractor directly to Sampler's prompt to keep them in sync. Leave Sampler's prompt empty and it will use whatever was stored during extraction.
Installation
cd ComfyUI/custom_nodes
git clone https://github.com/Ethanfel/ComfyUI-SelVA.git
pip install -r ComfyUI-SelVA/requirements.txt
Model Weights
Weights are auto-downloaded to ComfyUI/models/selva/ on first load. No manual setup required.
| File | Size | Description |
|---|---|---|
video_enc_sup_5.pth |
~300 MB | TextSynchformer encoder |
generator_small_16k_sup_5.pth |
~340 MB | Small generator, 16 kHz output |
generator_small_44k_sup_5.pth |
~340 MB | Small generator, 44.1 kHz output |
generator_medium_44k_sup_5.pth |
~860 MB | Medium generator, 44.1 kHz output |
generator_large_44k_sup_5.pth |
~2.0 GB | Large generator, 44.1 kHz output |
v1-16.pth |
~1.1 GB | VAE for 16 kHz |
v1-44.pth |
~1.1 GB | VAE for 44.1 kHz |
best_netG.pt |
~90 MB | BigVGAN vocoder for 16 kHz |
synchformer_state_dict.pth |
~950 MB | Synchformer (shared with PrismAudio if present) |
CLIP (DFN5B-ViT-H-14-384) and T5 (flan-t5-base) are downloaded automatically from HuggingFace to ~/.cache/huggingface/.
VRAM Requirements
| VRAM | Recommended settings |
|---|---|
| 24 GB+ | keep_in_vram, any variant |
| 12–24 GB | offload_to_cpu, medium or smaller |
| 8–12 GB | offload_to_cpu, small variant, fp16 |
The auto offload strategy picks keep_in_vram if ≥ 16 GB VRAM is available, otherwise offload_to_cpu.