AutoencoderPretransform.load_state_dict() doesn't return IncompatibleKeys. Load into pretransform.model (AudioAutoencoder) to get the return value and see actual missing/unexpected key counts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ComfyUI-PrismAudio
Custom nodes for PrismAudio (ICLR 2026) — video-to-audio and text-to-audio generation using decomposed Chain-of-Thought reasoning with a 518M parameter DiT diffusion model and Stable Audio 2.0 VAE.
Installation
Clone into your ComfyUI custom nodes directory:
cd ComfyUI/custom_nodes
git clone -b prismaudio https://github.com/FunAudioLLM/ThinkSound ComfyUI-PrismAudio
pip install -r ComfyUI-PrismAudio/requirements.txt
flash-attn is optional. It is detected at runtime and falls back to PyTorch SDPA if unavailable.
For the Feature Extractor node (video feature extraction), a separate conda environment is required — see Feature Extraction Environment below.
Nodes
| Node | Description |
|---|---|
| PrismAudio Model Loader | Loads the diffusion model and VAE. Auto-downloads weights from HuggingFace. Inputs: precision (auto/fp32/fp16/bf16), offload_strategy (auto/keep_in_vram/offload_to_cpu). |
| PrismAudio Feature Loader | Loads pre-computed .npz feature files for use with the sampler. |
| PrismAudio Feature Extractor | Subprocess bridge that extracts features from video. Requires a separate conda env with JAX/TF. |
| PrismAudio Sampler | Main generation node. Takes model + features, produces AUDIO. Inputs: duration, steps, cfg_scale, seed. |
| PrismAudio Text Only | Text-to-audio generation without video. Uses the T5-Gemma text encoder. Inputs: text_prompt, duration, steps, cfg_scale, seed. |
Workflows
Quality Path (Video-to-Audio)
Video → PrismAudio Feature Extractor → PrismAudio Sampler → Save Audio
Pre-computed Path
PrismAudio Feature Loader (.npz) → PrismAudio Sampler → Save Audio
Text-Only
PrismAudio Text Only → Save Audio
Note: CoT text is a STRING input on the sampler. You can use any existing ComfyUI LLM nodes to generate it.
HuggingFace Authentication
Required for gated models (T5-Gemma, and possibly Stable Audio VAE).
- Visit https://huggingface.co/FunAudioLLM/PrismAudio and accept the license.
- Authenticate via one of:
- Environment variable:
export HF_TOKEN=hf_... - CLI login:
huggingface-cli login
- Environment variable:
There is no hf_token widget on the nodes by design — ComfyUI saves all STRING values to workflow JSON, which would expose your token.
Model Files
Weights are auto-downloaded to ComfyUI/models/prismaudio/:
| File | Size | Description |
|---|---|---|
prismaudio.ckpt |
~2.7 GB | Diffusion model |
vae.ckpt |
~2.5 GB | Stable Audio 2.0 VAE |
synchformer_state_dict.pth |
~950 MB | Synchformer |
T5-Gemma is cached in the standard HuggingFace cache directory (~/.cache/huggingface/).
VRAM Requirements
| VRAM | Strategy |
|---|---|
| 24 GB+ | Keep all models in VRAM |
| 12–24 GB | Sequential offload |
| 8–12 GB | Aggressive offload + fp16 |
| < 8 GB | May work with aggressive offload |
Feature Extraction Environment
The PrismAudio Feature Extractor node runs extraction in a subprocess using a separate Python environment (JAX/TF dependencies).
conda env create -f scripts/environment.yml
conda activate prismaudio-extract
Then set the python_env input on the Feature Extractor node to:
/path/to/conda/envs/prismaudio-extract/bin/python
Troubleshooting
- Gated model errors — Accept the license at https://huggingface.co/FunAudioLLM/PrismAudio and set
HF_TOKEN. - VRAM errors — Switch
offload_strategytooffload_to_cpu, or usefp16precision. - flash-attn — Purely optional. Auto-detected at runtime; falls back to PyTorch SDPA.
Credits
PrismAudio by FunAudioLLM (ICLR 2026). Paper & code.