Ethanfel 06f8dbbab4 feat: add hf_token input and HF_TOKEN env forwarding to feature extractor
google/t5gemma-l-l-ul2-it is a gated HuggingFace model requiring auth.
Add optional hf_token input on the node; forward it (plus the legacy
HUGGING_FACE_HUB_TOKEN alias) to the subprocess env. Falls back to
HF_TOKEN from the host environment. Warn clearly when neither is set.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:27:33 +01:00

ComfyUI-PrismAudio

Custom nodes for PrismAudio (ICLR 2026) — video-to-audio and text-to-audio generation using decomposed Chain-of-Thought reasoning with a 518M parameter DiT diffusion model and Stable Audio 2.0 VAE.

Installation

Clone into your ComfyUI custom nodes directory:

cd ComfyUI/custom_nodes
git clone -b prismaudio https://github.com/FunAudioLLM/ThinkSound ComfyUI-PrismAudio
pip install -r ComfyUI-PrismAudio/requirements.txt

flash-attn is optional. It is detected at runtime and falls back to PyTorch SDPA if unavailable.

For the Feature Extractor node (video feature extraction), a separate conda environment is required — see Feature Extraction Environment below.

Nodes

Node Description
PrismAudio Model Loader Loads the diffusion model and VAE. Auto-downloads weights from HuggingFace. Inputs: precision (auto/fp32/fp16/bf16), offload_strategy (auto/keep_in_vram/offload_to_cpu).
PrismAudio Feature Loader Loads pre-computed .npz feature files for use with the sampler.
PrismAudio Feature Extractor Subprocess bridge that extracts features from video. Requires a separate conda env with JAX/TF.
PrismAudio Sampler Main generation node. Takes model + features, produces AUDIO. Inputs: duration, steps, cfg_scale, seed.
PrismAudio Text Only Text-to-audio generation without video. Uses the T5-Gemma text encoder. Inputs: text_prompt, duration, steps, cfg_scale, seed.

Workflows

Quality Path (Video-to-Audio)

Video → PrismAudio Feature Extractor → PrismAudio Sampler → Save Audio

Pre-computed Path

PrismAudio Feature Loader (.npz) → PrismAudio Sampler → Save Audio

Text-Only

PrismAudio Text Only → Save Audio

Note: CoT text is a STRING input on the sampler. You can use any existing ComfyUI LLM nodes to generate it.

HuggingFace Authentication

Required for gated models (T5-Gemma, and possibly Stable Audio VAE).

  1. Visit https://huggingface.co/FunAudioLLM/PrismAudio and accept the license.
  2. Authenticate via one of:
    • Environment variable: export HF_TOKEN=hf_...
    • CLI login: huggingface-cli login

There is no hf_token widget on the nodes by design — ComfyUI saves all STRING values to workflow JSON, which would expose your token.

Model Files

Weights are auto-downloaded to ComfyUI/models/prismaudio/:

File Size Description
prismaudio.ckpt ~2.7 GB Diffusion model
vae.ckpt ~2.5 GB Stable Audio 2.0 VAE
synchformer_state_dict.pth ~950 MB Synchformer

T5-Gemma is cached in the standard HuggingFace cache directory (~/.cache/huggingface/).

VRAM Requirements

VRAM Strategy
24 GB+ Keep all models in VRAM
1224 GB Sequential offload
812 GB Aggressive offload + fp16
< 8 GB May work with aggressive offload

Feature Extraction Environment

The PrismAudio Feature Extractor node runs extraction in a subprocess using a separate Python environment (JAX/TF dependencies).

conda env create -f scripts/environment.yml
conda activate prismaudio-extract

Then set the python_env input on the Feature Extractor node to:

/path/to/conda/envs/prismaudio-extract/bin/python

Troubleshooting

  • Gated model errors — Accept the license at https://huggingface.co/FunAudioLLM/PrismAudio and set HF_TOKEN.
  • VRAM errors — Switch offload_strategy to offload_to_cpu, or use fp16 precision.
  • flash-attn — Purely optional. Auto-detected at runtime; falls back to PyTorch SDPA.

Credits

PrismAudio by FunAudioLLM (ICLR 2026). Paper & code.

S
Description
No description provided
Readme 2.7 MiB
Languages
Python 95.9%
Cuda 2.4%
C 1.5%
C++ 0.2%