T

Ethanfel 06f8dbbab4 feat: add hf_token input and HF_TOKEN env forwarding to feature extractor

google/t5gemma-l-l-ul2-it is a gated HuggingFace model requiring auth.
Add optional hf_token input on the node; forward it (plus the legacy
HUGGING_FACE_HUB_TOKEN alias) to the subprocess env. Falls back to
HF_TOKEN from the host environment. Warn clearly when neither is set.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-27 20:27:33 +01:00

data_utils

feat: add data_utils package with FeaturesUtils implementation

2026-03-27 20:14:34 +01:00

docs/plans

docs: initial design and implementation plan

2026-03-27 16:57:15 +01:00

nodes

feat: add hf_token input and HF_TOKEN env forwarding to feature extractor

2026-03-27 20:27:33 +01:00

prismaudio_core

fix: remove MMDiTWrapper import and dead code paths from factory.py

2026-03-27 19:12:40 +01:00

scripts

feat: verbose step-by-step logging in feature extraction

2026-03-27 20:19:38 +01:00

workflows

feat: add fps input to PrismAudioFeatureExtractor

2026-03-27 20:08:10 +01:00

__init__.py

fix: add plugin root to sys.path so prismaudio_core is importable

2026-03-27 19:41:11 +01:00

.gitignore

feat: auto-install pip venv for feature extraction on first use

2026-03-27 19:27:27 +01:00

README.md

docs: README with installation and usage instructions

2026-03-27 18:15:17 +01:00

requirements.txt

fix: add missing runtime dependencies to requirements.txt

2026-03-27 19:48:33 +01:00

README.md

ComfyUI-PrismAudio

Custom nodes for PrismAudio (ICLR 2026) — video-to-audio and text-to-audio generation using decomposed Chain-of-Thought reasoning with a 518M parameter DiT diffusion model and Stable Audio 2.0 VAE.

Installation

Clone into your ComfyUI custom nodes directory:

cd ComfyUI/custom_nodes
git clone -b prismaudio https://github.com/FunAudioLLM/ThinkSound ComfyUI-PrismAudio
pip install -r ComfyUI-PrismAudio/requirements.txt

flash-attn is optional. It is detected at runtime and falls back to PyTorch SDPA if unavailable.

For the Feature Extractor node (video feature extraction), a separate conda environment is required — see Feature Extraction Environment below.

Nodes

Node	Description
PrismAudio Model Loader	Loads the diffusion model and VAE. Auto-downloads weights from HuggingFace. Inputs: `precision` (auto/fp32/fp16/bf16), `offload_strategy` (auto/keep_in_vram/offload_to_cpu).
PrismAudio Feature Loader	Loads pre-computed `.npz` feature files for use with the sampler.
PrismAudio Feature Extractor	Subprocess bridge that extracts features from video. Requires a separate conda env with JAX/TF.
PrismAudio Sampler	Main generation node. Takes model + features, produces AUDIO. Inputs: `duration`, `steps`, `cfg_scale`, `seed`.
PrismAudio Text Only	Text-to-audio generation without video. Uses the T5-Gemma text encoder. Inputs: `text_prompt`, `duration`, `steps`, `cfg_scale`, `seed`.

Workflows

Quality Path (Video-to-Audio)

Video → PrismAudio Feature Extractor → PrismAudio Sampler → Save Audio

Pre-computed Path

PrismAudio Feature Loader (.npz) → PrismAudio Sampler → Save Audio

Text-Only

PrismAudio Text Only → Save Audio

Note: CoT text is a STRING input on the sampler. You can use any existing ComfyUI LLM nodes to generate it.

HuggingFace Authentication

Required for gated models (T5-Gemma, and possibly Stable Audio VAE).

Visit https://huggingface.co/FunAudioLLM/PrismAudio and accept the license.
Authenticate via one of:
- Environment variable: export HF_TOKEN=hf_...
- CLI login: huggingface-cli login

There is no hf_token widget on the nodes by design — ComfyUI saves all STRING values to workflow JSON, which would expose your token.

Model Files

Weights are auto-downloaded to ComfyUI/models/prismaudio/:

File	Size	Description
`prismaudio.ckpt`	~2.7 GB	Diffusion model
`vae.ckpt`	~2.5 GB	Stable Audio 2.0 VAE
`synchformer_state_dict.pth`	~950 MB	Synchformer

T5-Gemma is cached in the standard HuggingFace cache directory (~/.cache/huggingface/).

VRAM Requirements

VRAM	Strategy
24 GB+	Keep all models in VRAM
12–24 GB	Sequential offload
8–12 GB	Aggressive offload + fp16
< 8 GB	May work with aggressive offload

Feature Extraction Environment

The PrismAudio Feature Extractor node runs extraction in a subprocess using a separate Python environment (JAX/TF dependencies).

conda env create -f scripts/environment.yml
conda activate prismaudio-extract

Then set the python_env input on the Feature Extractor node to:

/path/to/conda/envs/prismaudio-extract/bin/python

Troubleshooting

Gated model errors — Accept the license at https://huggingface.co/FunAudioLLM/PrismAudio and set HF_TOKEN.
VRAM errors — Switch offload_strategy to offload_to_cpu, or use fp16 precision.
flash-attn — Purely optional. Auto-detected at runtime; falls back to PyTorch SDPA.

Credits

PrismAudio by FunAudioLLM (ICLR 2026). Paper & code.

README.md Unescape Escape

ComfyUI-PrismAudio

Installation

Nodes

Workflows

Quality Path (Video-to-Audio)

Pre-computed Path

Text-Only

HuggingFace Authentication

Model Files

VRAM Requirements

Feature Extraction Environment

Troubleshooting

Credits

README.md