T

Ethanfel 11457fc27a debug: fix VAE load_state_dict diagnostic — load into .model directly

AutoencoderPretransform.load_state_dict() doesn't return IncompatibleKeys.
Load into pretransform.model (AudioAutoencoder) to get the return value
and see actual missing/unexpected key counts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-27 21:56:06 +01:00

data_utils

feat: implement real Synchformer visual encoder (TimeSformer ViT-B/16)

2026-03-27 21:28:20 +01:00

docs/plans

docs: initial design and implementation plan

2026-03-27 16:57:15 +01:00

nodes

debug: fix VAE load_state_dict diagnostic — load into .model directly

2026-03-27 21:56:06 +01:00

prismaudio_core

fix: interpolate sync_cond to match audio sequence length in transformer

2026-03-27 21:21:39 +01:00

scripts

feat: add per-step timing to feature extraction logs

2026-03-27 21:13:42 +01:00

workflows

feat: add fps input to PrismAudioFeatureExtractor

2026-03-27 20:08:10 +01:00

__init__.py

fix: add plugin root to sys.path so prismaudio_core is importable

2026-03-27 19:41:11 +01:00

.gitignore

feat: auto-install pip venv for feature extraction on first use

2026-03-27 19:27:27 +01:00

README.md

docs: README with installation and usage instructions

2026-03-27 18:15:17 +01:00

requirements.txt

fix: add missing runtime dependencies to requirements.txt

2026-03-27 19:48:33 +01:00

README.md

ComfyUI-PrismAudio

Custom nodes for PrismAudio (ICLR 2026) — video-to-audio and text-to-audio generation using decomposed Chain-of-Thought reasoning with a 518M parameter DiT diffusion model and Stable Audio 2.0 VAE.

Installation

Clone into your ComfyUI custom nodes directory:

cd ComfyUI/custom_nodes
git clone -b prismaudio https://github.com/FunAudioLLM/ThinkSound ComfyUI-PrismAudio
pip install -r ComfyUI-PrismAudio/requirements.txt

flash-attn is optional. It is detected at runtime and falls back to PyTorch SDPA if unavailable.

For the Feature Extractor node (video feature extraction), a separate conda environment is required — see Feature Extraction Environment below.

Nodes

Node	Description
PrismAudio Model Loader	Loads the diffusion model and VAE. Auto-downloads weights from HuggingFace. Inputs: `precision` (auto/fp32/fp16/bf16), `offload_strategy` (auto/keep_in_vram/offload_to_cpu).
PrismAudio Feature Loader	Loads pre-computed `.npz` feature files for use with the sampler.
PrismAudio Feature Extractor	Subprocess bridge that extracts features from video. Requires a separate conda env with JAX/TF.
PrismAudio Sampler	Main generation node. Takes model + features, produces AUDIO. Inputs: `duration`, `steps`, `cfg_scale`, `seed`.
PrismAudio Text Only	Text-to-audio generation without video. Uses the T5-Gemma text encoder. Inputs: `text_prompt`, `duration`, `steps`, `cfg_scale`, `seed`.

Workflows

Quality Path (Video-to-Audio)

Video → PrismAudio Feature Extractor → PrismAudio Sampler → Save Audio

Pre-computed Path

PrismAudio Feature Loader (.npz) → PrismAudio Sampler → Save Audio

Text-Only

PrismAudio Text Only → Save Audio

Note: CoT text is a STRING input on the sampler. You can use any existing ComfyUI LLM nodes to generate it.

HuggingFace Authentication

Required for gated models (T5-Gemma, and possibly Stable Audio VAE).

Visit https://huggingface.co/FunAudioLLM/PrismAudio and accept the license.
Authenticate via one of:
- Environment variable: export HF_TOKEN=hf_...
- CLI login: huggingface-cli login

There is no hf_token widget on the nodes by design — ComfyUI saves all STRING values to workflow JSON, which would expose your token.

Model Files

Weights are auto-downloaded to ComfyUI/models/prismaudio/:

File	Size	Description
`prismaudio.ckpt`	~2.7 GB	Diffusion model
`vae.ckpt`	~2.5 GB	Stable Audio 2.0 VAE
`synchformer_state_dict.pth`	~950 MB	Synchformer

T5-Gemma is cached in the standard HuggingFace cache directory (~/.cache/huggingface/).

VRAM Requirements

VRAM	Strategy
24 GB+	Keep all models in VRAM
12–24 GB	Sequential offload
8–12 GB	Aggressive offload + fp16
< 8 GB	May work with aggressive offload

Feature Extraction Environment

The PrismAudio Feature Extractor node runs extraction in a subprocess using a separate Python environment (JAX/TF dependencies).

conda env create -f scripts/environment.yml
conda activate prismaudio-extract

Then set the python_env input on the Feature Extractor node to:

/path/to/conda/envs/prismaudio-extract/bin/python

Troubleshooting

Gated model errors — Accept the license at https://huggingface.co/FunAudioLLM/PrismAudio and set HF_TOKEN.
VRAM errors — Switch offload_strategy to offload_to_cpu, or use fp16 precision.
flash-attn — Purely optional. Auto-detected at runtime; falls back to PyTorch SDPA.

Credits

PrismAudio by FunAudioLLM (ICLR 2026). Paper & code.

README.md Unescape Escape

ComfyUI-PrismAudio

Installation

Nodes

Workflows

Quality Path (Video-to-Audio)

Pre-computed Path

Text-Only

HuggingFace Authentication

Model Files

VRAM Requirements

Feature Extraction Environment

Troubleshooting

Credits

README.md