Adds sync_strength (0.0–3.0, default 1.0) to PrismAudioSampler. The scale is applied post-conditioner (after Sync_MLP) to the conditioning tensor before it enters the DiT. Since CFG always uses zeros as the null sync embedding, this cleanly scales the sync guidance signal: effective_sync_guidance = cfg_scale * (sync_strength * cond - 0) Higher values tighten temporal audio-video alignment; 0.0 disables sync guidance entirely (audio conditioned only by video + text features). Not applied in T2A mode where sync is replaced by the learned empty_sync_feat. Also logs sync temporal coverage vs audio target duration, with a warning when they differ by more than 0.5s (stale or mismatched features). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ComfyUI-PrismAudio
Custom nodes for PrismAudio (ICLR 2026) — video-to-audio and text-to-audio generation using decomposed Chain-of-Thought reasoning with a 518M parameter DiT diffusion model and Stable Audio 2.0 VAE.
Installation
Clone into your ComfyUI custom nodes directory:
cd ComfyUI/custom_nodes
git clone https://github.com/Ethanfel/ComfyUI-Prismaudio.git ComfyUI-PrismAudio
pip install -r ComfyUI-PrismAudio/requirements.txt
flash-attn is optional — detected at runtime, falls back to PyTorch SDPA if unavailable.
Nodes
PrismAudio Model Loader
Loads the DiT diffusion model and VAE. Auto-downloads weights from HuggingFace on first use.
| Input | Options | Description |
|---|---|---|
precision |
auto / fp32 / fp16 / bf16 | DiT and conditioner dtype. VAE is always fp32. |
offload_strategy |
auto / keep_in_vram / offload_to_cpu | Memory management. |
PrismAudio Feature Extractor
Extracts video features (VideoPrism LvT, Synchformer) and text features (T5-Gemma) from a video in a subprocess. Results are cached on disk.
| Input | Description |
|---|---|
video |
IMAGE tensor from any ComfyUI video loader |
caption_cot |
Chain-of-thought description of the audio scene |
video_info |
(optional) VHS_VIDEOINFO from VHS LoadVideo — sets fps automatically |
fps |
Source fps — ignored if video_info is connected |
python_env |
managed_env (auto-created isolated venv, recommended) or comfyui_env (current Python, see warning below) |
cache_dir |
Directory for cached .npz files. Empty = system temp dir. |
hf_token |
HuggingFace token for gated models. Prefer HF_TOKEN env var instead. |
Outputs: features (PRISMAUDIO_FEATURES), fps (FLOAT)
managed_env auto-creates a venv at _extract_env/ inside the plugin directory on first use and installs JAX, TF, VideoPrism, and Synchformer. This takes several minutes the first time.
comfyui_env uses the current ComfyUI Python — JAX/TF/videoprism must already be installed. Installing them into the ComfyUI environment may conflict with existing packages.
PrismAudio Feature Loader
Loads a pre-computed .npz feature file. Use this to re-use extracted features without re-running the extractor.
| Input | Description |
|---|---|
npz_path |
Path to a .npz file produced by the Feature Extractor |
PrismAudio Sampler
Video-to-audio generation. Takes model + features, produces AUDIO.
| Input | Description |
|---|---|
model |
From Model Loader |
features |
From Feature Extractor or Feature Loader |
duration |
Audio duration in seconds. Set to 0 to use the video duration from features automatically. |
steps |
Sampling steps (default: 100) |
cfg_scale |
Classifier-free guidance scale (default: 7.0) |
seed |
RNG seed |
PrismAudio Text Only
Text-to-audio generation without video. Uses the T5-Gemma encoder.
| Input | Description |
|---|---|
model |
From Model Loader |
text_prompt |
Chain-of-thought audio scene description. Longer, more detailed prompts produce better results. |
duration |
Audio duration in seconds |
steps |
Sampling steps (default: 100) |
cfg_scale |
Classifier-free guidance scale (default: 7.0) |
seed |
RNG seed |
Workflows
Video-to-Audio
VHS LoadVideo ──► PrismAudio Feature Extractor ──► PrismAudio Sampler ──► Save Audio
(video_info) ──────────────────► (fps auto)
(features) ────────────────────► (features)
duration=0 ─────────────────────► (auto from features)
Pre-computed Features
PrismAudio Feature Loader (.npz) ──► PrismAudio Sampler ──► Save Audio
Text-to-Audio
PrismAudio Text Only ──► Save Audio
HuggingFace Authentication
Required for T5-Gemma (gated model) and PrismAudio weights.
- Visit https://huggingface.co/FunAudioLLM/PrismAudio and accept the license.
- Authenticate via one of:
- Environment variable:
export HF_TOKEN=hf_... - CLI login:
huggingface-cli login
- Environment variable:
There is no hf_token widget on the main nodes by design — ComfyUI saves all STRING values to workflow JSON, which would expose your token. The Feature Extractor has an hf_token input as a convenience but using HF_TOKEN env var is preferred.
Model Files
Weights are auto-downloaded to ComfyUI/models/prismaudio/:
| File | Size | Description |
|---|---|---|
prismaudio.ckpt |
~2.7 GB | Diffusion model (DiT) |
vae.ckpt |
~2.5 GB | Stable Audio 2.0 VAE |
synchformer_state_dict.pth |
~950 MB | Synchformer visual encoder |
T5-Gemma and VideoPrism LvT are cached in ~/.cache/huggingface/.
VRAM Requirements
| VRAM | Recommended settings |
|---|---|
| 24 GB+ | keep_in_vram, any precision |
| 12–24 GB | offload_to_cpu, bf16/fp16 |
| 8–12 GB | offload_to_cpu, fp16 |
| < 8 GB | May work with offload_to_cpu + fp16 |
Troubleshooting
- Gated model errors — Accept the license at https://huggingface.co/FunAudioLLM/PrismAudio and set
HF_TOKEN. - VRAM errors — Switch
offload_strategytooffload_to_cpuand/or usefp16precision. - Feature extraction fails — Ensure
synchformer_state_dict.pthis inmodels/prismaudio/. On first run withmanaged_env, installation takes several minutes. - flash-attn — Optional. Auto-detected at runtime; falls back to PyTorch SDPA.
Credits
PrismAudio by FunAudioLLM (ICLR 2026). Model & weights.