T

Ethanfel a315093743 feat: sync_strength control and temporal coverage diagnostic in sampler

Adds sync_strength (0.0–3.0, default 1.0) to PrismAudioSampler.
The scale is applied post-conditioner (after Sync_MLP) to the conditioning
tensor before it enters the DiT. Since CFG always uses zeros as the null
sync embedding, this cleanly scales the sync guidance signal:
  effective_sync_guidance = cfg_scale * (sync_strength * cond - 0)
Higher values tighten temporal audio-video alignment; 0.0 disables sync
guidance entirely (audio conditioned only by video + text features).
Not applied in T2A mode where sync is replaced by the learned empty_sync_feat.

Also logs sync temporal coverage vs audio target duration, with a warning
when they differ by more than 0.5s (stale or mismatched features).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-28 16:23:41 +01:00

data_utils

fix: use videoprism_lvt_public_v1_large with joint video-text forward

2026-03-28 10:37:02 +01:00

docs/plans

docs: initial design and implementation plan

2026-03-27 16:57:15 +01:00

nodes

feat: sync_strength control and temporal coverage diagnostic in sampler

2026-03-28 16:23:41 +01:00

prismaudio_core

fix: interpolate sync_cond to match audio sequence length in transformer

2026-03-27 21:21:39 +01:00

scripts

fix: feature extractor CUDA detection, cache correctness, and short-video crash

2026-03-28 16:00:05 +01:00

workflows

chore: add updated V2A example workflow

2026-03-28 11:13:06 +01:00

__init__.py

fix: add plugin root to sys.path so prismaudio_core is importable

2026-03-27 19:41:11 +01:00

.gitignore

feat: auto-install pip venv for feature extraction on first use

2026-03-27 19:27:27 +01:00

README.md

docs: fix README references — PrismAudio not ThinkSound

2026-03-28 11:16:31 +01:00

requirements.txt

feat: LoRA trainer and loader nodes for PrismAudio DiT fine-tuning

2026-03-28 12:18:50 +01:00

README.md

ComfyUI-PrismAudio

Custom nodes for PrismAudio (ICLR 2026) — video-to-audio and text-to-audio generation using decomposed Chain-of-Thought reasoning with a 518M parameter DiT diffusion model and Stable Audio 2.0 VAE.

Installation

Clone into your ComfyUI custom nodes directory:

cd ComfyUI/custom_nodes
git clone https://github.com/Ethanfel/ComfyUI-Prismaudio.git ComfyUI-PrismAudio
pip install -r ComfyUI-PrismAudio/requirements.txt

flash-attn is optional — detected at runtime, falls back to PyTorch SDPA if unavailable.

Nodes

PrismAudio Model Loader

Loads the DiT diffusion model and VAE. Auto-downloads weights from HuggingFace on first use.

Input	Options	Description
`precision`	auto / fp32 / fp16 / bf16	DiT and conditioner dtype. VAE is always fp32.
`offload_strategy`	auto / keep_in_vram / offload_to_cpu	Memory management.

PrismAudio Feature Extractor

Extracts video features (VideoPrism LvT, Synchformer) and text features (T5-Gemma) from a video in a subprocess. Results are cached on disk.

Input	Description
`video`	IMAGE tensor from any ComfyUI video loader
`caption_cot`	Chain-of-thought description of the audio scene
`video_info`	(optional) `VHS_VIDEOINFO` from VHS LoadVideo — sets fps automatically
`fps`	Source fps — ignored if `video_info` is connected
`python_env`	`managed_env` (auto-created isolated venv, recommended) or `comfyui_env` (current Python, see warning below)
`cache_dir`	Directory for cached `.npz` files. Empty = system temp dir.
`hf_token`	HuggingFace token for gated models. Prefer `HF_TOKEN` env var instead.

Outputs: features (PRISMAUDIO_FEATURES), fps (FLOAT)

managed_env auto-creates a venv at _extract_env/ inside the plugin directory on first use and installs JAX, TF, VideoPrism, and Synchformer. This takes several minutes the first time.

comfyui_env uses the current ComfyUI Python — JAX/TF/videoprism must already be installed. Installing them into the ComfyUI environment may conflict with existing packages.

PrismAudio Feature Loader

Loads a pre-computed .npz feature file. Use this to re-use extracted features without re-running the extractor.

Input	Description
`npz_path`	Path to a `.npz` file produced by the Feature Extractor

PrismAudio Sampler

Video-to-audio generation. Takes model + features, produces AUDIO.

Input	Description
`model`	From Model Loader
`features`	From Feature Extractor or Feature Loader
`duration`	Audio duration in seconds. Set to `0` to use the video duration from features automatically.
`steps`	Sampling steps (default: 100)
`cfg_scale`	Classifier-free guidance scale (default: 7.0)
`seed`	RNG seed

PrismAudio Text Only

Text-to-audio generation without video. Uses the T5-Gemma encoder.

Input	Description
`model`	From Model Loader
`text_prompt`	Chain-of-thought audio scene description. Longer, more detailed prompts produce better results.
`duration`	Audio duration in seconds
`steps`	Sampling steps (default: 100)
`cfg_scale`	Classifier-free guidance scale (default: 7.0)
`seed`	RNG seed

Workflows

Video-to-Audio

VHS LoadVideo ──► PrismAudio Feature Extractor ──► PrismAudio Sampler ──► Save Audio
                         (video_info) ──────────────────► (fps auto)
                         (features) ────────────────────► (features)
                         duration=0 ─────────────────────► (auto from features)

Pre-computed Features

PrismAudio Feature Loader (.npz) ──► PrismAudio Sampler ──► Save Audio

Text-to-Audio

PrismAudio Text Only ──► Save Audio

HuggingFace Authentication

Required for T5-Gemma (gated model) and PrismAudio weights.

Visit https://huggingface.co/FunAudioLLM/PrismAudio and accept the license.
Authenticate via one of:
- Environment variable: export HF_TOKEN=hf_...
- CLI login: huggingface-cli login

There is no hf_token widget on the main nodes by design — ComfyUI saves all STRING values to workflow JSON, which would expose your token. The Feature Extractor has an hf_token input as a convenience but using HF_TOKEN env var is preferred.

Model Files

Weights are auto-downloaded to ComfyUI/models/prismaudio/:

File	Size	Description
`prismaudio.ckpt`	~2.7 GB	Diffusion model (DiT)
`vae.ckpt`	~2.5 GB	Stable Audio 2.0 VAE
`synchformer_state_dict.pth`	~950 MB	Synchformer visual encoder

T5-Gemma and VideoPrism LvT are cached in ~/.cache/huggingface/.

VRAM Requirements

VRAM	Recommended settings
24 GB+	`keep_in_vram`, any precision
12–24 GB	`offload_to_cpu`, bf16/fp16
8–12 GB	`offload_to_cpu`, fp16
< 8 GB	May work with `offload_to_cpu` + fp16

Troubleshooting

Gated model errors — Accept the license at https://huggingface.co/FunAudioLLM/PrismAudio and set HF_TOKEN.
VRAM errors — Switch offload_strategy to offload_to_cpu and/or use fp16 precision.
Feature extraction fails — Ensure synchformer_state_dict.pth is in models/prismaudio/. On first run with managed_env, installation takes several minutes.
flash-attn — Optional. Auto-detected at runtime; falls back to PyTorch SDPA.

Credits

PrismAudio by FunAudioLLM (ICLR 2026). Model & weights.

README.md Unescape Escape

ComfyUI-PrismAudio

Installation

Nodes

PrismAudio Model Loader

PrismAudio Feature Extractor

PrismAudio Feature Loader

PrismAudio Sampler

PrismAudio Text Only

Workflows

Video-to-Audio

Pre-computed Features

Text-to-Audio

HuggingFace Authentication

Model Files

VRAM Requirements

Troubleshooting

Credits

README.md