docs: rewrite README to reflect current node design
Update node descriptions, inputs/outputs, workflows, and environment setup to match current implementation (managed_env dropdown, VHS video_info, auto-duration, fps output, synchformer auto-resolve). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -8,56 +8,120 @@ Clone into your ComfyUI custom nodes directory:
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
cd ComfyUI/custom_nodes
|
cd ComfyUI/custom_nodes
|
||||||
git clone -b prismaudio https://github.com/FunAudioLLM/ThinkSound ComfyUI-PrismAudio
|
git clone git@192.168.1.1:Ethanfel/ComfyUI-Prismaudio.git ComfyUI-PrismAudio
|
||||||
pip install -r ComfyUI-PrismAudio/requirements.txt
|
pip install -r ComfyUI-PrismAudio/requirements.txt
|
||||||
```
|
```
|
||||||
|
|
||||||
**flash-attn** is optional. It is detected at runtime and falls back to PyTorch SDPA if unavailable.
|
**flash-attn** is optional — detected at runtime, falls back to PyTorch SDPA if unavailable.
|
||||||
|
|
||||||
For the **Feature Extractor** node (video feature extraction), a separate conda environment is required — see [Feature Extraction Environment](#feature-extraction-environment) below.
|
|
||||||
|
|
||||||
## Nodes
|
## Nodes
|
||||||
|
|
||||||
| Node | Description |
|
### PrismAudio Model Loader
|
||||||
|------|-------------|
|
|
||||||
| **PrismAudio Model Loader** | Loads the diffusion model and VAE. Auto-downloads weights from HuggingFace. Inputs: `precision` (auto/fp32/fp16/bf16), `offload_strategy` (auto/keep_in_vram/offload_to_cpu). |
|
Loads the DiT diffusion model and VAE. Auto-downloads weights from HuggingFace on first use.
|
||||||
| **PrismAudio Feature Loader** | Loads pre-computed `.npz` feature files for use with the sampler. |
|
|
||||||
| **PrismAudio Feature Extractor** | Subprocess bridge that extracts features from video. Requires a separate conda env with JAX/TF. |
|
| Input | Options | Description |
|
||||||
| **PrismAudio Sampler** | Main generation node. Takes model + features, produces AUDIO. Inputs: `duration`, `steps`, `cfg_scale`, `seed`. |
|
|-------|---------|-------------|
|
||||||
| **PrismAudio Text Only** | Text-to-audio generation without video. Uses the T5-Gemma text encoder. Inputs: `text_prompt`, `duration`, `steps`, `cfg_scale`, `seed`. |
|
| `precision` | auto / fp32 / fp16 / bf16 | DiT and conditioner dtype. VAE is always fp32. |
|
||||||
|
| `offload_strategy` | auto / keep_in_vram / offload_to_cpu | Memory management. |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### PrismAudio Feature Extractor
|
||||||
|
|
||||||
|
Extracts video features (VideoPrism LvT, Synchformer) and text features (T5-Gemma) from a video in a subprocess. Results are cached on disk.
|
||||||
|
|
||||||
|
| Input | Description |
|
||||||
|
|-------|-------------|
|
||||||
|
| `video` | IMAGE tensor from any ComfyUI video loader |
|
||||||
|
| `caption_cot` | Chain-of-thought description of the audio scene |
|
||||||
|
| `video_info` | *(optional)* `VHS_VIDEOINFO` from VHS LoadVideo — sets fps automatically |
|
||||||
|
| `fps` | Source fps — ignored if `video_info` is connected |
|
||||||
|
| `python_env` | `managed_env` (auto-created isolated venv, recommended) or `comfyui_env` (current Python, see warning below) |
|
||||||
|
| `cache_dir` | Directory for cached `.npz` files. Empty = system temp dir. |
|
||||||
|
| `hf_token` | HuggingFace token for gated models. Prefer `HF_TOKEN` env var instead. |
|
||||||
|
|
||||||
|
**Outputs:** `features` (PRISMAUDIO_FEATURES), `fps` (FLOAT)
|
||||||
|
|
||||||
|
**`managed_env`** auto-creates a venv at `_extract_env/` inside the plugin directory on first use and installs JAX, TF, VideoPrism, and Synchformer. This takes several minutes the first time.
|
||||||
|
|
||||||
|
**`comfyui_env`** uses the current ComfyUI Python — JAX/TF/videoprism must already be installed. Installing them into the ComfyUI environment may conflict with existing packages.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### PrismAudio Feature Loader
|
||||||
|
|
||||||
|
Loads a pre-computed `.npz` feature file. Use this to re-use extracted features without re-running the extractor.
|
||||||
|
|
||||||
|
| Input | Description |
|
||||||
|
|-------|-------------|
|
||||||
|
| `npz_path` | Path to a `.npz` file produced by the Feature Extractor |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### PrismAudio Sampler
|
||||||
|
|
||||||
|
Video-to-audio generation. Takes model + features, produces AUDIO.
|
||||||
|
|
||||||
|
| Input | Description |
|
||||||
|
|-------|-------------|
|
||||||
|
| `model` | From Model Loader |
|
||||||
|
| `features` | From Feature Extractor or Feature Loader |
|
||||||
|
| `duration` | Audio duration in seconds. Set to `0` to use the video duration from features automatically. |
|
||||||
|
| `steps` | Sampling steps (default: 100) |
|
||||||
|
| `cfg_scale` | Classifier-free guidance scale (default: 7.0) |
|
||||||
|
| `seed` | RNG seed |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### PrismAudio Text Only
|
||||||
|
|
||||||
|
Text-to-audio generation without video. Uses the T5-Gemma encoder.
|
||||||
|
|
||||||
|
| Input | Description |
|
||||||
|
|-------|-------------|
|
||||||
|
| `model` | From Model Loader |
|
||||||
|
| `text_prompt` | Chain-of-thought audio scene description. Longer, more detailed prompts produce better results. |
|
||||||
|
| `duration` | Audio duration in seconds |
|
||||||
|
| `steps` | Sampling steps (default: 100) |
|
||||||
|
| `cfg_scale` | Classifier-free guidance scale (default: 7.0) |
|
||||||
|
| `seed` | RNG seed |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Workflows
|
## Workflows
|
||||||
|
|
||||||
### Quality Path (Video-to-Audio)
|
### Video-to-Audio
|
||||||
|
|
||||||
```
|
```
|
||||||
Video → PrismAudio Feature Extractor → PrismAudio Sampler → Save Audio
|
VHS LoadVideo ──► PrismAudio Feature Extractor ──► PrismAudio Sampler ──► Save Audio
|
||||||
|
(video_info) ──────────────────► (fps auto)
|
||||||
|
(features) ────────────────────► (features)
|
||||||
|
duration=0 ─────────────────────► (auto from features)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Pre-computed Path
|
### Pre-computed Features
|
||||||
|
|
||||||
```
|
```
|
||||||
PrismAudio Feature Loader (.npz) → PrismAudio Sampler → Save Audio
|
PrismAudio Feature Loader (.npz) ──► PrismAudio Sampler ──► Save Audio
|
||||||
```
|
```
|
||||||
|
|
||||||
### Text-Only
|
### Text-to-Audio
|
||||||
|
|
||||||
```
|
```
|
||||||
PrismAudio Text Only → Save Audio
|
PrismAudio Text Only ──► Save Audio
|
||||||
```
|
```
|
||||||
|
|
||||||
> **Note:** CoT text is a STRING input on the sampler. You can use any existing ComfyUI LLM nodes to generate it.
|
|
||||||
|
|
||||||
## HuggingFace Authentication
|
## HuggingFace Authentication
|
||||||
|
|
||||||
Required for gated models (T5-Gemma, and possibly Stable Audio VAE).
|
Required for T5-Gemma (gated model) and PrismAudio weights.
|
||||||
|
|
||||||
1. Visit <https://huggingface.co/FunAudioLLM/PrismAudio> and accept the license.
|
1. Visit <https://huggingface.co/FunAudioLLM/PrismAudio> and accept the license.
|
||||||
2. Authenticate via one of:
|
2. Authenticate via one of:
|
||||||
- **Environment variable:** `export HF_TOKEN=hf_...`
|
- **Environment variable:** `export HF_TOKEN=hf_...`
|
||||||
- **CLI login:** `huggingface-cli login`
|
- **CLI login:** `huggingface-cli login`
|
||||||
|
|
||||||
There is no `hf_token` widget on the nodes by design — ComfyUI saves all STRING values to workflow JSON, which would expose your token.
|
There is no `hf_token` widget on the main nodes by design — ComfyUI saves all STRING values to workflow JSON, which would expose your token. The Feature Extractor has an `hf_token` input as a convenience but using `HF_TOKEN` env var is preferred.
|
||||||
|
|
||||||
## Model Files
|
## Model Files
|
||||||
|
|
||||||
@@ -65,41 +129,27 @@ Weights are auto-downloaded to `ComfyUI/models/prismaudio/`:
|
|||||||
|
|
||||||
| File | Size | Description |
|
| File | Size | Description |
|
||||||
|------|------|-------------|
|
|------|------|-------------|
|
||||||
| `prismaudio.ckpt` | ~2.7 GB | Diffusion model |
|
| `prismaudio.ckpt` | ~2.7 GB | Diffusion model (DiT) |
|
||||||
| `vae.ckpt` | ~2.5 GB | Stable Audio 2.0 VAE |
|
| `vae.ckpt` | ~2.5 GB | Stable Audio 2.0 VAE |
|
||||||
| `synchformer_state_dict.pth` | ~950 MB | Synchformer |
|
| `synchformer_state_dict.pth` | ~950 MB | Synchformer visual encoder |
|
||||||
|
|
||||||
T5-Gemma is cached in the standard HuggingFace cache directory (`~/.cache/huggingface/`).
|
T5-Gemma and VideoPrism LvT are cached in `~/.cache/huggingface/`.
|
||||||
|
|
||||||
## VRAM Requirements
|
## VRAM Requirements
|
||||||
|
|
||||||
| VRAM | Strategy |
|
| VRAM | Recommended settings |
|
||||||
|------|----------|
|
|------|----------------------|
|
||||||
| 24 GB+ | Keep all models in VRAM |
|
| 24 GB+ | `keep_in_vram`, any precision |
|
||||||
| 12–24 GB | Sequential offload |
|
| 12–24 GB | `offload_to_cpu`, bf16/fp16 |
|
||||||
| 8–12 GB | Aggressive offload + fp16 |
|
| 8–12 GB | `offload_to_cpu`, fp16 |
|
||||||
| < 8 GB | May work with aggressive offload |
|
| < 8 GB | May work with `offload_to_cpu` + fp16 |
|
||||||
|
|
||||||
## Feature Extraction Environment
|
|
||||||
|
|
||||||
The **PrismAudio Feature Extractor** node runs extraction in a subprocess using a separate Python environment (JAX/TF dependencies).
|
|
||||||
|
|
||||||
```bash
|
|
||||||
conda env create -f scripts/environment.yml
|
|
||||||
conda activate prismaudio-extract
|
|
||||||
```
|
|
||||||
|
|
||||||
Then set the `python_env` input on the Feature Extractor node to:
|
|
||||||
|
|
||||||
```
|
|
||||||
/path/to/conda/envs/prismaudio-extract/bin/python
|
|
||||||
```
|
|
||||||
|
|
||||||
## Troubleshooting
|
## Troubleshooting
|
||||||
|
|
||||||
- **Gated model errors** — Accept the license at <https://huggingface.co/FunAudioLLM/PrismAudio> and set `HF_TOKEN`.
|
- **Gated model errors** — Accept the license at <https://huggingface.co/FunAudioLLM/PrismAudio> and set `HF_TOKEN`.
|
||||||
- **VRAM errors** — Switch `offload_strategy` to `offload_to_cpu`, or use `fp16` precision.
|
- **VRAM errors** — Switch `offload_strategy` to `offload_to_cpu` and/or use `fp16` precision.
|
||||||
- **flash-attn** — Purely optional. Auto-detected at runtime; falls back to PyTorch SDPA.
|
- **Feature extraction fails** — Ensure `synchformer_state_dict.pth` is in `models/prismaudio/`. On first run with `managed_env`, installation takes several minutes.
|
||||||
|
- **flash-attn** — Optional. Auto-detected at runtime; falls back to PyTorch SDPA.
|
||||||
|
|
||||||
## Credits
|
## Credits
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user