From 62a3c5d0dcc44fd32cf5475ce4ce65c6241afdce Mon Sep 17 00:00:00 2001 From: Ethanfel Date: Sat, 28 Mar 2026 11:10:07 +0100 Subject: [PATCH] docs: rewrite README to reflect current node design Update node descriptions, inputs/outputs, workflows, and environment setup to match current implementation (managed_env dropdown, VHS video_info, auto-duration, fps output, synchformer auto-resolve). Co-Authored-By: Claude Sonnet 4.6 --- README.md | 144 ++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 97 insertions(+), 47 deletions(-) diff --git a/README.md b/README.md index 2282d97..c0bd804 100644 --- a/README.md +++ b/README.md @@ -8,56 +8,120 @@ Clone into your ComfyUI custom nodes directory: ```bash cd ComfyUI/custom_nodes -git clone -b prismaudio https://github.com/FunAudioLLM/ThinkSound ComfyUI-PrismAudio +git clone git@192.168.1.1:Ethanfel/ComfyUI-Prismaudio.git ComfyUI-PrismAudio pip install -r ComfyUI-PrismAudio/requirements.txt ``` -**flash-attn** is optional. It is detected at runtime and falls back to PyTorch SDPA if unavailable. - -For the **Feature Extractor** node (video feature extraction), a separate conda environment is required — see [Feature Extraction Environment](#feature-extraction-environment) below. +**flash-attn** is optional — detected at runtime, falls back to PyTorch SDPA if unavailable. ## Nodes -| Node | Description | -|------|-------------| -| **PrismAudio Model Loader** | Loads the diffusion model and VAE. Auto-downloads weights from HuggingFace. Inputs: `precision` (auto/fp32/fp16/bf16), `offload_strategy` (auto/keep_in_vram/offload_to_cpu). | -| **PrismAudio Feature Loader** | Loads pre-computed `.npz` feature files for use with the sampler. | -| **PrismAudio Feature Extractor** | Subprocess bridge that extracts features from video. Requires a separate conda env with JAX/TF. | -| **PrismAudio Sampler** | Main generation node. Takes model + features, produces AUDIO. Inputs: `duration`, `steps`, `cfg_scale`, `seed`. | -| **PrismAudio Text Only** | Text-to-audio generation without video. Uses the T5-Gemma text encoder. Inputs: `text_prompt`, `duration`, `steps`, `cfg_scale`, `seed`. | +### PrismAudio Model Loader + +Loads the DiT diffusion model and VAE. Auto-downloads weights from HuggingFace on first use. + +| Input | Options | Description | +|-------|---------|-------------| +| `precision` | auto / fp32 / fp16 / bf16 | DiT and conditioner dtype. VAE is always fp32. | +| `offload_strategy` | auto / keep_in_vram / offload_to_cpu | Memory management. | + +--- + +### PrismAudio Feature Extractor + +Extracts video features (VideoPrism LvT, Synchformer) and text features (T5-Gemma) from a video in a subprocess. Results are cached on disk. + +| Input | Description | +|-------|-------------| +| `video` | IMAGE tensor from any ComfyUI video loader | +| `caption_cot` | Chain-of-thought description of the audio scene | +| `video_info` | *(optional)* `VHS_VIDEOINFO` from VHS LoadVideo — sets fps automatically | +| `fps` | Source fps — ignored if `video_info` is connected | +| `python_env` | `managed_env` (auto-created isolated venv, recommended) or `comfyui_env` (current Python, see warning below) | +| `cache_dir` | Directory for cached `.npz` files. Empty = system temp dir. | +| `hf_token` | HuggingFace token for gated models. Prefer `HF_TOKEN` env var instead. | + +**Outputs:** `features` (PRISMAUDIO_FEATURES), `fps` (FLOAT) + +**`managed_env`** auto-creates a venv at `_extract_env/` inside the plugin directory on first use and installs JAX, TF, VideoPrism, and Synchformer. This takes several minutes the first time. + +**`comfyui_env`** uses the current ComfyUI Python — JAX/TF/videoprism must already be installed. Installing them into the ComfyUI environment may conflict with existing packages. + +--- + +### PrismAudio Feature Loader + +Loads a pre-computed `.npz` feature file. Use this to re-use extracted features without re-running the extractor. + +| Input | Description | +|-------|-------------| +| `npz_path` | Path to a `.npz` file produced by the Feature Extractor | + +--- + +### PrismAudio Sampler + +Video-to-audio generation. Takes model + features, produces AUDIO. + +| Input | Description | +|-------|-------------| +| `model` | From Model Loader | +| `features` | From Feature Extractor or Feature Loader | +| `duration` | Audio duration in seconds. Set to `0` to use the video duration from features automatically. | +| `steps` | Sampling steps (default: 100) | +| `cfg_scale` | Classifier-free guidance scale (default: 7.0) | +| `seed` | RNG seed | + +--- + +### PrismAudio Text Only + +Text-to-audio generation without video. Uses the T5-Gemma encoder. + +| Input | Description | +|-------|-------------| +| `model` | From Model Loader | +| `text_prompt` | Chain-of-thought audio scene description. Longer, more detailed prompts produce better results. | +| `duration` | Audio duration in seconds | +| `steps` | Sampling steps (default: 100) | +| `cfg_scale` | Classifier-free guidance scale (default: 7.0) | +| `seed` | RNG seed | + +--- ## Workflows -### Quality Path (Video-to-Audio) +### Video-to-Audio ``` -Video → PrismAudio Feature Extractor → PrismAudio Sampler → Save Audio +VHS LoadVideo ──► PrismAudio Feature Extractor ──► PrismAudio Sampler ──► Save Audio + (video_info) ──────────────────► (fps auto) + (features) ────────────────────► (features) + duration=0 ─────────────────────► (auto from features) ``` -### Pre-computed Path +### Pre-computed Features ``` -PrismAudio Feature Loader (.npz) → PrismAudio Sampler → Save Audio +PrismAudio Feature Loader (.npz) ──► PrismAudio Sampler ──► Save Audio ``` -### Text-Only +### Text-to-Audio ``` -PrismAudio Text Only → Save Audio +PrismAudio Text Only ──► Save Audio ``` -> **Note:** CoT text is a STRING input on the sampler. You can use any existing ComfyUI LLM nodes to generate it. - ## HuggingFace Authentication -Required for gated models (T5-Gemma, and possibly Stable Audio VAE). +Required for T5-Gemma (gated model) and PrismAudio weights. 1. Visit and accept the license. 2. Authenticate via one of: - **Environment variable:** `export HF_TOKEN=hf_...` - **CLI login:** `huggingface-cli login` -There is no `hf_token` widget on the nodes by design — ComfyUI saves all STRING values to workflow JSON, which would expose your token. +There is no `hf_token` widget on the main nodes by design — ComfyUI saves all STRING values to workflow JSON, which would expose your token. The Feature Extractor has an `hf_token` input as a convenience but using `HF_TOKEN` env var is preferred. ## Model Files @@ -65,41 +129,27 @@ Weights are auto-downloaded to `ComfyUI/models/prismaudio/`: | File | Size | Description | |------|------|-------------| -| `prismaudio.ckpt` | ~2.7 GB | Diffusion model | +| `prismaudio.ckpt` | ~2.7 GB | Diffusion model (DiT) | | `vae.ckpt` | ~2.5 GB | Stable Audio 2.0 VAE | -| `synchformer_state_dict.pth` | ~950 MB | Synchformer | +| `synchformer_state_dict.pth` | ~950 MB | Synchformer visual encoder | -T5-Gemma is cached in the standard HuggingFace cache directory (`~/.cache/huggingface/`). +T5-Gemma and VideoPrism LvT are cached in `~/.cache/huggingface/`. ## VRAM Requirements -| VRAM | Strategy | -|------|----------| -| 24 GB+ | Keep all models in VRAM | -| 12–24 GB | Sequential offload | -| 8–12 GB | Aggressive offload + fp16 | -| < 8 GB | May work with aggressive offload | - -## Feature Extraction Environment - -The **PrismAudio Feature Extractor** node runs extraction in a subprocess using a separate Python environment (JAX/TF dependencies). - -```bash -conda env create -f scripts/environment.yml -conda activate prismaudio-extract -``` - -Then set the `python_env` input on the Feature Extractor node to: - -``` -/path/to/conda/envs/prismaudio-extract/bin/python -``` +| VRAM | Recommended settings | +|------|----------------------| +| 24 GB+ | `keep_in_vram`, any precision | +| 12–24 GB | `offload_to_cpu`, bf16/fp16 | +| 8–12 GB | `offload_to_cpu`, fp16 | +| < 8 GB | May work with `offload_to_cpu` + fp16 | ## Troubleshooting - **Gated model errors** — Accept the license at and set `HF_TOKEN`. -- **VRAM errors** — Switch `offload_strategy` to `offload_to_cpu`, or use `fp16` precision. -- **flash-attn** — Purely optional. Auto-detected at runtime; falls back to PyTorch SDPA. +- **VRAM errors** — Switch `offload_strategy` to `offload_to_cpu` and/or use `fp16` precision. +- **Feature extraction fails** — Ensure `synchformer_state_dict.pth` is in `models/prismaudio/`. On first run with `managed_env`, installation takes several minutes. +- **flash-attn** — Optional. Auto-detected at runtime; falls back to PyTorch SDPA. ## Credits