The fps output was only returned on cache hits. Fresh extractions
returned only features, leaving fps null.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Point links to huggingface.co/FunAudioLLM/PrismAudio and use public
GitHub URL for install instructions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Setting duration to 0 in PrismAudioSampler now reads the duration
stored in the PRISMAUDIO_FEATURES dict (set by the feature extractor).
Default changed from 10.0 to 0.0 so V2A workflows are wired up
automatically.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove synchformer_ckpt input — always resolved from models/prismaudio/
(errors early with clear message if missing)
- Replace python_env string input with dropdown: managed_env (isolated
auto-created venv, default) or comfyui_env (current Python, with warning)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the VHS LoadVideo video_info output is connected, loaded_fps is
used automatically instead of the manual fps input.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Saves frames as uint8 .npy instead of H.264 MP4, eliminating the
lossy codec roundtrip. extract_features.py loads .npy directly and
skips decord when given a numpy file. Passes --source_fps for
correct temporal sampling.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove debug_zero_video/debug_zero_sync inputs from PrismAudioSampler,
DIT velocity diagnostics, conditioner stats logging, and feature stats
prints from both sampler.py and text_only.py.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The wrong model (videoprism_public_v1_large, vision-only) was used,
causing V2A audio distortion. Switch to the LvT variant which has a
text tower, pass CoT captions for joint encoding, and extract per-frame
features from outputs['frame_embeddings'] (L2-normalized, [T, 1024])
instead of manually averaging spatial patches.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add per-key conditioning output stats (after Cond_MLP/Sync_MLP, after
_substitute_empty_features) to both sampler and text_only nodes. Also
add raw T5 text feature stats in T2A before conditioning.
This lets us directly compare:
- T2A vs V2A conditioning outputs to find which path differs
- T2A vs npz text feature ranges
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Match the diagnostic output already in text_only.py to compare
V2A vs T2A latent distributions and diagnose conditioning issues.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Zero features through bias-free Cond_MLP produce near-zero activations,
not the learned null signal the model was trained with. Use empty_clip_feat
(the learned null video embedding) just like empty_sync_feat for sync.
Also improve text_prompt tooltip to encourage detailed CoT descriptions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AutoencoderPretransform.load_state_dict() doesn't return IncompatibleKeys.
Load into pretransform.model (AudioAutoencoder) to get the return value
and see actual missing/unexpected key counts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Print key counts, missing/unexpected keys, and sample key names to
diagnose whether weights are actually loading correctly (strict=False
silently hides mismatches that would cause garbage audio output).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Allows isolating which feature set causes quality issues:
- debug_zero_video: zero video_features → text+sync only
- debug_zero_sync: zero sync_features → text+video only
Also logs mean/std/shape for all three feature tensors on every run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sync_MLP interpolates sync features based on video duration, but audio
latent length depends on the user-set audio duration. When video != audio
duration, the sequences diverge. Resample sync_cond to x's length before
the gated addition so any video/audio duration combo works.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stream raw RGB bytes from tensor directly to ffmpeg stdin.
Eliminates all intermediate PNG file I/O — much faster for large frame counts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Shows how long PIL+ffmpeg video export takes so we can see
if that's contributing to the gap before [extract] output appears.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each step now prints elapsed seconds on completion.
Total time printed at the end to identify bottlenecks.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
prismaudio.json conditioner config requires:
- video_features: dim=1024 → switch videoprism_public_v1_base → large (ViT-L)
- sync_features: dim=768, length divisible by 8 → expand [num_seg,768] to
[num_seg*8,768] (per-frame) so Sync_MLP can reshape by groups of 8
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
T5-Gemma outputs BFloat16 which numpy does not support.
Cast all feature tensors with .float() before .numpy().
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When synchformer_ckpt input is empty, look for synchformer_state_dict.pth
in the ComfyUI prismaudio models directory automatically.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
videoprism/__init__.py is empty — API lives in videoprism.models.
Fix: from videoprism import models as vp (not import videoprism as vp).
Also add flax to managed venv packages (required by videoprism Flax model).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RTX 6000 Pro (Blackwell SM 10.0) fully supports CUDA 13. Switch from
jax[cpu]+jaxlib to jax[cuda13] which bundles jaxlib and uses
pip-managed CUDA libraries. Delete _extract_env to force a rebuild.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
google/t5gemma-l-l-ul2-it is a gated HuggingFace model requiring auth.
Add optional hf_token input on the node; forward it (plus the legacy
HUGGING_FACE_HUB_TOKEN alias) to the subprocess env. Falls back to
HF_TOKEN from the host environment. Warn clearly when neither is set.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Empty string from clearing the node field caused subprocess to execute ''
which raises PermissionError. Now any blank or 'python' value uses the
auto-installed venv.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Creates data_utils/v2a_utils/feature_utils_288.py with FeaturesUtils:
- T5-Gemma text encoding via transformers
- VideoPrism video encoding via JAX videoprism package
- Synchformer visual encoder loading from checkpoint
Also fixes extract_features.py to add plugin root to sys.path so
data_utils is importable in the subprocess venv.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Exposes the video frame rate as an optional input (default 30).
Correct FPS ensures accurate temporal frame sampling in VideoPrism
and Synchformer feature extraction.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
write_video requires the optional 'av' (PyAV) package. Use PIL to save
frames as PNGs then combine with ffmpeg, which is always present in
ComfyUI Docker images.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Installs each package individually with [n/total] counters and
pip progress bars, so failures pinpoint the exact failing package.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tensorflow-cpu==2.15.0 only supports Python <=3.11; relax to >=2.16.0
- capture_output=False so pip errors are visible in ComfyUI logs
- clean up incomplete venv dir before retrying install
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
vae.ckpt is a full training checkpoint containing discriminator, STFT
loss modules, and EMA wrappers that are absent from the inference
AudioAutoencoder. strict=False ignores these training-only keys while
still loading all encoder/decoder/bottleneck weights correctly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
einops-exts, vector-quantize-pytorch, scipy were imported by prismaudio_core
but not listed in requirements.txt.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ComfyUI does not add the custom node directory to sys.path automatically,
so prismaudio_core (a package inside the plugin dir) was not found at runtime.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PrismAudioFeatureExtractor now creates and populates a managed venv
(_extract_env/) automatically when python_env is left as the default
'python'. Also adds scripts/install_extract_env.sh for manual/Docker
setup without conda.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MMDiTWrapper was removed from diffusion.py during cleanup but the import
in factory.py was missed, causing ImportError on every model load.
Also stub wavelet and diffusion_prior paths that reference deleted modules.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>