- Remove synchformer_ckpt input — always resolved from models/prismaudio/
(errors early with clear message if missing)
- Replace python_env string input with dropdown: managed_env (isolated
auto-created venv, default) or comfyui_env (current Python, with warning)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the VHS LoadVideo video_info output is connected, loaded_fps is
used automatically instead of the manual fps input.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Saves frames as uint8 .npy instead of H.264 MP4, eliminating the
lossy codec roundtrip. extract_features.py loads .npy directly and
skips decord when given a numpy file. Passes --source_fps for
correct temporal sampling.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove debug_zero_video/debug_zero_sync inputs from PrismAudioSampler,
DIT velocity diagnostics, conditioner stats logging, and feature stats
prints from both sampler.py and text_only.py.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add per-key conditioning output stats (after Cond_MLP/Sync_MLP, after
_substitute_empty_features) to both sampler and text_only nodes. Also
add raw T5 text feature stats in T2A before conditioning.
This lets us directly compare:
- T2A vs V2A conditioning outputs to find which path differs
- T2A vs npz text feature ranges
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Match the diagnostic output already in text_only.py to compare
V2A vs T2A latent distributions and diagnose conditioning issues.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Zero features through bias-free Cond_MLP produce near-zero activations,
not the learned null signal the model was trained with. Use empty_clip_feat
(the learned null video embedding) just like empty_sync_feat for sync.
Also improve text_prompt tooltip to encourage detailed CoT descriptions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AutoencoderPretransform.load_state_dict() doesn't return IncompatibleKeys.
Load into pretransform.model (AudioAutoencoder) to get the return value
and see actual missing/unexpected key counts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Print key counts, missing/unexpected keys, and sample key names to
diagnose whether weights are actually loading correctly (strict=False
silently hides mismatches that would cause garbage audio output).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Allows isolating which feature set causes quality issues:
- debug_zero_video: zero video_features → text+sync only
- debug_zero_sync: zero sync_features → text+video only
Also logs mean/std/shape for all three feature tensors on every run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stream raw RGB bytes from tensor directly to ffmpeg stdin.
Eliminates all intermediate PNG file I/O — much faster for large frame counts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Shows how long PIL+ffmpeg video export takes so we can see
if that's contributing to the gap before [extract] output appears.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When synchformer_ckpt input is empty, look for synchformer_state_dict.pth
in the ComfyUI prismaudio models directory automatically.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
videoprism/__init__.py is empty — API lives in videoprism.models.
Fix: from videoprism import models as vp (not import videoprism as vp).
Also add flax to managed venv packages (required by videoprism Flax model).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RTX 6000 Pro (Blackwell SM 10.0) fully supports CUDA 13. Switch from
jax[cpu]+jaxlib to jax[cuda13] which bundles jaxlib and uses
pip-managed CUDA libraries. Delete _extract_env to force a rebuild.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
google/t5gemma-l-l-ul2-it is a gated HuggingFace model requiring auth.
Add optional hf_token input on the node; forward it (plus the legacy
HUGGING_FACE_HUB_TOKEN alias) to the subprocess env. Falls back to
HF_TOKEN from the host environment. Warn clearly when neither is set.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Empty string from clearing the node field caused subprocess to execute ''
which raises PermissionError. Now any blank or 'python' value uses the
auto-installed venv.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Exposes the video frame rate as an optional input (default 30).
Correct FPS ensures accurate temporal frame sampling in VideoPrism
and Synchformer feature extraction.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
write_video requires the optional 'av' (PyAV) package. Use PIL to save
frames as PNGs then combine with ffmpeg, which is always present in
ComfyUI Docker images.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Installs each package individually with [n/total] counters and
pip progress bars, so failures pinpoint the exact failing package.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- tensorflow-cpu==2.15.0 only supports Python <=3.11; relax to >=2.16.0
- capture_output=False so pip errors are visible in ComfyUI logs
- clean up incomplete venv dir before retrying install
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
vae.ckpt is a full training checkpoint containing discriminator, STFT
loss modules, and EMA wrappers that are absent from the inference
AudioAutoencoder. strict=False ignores these training-only keys while
still loading all encoder/decoder/bottleneck weights correctly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PrismAudioFeatureExtractor now creates and populates a managed venv
(_extract_env/) automatically when python_env is left as the default
'python'. Also adds scripts/install_extract_env.sh for manual/Docker
setup without conda.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>