Both nodes moved models to GPU before work then back to CPU after.
Any exception (OOM, cancellation, bad input) would skip the cleanup,
leaving models on GPU permanently until ComfyUI restarts.
Wrap the entire work block in try/finally so offload_to_cpu cleanup
always runs regardless of how the node exits. Also removes the unused
`mode` variable in SelvaSampler.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- selva_sampler: wrap decode+vocode in their own OOM catch — previously
OOM during mel decode or vocoding gave a raw CUDA traceback instead
of the actionable hint
- selva_feature_extractor: sync frames log line now shows (masked) when
a mask is active, matching the CLIP log line
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Allows per-frame or static segmentation masks to be applied before CLIP
and sync encoding, zeroing background pixels. Useful when multiple objects
compete for the same sound and text prompting alone is insufficient.
- _apply_mask(): resizes mask spatially (nearest-exact), samples temporally
to match sampled frame count, multiplies into frames
- _hash_inputs(): includes mask bytes in cache key (begin/mid/end sampling)
- INPUT_TYPES: mask added to optional inputs with tooltip
- extract_features(): mask=None parameter, applied after _resize_frames for
both CLIP (384px) and sync (224px) paths, before normalization
- Log line notes when masking is active
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Model Loader:
- bf16 support check — auto-falls back to fp16 on unsupported GPUs
- DESCRIPTION and OUTPUT_TOOLTIPS
Feature Extractor:
- Store variant in features dict and .npz cache
- Progress bar (3 steps: CLIP encode, T5 encode, sync encode)
- Expand cache hash to 32 hex chars
- DESCRIPTION and OUTPUT_TOOLTIPS
Sampler:
- Variant mismatch validation against extracted features
- Cancellation support via throw_exception_if_processing_interrupted()
- OOM catch with actionable error message
- normalize toggle (optional BOOLEAN, default true) for peak normalization
- Remove empty optional: {} block
- DESCRIPTION and OUTPUT_TOOLTIPS
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- nodes/__init__.py: fix [PrismAudio] leftover label in error print
- selva_feature_extractor: hash beginning, middle and end of video tensor
instead of just first 1MB, avoiding collisions on videos with same opening frames
- selva_sampler: derive SequenceConfig from model template via dataclasses.replace
instead of hardcoding sampling_rate/spectrogram_frame_rate per mode
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This branch registers only the three SelVA nodes. PrismAudio nodes stay
on master/feature/lora-trainer.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Actual filenames in jnwnlee/SelVA: generator_*_44khz_sup_5.pth.
download_utils.py had the wrong names so those MD5s are unverified — set to
None to skip MD5 check for 44k generators. All other files verified/unchanged.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously _ensure() trusted any existing file. Files downloaded by the
broken requests-based code (HTML error pages) would be silently reused.
Now checks MD5 on every load; deletes and re-downloads on mismatch.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
download_utils.py used requests without auth — jnwnlee/SelVA returned an
HTML error page which torch then failed to unpickle ('E' / opcode 69).
huggingface_hub.hf_hub_download() handles HF_TOKEN auth automatically,
validates downloads, and retries. Files are still copied to models/selva/.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PyTorch 2.6 changed the default to weights_only=True. SelVA checkpoints
contain non-tensor types (numpy scalars etc.) that fail strict unpickling.
All weights come from trusted sources (jnwnlee/selva HF repo).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- SelvaSampler: multiline:false puts negative_prompt inline above sliders
- SelvaModelLoader: VAE filenames in download_utils are v1-16.pth/v1-44.pth,
not v1-{mode}.pth (mode includes the 'k' suffix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move negative_prompt to required inputs, right after prompt, so it appears
above duration/steps/cfg/seed in the ComfyUI node layout.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Users can now wire the prompt output directly to SelvaSampler's prompt input,
making the data flow explicit instead of relying on the implicit features fallback.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ComfyUI renders required inputs above optional ones. Moving negative_prompt
to optional puts prompt first (natural order) and negative_prompt at the
bottom where it belongs as a power-user input. Also guards against
negative_prompt=None when not connected.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Uses selva_core/utils/download_utils.py (already has URLs + MD5s for all
weights). Models download to models/selva/ on first load. Synchformer reuses
models/prismaudio/synchformer_state_dict.pth if already present (no duplicate
download for PrismAudio users), otherwise downloads to models/selva/.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SelvaFeatureExtractor now stores the prompt in SELVA_FEATURES (both in the
returned dict and the .npz cache). SelvaSampler's prompt is now optional —
when left empty it falls back to the prompt stored in features. A non-empty
override can still be passed when CLIP text should differ from the sync text.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- selva_feature_extractor: cache hash now includes resolved duration;
same video + different duration override no longer returns stale features
- selva_sampler: MPS-safe noise generation (torch.Generator on CPU then
move to device, same pattern as PrismAudioSampler)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Calls update_seq_lengths with actual feature dimensions (not seq_cfg) to
avoid rounding assertion mismatches. Progress bar tracks each Euler step.
Supports negative prompts for steering, normalizes output to [-1,1].
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CLIP frames at 8fps→384px (normalize inside FeaturesUtils).
Sync frames at 25fps→224px, normalized to [-1,1] externally.
T5 text encoded via FeaturesUtils, sup tokens prepended, then text-conditioned
sync features extracted via TextSynch.encode_video_with_sync(). Results cached
as .npz keyed by hash(frames[:1MB] + prompt + fps + variant).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The fps output was only returned on cache hits. Fresh extractions
returned only features, leaving fps null.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Setting duration to 0 in PrismAudioSampler now reads the duration
stored in the PRISMAUDIO_FEATURES dict (set by the feature extractor).
Default changed from 10.0 to 0.0 so V2A workflows are wired up
automatically.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove synchformer_ckpt input — always resolved from models/prismaudio/
(errors early with clear message if missing)
- Replace python_env string input with dropdown: managed_env (isolated
auto-created venv, default) or comfyui_env (current Python, with warning)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the VHS LoadVideo video_info output is connected, loaded_fps is
used automatically instead of the manual fps input.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Saves frames as uint8 .npy instead of H.264 MP4, eliminating the
lossy codec roundtrip. extract_features.py loads .npy directly and
skips decord when given a numpy file. Passes --source_fps for
correct temporal sampling.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove debug_zero_video/debug_zero_sync inputs from PrismAudioSampler,
DIT velocity diagnostics, conditioner stats logging, and feature stats
prints from both sampler.py and text_only.py.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add per-key conditioning output stats (after Cond_MLP/Sync_MLP, after
_substitute_empty_features) to both sampler and text_only nodes. Also
add raw T5 text feature stats in T2A before conditioning.
This lets us directly compare:
- T2A vs V2A conditioning outputs to find which path differs
- T2A vs npz text feature ranges
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Match the diagnostic output already in text_only.py to compare
V2A vs T2A latent distributions and diagnose conditioning issues.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Zero features through bias-free Cond_MLP produce near-zero activations,
not the learned null signal the model was trained with. Use empty_clip_feat
(the learned null video embedding) just like empty_sync_feat for sync.
Also improve text_prompt tooltip to encourage detailed CoT descriptions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AutoencoderPretransform.load_state_dict() doesn't return IncompatibleKeys.
Load into pretransform.model (AudioAutoencoder) to get the return value
and see actual missing/unexpected key counts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Print key counts, missing/unexpected keys, and sample key names to
diagnose whether weights are actually loading correctly (strict=False
silently hides mismatches that would cause garbage audio output).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Allows isolating which feature set causes quality issues:
- debug_zero_video: zero video_features → text+sync only
- debug_zero_sync: zero sync_features → text+video only
Also logs mean/std/shape for all three feature tensors on every run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Stream raw RGB bytes from tensor directly to ffmpeg stdin.
Eliminates all intermediate PNG file I/O — much faster for large frame counts.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Shows how long PIL+ffmpeg video export takes so we can see
if that's contributing to the gap before [extract] output appears.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When synchformer_ckpt input is empty, look for synchformer_state_dict.pth
in the ComfyUI prismaudio models directory automatically.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
videoprism/__init__.py is empty — API lives in videoprism.models.
Fix: from videoprism import models as vp (not import videoprism as vp).
Also add flax to managed venv packages (required by videoprism Flax model).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RTX 6000 Pro (Blackwell SM 10.0) fully supports CUDA 13. Switch from
jax[cpu]+jaxlib to jax[cuda13] which bundles jaxlib and uses
pip-managed CUDA libraries. Delete _extract_env to force a rebuild.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
google/t5gemma-l-l-ul2-it is a gated HuggingFace model requiring auth.
Add optional hf_token input on the node; forward it (plus the legacy
HUGGING_FACE_HUB_TOKEN alias) to the subprocess env. Falls back to
HF_TOKEN from the host environment. Warn clearly when neither is set.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Empty string from clearing the node field caused subprocess to execute ''
which raises PermissionError. Now any blank or 'python' value uses the
auto-installed venv.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Exposes the video frame rate as an optional input (default 30).
Correct FPS ensures accurate temporal frame sampling in VideoPrism
and Synchformer feature extraction.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
write_video requires the optional 'av' (PyAV) package. Use PIL to save
frames as PNGs then combine with ffmpeg, which is always present in
ComfyUI Docker images.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Installs each package individually with [n/total] counters and
pip progress bars, so failures pinpoint the exact failing package.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>