- Replace zero-fill with neutral gray (0.5) fill so masked background
pixels stay in-distribution: 0.5 maps to ~0 in CLIP normalized space
and exactly 0 after sync's [-1,1] normalization
- Add mask_strength float (0–1) for partial background suppression
- Add mask_clip / mask_sync booleans to toggle masking independently
on the CLIP (384px) and TextSynchformer (224px) encoding paths
- Fix temporal mask sampling: use fps-accurate index formula (same as
_sample_frames) instead of proportional int(i*M/N)
- Include mask_strength, mask_clip, mask_sync in cache hash when mask
is connected, so changing any param correctly busts the cache
- Log lines now report masked/skipped state and strength per path
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Both nodes moved models to GPU before work then back to CPU after.
Any exception (OOM, cancellation, bad input) would skip the cleanup,
leaving models on GPU permanently until ComfyUI restarts.
Wrap the entire work block in try/finally so offload_to_cpu cleanup
always runs regardless of how the node exits. Also removes the unused
`mode` variable in SelvaSampler.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- selva_sampler: wrap decode+vocode in their own OOM catch — previously
OOM during mel decode or vocoding gave a raw CUDA traceback instead
of the actionable hint
- selva_feature_extractor: sync frames log line now shows (masked) when
a mask is active, matching the CLIP log line
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Allows per-frame or static segmentation masks to be applied before CLIP
and sync encoding, zeroing background pixels. Useful when multiple objects
compete for the same sound and text prompting alone is insufficient.
- _apply_mask(): resizes mask spatially (nearest-exact), samples temporally
to match sampled frame count, multiplies into frames
- _hash_inputs(): includes mask bytes in cache key (begin/mid/end sampling)
- INPUT_TYPES: mask added to optional inputs with tooltip
- extract_features(): mask=None parameter, applied after _resize_frames for
both CLIP (384px) and sync (224px) paths, before normalization
- Log line notes when masking is active
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Model Loader:
- bf16 support check — auto-falls back to fp16 on unsupported GPUs
- DESCRIPTION and OUTPUT_TOOLTIPS
Feature Extractor:
- Store variant in features dict and .npz cache
- Progress bar (3 steps: CLIP encode, T5 encode, sync encode)
- Expand cache hash to 32 hex chars
- DESCRIPTION and OUTPUT_TOOLTIPS
Sampler:
- Variant mismatch validation against extracted features
- Cancellation support via throw_exception_if_processing_interrupted()
- OOM catch with actionable error message
- normalize toggle (optional BOOLEAN, default true) for peak normalization
- Remove empty optional: {} block
- DESCRIPTION and OUTPUT_TOOLTIPS
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace PreviewAudio with VHS_VideoCombine — outputs video+audio together
- Wire fps from FeatureExtractor to VideoCombine frame_rate
- Wire audio from Sampler into VideoCombine
- Clear hardcoded video filename
- Set filename_prefix to SelVA, save_output=true
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- nodes/__init__.py: fix [PrismAudio] leftover label in error print
- selva_feature_extractor: hash beginning, middle and end of video tensor
instead of just first 1MB, avoiding collisions on videos with same opening frames
- selva_sampler: derive SequenceConfig from model template via dataclasses.replace
instead of hardcoding sampling_rate/spectrogram_frame_rate per mode
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This branch registers only the three SelVA nodes. PrismAudio nodes stay
on master/feature/lora-trainer.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Newer hf_hub stopped passing proxies/resume_download/local_files_only/token
to _from_pretrained(). Give them defaults so the call doesn't fail when
these kwargs are omitted.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Actual filenames in jnwnlee/SelVA: generator_*_44khz_sup_5.pth.
download_utils.py had the wrong names so those MD5s are unverified — set to
None to skip MD5 check for 44k generators. All other files verified/unchanged.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously _ensure() trusted any existing file. Files downloaded by the
broken requests-based code (HTML error pages) would be silently reused.
Now checks MD5 on every load; deletes and re-downloads on mismatch.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
download_utils.py used requests without auth — jnwnlee/SelVA returned an
HTML error page which torch then failed to unpickle ('E' / opcode 69).
huggingface_hub.hf_hub_download() handles HF_TOKEN auth automatically,
validates downloads, and retries. Files are still copied to models/selva/.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
PyTorch 2.6 changed the default to weights_only=True. SelVA checkpoints
contain non-tensor types (numpy scalars etc.) that fail strict unpickling.
All weights come from trusted sources (jnwnlee/selva HF repo).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- SelvaSampler: multiline:false puts negative_prompt inline above sliders
- SelvaModelLoader: VAE filenames in download_utils are v1-16.pth/v1-44.pth,
not v1-{mode}.pth (mode includes the 'k' suffix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move negative_prompt to required inputs, right after prompt, so it appears
above duration/steps/cfg/seed in the ComfyUI node layout.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
find_pruneable_heads_and_indices and prune_linear_layer were removed from
both pytorch_utils and modeling_utils in some transformers builds. Provide
minimal inline implementations as final fallback — prune_heads() is never
called at inference time so correctness is only needed for completeness.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Users can now wire the prompt output directly to SelvaSampler's prompt input,
making the data flow explicit instead of relying on the implicit features fallback.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ComfyUI renders required inputs above optional ones. Moving negative_prompt
to optional puts prompt first (natural order) and negative_prompt at the
bottom where it belongs as a power-user input. Also guards against
negative_prompt=None when not connected.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Uses selva_core/utils/download_utils.py (already has URLs + MD5s for all
weights). Models download to models/selva/ on first load. Synchformer reuses
models/prismaudio/synchformer_state_dict.pth if already present (no duplicate
download for PrismAudio users), otherwise downloads to models/selva/.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
SelvaFeatureExtractor now stores the prompt in SELVA_FEATURES (both in the
returned dict and the .npz cache). SelvaSampler's prompt is now optional —
when left empty it falls back to the prompt stored in features. A non-empty
override can still be passed when CLIP text should differ from the sync text.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Some transformers builds removed these from pytorch_utils. Fall back to
modeling_utils which exposes them in all known versions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- selva_feature_extractor: cache hash now includes resolved duration;
same video + different duration override no longer returns stale features
- selva_sampler: MPS-safe noise generation (torch.Generator on CPU then
move to device, same pattern as PrismAudioSampler)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Calls update_seq_lengths with actual feature dimensions (not seq_cfg) to
avoid rounding assertion mismatches. Progress bar tracks each Euler step.
Supports negative prompts for steering, normalizes output to [-1,1].
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
CLIP frames at 8fps→384px (normalize inside FeaturesUtils).
Sync frames at 25fps→224px, normalized to [-1,1] externally.
T5 text encoded via FeaturesUtils, sup tokens prepended, then text-conditioned
sync features extracted via TextSynch.encode_video_with_sync(). Results cached
as .npz keyed by hash(frames[:1MB] + prompt + fps + variant).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pure PyTorch SelVA source for SelvaModelLoader/FeatureExtractor/Sampler nodes.
Imports rewritten from selva.* to selva_core.*. mel_converter.py: replaced
librosa.filters.mel with pure-numpy implementation to avoid librosa→numba→NumPy
version incompatibility in some ComfyUI environments.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The fps output was only returned on cache hits. Fresh extractions
returned only features, leaving fps null.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Point links to huggingface.co/FunAudioLLM/PrismAudio and use public
GitHub URL for install instructions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Setting duration to 0 in PrismAudioSampler now reads the duration
stored in the PRISMAUDIO_FEATURES dict (set by the feature extractor).
Default changed from 10.0 to 0.0 so V2A workflows are wired up
automatically.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove synchformer_ckpt input — always resolved from models/prismaudio/
(errors early with clear message if missing)
- Replace python_env string input with dropdown: managed_env (isolated
auto-created venv, default) or comfyui_env (current Python, with warning)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the VHS LoadVideo video_info output is connected, loaded_fps is
used automatically instead of the manual fps input.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Saves frames as uint8 .npy instead of H.264 MP4, eliminating the
lossy codec roundtrip. extract_features.py loads .npy directly and
skips decord when given a numpy file. Passes --source_fps for
correct temporal sampling.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove debug_zero_video/debug_zero_sync inputs from PrismAudioSampler,
DIT velocity diagnostics, conditioner stats logging, and feature stats
prints from both sampler.py and text_only.py.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The wrong model (videoprism_public_v1_large, vision-only) was used,
causing V2A audio distortion. Switch to the LvT variant which has a
text tower, pass CoT captions for joint encoding, and extract per-frame
features from outputs['frame_embeddings'] (L2-normalized, [T, 1024])
instead of manually averaging spatial patches.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add per-key conditioning output stats (after Cond_MLP/Sync_MLP, after
_substitute_empty_features) to both sampler and text_only nodes. Also
add raw T5 text feature stats in T2A before conditioning.
This lets us directly compare:
- T2A vs V2A conditioning outputs to find which path differs
- T2A vs npz text feature ranges
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Match the diagnostic output already in text_only.py to compare
V2A vs T2A latent distributions and diagnose conditioning issues.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Zero features through bias-free Cond_MLP produce near-zero activations,
not the learned null signal the model was trained with. Use empty_clip_feat
(the learned null video embedding) just like empty_sync_feat for sync.
Also improve text_prompt tooltip to encourage detailed CoT descriptions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>