93 Commits

Author SHA1 Message Date
Ethanfel b519b042e2 docs: document mask inputs and normalize toggle in README
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 10:43:42 +02:00
Ethanfel f28759f1e3 feat: improve mask support with neutral fill, mask_strength, and per-path toggles
- Replace zero-fill with neutral gray (0.5) fill so masked background
  pixels stay in-distribution: 0.5 maps to ~0 in CLIP normalized space
  and exactly 0 after sync's [-1,1] normalization
- Add mask_strength float (0–1) for partial background suppression
- Add mask_clip / mask_sync booleans to toggle masking independently
  on the CLIP (384px) and TextSynchformer (224px) encoding paths
- Fix temporal mask sampling: use fps-accurate index formula (same as
  _sample_frames) instead of proportional int(i*M/N)
- Include mask_strength, mask_clip, mask_sync in cache hash when mask
  is connected, so changing any param correctly busts the cache
- Log lines now report masked/skipped state and strength per path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 10:43:01 +02:00
Ethanfel 3dd6badfd9 fix: guarantee offload cleanup on exception with try/finally
Both nodes moved models to GPU before work then back to CPU after.
Any exception (OOM, cancellation, bad input) would skip the cleanup,
leaving models on GPU permanently until ComfyUI restarts.

Wrap the entire work block in try/finally so offload_to_cpu cleanup
always runs regardless of how the node exits. Also removes the unused
`mode` variable in SelvaSampler.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 08:40:39 +02:00
Ethanfel 8bb2fb7015 fix: extend OOM catch to decode/vocode, add (masked) to sync log line
- selva_sampler: wrap decode+vocode in their own OOM catch — previously
  OOM during mel decode or vocoding gave a raw CUDA traceback instead
  of the actionable hint
- selva_feature_extractor: sync frames log line now shows (masked) when
  a mask is active, matching the CLIP log line

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 08:38:59 +02:00
Ethanfel f4a7292cde feat: add optional MASK input to SelVA Feature Extractor
Allows per-frame or static segmentation masks to be applied before CLIP
and sync encoding, zeroing background pixels. Useful when multiple objects
compete for the same sound and text prompting alone is insufficient.

- _apply_mask(): resizes mask spatially (nearest-exact), samples temporally
  to match sampled frame count, multiplies into frames
- _hash_inputs(): includes mask bytes in cache key (begin/mid/end sampling)
- INPUT_TYPES: mask added to optional inputs with tooltip
- extract_features(): mask=None parameter, applied after _resize_frames for
  both CLIP (384px) and sync (224px) paths, before normalization
- Log line notes when masking is active

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 08:34:13 +02:00
Ethanfel bd53744e2d feat: comprehensive node improvements
Model Loader:
- bf16 support check — auto-falls back to fp16 on unsupported GPUs
- DESCRIPTION and OUTPUT_TOOLTIPS

Feature Extractor:
- Store variant in features dict and .npz cache
- Progress bar (3 steps: CLIP encode, T5 encode, sync encode)
- Expand cache hash to 32 hex chars
- DESCRIPTION and OUTPUT_TOOLTIPS

Sampler:
- Variant mismatch validation against extracted features
- Cancellation support via throw_exception_if_processing_interrupted()
- OOM catch with actionable error message
- normalize toggle (optional BOOLEAN, default true) for peak normalization
- Remove empty optional: {} block
- DESCRIPTION and OUTPUT_TOOLTIPS

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 18:16:03 +02:00
Ethanfel 429810db5b docs: improve tooltips on all three SelVA nodes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 18:10:05 +02:00
Ethanfel 57f56c04e2 feat: update demo workflow with VHS_VideoCombine output
- Replace PreviewAudio with VHS_VideoCombine — outputs video+audio together
- Wire fps from FeatureExtractor to VideoCombine frame_rate
- Wire audio from Sampler into VideoCombine
- Clear hardcoded video filename
- Set filename_prefix to SelVA, save_output=true

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 18:07:56 +02:00
Ethanfel ff26d0b87d fix: bug sweep and improvements
- nodes/__init__.py: fix [PrismAudio] leftover label in error print
- selva_feature_extractor: hash beginning, middle and end of video tensor
  instead of just first 1MB, avoiding collisions on videos with same opening frames
- selva_sampler: derive SequenceConfig from model template via dataclasses.replace
  instead of hardcoding sampling_rate/spectrogram_frame_rate per mode

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 18:04:35 +02:00
Ethanfel 83b1da9520 chore: remove all PrismAudio code from main branch
- Delete prismaudio_core/, data_utils/, scripts/, docs/plans/
- Delete PrismAudio nodes (feature_extractor, feature_loader, model_loader, sampler, text_only)
- Delete PrismAudio workflows (video_to_audio, text_to_audio)
- Clean nodes/utils.py: rename PRISMAUDIO_CATEGORY → SELVA_CATEGORY, remove unused helpers
- Strip PrismAudio-only deps from requirements.txt

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 17:58:31 +02:00
Ethanfel 679a607a85 feat: wire prompt output from feature extractor to sampler in demo workflow
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 17:13:23 +02:00
Ethanfel d495939367 docs: rewrite README for SelVA
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 17:12:28 +02:00
Ethanfel 982d66e078 chore: remove PrismAudio nodes from selva-integration branch
This branch registers only the three SelVA nodes. PrismAudio nodes stay
on master/feature/lora-trainer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 17:01:21 +02:00
Ethanfel b4124f58b3 fix: BigVGANv2._from_pretrained() compat with newer huggingface_hub
Newer hf_hub stopped passing proxies/resume_download/local_files_only/token
to _from_pretrained(). Give them defaults so the call doesn't fail when
these kwargs are omitted.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:51:48 +02:00
Ethanfel 2c9d521565 fix: 44k generator HF paths use 44khz suffix (not 44k)
Actual filenames in jnwnlee/SelVA: generator_*_44khz_sup_5.pth.
download_utils.py had the wrong names so those MD5s are unverified — set to
None to skip MD5 check for 44k generators. All other files verified/unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:46:20 +02:00
Ethanfel 28229d62ce fix: MD5 validation on existing files — re-download if corrupt
Previously _ensure() trusted any existing file. Files downloaded by the
broken requests-based code (HTML error pages) would be silently reused.
Now checks MD5 on every load; deletes and re-downloads on mismatch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:42:38 +02:00
Ethanfel 92593189f0 fix: use huggingface_hub for downloads instead of raw requests
download_utils.py used requests without auth — jnwnlee/SelVA returned an
HTML error page which torch then failed to unpickle ('E' / opcode 69).
huggingface_hub.hf_hub_download() handles HF_TOKEN auth automatically,
validates downloads, and retries. Files are still copied to models/selva/.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:41:29 +02:00
Ethanfel 614a2e02aa fix: weights_only=False for SelVA checkpoints (PyTorch 2.6 compat)
PyTorch 2.6 changed the default to weights_only=True. SelVA checkpoints
contain non-tensor types (numpy scalars etc.) that fail strict unpickling.
All weights come from trusted sources (jnwnlee/selva HF repo).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:38:31 +02:00
Ethanfel 40388ba6de fix: negative_prompt inline (multiline:false) + VAE filename v1-44.pth not v1-44k.pth
- SelvaSampler: multiline:false puts negative_prompt inline above sliders
- SelvaModelLoader: VAE filenames in download_utils are v1-16.pth/v1-44.pth,
  not v1-{mode}.pth (mode includes the 'k' suffix)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:35:17 +02:00
Ethanfel 789e09535d fix: SelvaSampler — negative_prompt above settings
Move negative_prompt to required inputs, right after prompt, so it appears
above duration/steps/cfg/seed in the ComfyUI node layout.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:31:53 +02:00
Ethanfel 4da4858e4a fix: inline prune helpers when removed from both transformers locations
find_pruneable_heads_and_indices and prune_linear_layer were removed from
both pytorch_utils and modeling_utils in some transformers builds. Provide
minimal inline implementations as final fallback — prune_heads() is never
called at inference time so correctness is only needed for completeness.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:30:58 +02:00
Ethanfel ab8e1e5b7b feat: SelvaFeatureExtractor outputs prompt as STRING
Users can now wire the prompt output directly to SelvaSampler's prompt input,
making the data flow explicit instead of relying on the implicit features fallback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:27:49 +02:00
Ethanfel e3a3384727 fix: SelvaSampler input order — prompt required, negative_prompt optional
ComfyUI renders required inputs above optional ones. Moving negative_prompt
to optional puts prompt first (natural order) and negative_prompt at the
bottom where it belongs as a power-user input. Also guards against
negative_prompt=None when not connected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:27:07 +02:00
Ethanfel 9a985499e7 feat: auto-download SelVA weights on first use
Uses selva_core/utils/download_utils.py (already has URLs + MD5s for all
weights). Models download to models/selva/ on first load. Synchformer reuses
models/prismaudio/synchformer_state_dict.pth if already present (no duplicate
download for PrismAudio users), otherwise downloads to models/selva/.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:25:36 +02:00
Ethanfel 27b4424e1a feat: prompt entered once in SelvaFeatureExtractor, reused by SelvaSampler
SelvaFeatureExtractor now stores the prompt in SELVA_FEATURES (both in the
returned dict and the .npz cache). SelvaSampler's prompt is now optional —
when left empty it falls back to the prompt stored in features. A non-empty
override can still be passed when CLIP text should differ from the sync text.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:22:59 +02:00
Ethanfel 0e417f4078 fix: transformers compat — find_pruneable_heads_and_indices import
Some transformers builds removed these from pytorch_utils. Fall back to
modeling_utils which exposes them in all known versions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:21:26 +02:00
Ethanfel 6474e2816c fix: two bugs in SelVA nodes
- selva_feature_extractor: cache hash now includes resolved duration;
  same video + different duration override no longer returns stale features
- selva_sampler: MPS-safe noise generation (torch.Generator on CPU then
  move to device, same pattern as PrismAudioSampler)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 15:39:57 +02:00
Ethanfel c23d210ab2 feat: SelVA video-to-audio example workflow
LoadVideo → SelvaFeatureExtractor → SelvaSampler → PreviewAudio.
Defaults: medium_44k, bf16, 25 steps, cfg=4.5.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 15:31:53 +02:00
Ethanfel b59b657b6f feat: SelvaSampler — flow matching ODE with CFG and negative prompts
Calls update_seq_lengths with actual feature dimensions (not seq_cfg) to
avoid rounding assertion mismatches. Progress bar tracks each Euler step.
Supports negative prompts for steering, normalizes output to [-1,1].

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 15:31:18 +02:00
Ethanfel 578b501d38 feat: SelvaFeatureExtractor — inline CLIP + TextSynchformer feature extraction
CLIP frames at 8fps→384px (normalize inside FeaturesUtils).
Sync frames at 25fps→224px, normalized to [-1,1] externally.
T5 text encoded via FeaturesUtils, sup tokens prepended, then text-conditioned
sync features extracted via TextSynch.encode_video_with_sync(). Results cached
as .npz keyed by hash(frames[:1MB] + prompt + fps + variant).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 15:23:40 +02:00
Ethanfel fe94438356 feat: SelvaModelLoader node — loads TextSynch + MMAudio + FeaturesUtils
Resolves weights from models/selva/. Reuses synchformer_state_dict.pth from
models/prismaudio/ (no duplicate download). Supports four variants:
small_16k / small_44k / medium_44k / large_44k.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 15:21:03 +02:00
Ethanfel 6bc3fd6443 chore: vendor selva_core from jnwnlee/selva@d7d40a9
Pure PyTorch SelVA source for SelvaModelLoader/FeatureExtractor/Sampler nodes.
Imports rewritten from selva.* to selva_core.*. mel_converter.py: replaced
librosa.filters.mel with pure-numpy implementation to avoid librosa→numba→NumPy
version incompatibility in some ComfyUI environments.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 15:18:09 +02:00
Ethanfel 762b19fd3a fix: return fps from non-cache extraction path
The fps output was only returned on cache hits. Fresh extractions
returned only features, leaving fps null.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:26:15 +01:00
Ethanfel 807a2e51fb docs: fix README references — PrismAudio not ThinkSound
Point links to huggingface.co/FunAudioLLM/PrismAudio and use public
GitHub URL for install instructions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:16:31 +01:00
Ethanfel 67be94c45c chore: add updated V2A example workflow
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:13:06 +01:00
Ethanfel 681d230b0c chore: update T2A workflow to match V2A style and current defaults
Steps=100, cfg=7.0, randomize seed, consistent node format with
aux_id/ver/ue_properties.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:11:20 +01:00
Ethanfel 62a3c5d0dc docs: rewrite README to reflect current node design
Update node descriptions, inputs/outputs, workflows, and environment
setup to match current implementation (managed_env dropdown, VHS
video_info, auto-duration, fps output, synchformer auto-resolve).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:10:07 +01:00
Ethanfel 30631c0cb4 fix: change fps output type from INT to FLOAT
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:05:35 +01:00
Ethanfel d0c9a72782 feat: add fps INT output to PrismAudioFeatureExtractor
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:05:03 +01:00
Ethanfel 5b62be0447 chore: update default steps=100 and cfg_scale=7.0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:03:48 +01:00
Ethanfel abd315092b feat: auto-use video duration from features when duration=0
Setting duration to 0 in PrismAudioSampler now reads the duration
stored in the PRISMAUDIO_FEATURES dict (set by the feature extractor).
Default changed from 10.0 to 0.0 so V2A workflows are wired up
automatically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:00:47 +01:00
Ethanfel 972d379369 refactor: simplify feature extractor inputs
- Remove synchformer_ckpt input — always resolved from models/prismaudio/
  (errors early with clear message if missing)
- Replace python_env string input with dropdown: managed_env (isolated
  auto-created venv, default) or comfyui_env (current Python, with warning)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:55:08 +01:00
Ethanfel 8969d407f6 feat: accept VHS_VIDEOINFO to auto-set fps in feature extractor
When the VHS LoadVideo video_info output is connected, loaded_fps is
used automatically instead of the manual fps input.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:52:51 +01:00
Ethanfel 707ccb463e perf: replace MP4 encode/decode with lossless .npy frame transfer
Saves frames as uint8 .npy instead of H.264 MP4, eliminating the
lossy codec roundtrip. extract_features.py loads .npy directly and
skips decord when given a numpy file. Passes --source_fps for
correct temporal sampling.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:50:35 +01:00
Ethanfel c38df8c6fa chore: remove debug options and diagnostic logging
Remove debug_zero_video/debug_zero_sync inputs from PrismAudioSampler,
DIT velocity diagnostics, conditioner stats logging, and feature stats
prints from both sampler.py and text_only.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:47:00 +01:00
Ethanfel 2f626d8a96 fix: use videoprism_lvt_public_v1_large with joint video-text forward
The wrong model (videoprism_public_v1_large, vision-only) was used,
causing V2A audio distortion. Switch to the LvT variant which has a
text tower, pass CoT captions for joint encoding, and extract per-frame
features from outputs['frame_embeddings'] (L2-normalized, [T, 1024])
instead of manually averaging spatial patches.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:37:02 +01:00
Ethanfel 1d8b9b59e0 debug: add DIT velocity diagnostic at t=1 to isolate DIT vs VAE quality issue
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 23:57:03 +01:00
Ethanfel 8bf4a0c3fc debug: log conditioner output stats and T2A text feature stats
Add per-key conditioning output stats (after Cond_MLP/Sync_MLP, after
_substitute_empty_features) to both sampler and text_only nodes. Also
add raw T5 text feature stats in T2A before conditioning.

This lets us directly compare:
- T2A vs V2A conditioning outputs to find which path differs
- T2A vs npz text feature ranges

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:39:44 +01:00
Ethanfel 477fe0f08f debug: add latent and audio stats logging to V2A sampler
Match the diagnostic output already in text_only.py to compare
V2A vs T2A latent distributions and diagnose conditioning issues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:28:08 +01:00
Ethanfel c0b7ccbcee fix: substitute empty_clip_feat for video features when no video present
Zero features through bias-free Cond_MLP produce near-zero activations,
not the learned null signal the model was trained with. Use empty_clip_feat
(the learned null video embedding) just like empty_sync_feat for sync.
Also improve text_prompt tooltip to encourage detailed CoT descriptions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:13:22 +01:00