Commit Graph

174 Commits

Author SHA1 Message Date
Ethanfel 789e09535d fix: SelvaSampler — negative_prompt above settings
Move negative_prompt to required inputs, right after prompt, so it appears
above duration/steps/cfg/seed in the ComfyUI node layout.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:31:53 +02:00
Ethanfel 4da4858e4a fix: inline prune helpers when removed from both transformers locations
find_pruneable_heads_and_indices and prune_linear_layer were removed from
both pytorch_utils and modeling_utils in some transformers builds. Provide
minimal inline implementations as final fallback — prune_heads() is never
called at inference time so correctness is only needed for completeness.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:30:58 +02:00
Ethanfel ab8e1e5b7b feat: SelvaFeatureExtractor outputs prompt as STRING
Users can now wire the prompt output directly to SelvaSampler's prompt input,
making the data flow explicit instead of relying on the implicit features fallback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:27:49 +02:00
Ethanfel e3a3384727 fix: SelvaSampler input order — prompt required, negative_prompt optional
ComfyUI renders required inputs above optional ones. Moving negative_prompt
to optional puts prompt first (natural order) and negative_prompt at the
bottom where it belongs as a power-user input. Also guards against
negative_prompt=None when not connected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:27:07 +02:00
Ethanfel 9a985499e7 feat: auto-download SelVA weights on first use
Uses selva_core/utils/download_utils.py (already has URLs + MD5s for all
weights). Models download to models/selva/ on first load. Synchformer reuses
models/prismaudio/synchformer_state_dict.pth if already present (no duplicate
download for PrismAudio users), otherwise downloads to models/selva/.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:25:36 +02:00
Ethanfel 27b4424e1a feat: prompt entered once in SelvaFeatureExtractor, reused by SelvaSampler
SelvaFeatureExtractor now stores the prompt in SELVA_FEATURES (both in the
returned dict and the .npz cache). SelvaSampler's prompt is now optional —
when left empty it falls back to the prompt stored in features. A non-empty
override can still be passed when CLIP text should differ from the sync text.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:22:59 +02:00
Ethanfel 0e417f4078 fix: transformers compat — find_pruneable_heads_and_indices import
Some transformers builds removed these from pytorch_utils. Fall back to
modeling_utils which exposes them in all known versions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 16:21:26 +02:00
Ethanfel 6474e2816c fix: two bugs in SelVA nodes
- selva_feature_extractor: cache hash now includes resolved duration;
  same video + different duration override no longer returns stale features
- selva_sampler: MPS-safe noise generation (torch.Generator on CPU then
  move to device, same pattern as PrismAudioSampler)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 15:39:57 +02:00
Ethanfel c23d210ab2 feat: SelVA video-to-audio example workflow
LoadVideo → SelvaFeatureExtractor → SelvaSampler → PreviewAudio.
Defaults: medium_44k, bf16, 25 steps, cfg=4.5.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 15:31:53 +02:00
Ethanfel b59b657b6f feat: SelvaSampler — flow matching ODE with CFG and negative prompts
Calls update_seq_lengths with actual feature dimensions (not seq_cfg) to
avoid rounding assertion mismatches. Progress bar tracks each Euler step.
Supports negative prompts for steering, normalizes output to [-1,1].

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 15:31:18 +02:00
Ethanfel 578b501d38 feat: SelvaFeatureExtractor — inline CLIP + TextSynchformer feature extraction
CLIP frames at 8fps→384px (normalize inside FeaturesUtils).
Sync frames at 25fps→224px, normalized to [-1,1] externally.
T5 text encoded via FeaturesUtils, sup tokens prepended, then text-conditioned
sync features extracted via TextSynch.encode_video_with_sync(). Results cached
as .npz keyed by hash(frames[:1MB] + prompt + fps + variant).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 15:23:40 +02:00
Ethanfel fe94438356 feat: SelvaModelLoader node — loads TextSynch + MMAudio + FeaturesUtils
Resolves weights from models/selva/. Reuses synchformer_state_dict.pth from
models/prismaudio/ (no duplicate download). Supports four variants:
small_16k / small_44k / medium_44k / large_44k.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 15:21:03 +02:00
Ethanfel 6bc3fd6443 chore: vendor selva_core from jnwnlee/selva@d7d40a9
Pure PyTorch SelVA source for SelvaModelLoader/FeatureExtractor/Sampler nodes.
Imports rewritten from selva.* to selva_core.*. mel_converter.py: replaced
librosa.filters.mel with pure-numpy implementation to avoid librosa→numba→NumPy
version incompatibility in some ComfyUI environments.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 15:18:09 +02:00
Ethanfel 762b19fd3a fix: return fps from non-cache extraction path
The fps output was only returned on cache hits. Fresh extractions
returned only features, leaving fps null.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:26:15 +01:00
Ethanfel 807a2e51fb docs: fix README references — PrismAudio not ThinkSound
Point links to huggingface.co/FunAudioLLM/PrismAudio and use public
GitHub URL for install instructions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:16:31 +01:00
Ethanfel 67be94c45c chore: add updated V2A example workflow
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:13:06 +01:00
Ethanfel 681d230b0c chore: update T2A workflow to match V2A style and current defaults
Steps=100, cfg=7.0, randomize seed, consistent node format with
aux_id/ver/ue_properties.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:11:20 +01:00
Ethanfel 62a3c5d0dc docs: rewrite README to reflect current node design
Update node descriptions, inputs/outputs, workflows, and environment
setup to match current implementation (managed_env dropdown, VHS
video_info, auto-duration, fps output, synchformer auto-resolve).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:10:07 +01:00
Ethanfel 30631c0cb4 fix: change fps output type from INT to FLOAT
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:05:35 +01:00
Ethanfel d0c9a72782 feat: add fps INT output to PrismAudioFeatureExtractor
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:05:03 +01:00
Ethanfel 5b62be0447 chore: update default steps=100 and cfg_scale=7.0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:03:48 +01:00
Ethanfel abd315092b feat: auto-use video duration from features when duration=0
Setting duration to 0 in PrismAudioSampler now reads the duration
stored in the PRISMAUDIO_FEATURES dict (set by the feature extractor).
Default changed from 10.0 to 0.0 so V2A workflows are wired up
automatically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:00:47 +01:00
Ethanfel 972d379369 refactor: simplify feature extractor inputs
- Remove synchformer_ckpt input — always resolved from models/prismaudio/
  (errors early with clear message if missing)
- Replace python_env string input with dropdown: managed_env (isolated
  auto-created venv, default) or comfyui_env (current Python, with warning)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:55:08 +01:00
Ethanfel 8969d407f6 feat: accept VHS_VIDEOINFO to auto-set fps in feature extractor
When the VHS LoadVideo video_info output is connected, loaded_fps is
used automatically instead of the manual fps input.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:52:51 +01:00
Ethanfel 707ccb463e perf: replace MP4 encode/decode with lossless .npy frame transfer
Saves frames as uint8 .npy instead of H.264 MP4, eliminating the
lossy codec roundtrip. extract_features.py loads .npy directly and
skips decord when given a numpy file. Passes --source_fps for
correct temporal sampling.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:50:35 +01:00
Ethanfel c38df8c6fa chore: remove debug options and diagnostic logging
Remove debug_zero_video/debug_zero_sync inputs from PrismAudioSampler,
DIT velocity diagnostics, conditioner stats logging, and feature stats
prints from both sampler.py and text_only.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:47:00 +01:00
Ethanfel 2f626d8a96 fix: use videoprism_lvt_public_v1_large with joint video-text forward
The wrong model (videoprism_public_v1_large, vision-only) was used,
causing V2A audio distortion. Switch to the LvT variant which has a
text tower, pass CoT captions for joint encoding, and extract per-frame
features from outputs['frame_embeddings'] (L2-normalized, [T, 1024])
instead of manually averaging spatial patches.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:37:02 +01:00
Ethanfel 1d8b9b59e0 debug: add DIT velocity diagnostic at t=1 to isolate DIT vs VAE quality issue
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 23:57:03 +01:00
Ethanfel 8bf4a0c3fc debug: log conditioner output stats and T2A text feature stats
Add per-key conditioning output stats (after Cond_MLP/Sync_MLP, after
_substitute_empty_features) to both sampler and text_only nodes. Also
add raw T5 text feature stats in T2A before conditioning.

This lets us directly compare:
- T2A vs V2A conditioning outputs to find which path differs
- T2A vs npz text feature ranges

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:39:44 +01:00
Ethanfel 477fe0f08f debug: add latent and audio stats logging to V2A sampler
Match the diagnostic output already in text_only.py to compare
V2A vs T2A latent distributions and diagnose conditioning issues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:28:08 +01:00
Ethanfel c0b7ccbcee fix: substitute empty_clip_feat for video features when no video present
Zero features through bias-free Cond_MLP produce near-zero activations,
not the learned null signal the model was trained with. Use empty_clip_feat
(the learned null video embedding) just like empty_sync_feat for sync.
Also improve text_prompt tooltip to encourage detailed CoT descriptions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:13:22 +01:00
Ethanfel 45633788a4 debug: add latent and audio stats logging to T2A node
Print fakes latent stats (mean/std/min/max) and audio pre-norm stats
to diagnose whether diffusion output is numerically reasonable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:06:39 +01:00
Ethanfel 11457fc27a debug: fix VAE load_state_dict diagnostic — load into .model directly
AutoencoderPretransform.load_state_dict() doesn't return IncompatibleKeys.
Load into pretransform.model (AudioAutoencoder) to get the return value
and see actual missing/unexpected key counts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:56:06 +01:00
Ethanfel f2705b3063 debug: log weight load stats for diffusion and VAE checkpoints
Print key counts, missing/unexpected keys, and sample key names to
diagnose whether weights are actually loading correctly (strict=False
silently hides mismatches that would cause garbage audio output).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:53:25 +01:00
Ethanfel 83a7f2787b feat: add debug_zero_video/sync toggles and feature stats logging to sampler
Allows isolating which feature set causes quality issues:
- debug_zero_video: zero video_features → text+sync only
- debug_zero_sync: zero sync_features → text+video only
Also logs mean/std/shape for all three feature tensors on every run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:40:34 +01:00
Ethanfel 140cc5ee9a feat: implement real Synchformer visual encoder (TimeSformer ViT-B/16)
Replace placeholder single-linear with proper architecture reverse-engineered
from synchformer_state_dict.pth:
- _PatchEmbed: Conv2d(3, 768, 16x16) → [B, 196, 768]
- _TimeSformerBlock: factorized spatial + temporal attention (norm1/attn/norm3/timeattn/norm2/mlp)
- _SpatialAttnAgg: TransformerEncoderLayer with CLS token, aggregates 196 patches → 1/frame
- 12 blocks, dim=768, 8 frames/segment
- Loads from vfeat_extractor.* prefix, skips 3D patch embed

Output: [T_aligned, 768] per-frame features for Sync_MLP conditioner.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:28:20 +01:00
Ethanfel f99d2666e8 fix: interpolate sync_cond to match audio sequence length in transformer
Sync_MLP interpolates sync features based on video duration, but audio
latent length depends on the user-set audio duration. When video != audio
duration, the sequences diverge. Resample sync_cond to x's length before
the gated addition so any video/audio duration combo works.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:21:39 +01:00
Ethanfel 934a401633 perf: replace PIL+PNG frame files with direct ffmpeg stdin pipe
Stream raw RGB bytes from tensor directly to ffmpeg stdin.
Eliminates all intermediate PNG file I/O — much faster for large frame counts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:20:00 +01:00
Ethanfel b3ac9ab22f feat: log MP4 conversion time before subprocess spawn
Shows how long PIL+ffmpeg video export takes so we can see
if that's contributing to the gap before [extract] output appears.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:19:26 +01:00
Ethanfel ca87c41a2e feat: add per-step timing to feature extraction logs
Each step now prints elapsed seconds on completion.
Total time printed at the end to identify bottlenecks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:13:42 +01:00
Ethanfel 63bd999dfa fix: switch to VideoPrism large (1024-dim) and fix Synchformer output shape
prismaudio.json conditioner config requires:
- video_features: dim=1024 → switch videoprism_public_v1_base → large (ViT-L)
- sync_features: dim=768, length divisible by 8 → expand [num_seg,768] to
  [num_seg*8,768] (per-frame) so Sync_MLP can reshape by groups of 8

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:07:17 +01:00
Ethanfel 20fb766ad2 fix: cast tensors to float32 before numpy() in feature save
T5-Gemma outputs BFloat16 which numpy does not support.
Cast all feature tensors with .float() before .numpy().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:56:52 +01:00
Ethanfel 93120eb6b9 feat: auto-resolve synchformer checkpoint from prismaudio models dir
When synchformer_ckpt input is empty, look for synchformer_state_dict.pth
in the ComfyUI prismaudio models directory automatically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:49:56 +01:00
Ethanfel b1a2ee594e fix: correct VideoPrism import (videoprism.models, not videoprism); add flax dep
videoprism/__init__.py is empty — API lives in videoprism.models.
Fix: from videoprism import models as vp (not import videoprism as vp).
Also add flax to managed venv packages (required by videoprism Flax model).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:38:00 +01:00
Ethanfel 0f46e8359d feat: switch managed venv to jax[cuda13] for GPU feature extraction
RTX 6000 Pro (Blackwell SM 10.0) fully supports CUDA 13. Switch from
jax[cpu]+jaxlib to jax[cuda13] which bundles jaxlib and uses
pip-managed CUDA libraries. Delete _extract_env to force a rebuild.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:33:45 +01:00
Ethanfel 06f8dbbab4 feat: add hf_token input and HF_TOKEN env forwarding to feature extractor
google/t5gemma-l-l-ul2-it is a gated HuggingFace model requiring auth.
Add optional hf_token input on the node; forward it (plus the legacy
HUGGING_FACE_HUB_TOKEN alias) to the subprocess env. Falls back to
HF_TOKEN from the host environment. Warn clearly when neither is set.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:27:33 +01:00
Ethanfel a6d584bd34 fix: treat empty python_env as auto-managed venv trigger
Empty string from clearing the node field caused subprocess to execute ''
which raises PermissionError. Now any blank or 'python' value uses the
auto-installed venv.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:21:16 +01:00
Ethanfel 829f398ed0 feat: verbose step-by-step logging in feature extraction
- extract_features.py: 6 numbered steps with shapes, fps, frame counts
- feature_extractor.py: stream subprocess output live (capture_output=False)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:19:38 +01:00
Ethanfel 878025450a feat: add data_utils package with FeaturesUtils implementation
Creates data_utils/v2a_utils/feature_utils_288.py with FeaturesUtils:
- T5-Gemma text encoding via transformers
- VideoPrism video encoding via JAX videoprism package
- Synchformer visual encoder loading from checkpoint

Also fixes extract_features.py to add plugin root to sys.path so
data_utils is importable in the subprocess venv.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:14:34 +01:00
Ethanfel f32456a142 feat: add fps input to PrismAudioFeatureExtractor
Exposes the video frame rate as an optional input (default 30).
Correct FPS ensures accurate temporal frame sampling in VideoPrism
and Synchformer feature extraction.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:08:10 +01:00