Commit Graph

191 Commits

Author SHA1 Message Date
Ethanfel 6474e2816c fix: two bugs in SelVA nodes
- selva_feature_extractor: cache hash now includes resolved duration;
  same video + different duration override no longer returns stale features
- selva_sampler: MPS-safe noise generation (torch.Generator on CPU then
  move to device, same pattern as PrismAudioSampler)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 15:39:57 +02:00
Ethanfel b59b657b6f feat: SelvaSampler — flow matching ODE with CFG and negative prompts
Calls update_seq_lengths with actual feature dimensions (not seq_cfg) to
avoid rounding assertion mismatches. Progress bar tracks each Euler step.
Supports negative prompts for steering, normalizes output to [-1,1].

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 15:31:18 +02:00
Ethanfel 578b501d38 feat: SelvaFeatureExtractor — inline CLIP + TextSynchformer feature extraction
CLIP frames at 8fps→384px (normalize inside FeaturesUtils).
Sync frames at 25fps→224px, normalized to [-1,1] externally.
T5 text encoded via FeaturesUtils, sup tokens prepended, then text-conditioned
sync features extracted via TextSynch.encode_video_with_sync(). Results cached
as .npz keyed by hash(frames[:1MB] + prompt + fps + variant).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 15:23:40 +02:00
Ethanfel fe94438356 feat: SelvaModelLoader node — loads TextSynch + MMAudio + FeaturesUtils
Resolves weights from models/selva/. Reuses synchformer_state_dict.pth from
models/prismaudio/ (no duplicate download). Supports four variants:
small_16k / small_44k / medium_44k / large_44k.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 15:21:03 +02:00
Ethanfel 762b19fd3a fix: return fps from non-cache extraction path
The fps output was only returned on cache hits. Fresh extractions
returned only features, leaving fps null.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:26:15 +01:00
Ethanfel 30631c0cb4 fix: change fps output type from INT to FLOAT
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:05:35 +01:00
Ethanfel d0c9a72782 feat: add fps INT output to PrismAudioFeatureExtractor
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:05:03 +01:00
Ethanfel 5b62be0447 chore: update default steps=100 and cfg_scale=7.0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:03:48 +01:00
Ethanfel abd315092b feat: auto-use video duration from features when duration=0
Setting duration to 0 in PrismAudioSampler now reads the duration
stored in the PRISMAUDIO_FEATURES dict (set by the feature extractor).
Default changed from 10.0 to 0.0 so V2A workflows are wired up
automatically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 11:00:47 +01:00
Ethanfel 972d379369 refactor: simplify feature extractor inputs
- Remove synchformer_ckpt input — always resolved from models/prismaudio/
  (errors early with clear message if missing)
- Replace python_env string input with dropdown: managed_env (isolated
  auto-created venv, default) or comfyui_env (current Python, with warning)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:55:08 +01:00
Ethanfel 8969d407f6 feat: accept VHS_VIDEOINFO to auto-set fps in feature extractor
When the VHS LoadVideo video_info output is connected, loaded_fps is
used automatically instead of the manual fps input.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:52:51 +01:00
Ethanfel 707ccb463e perf: replace MP4 encode/decode with lossless .npy frame transfer
Saves frames as uint8 .npy instead of H.264 MP4, eliminating the
lossy codec roundtrip. extract_features.py loads .npy directly and
skips decord when given a numpy file. Passes --source_fps for
correct temporal sampling.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:50:35 +01:00
Ethanfel c38df8c6fa chore: remove debug options and diagnostic logging
Remove debug_zero_video/debug_zero_sync inputs from PrismAudioSampler,
DIT velocity diagnostics, conditioner stats logging, and feature stats
prints from both sampler.py and text_only.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 10:47:00 +01:00
Ethanfel 1d8b9b59e0 debug: add DIT velocity diagnostic at t=1 to isolate DIT vs VAE quality issue
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 23:57:03 +01:00
Ethanfel 8bf4a0c3fc debug: log conditioner output stats and T2A text feature stats
Add per-key conditioning output stats (after Cond_MLP/Sync_MLP, after
_substitute_empty_features) to both sampler and text_only nodes. Also
add raw T5 text feature stats in T2A before conditioning.

This lets us directly compare:
- T2A vs V2A conditioning outputs to find which path differs
- T2A vs npz text feature ranges

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:39:44 +01:00
Ethanfel 477fe0f08f debug: add latent and audio stats logging to V2A sampler
Match the diagnostic output already in text_only.py to compare
V2A vs T2A latent distributions and diagnose conditioning issues.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:28:08 +01:00
Ethanfel c0b7ccbcee fix: substitute empty_clip_feat for video features when no video present
Zero features through bias-free Cond_MLP produce near-zero activations,
not the learned null signal the model was trained with. Use empty_clip_feat
(the learned null video embedding) just like empty_sync_feat for sync.
Also improve text_prompt tooltip to encourage detailed CoT descriptions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:13:22 +01:00
Ethanfel 45633788a4 debug: add latent and audio stats logging to T2A node
Print fakes latent stats (mean/std/min/max) and audio pre-norm stats
to diagnose whether diffusion output is numerically reasonable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 22:06:39 +01:00
Ethanfel 11457fc27a debug: fix VAE load_state_dict diagnostic — load into .model directly
AutoencoderPretransform.load_state_dict() doesn't return IncompatibleKeys.
Load into pretransform.model (AudioAutoencoder) to get the return value
and see actual missing/unexpected key counts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:56:06 +01:00
Ethanfel f2705b3063 debug: log weight load stats for diffusion and VAE checkpoints
Print key counts, missing/unexpected keys, and sample key names to
diagnose whether weights are actually loading correctly (strict=False
silently hides mismatches that would cause garbage audio output).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:53:25 +01:00
Ethanfel 83a7f2787b feat: add debug_zero_video/sync toggles and feature stats logging to sampler
Allows isolating which feature set causes quality issues:
- debug_zero_video: zero video_features → text+sync only
- debug_zero_sync: zero sync_features → text+video only
Also logs mean/std/shape for all three feature tensors on every run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:40:34 +01:00
Ethanfel 934a401633 perf: replace PIL+PNG frame files with direct ffmpeg stdin pipe
Stream raw RGB bytes from tensor directly to ffmpeg stdin.
Eliminates all intermediate PNG file I/O — much faster for large frame counts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:20:00 +01:00
Ethanfel b3ac9ab22f feat: log MP4 conversion time before subprocess spawn
Shows how long PIL+ffmpeg video export takes so we can see
if that's contributing to the gap before [extract] output appears.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 21:19:26 +01:00
Ethanfel 93120eb6b9 feat: auto-resolve synchformer checkpoint from prismaudio models dir
When synchformer_ckpt input is empty, look for synchformer_state_dict.pth
in the ComfyUI prismaudio models directory automatically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:49:56 +01:00
Ethanfel b1a2ee594e fix: correct VideoPrism import (videoprism.models, not videoprism); add flax dep
videoprism/__init__.py is empty — API lives in videoprism.models.
Fix: from videoprism import models as vp (not import videoprism as vp).
Also add flax to managed venv packages (required by videoprism Flax model).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:38:00 +01:00
Ethanfel 0f46e8359d feat: switch managed venv to jax[cuda13] for GPU feature extraction
RTX 6000 Pro (Blackwell SM 10.0) fully supports CUDA 13. Switch from
jax[cpu]+jaxlib to jax[cuda13] which bundles jaxlib and uses
pip-managed CUDA libraries. Delete _extract_env to force a rebuild.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:33:45 +01:00
Ethanfel 06f8dbbab4 feat: add hf_token input and HF_TOKEN env forwarding to feature extractor
google/t5gemma-l-l-ul2-it is a gated HuggingFace model requiring auth.
Add optional hf_token input on the node; forward it (plus the legacy
HUGGING_FACE_HUB_TOKEN alias) to the subprocess env. Falls back to
HF_TOKEN from the host environment. Warn clearly when neither is set.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:27:33 +01:00
Ethanfel a6d584bd34 fix: treat empty python_env as auto-managed venv trigger
Empty string from clearing the node field caused subprocess to execute ''
which raises PermissionError. Now any blank or 'python' value uses the
auto-installed venv.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:21:16 +01:00
Ethanfel 829f398ed0 feat: verbose step-by-step logging in feature extraction
- extract_features.py: 6 numbered steps with shapes, fps, frame counts
- feature_extractor.py: stream subprocess output live (capture_output=False)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:19:38 +01:00
Ethanfel f32456a142 feat: add fps input to PrismAudioFeatureExtractor
Exposes the video frame rate as an optional input (default 30).
Correct FPS ensures accurate temporal frame sampling in VideoPrism
and Synchformer feature extraction.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:08:10 +01:00
Ethanfel c416045ace fix: replace torchvision.io.write_video with PIL+ffmpeg
write_video requires the optional 'av' (PyAV) package. Use PIL to save
frames as PNGs then combine with ffmpeg, which is always present in
ComfyUI Docker images.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:03:39 +01:00
Ethanfel 824550bed3 feat: verbose per-package progress during venv auto-install
Installs each package individually with [n/total] counters and
pip progress bars, so failures pinpoint the exact failing package.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 20:00:04 +01:00
Ethanfel 8f2e204146 fix: show pip output, handle incomplete venv, fix TF version for Python 3.12
- tensorflow-cpu==2.15.0 only supports Python <=3.11; relax to >=2.16.0
- capture_output=False so pip errors are visible in ComfyUI logs
- clean up incomplete venv dir before retrying install

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 19:55:55 +01:00
Ethanfel 8e3ab999f0 fix: load VAE state dict with strict=False
vae.ckpt is a full training checkpoint containing discriminator, STFT
loss modules, and EMA wrappers that are absent from the inference
AudioAutoencoder. strict=False ignores these training-only keys while
still loading all encoder/decoder/bottleneck weights correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 19:51:51 +01:00
Ethanfel 35d0615253 feat: auto-install pip venv for feature extraction on first use
PrismAudioFeatureExtractor now creates and populates a managed venv
(_extract_env/) automatically when python_env is left as the default
'python'. Also adds scripts/install_extract_env.sh for manual/Docker
setup without conda.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 19:27:27 +01:00
Ethanfel 618e7de64b feat: PrismAudioTextOnly node with correct T5-Gemma encoding
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 18:09:11 +01:00
Ethanfel 3d62688e8c feat: PrismAudioSampler node with correct metadata format and peak normalization
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 18:07:33 +01:00
Ethanfel 7c54ee8482 feat: PrismAudioFeatureExtractor node with subprocess bridge and conda env
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 18:06:10 +01:00
Ethanfel 3f35aa39f2 feat: PrismAudioFeatureLoader node for pre-computed .npz files
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 18:04:32 +01:00
Ethanfel 1043f4bacb feat: PrismAudioModelLoader node with auto-download and adaptive VRAM
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 18:02:47 +01:00
Ethanfel baa80de194 feat: project scaffolding with shared utils and node registration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-27 16:59:21 +01:00