T

Ethanfel e1a2f0ed7d feat: add inject_mode (suffix/prefix) to TI pipeline

Observation: n4_baseline loss barely moved (1.025→0.965 over 3000 steps),
token_norm grew linearly without plateau — generator likely ignores last-K
CLIP positions (EOS/padding zone) where suffix injects.

Fix: add inject_mode parameter throughout the pipeline:
- "suffix": replace last K positions (original behavior, model may ignore)
- "prefix": replace positions 1:1+K right after BOS — highest attention
  weight in CLIP, much stronger gradient signal expected

Changes:
- selva_textual_inversion_trainer.py: _inject_tokens() helper centralises
  the torch.cat construction for both modes; used in training loop and eval;
  inject_mode stored in checkpoint files
- selva_textual_inversion_loader.py: reads inject_mode from checkpoint,
  includes in TEXTUAL_INVERSION bundle
- selva_sampler.py: uses _inject_tokens() via bundle's inject_mode field
- selva_ti_scheduler.py: inject_mode in _PARAM_DEFAULTS, config, and
  _train_inner call
- ti_sweep_1.json: updated with prefix_inject group (n4, n8, n4+warm);
  n4_baseline marked completed; suffix experiments retained for comparison

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-08 23:31:52 +02:00

docs

docs: note AudioX shows no perceptual quality gain on V2A vs SelVA

2026-04-07 09:12:00 +02:00

experiments

feat: add inject_mode (suffix/prefix) to TI pipeline

2026-04-08 23:31:52 +02:00

nodes

feat: add inject_mode (suffix/prefix) to TI pipeline

2026-04-08 23:31:52 +02:00

selva_core

feat: add LoRA dropout, LoRA+ asymmetric LR, and curriculum timestep sampling

2026-04-06 12:43:18 +02:00

workflows

feat: update demo workflow with VHS_VideoCombine output

2026-04-04 18:07:56 +02:00

__init__.py

chore: remove all PrismAudio code from main branch

2026-04-04 17:58:31 +02:00

.gitignore

feat: auto-install pip venv for feature extraction on first use

2026-03-27 19:27:27 +01:00

LORA_TRAINING.md

docs: add video format recommendations to dataset preparation section

2026-04-06 13:44:14 +02:00

README.md

docs: document mask inputs and normalize toggle in README

2026-04-05 10:43:42 +02:00

requirements.txt

chore: remove all PrismAudio code from main branch

2026-04-04 17:58:31 +02:00

train_lora.py

feat: add LoRA dropout, LoRA+ asymmetric LR, and curriculum timestep sampling

2026-04-06 12:43:18 +02:00

README.md

ComfyUI-SelVA

Custom nodes for SelVA — video-to-audio generation driven by text prompts. SelVA conditions audio synthesis on both visual content and natural language, letting you describe what sounds to generate rather than just when.

Built on MMAudio with a TextSynchformer encoder that injects text guidance directly into the visual sync stream.

Nodes

SelVA Model Loader

Loads the generator, TextSynchformer encoder, and all feature utilities (CLIP, T5, Synchformer, VAE). Weights are auto-downloaded from HuggingFace on first use.

Input	Options	Description
`variant`	small_16k / small_44k / medium_44k / large_44k	Model size and output sample rate
`precision`	bf16 / fp16 / fp32	Compute dtype
`offload_strategy`	auto / keep_in_vram / offload_to_cpu	Memory management

Output: model (SELVA_MODEL)

SelVA Feature Extractor

Extracts CLIP visual features and text-guided sync features from a video. Results are cached on disk — re-running with the same inputs is instant.

Input	Description
`model`	From SelVA Model Loader
`video`	IMAGE tensor from any ComfyUI video loader
`prompt`	Text description of the audio to generate
`video_info`	(optional) VHS_VIDEOINFO from VHS LoadVideo — sets fps automatically
`fps`	Source fps — ignored if `video_info` is connected
`duration`	Override clip duration in seconds. `0` = infer from video length
`cache_dir`	Directory for cached `.npz` files. Empty = system temp dir
`mask`	(optional) Segmentation mask `[T,H,W]` float [0,1] — static (1 frame) or per-frame
`mask_strength`	Background suppression strength. `1.0` = full neutral fill, `0.0` = no effect
`mask_clip`	Apply mask to CLIP features (384px path). Disable to let CLIP see the full scene
`mask_sync`	Apply mask to TextSynchformer sync features (224px path)

Outputs: features (SELVA_FEATURES), fps (FLOAT), prompt (STRING)

Connect prompt output to the Sampler's prompt input to avoid entering it twice.

Masking

Connect a segmentation mask (SAM2, Grounding DINO+SAM, or any ComfyUI mask node) to isolate a specific object's motion before encoding. Background pixels are filled with a neutral value (0.5) rather than zeroed — this keeps them in-distribution for CLIP and maps to exactly 0 after sync's [-1,1] normalization, minimising the influence of background motion on the generated audio.

Use mask_sync=true, mask_clip=false if you want sync features focused on the target object while CLIP still sees the full scene for broader context. Changing any mask parameter correctly busts the feature cache.

SelVA Sampler

Generates audio from video features. Runs the rectified flow ODE with classifier-free guidance.

Input	Description
`model`	From SelVA Model Loader
`features`	From SelVA Feature Extractor
`prompt`	Text description — leave empty to use the prompt stored in features
`negative_prompt`	What to suppress (e.g. `"speech, voice, talking"`)
`duration`	Audio duration in seconds. `0` = use duration from features
`steps`	Sampling steps (default: 25)
`cfg_strength`	Classifier-free guidance scale (default: 4.5)
`seed`	RNG seed
`normalize`	Peak-normalize output to [-1, 1] (default: true)

Output: AUDIO

Workflow

VHS LoadVideo ──► SelVA Feature Extractor ──────────────────────► SelVA Sampler ──► Save Audio
                      │ (video_info) ─► (fps auto)                      ▲
                      │ (features) ────────────────────────────────────►│
                      │ (prompt) ──────────────────────────────────────►│

Connect the prompt output of Feature Extractor directly to Sampler's prompt to keep them in sync. Leave Sampler's prompt empty and it will use whatever was stored during extraction.

Installation

cd ComfyUI/custom_nodes
git clone https://github.com/Ethanfel/ComfyUI-SelVA.git
pip install -r ComfyUI-SelVA/requirements.txt

Model Weights

Weights are auto-downloaded to ComfyUI/models/selva/ on first load. No manual setup required.

File	Size	Description
`video_enc_sup_5.pth`	~300 MB	TextSynchformer encoder
`generator_small_16k_sup_5.pth`	~340 MB	Small generator, 16 kHz output
`generator_small_44k_sup_5.pth`	~340 MB	Small generator, 44.1 kHz output
`generator_medium_44k_sup_5.pth`	~860 MB	Medium generator, 44.1 kHz output
`generator_large_44k_sup_5.pth`	~2.0 GB	Large generator, 44.1 kHz output
`v1-16.pth`	~1.1 GB	VAE for 16 kHz
`v1-44.pth`	~1.1 GB	VAE for 44.1 kHz
`best_netG.pt`	~90 MB	BigVGAN vocoder for 16 kHz
`synchformer_state_dict.pth`	~950 MB	Synchformer (shared with PrismAudio if present)

CLIP (DFN5B-ViT-H-14-384) and T5 (flan-t5-base) are downloaded automatically from HuggingFace to ~/.cache/huggingface/.

VRAM Requirements

VRAM	Recommended settings
24 GB+	`keep_in_vram`, any variant
12–24 GB	`offload_to_cpu`, medium or smaller
8–12 GB	`offload_to_cpu`, small variant, fp16

The auto offload strategy picks keep_in_vram if ≥ 16 GB VRAM is available, otherwise offload_to_cpu.

Credits

SelVA by Jaehwan Lee et al. — TextSynchformer and SelVA training
MMAudio by Feng et al. — MM-DiT audio generator and flow matching framework
BigVGAN by NVIDIA — neural vocoder for 16 kHz synthesis

README.md Unescape Escape

ComfyUI-SelVA

Nodes

SelVA Model Loader

SelVA Feature Extractor

Masking

SelVA Sampler

Workflow

Installation

Model Weights

VRAM Requirements

Credits

README.md