- Clip length section: fixed 8s duration, padding/trim behavior, per-sound-type strategies (continuous, short events, repeating, onset placement). - Dataset size table: 5-10 / 15-30 / 30-60 / 60-150 / 150-300 / 300+ clips with scenario and expected result for each tier. - Note on diversity vs quantity. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
10 KiB
LoRA Training for SelVA
LoRA lets you teach the model new or partially-known sound classes using a small set of video+audio pairs. Only ~10 MB of adapter weights are trained instead of the full 4.4 GB model.
Overview
Training is split into two steps:
- Dataset preparation (in ComfyUI) — extract visual features from your video clips using the
SelVA Feature Extractornode, and collect clean matching audio files. - Training (command line) — run
train_lora.pywith your dataset directory.
The training script only loads the generator and the VAE encoder. CLIP visual features and sync features come pre-computed from the .npz files, so Synchformer and T5 are not loaded during training, saving 3–4 GB of VRAM.
Requirements
Same environment as SelVA inference. Additional Python packages:
torchaudio
Step 1 — Prepare the dataset
1.1 Extract visual features in ComfyUI
For each video clip you want to train on:
- Load the video with a VHS LoadVideo node.
- Connect it to SelVA Feature Extractor.
- Set
cache_dirto a dedicated dataset folder, e.g.dataset/my_sound. - Set
nameto a short descriptive label, e.g.dog_bark. The node will savedog_bark_001.npz, thendog_bark_002.npz, etc. automatically as you process more clips. - Set the
promptto describe the sound (e.g.a dog barking). This prompt is used to condition the sync features — be specific. - Optionally connect a mask to isolate the sound source in frame (recommended when the scene has multiple objects).
Tip: The prompt used for feature extraction conditions the visual sync features. You can use a different, more precise prompt at training time — see Step 2.
1.2 Collect clean audio
For each .npz file, place a matching audio file with the same filename stem in the same directory:
dataset/my_sound/
dog_bark_001.npz ← from SelVA Feature Extractor
dog_bark_001.wav ← clean isolated audio recording
dog_bark_002.npz
dog_bark_002.wav
dog_bark_003.npz
dog_bark_003.wav
Supported audio formats: .wav, .flac, .mp3, .ogg, .aiff, .aif
The audio will be automatically resampled and trimmed/padded to match the model's expected duration. Use clean, isolated recordings — no background noise.
1.3 Optional: prompts.txt
If you want a different prompt at training time than the one embedded in the .npz, create a prompts.txt file in the dataset directory:
# One line per file: filename: prompt text
dog_bark.npz: a large dog barking aggressively
dog_bark_001.npz: a dog barking in the distance
Priority: prompts.txt > prompt embedded in .npz > directory name as fallback.
Step 2 — Run training
python train_lora.py \
--data_dir dataset/my_sound \
--output_dir lora_output/my_sound \
--variant large_44k \
--selva_dir /path/to/ComfyUI/models/selva \
--rank 16 \
--steps 2000 \
--lr 1e-4
The script will:
- Load the VAE, CLIP text encoder, and generator.
- Pre-load all clips (audio encoded to latents, features loaded from
.npz). - Train LoRA adapters for the specified number of steps.
- Save a checkpoint every
--save_everysteps and a finaladapter_final.ptwith embedded metadata.
CLI Reference
| Argument | Default | Description |
|---|---|---|
--data_dir |
required | Directory containing .npz + audio pairs |
--output_dir |
lora_output |
Where to save adapter checkpoints |
--variant |
large_44k |
Model variant: small_16k, small_44k, medium_44k, large_44k |
--selva_dir |
required | Path to SelVA model weights directory |
--rank |
16 |
LoRA rank — higher = more capacity, more VRAM |
--alpha |
rank |
LoRA alpha scaling. Default (= rank) means scale = 1.0 |
--target |
attn.qkv |
Which layers to adapt. Add linear1 for post-attention projections |
--lr |
1e-4 |
Learning rate |
--steps |
2000 |
Total training steps |
--warmup_steps |
500 |
Linear LR warmup steps |
--grad_accum |
4 |
Gradient accumulation steps (effective batch = grad_accum × 1) |
--save_every |
500 |
Save a checkpoint every N steps |
--precision |
bf16 |
Mixed precision: bf16, fp16, fp32 |
--seed |
42 |
Random seed |
Step 3 — Load the adapter in ComfyUI
Connect SelVA LoRA Loader between the model loader and the sampler:
SelVA Model Loader → SelVA LoRA Loader → SelVA Sampler
| Input | Description |
|---|---|
model |
SELVA_MODEL from the model loader |
adapter_path |
Path to adapter_final.pt or any adapter_stepXXXXX.pt |
strength |
0.0 = adapter disabled, 1.0 = full strength, >1.0 = exaggerated |
The loader reads rank, alpha, and target layers from the metadata embedded in the .pt file — no need to set them manually.
The base model is not modified. The loader returns a shallow copy with a deep-copied generator so the original stays intact.
Tuning Guide
Clip length
The model has a fixed input duration of 8 seconds for all variants (both 16k and 44k). This is not a parameter you can change.
- Audio shorter than 8 s is zero-padded (silence appended). The model will learn the sound but may also learn silence as part of the pattern — keep in mind for very short sounds.
- Audio longer than 8 s is trimmed at 8 s. Content beyond that is lost.
- Video shorter than 8 s has its last frame repeated to fill the clip.
Practical recommendations:
| Sound type | Clip strategy |
|---|---|
| Continuous sound (rain, engine, wind) | 8 s recordings, as many positions in the audio as possible |
| Single event < 2 s (click, bark, knock) | Center the event — pad deliberately with silence before/after, or loop the event 2–3 times per clip |
| Repeating event (footsteps, dripping) | Record full 8 s with natural repetition at the intended cadence |
| Sound with a clear onset (explosion, splash) | Put the onset at ~1–2 s from the start, not at 0 s — gives the model context |
Tip: When extracting features in ComfyUI, set
durationto 0 to use the full video length up to 8 s. Clips longer than 8 s are automatically clamped.
How many clips do I need?
The table below gives a rough scaling guide. Quality and diversity of recordings matter more than raw count.
| Dataset size | Scenario | Expected result |
|---|---|---|
| 5–10 clips | Quick test / proof of concept | May work if the model already partially knows the sound; often underfits |
| 15–30 clips | Fine-tuning a sound the model knows but gets wrong | Good starting point — covers the main variations |
| 30–60 clips | Teaching a new but acoustically simple sound class | Reliable convergence with default hyperparameters |
| 60–150 clips | Unusual or complex sounds, strong style shift | Needed for stable generalization across video contexts |
| 150–300 clips | Sounds the model has never encountered | Required to avoid overfitting; increase rank to 32 |
| 300+ | Large-scale domain shift | Consider also targeting linear1 in addition to attn.qkv |
Diversity beats quantity. Ten clips of a dog barking in different environments (indoors, outdoors, distant, close) train better than fifty clips of the same recording. Vary: distance, room acoustics, intensity, speed.
Rank
| Rank | Use case |
|---|---|
8 |
Fine details on a sound the model already knows well |
16 |
Default — good balance of capacity and VRAM |
32 |
Harder sounds or larger style shifts |
Higher rank increases VRAM usage and overfitting risk on small datasets.
Steps
| Dataset size | Recommended steps |
|---|---|
| 10–20 clips | 500–1000 |
| 20–50 clips | 1000–3000 |
| 50+ clips | 2000–5000 |
Monitor the loss — it should decrease steadily in the first few hundred steps. If it plateaus early, try a higher rank or more clips. If it drops very fast and then bounces, lower the learning rate.
Learning rate
1e-4 is a safe default. If training is unstable (loss spikes), try 5e-5. If learning seems slow, try 2e-4.
Target layers
attn.qkv (default) adapts only the self-attention QKV projections — 21 layers in large_44k. This is the recommended starting point.
Add linear1 to also adapt post-attention projections if attn.qkv alone is not enough:
--target attn.qkv linear1
Loss interpretation
A typical loss curve:
- Starts around
0.8–1.2 - Should reach
0.3–0.6after convergence for a clean sound class - Below
0.1on a small dataset usually means overfitting
Precision
Use bf16 on Ampere+ GPUs (RTX 3xxx, A100, etc.). Fall back to fp16 on older GPUs. fp32 is only needed for debugging — 2× more VRAM.
Output files
lora_output/my_sound/
adapter_step00500.pt ← checkpoint at step 500
adapter_step01000.pt ← checkpoint at step 1000
...
adapter_final.pt ← final adapter with embedded metadata
meta.json ← human-readable metadata (rank, alpha, target, steps)
adapter_final.pt format:
{
"state_dict": { "blocks.0.attn.qkv.lora_A": ..., ... },
"meta": {
"variant": "large_44k",
"rank": 16,
"alpha": 16.0,
"target": ["attn.qkv"],
"steps": 2000
}
}
Troubleshooting
No layers matched target=...
The --target suffixes do not match any layer names. The default attn.qkv targets SelfAttention.qkv in all transformer blocks. If you changed --target, verify the layer names with model.named_modules().
No .npz files found in ...
The --data_dir path is wrong or no .npz files were extracted there yet. Run SelVA Feature Extractor in ComfyUI first with the matching cache_dir.
No audio file found for clip.npz
Place an audio file with the exact same stem next to the .npz: clip.wav, clip.flac, etc.
Loss does not decrease
- Try a higher learning rate (
2e-4) or more warmup steps. - Check that the audio files are clean and actually contain the target sound.
- Check that the
.npzfeatures were extracted with a relevant prompt.
Loss explodes or NaN
- Lower the learning rate (
5e-5). - Make sure audio is normalized to
[-1, 1]. PCM files with 16-bit integer encoding may need to be converted first (ffmpeg -i input.wav -ar 44100 output.wav).