Files
ComfyUI-SelVA/LORA_TRAINING.md
T
Ethanfel 2f4641247a feat: add resume support to train_lora.py
Step checkpoints now save optimizer state, scheduler state, and step
number alongside the LoRA weights. Pass --resume path/to/adapter_stepXXXXX.pt
to continue training from that checkpoint. --steps always means total steps,
so resuming from 1000 with --steps 2000 trains 1000 more steps.

adapter_final.pt format is unchanged (state_dict + meta only) so
SelvaLoraLoader remains compatible.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 16:59:30 +02:00

267 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# LoRA Training for SelVA
LoRA lets you teach the model new or partially-known sound classes using a small set of video+audio pairs. Only ~10 MB of adapter weights are trained instead of the full 4.4 GB model.
---
## Overview
Training is split into two steps:
1. **Dataset preparation** (in ComfyUI) — extract visual features from your video clips using the `SelVA Feature Extractor` node, and collect clean matching audio files.
2. **Training** (command line) — run `train_lora.py` with your dataset directory.
The training script only loads the generator and the VAE encoder. CLIP visual features and sync features come pre-computed from the `.npz` files, so Synchformer and T5 are not loaded during training, saving 34 GB of VRAM.
---
## Requirements
Same environment as SelVA inference. Additional Python packages:
```
torchaudio
```
---
## Step 1 — Prepare the dataset
### 1.1 Extract visual features in ComfyUI
For each video clip you want to train on:
1. Load the video with a VHS LoadVideo node.
2. Connect it to **SelVA Feature Extractor**.
3. Set **`cache_dir`** to a dedicated dataset folder, e.g. `dataset/my_sound`.
4. Set **`name`** to a short descriptive label, e.g. `dog_bark`. The node will save `dog_bark_001.npz`, then `dog_bark_002.npz`, etc. automatically as you process more clips.
5. Set the **`prompt`** to describe the sound (e.g. `a dog barking`). This prompt is used to condition the sync features — be specific.
6. Optionally connect a **mask** to isolate the sound source in frame (recommended when the scene has multiple objects).
> **Tip:** The prompt used for feature extraction conditions the *visual sync features*. You can use a different, more precise prompt at training time — see Step 2.
### 1.2 Collect clean audio
For each `.npz` file, place a matching audio file with the **same filename stem** in the same directory:
```
dataset/my_sound/
dog_bark_001.npz ← from SelVA Feature Extractor
dog_bark_001.wav ← clean isolated audio recording
dog_bark_002.npz
dog_bark_002.wav
dog_bark_003.npz
dog_bark_003.wav
```
Supported audio formats: `.wav`, `.flac`, `.mp3`, `.ogg`, `.aiff`, `.aif`
The audio will be automatically resampled and trimmed/padded to match the model's expected duration. Use clean, isolated recordings — no background noise.
### 1.3 Optional: prompts.txt
If you want a different prompt at training time than the one embedded in the `.npz`, create a `prompts.txt` file in the dataset directory:
```
# One line per file: filename: prompt text
dog_bark.npz: a large dog barking aggressively
dog_bark_001.npz: a dog barking in the distance
```
Priority: `prompts.txt` > prompt embedded in `.npz` > directory name as fallback.
---
## Step 2 — Run training
```bash
python train_lora.py \
--data_dir dataset/my_sound \
--output_dir lora_output/my_sound \
--variant large_44k \
--selva_dir /path/to/ComfyUI/models/selva \
--rank 16 \
--steps 2000 \
--lr 1e-4
```
The script will:
1. Load the VAE, CLIP text encoder, and generator.
2. Pre-load all clips (audio encoded to latents, features loaded from `.npz`).
3. Train LoRA adapters for the specified number of steps.
4. Save a checkpoint every `--save_every` steps and a final `adapter_final.pt` with embedded metadata.
---
## CLI Reference
| Argument | Default | Description |
|---|---|---|
| `--data_dir` | required | Directory containing `.npz` + audio pairs |
| `--output_dir` | `lora_output` | Where to save adapter checkpoints |
| `--variant` | `large_44k` | Model variant: `small_16k`, `small_44k`, `medium_44k`, `large_44k` |
| `--selva_dir` | required | Path to SelVA model weights directory |
| `--rank` | `16` | LoRA rank — higher = more capacity, more VRAM |
| `--alpha` | `rank` | LoRA alpha scaling. Default (= rank) means scale = 1.0 |
| `--target` | `attn.qkv` | Which layers to adapt. Add `linear1` for post-attention projections |
| `--lr` | `1e-4` | Learning rate |
| `--steps` | `2000` | Total training steps |
| `--warmup_steps` | `500` | Linear LR warmup steps |
| `--grad_accum` | `4` | Gradient accumulation steps (effective batch = grad_accum × 1) |
| `--save_every` | `500` | Save a checkpoint every N steps |
| `--resume` | `None` | Path to a step checkpoint to resume from (e.g. `lora_output/adapter_step01000.pt`) |
| `--precision` | `bf16` | Mixed precision: `bf16`, `fp16`, `fp32` |
| `--seed` | `42` | Random seed |
---
## Step 3 — Load the adapter in ComfyUI
Connect **SelVA LoRA Loader** between the model loader and the sampler:
```
SelVA Model Loader → SelVA LoRA Loader → SelVA Sampler
```
| Input | Description |
|---|---|
| `model` | SELVA_MODEL from the model loader |
| `adapter_path` | Path to `adapter_final.pt` or any `adapter_stepXXXXX.pt` |
| `strength` | 0.0 = adapter disabled, 1.0 = full strength, >1.0 = exaggerated |
The loader reads rank, alpha, and target layers from the metadata embedded in the `.pt` file — no need to set them manually.
> The base model is not modified. The loader returns a shallow copy with a deep-copied generator so the original stays intact.
---
## Tuning Guide
### Clip length
The model has a **fixed input duration of 8 seconds** for all variants (both 16k and 44k). This is not a parameter you can change.
- Audio shorter than 8 s is **zero-padded** (silence appended). The model will learn the sound but may also learn silence as part of the pattern — keep in mind for very short sounds.
- Audio longer than 8 s is **trimmed** at 8 s. Content beyond that is lost.
- Video shorter than 8 s has its **last frame repeated** to fill the clip.
**Practical recommendations:**
| Sound type | Clip strategy |
|---|---|
| Continuous sound (rain, engine, wind) | 8 s recordings, as many positions in the audio as possible |
| Single event < 2 s (click, bark, knock) | Center the event — pad deliberately with silence before/after, or loop the event 23 times per clip |
| Repeating event (footsteps, dripping) | Record full 8 s with natural repetition at the intended cadence |
| Sound with a clear onset (explosion, splash) | Put the onset at ~12 s from the start, not at 0 s — gives the model context |
> **Tip:** When extracting features in ComfyUI, set `duration` to 0 to use the full video length up to 8 s. Clips longer than 8 s are automatically clamped.
### How many clips do I need?
The table below gives a rough scaling guide. Quality and diversity of recordings matter more than raw count.
| Dataset size | Scenario | Expected result |
|---|---|---|
| **510 clips** | Quick test / proof of concept | May work if the model already partially knows the sound; often underfits |
| **1530 clips** | Fine-tuning a sound the model knows but gets wrong | Good starting point — covers the main variations |
| **3060 clips** | Teaching a new but acoustically simple sound class | Reliable convergence with default hyperparameters |
| **60150 clips** | Unusual or complex sounds, strong style shift | Needed for stable generalization across video contexts |
| **150300 clips** | Sounds the model has never encountered | Required to avoid overfitting; increase rank to 32 |
| **300+** | Large-scale domain shift | Consider also targeting `linear1` in addition to `attn.qkv` |
**Diversity beats quantity.** Ten clips of a dog barking in different environments (indoors, outdoors, distant, close) train better than fifty clips of the same recording. Vary: distance, room acoustics, intensity, speed.
### Rank
| Rank | Use case |
|---|---|
| `8` | Fine details on a sound the model already knows well |
| `16` | Default — good balance of capacity and VRAM |
| `32` | Harder sounds or larger style shifts |
Higher rank increases VRAM usage and overfitting risk on small datasets.
### Steps
| Dataset size | Recommended steps |
|---|---|
| 1020 clips | 5001000 |
| 2050 clips | 10003000 |
| 50+ clips | 20005000 |
Monitor the loss — it should decrease steadily in the first few hundred steps. If it plateaus early, try a higher rank or more clips. If it drops very fast and then bounces, lower the learning rate.
### Learning rate
`1e-4` is a safe default. If training is unstable (loss spikes), try `5e-5`. If learning seems slow, try `2e-4`.
### Target layers
`attn.qkv` (default) adapts only the self-attention QKV projections — 21 layers in `large_44k`. This is the recommended starting point.
Add `linear1` to also adapt post-attention projections if `attn.qkv` alone is not enough:
```bash
--target attn.qkv linear1
```
### Loss interpretation
A typical loss curve:
- Starts around `0.81.2`
- Should reach `0.30.6` after convergence for a clean sound class
- Below `0.1` on a small dataset usually means overfitting
### Precision
Use `bf16` on Ampere+ GPUs (RTX 3xxx, A100, etc.). Fall back to `fp16` on older GPUs. `fp32` is only needed for debugging — 2× more VRAM.
---
## Output files
```
lora_output/my_sound/
adapter_step00500.pt ← checkpoint at step 500
adapter_step01000.pt ← checkpoint at step 1000
...
adapter_final.pt ← final adapter with embedded metadata
meta.json ← human-readable metadata (rank, alpha, target, steps)
```
`adapter_final.pt` format:
```python
{
"state_dict": { "blocks.0.attn.qkv.lora_A": ..., ... },
"meta": {
"variant": "large_44k",
"rank": 16,
"alpha": 16.0,
"target": ["attn.qkv"],
"steps": 2000
}
}
```
---
## Troubleshooting
**`No layers matched target=...`**
The `--target` suffixes do not match any layer names. The default `attn.qkv` targets `SelfAttention.qkv` in all transformer blocks. If you changed `--target`, verify the layer names with `model.named_modules()`.
**`No .npz files found in ...`**
The `--data_dir` path is wrong or no `.npz` files were extracted there yet. Run SelVA Feature Extractor in ComfyUI first with the matching `cache_dir`.
**`No audio file found for clip.npz`**
Place an audio file with the exact same stem next to the `.npz`: `clip.wav`, `clip.flac`, etc.
**Loss does not decrease**
- Try a higher learning rate (`2e-4`) or more warmup steps.
- Check that the audio files are clean and actually contain the target sound.
- Check that the `.npz` features were extracted with a relevant prompt.
**Loss explodes or NaN**
- Lower the learning rate (`5e-5`).
- Make sure audio is normalized to `[-1, 1]`. PCM files with 16-bit integer encoding may need to be converted first (`ffmpeg -i input.wav -ar 44100 output.wav`).