Compare commits
5 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 95136b53a0 | |||
| 8f31d00beb | |||
| 3ee1893e10 | |||
| c86258d48f | |||
| 8338560600 |
+64
-2
@@ -36,11 +36,41 @@ For each video clip you want to train on:
|
||||
2. Connect it to **SelVA Feature Extractor**.
|
||||
3. Set **`cache_dir`** to a dedicated dataset folder, e.g. `dataset/my_sound`.
|
||||
4. Set **`name`** to a short descriptive label, e.g. `dog_bark`. The node will save `dog_bark_001.npz`, then `dog_bark_002.npz`, etc. automatically as you process more clips.
|
||||
5. Set the **`prompt`** to describe the sound (e.g. `a dog barking`). This prompt is used to condition the sync features — be specific.
|
||||
6. Optionally connect a **mask** to isolate the sound source in frame (recommended when the scene has multiple objects).
|
||||
5. Set the **`prompt`** to describe the sound (e.g. `a dog barking`). This prompt conditions the Synchformer sync features — be as specific as possible (see prompt guide below).
|
||||
6. Optionally connect a **mask** to isolate the sound source in frame (strongly recommended when multiple objects are visible — see masking note below).
|
||||
|
||||
> **Tip:** The prompt used for feature extraction conditions the *visual sync features*. You can use a different, more precise prompt at training time — see Step 2.
|
||||
|
||||
### Prompt guide
|
||||
|
||||
The prompt is not just a label — it directly shapes what the Synchformer pays attention to in the video. Imprecise prompts produce unfocused sync features, which the LoRA then has to compensate for, often introducing noise.
|
||||
|
||||
**Good prompts are specific about:**
|
||||
- The sound source (what object is making the sound)
|
||||
- The acoustic character (loud/quiet, sharp/soft, wet/dry)
|
||||
- The action producing the sound (if applicable)
|
||||
|
||||
| Sound | Weak prompt | Strong prompt |
|
||||
|---|---|---|
|
||||
| Dog bark | `dog` | `a large dog barking loudly` |
|
||||
| Footsteps | `walking` | `heavy boots on a wooden floor` |
|
||||
| Water | `water` | `water dripping into a metal bucket` |
|
||||
| Explosion | `explosion` | `a large explosion with deep bass rumble` |
|
||||
| Door | `door` | `a heavy wooden door slamming shut` |
|
||||
|
||||
**Rules of thumb:**
|
||||
- Describe the *sound*, not the visual scene. `a person hitting a drum` is better than `a drummer on stage`.
|
||||
- Keep prompts consistent across all clips for the same sound class. Mixing `a dog barking` and `loud barking dog` in the same dataset creates conflicting sync features.
|
||||
- Avoid negations (`no background noise`) — the model does not understand negations in sync feature conditioning.
|
||||
|
||||
### Masking note
|
||||
|
||||
If the video frame contains multiple moving objects, CLIP and sync features will be diluted by irrelevant motion. Use a segmentation mask (SAM2 or Grounding DINO+SAM) to isolate the sound source:
|
||||
|
||||
- Connect the mask to the **`mask`** input on SelVA Feature Extractor.
|
||||
- Leave `mask_strength` at `1.0` for clean isolation; lower it only if the masked region is very small and the model loses context.
|
||||
- Re-extract features with a mask even if you already have `.npz` files — better features directly reduce training noise.
|
||||
|
||||
### 1.2 Collect clean audio
|
||||
|
||||
For each `.npz` file, place a matching audio file with the **same filename stem** in the same directory:
|
||||
@@ -328,3 +358,35 @@ Make sure the SelVA LoRA Loader output is wired to the **Sampler** input, not th
|
||||
|
||||
**Loss plateaus early (above 0.7)**
|
||||
Dataset is the bottleneck. Add more clips — diversity matters more than quantity.
|
||||
|
||||
---
|
||||
|
||||
## Observations (work in progress)
|
||||
|
||||
These are empirical findings from ongoing experiments. They will be promoted to the main guide once more validated.
|
||||
|
||||
### Precision and batch size
|
||||
|
||||
| Config | Smoothed loss at step 2000 | Notes |
|
||||
|---|---|---|
|
||||
| bf16 batch 1 | ~0.73 | Noisy gradients, slow |
|
||||
| bf16 batch 16 | ~0.65 | Stable, plateaued around step 6000–8000 at ~0.59 |
|
||||
| bf16 batch 16 logit_normal | ~0.47 | Lower loss floor, similar or marginally better audio |
|
||||
| fp32 batch 32 | ~0.58 | Matches bf16 batch 16 at step 6000 already at step 2000 |
|
||||
|
||||
**Key finding:** fp32 batch 32 converges to the same perceptual quality point in ~2000 steps that bf16 batch 16 needs 6000+ steps to reach. However, fp32 batch 32 continues descending well past that point on small datasets (10 clips), eventually overfitting. **Stop fp32 batch 32 around step 2000 on a 10-clip dataset** — later checkpoints sound worse despite lower loss.
|
||||
|
||||
**Lower loss ≠ better audio.** Once overfitting begins the model memorizes training clips rather than generalizing to new video inputs. Test intermediate checkpoints (e.g. step 500, 1000, 2000) to find the perceptual sweet spot.
|
||||
|
||||
### logit_normal vs uniform
|
||||
|
||||
logit_normal consistently reaches a lower loss floor than uniform. However perceptual improvement is dataset-dependent — on 10 clips the difference is marginal. May be more impactful with larger datasets. No conclusion yet.
|
||||
|
||||
### White noise
|
||||
|
||||
Residual white noise on generated audio is primarily a **dataset** problem, not a training one. Observed with all configs on 10 clips. Likely causes:
|
||||
- Too few clips for the model to confidently predict the target sound
|
||||
- Imprecise extraction prompts producing unfocused sync features
|
||||
- Missing mask when multiple objects are in frame
|
||||
|
||||
CFG scale amplifies any adapter noise bias. Reducing CFG to 3.0–3.5 or adapter strength to 0.6–0.7 helps at inference.
|
||||
|
||||
+61
-12
@@ -305,7 +305,24 @@ class SelvaLoraTrainer:
|
||||
feature_utils_orig = model["feature_utils"]
|
||||
|
||||
data_dir = Path(data_dir.strip())
|
||||
output_dir = Path(output_dir.strip())
|
||||
|
||||
_out_str = output_dir.strip()
|
||||
_out_p = Path(_out_str)
|
||||
# On Windows a Unix-style path like "/lora_output" is technically absolute
|
||||
# (drive-relative) but the user almost certainly meant a subfolder of the
|
||||
# ComfyUI output directory. Treat any non-absolute path AND any path whose
|
||||
# only "absolute" anchor is a leading slash (no drive letter) as relative to
|
||||
# the ComfyUI output folder.
|
||||
import sys as _sys
|
||||
_unix_style_on_windows = (
|
||||
_sys.platform == "win32"
|
||||
and _out_p.is_absolute()
|
||||
and not _out_p.drive # e.g. Path("/foo").drive == "" on Windows
|
||||
)
|
||||
if not _out_p.is_absolute() or _unix_style_on_windows:
|
||||
_out_p = Path(folder_paths.get_output_directory()) / _out_p.relative_to(_out_p.anchor)
|
||||
print(f"[LoRA Trainer] output_dir resolved to: {_out_p}", flush=True)
|
||||
output_dir = _out_p
|
||||
output_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
alpha_val = float(alpha) if alpha > 0.0 else float(rank)
|
||||
@@ -370,7 +387,23 @@ class SelvaLoraTrainer:
|
||||
# Text → CLIP features (reuse already-loaded CLIP from inference model)
|
||||
text_clip = feature_utils_orig.encode_text_clip([prompt]).cpu()
|
||||
|
||||
dataset.append((x1, bundle["clip_features"], bundle["sync_features"], text_clip))
|
||||
# Pad/trim clip and sync features to fixed seq lengths — clips from
|
||||
# shorter videos have fewer frames and would cause stack() to fail
|
||||
clip_f = bundle["clip_features"] # [1, N_clip, 1024]
|
||||
c_tgt = seq_cfg.clip_seq_len
|
||||
if clip_f.shape[1] < c_tgt:
|
||||
clip_f = F.pad(clip_f, (0, 0, 0, c_tgt - clip_f.shape[1]))
|
||||
elif clip_f.shape[1] > c_tgt:
|
||||
clip_f = clip_f[:, :c_tgt, :]
|
||||
|
||||
sync_f = bundle["sync_features"] # [1, N_sync, 768]
|
||||
s_tgt = seq_cfg.sync_seq_len
|
||||
if sync_f.shape[1] < s_tgt:
|
||||
sync_f = F.pad(sync_f, (0, 0, 0, s_tgt - sync_f.shape[1]))
|
||||
elif sync_f.shape[1] > s_tgt:
|
||||
sync_f = sync_f[:, :s_tgt, :]
|
||||
|
||||
dataset.append((x1, clip_f, sync_f, text_clip))
|
||||
except Exception as e:
|
||||
print(f" [LoRA Trainer] Warning: failed {npz_path.name}: {e}", flush=True)
|
||||
traceback.print_exc()
|
||||
@@ -473,6 +506,9 @@ class SelvaLoraTrainer:
|
||||
print(f"\n[LoRA Trainer] Training {remaining} steps "
|
||||
f"(step {start_step + 1} → {steps}, batch_size={batch_size})\n", flush=True)
|
||||
|
||||
last_step = start_step
|
||||
completed = False
|
||||
try:
|
||||
for step in range(start_step + 1, steps + 1):
|
||||
batch = random.choices(dataset, k=batch_size)
|
||||
x1_list, clip_list, sync_list, text_list = zip(*batch)
|
||||
@@ -537,32 +573,45 @@ class SelvaLoraTrainer:
|
||||
sf.write(str(wav_path), wav.squeeze(0).numpy(), sr)
|
||||
print(f"[LoRA Trainer] Sample saved: {wav_path}", flush=True)
|
||||
|
||||
last_step = step
|
||||
pbar_train.update(1)
|
||||
|
||||
# Save inference adapter (state_dict + meta only — SelvaLoraLoader compatible)
|
||||
# Increment filename if a previous final already exists (resume case)
|
||||
completed = True
|
||||
|
||||
finally:
|
||||
# Save adapter and loss curves whether training completed or was cancelled.
|
||||
# Skip if we never completed a single step (nothing useful to save).
|
||||
if loss_history:
|
||||
if completed:
|
||||
# Normal completion — use adapter_final.pt (increment if exists)
|
||||
final_path = output_dir / "adapter_final.pt"
|
||||
if final_path.exists():
|
||||
i = 1
|
||||
while (output_dir / f"adapter_final_{i:03d}.pt").exists():
|
||||
i += 1
|
||||
final_path = output_dir / f"adapter_final_{i:03d}.pt"
|
||||
label = "Done"
|
||||
else:
|
||||
# Cancelled — include the step number so the file is useful for resume
|
||||
final_path = output_dir / f"adapter_cancelled_step{last_step:05d}.pt"
|
||||
label = f"Cancelled at step {last_step}"
|
||||
|
||||
torch.save({"state_dict": get_lora_state_dict(generator), "meta": meta}, final_path)
|
||||
(output_dir / "meta.json").write_text(json.dumps(meta, indent=2))
|
||||
print(f"\n[LoRA Trainer] Done. Adapter saved to {final_path}", flush=True)
|
||||
|
||||
# --- Return patched model ---
|
||||
generator.eval()
|
||||
generator.to(next(model["generator"].parameters()).device)
|
||||
patched = {**model, "generator": generator}
|
||||
print(f"\n[LoRA Trainer] {label}. Adapter saved to {final_path}", flush=True)
|
||||
|
||||
smoothed = _smooth_losses(loss_history)
|
||||
raw_img = _draw_loss_curve(loss_history, log_interval, start_step)
|
||||
smoothed_img = _draw_loss_curve(loss_history, log_interval, start_step, smoothed=smoothed)
|
||||
smoothed_img = _draw_loss_curve(loss_history, log_interval, start_step,
|
||||
smoothed=smoothed)
|
||||
raw_img.save(str(output_dir / "loss_raw.png"))
|
||||
smoothed_img.save(str(output_dir / "loss_smoothed.png"))
|
||||
print(f"[LoRA Trainer] Loss curves saved to {output_dir}", flush=True)
|
||||
|
||||
loss_curve = _pil_to_tensor(smoothed_img)
|
||||
# Reached only on normal completion (exception re-raises past this point)
|
||||
generator.eval()
|
||||
generator.to(next(model["generator"].parameters()).device)
|
||||
patched = {**model, "generator": generator}
|
||||
|
||||
loss_curve = _pil_to_tensor(smoothed_img)
|
||||
return (patched, str(final_path), loss_curve)
|
||||
|
||||
+18
-1
@@ -284,7 +284,24 @@ def main():
|
||||
elif x1.shape[1] > tgt:
|
||||
x1 = x1[:, :tgt, :]
|
||||
text_clip = encode_text_clip(clip_model, tokenizer_clip, [prompt], device).cpu()
|
||||
dataset.append((x1, bundle["clip_features"], bundle["sync_features"], text_clip))
|
||||
|
||||
# Pad/trim clip and sync features to fixed seq lengths — shorter clips
|
||||
# have fewer frames and would cause stack() to fail during batching
|
||||
clip_f = bundle["clip_features"] # [1, N_clip, 1024]
|
||||
c_tgt = seq_cfg.clip_seq_len
|
||||
if clip_f.shape[1] < c_tgt:
|
||||
clip_f = F.pad(clip_f, (0, 0, 0, c_tgt - clip_f.shape[1]))
|
||||
elif clip_f.shape[1] > c_tgt:
|
||||
clip_f = clip_f[:, :c_tgt, :]
|
||||
|
||||
sync_f = bundle["sync_features"] # [1, N_sync, 768]
|
||||
s_tgt = seq_cfg.sync_seq_len
|
||||
if sync_f.shape[1] < s_tgt:
|
||||
sync_f = F.pad(sync_f, (0, 0, 0, s_tgt - sync_f.shape[1]))
|
||||
elif sync_f.shape[1] > s_tgt:
|
||||
sync_f = sync_f[:, :s_tgt, :]
|
||||
|
||||
dataset.append((x1, clip_f, sync_f, text_clip))
|
||||
except Exception as e:
|
||||
print(f" [LoRA] Warning: failed to process {npz_path.name}: {e}")
|
||||
|
||||
|
||||
Reference in New Issue
Block a user