Compare commits
5 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 95136b53a0 | |||
| 8f31d00beb | |||
| 3ee1893e10 | |||
| c86258d48f | |||
| 8338560600 |
+64
-2
@@ -36,11 +36,41 @@ For each video clip you want to train on:
|
|||||||
2. Connect it to **SelVA Feature Extractor**.
|
2. Connect it to **SelVA Feature Extractor**.
|
||||||
3. Set **`cache_dir`** to a dedicated dataset folder, e.g. `dataset/my_sound`.
|
3. Set **`cache_dir`** to a dedicated dataset folder, e.g. `dataset/my_sound`.
|
||||||
4. Set **`name`** to a short descriptive label, e.g. `dog_bark`. The node will save `dog_bark_001.npz`, then `dog_bark_002.npz`, etc. automatically as you process more clips.
|
4. Set **`name`** to a short descriptive label, e.g. `dog_bark`. The node will save `dog_bark_001.npz`, then `dog_bark_002.npz`, etc. automatically as you process more clips.
|
||||||
5. Set the **`prompt`** to describe the sound (e.g. `a dog barking`). This prompt is used to condition the sync features — be specific.
|
5. Set the **`prompt`** to describe the sound (e.g. `a dog barking`). This prompt conditions the Synchformer sync features — be as specific as possible (see prompt guide below).
|
||||||
6. Optionally connect a **mask** to isolate the sound source in frame (recommended when the scene has multiple objects).
|
6. Optionally connect a **mask** to isolate the sound source in frame (strongly recommended when multiple objects are visible — see masking note below).
|
||||||
|
|
||||||
> **Tip:** The prompt used for feature extraction conditions the *visual sync features*. You can use a different, more precise prompt at training time — see Step 2.
|
> **Tip:** The prompt used for feature extraction conditions the *visual sync features*. You can use a different, more precise prompt at training time — see Step 2.
|
||||||
|
|
||||||
|
### Prompt guide
|
||||||
|
|
||||||
|
The prompt is not just a label — it directly shapes what the Synchformer pays attention to in the video. Imprecise prompts produce unfocused sync features, which the LoRA then has to compensate for, often introducing noise.
|
||||||
|
|
||||||
|
**Good prompts are specific about:**
|
||||||
|
- The sound source (what object is making the sound)
|
||||||
|
- The acoustic character (loud/quiet, sharp/soft, wet/dry)
|
||||||
|
- The action producing the sound (if applicable)
|
||||||
|
|
||||||
|
| Sound | Weak prompt | Strong prompt |
|
||||||
|
|---|---|---|
|
||||||
|
| Dog bark | `dog` | `a large dog barking loudly` |
|
||||||
|
| Footsteps | `walking` | `heavy boots on a wooden floor` |
|
||||||
|
| Water | `water` | `water dripping into a metal bucket` |
|
||||||
|
| Explosion | `explosion` | `a large explosion with deep bass rumble` |
|
||||||
|
| Door | `door` | `a heavy wooden door slamming shut` |
|
||||||
|
|
||||||
|
**Rules of thumb:**
|
||||||
|
- Describe the *sound*, not the visual scene. `a person hitting a drum` is better than `a drummer on stage`.
|
||||||
|
- Keep prompts consistent across all clips for the same sound class. Mixing `a dog barking` and `loud barking dog` in the same dataset creates conflicting sync features.
|
||||||
|
- Avoid negations (`no background noise`) — the model does not understand negations in sync feature conditioning.
|
||||||
|
|
||||||
|
### Masking note
|
||||||
|
|
||||||
|
If the video frame contains multiple moving objects, CLIP and sync features will be diluted by irrelevant motion. Use a segmentation mask (SAM2 or Grounding DINO+SAM) to isolate the sound source:
|
||||||
|
|
||||||
|
- Connect the mask to the **`mask`** input on SelVA Feature Extractor.
|
||||||
|
- Leave `mask_strength` at `1.0` for clean isolation; lower it only if the masked region is very small and the model loses context.
|
||||||
|
- Re-extract features with a mask even if you already have `.npz` files — better features directly reduce training noise.
|
||||||
|
|
||||||
### 1.2 Collect clean audio
|
### 1.2 Collect clean audio
|
||||||
|
|
||||||
For each `.npz` file, place a matching audio file with the **same filename stem** in the same directory:
|
For each `.npz` file, place a matching audio file with the **same filename stem** in the same directory:
|
||||||
@@ -328,3 +358,35 @@ Make sure the SelVA LoRA Loader output is wired to the **Sampler** input, not th
|
|||||||
|
|
||||||
**Loss plateaus early (above 0.7)**
|
**Loss plateaus early (above 0.7)**
|
||||||
Dataset is the bottleneck. Add more clips — diversity matters more than quantity.
|
Dataset is the bottleneck. Add more clips — diversity matters more than quantity.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Observations (work in progress)
|
||||||
|
|
||||||
|
These are empirical findings from ongoing experiments. They will be promoted to the main guide once more validated.
|
||||||
|
|
||||||
|
### Precision and batch size
|
||||||
|
|
||||||
|
| Config | Smoothed loss at step 2000 | Notes |
|
||||||
|
|---|---|---|
|
||||||
|
| bf16 batch 1 | ~0.73 | Noisy gradients, slow |
|
||||||
|
| bf16 batch 16 | ~0.65 | Stable, plateaued around step 6000–8000 at ~0.59 |
|
||||||
|
| bf16 batch 16 logit_normal | ~0.47 | Lower loss floor, similar or marginally better audio |
|
||||||
|
| fp32 batch 32 | ~0.58 | Matches bf16 batch 16 at step 6000 already at step 2000 |
|
||||||
|
|
||||||
|
**Key finding:** fp32 batch 32 converges to the same perceptual quality point in ~2000 steps that bf16 batch 16 needs 6000+ steps to reach. However, fp32 batch 32 continues descending well past that point on small datasets (10 clips), eventually overfitting. **Stop fp32 batch 32 around step 2000 on a 10-clip dataset** — later checkpoints sound worse despite lower loss.
|
||||||
|
|
||||||
|
**Lower loss ≠ better audio.** Once overfitting begins the model memorizes training clips rather than generalizing to new video inputs. Test intermediate checkpoints (e.g. step 500, 1000, 2000) to find the perceptual sweet spot.
|
||||||
|
|
||||||
|
### logit_normal vs uniform
|
||||||
|
|
||||||
|
logit_normal consistently reaches a lower loss floor than uniform. However perceptual improvement is dataset-dependent — on 10 clips the difference is marginal. May be more impactful with larger datasets. No conclusion yet.
|
||||||
|
|
||||||
|
### White noise
|
||||||
|
|
||||||
|
Residual white noise on generated audio is primarily a **dataset** problem, not a training one. Observed with all configs on 10 clips. Likely causes:
|
||||||
|
- Too few clips for the model to confidently predict the target sound
|
||||||
|
- Imprecise extraction prompts producing unfocused sync features
|
||||||
|
- Missing mask when multiple objects are in frame
|
||||||
|
|
||||||
|
CFG scale amplifies any adapter noise bias. Reducing CFG to 3.0–3.5 or adapter strength to 0.6–0.7 helps at inference.
|
||||||
|
|||||||
+61
-12
@@ -305,7 +305,24 @@ class SelvaLoraTrainer:
|
|||||||
feature_utils_orig = model["feature_utils"]
|
feature_utils_orig = model["feature_utils"]
|
||||||
|
|
||||||
data_dir = Path(data_dir.strip())
|
data_dir = Path(data_dir.strip())
|
||||||
output_dir = Path(output_dir.strip())
|
|
||||||
|
_out_str = output_dir.strip()
|
||||||
|
_out_p = Path(_out_str)
|
||||||
|
# On Windows a Unix-style path like "/lora_output" is technically absolute
|
||||||
|
# (drive-relative) but the user almost certainly meant a subfolder of the
|
||||||
|
# ComfyUI output directory. Treat any non-absolute path AND any path whose
|
||||||
|
# only "absolute" anchor is a leading slash (no drive letter) as relative to
|
||||||
|
# the ComfyUI output folder.
|
||||||
|
import sys as _sys
|
||||||
|
_unix_style_on_windows = (
|
||||||
|
_sys.platform == "win32"
|
||||||
|
and _out_p.is_absolute()
|
||||||
|
and not _out_p.drive # e.g. Path("/foo").drive == "" on Windows
|
||||||
|
)
|
||||||
|
if not _out_p.is_absolute() or _unix_style_on_windows:
|
||||||
|
_out_p = Path(folder_paths.get_output_directory()) / _out_p.relative_to(_out_p.anchor)
|
||||||
|
print(f"[LoRA Trainer] output_dir resolved to: {_out_p}", flush=True)
|
||||||
|
output_dir = _out_p
|
||||||
output_dir.mkdir(parents=True, exist_ok=True)
|
output_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
alpha_val = float(alpha) if alpha > 0.0 else float(rank)
|
alpha_val = float(alpha) if alpha > 0.0 else float(rank)
|
||||||
@@ -370,7 +387,23 @@ class SelvaLoraTrainer:
|
|||||||
# Text → CLIP features (reuse already-loaded CLIP from inference model)
|
# Text → CLIP features (reuse already-loaded CLIP from inference model)
|
||||||
text_clip = feature_utils_orig.encode_text_clip([prompt]).cpu()
|
text_clip = feature_utils_orig.encode_text_clip([prompt]).cpu()
|
||||||
|
|
||||||
dataset.append((x1, bundle["clip_features"], bundle["sync_features"], text_clip))
|
# Pad/trim clip and sync features to fixed seq lengths — clips from
|
||||||
|
# shorter videos have fewer frames and would cause stack() to fail
|
||||||
|
clip_f = bundle["clip_features"] # [1, N_clip, 1024]
|
||||||
|
c_tgt = seq_cfg.clip_seq_len
|
||||||
|
if clip_f.shape[1] < c_tgt:
|
||||||
|
clip_f = F.pad(clip_f, (0, 0, 0, c_tgt - clip_f.shape[1]))
|
||||||
|
elif clip_f.shape[1] > c_tgt:
|
||||||
|
clip_f = clip_f[:, :c_tgt, :]
|
||||||
|
|
||||||
|
sync_f = bundle["sync_features"] # [1, N_sync, 768]
|
||||||
|
s_tgt = seq_cfg.sync_seq_len
|
||||||
|
if sync_f.shape[1] < s_tgt:
|
||||||
|
sync_f = F.pad(sync_f, (0, 0, 0, s_tgt - sync_f.shape[1]))
|
||||||
|
elif sync_f.shape[1] > s_tgt:
|
||||||
|
sync_f = sync_f[:, :s_tgt, :]
|
||||||
|
|
||||||
|
dataset.append((x1, clip_f, sync_f, text_clip))
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f" [LoRA Trainer] Warning: failed {npz_path.name}: {e}", flush=True)
|
print(f" [LoRA Trainer] Warning: failed {npz_path.name}: {e}", flush=True)
|
||||||
traceback.print_exc()
|
traceback.print_exc()
|
||||||
@@ -473,6 +506,9 @@ class SelvaLoraTrainer:
|
|||||||
print(f"\n[LoRA Trainer] Training {remaining} steps "
|
print(f"\n[LoRA Trainer] Training {remaining} steps "
|
||||||
f"(step {start_step + 1} → {steps}, batch_size={batch_size})\n", flush=True)
|
f"(step {start_step + 1} → {steps}, batch_size={batch_size})\n", flush=True)
|
||||||
|
|
||||||
|
last_step = start_step
|
||||||
|
completed = False
|
||||||
|
try:
|
||||||
for step in range(start_step + 1, steps + 1):
|
for step in range(start_step + 1, steps + 1):
|
||||||
batch = random.choices(dataset, k=batch_size)
|
batch = random.choices(dataset, k=batch_size)
|
||||||
x1_list, clip_list, sync_list, text_list = zip(*batch)
|
x1_list, clip_list, sync_list, text_list = zip(*batch)
|
||||||
@@ -537,32 +573,45 @@ class SelvaLoraTrainer:
|
|||||||
sf.write(str(wav_path), wav.squeeze(0).numpy(), sr)
|
sf.write(str(wav_path), wav.squeeze(0).numpy(), sr)
|
||||||
print(f"[LoRA Trainer] Sample saved: {wav_path}", flush=True)
|
print(f"[LoRA Trainer] Sample saved: {wav_path}", flush=True)
|
||||||
|
|
||||||
|
last_step = step
|
||||||
pbar_train.update(1)
|
pbar_train.update(1)
|
||||||
|
|
||||||
# Save inference adapter (state_dict + meta only — SelvaLoraLoader compatible)
|
completed = True
|
||||||
# Increment filename if a previous final already exists (resume case)
|
|
||||||
|
finally:
|
||||||
|
# Save adapter and loss curves whether training completed or was cancelled.
|
||||||
|
# Skip if we never completed a single step (nothing useful to save).
|
||||||
|
if loss_history:
|
||||||
|
if completed:
|
||||||
|
# Normal completion — use adapter_final.pt (increment if exists)
|
||||||
final_path = output_dir / "adapter_final.pt"
|
final_path = output_dir / "adapter_final.pt"
|
||||||
if final_path.exists():
|
if final_path.exists():
|
||||||
i = 1
|
i = 1
|
||||||
while (output_dir / f"adapter_final_{i:03d}.pt").exists():
|
while (output_dir / f"adapter_final_{i:03d}.pt").exists():
|
||||||
i += 1
|
i += 1
|
||||||
final_path = output_dir / f"adapter_final_{i:03d}.pt"
|
final_path = output_dir / f"adapter_final_{i:03d}.pt"
|
||||||
|
label = "Done"
|
||||||
|
else:
|
||||||
|
# Cancelled — include the step number so the file is useful for resume
|
||||||
|
final_path = output_dir / f"adapter_cancelled_step{last_step:05d}.pt"
|
||||||
|
label = f"Cancelled at step {last_step}"
|
||||||
|
|
||||||
torch.save({"state_dict": get_lora_state_dict(generator), "meta": meta}, final_path)
|
torch.save({"state_dict": get_lora_state_dict(generator), "meta": meta}, final_path)
|
||||||
(output_dir / "meta.json").write_text(json.dumps(meta, indent=2))
|
(output_dir / "meta.json").write_text(json.dumps(meta, indent=2))
|
||||||
print(f"\n[LoRA Trainer] Done. Adapter saved to {final_path}", flush=True)
|
print(f"\n[LoRA Trainer] {label}. Adapter saved to {final_path}", flush=True)
|
||||||
|
|
||||||
# --- Return patched model ---
|
|
||||||
generator.eval()
|
|
||||||
generator.to(next(model["generator"].parameters()).device)
|
|
||||||
patched = {**model, "generator": generator}
|
|
||||||
|
|
||||||
smoothed = _smooth_losses(loss_history)
|
smoothed = _smooth_losses(loss_history)
|
||||||
raw_img = _draw_loss_curve(loss_history, log_interval, start_step)
|
raw_img = _draw_loss_curve(loss_history, log_interval, start_step)
|
||||||
smoothed_img = _draw_loss_curve(loss_history, log_interval, start_step, smoothed=smoothed)
|
smoothed_img = _draw_loss_curve(loss_history, log_interval, start_step,
|
||||||
|
smoothed=smoothed)
|
||||||
raw_img.save(str(output_dir / "loss_raw.png"))
|
raw_img.save(str(output_dir / "loss_raw.png"))
|
||||||
smoothed_img.save(str(output_dir / "loss_smoothed.png"))
|
smoothed_img.save(str(output_dir / "loss_smoothed.png"))
|
||||||
print(f"[LoRA Trainer] Loss curves saved to {output_dir}", flush=True)
|
print(f"[LoRA Trainer] Loss curves saved to {output_dir}", flush=True)
|
||||||
|
|
||||||
loss_curve = _pil_to_tensor(smoothed_img)
|
# Reached only on normal completion (exception re-raises past this point)
|
||||||
|
generator.eval()
|
||||||
|
generator.to(next(model["generator"].parameters()).device)
|
||||||
|
patched = {**model, "generator": generator}
|
||||||
|
|
||||||
|
loss_curve = _pil_to_tensor(smoothed_img)
|
||||||
return (patched, str(final_path), loss_curve)
|
return (patched, str(final_path), loss_curve)
|
||||||
|
|||||||
+18
-1
@@ -284,7 +284,24 @@ def main():
|
|||||||
elif x1.shape[1] > tgt:
|
elif x1.shape[1] > tgt:
|
||||||
x1 = x1[:, :tgt, :]
|
x1 = x1[:, :tgt, :]
|
||||||
text_clip = encode_text_clip(clip_model, tokenizer_clip, [prompt], device).cpu()
|
text_clip = encode_text_clip(clip_model, tokenizer_clip, [prompt], device).cpu()
|
||||||
dataset.append((x1, bundle["clip_features"], bundle["sync_features"], text_clip))
|
|
||||||
|
# Pad/trim clip and sync features to fixed seq lengths — shorter clips
|
||||||
|
# have fewer frames and would cause stack() to fail during batching
|
||||||
|
clip_f = bundle["clip_features"] # [1, N_clip, 1024]
|
||||||
|
c_tgt = seq_cfg.clip_seq_len
|
||||||
|
if clip_f.shape[1] < c_tgt:
|
||||||
|
clip_f = F.pad(clip_f, (0, 0, 0, c_tgt - clip_f.shape[1]))
|
||||||
|
elif clip_f.shape[1] > c_tgt:
|
||||||
|
clip_f = clip_f[:, :c_tgt, :]
|
||||||
|
|
||||||
|
sync_f = bundle["sync_features"] # [1, N_sync, 768]
|
||||||
|
s_tgt = seq_cfg.sync_seq_len
|
||||||
|
if sync_f.shape[1] < s_tgt:
|
||||||
|
sync_f = F.pad(sync_f, (0, 0, 0, s_tgt - sync_f.shape[1]))
|
||||||
|
elif sync_f.shape[1] > s_tgt:
|
||||||
|
sync_f = sync_f[:, :s_tgt, :]
|
||||||
|
|
||||||
|
dataset.append((x1, clip_f, sync_f, text_clip))
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
print(f" [LoRA] Warning: failed to process {npz_path.name}: {e}")
|
print(f" [LoRA] Warning: failed to process {npz_path.name}: {e}")
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user