5 Commits

Author SHA1 Message Date
Ethanfel 95136b53a0 docs: add observations section with fp32/batch/precision findings
Work-in-progress empirical notes: fp32 batch 32 reaches same quality as
bf16 batch 16 in 1/3 the steps but overfits past ~2000 steps on 10 clips.
Lower loss does not reliably mean better audio on small datasets.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 02:34:52 +02:00
Ethanfel 8f31d00beb docs: add prompt guide and masking note to dataset preparation section
Poor prompts and missing masks are a common source of white noise in LoRA
training — imprecise sync features force the adapter to compensate with noise.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 01:43:35 +02:00
Ethanfel 3ee1893e10 fix: resolve relative and Unix-style output_dir paths to ComfyUI output folder
On Windows, /folder is drive-relative (no drive letter) rather than a real
absolute path. Redirect these to ComfyUI's output directory so files don't
land at C:\folder. Also redirects plain relative paths (e.g. lora_output)
to output/ instead of the process working directory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 01:13:59 +02:00
Ethanfel c86258d48f fix: save adapter and loss curves on cancel, not only on normal completion
Wraps training loop in try/finally so adapter_final.pt and loss PNGs are
always written. On cancellation the adapter is named
adapter_cancelled_stepXXXXX.pt so it can be used with --resume.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 01:07:04 +02:00
Ethanfel 8338560600 fix: pad/trim clip and sync features to fixed seq_len at dataset load time
Clips from shorter videos produce fewer CLIP frames (e.g. 2s → 16 frames,
8s → 64 frames). Mixed-length datasets would cause torch.stack() to fail
during batching. Normalize to seq_cfg.clip_seq_len / sync_seq_len at load,
same as latents are already normalized to latent_seq_len.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 00:51:45 +02:00
3 changed files with 208 additions and 80 deletions
+64 -2
View File
@@ -36,11 +36,41 @@ For each video clip you want to train on:
2. Connect it to **SelVA Feature Extractor**.
3. Set **`cache_dir`** to a dedicated dataset folder, e.g. `dataset/my_sound`.
4. Set **`name`** to a short descriptive label, e.g. `dog_bark`. The node will save `dog_bark_001.npz`, then `dog_bark_002.npz`, etc. automatically as you process more clips.
5. Set the **`prompt`** to describe the sound (e.g. `a dog barking`). This prompt is used to condition the sync features — be specific.
6. Optionally connect a **mask** to isolate the sound source in frame (recommended when the scene has multiple objects).
5. Set the **`prompt`** to describe the sound (e.g. `a dog barking`). This prompt conditions the Synchformer sync features — be as specific as possible (see prompt guide below).
6. Optionally connect a **mask** to isolate the sound source in frame (strongly recommended when multiple objects are visible — see masking note below).
> **Tip:** The prompt used for feature extraction conditions the *visual sync features*. You can use a different, more precise prompt at training time — see Step 2.
### Prompt guide
The prompt is not just a label — it directly shapes what the Synchformer pays attention to in the video. Imprecise prompts produce unfocused sync features, which the LoRA then has to compensate for, often introducing noise.
**Good prompts are specific about:**
- The sound source (what object is making the sound)
- The acoustic character (loud/quiet, sharp/soft, wet/dry)
- The action producing the sound (if applicable)
| Sound | Weak prompt | Strong prompt |
|---|---|---|
| Dog bark | `dog` | `a large dog barking loudly` |
| Footsteps | `walking` | `heavy boots on a wooden floor` |
| Water | `water` | `water dripping into a metal bucket` |
| Explosion | `explosion` | `a large explosion with deep bass rumble` |
| Door | `door` | `a heavy wooden door slamming shut` |
**Rules of thumb:**
- Describe the *sound*, not the visual scene. `a person hitting a drum` is better than `a drummer on stage`.
- Keep prompts consistent across all clips for the same sound class. Mixing `a dog barking` and `loud barking dog` in the same dataset creates conflicting sync features.
- Avoid negations (`no background noise`) — the model does not understand negations in sync feature conditioning.
### Masking note
If the video frame contains multiple moving objects, CLIP and sync features will be diluted by irrelevant motion. Use a segmentation mask (SAM2 or Grounding DINO+SAM) to isolate the sound source:
- Connect the mask to the **`mask`** input on SelVA Feature Extractor.
- Leave `mask_strength` at `1.0` for clean isolation; lower it only if the masked region is very small and the model loses context.
- Re-extract features with a mask even if you already have `.npz` files — better features directly reduce training noise.
### 1.2 Collect clean audio
For each `.npz` file, place a matching audio file with the **same filename stem** in the same directory:
@@ -328,3 +358,35 @@ Make sure the SelVA LoRA Loader output is wired to the **Sampler** input, not th
**Loss plateaus early (above 0.7)**
Dataset is the bottleneck. Add more clips — diversity matters more than quantity.
---
## Observations (work in progress)
These are empirical findings from ongoing experiments. They will be promoted to the main guide once more validated.
### Precision and batch size
| Config | Smoothed loss at step 2000 | Notes |
|---|---|---|
| bf16 batch 1 | ~0.73 | Noisy gradients, slow |
| bf16 batch 16 | ~0.65 | Stable, plateaued around step 60008000 at ~0.59 |
| bf16 batch 16 logit_normal | ~0.47 | Lower loss floor, similar or marginally better audio |
| fp32 batch 32 | ~0.58 | Matches bf16 batch 16 at step 6000 already at step 2000 |
**Key finding:** fp32 batch 32 converges to the same perceptual quality point in ~2000 steps that bf16 batch 16 needs 6000+ steps to reach. However, fp32 batch 32 continues descending well past that point on small datasets (10 clips), eventually overfitting. **Stop fp32 batch 32 around step 2000 on a 10-clip dataset** — later checkpoints sound worse despite lower loss.
**Lower loss ≠ better audio.** Once overfitting begins the model memorizes training clips rather than generalizing to new video inputs. Test intermediate checkpoints (e.g. step 500, 1000, 2000) to find the perceptual sweet spot.
### logit_normal vs uniform
logit_normal consistently reaches a lower loss floor than uniform. However perceptual improvement is dataset-dependent — on 10 clips the difference is marginal. May be more impactful with larger datasets. No conclusion yet.
### White noise
Residual white noise on generated audio is primarily a **dataset** problem, not a training one. Observed with all configs on 10 clips. Likely causes:
- Too few clips for the model to confidently predict the target sound
- Imprecise extraction prompts producing unfocused sync features
- Missing mask when multiple objects are in frame
CFG scale amplifies any adapter noise bias. Reducing CFG to 3.03.5 or adapter strength to 0.60.7 helps at inference.
+61 -12
View File
@@ -305,7 +305,24 @@ class SelvaLoraTrainer:
feature_utils_orig = model["feature_utils"]
data_dir = Path(data_dir.strip())
output_dir = Path(output_dir.strip())
_out_str = output_dir.strip()
_out_p = Path(_out_str)
# On Windows a Unix-style path like "/lora_output" is technically absolute
# (drive-relative) but the user almost certainly meant a subfolder of the
# ComfyUI output directory. Treat any non-absolute path AND any path whose
# only "absolute" anchor is a leading slash (no drive letter) as relative to
# the ComfyUI output folder.
import sys as _sys
_unix_style_on_windows = (
_sys.platform == "win32"
and _out_p.is_absolute()
and not _out_p.drive # e.g. Path("/foo").drive == "" on Windows
)
if not _out_p.is_absolute() or _unix_style_on_windows:
_out_p = Path(folder_paths.get_output_directory()) / _out_p.relative_to(_out_p.anchor)
print(f"[LoRA Trainer] output_dir resolved to: {_out_p}", flush=True)
output_dir = _out_p
output_dir.mkdir(parents=True, exist_ok=True)
alpha_val = float(alpha) if alpha > 0.0 else float(rank)
@@ -370,7 +387,23 @@ class SelvaLoraTrainer:
# Text → CLIP features (reuse already-loaded CLIP from inference model)
text_clip = feature_utils_orig.encode_text_clip([prompt]).cpu()
dataset.append((x1, bundle["clip_features"], bundle["sync_features"], text_clip))
# Pad/trim clip and sync features to fixed seq lengths — clips from
# shorter videos have fewer frames and would cause stack() to fail
clip_f = bundle["clip_features"] # [1, N_clip, 1024]
c_tgt = seq_cfg.clip_seq_len
if clip_f.shape[1] < c_tgt:
clip_f = F.pad(clip_f, (0, 0, 0, c_tgt - clip_f.shape[1]))
elif clip_f.shape[1] > c_tgt:
clip_f = clip_f[:, :c_tgt, :]
sync_f = bundle["sync_features"] # [1, N_sync, 768]
s_tgt = seq_cfg.sync_seq_len
if sync_f.shape[1] < s_tgt:
sync_f = F.pad(sync_f, (0, 0, 0, s_tgt - sync_f.shape[1]))
elif sync_f.shape[1] > s_tgt:
sync_f = sync_f[:, :s_tgt, :]
dataset.append((x1, clip_f, sync_f, text_clip))
except Exception as e:
print(f" [LoRA Trainer] Warning: failed {npz_path.name}: {e}", flush=True)
traceback.print_exc()
@@ -473,6 +506,9 @@ class SelvaLoraTrainer:
print(f"\n[LoRA Trainer] Training {remaining} steps "
f"(step {start_step + 1}{steps}, batch_size={batch_size})\n", flush=True)
last_step = start_step
completed = False
try:
for step in range(start_step + 1, steps + 1):
batch = random.choices(dataset, k=batch_size)
x1_list, clip_list, sync_list, text_list = zip(*batch)
@@ -537,32 +573,45 @@ class SelvaLoraTrainer:
sf.write(str(wav_path), wav.squeeze(0).numpy(), sr)
print(f"[LoRA Trainer] Sample saved: {wav_path}", flush=True)
last_step = step
pbar_train.update(1)
# Save inference adapter (state_dict + meta only — SelvaLoraLoader compatible)
# Increment filename if a previous final already exists (resume case)
completed = True
finally:
# Save adapter and loss curves whether training completed or was cancelled.
# Skip if we never completed a single step (nothing useful to save).
if loss_history:
if completed:
# Normal completion — use adapter_final.pt (increment if exists)
final_path = output_dir / "adapter_final.pt"
if final_path.exists():
i = 1
while (output_dir / f"adapter_final_{i:03d}.pt").exists():
i += 1
final_path = output_dir / f"adapter_final_{i:03d}.pt"
label = "Done"
else:
# Cancelled — include the step number so the file is useful for resume
final_path = output_dir / f"adapter_cancelled_step{last_step:05d}.pt"
label = f"Cancelled at step {last_step}"
torch.save({"state_dict": get_lora_state_dict(generator), "meta": meta}, final_path)
(output_dir / "meta.json").write_text(json.dumps(meta, indent=2))
print(f"\n[LoRA Trainer] Done. Adapter saved to {final_path}", flush=True)
# --- Return patched model ---
generator.eval()
generator.to(next(model["generator"].parameters()).device)
patched = {**model, "generator": generator}
print(f"\n[LoRA Trainer] {label}. Adapter saved to {final_path}", flush=True)
smoothed = _smooth_losses(loss_history)
raw_img = _draw_loss_curve(loss_history, log_interval, start_step)
smoothed_img = _draw_loss_curve(loss_history, log_interval, start_step, smoothed=smoothed)
smoothed_img = _draw_loss_curve(loss_history, log_interval, start_step,
smoothed=smoothed)
raw_img.save(str(output_dir / "loss_raw.png"))
smoothed_img.save(str(output_dir / "loss_smoothed.png"))
print(f"[LoRA Trainer] Loss curves saved to {output_dir}", flush=True)
loss_curve = _pil_to_tensor(smoothed_img)
# Reached only on normal completion (exception re-raises past this point)
generator.eval()
generator.to(next(model["generator"].parameters()).device)
patched = {**model, "generator": generator}
loss_curve = _pil_to_tensor(smoothed_img)
return (patched, str(final_path), loss_curve)
+18 -1
View File
@@ -284,7 +284,24 @@ def main():
elif x1.shape[1] > tgt:
x1 = x1[:, :tgt, :]
text_clip = encode_text_clip(clip_model, tokenizer_clip, [prompt], device).cpu()
dataset.append((x1, bundle["clip_features"], bundle["sync_features"], text_clip))
# Pad/trim clip and sync features to fixed seq lengths — shorter clips
# have fewer frames and would cause stack() to fail during batching
clip_f = bundle["clip_features"] # [1, N_clip, 1024]
c_tgt = seq_cfg.clip_seq_len
if clip_f.shape[1] < c_tgt:
clip_f = F.pad(clip_f, (0, 0, 0, c_tgt - clip_f.shape[1]))
elif clip_f.shape[1] > c_tgt:
clip_f = clip_f[:, :c_tgt, :]
sync_f = bundle["sync_features"] # [1, N_sync, 768]
s_tgt = seq_cfg.sync_seq_len
if sync_f.shape[1] < s_tgt:
sync_f = F.pad(sync_f, (0, 0, 0, s_tgt - sync_f.shape[1]))
elif sync_f.shape[1] > s_tgt:
sync_f = sync_f[:, :s_tgt, :]
dataset.append((x1, clip_f, sync_f, text_clip))
except Exception as e:
print(f" [LoRA] Warning: failed to process {npz_path.name}: {e}")