docs: add observations section with fp32/batch/precision findings

Work-in-progress empirical notes: fp32 batch 32 reaches same quality as bf16 batch 16 in 1/3 the steps but overfits past ~2000 steps on 10 clips. Lower loss does not reliably mean better audio on small datasets. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docs: add prompt guide and masking note to dataset preparation section
2026-04-06 02:34:52 +02:00 · 2026-04-06 01:43:35 +02:00 · 2026-04-06 01:13:59 +02:00 · 2026-04-06 01:07:04 +02:00 · 2026-04-06 00:51:45 +02:00
3 changed files with 208 additions and 80 deletions
@@ -36,11 +36,41 @@ For each video clip you want to train on:
 2. Connect it to **SelVA Feature Extractor**.
 3. Set **`cache_dir`** to a dedicated dataset folder, e.g. `dataset/my_sound`.
 4. Set **`name`** to a short descriptive label, e.g. `dog_bark`. The node will save `dog_bark_001.npz`, then `dog_bark_002.npz`, etc. automatically as you process more clips.
-5. Set the **`prompt`** to describe the sound (e.g. `a dog barking`). This prompt is used to condition the sync features — be specific.
-6. Optionally connect a **mask** to isolate the sound source in frame (recommended when the scene has multiple objects).
+5. Set the **`prompt`** to describe the sound (e.g. `a dog barking`). This prompt conditions the Synchformer sync features — be as specific as possible (see prompt guide below).
+6. Optionally connect a **mask** to isolate the sound source in frame (strongly recommended when multiple objects are visible — see masking note below).

 > **Tip:** The prompt used for feature extraction conditions the *visual sync features*. You can use a different, more precise prompt at training time — see Step 2.

+### Prompt guide
+
+The prompt is not just a label — it directly shapes what the Synchformer pays attention to in the video. Imprecise prompts produce unfocused sync features, which the LoRA then has to compensate for, often introducing noise.
+
+**Good prompts are specific about:**
+- The sound source (what object is making the sound)
+- The acoustic character (loud/quiet, sharp/soft, wet/dry)
+- The action producing the sound (if applicable)
+
+| Sound | Weak prompt | Strong prompt |
+|---|---|---|
+| Dog bark | `dog` | `a large dog barking loudly` |
+| Footsteps | `walking` | `heavy boots on a wooden floor` |
+| Water | `water` | `water dripping into a metal bucket` |
+| Explosion | `explosion` | `a large explosion with deep bass rumble` |
+| Door | `door` | `a heavy wooden door slamming shut` |
+
+**Rules of thumb:**
+- Describe the *sound*, not the visual scene. `a person hitting a drum` is better than `a drummer on stage`.
+- Keep prompts consistent across all clips for the same sound class. Mixing `a dog barking` and `loud barking dog` in the same dataset creates conflicting sync features.
+- Avoid negations (`no background noise`) — the model does not understand negations in sync feature conditioning.
+
+### Masking note
+
+If the video frame contains multiple moving objects, CLIP and sync features will be diluted by irrelevant motion. Use a segmentation mask (SAM2 or Grounding DINO+SAM) to isolate the sound source:
+
+- Connect the mask to the **`mask`** input on SelVA Feature Extractor.
+- Leave `mask_strength` at `1.0` for clean isolation; lower it only if the masked region is very small and the model loses context.
+- Re-extract features with a mask even if you already have `.npz` files — better features directly reduce training noise.
+
 ### 1.2 Collect clean audio

 For each `.npz` file, place a matching audio file with the **same filename stem** in the same directory:
@@ -328,3 +358,35 @@ Make sure the SelVA LoRA Loader output is wired to the **Sampler** input, not th

 **Loss plateaus early (above 0.7)**
 Dataset is the bottleneck. Add more clips — diversity matters more than quantity.
+
+---
+
+## Observations (work in progress)
+
+These are empirical findings from ongoing experiments. They will be promoted to the main guide once more validated.
+
+### Precision and batch size
+
+| Config | Smoothed loss at step 2000 | Notes |
+|---|---|---|
+| bf16 batch 1 | ~0.73 | Noisy gradients, slow |
+| bf16 batch 16 | ~0.65 | Stable, plateaued around step 6000–8000 at ~0.59 |
+| bf16 batch 16 logit_normal | ~0.47 | Lower loss floor, similar or marginally better audio |
+| fp32 batch 32 | ~0.58 | Matches bf16 batch 16 at step 6000 already at step 2000 |
+
+**Key finding:** fp32 batch 32 converges to the same perceptual quality point in ~2000 steps that bf16 batch 16 needs 6000+ steps to reach. However, fp32 batch 32 continues descending well past that point on small datasets (10 clips), eventually overfitting. **Stop fp32 batch 32 around step 2000 on a 10-clip dataset** — later checkpoints sound worse despite lower loss.
+
+**Lower loss ≠ better audio.** Once overfitting begins the model memorizes training clips rather than generalizing to new video inputs. Test intermediate checkpoints (e.g. step 500, 1000, 2000) to find the perceptual sweet spot.
+
+### logit_normal vs uniform
+
+logit_normal consistently reaches a lower loss floor than uniform. However perceptual improvement is dataset-dependent — on 10 clips the difference is marginal. May be more impactful with larger datasets. No conclusion yet.
+
+### White noise
+
+Residual white noise on generated audio is primarily a **dataset** problem, not a training one. Observed with all configs on 10 clips. Likely causes:
+- Too few clips for the model to confidently predict the target sound
+- Imprecise extraction prompts producing unfocused sync features
+- Missing mask when multiple objects are in frame
+
+CFG scale amplifies any adapter noise bias. Reducing CFG to 3.0–3.5 or adapter strength to 0.6–0.7 helps at inference.
@@ -305,7 +305,24 @@ class SelvaLoraTrainer:
        feature_utils_orig = model["feature_utils"]

        data_dir   = Path(data_dir.strip())
-        output_dir = Path(output_dir.strip())
+
+        _out_str = output_dir.strip()
+        _out_p   = Path(_out_str)
+        # On Windows a Unix-style path like "/lora_output" is technically absolute
+        # (drive-relative) but the user almost certainly meant a subfolder of the
+        # ComfyUI output directory. Treat any non-absolute path AND any path whose
+        # only "absolute" anchor is a leading slash (no drive letter) as relative to
+        # the ComfyUI output folder.
+        import sys as _sys
+        _unix_style_on_windows = (
+            _sys.platform == "win32"
+            and _out_p.is_absolute()
+            and not _out_p.drive  # e.g. Path("/foo").drive == "" on Windows
+        )
+        if not _out_p.is_absolute() or _unix_style_on_windows:
+            _out_p = Path(folder_paths.get_output_directory()) / _out_p.relative_to(_out_p.anchor)
+            print(f"[LoRA Trainer] output_dir resolved to: {_out_p}", flush=True)
+        output_dir = _out_p
        output_dir.mkdir(parents=True, exist_ok=True)

        alpha_val      = float(alpha) if alpha > 0.0 else float(rank)
@@ -370,7 +387,23 @@ class SelvaLoraTrainer:
                # Text → CLIP features (reuse already-loaded CLIP from inference model)
                text_clip = feature_utils_orig.encode_text_clip([prompt]).cpu()

-                dataset.append((x1, bundle["clip_features"], bundle["sync_features"], text_clip))
+                # Pad/trim clip and sync features to fixed seq lengths — clips from
+                # shorter videos have fewer frames and would cause stack() to fail
+                clip_f = bundle["clip_features"]  # [1, N_clip, 1024]
+                c_tgt  = seq_cfg.clip_seq_len
+                if clip_f.shape[1] < c_tgt:
+                    clip_f = F.pad(clip_f, (0, 0, 0, c_tgt - clip_f.shape[1]))
+                elif clip_f.shape[1] > c_tgt:
+                    clip_f = clip_f[:, :c_tgt, :]
+
+                sync_f = bundle["sync_features"]  # [1, N_sync, 768]
+                s_tgt  = seq_cfg.sync_seq_len
+                if sync_f.shape[1] < s_tgt:
+                    sync_f = F.pad(sync_f, (0, 0, 0, s_tgt - sync_f.shape[1]))
+                elif sync_f.shape[1] > s_tgt:
+                    sync_f = sync_f[:, :s_tgt, :]
+
+                dataset.append((x1, clip_f, sync_f, text_clip))
            except Exception as e:
                print(f"  [LoRA Trainer] Warning: failed {npz_path.name}: {e}", flush=True)
                traceback.print_exc()
@@ -473,6 +506,9 @@ class SelvaLoraTrainer:
        print(f"\n[LoRA Trainer] Training {remaining} steps "
              f"(step {start_step + 1} → {steps}, batch_size={batch_size})\n", flush=True)

+        last_step = start_step
+        completed = False
+        try:
            for step in range(start_step + 1, steps + 1):
                batch = random.choices(dataset, k=batch_size)
                x1_list, clip_list, sync_list, text_list = zip(*batch)
@@ -537,32 +573,45 @@ class SelvaLoraTrainer:
                            sf.write(str(wav_path), wav.squeeze(0).numpy(), sr)
                        print(f"[LoRA Trainer] Sample saved: {wav_path}", flush=True)

+                last_step = step
                pbar_train.update(1)

-        # Save inference adapter (state_dict + meta only — SelvaLoraLoader compatible)
-        # Increment filename if a previous final already exists (resume case)
+            completed = True
+
+        finally:
+            # Save adapter and loss curves whether training completed or was cancelled.
+            # Skip if we never completed a single step (nothing useful to save).
+            if loss_history:
+                if completed:
+                    # Normal completion — use adapter_final.pt (increment if exists)
                    final_path = output_dir / "adapter_final.pt"
                    if final_path.exists():
                        i = 1
                        while (output_dir / f"adapter_final_{i:03d}.pt").exists():
                            i += 1
                        final_path = output_dir / f"adapter_final_{i:03d}.pt"
+                    label = "Done"
+                else:
+                    # Cancelled — include the step number so the file is useful for resume
+                    final_path = output_dir / f"adapter_cancelled_step{last_step:05d}.pt"
+                    label = f"Cancelled at step {last_step}"
+
                torch.save({"state_dict": get_lora_state_dict(generator), "meta": meta}, final_path)
                (output_dir / "meta.json").write_text(json.dumps(meta, indent=2))
-        print(f"\n[LoRA Trainer] Done. Adapter saved to {final_path}", flush=True)
-
-        # --- Return patched model ---
-        generator.eval()
-        generator.to(next(model["generator"].parameters()).device)
-        patched = {**model, "generator": generator}
+                print(f"\n[LoRA Trainer] {label}. Adapter saved to {final_path}", flush=True)

                smoothed     = _smooth_losses(loss_history)
                raw_img      = _draw_loss_curve(loss_history, log_interval, start_step)
-        smoothed_img = _draw_loss_curve(loss_history, log_interval, start_step, smoothed=smoothed)
+                smoothed_img = _draw_loss_curve(loss_history, log_interval, start_step,
+                                                smoothed=smoothed)
                raw_img.save(str(output_dir / "loss_raw.png"))
                smoothed_img.save(str(output_dir / "loss_smoothed.png"))
                print(f"[LoRA Trainer] Loss curves saved to {output_dir}", flush=True)

-        loss_curve = _pil_to_tensor(smoothed_img)
+        # Reached only on normal completion (exception re-raises past this point)
+        generator.eval()
+        generator.to(next(model["generator"].parameters()).device)
+        patched = {**model, "generator": generator}

+        loss_curve = _pil_to_tensor(smoothed_img)
        return (patched, str(final_path), loss_curve)
@@ -284,7 +284,24 @@ def main():
            elif x1.shape[1] > tgt:
                x1 = x1[:, :tgt, :]
            text_clip = encode_text_clip(clip_model, tokenizer_clip, [prompt], device).cpu()
-            dataset.append((x1, bundle["clip_features"], bundle["sync_features"], text_clip))
+
+            # Pad/trim clip and sync features to fixed seq lengths — shorter clips
+            # have fewer frames and would cause stack() to fail during batching
+            clip_f = bundle["clip_features"]  # [1, N_clip, 1024]
+            c_tgt  = seq_cfg.clip_seq_len
+            if clip_f.shape[1] < c_tgt:
+                clip_f = F.pad(clip_f, (0, 0, 0, c_tgt - clip_f.shape[1]))
+            elif clip_f.shape[1] > c_tgt:
+                clip_f = clip_f[:, :c_tgt, :]
+
+            sync_f = bundle["sync_features"]  # [1, N_sync, 768]
+            s_tgt  = seq_cfg.sync_seq_len
+            if sync_f.shape[1] < s_tgt:
+                sync_f = F.pad(sync_f, (0, 0, 0, s_tgt - sync_f.shape[1]))
+            elif sync_f.shape[1] > s_tgt:
+                sync_f = sync_f[:, :s_tgt, :]
+
+            dataset.append((x1, clip_f, sync_f, text_clip))
        except Exception as e:
            print(f"  [LoRA] Warning: failed to process {npz_path.name}: {e}")
Author	SHA1	Message	Date
Ethanfel	95136b53a0	docs: add observations section with fp32/batch/precision findings Work-in-progress empirical notes: fp32 batch 32 reaches same quality as bf16 batch 16 in 1/3 the steps but overfits past ~2000 steps on 10 clips. Lower loss does not reliably mean better audio on small datasets. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 02:34:52 +02:00
Ethanfel	8f31d00beb	docs: add prompt guide and masking note to dataset preparation section Poor prompts and missing masks are a common source of white noise in LoRA training — imprecise sync features force the adapter to compensate with noise. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 01:43:35 +02:00
Ethanfel	3ee1893e10	fix: resolve relative and Unix-style output_dir paths to ComfyUI output folder On Windows, /folder is drive-relative (no drive letter) rather than a real absolute path. Redirect these to ComfyUI's output directory so files don't land at C:\folder. Also redirects plain relative paths (e.g. lora_output) to output/ instead of the process working directory. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 01:13:59 +02:00
Ethanfel	c86258d48f	fix: save adapter and loss curves on cancel, not only on normal completion Wraps training loop in try/finally so adapter_final.pt and loss PNGs are always written. On cancellation the adapter is named adapter_cancelled_stepXXXXX.pt so it can be used with --resume. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 01:07:04 +02:00
Ethanfel	8338560600	fix: pad/trim clip and sync features to fixed seq_len at dataset load time Clips from shorter videos produce fewer CLIP frames (e.g. 2s → 16 frames, 8s → 64 frames). Mixed-length datasets would cause torch.stack() to fail during batching. Normalize to seq_cfg.clip_seq_len / sync_seq_len at load, same as latents are already normalized to latent_seq_len. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 00:51:45 +02:00