docs: add observations section with fp32/batch/precision findings

Work-in-progress empirical notes: fp32 batch 32 reaches same quality as bf16 batch 16 in 1/3 the steps but overfits past ~2000 steps on 10 clips. Lower loss does not reliably mean better audio on small datasets. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docs: add prompt guide and masking note to dataset preparation section
2026-04-06 02:34:52 +02:00 · 2026-04-06 01:43:35 +02:00 · 2026-04-06 01:13:59 +02:00 · 2026-04-06 01:07:04 +02:00 · 2026-04-06 00:51:45 +02:00
3 changed files with 208 additions and 80 deletions
@@ -36,11 +36,41 @@ For each video clip you want to train on:
 2. Connect it to **SelVA Feature Extractor**.
 3. Set **`cache_dir`** to a dedicated dataset folder, e.g. `dataset/my_sound`.
 4. Set **`name`** to a short descriptive label, e.g. `dog_bark`. The node will save `dog_bark_001.npz`, then `dog_bark_002.npz`, etc. automatically as you process more clips.
-5. Set the **`prompt`** to describe the sound (e.g. `a dog barking`). This prompt is used to condition the sync features — be specific.
+5. Set the **`prompt`** to describe the sound (e.g. `a dog barking`). This prompt conditions the Synchformer sync features — be as specific as possible (see prompt guide below).
-6. Optionally connect a **mask** to isolate the sound source in frame (recommended when the scene has multiple objects).
+6. Optionally connect a **mask** to isolate the sound source in frame (strongly recommended when multiple objects are visible — see masking note below).
 > **Tip:** The prompt used for feature extraction conditions the *visual sync features*. You can use a different, more precise prompt at training time — see Step 2.
 ### Prompt guide
 The prompt is not just a label — it directly shapes what the Synchformer pays attention to in the video. Imprecise prompts produce unfocused sync features, which the LoRA then has to compensate for, often introducing noise.
 **Good prompts are specific about:**
 - The sound source (what object is making the sound)
 - The acoustic character (loud/quiet, sharp/soft, wet/dry)
 - The action producing the sound (if applicable)
 | Sound | Weak prompt | Strong prompt |
 |---|---|---|
 | Dog bark | `dog` | `a large dog barking loudly` |
 | Footsteps | `walking` | `heavy boots on a wooden floor` |
 | Water | `water` | `water dripping into a metal bucket` |
 | Explosion | `explosion` | `a large explosion with deep bass rumble` |
 | Door | `door` | `a heavy wooden door slamming shut` |
 **Rules of thumb:**
 - Describe the *sound*, not the visual scene. `a person hitting a drum` is better than `a drummer on stage`.
 - Keep prompts consistent across all clips for the same sound class. Mixing `a dog barking` and `loud barking dog` in the same dataset creates conflicting sync features.
 - Avoid negations (`no background noise`) — the model does not understand negations in sync feature conditioning.
 ### Masking note
 If the video frame contains multiple moving objects, CLIP and sync features will be diluted by irrelevant motion. Use a segmentation mask (SAM2 or Grounding DINO+SAM) to isolate the sound source:
 - Connect the mask to the **`mask`** input on SelVA Feature Extractor.
 - Leave `mask_strength` at `1.0` for clean isolation; lower it only if the masked region is very small and the model loses context.
 - Re-extract features with a mask even if you already have `.npz` files — better features directly reduce training noise.
 ### 1.2 Collect clean audio
 For each `.npz` file, place a matching audio file with the **same filename stem** in the same directory:
@@ -328,3 +358,35 @@ Make sure the SelVA LoRA Loader output is wired to the **Sampler** input, not th
 **Loss plateaus early (above 0.7)**
 Dataset is the bottleneck. Add more clips — diversity matters more than quantity.
 ---
 ## Observations (work in progress)
 These are empirical findings from ongoing experiments. They will be promoted to the main guide once more validated.
 ### Precision and batch size
 | Config | Smoothed loss at step 2000 | Notes |
 |---|---|---|
 | bf16 batch 1 | ~0.73 | Noisy gradients, slow |
 | bf16 batch 16 | ~0.65 | Stable, plateaued around step 6000–8000 at ~0.59 |
 | bf16 batch 16 logit_normal | ~0.47 | Lower loss floor, similar or marginally better audio |
 | fp32 batch 32 | ~0.58 | Matches bf16 batch 16 at step 6000 already at step 2000 |
 **Key finding:** fp32 batch 32 converges to the same perceptual quality point in ~2000 steps that bf16 batch 16 needs 6000+ steps to reach. However, fp32 batch 32 continues descending well past that point on small datasets (10 clips), eventually overfitting. **Stop fp32 batch 32 around step 2000 on a 10-clip dataset** — later checkpoints sound worse despite lower loss.
 **Lower loss ≠ better audio.** Once overfitting begins the model memorizes training clips rather than generalizing to new video inputs. Test intermediate checkpoints (e.g. step 500, 1000, 2000) to find the perceptual sweet spot.
 ### logit_normal vs uniform
 logit_normal consistently reaches a lower loss floor than uniform. However perceptual improvement is dataset-dependent — on 10 clips the difference is marginal. May be more impactful with larger datasets. No conclusion yet.
 ### White noise
 Residual white noise on generated audio is primarily a **dataset** problem, not a training one. Observed with all configs on 10 clips. Likely causes:
 - Too few clips for the model to confidently predict the target sound
 - Imprecise extraction prompts producing unfocused sync features
 - Missing mask when multiple objects are in frame
 CFG scale amplifies any adapter noise bias. Reducing CFG to 3.0–3.5 or adapter strength to 0.6–0.7 helps at inference.
@@ -305,7 +305,24 @@ class SelvaLoraTrainer:
        feature_utils_orig = model["feature_utils"]
        data_dir   = Path(data_dir.strip())
-        output_dir = Path(output_dir.strip())
+
        _out_str = output_dir.strip()
        _out_p   = Path(_out_str)
        # On Windows a Unix-style path like "/lora_output" is technically absolute
        # (drive-relative) but the user almost certainly meant a subfolder of the
        # ComfyUI output directory. Treat any non-absolute path AND any path whose
        # only "absolute" anchor is a leading slash (no drive letter) as relative to
        # the ComfyUI output folder.
        import sys as _sys
        _unix_style_on_windows = (
            _sys.platform == "win32"
            and _out_p.is_absolute()
            and not _out_p.drive  # e.g. Path("/foo").drive == "" on Windows
        )
        if not _out_p.is_absolute() or _unix_style_on_windows:
            _out_p = Path(folder_paths.get_output_directory()) / _out_p.relative_to(_out_p.anchor)
            print(f"[LoRA Trainer] output_dir resolved to: {_out_p}", flush=True)
        output_dir = _out_p
        output_dir.mkdir(parents=True, exist_ok=True)
        alpha_val      = float(alpha) if alpha > 0.0 else float(rank)
@@ -370,7 +387,23 @@ class SelvaLoraTrainer:
                # Text → CLIP features (reuse already-loaded CLIP from inference model)
                text_clip = feature_utils_orig.encode_text_clip([prompt]).cpu()
-                dataset.append((x1, bundle["clip_features"], bundle["sync_features"], text_clip))
+                # Pad/trim clip and sync features to fixed seq lengths — clips from
                # shorter videos have fewer frames and would cause stack() to fail
                clip_f = bundle["clip_features"]  # [1, N_clip, 1024]
                c_tgt  = seq_cfg.clip_seq_len
                if clip_f.shape[1] < c_tgt:
                    clip_f = F.pad(clip_f, (0, 0, 0, c_tgt - clip_f.shape[1]))
                elif clip_f.shape[1] > c_tgt:
                    clip_f = clip_f[:, :c_tgt, :]
                sync_f = bundle["sync_features"]  # [1, N_sync, 768]
                s_tgt  = seq_cfg.sync_seq_len
                if sync_f.shape[1] < s_tgt:
                    sync_f = F.pad(sync_f, (0, 0, 0, s_tgt - sync_f.shape[1]))
                elif sync_f.shape[1] > s_tgt:
                    sync_f = sync_f[:, :s_tgt, :]
                dataset.append((x1, clip_f, sync_f, text_clip))
            except Exception as e:
                print(f"  [LoRA Trainer] Warning: failed {npz_path.name}: {e}", flush=True)
                traceback.print_exc()
@@ -473,6 +506,9 @@ class SelvaLoraTrainer:
        print(f"\n[LoRA Trainer] Training {remaining} steps "
              f"(step {start_step + 1} → {steps}, batch_size={batch_size})\n", flush=True)
        last_step = start_step
        completed = False
        try:
            for step in range(start_step + 1, steps + 1):
                batch = random.choices(dataset, k=batch_size)
                x1_list, clip_list, sync_list, text_list = zip(*batch)
@@ -537,32 +573,45 @@ class SelvaLoraTrainer:
                            sf.write(str(wav_path), wav.squeeze(0).numpy(), sr)
                        print(f"[LoRA Trainer] Sample saved: {wav_path}", flush=True)
                last_step = step
                pbar_train.update(1)
-        # Save inference adapter (state_dict + meta only — SelvaLoraLoader compatible)
+            completed = True
-        # Increment filename if a previous final already exists (resume case)
+
        finally:
            # Save adapter and loss curves whether training completed or was cancelled.
            # Skip if we never completed a single step (nothing useful to save).
            if loss_history:
                if completed:
                    # Normal completion — use adapter_final.pt (increment if exists)
                    final_path = output_dir / "adapter_final.pt"
                    if final_path.exists():
                        i = 1
                        while (output_dir / f"adapter_final_{i:03d}.pt").exists():
                            i += 1
                        final_path = output_dir / f"adapter_final_{i:03d}.pt"
                    label = "Done"
                else:
                    # Cancelled — include the step number so the file is useful for resume
                    final_path = output_dir / f"adapter_cancelled_step{last_step:05d}.pt"
                    label = f"Cancelled at step {last_step}"
                torch.save({"state_dict": get_lora_state_dict(generator), "meta": meta}, final_path)
                (output_dir / "meta.json").write_text(json.dumps(meta, indent=2))
-        print(f"\n[LoRA Trainer] Done. Adapter saved to {final_path}", flush=True)
+                print(f"\n[LoRA Trainer] {label}. Adapter saved to {final_path}", flush=True)
        # --- Return patched model ---
        generator.eval()
        generator.to(next(model["generator"].parameters()).device)
        patched = {**model, "generator": generator}
                smoothed     = _smooth_losses(loss_history)
                raw_img      = _draw_loss_curve(loss_history, log_interval, start_step)
-        smoothed_img = _draw_loss_curve(loss_history, log_interval, start_step, smoothed=smoothed)
+                smoothed_img = _draw_loss_curve(loss_history, log_interval, start_step,
                                                smoothed=smoothed)
                raw_img.save(str(output_dir / "loss_raw.png"))
                smoothed_img.save(str(output_dir / "loss_smoothed.png"))
                print(f"[LoRA Trainer] Loss curves saved to {output_dir}", flush=True)
-        loss_curve = _pil_to_tensor(smoothed_img)
+        # Reached only on normal completion (exception re-raises past this point)
        generator.eval()
        generator.to(next(model["generator"].parameters()).device)
        patched = {**model, "generator": generator}
        loss_curve = _pil_to_tensor(smoothed_img)
        return (patched, str(final_path), loss_curve)
@@ -284,7 +284,24 @@ def main():
            elif x1.shape[1] > tgt:
                x1 = x1[:, :tgt, :]
            text_clip = encode_text_clip(clip_model, tokenizer_clip, [prompt], device).cpu()
-            dataset.append((x1, bundle["clip_features"], bundle["sync_features"], text_clip))
+
            # Pad/trim clip and sync features to fixed seq lengths — shorter clips
            # have fewer frames and would cause stack() to fail during batching
            clip_f = bundle["clip_features"]  # [1, N_clip, 1024]
            c_tgt  = seq_cfg.clip_seq_len
            if clip_f.shape[1] < c_tgt:
                clip_f = F.pad(clip_f, (0, 0, 0, c_tgt - clip_f.shape[1]))
            elif clip_f.shape[1] > c_tgt:
                clip_f = clip_f[:, :c_tgt, :]
            sync_f = bundle["sync_features"]  # [1, N_sync, 768]
            s_tgt  = seq_cfg.sync_seq_len
            if sync_f.shape[1] < s_tgt:
                sync_f = F.pad(sync_f, (0, 0, 0, s_tgt - sync_f.shape[1]))
            elif sync_f.shape[1] > s_tgt:
                sync_f = sync_f[:, :s_tgt, :]
            dataset.append((x1, clip_f, sync_f, text_clip))
        except Exception as e:
            print(f"  [LoRA] Warning: failed to process {npz_path.name}: {e}")
Author	SHA1	Message	Date
Ethanfel	95136b53a0	docs: add observations section with fp32/batch/precision findings Work-in-progress empirical notes: fp32 batch 32 reaches same quality as bf16 batch 16 in 1/3 the steps but overfits past ~2000 steps on 10 clips. Lower loss does not reliably mean better audio on small datasets. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 02:34:52 +02:00
Ethanfel	8f31d00beb	docs: add prompt guide and masking note to dataset preparation section Poor prompts and missing masks are a common source of white noise in LoRA training — imprecise sync features force the adapter to compensate with noise. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 01:43:35 +02:00
Ethanfel	3ee1893e10	fix: resolve relative and Unix-style output_dir paths to ComfyUI output folder On Windows, /folder is drive-relative (no drive letter) rather than a real absolute path. Redirect these to ComfyUI's output directory so files don't land at C:\folder. Also redirects plain relative paths (e.g. lora_output) to output/ instead of the process working directory. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 01:13:59 +02:00
Ethanfel	c86258d48f	fix: save adapter and loss curves on cancel, not only on normal completion Wraps training loop in try/finally so adapter_final.pt and loss PNGs are always written. On cancellation the adapter is named adapter_cancelled_stepXXXXX.pt so it can be used with --resume. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 01:07:04 +02:00
Ethanfel	8338560600	fix: pad/trim clip and sync features to fixed seq_len at dataset load time Clips from shorter videos produce fewer CLIP frames (e.g. 2s → 16 frames, 8s → 64 frames). Mixed-length datasets would cause torch.stack() to fail during batching. Normalize to seq_cfg.clip_seq_len / sync_seq_len at load, same as latents are already normalized to latent_seq_len. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 00:51:45 +02:00