docs: add observations section with fp32/batch/precision findings

Work-in-progress empirical notes: fp32 batch 32 reaches same quality as bf16 batch 16 in 1/3 the steps but overfits past ~2000 steps on 10 clips. Lower loss does not reliably mean better audio on small datasets. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 02:34:52 +02:00
parent 9fc739fe9e
commit 5baa070e61
1 changed files with 32 additions and 0 deletions
@@ -376,3 +376,35 @@ Make sure the SelVA LoRA Loader output is wired to the **Sampler** input, not th

 **Loss plateaus early (above 0.7)**
 Dataset is the bottleneck. Add more clips — diversity matters more than quantity.
+
+---
+
+## Observations (work in progress)
+
+These are empirical findings from ongoing experiments. They will be promoted to the main guide once more validated.
+
+### Precision and batch size
+
+| Config | Smoothed loss at step 2000 | Notes |
+|---|---|---|
+| bf16 batch 1 | ~0.73 | Noisy gradients, slow |
+| bf16 batch 16 | ~0.65 | Stable, plateaued around step 6000–8000 at ~0.59 |
+| bf16 batch 16 logit_normal | ~0.47 | Lower loss floor, similar or marginally better audio |
+| fp32 batch 32 | ~0.58 | Matches bf16 batch 16 at step 6000 already at step 2000 |
+
+**Key finding:** fp32 batch 32 converges to the same perceptual quality point in ~2000 steps that bf16 batch 16 needs 6000+ steps to reach. However, fp32 batch 32 continues descending well past that point on small datasets (10 clips), eventually overfitting. **Stop fp32 batch 32 around step 2000 on a 10-clip dataset** — later checkpoints sound worse despite lower loss.
+
+**Lower loss ≠ better audio.** Once overfitting begins the model memorizes training clips rather than generalizing to new video inputs. Test intermediate checkpoints (e.g. step 500, 1000, 2000) to find the perceptual sweet spot.
+
+### logit_normal vs uniform
+
+logit_normal consistently reaches a lower loss floor than uniform. However perceptual improvement is dataset-dependent — on 10 clips the difference is marginal. May be more impactful with larger datasets. No conclusion yet.
+
+### White noise
+
+Residual white noise on generated audio is primarily a **dataset** problem, not a training one. Observed with all configs on 10 clips. Likely causes:
+- Too few clips for the model to confidently predict the target sound
+- Imprecise extraction prompts producing unfocused sync features
+- Missing mask when multiple objects are in frame
+
+CFG scale amplifies any adapter noise bias. Reducing CFG to 3.0–3.5 or adapter strength to 0.6–0.7 helps at inference.