From 5baa070e619ae3f81ba286aa4c343a2158935499 Mon Sep 17 00:00:00 2001
From: Ethanfel <ethan.fel@ts-pc.fr>
Date: Mon, 6 Apr 2026 02:34:52 +0200
Subject: [PATCH] docs: add observations section with fp32/batch/precision
 findings

Work-in-progress empirical notes: fp32 batch 32 reaches same quality as
bf16 batch 16 in 1/3 the steps but overfits past ~2000 steps on 10 clips.
Lower loss does not reliably mean better audio on small datasets.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 LORA_TRAINING.md | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/LORA_TRAINING.md b/LORA_TRAINING.md
index da1ecc9..09621ed 100644
--- a/LORA_TRAINING.md
+++ b/LORA_TRAINING.md
@@ -376,3 +376,35 @@ Make sure the SelVA LoRA Loader output is wired to the **Sampler** input, not th
 
 **Loss plateaus early (above 0.7)**
 Dataset is the bottleneck. Add more clips — diversity matters more than quantity.
+
+---
+
+## Observations (work in progress)
+
+These are empirical findings from ongoing experiments. They will be promoted to the main guide once more validated.
+
+### Precision and batch size
+
+| Config | Smoothed loss at step 2000 | Notes |
+|---|---|---|
+| bf16 batch 1 | ~0.73 | Noisy gradients, slow |
+| bf16 batch 16 | ~0.65 | Stable, plateaued around step 6000–8000 at ~0.59 |
+| bf16 batch 16 logit_normal | ~0.47 | Lower loss floor, similar or marginally better audio |
+| fp32 batch 32 | ~0.58 | Matches bf16 batch 16 at step 6000 already at step 2000 |
+
+**Key finding:** fp32 batch 32 converges to the same perceptual quality point in ~2000 steps that bf16 batch 16 needs 6000+ steps to reach. However, fp32 batch 32 continues descending well past that point on small datasets (10 clips), eventually overfitting. **Stop fp32 batch 32 around step 2000 on a 10-clip dataset** — later checkpoints sound worse despite lower loss.
+
+**Lower loss ≠ better audio.** Once overfitting begins the model memorizes training clips rather than generalizing to new video inputs. Test intermediate checkpoints (e.g. step 500, 1000, 2000) to find the perceptual sweet spot.
+
+### logit_normal vs uniform
+
+logit_normal consistently reaches a lower loss floor than uniform. However perceptual improvement is dataset-dependent — on 10 clips the difference is marginal. May be more impactful with larger datasets. No conclusion yet.
+
+### White noise
+
+Residual white noise on generated audio is primarily a **dataset** problem, not a training one. Observed with all configs on 10 clips. Likely causes:
+- Too few clips for the model to confidently predict the target sound
+- Imprecise extraction prompts producing unfocused sync features
+- Missing mask when multiple objects are in frame
+
+CFG scale amplifies any adapter noise bias. Reducing CFG to 3.0–3.5 or adapter strength to 0.6–0.7 helps at inference.