From 5baa070e619ae3f81ba286aa4c343a2158935499 Mon Sep 17 00:00:00 2001 From: Ethanfel Date: Mon, 6 Apr 2026 02:34:52 +0200 Subject: [PATCH] docs: add observations section with fp32/batch/precision findings Work-in-progress empirical notes: fp32 batch 32 reaches same quality as bf16 batch 16 in 1/3 the steps but overfits past ~2000 steps on 10 clips. Lower loss does not reliably mean better audio on small datasets. Co-Authored-By: Claude Sonnet 4.6 --- LORA_TRAINING.md | 32 ++++++++++++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/LORA_TRAINING.md b/LORA_TRAINING.md index da1ecc9..09621ed 100644 --- a/LORA_TRAINING.md +++ b/LORA_TRAINING.md @@ -376,3 +376,35 @@ Make sure the SelVA LoRA Loader output is wired to the **Sampler** input, not th **Loss plateaus early (above 0.7)** Dataset is the bottleneck. Add more clips — diversity matters more than quantity. + +--- + +## Observations (work in progress) + +These are empirical findings from ongoing experiments. They will be promoted to the main guide once more validated. + +### Precision and batch size + +| Config | Smoothed loss at step 2000 | Notes | +|---|---|---| +| bf16 batch 1 | ~0.73 | Noisy gradients, slow | +| bf16 batch 16 | ~0.65 | Stable, plateaued around step 6000–8000 at ~0.59 | +| bf16 batch 16 logit_normal | ~0.47 | Lower loss floor, similar or marginally better audio | +| fp32 batch 32 | ~0.58 | Matches bf16 batch 16 at step 6000 already at step 2000 | + +**Key finding:** fp32 batch 32 converges to the same perceptual quality point in ~2000 steps that bf16 batch 16 needs 6000+ steps to reach. However, fp32 batch 32 continues descending well past that point on small datasets (10 clips), eventually overfitting. **Stop fp32 batch 32 around step 2000 on a 10-clip dataset** — later checkpoints sound worse despite lower loss. + +**Lower loss ≠ better audio.** Once overfitting begins the model memorizes training clips rather than generalizing to new video inputs. Test intermediate checkpoints (e.g. step 500, 1000, 2000) to find the perceptual sweet spot. + +### logit_normal vs uniform + +logit_normal consistently reaches a lower loss floor than uniform. However perceptual improvement is dataset-dependent — on 10 clips the difference is marginal. May be more impactful with larger datasets. No conclusion yet. + +### White noise + +Residual white noise on generated audio is primarily a **dataset** problem, not a training one. Observed with all configs on 10 clips. Likely causes: +- Too few clips for the model to confidently predict the target sound +- Imprecise extraction prompts producing unfocused sync features +- Missing mask when multiple objects are in frame + +CFG scale amplifies any adapter noise bias. Reducing CFG to 3.0–3.5 or adapter strength to 0.6–0.7 helps at inference.