From 9fc739fe9e3b33c7d273a571f3d952749e14c358 Mon Sep 17 00:00:00 2001
From: Ethanfel <ethan.fel@ts-pc.fr>
Date: Mon, 6 Apr 2026 01:43:28 +0200
Subject: [PATCH] docs: add prompt guide and masking note to dataset
 preparation section
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Poor prompts and missing masks are a common source of white noise in LoRA
training — imprecise sync features force the adapter to compensate with noise.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 LORA_TRAINING.md | 34 ++++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/LORA_TRAINING.md b/LORA_TRAINING.md
index e557d57..da1ecc9 100644
--- a/LORA_TRAINING.md
+++ b/LORA_TRAINING.md
@@ -36,11 +36,41 @@ For each video clip you want to train on:
 2. Connect it to **SelVA Feature Extractor**.
 3. Set **`cache_dir`** to a dedicated dataset folder, e.g. `dataset/my_sound`.
 4. Set **`name`** to a short descriptive label, e.g. `dog_bark`. The node will save `dog_bark_001.npz`, then `dog_bark_002.npz`, etc. automatically as you process more clips.
-5. Set the **`prompt`** to describe the sound (e.g. `a dog barking`). This prompt is used to condition the sync features — be specific.
-6. Optionally connect a **mask** to isolate the sound source in frame (recommended when the scene has multiple objects).
+5. Set the **`prompt`** to describe the sound (e.g. `a dog barking`). This prompt conditions the Synchformer sync features — be as specific as possible (see prompt guide below).
+6. Optionally connect a **mask** to isolate the sound source in frame (strongly recommended when multiple objects are visible — see masking note below).
 
 > **Tip:** The prompt used for feature extraction conditions the *visual sync features*. You can use a different, more precise prompt at training time — see Step 2.
 
+### Prompt guide
+
+The prompt is not just a label — it directly shapes what the Synchformer pays attention to in the video. Imprecise prompts produce unfocused sync features, which the LoRA then has to compensate for, often introducing noise.
+
+**Good prompts are specific about:**
+- The sound source (what object is making the sound)
+- The acoustic character (loud/quiet, sharp/soft, wet/dry)
+- The action producing the sound (if applicable)
+
+| Sound | Weak prompt | Strong prompt |
+|---|---|---|
+| Dog bark | `dog` | `a large dog barking loudly` |
+| Footsteps | `walking` | `heavy boots on a wooden floor` |
+| Water | `water` | `water dripping into a metal bucket` |
+| Explosion | `explosion` | `a large explosion with deep bass rumble` |
+| Door | `door` | `a heavy wooden door slamming shut` |
+
+**Rules of thumb:**
+- Describe the *sound*, not the visual scene. `a person hitting a drum` is better than `a drummer on stage`.
+- Keep prompts consistent across all clips for the same sound class. Mixing `a dog barking` and `loud barking dog` in the same dataset creates conflicting sync features.
+- Avoid negations (`no background noise`) — the model does not understand negations in sync feature conditioning.
+
+### Masking note
+
+If the video frame contains multiple moving objects, CLIP and sync features will be diluted by irrelevant motion. Use a segmentation mask (SAM2 or Grounding DINO+SAM) to isolate the sound source:
+
+- Connect the mask to the **`mask`** input on SelVA Feature Extractor.
+- Leave `mask_strength` at `1.0` for clean isolation; lower it only if the masked region is very small and the model loses context.
+- Re-extract features with a mask even if you already have `.npz` files — better features directly reduce training noise.
+
 ### 1.2 Collect clean audio
 
 For each `.npz` file, place a matching audio file with the **same filename stem** in the same directory: