fix(ti): lower default lr/batch, add lr_batch sweep group

n4_baseline showed token_norm growing linearly without plateau — classic sign of lr too high relative to parameter count. With only K×1024 params, gradient signal per param is already high-magnitude; high lr causes overshoot rather than convergence. - Default lr: 1e-3 → 2e-4 (matches LoRA working regime) - Default batch_size: 16 → 4 (more diverse gradients, helps norm saturate) - ti_sweep_1.json: add lr_batch group (lr_low_b4, lr_mid_b8, lr_low_b4_prefix, lr_2e3), restructure with clearer groups, annotate n4_baseline as completed with findings Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 23:42:22 +02:00
parent 92535deab2
commit f9d092158a
3 changed files with 42 additions and 31 deletions
@@ -75,9 +75,9 @@ def _get_system_info() -> dict:

 _PARAM_DEFAULTS = {
    "n_tokens":     4,
-    "lr":           1e-3,
+    "lr":           2e-4,
    "steps":        3000,
-    "batch_size":   16,
+    "batch_size":   4,
    "warmup_steps": 100,
    "seed":         42,
    "save_every":   1000,