feat: add batch_size parameter to training (default 4)

Replaces single-sample steps with batched sampling via random.choices(). Tensors are stacked to [B, T, C] before the forward pass; t is now [B]. Default grad_accum lowered to 1 since real batching gives stable gradients. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-05 23:36:12 +02:00
parent 3f67de694c
commit 09b3b94ddd
3 changed files with 28 additions and 22 deletions
@@ -107,7 +107,8 @@ The script will:
 | `--lr` | `1e-4` | Learning rate |
 | `--steps` | `2000` | Total training steps |
 | `--warmup_steps` | `100` | Linear LR warmup steps |
-| `--grad_accum` | `4` | Gradient accumulation steps (effective batch = grad_accum × 1) |
+| `--batch_size` | `4` | Clips per training step |
+| `--grad_accum` | `1` | Gradient accumulation steps |
 | `--save_every` | `500` | Save a checkpoint every N steps |
 | `--resume` | `None` | Path to a step checkpoint to resume from (e.g. `lora_output/adapter_step01000.pt`) |
 | `--precision` | `bf16` | Mixed precision: `bf16`, `fp16`, `fp32` |