feat: add resume support to train_lora.py

Step checkpoints now save optimizer state, scheduler state, and step
number alongside the LoRA weights. Pass --resume path/to/adapter_stepXXXXX.pt
to continue training from that checkpoint. --steps always means total steps,
so resuming from 1000 with --steps 2000 trains 1000 more steps.

adapter_final.pt format is unchanged (state_dict + meta only) so
SelvaLoraLoader remains compatible.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-04-05 16:59:30 +02:00
parent 8e9114b92c
commit 2f4641247a
2 changed files with 38 additions and 5 deletions
+1
View File
@@ -109,6 +109,7 @@ The script will:
| `--warmup_steps` | `500` | Linear LR warmup steps |
| `--grad_accum` | `4` | Gradient accumulation steps (effective batch = grad_accum × 1) |
| `--save_every` | `500` | Save a checkpoint every N steps |
| `--resume` | `None` | Path to a step checkpoint to resume from (e.g. `lora_output/adapter_step01000.pt`) |
| `--precision` | `bf16` | Mixed precision: `bf16`, `fp16`, `fp32` |
| `--seed` | `42` | Random seed |