Peak norm was slamming output to full scale regardless of content level,
making generated audio several times louder than training clips.
RMS norm to -20 dBFS matches typical processed audio level.
Sampler exposes target_lufs (-40 to -6, default -20) for user control.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
9 experiments targeting loss 0.25-0.35 without LoRA+ noise.
Tests higher base LR (2e-4/3e-4/5e-4), curriculum combos, conservative
LoRA+ ratio=4, and rank 256 baseline + lr=3e-4.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Inference is fast on RTX PRO 6000 — 8 steps was washing out quality
differences between experiments.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
15 experiments across rank (64/128), alpha, regularisation, LR, target
layers, and combined stacks. Based on tier1_thorough early results
confirming rank 64 sounds best perceptually.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Scheduler: on re-run, reads existing experiment_summary.json and skips
already-completed experiments — safe to stop and restart mid-sweep.
tier1_thorough: adds g5 (lr 3e-5/3e-4), g6 (full target attn.qkv+linear1
at r16 and r64), and g4_full_r64_6k (6000-step extended run) — 17 total.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Always sample dataset[0] with fixed noise seed so checkpoints are
directly comparable (hear the model improve step by step)
- Save to output_dir/samples/step_XXXXX.wav instead of alongside checkpoints
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Dataset browser: audio/features now resolve through features/ subdir
- tier1_sweep.json: update data_dir to BJ dataset path
- tier1_thorough.json: 12-experiment overnight sweep across 4 groups
(rank 16/32/64, alpha scaling, LoRA+/dropout/curriculum isolation,
full Tier 1 stack at r16 and r64) — output to BJ/experiment/
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Companion node for inspecting dataset.json entries by integer index.
Outputs video (.mp4), audio (.wav/.flac), features (.npz), frames dir,
mask dir, label, and max_index for constraining the index widget range.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Architecture comparison, capability matrix, integration cost estimate,
LoRA training difficulty analysis, and license implications.
Verdict: SelVA remains preferred for V2A + LoRA fine-tuning; AudioX
adds value for music generation, inpainting, and text-to-audio tasks.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New section 1.1 covers aspect ratio (16:9 landscape preferred), resolution
(≥480p), frame rate (any, use VHS_VIDEOINFO), and portrait handling
(center-crop to square). Based on CLIP 384px and Synchformer 224px internals.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Full research notes on cleaning, augmentation, and quality metrics for
generative model training. Covers LUFS normalization, AudioSep, waveform
augmentation (pitch shift, RIR, EQ), latent mixup, DNSMOS gating, tool
install commands, and key paper references.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6 experiments: baseline, LoRA+ (ratio=16), dropout 0.05, dropout 0.1,
curriculum sampling, and all three combined. bf16 batch 16, 2000 steps,
seed 42. data_dir placeholder needs to be updated before running.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- trainer: raise ValueError early when remaining steps < log_interval (50)
instead of UnboundLocalError on smoothed_img/final_path at return
- trainer: use None in grad_norm_history instead of silent 0.0 when
grad_accum > log_interval and no optimizer step fired in the interval
- trainer: include start_step in _train_inner return dict
- scheduler: use start_step from result dict for min_loss_step and
loss_at_steps (fixes wrong step labels on resumed experiments)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
trainer:
- Track gradient norm before clipping at each optimizer step
- Log avg grad_norm per log_interval alongside loss in console output
- Include grad_norm_history in _train_inner return dict
scheduler:
- Add system block to summary (GPU name, VRAM, torch/CUDA version)
- Include full loss_history and grad_norm_history arrays in each
experiment result (50-step resolution, not just save_every checkpoints)
- Add loss_std_last_quarter stability metric (std dev of raw loss over
last 25% of steps — high value indicates unstable training)
- Add log_interval field so consumers know the x-axis resolution
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Extract _prepare_dataset() from SelvaLoraTrainer.train() as a module-level
function so the dataset can be encoded once and reused across experiments
- Change _train_inner() return value from tuple to dict (adds loss_history,
meta, completed; train() unpacks for ComfyUI — no change to node outputs)
- New SelvaLoraScheduler node: reads a JSON sweep file, runs N experiments
sequentially, writes experiment_summary.json (updated after each run) and
loss_comparison.png with all smoothed curves overlaid on the same axes
- Register SelvaLoraScheduler in nodes/__init__.py
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- LoRA dropout: applied to the LoRA path only (not frozen base weights),
0.05–0.1 helps regularize on small datasets (arXiv:2404.09610)
- LoRA+: separate optimizer param groups for lora_A and lora_B with
configurable LR ratio; ratio=16 enables LoRA+ (arXiv:2402.12354)
- Curriculum mode: logit_normal for first N% of steps then uniform,
directly addresses early convergence + fine-detail degradation at
boundaries (arXiv:2603.12517)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Work-in-progress empirical notes: fp32 batch 32 reaches same quality as
bf16 batch 16 in 1/3 the steps but overfits past ~2000 steps on 10 clips.
Lower loss does not reliably mean better audio on small datasets.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Poor prompts and missing masks are a common source of white noise in LoRA
training — imprecise sync features force the adapter to compensate with noise.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
logit_normal reaches lower loss but perceptual improvement over uniform
is dataset-dependent. Keeping uniform as default to match original MMAudio
training behavior; logit_normal remains available as an option.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On Windows, /folder is drive-relative (no drive letter) rather than a real
absolute path. Redirect these to ComfyUI's output directory so files don't
land at C:\folder. Also redirects plain relative paths (e.g. lora_output)
to output/ instead of the process working directory.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wraps training loop in try/finally so adapter_final.pt and loss PNGs are
always written. On cancellation the adapter is named
adapter_cancelled_stepXXXXX.pt so it can be used with --resume.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Clips from shorter videos produce fewer CLIP frames (e.g. 2s → 16 frames,
8s → 64 frames). Mixed-length datasets would cause torch.stack() to fail
during batching. Normalize to seq_cfg.clip_seq_len / sync_seq_len at load,
same as latents are already normalized to latent_seq_len.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Uniform timestep sampling undertrained t>0.8 (the final denoising steps),
leaving residual noise that CFG amplifies at inference. Logit-normal sampling
concentrates training near t=0.5 while still covering the full range, improving
high-t coverage and reducing noise floor in generated audio.
Default changed from uniform to logit_normal (sigma=1.0). Previous behavior
available with timestep_mode=uniform.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Added batch_size VRAM table and updated step recommendations for batched training
- Added adapter strength section with practical guidance (0.6-0.7 for noise)
- Added ComfyUI node as Option A for training (not just CLI)
- Noted .mp3 as not recommended, soundfile fallback implied
- Added output files section with sample_*.wav and loss curve PNGs
- Added "LoRA has no effect" troubleshooting (wrong node wired)
- Updated loss convergence targets based on observed training runs
- Clarified linear1 target: 150+ clips recommended
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces single-sample steps with batched sampling via random.choices().
Tensors are stacked to [B, T, C] before the forward pass; t is now [B].
Default grad_accum lowered to 1 since real batching gives stable gradients.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Raw curve shown in light blue, EMA-smoothed (beta=0.9) overlay in darker
blue. Both saved as PNG at end of training. The node IMAGE output now
returns the smoothed version. Live preview also uses the smoothed overlay.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
500 warmup steps is 25% of a 2000-step run — too long. 100 steps lets
the full lr kick in much earlier without sacrificing stability.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The third element in ComfyUI's preview tuple is max_size in pixels, not
JPEG quality. Passing 85 was capping the live loss curve at 85×40px.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
torch.enable_grad() alone is insufficient: operations on inference tensors
(created inside ComfyUI's outer inference_mode context) produce inference
tensors even inside enable_grad, breaking autograd. inference_mode(False)
exits the inference context so the deepcopy, apply_lora, and training loop
run with a fully clean autograd context.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
torch.enable_grad() re-enables grad tracking but nn.Parameters created while
torch.inference_mode() is active are inference tensors that can't enter autograd
regardless. Splitting into _train_inner() and calling it inside enable_grad()
ensures the deepcopy, apply_lora, and the training loop all run with a clean
autograd context.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ComfyUI executes all nodes inside torch.no_grad(), which prevents gradient
tracking and makes loss.backward() fail. torch.enable_grad() overrides it.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
apply_lora() is called after generator.to(device), so lora_A/lora_B were
being created on CPU while the rest of the model was on CUDA.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
STFT hop-size rounding produces ±1 latent frame vs the expected seq length.
Clamp to seq_cfg.latent_seq_len after transpose so generator.forward assertion passes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Recent torchaudio defaults to torchcodec as the audio backend, which requires
FFmpeg shared libraries. Falls back to soundfile for envs where torchcodec
can't load (e.g. containerised ComfyUI without system FFmpeg).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
torch.stft requires float32 input — casting vae_utils to bf16 caused silent
failures during dataset pre-loading. Also adds traceback.print_exc() so future
clip-load errors are visible in the ComfyUI log.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
At every save_every steps, run a quick 8-step no-CFG inference pass on
a random training clip and save the decoded waveform as
sample_stepXXXXX.wav next to the checkpoint. Uses the existing
generator.unnormalize + feature_utils.decode + vocode pipeline from
the sampler. Failure is non-fatal (logged and skipped).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Send updated loss curve to ComfyUI frontend every 50 steps via
pbar_train.update_absolute() with a JPEG preview tuple — same
mechanism as KSampler's denoising previews.
- Fix x-axis step labels for resumed runs (previously always started
at 0; now correctly shows start_step + offset).
- Split _draw_loss_curve (returns PIL Image) from _pil_to_tensor
(converts for ComfyUI IMAGE output).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Runs the full training loop inside ComfyUI. Reuses the already-loaded
CLIP model from the inference model for text encoding; loads only a
minimal VAE encoder separately (freed after dataset pre-loading).
Outputs:
- SELVA_MODEL with LoRA applied (ready to connect directly to Sampler)
- adapter_path STRING (for SelVA LoRA Loader in future sessions)
- loss_curve IMAGE (PIL-rendered line chart of training loss per 50 steps)
Progress is shown via ComfyUI ProgressBar (two phases: dataset loading,
then training steps). Resume is supported via resume_path input.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>