Commit Graph

275 Commits

Author SHA1 Message Date
Ethanfel bb07bc8169 fix(ti-trainer): guard spectral metrics, drop unused imports
- Wrap _spectral_metrics + _save_spectrogram in try-except so a matplotlib
  or STFT error doesn't abort the checkpoint save (matches LoRA trainer)
- Remove unused `import math` and `_pil_to_tensor` import
- Drop dead `img` variable (_save_spectrogram returns None)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 23:10:19 +02:00
Ethanfel e36cdd7947 fix(ti-trainer): fix gradient flow and spectral metric shapes
- Replace in-place text_clip assignment with torch.cat so the computation
  graph correctly links text_input → learned_tokens; in-place assignment
  into a requires_grad=False leaf severs the graph and learned_tokens
  receives no gradients
- _spectral_metrics(wav, sr): was passing wav.unsqueeze(0) [1,1,L] instead
  of wav [1,L]; stft mean(dim=1) would return wrong shape [1,T] not [n_freqs]
- _save_spectrogram(wav, sr, ...): was passing wav.squeeze(0) [L] (1D)
  instead of wav [1,L] as the function expects

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 23:08:13 +02:00
Ethanfel e56ece9c1c feat: add SelVA Textual Inversion Trainer and Loader nodes
Learns K CLIP token embeddings ([K, 1024]) with all model weights frozen,
keeping generated latents on the decoder's natural manifold — avoids the
quality degradation that affects LoRA on BJ's audio dataset.

- selva_textual_inversion_trainer.py: trains learned_tokens via AdamW,
  injects into last K positions of 77-token CLIP embedding, checkpoints
  with eval audio + spectral metrics
- selva_textual_inversion_loader.py: loads .pt bundle, returns
  TEXTUAL_INVERSION dict for sampler
- selva_sampler.py: optional textual_inversion input; injects into both
  text_clip and neg_text_clip before preprocess_conditions
- __init__.py: registers both new nodes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 23:01:44 +02:00
Ethanfel eed7eefeac feat: add SelVA HF Smoother and Spectral Matcher preprocessing nodes
Two ComfyUI nodes to reduce domain mismatch between custom training audio
and the MMAudio VAE's expected spectral distribution:

SelvaHfSmoother: blends a low-pass filtered copy (biquad) with the original
at a configurable cutoff and blend ratio. Attenuates extreme HF content that
BigVGANv2 handles poorly. RMS-preserving.

SelvaSpectralMatcher: computes the log-mel energy profile of the clip,
compares it per-band to the VAE's normalization means (DATA_MEAN_80D/128D),
and applies a smooth STFT-domain gain correction to match the codec's training
distribution. Configurable strength and max_gain_db clamp. RMS-preserving.

Recommended workflow: SpectralMatcher → HfSmoother → feature extraction.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 20:28:16 +02:00
Ethanfel 107bb05f17 fix(vae-roundtrip): pass bigvgan path to encoder-only FeaturesUtils
AutoEncoderModule unconditionally asserts vocoder_ckpt_path is not None
even when need_vae_encoder=True. Pass best_netG.pt to satisfy the assert;
the vocoder weights are not actually used since decode+vocode go through
model["feature_utils"].

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 20:05:44 +02:00
Ethanfel 10e6095e31 fix(vae-roundtrip): use model feature_utils for decode, add normalize/unnormalize, normalize output
- Load fresh FeaturesUtils only for encoding; use model["feature_utils"] for
  decode+vocode to mirror the exact path the sampler takes
- Apply generator.normalize() → unnormalize() around the encoded latent so the
  decoder receives latents in the same space it expects from inference
- Log both encoded and norm→unnorm latent stats to diagnose round-trip fidelity
- Normalize output to -27 dBFS (matching training clip RMS) and clamp to [-1, 1]
  to prevent clipping artifacts in the output waveform

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 19:50:01 +02:00
Ethanfel 528d33be39 fix: trim/pad latent to seq_cfg.latent_seq_len before decoding
Without this the decoder produced 7s instead of 8s due to STFT rounding.
Same fix as _prepare_dataset uses for training data.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 19:22:09 +02:00
Ethanfel 8195c3114a feat: add SelVA VAE Roundtrip node
Encodes audio through the VAE then decodes straight back, bypassing the
diffusion model entirely. Use this to isolate whether saturation artifacts
are introduced by the codec reconstruction (VAE/DAC) or by the LoRA.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 19:15:20 +02:00
Ethanfel c8e6b91f67 feat: add alpha_scale_sweep to fix LoRA noise contamination
Previous sweep used alpha=rank (scale=1.0) which at rank 128/256 drowned
base model priors — spectral flatness went from 0.013 (baseline) to 0.094.
This sweep tests alpha dramatically below rank across r16/r32/r128 to find
the scale where LoRA nudges rather than overwrites.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 17:55:05 +02:00
Ethanfel fdce9cbbf1 feat: evaluate adapters on all dataset clips, not just clip_001
- _eval_sample gains clip_idx param (default 0, backward compatible)
- Evaluator loops over all dataset clips per adapter, saves one WAV per clip
- Reference metrics computed for all clips and averaged
- Comparison chart and summary use avg_metrics across all clips
- Eliminates bias from evaluating on an unrepresentative single clip

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 17:42:55 +02:00
Ethanfel 42ceb4b153 fix: preserve original audio extension when copying reference file
shutil.copy2 was writing FLAC binary to reference.wav — unplayable.
Now copies as reference{.flac/.wav/etc} matching the source extension.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 17:31:26 +02:00
Ethanfel 4505b89db1 feat: add reference audio to LoRA evaluator
Loads the first clip's original audio (same clip used for inference),
copies it to output_dir/reference.wav, runs spectral metrics and
saves a spectrogram. Appears first in the comparison chart so generated
samples can be judged against the target sound.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 17:30:33 +02:00
Ethanfel dbfa7b23fe feat: add eval_r128_candidates.json
Evaluates top 5 adapters from r128_sweet_spot: baseline, lr_5e4_r128,
lr_3e4_r256, lr_3e4_r128, curriculum_lr_3e4 final + step 6000 checkpoint
(before regression) for spectral comparison.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 17:28:28 +02:00
Ethanfel d2e1ea7b80 feat: add SelVA LoRA Evaluator node
Generates audio samples from a list of adapters against a fixed reference
clip, collects spectral metrics for each, and outputs a comparison bar
chart + eval_summary.json. Useful for comparing sweep candidates before
committing to a next round of training.

JSON format: name, data_dir, output_dir, steps, seed, adapters[{id, path}].
Empty path = baseline (no LoRA).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 17:26:50 +02:00
Ethanfel 9a47508d2d fix: lower RMS normalization target from -23/-20 to -27 dBFS
Training clips at -23 LUFS measure -25 to -31 dBFS RMS (avg ~-27).
Normalizing output to -23 dBFS was 4-8 dB too loud, causing saturation
on clips with high crest factor and peaks near 0 dBFS.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 17:19:20 +02:00
Ethanfel 678c050f11 fix: make normalize(x1) assignment explicit in training loop
normalize() uses in-place ops so it worked, but reading the return value
makes the intent clear and guards against future refactors.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 15:43:42 +02:00
Ethanfel 1be07a80d2 feat: add cosine LR decay schedule to trainer and scheduler
- Add lr_schedule param (constant|cosine) to SelvaLoraTrainer
- Cosine decays LR from initial value to ~0 after warmup, preventing
  the oscillation observed at steps 6000-8000 with lr=2e-4 flat
- Wire lr_schedule through scheduler _PARAM_DEFAULTS and _train_inner call
- Add g5_r128_lr_2e4_cosine and g5_r128_lr_3e4_cosine to r128_sweet_spot sweep

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 13:25:01 +02:00
Ethanfel 58e1985af2 feat: SelVA Skip Experiment node + save partial scalars on skip
- New node: SelVA Skip Experiment — writes skip_current.flag from UI,
  queue in a second workflow tab while scheduler is running
- SkipExperiment now attaches partial loss/grad/spectral data to the
  exception so the scheduler saves all collected scalars in the summary

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 13:10:43 +02:00
Ethanfel 264dc49d42 feat: skip_current.flag to cancel experiment and move to next
Create the flag file in the sweep output_root to skip the running
experiment at the next log interval (every 50 steps):
  touch /path/to/experiment/skip_current.flag

Scheduler marks it as 'skipped' in the summary and continues.
Skipped experiments are NOT resumed on restart (unlike failed ones).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 13:09:01 +02:00
Ethanfel fec5c86f09 feat: add spectral_flatness and temporal_variance to eval metrics
spectral_flatness (Wiener entropy) — 0=tonal, 1=white noise.
Rising value across steps directly flags noise contamination.
temporal_variance — RMS std/mean per frame. Low = lifeless/compressed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 12:45:40 +02:00
Ethanfel 2861327016 feat: spectral metrics per eval sample in experiment summary
Computes hf_energy_ratio (>4kHz), spectral_centroid_hz, spectral_rolloff_hz
at each save_every checkpoint. Logged to console and stored in
experiment_summary.json under results.spectral_metrics[step].

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 12:44:43 +02:00
Ethanfel c4687521ef feat: save spectrogram PNG alongside each eval sample
Log-frequency dB spectrogram (inferno colormap, 100Hz–16kHz) saved as
step_XXXXX.png next to step_XXXXX.wav in samples/ subfolder.
Makes high-frequency rolloff (low bitrate signature) immediately visible.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 12:42:34 +02:00
Ethanfel 8717af2728 fix: prevent saturation from RMS normalization clipping peaks
RMS normalize to target then scale back if peaks exceed 1.0,
preserving dynamics instead of hard-clipping transients.
Eval sample target updated to -23 dBFS to match training data.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 12:29:29 +02:00
Ethanfel 78e9838a83 fix: replace peak normalization with RMS normalization at -20 dBFS
Peak norm was slamming output to full scale regardless of content level,
making generated audio several times louder than training clips.
RMS norm to -20 dBFS matches typical processed audio level.
Sampler exposes target_lufs (-40 to -6, default -20) for user control.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 12:06:48 +02:00
Ethanfel 94610b8943 feat: r128_sweet_spot sweep — noise-free LR search + rank 256
9 experiments targeting loss 0.25-0.35 without LoRA+ noise.
Tests higher base LR (2e-4/3e-4/5e-4), curriculum combos, conservative
LoRA+ ratio=4, and rank 256 baseline + lr=3e-4.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 10:46:08 +02:00
Ethanfel f5f7f2ae68 fix: eval sample seed 0 -> 42
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 10:32:43 +02:00
Ethanfel 1663b39833 fix: bump eval sample to 25 ODE steps (was 8)
Inference is fast on RTX PRO 6000 — 8 steps was washing out quality
differences between experiments.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 10:32:27 +02:00
Ethanfel a7923d5fb7 feat: r64_overnight sweep — focused rank-64 ablation at 8000 steps
15 experiments across rank (64/128), alpha, regularisation, LR, target
layers, and combined stacks. Based on tier1_thorough early results
confirming rank 64 sounds best perceptually.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 01:32:23 +02:00
Ethanfel 786a57c424 feat: sweep resume + 5 additional experiments (LR, target, extended)
Scheduler: on re-run, reads existing experiment_summary.json and skips
already-completed experiments — safe to stop and restart mid-sweep.

tier1_thorough: adds g5 (lr 3e-5/3e-4), g6 (full target attn.qkv+linear1
at r16 and r64), and g4_full_r64_6k (6000-step extended run) — 17 total.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 00:59:16 +02:00
Ethanfel f15e02b0b8 fix: eval samples use fixed clip/seed, save to samples/ subfolder
- Always sample dataset[0] with fixed noise seed so checkpoints are
  directly comparable (hear the model improve step by step)
- Save to output_dir/samples/step_XXXXX.wav instead of alongside checkpoints

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 00:54:37 +02:00
Ethanfel 0682a536cb fix: point data_dir to features/ subdir where .npz and audio live
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 00:45:32 +02:00
Ethanfel 0000878e76 feat: thorough overnight sweep + dataset browser updates
- Dataset browser: audio/features now resolve through features/ subdir
- tier1_sweep.json: update data_dir to BJ dataset path
- tier1_thorough.json: 12-experiment overnight sweep across 4 groups
  (rank 16/32/64, alpha scaling, LoRA+/dropout/curriculum isolation,
  full Tier 1 stack at r16 and r64) — output to BJ/experiment/

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 00:38:19 +02:00
Ethanfel 675644189d feat: add SelVA Dataset Browser node
Companion node for inspecting dataset.json entries by integer index.
Outputs video (.mp4), audio (.wav/.flac), features (.npz), frames dir,
mask dir, label, and max_index for constraining the index widget range.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 14:55:27 +02:00
Ethanfel 82fb7a0009 docs: note AudioX shows no perceptual quality gain on V2A vs SelVA
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 09:12:00 +02:00
Ethanfel af4777d2d7 docs: add AudioX vs SelVA evaluation
Architecture comparison, capability matrix, integration cost estimate,
LoRA training difficulty analysis, and license implications.
Verdict: SelVA remains preferred for V2A + LoRA fine-tuning; AudioX
adds value for music generation, inpainting, and text-to-audio tasks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-07 09:11:09 +02:00
Ethanfel ed8abf7a5b docs: add video format recommendations to dataset preparation section
New section 1.1 covers aspect ratio (16:9 landscape preferred), resolution
(≥480p), frame rate (any, use VHS_VIDEOINFO), and portrait handling
(center-crop to square). Based on CLIP 384px and Synchformer 224px internals.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 13:44:14 +02:00
Ethanfel 21ed93d3ee docs: add audio dataset pipeline reference doc
Full research notes on cleaning, augmentation, and quality metrics for
generative model training. Covers LUFS normalization, AudioSep, waveform
augmentation (pitch shift, RIR, EQ), latent mixup, DNSMOS gating, tool
install commands, and key paper references.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 13:37:48 +02:00
Ethanfel f1e2bbd55b feat: add first experiment sweep file for Tier 1 ablation
6 experiments: baseline, LoRA+ (ratio=16), dropout 0.05, dropout 0.1,
curriculum sampling, and all three combined. bf16 batch 16, 2000 steps,
seed 42. data_dir placeholder needs to be updated before running.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 13:15:06 +02:00
Ethanfel 3d9221c248 fix: three bugs in scheduler and trainer
- trainer: raise ValueError early when remaining steps < log_interval (50)
  instead of UnboundLocalError on smoothed_img/final_path at return
- trainer: use None in grad_norm_history instead of silent 0.0 when
  grad_accum > log_interval and no optimizer step fired in the interval
- trainer: include start_step in _train_inner return dict
- scheduler: use start_step from result dict for min_loss_step and
  loss_at_steps (fixes wrong step labels on resumed experiments)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 13:11:25 +02:00
Ethanfel 2d200395af feat: add grad norm logging and richer experiment summary output
trainer:
- Track gradient norm before clipping at each optimizer step
- Log avg grad_norm per log_interval alongside loss in console output
- Include grad_norm_history in _train_inner return dict

scheduler:
- Add system block to summary (GPU name, VRAM, torch/CUDA version)
- Include full loss_history and grad_norm_history arrays in each
  experiment result (50-step resolution, not just save_every checkpoints)
- Add loss_std_last_quarter stability metric (std dev of raw loss over
  last 25% of steps — high value indicates unstable training)
- Add log_interval field so consumers know the x-axis resolution

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 13:06:39 +02:00
Ethanfel 3ec380a27e feat: add SelVA LoRA Scheduler node for automated experiment sweeps
- Extract _prepare_dataset() from SelvaLoraTrainer.train() as a module-level
  function so the dataset can be encoded once and reused across experiments
- Change _train_inner() return value from tuple to dict (adds loss_history,
  meta, completed; train() unpacks for ComfyUI — no change to node outputs)
- New SelvaLoraScheduler node: reads a JSON sweep file, runs N experiments
  sequentially, writes experiment_summary.json (updated after each run) and
  loss_comparison.png with all smoothed curves overlaid on the same axes
- Register SelvaLoraScheduler in nodes/__init__.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 13:03:21 +02:00
Ethanfel 9bc2568543 docs: document LoRA dropout, LoRA+, and curriculum timestep sampling
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 12:45:53 +02:00
Ethanfel eb63c1ead7 feat: add LoRA dropout, LoRA+ asymmetric LR, and curriculum timestep sampling
- LoRA dropout: applied to the LoRA path only (not frozen base weights),
  0.05–0.1 helps regularize on small datasets (arXiv:2404.09610)
- LoRA+: separate optimizer param groups for lora_A and lora_B with
  configurable LR ratio; ratio=16 enables LoRA+ (arXiv:2402.12354)
- Curriculum mode: logit_normal for first N% of steps then uniform,
  directly addresses early convergence + fine-detail degradation at
  boundaries (arXiv:2603.12517)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 12:43:18 +02:00
Ethanfel 5baa070e61 docs: add observations section with fp32/batch/precision findings
Work-in-progress empirical notes: fp32 batch 32 reaches same quality as
bf16 batch 16 in 1/3 the steps but overfits past ~2000 steps on 10 clips.
Lower loss does not reliably mean better audio on small datasets.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 02:34:53 +02:00
Ethanfel 9fc739fe9e docs: add prompt guide and masking note to dataset preparation section
Poor prompts and missing masks are a common source of white noise in LoRA
training — imprecise sync features force the adapter to compensate with noise.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 01:43:28 +02:00
Ethanfel 57fae4a8ce chore: default timestep_mode back to uniform
logit_normal reaches lower loss but perceptual improvement over uniform
is dataset-dependent. Keeping uniform as default to match original MMAudio
training behavior; logit_normal remains available as an option.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 01:21:08 +02:00
Ethanfel 8e919c0459 fix: resolve relative and Unix-style output_dir paths to ComfyUI output folder
On Windows, /folder is drive-relative (no drive letter) rather than a real
absolute path. Redirect these to ComfyUI's output directory so files don't
land at C:\folder. Also redirects plain relative paths (e.g. lora_output)
to output/ instead of the process working directory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 01:14:04 +02:00
Ethanfel fec8eaac95 fix: save adapter and loss curves on cancel, not only on normal completion
Wraps training loop in try/finally so adapter_final.pt and loss PNGs are
always written. On cancellation the adapter is named
adapter_cancelled_stepXXXXX.pt so it can be used with --resume.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 01:06:44 +02:00
Ethanfel d83632e754 fix: pad/trim clip and sync features to fixed seq_len at dataset load time
Clips from shorter videos produce fewer CLIP frames (e.g. 2s → 16 frames,
8s → 64 frames). Mixed-length datasets would cause torch.stack() to fail
during batching. Normalize to seq_cfg.clip_seq_len / sync_seq_len at load,
same as latents are already normalized to latent_seq_len.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 00:54:05 +02:00
Ethanfel a5014e49eb feat: add logit-normal timestep sampling to reduce white noise artifacts
Uniform timestep sampling undertrained t>0.8 (the final denoising steps),
leaving residual noise that CFG amplifies at inference. Logit-normal sampling
concentrates training near t=0.5 while still covering the full range, improving
high-t coverage and reducing noise floor in generated audio.

Default changed from uniform to logit_normal (sigma=1.0). Previous behavior
available with timestep_mode=uniform.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 00:35:42 +02:00