torchcodec/libavutil soname mismatch causes torchaudio to fail on every
file load, silently emptying clips. Add _load_wav() that tries torchaudio
first then falls back to soundfile (handles wav/flac without ffmpeg).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
After the finally block, offload_to_cpu moves the vocoder to CPU while
ref_mel stays on GPU. Fix: detect vocoder's current device via
next(vocoder.parameters()).device and move ref_mel there before vocoding.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Saves baseline.wav (ground truth roundtrip before training), stepN.wav
at each save_every checkpoint, and final.wav after training completes.
All use the same fixed reference segment (clip 0, position 0) for
direct comparison across checkpoints.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
44k variants use BigVGANv2 directly as the vocoder (no wrapper, no
@inference_mode decorator), accessible at feature_utils.tod.vocoder.
16k wraps BigVGANVocoder inside BigVGAN, accessed at .vocoder.vocoder.
Both trainer and loader now branch on model["mode"].
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Spectral-loss-only fine-tuning of the BigVGAN vocoder (mel→waveform)
on BJ audio clips. DiT and VAE are completely frozen. Losses: mel L1
reconstruction + multi-resolution STFT magnitude L1 (same three
resolutions as the BigVGAN discriminator config). Saves in
{'generator': state_dict} format compatible with the original BigVGAN
checkpoint. Loader replaces vocoder weights in the loaded SELVA_MODEL
in-place so no full model reload is needed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two improvements for stronger steering effect:
1. Apply steering only during the conditional predict_flow pass by
monkey-patching predict_flow to set a flag via identity check
(cond is conditions). Hooks skip the unconditional pass, so
steering is amplified by cfg_strength (~4.5x) instead of canceling
out in the CFG guidance term.
2. Restore per-position [seq, hidden] steering vectors instead of
seq-averaged [hidden]. More spatially specific — captures positional
activation patterns rather than a global mean. Seq length mismatch
at inference time handled via linear interpolation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements per-block DiT activation steering as an alternative to textual
inversion. Extractor runs frozen generator on dataset with BJ vs empty
conditions, records mean hidden-state delta per block, saves [hidden_dim]
vectors (seq-averaged so they broadcast to any inference duration). Loader
reads the bundle. Sampler registers forward hooks during the ODE that add
strength × vec to each block output, cleaned up in a finally block.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TI via text conditioning produces buzz because SelVA's text path is
mean-pooled into a global DiT bias — not rich per-token cross-attention
like SD. The optimizer learns a constant spectral artifact rather than
semantic style shift.
ti_strength=1.0 (default) = full injection as before.
ti_strength<1.0 = lerp between original and injected text_clip,
allowing the effect to be dialled back without retraining.
Applies to both text_clip and neg_text_clip symmetrically.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Diagnosis: learned tokens grew to norm ~3.2 while real CLIP content tokens
sit at ~1.0. Model never trained on embeddings that large — activates buzz
artifact instead of semantic style shift.
Fix: measure mean token norm from content positions (1–20) of dataset CLIP
embeddings at startup, clamp learned_tokens per-token after every optimizer
step to max 1.5× that reference (50% headroom). Token norm is now logged
as current/limit for easy monitoring.
ti_sweep_1.json: rebuild around norm_clamp group — n4_clamped (primary
diagnostic), prefix_clamped, n8_prefix_clamped, warm_clamped.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
n4_baseline showed token_norm growing linearly without plateau — classic
sign of lr too high relative to parameter count. With only K×1024 params,
gradient signal per param is already high-magnitude; high lr causes
overshoot rather than convergence.
- Default lr: 1e-3 → 2e-4 (matches LoRA working regime)
- Default batch_size: 16 → 4 (more diverse gradients, helps norm saturate)
- ti_sweep_1.json: add lr_batch group (lr_low_b4, lr_mid_b8,
lr_low_b4_prefix, lr_2e3), restructure with clearer groups,
annotate n4_baseline as completed with findings
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously the comparison PNG was only written at the very end of the sweep,
so an interrupted run produced no image at all. Now _save_comparison() is
called right after _write_summary() for every successful experiment, keeping
loss_comparison.png current throughout the sweep.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Saves baseline.wav + baseline.png in the checkpoint dir using the same
seed as the TI eval samples — direct A/B comparison at every checkpoint
without re-generating the baseline each time.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Observation: n4_baseline loss barely moved (1.025→0.965 over 3000 steps),
token_norm grew linearly without plateau — generator likely ignores last-K
CLIP positions (EOS/padding zone) where suffix injects.
Fix: add inject_mode parameter throughout the pipeline:
- "suffix": replace last K positions (original behavior, model may ignore)
- "prefix": replace positions 1:1+K right after BOS — highest attention
weight in CLIP, much stronger gradient signal expected
Changes:
- selva_textual_inversion_trainer.py: _inject_tokens() helper centralises
the torch.cat construction for both modes; used in training loop and eval;
inject_mode stored in checkpoint files
- selva_textual_inversion_loader.py: reads inject_mode from checkpoint,
includes in TEXTUAL_INVERSION bundle
- selva_sampler.py: uses _inject_tokens() via bundle's inject_mode field
- selva_ti_scheduler.py: inject_mode in _PARAM_DEFAULTS, config, and
_train_inner call
- ti_sweep_1.json: updated with prefix_inject group (n4, n8, n4+warm);
n4_baseline marked completed; suffix experiments retained for comparison
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Reuses _draw_loss_curve + _smooth_losses + _pil_to_tensor from the LoRA
trainer — raw loss in light blue, smoothed overlay in blue, matches the
LoRA trainer's visual style.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- SelvaTiScheduler: runs a JSON-defined sweep of TI training experiments,
loading the dataset once and reusing it across runs
- Collects per-experiment loss history, final/min loss, stability metric
(loss_std_last_quarter), and duration — written to experiment_summary.json
after each completed run so partial sweeps survive interruption
- Resume-aware: skips experiments already marked completed in an existing
summary file
- Outputs smoothed loss comparison chart (same axes, one curve per experiment)
- SelvaTextualInversionTrainer._train_inner now returns a dict
{embeddings_path, loss_history} so the scheduler can read results;
train() extracts just the path for ComfyUI
JSON format: name, description, data_dir, output_root, base config,
experiments list with id + param overrides
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Wrap _spectral_metrics + _save_spectrogram in try-except so a matplotlib
or STFT error doesn't abort the checkpoint save (matches LoRA trainer)
- Remove unused `import math` and `_pil_to_tensor` import
- Drop dead `img` variable (_save_spectrogram returns None)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace in-place text_clip assignment with torch.cat so the computation
graph correctly links text_input → learned_tokens; in-place assignment
into a requires_grad=False leaf severs the graph and learned_tokens
receives no gradients
- _spectral_metrics(wav, sr): was passing wav.unsqueeze(0) [1,1,L] instead
of wav [1,L]; stft mean(dim=1) would return wrong shape [1,T] not [n_freqs]
- _save_spectrogram(wav, sr, ...): was passing wav.squeeze(0) [L] (1D)
instead of wav [1,L] as the function expects
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Learns K CLIP token embeddings ([K, 1024]) with all model weights frozen,
keeping generated latents on the decoder's natural manifold — avoids the
quality degradation that affects LoRA on BJ's audio dataset.
- selva_textual_inversion_trainer.py: trains learned_tokens via AdamW,
injects into last K positions of 77-token CLIP embedding, checkpoints
with eval audio + spectral metrics
- selva_textual_inversion_loader.py: loads .pt bundle, returns
TEXTUAL_INVERSION dict for sampler
- selva_sampler.py: optional textual_inversion input; injects into both
text_clip and neg_text_clip before preprocess_conditions
- __init__.py: registers both new nodes
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two ComfyUI nodes to reduce domain mismatch between custom training audio
and the MMAudio VAE's expected spectral distribution:
SelvaHfSmoother: blends a low-pass filtered copy (biquad) with the original
at a configurable cutoff and blend ratio. Attenuates extreme HF content that
BigVGANv2 handles poorly. RMS-preserving.
SelvaSpectralMatcher: computes the log-mel energy profile of the clip,
compares it per-band to the VAE's normalization means (DATA_MEAN_80D/128D),
and applies a smooth STFT-domain gain correction to match the codec's training
distribution. Configurable strength and max_gain_db clamp. RMS-preserving.
Recommended workflow: SpectralMatcher → HfSmoother → feature extraction.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AutoEncoderModule unconditionally asserts vocoder_ckpt_path is not None
even when need_vae_encoder=True. Pass best_netG.pt to satisfy the assert;
the vocoder weights are not actually used since decode+vocode go through
model["feature_utils"].
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Load fresh FeaturesUtils only for encoding; use model["feature_utils"] for
decode+vocode to mirror the exact path the sampler takes
- Apply generator.normalize() → unnormalize() around the encoded latent so the
decoder receives latents in the same space it expects from inference
- Log both encoded and norm→unnorm latent stats to diagnose round-trip fidelity
- Normalize output to -27 dBFS (matching training clip RMS) and clamp to [-1, 1]
to prevent clipping artifacts in the output waveform
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without this the decoder produced 7s instead of 8s due to STFT rounding.
Same fix as _prepare_dataset uses for training data.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Encodes audio through the VAE then decodes straight back, bypassing the
diffusion model entirely. Use this to isolate whether saturation artifacts
are introduced by the codec reconstruction (VAE/DAC) or by the LoRA.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _eval_sample gains clip_idx param (default 0, backward compatible)
- Evaluator loops over all dataset clips per adapter, saves one WAV per clip
- Reference metrics computed for all clips and averaged
- Comparison chart and summary use avg_metrics across all clips
- Eliminates bias from evaluating on an unrepresentative single clip
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
shutil.copy2 was writing FLAC binary to reference.wav — unplayable.
Now copies as reference{.flac/.wav/etc} matching the source extension.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Loads the first clip's original audio (same clip used for inference),
copies it to output_dir/reference.wav, runs spectral metrics and
saves a spectrogram. Appears first in the comparison chart so generated
samples can be judged against the target sound.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Generates audio samples from a list of adapters against a fixed reference
clip, collects spectral metrics for each, and outputs a comparison bar
chart + eval_summary.json. Useful for comparing sweep candidates before
committing to a next round of training.
JSON format: name, data_dir, output_dir, steps, seed, adapters[{id, path}].
Empty path = baseline (no LoRA).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Training clips at -23 LUFS measure -25 to -31 dBFS RMS (avg ~-27).
Normalizing output to -23 dBFS was 4-8 dB too loud, causing saturation
on clips with high crest factor and peaks near 0 dBFS.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
normalize() uses in-place ops so it worked, but reading the return value
makes the intent clear and guards against future refactors.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add lr_schedule param (constant|cosine) to SelvaLoraTrainer
- Cosine decays LR from initial value to ~0 after warmup, preventing
the oscillation observed at steps 6000-8000 with lr=2e-4 flat
- Wire lr_schedule through scheduler _PARAM_DEFAULTS and _train_inner call
- Add g5_r128_lr_2e4_cosine and g5_r128_lr_3e4_cosine to r128_sweet_spot sweep
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- New node: SelVA Skip Experiment — writes skip_current.flag from UI,
queue in a second workflow tab while scheduler is running
- SkipExperiment now attaches partial loss/grad/spectral data to the
exception so the scheduler saves all collected scalars in the summary
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Create the flag file in the sweep output_root to skip the running
experiment at the next log interval (every 50 steps):
touch /path/to/experiment/skip_current.flag
Scheduler marks it as 'skipped' in the summary and continues.
Skipped experiments are NOT resumed on restart (unlike failed ones).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Computes hf_energy_ratio (>4kHz), spectral_centroid_hz, spectral_rolloff_hz
at each save_every checkpoint. Logged to console and stored in
experiment_summary.json under results.spectral_metrics[step].
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Log-frequency dB spectrogram (inferno colormap, 100Hz–16kHz) saved as
step_XXXXX.png next to step_XXXXX.wav in samples/ subfolder.
Makes high-frequency rolloff (low bitrate signature) immediately visible.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RMS normalize to target then scale back if peaks exceed 1.0,
preserving dynamics instead of hard-clipping transients.
Eval sample target updated to -23 dBFS to match training data.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Peak norm was slamming output to full scale regardless of content level,
making generated audio several times louder than training clips.
RMS norm to -20 dBFS matches typical processed audio level.
Sampler exposes target_lufs (-40 to -6, default -20) for user control.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Inference is fast on RTX PRO 6000 — 8 steps was washing out quality
differences between experiments.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Scheduler: on re-run, reads existing experiment_summary.json and skips
already-completed experiments — safe to stop and restart mid-sweep.
tier1_thorough: adds g5 (lr 3e-5/3e-4), g6 (full target attn.qkv+linear1
at r16 and r64), and g4_full_r64_6k (6000-step extended run) — 17 total.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Always sample dataset[0] with fixed noise seed so checkpoints are
directly comparable (hear the model improve step by step)
- Save to output_dir/samples/step_XXXXX.wav instead of alongside checkpoints
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Dataset browser: audio/features now resolve through features/ subdir
- tier1_sweep.json: update data_dir to BJ dataset path
- tier1_thorough.json: 12-experiment overnight sweep across 4 groups
(rank 16/32/64, alpha scaling, LoRA+/dropout/curriculum isolation,
full Tier 1 stack at r16 and r64) — output to BJ/experiment/
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Companion node for inspecting dataset.json entries by integer index.
Outputs video (.mp4), audio (.wav/.flac), features (.npz), frames dir,
mask dir, label, and max_index for constraining the index widget range.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- trainer: raise ValueError early when remaining steps < log_interval (50)
instead of UnboundLocalError on smoothed_img/final_path at return
- trainer: use None in grad_norm_history instead of silent 0.0 when
grad_accum > log_interval and no optimizer step fired in the interval
- trainer: include start_step in _train_inner return dict
- scheduler: use start_step from result dict for min_loss_step and
loss_at_steps (fixes wrong step labels on resumed experiments)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
trainer:
- Track gradient norm before clipping at each optimizer step
- Log avg grad_norm per log_interval alongside loss in console output
- Include grad_norm_history in _train_inner return dict
scheduler:
- Add system block to summary (GPU name, VRAM, torch/CUDA version)
- Include full loss_history and grad_norm_history arrays in each
experiment result (50-step resolution, not just save_every checkpoints)
- Add loss_std_last_quarter stability metric (std dev of raw loss over
last 25% of steps — high value indicates unstable training)
- Add log_interval field so consumers know the x-axis resolution
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Extract _prepare_dataset() from SelvaLoraTrainer.train() as a module-level
function so the dataset can be encoded once and reused across experiments
- Change _train_inner() return value from tuple to dict (adds loss_history,
meta, completed; train() unpacks for ComfyUI — no change to node outputs)
- New SelvaLoraScheduler node: reads a JSON sweep file, runs N experiments
sequentially, writes experiment_summary.json (updated after each run) and
loss_comparison.png with all smoothed curves overlaid on the same axes
- Register SelvaLoraScheduler in nodes/__init__.py
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- LoRA dropout: applied to the LoRA path only (not frozen base weights),
0.05–0.1 helps regularize on small datasets (arXiv:2404.09610)
- LoRA+: separate optimizer param groups for lora_A and lora_B with
configurable LR ratio; ratio=16 enables LoRA+ (arXiv:2402.12354)
- Curriculum mode: logit_normal for first N% of steps then uniform,
directly addresses early convergence + fine-detail degradation at
boundaries (arXiv:2603.12517)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
logit_normal reaches lower loss but perceptual improvement over uniform
is dataset-dependent. Keeping uniform as default to match original MMAudio
training behavior; logit_normal remains available as an option.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
On Windows, /folder is drive-relative (no drive letter) rather than a real
absolute path. Redirect these to ComfyUI's output directory so files don't
land at C:\folder. Also redirects plain relative paths (e.g. lora_output)
to output/ instead of the process working directory.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wraps training loop in try/finally so adapter_final.pt and loss PNGs are
always written. On cancellation the adapter is named
adapter_cancelled_stepXXXXX.pt so it can be used with --resume.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>