ComfyUI-SelVA

Author	SHA1	Message	Date
Ethanfel	057bfb813d	feat: add SelvaDatasetResampler node (soxr VHQ) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 14:13:45 +02:00
Ethanfel	2c71d4c184	feat: add SelvaDatasetLoader node Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 14:09:43 +02:00
Ethanfel	d25df10aa5	feat: add audio dataset pipeline skeleton	2026-04-09 14:05:31 +02:00
Ethanfel	d70a4d2123	docs: add audio dataset pipeline implementation plan	2026-04-09 14:02:46 +02:00
Ethanfel	2b10205657	fix: raise segment_seconds max from 4s to 30s Hardcoded max of 4.0 prevented using full 8s clips. Raised to 30s. Also bumped default from 1.0 to 2.0 as a more sensible starting point. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 13:49:50 +02:00
Ethanfel	8166c56552	perf: gradient checkpointing on vocoder forward to reduce activation memory BigVGAN's 512x upsampling stack stores huge intermediate activations for backward even in snake_alpha_only mode (only 5K trainable params, but activation graph runs through the full network after each snake op). Wrapping vocoder() in checkpoint(use_reentrant=False) recomputes activations during backward instead of storing them — ~2x compute cost, large reduction in peak VRAM. Should allow batch_size > 1 on 96 GB without OOM. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 13:45:24 +02:00
Ethanfel	eece79ccae	fix: correct MRD channel width to 128 and unload models before training Two bugs: 1. _DiscriminatorR used channels=32 but the BigVGAN pretrained discriminator checkpoint has channels=128. All convs in _DiscriminatorR now use 128, matching the checkpoint architecture so state_dict loads without error. 2. BigVGAN trainer OOM: SelVA generator and other ComfyUI models remain in VRAM during training (~90 GiB used). Add unload_all_models() + cache flush before the training loop to reclaim VRAM headroom. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 13:40:01 +02:00
Ethanfel	357b875e5e	fix: strip inference tensor flags in DITTO optimizer Two crash paths under "RuntimeError: Inference tensors cannot be saved for backward": 1. clip_f / sync_f loaded from main-thread inference_mode carry the inference flag. Clone them on entry to the worker thread so the conditions built from them are clean non-inference tensors. Also clone x after Phase 1 before the STE reconnection — Phase 1 runs under no_grad and produces outputs that may still carry the flag through the conditions path. 2. net_generator.unnormalize + feature_utils.decode called outside any checkpoint wrapper with requires_grad=True input. Backward tried to save inference-flagged model weights. Wrapped both calls in checkpoint(use_reentrant=False) so they recompute on backward instead of storing activations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 12:18:20 +02:00
Ethanfel	211494a91c	fix: DITTO gradient never reached x0, remove unused imports and dead code DITTO critical bug: x was reassigned on every ODE step, so by the time loss.backward() ran, x pointed to the final output tensor (grad_fn, not a leaf) and x.grad was always None. The manual gradient transfer never fired — x0 was never updated. The optimization was a no-op. Fix: use a straight-through estimator after the no-grad prefix: x = x + (x0 - x0.detach()) This adds zero value but creates a grad_fn back to x0, so backward() propagates ∂loss/∂x (at the Phase-1/2 boundary) directly to x0.grad. Equivalent to truncated BPTT with ∂x_prefix/∂x0 ≈ I. Also remove unused imports (SelvaSampler, _inject_tokens, random) that caused cascade ImportError risk, and remove dead trainable_count variable in BigVGAN trainer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 12:10:02 +02:00
Ethanfel	1e9551152e	feat: add DITTO optimizer, upgrade BigVGAN trainer, document all nodes BigVGAN trainer (selva_bigvgan_trainer.py): - Add snake_alpha_only train mode: tunes only ~27K per-channel α params (0.024% of 112M) — physically cannot cause harmonic smearing - Add lambda_l2sp: L2-SP anchor regularization toward pretrained weights - Add optional discriminator_path: frozen MPD+MRD feature matching loss replaces mel L1 when a BigVGAN discriminator checkpoint is provided - Inline MPD + MRD discriminator implementations (no extra dependencies) DITTO optimizer (selva_ditto_optimizer.py): - New node: inference-time noise optimization (arXiv:2401.12179) - Optimizes x₀ via mel Gram matrix style loss against BJ reference clips - All model weights frozen — zero quality degradation risk - Truncated BPTT through last n_grad_steps of the ODE (configurable) - Gradient checkpointing on each differentiated step Docs: - README: document all 20 nodes (was 3), add workflow diagrams - STYLE_TRANSFER.md: new guide — DITTO, vocoder fine-tuning tiers, why LoRA/TI fail, combined approach, dataset prep Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 12:04:05 +02:00
Ethanfel	f17f6f0863	feat: save ground truth spectrogram once for direct comparison Writes _gt_spec.png from ref_mel before training starts so each step's _spec.png can be compared against the unmodified vocoder roundtrip target. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 03:05:47 +02:00
Ethanfel	304d9d01bf	feat: save mel spectrogram PNG alongside each eval sample Adds _save_spectrogram() using PIL only (no matplotlib). Each _save_sample call now writes both a .wav and a _spec.png so training progress is visible without listening. Colour map is blue→green→yellow (viridis-ish), low frequencies at the bottom. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 03:03:28 +02:00
Ethanfel	0128a81cc2	fix: use full first clip for eval samples instead of 1s segment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 03:01:52 +02:00
Ethanfel	710261f5be	fix: add soundfile fallback for torchaudio.save in sample writing Same environment has no compatible ffmpeg/torchcodec for saving. Mirror the _load_wav pattern: try torchaudio, fall back to soundfile. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:58:07 +02:00
Ethanfel	5df2abd6dd	fix: handle all three inference-tensor sources in vocoder sanitization remove_parametrizations() stores weight as a plain __dict__ tensor (not nn.Parameter), making it invisible to _parameters iteration. Also, buffers (Activation1d anti-aliasing filters) are inference tensors that break the backward graph mid-network. Fix all three categories: 1. _parameters: clone().detach(), wrap as Parameter 2. plain __dict__ tensors: clone(), register_parameter (also makes trainable) 3. _buffers: clone() to strip inference flag without parametrizing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:54:41 +02:00
Ethanfel	b243908873	debug: inspect conv_pre parametrizations and _parameters keys Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:46:16 +02:00
Ethanfel	9df855ee0e	debug: print is_inference() status before failing conv_pre call Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:41:51 +02:00
Ethanfel	78f8aa98ad	fix: clone inference tensors at thread entry to strip the inference flag torch.inference_mode is thread-local, but the inference flag lives on the tensor object. Operations on inference tensors always propagate it, even in a clean thread. The only escape is .clone() called outside inference_mode. At thread entry (inference_mode disabled): clone clips and mel_converter buffers to get clean normal tensors before any training computation. Vocoder parameter clone() also now works correctly in this thread context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:35:48 +02:00
Ethanfel	e870446b0f	fix: run BigVGAN training in a fresh thread to escape inference_mode torch.inference_mode is thread-local. ComfyUI sets it on the node-execution thread; inference_mode(False) alone is insufficient to escape it in some environments (e.g. async wrappers, lora-manager hook). A new thread always starts clean. Moved all training logic into _do_train() called via threading.Thread so every tensor is a normal autograd tensor by default. Simplified parameter cloning: clone().detach().requires_grad_(True). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:30:53 +02:00
Ethanfel	df63b147e9	fix: sanitize all submodule buffers of mel_converter + guarantee target_mel output Previous fix only iterated mel_converter._buffers (direct buffers). Submodules (e.g. Spectrogram.window) still held inference tensors. Switch to .modules() to cover all nested buffers, matching the vocoder parameter sanitization. Also add a zeros+copy_ safety net on target_mel output so conv can save it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:14:12 +02:00
Ethanfel	51ac099073	fix: sanitize target_flat — clips are inference tensors from outer inference_mode The clips list is built inside ComfyUI's inference_mode context, so every element is an inference tensor. torch.stack().clone() propagates the flag. Use zeros+copy_ (same pattern as params/buffers) to get a normal tensor, so mel_converter(target_flat) inside no_grad produces a saveable input. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:09:26 +02:00
Ethanfel	b7565ec458	fix: sanitize inference tensors in BigVGAN trainer via zeros+copy_ pattern param.data.clone() and tensor.detach() on inference tensors both produce inference tensors — the flag propagates through all operations on them. Inside inference_mode(False), torch.zeros() creates genuine normal tensors. Use zeros+copy_ to sanitize both vocoder parameters and mel_converter buffers once before training, so autograd can save inputs for backward. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 02:05:36 +02:00
Ethanfel	0fcb6d3106	fix(bigvgan-trainer): replace parameter objects to fully strip inference tensor flag param.data = clone() only replaces storage — the nn.Parameter object itself retains the inference tensor flag set when the model was loaded. Replace each parameter with a fresh nn.Parameter(data.clone()) created inside inference_mode(False) so both the object and its data are normal tensors. Move optimizer creation to after re-creation so it references the new objects. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:58:57 +02:00
Ethanfel	c86306bde8	fix(bigvgan-trainer): clone vocoder parameters to strip inference tensor flag The vocoder is loaded inside ComfyUI's torch.inference_mode(), making all its parameters inference tensors. Autograd cannot save inference tensors for backward even with requires_grad=True. Clone all parameters inside torch.inference_mode(False) before training to get normal tensors. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:55:16 +02:00
Ethanfel	f04d59fe63	fix(bigvgan-trainer): clone mel outputs to strip inference tensor flag from buffers mel_converter buffers (mel_basis, hann_window) are inference tensors because the model was loaded inside ComfyUI's torch.inference_mode(). Operations on them propagate the flag to outputs. Clone both target_mel and pred_mel to get normal autograd-compatible tensors. .clone() is differentiable so the grad graph to vocoder parameters is preserved. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:51:28 +02:00
Ethanfel	daa36a5f7b	fix(bigvgan-trainer): clone target tensor to exit inference mode before backward Clips loaded outside torch.inference_mode(False) are inference tensors. Autograd cannot save them for backward. .clone() creates a normal tensor, same fix pattern as selva_lora_trainer's dist.mode().clone(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:47:47 +02:00
Ethanfel	16e20b30ce	fix(bigvgan-trainer): cast audio to model dtype to match bf16 mel_converter buffers Model loaded in bf16 causes mel_basis buffer to be bf16. Audio loaded from disk is float32, causing matmul dtype mismatch. Cast all audio tensors to model["dtype"] before passing to mel_converter/vocoder. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:46:01 +02:00
Ethanfel	ea7dfed27a	fix(bigvgan-trainer): fallback to soundfile when torchaudio ffmpeg backend fails torchcodec/libavutil soname mismatch causes torchaudio to fail on every file load, silently emptying clips. Add _load_wav() that tries torchaudio first then falls back to soundfile (handles wav/flac without ffmpeg). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:41:59 +02:00
Ethanfel	81ff0d46c9	fix(bigvgan-trainer): resolve device mismatch in _save_sample after offload After the finally block, offload_to_cpu moves the vocoder to CPU while ref_mel stays on GPU. Fix: detect vocoder's current device via next(vocoder.parameters()).device and move ref_mel there before vocoding. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:35:07 +02:00
Ethanfel	9fdeb65182	feat(bigvgan-trainer): add eval samples at checkpoints and end Saves baseline.wav (ground truth roundtrip before training), stepN.wav at each save_every checkpoint, and final.wav after training completes. All use the same fixed reference segment (clip 0, position 0) for direct comparison across checkpoints. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:30:34 +02:00
Ethanfel	790a53e3df	fix(bigvgan): add 44k/BigVGANv2 support to trainer and loader 44k variants use BigVGANv2 directly as the vocoder (no wrapper, no @inference_mode decorator), accessible at feature_utils.tod.vocoder. 16k wraps BigVGANVocoder inside BigVGAN, accessed at .vocoder.vocoder. Both trainer and loader now branch on model["mode"]. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:28:32 +02:00
Ethanfel	9c784b4bdb	feat: add BigVGAN vocoder fine-tuner and loader nodes Spectral-loss-only fine-tuning of the BigVGAN vocoder (mel→waveform) on BJ audio clips. DiT and VAE are completely frozen. Losses: mel L1 reconstruction + multi-resolution STFT magnitude L1 (same three resolutions as the BigVGAN discriminator config). Saves in {'generator': state_dict} format compatible with the original BigVGAN checkpoint. Loader replaces vocoder weights in the loaded SELVA_MODEL in-place so no full model reload is needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:26:12 +02:00
Ethanfel	115a0c3718	feat(steering): conditional-only injection + per-position vectors Two improvements for stronger steering effect: 1. Apply steering only during the conditional predict_flow pass by monkey-patching predict_flow to set a flag via identity check (cond is conditions). Hooks skip the unconditional pass, so steering is amplified by cfg_strength (~4.5x) instead of canceling out in the CFG guidance term. 2. Restore per-position [seq, hidden] steering vectors instead of seq-averaged [hidden]. More spatially specific — captures positional activation patterns rather than a global mean. Seq length mismatch at inference time handled via linear interpolation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 01:02:51 +02:00
Ethanfel	95923cdf42	feat: add activation steering pipeline (extractor, loader, sampler injection) Implements per-block DiT activation steering as an alternative to textual inversion. Extractor runs frozen generator on dataset with BJ vs empty conditions, records mean hidden-state delta per block, saves [hidden_dim] vectors (seq-averaged so they broadcast to any inference duration). Loader reads the bundle. Sampler registers forward hooks during the ODE that add strength × vec to each block output, cleaned up in a finally block. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 00:38:26 +02:00
Ethanfel	28ee3db337	feat(sampler): add ti_strength blend for TI injection TI via text conditioning produces buzz because SelVA's text path is mean-pooled into a global DiT bias — not rich per-token cross-attention like SD. The optimizer learns a constant spectral artifact rather than semantic style shift. ti_strength=1.0 (default) = full injection as before. ti_strength<1.0 = lerp between original and injected text_clip, allowing the effect to be dialled back without retraining. Applies to both text_clip and neg_text_clip symmetrically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-09 00:07:57 +02:00
Ethanfel	b89167cfae	fix(ti-trainer): clamp token norm to CLIP manifold to prevent buzz artifacts Diagnosis: learned tokens grew to norm ~3.2 while real CLIP content tokens sit at ~1.0. Model never trained on embeddings that large — activates buzz artifact instead of semantic style shift. Fix: measure mean token norm from content positions (1–20) of dataset CLIP embeddings at startup, clamp learned_tokens per-token after every optimizer step to max 1.5× that reference (50% headroom). Token norm is now logged as current/limit for easy monitoring. ti_sweep_1.json: rebuild around norm_clamp group — n4_clamped (primary diagnostic), prefix_clamped, n8_prefix_clamped, warm_clamped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 23:54:23 +02:00
Ethanfel	f9d092158a	fix(ti): lower default lr/batch, add lr_batch sweep group n4_baseline showed token_norm growing linearly without plateau — classic sign of lr too high relative to parameter count. With only K×1024 params, gradient signal per param is already high-magnitude; high lr causes overshoot rather than convergence. - Default lr: 1e-3 → 2e-4 (matches LoRA working regime) - Default batch_size: 16 → 4 (more diverse gradients, helps norm saturate) - ti_sweep_1.json: add lr_batch group (lr_low_b4, lr_mid_b8, lr_low_b4_prefix, lr_2e3), restructure with clearer groups, annotate n4_baseline as completed with findings Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 23:42:22 +02:00
Ethanfel	92535deab2	fix(ti-scheduler): save comparison image after each completed experiment Previously the comparison PNG was only written at the very end of the sweep, so an interrupted run produced no image at all. Now _save_comparison() is called right after _write_summary() for every successful experiment, keeping loss_comparison.png current throughout the sweep. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 23:39:30 +02:00
Ethanfel	0b24207ca5	feat(ti-trainer): generate baseline.wav once before training starts Saves baseline.wav + baseline.png in the checkpoint dir using the same seed as the TI eval samples — direct A/B comparison at every checkpoint without re-generating the baseline each time. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 23:33:28 +02:00
Ethanfel	e1a2f0ed7d	feat: add inject_mode (suffix/prefix) to TI pipeline Observation: n4_baseline loss barely moved (1.025→0.965 over 3000 steps), token_norm grew linearly without plateau — generator likely ignores last-K CLIP positions (EOS/padding zone) where suffix injects. Fix: add inject_mode parameter throughout the pipeline: - "suffix": replace last K positions (original behavior, model may ignore) - "prefix": replace positions 1:1+K right after BOS — highest attention weight in CLIP, much stronger gradient signal expected Changes: - selva_textual_inversion_trainer.py: _inject_tokens() helper centralises the torch.cat construction for both modes; used in training loop and eval; inject_mode stored in checkpoint files - selva_textual_inversion_loader.py: reads inject_mode from checkpoint, includes in TEXTUAL_INVERSION bundle - selva_sampler.py: uses _inject_tokens() via bundle's inject_mode field - selva_ti_scheduler.py: inject_mode in _PARAM_DEFAULTS, config, and _train_inner call - ti_sweep_1.json: updated with prefix_inject group (n4, n8, n4+warm); n4_baseline marked completed; suffix experiments retained for comparison Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 23:31:52 +02:00
Ethanfel	f96265da23	feat(ti-trainer): add loss curve IMAGE output Reuses _draw_loss_curve + _smooth_losses + _pil_to_tensor from the LoRA trainer — raw loss in light blue, smoothed overlay in blue, matches the LoRA trainer's visual style. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 23:20:44 +02:00
Ethanfel	c0d95ce356	feat: add ti_sweep_1 experiment file First TI sweep covering the three most impactful axes: - token_count group: n_tokens 4 / 8 / 16 (capacity vs overfitting) - learning_rate group: 5e-4 / 1e-3 / 2e-3 with n_tokens=4 - warm_init group: n4 and n8 seeded from 'mechanical impact sound design' 7 experiments total, 3000 steps each, same data_dir as LoRA sweeps. n4_baseline (lr=1e-3, random init) is the primary reference point. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 23:14:31 +02:00
Ethanfel	e37bfe1b1c	feat: add SelVA TI Scheduler for sweep-based textual inversion experiments - SelvaTiScheduler: runs a JSON-defined sweep of TI training experiments, loading the dataset once and reusing it across runs - Collects per-experiment loss history, final/min loss, stability metric (loss_std_last_quarter), and duration — written to experiment_summary.json after each completed run so partial sweeps survive interruption - Resume-aware: skips experiments already marked completed in an existing summary file - Outputs smoothed loss comparison chart (same axes, one curve per experiment) - SelvaTextualInversionTrainer._train_inner now returns a dict {embeddings_path, loss_history} so the scheduler can read results; train() extracts just the path for ComfyUI JSON format: name, description, data_dir, output_root, base config, experiments list with id + param overrides Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 23:13:04 +02:00
Ethanfel	bb07bc8169	fix(ti-trainer): guard spectral metrics, drop unused imports - Wrap _spectral_metrics + _save_spectrogram in try-except so a matplotlib or STFT error doesn't abort the checkpoint save (matches LoRA trainer) - Remove unused `import math` and `_pil_to_tensor` import - Drop dead `img` variable (_save_spectrogram returns None) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 23:10:19 +02:00
Ethanfel	e36cdd7947	fix(ti-trainer): fix gradient flow and spectral metric shapes - Replace in-place text_clip assignment with torch.cat so the computation graph correctly links text_input → learned_tokens; in-place assignment into a requires_grad=False leaf severs the graph and learned_tokens receives no gradients - _spectral_metrics(wav, sr): was passing wav.unsqueeze(0) [1,1,L] instead of wav [1,L]; stft mean(dim=1) would return wrong shape [1,T] not [n_freqs] - _save_spectrogram(wav, sr, ...): was passing wav.squeeze(0) [L] (1D) instead of wav [1,L] as the function expects Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 23:08:13 +02:00
Ethanfel	e56ece9c1c	feat: add SelVA Textual Inversion Trainer and Loader nodes Learns K CLIP token embeddings ([K, 1024]) with all model weights frozen, keeping generated latents on the decoder's natural manifold — avoids the quality degradation that affects LoRA on BJ's audio dataset. - selva_textual_inversion_trainer.py: trains learned_tokens via AdamW, injects into last K positions of 77-token CLIP embedding, checkpoints with eval audio + spectral metrics - selva_textual_inversion_loader.py: loads .pt bundle, returns TEXTUAL_INVERSION dict for sampler - selva_sampler.py: optional textual_inversion input; injects into both text_clip and neg_text_clip before preprocess_conditions - __init__.py: registers both new nodes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 23:01:44 +02:00
Ethanfel	eed7eefeac	feat: add SelVA HF Smoother and Spectral Matcher preprocessing nodes Two ComfyUI nodes to reduce domain mismatch between custom training audio and the MMAudio VAE's expected spectral distribution: SelvaHfSmoother: blends a low-pass filtered copy (biquad) with the original at a configurable cutoff and blend ratio. Attenuates extreme HF content that BigVGANv2 handles poorly. RMS-preserving. SelvaSpectralMatcher: computes the log-mel energy profile of the clip, compares it per-band to the VAE's normalization means (DATA_MEAN_80D/128D), and applies a smooth STFT-domain gain correction to match the codec's training distribution. Configurable strength and max_gain_db clamp. RMS-preserving. Recommended workflow: SpectralMatcher → HfSmoother → feature extraction. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 20:28:16 +02:00
Ethanfel	107bb05f17	fix(vae-roundtrip): pass bigvgan path to encoder-only FeaturesUtils AutoEncoderModule unconditionally asserts vocoder_ckpt_path is not None even when need_vae_encoder=True. Pass best_netG.pt to satisfy the assert; the vocoder weights are not actually used since decode+vocode go through model["feature_utils"]. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 20:05:44 +02:00
Ethanfel	10e6095e31	fix(vae-roundtrip): use model feature_utils for decode, add normalize/unnormalize, normalize output - Load fresh FeaturesUtils only for encoding; use model["feature_utils"] for decode+vocode to mirror the exact path the sampler takes - Apply generator.normalize() → unnormalize() around the encoded latent so the decoder receives latents in the same space it expects from inference - Log both encoded and norm→unnorm latent stats to diagnose round-trip fidelity - Normalize output to -27 dBFS (matching training clip RMS) and clamp to [-1, 1] to prevent clipping artifacts in the output waveform Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 19:50:01 +02:00
Ethanfel	528d33be39	fix: trim/pad latent to seq_cfg.latent_seq_len before decoding Without this the decoder produced 7s instead of 8s due to STFT rounding. Same fix as _prepare_dataset uses for training data. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-08 19:22:09 +02:00

1 2 3 4 5

218 Commits