Recent torchaudio defaults to torchcodec as the audio backend, which requires
FFmpeg shared libraries. Falls back to soundfile for envs where torchcodec
can't load (e.g. containerised ComfyUI without system FFmpeg).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
torch.stft requires float32 input — casting vae_utils to bf16 caused silent
failures during dataset pre-loading. Also adds traceback.print_exc() so future
clip-load errors are visible in the ComfyUI log.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
At every save_every steps, run a quick 8-step no-CFG inference pass on
a random training clip and save the decoded waveform as
sample_stepXXXXX.wav next to the checkpoint. Uses the existing
generator.unnormalize + feature_utils.decode + vocode pipeline from
the sampler. Failure is non-fatal (logged and skipped).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Send updated loss curve to ComfyUI frontend every 50 steps via
pbar_train.update_absolute() with a JPEG preview tuple — same
mechanism as KSampler's denoising previews.
- Fix x-axis step labels for resumed runs (previously always started
at 0; now correctly shows start_step + offset).
- Split _draw_loss_curve (returns PIL Image) from _pil_to_tensor
(converts for ComfyUI IMAGE output).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Runs the full training loop inside ComfyUI. Reuses the already-loaded
CLIP model from the inference model for text encoding; loads only a
minimal VAE encoder separately (freed after dataset pre-loading).
Outputs:
- SELVA_MODEL with LoRA applied (ready to connect directly to Sampler)
- adapter_path STRING (for SelVA LoRA Loader in future sessions)
- loss_curve IMAGE (PIL-rendered line chart of training loss per 50 steps)
Progress is shown via ComfyUI ProgressBar (two phases: dataset loading,
then training steps). Resume is supported via resume_path input.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>