Ethanfel e5110b88e1 feat: auto input_sr — detect bandwidth and pick the best value
New "auto" option (now the default) on the Sampler's input_sr. detect_input_sr
finds the spectral cutoff cliff (steepest drop) and its dB confidence: effective
cutoff = that cliff if confident, else sr/2 — one rule that covers band-limited
(→ matched input_sr), full-band (→ 24000), and genuine low-rate files
(→ their rate). Rounds DOWN to the nearest supported Nyquist to avoid feeding
the model an empty band. Logs its decision. Falls back to 24000 when unsure.

Tests cover sharp 4/6/8/12 kHz cutoffs, full-band, genuine-8kHz, silence, stereo.
Verified end-to-end on the real model (8 kHz clip -> auto picks 16000).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 12:46:02 +02:00
2026-06-01 12:59:42 +02:00
2026-06-01 12:59:42 +02:00
2026-06-01 12:59:42 +02:00

ComfyUI-UniverSR

Audio super-resolution for ComfyUI — upscale low-bandwidth audio to a full 48 kHz with UniverSR, Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching (ICASSP 2026).

ICASSP 2026 arXiv Demo License: MIT

One model upscales 8 / 12 / 16 / 24 kHz effective bandwidth → 48 kHz for speech, music and sound effects. It works directly in the complexSTFT domain with flow matching — no neural vocoder — and regenerates the missing highfrequency band instead of merely interpolating, so muffled or bandlimited audio gets believable "air" and detail back.

UniverSR overview


Table of contents


Features

  • 🎚️ 8 / 12 / 16 / 24 kHz → 48 kHz with a single model — speech, music, SFX.
  • 🧩 Two-node design — a cached Model Loader + a Super-Resolution sampler.
  • ⬇️ Auto-download of the official checkpoints into models/universr/ on first use.
  • 🔗 Long-audio chunking with click-free overlap-add (handles clips of any length).
  • 🎧 Stereo-aware — each channel is processed independently and preserved.
  • 🎛️ Wet/dry blend — full SR, or dial it back to gently brighten already-48 kHz audio (BWE).
  • 🎲 Seed control with global-RNG isolation (won't perturb other nodes' randomness).
  • 📊 Optional before/after spectrogram image output.
  • 🎬 Video in / out — extract a video's audio, super-resolve it, and remux it back onto the original video (no video re-encode), all with ffmpeg.
  • 📦 Self-contained — the UniverSR inference code is vendored; the only extra dependency beyond ComfyUI's stack is torchdiffeq.

Installation

cd ComfyUI/custom_nodes
git clone https://github.com/ethanfel/ComfyUI-UniverSR.git
pip install -r ComfyUI-UniverSR/requirements.txt

Then restart ComfyUI. The nodes appear under the audio/UniverSR category.

Dependencies. torch, torchaudio, numpy and matplotlib already ship with ComfyUI. This node only adds:

torchdiffeq   einops   timm   huggingface_hub   pyyaml

(einops/timm/huggingface_hub/pyyaml are usually already present; torchdiffeq is the one that typically needs installing.) The universr package itself is vendored under vendor/ — if a pip-installed copy is found it is preferred, otherwise the bundled one is used, so no git+ install is required.

The video nodes additionally need ffmpeg on your PATH (apt install ffmpeg / brew install ffmpeg / conda install -c conda-forge ffmpeg) and soundfile (in requirements.txt). The audio SR nodes work without either.

GPU recommended. Inference runs on CUDA if available and falls back to CPU (much slower).


Models

Preset Domain Hugging Face Notes
universr-audio General (music / SFX / mixed) woongzip1/universr-audio Recommended default.
universr-speech Speech / voice woongzip1/universr-speech Tuned for voice recordings.

Each preset is ~230 MB and downloads automatically to ComfyUI/models/universr/<preset>/ the first time you load it (it lands as config.yaml + pytorch_model.bin).

Manual / offline install — drop the two files into ComfyUI/models/universr/<name>/ yourself:

huggingface-cli download woongzip1/universr-audio \
  --local-dir ComfyUI/models/universr/universr-audio

Any folder you place under models/universr/ that contains config.yaml + pytorch_model.bin will also show up in the loader's model dropdown.


Nodes

LoadAudio ─────────────┐
                       ▼
UniverSR Model Loader ─► UniverSR Super-Resolution ─► SaveAudio / PreviewAudio
                                          └─ spectrogram ─► PreviewImage

UniverSR Model Loader

Loads (and caches) a checkpoint. Output: UNIVERSR_MODEL.

Input Type Default Description
model choice universr-audio Preset to download, or a local checkpoint folder found under models/universr/.
device auto / cuda / cpu auto Where to load the weights. auto picks CUDA when available.
tf32 (opt.) bool False TF32 matmul + conv on Ampere+ (~1.15×). Tonally neutral in testing but not bit-exact; off = reference fp32.
compile (opt.) bool False torch.compile the network (~2×). See Performance.
local_path (opt.) string "" Override: a folder with config.yaml + pytorch_model.bin, or a raw training checkpoint (.pth / .ckpt).
config_path (opt.) string "" config.yaml to pair with a raw checkpoint. Empty → the bundled default config.

The loaded model is cached by (path, device), so re-running a graph or reusing the loader across runs does not reload the weights.

UniverSR Super-Resolution

Runs the super-resolution. Outputs: AUDIO (48 kHz) and IMAGE (spectrogram).

Input Type Default Range Description
audio AUDIO Input audio (any sample rate / mono or stereo).
model UNIVERSR_MODEL From the Model Loader.
input_sr choice auto auto / 8000 / 12000 / 16000 / 24000 Effective input bandwidth (Hz). Content is valid up to input_sr/2 and regenerated above it. auto detects the cutoff for you (see below).
ode_method choice midpoint euler / midpoint / rk4 ODE solver. euler fastest → midpoint balanced → rk4 best.
ode_steps int 4 164 Flow-matching integration steps. 4 is fast & validated; 410 is a good range.
guidance_scale float 1.5 06 Classifier-free guidance. Higher = denser highs but less faithful. 0 disables CFG.
seed int 0 Noise seed for the flow source. 0 = random each run.
chunk_seconds float 10.0 0120 Process long audio in chunks this long to bound VRAM. 0 = whole clip at once.
overlap_seconds float 0.5 05 Crossfade overlap between chunks (prevents seam clicks).
blend float 1.0 01 Wet/dry mix. 1.0 = full SR; lower keeps more of the original.
unload_model bool false Free the model from VRAM after this run.
show_spectrogram bool true Also output a before/after spectrogram comparison image.

UniverSR Load Video Audio

Upload or pick a video, extract its audio track (native rate/channels, via ffmpeg), and keep a reference to the source video for remuxing. The clip previews inline in the node — with an upload button and drag-and-drop, just like a normal video loader. Outputs UNIVERSR_VIDEO and AUDIO.

Input Type Default Description
video upload / choice Drop or upload a video, or pick one from ComfyUI's input/ folder.
start_time (opt.) float 0.0 Trim start, seconds.
duration (opt.) float 0.0 Trim length, seconds (0 = to end).

There is also a UniverSR Load Video Audio (Path) variant that takes an absolute video_path string (for files outside ComfyUI's input/ folder); it previews after you run it. Both feed the combiner.

UniverSR Video Combiner

Muxes an AUDIO track onto the source video without re-encoding the video (-c:v copy) and saves the result. If the loader trimmed the clip, the same trim is applied to the video so A/V stay aligned.

Input Type Default Description
video UNIVERSR_VIDEO From UniverSR Load Video Audio.
audio AUDIO The enhanced 48 kHz audio.
filename_prefix string UniverSR Output name prefix (auto-incremented).
audio_codec (opt.) choice aac aac / flac / pcm_s16le / libopus / libmp3lame.
save_output (opt.) bool true Save to output/ (else temp/).

Output: output_path (string) and an inline video preview.

Video workflow

UniverSR Load Video Audio ──┬─ audio ─► UniverSR Super-Resolution ─ audio ─┐
                            │                                              ▼
                            └────────────── video ──────────────► UniverSR Video Combiner ─► .mp4
                                              UniverSR Model Loader ─► (Super-Resolution)

Load the video → super-resolve its audio (set input_sr to the content bandwidth) → feed the enhanced audio and the video reference into the combiner. Ready-made graph: example_workflows/universr_video.json.


Choosing input_sr (the one setting that matters most)

input_sr tells the model the effective bandwidth of your content. Everything above input_sr / 2 is treated as missing and regenerated:

input_sr Treated as valid up to The model regenerates
8000 4 kHz 4 24 kHz
12000 6 kHz 6 24 kHz
16000 8 kHz 8 24 kHz
24000 12 kHz 12 24 kHz

auto (default) analyses the input's spectrum, finds the cutoff cliff, and picks the largest supported bandwidth at or below it (rounding down, to avoid feeding the model an empty band). It prints its decision, e.g. auto: cutoff 8.0 kHz (drop 53 dB) -> input_sr=16000. When there's no clear cutoff (full-band or gently rolled-off audio) it falls back to 24000 (least aggressive). Auto is most reliable on genuinely band-limited material (codecs, downsamples, telephone); for fine control or deliberate over-brightening, pick a value manually.

Two ways to use it:

  1. Genuine low-rate audio (classic super-resolution). You have an 8 kHz (or 16/24 kHz) recording and want a full 48 kHz result → set input_sr to that rate. 8 kHz → 48 kHz is the strongest case (the model is trained 70 % on it).
  2. Brighten muffled but full-rate audio (bandwidth extension). Your file is already 48 kHz but sounds dull / rolled-off (e.g. generated audio, old MP3s). Pick the input_sr that matches where real content ends and let the model rebuild above it — 16000 (rebuild only above 8 kHz) is the most natural; 8000 is brighter and more aggressive. Combine with blend < 1.0 to keep the dry signal and add just a touch of high end.

The node always reproduces the model's training degradation internally (band-limit → super-resolve), so you don't need to pre-process or resample your audio — just pick the bandwidth.


Performance (speed)

Speedups live on the Model Loader. compile is the real, tonally-neutral win (op fusion); tf32 is a small extra that is off by default.

Setting Speedup (measured) Notes
compile (opt-in) ~2.1× torch.compile the network — op fusion, no tonal change. The recommended speedup.
tf32 (default off) ~1.15× TF32 matmul + conv on Ampere+. Stacks with compile → ~2.5×. Tonally neutral in our spectral A/B but not bit-exact — left off so the default is reference fp32.

On the reference machine, a 12 s clip went 4.3 s → 1.7 s (2.48×) with both enabled.

About tf32: in a fixed-seed A/B, TF32 left the spectral centroid and >8 kHz energy unchanged to 3 significant figures (i.e. it does not darken the output). If you toggle it and the result sounds different, check your seed — with seed=0 every run draws new noise, so two runs differ regardless of TF32. To compare fairly, set a fixed seed and change only the toggle. Enabling tf32 also turns on cuDNN conv-TF32; disabling it restores true fp32 (PyTorch leaves conv-TF32 on by default otherwise).

About compile: the first run pays a one-time compile (~1035 s); after that the compiled model is cached for the whole ComfyUI session. The model can only be compiled for a fixed input shape, so the node automatically pads every chunk to chunk_seconds — meaning clips of any length reuse the same compiled graph (no per-length recompiles). Set the sampler's chunk_seconds near your typical clip length so short clips aren't padded up wastefully. Requires CUDA; falls back to eager if compilation fails.

Things that don't help here: CFG-batching, channel/chunk batching, and channels_last — the GPU is already compute-bound at batch 1, so they gave ~0 gain in testing. Going faster than compile requires bf16/fp16, which is not equal-quality (verify by ear first).

Content input_sr guidance_scale ode_method / ode_steps
Speech (8 kHz source) 8000 1.0 1.5 midpoint / 4
Music (8 kHz source) 8000 1.5 2.0 midpoint / 48
Sound effects 8000 ~1.5 midpoint / 4
Brighten dull 48 kHz audio 16000 2.0 3.0 midpoint / 4 (try blend 0.61.0)

Notes:

  • Higher guidance_scale (>3) produces denser highs but can add hiss/artifacts.
  • Higher input rates (especially 24 kHz) reconstruct less high-frequency detail than 8 kHz, an upstream model limitation — see the UniverSR notes.

Long audio & chunking

UniverSR runs the whole clip through a flow-matching ODE in one pass, which exhausts VRAM on long files. This node splits the audio in the time domain and stitches the results with overlap-add and a linear crossfade (weight-normalised), so seams are click-free.

  • chunk_seconds — lower it if you hit out-of-memory errors; 0 processes the whole clip at once. Values below ~0.68 s are raised to the model's internal minimum automatically.
  • overlap_seconds — raise it slightly if you ever hear a seam between chunks.
  • Stereo is processed per-channel; a ComfyUI progress bar tracks batch × channels × chunks.

Example workflow

A ready-made graph is in example_workflows/universr_super_resolution.jsondrag it onto the ComfyUI canvas. It wires LoadAudio → UniverSR Model Loader → UniverSR Super-Resolution → PreviewAudio with the spectrogram going to a PreviewImage.


How it works

ComfyUI audio arrives at an arbitrary real sample rate. UniverSR's file API relies on torchaudio.load (whose torchcodec backend is fragile across environments), and its tensor API assumes the tensor is already at input_sr. So this node does the band-limit itself, entirely with pure-DSP resampling (no codec):

  1. Resample the input to 48 kHz.
  2. For each chunk, downsample to input_sr → hand UniverSR a genuine low-rate tensor.
  3. UniverSR upsamples back to 48 kHz internally and regenerates the high band via flow matching.
  4. Overlap-add the enhanced chunks; optionally blend with the dry signal.

This reproduces the exact training-time degradation (validated against the upstream pipeline). The node also snapshots and restores the global torch/CUDA RNG around inference, so seeding here never makes the rest of your ComfyUI graph deterministic.


Troubleshooting

Symptom Fix
Could not import the 'universr' package pip install torchdiffeq into your ComfyUI Python env.
CUDA out of memory Lower chunk_seconds (e.g. 58), or set the loader device to cpu.
Output sounds harsh / hissy Lower guidance_scale; for BWE, raise input_sr and/or lower blend.
Result barely brighter This is normal for higher input_sr; use a lower input_sr or raise guidance_scale.
First run hangs for a while It's downloading the ~230 MB checkpoint — watch the console.
Spectrogram is blank matplotlib is missing/headless; audio output is unaffected.

Credits & license

UniverSR © Woongjib Choi, Sangmin Lee, Hyungseob Lim, Hong-Goo Kang — DSPAI Lab, Yonsei University — released under the MIT License (see LICENSE). This repository wraps UniverSR for ComfyUI and vendors its inference code unmodified under vendor/. All credit for the model and method goes to the original authors.

@inproceedings{choi2026universr,
  title     = {{UniverSR}: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching},
  author    = {Choi, Woongjib and Lee, Sangmin and Lim, Hyungseob and Kang, Hong-Goo},
  booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
  year      = {2026}
}

Links: paper · demo · upstream repo

S
Description
No description provided
Readme MIT 182 KiB
Languages
Python 89.6%
JavaScript 10.4%