94178a4851
Reported as "darker", but a fixed-seed spectral A/B shows TF32 is tonally neutral (centroid 564→565 Hz, HF>8k 0.00825→0.00833) — the perceived change is the seed=0 random-noise confound, not TF32. Still, TF32 is only ~1.15x and not bit-exact, so default it OFF for reference-fp32 output and let compile (~2.1x, op fusion) be the headline speedup. apply_tf32 now also toggles cuDNN conv-TF32 (PyTorch leaves it on by default), so off is genuinely fp32. Docs updated with the seed-confound A/B guidance. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
343 lines
17 KiB
Markdown
343 lines
17 KiB
Markdown
# ComfyUI-UniverSR
|
||
|
||
**Audio super-resolution for ComfyUI** — upscale low-bandwidth audio to a full **48 kHz** with
|
||
[UniverSR](https://github.com/woongzip1/UniverSR), *Unified and Versatile Audio Super-Resolution via
|
||
Vocoder-Free Flow Matching* (ICASSP 2026).
|
||
|
||
[](https://arxiv.org/abs/2510.00771)
|
||
[](https://arxiv.org/abs/2510.00771)
|
||
[](https://woongzip1.github.io/universr-demo/)
|
||
[](LICENSE)
|
||
|
||
One model upscales **8 / 12 / 16 / 24 kHz** effective bandwidth → **48 kHz** for **speech, music and
|
||
sound effects**. It works directly in the complex‑STFT domain with flow matching — **no neural
|
||
vocoder** — and *regenerates* the missing high‑frequency band instead of merely interpolating, so
|
||
muffled or band‑limited audio gets believable "air" and detail back.
|
||
|
||
<p align="center">
|
||
<img src="https://raw.githubusercontent.com/woongzip1/UniverSR/master/assets/overview.png" width="760" alt="UniverSR overview" />
|
||
</p>
|
||
|
||
---
|
||
|
||
## Table of contents
|
||
- [Features](#features)
|
||
- [Installation](#installation)
|
||
- [Models](#models)
|
||
- [Nodes](#nodes)
|
||
- [UniverSR Model Loader](#universr-model-loader)
|
||
- [UniverSR Super-Resolution](#universr-super-resolution)
|
||
- [UniverSR Load Video Audio](#universr-load-video-audio)
|
||
- [UniverSR Video Combiner](#universr-video-combiner)
|
||
- [Choosing `input_sr`](#choosing-input_sr-the-one-setting-that-matters-most)
|
||
- [Performance (speed)](#performance-speed)
|
||
- [Recommended settings](#recommended-settings)
|
||
- [Long audio & chunking](#long-audio--chunking)
|
||
- [Example workflow](#example-workflow)
|
||
- [How it works](#how-it-works)
|
||
- [Troubleshooting](#troubleshooting)
|
||
- [Credits & license](#credits--license)
|
||
|
||
---
|
||
|
||
## Features
|
||
|
||
- 🎚️ **8 / 12 / 16 / 24 kHz → 48 kHz** with a single model — speech, music, SFX.
|
||
- 🧩 **Two-node design** — a cached **Model Loader** + a **Super-Resolution** sampler.
|
||
- ⬇️ **Auto-download** of the official checkpoints into `models/universr/` on first use.
|
||
- 🔗 **Long-audio chunking** with click-free overlap-add (handles clips of any length).
|
||
- 🎧 **Stereo-aware** — each channel is processed independently and preserved.
|
||
- 🎛️ **Wet/dry blend** — full SR, or dial it back to gently brighten already-48 kHz audio (BWE).
|
||
- 🎲 **Seed control** with **global-RNG isolation** (won't perturb other nodes' randomness).
|
||
- 📊 Optional **before/after spectrogram** image output.
|
||
- 🎬 **Video in / out** — extract a video's audio, super-resolve it, and remux it back onto the
|
||
original video (no video re-encode), all with `ffmpeg`.
|
||
- 📦 **Self-contained** — the UniverSR inference code is vendored; the only extra dependency beyond
|
||
ComfyUI's stack is `torchdiffeq`.
|
||
|
||
---
|
||
|
||
## Installation
|
||
|
||
```bash
|
||
cd ComfyUI/custom_nodes
|
||
git clone https://github.com/ethanfel/ComfyUI-UniverSR.git
|
||
pip install -r ComfyUI-UniverSR/requirements.txt
|
||
```
|
||
|
||
Then restart ComfyUI. The nodes appear under the **`audio/UniverSR`** category.
|
||
|
||
**Dependencies.** `torch`, `torchaudio`, `numpy` and `matplotlib` already ship with ComfyUI. This node
|
||
only adds:
|
||
|
||
```
|
||
torchdiffeq einops timm huggingface_hub pyyaml
|
||
```
|
||
|
||
(`einops`/`timm`/`huggingface_hub`/`pyyaml` are usually already present; `torchdiffeq` is the one
|
||
that typically needs installing.) The `universr` package itself is **vendored** under `vendor/` — if a
|
||
`pip`-installed copy is found it is preferred, otherwise the bundled one is used, so no `git+` install
|
||
is required.
|
||
|
||
The **video** nodes additionally need **`ffmpeg`** on your `PATH` (`apt install ffmpeg` /
|
||
`brew install ffmpeg` / `conda install -c conda-forge ffmpeg`) and `soundfile` (in `requirements.txt`).
|
||
The audio SR nodes work without either.
|
||
|
||
> **GPU recommended.** Inference runs on CUDA if available and falls back to CPU (much slower).
|
||
|
||
---
|
||
|
||
## Models
|
||
|
||
| Preset | Domain | Hugging Face | Notes |
|
||
|---|---|---|---|
|
||
| `universr-audio` | General (music / SFX / mixed) | [`woongzip1/universr-audio`](https://huggingface.co/woongzip1/universr-audio) | **Recommended default.** |
|
||
| `universr-speech` | Speech / voice | [`woongzip1/universr-speech`](https://huggingface.co/woongzip1/universr-speech) | Tuned for voice recordings. |
|
||
|
||
Each preset is ~230 MB and **downloads automatically** to `ComfyUI/models/universr/<preset>/` the
|
||
first time you load it (it lands as `config.yaml` + `pytorch_model.bin`).
|
||
|
||
**Manual / offline install** — drop the two files into `ComfyUI/models/universr/<name>/` yourself:
|
||
|
||
```bash
|
||
huggingface-cli download woongzip1/universr-audio \
|
||
--local-dir ComfyUI/models/universr/universr-audio
|
||
```
|
||
|
||
Any folder you place under `models/universr/` that contains `config.yaml` + `pytorch_model.bin` will
|
||
also show up in the loader's **model** dropdown.
|
||
|
||
---
|
||
|
||
## Nodes
|
||
|
||
```
|
||
LoadAudio ─────────────┐
|
||
▼
|
||
UniverSR Model Loader ─► UniverSR Super-Resolution ─► SaveAudio / PreviewAudio
|
||
└─ spectrogram ─► PreviewImage
|
||
```
|
||
|
||
### UniverSR Model Loader
|
||
|
||
Loads (and caches) a checkpoint. Output: **`UNIVERSR_MODEL`**.
|
||
|
||
| Input | Type | Default | Description |
|
||
|---|---|---|---|
|
||
| `model` | choice | `universr-audio` | Preset to download, or a local checkpoint folder found under `models/universr/`. |
|
||
| `device` | `auto` / `cuda` / `cpu` | `auto` | Where to load the weights. `auto` picks CUDA when available. |
|
||
| `tf32` *(opt.)* | bool | `False` | TF32 matmul + conv on Ampere+ (~1.15×). Tonally neutral in testing but not bit-exact; off = reference fp32. |
|
||
| `compile` *(opt.)* | bool | `False` | `torch.compile` the network (~2×). See [Performance](#performance-speed). |
|
||
| `local_path` *(opt.)* | string | `""` | Override: a folder with `config.yaml` + `pytorch_model.bin`, **or** a raw training checkpoint (`.pth` / `.ckpt`). |
|
||
| `config_path` *(opt.)* | string | `""` | `config.yaml` to pair with a raw checkpoint. Empty → the bundled default config. |
|
||
|
||
The loaded model is cached by `(path, device)`, so re-running a graph or reusing the loader across
|
||
runs does **not** reload the weights.
|
||
|
||
### UniverSR Super-Resolution
|
||
|
||
Runs the super-resolution. Outputs: **`AUDIO`** (48 kHz) and **`IMAGE`** (spectrogram).
|
||
|
||
| Input | Type | Default | Range | Description |
|
||
|---|---|---|---|---|
|
||
| `audio` | AUDIO | — | — | Input audio (any sample rate / mono or stereo). |
|
||
| `model` | UNIVERSR_MODEL | — | — | From the Model Loader. |
|
||
| `input_sr` | choice | `8000` | 8000 / 12000 / 16000 / 24000 | **Effective input bandwidth (Hz).** Content is treated as valid up to `input_sr/2` and **regenerated above it**. See below. |
|
||
| `ode_method` | choice | `midpoint` | euler / midpoint / rk4 | ODE solver. `euler` fastest → `midpoint` balanced → `rk4` best. |
|
||
| `ode_steps` | int | `4` | 1–64 | Flow-matching integration steps. `4` is fast & validated; `4–10` is a good range. |
|
||
| `guidance_scale` | float | `1.5` | 0–6 | Classifier-free guidance. Higher = denser highs but less faithful. `0` disables CFG. |
|
||
| `seed` | int | `0` | — | Noise seed for the flow source. `0` = random each run. |
|
||
| `chunk_seconds` | float | `10.0` | 0–120 | Process long audio in chunks this long to bound VRAM. `0` = whole clip at once. |
|
||
| `overlap_seconds` | float | `0.5` | 0–5 | Crossfade overlap between chunks (prevents seam clicks). |
|
||
| `blend` | float | `1.0` | 0–1 | Wet/dry mix. `1.0` = full SR; lower keeps more of the original. |
|
||
| `unload_model` | bool | `false` | — | Free the model from VRAM after this run. |
|
||
| `show_spectrogram` | bool | `true` | — | Also output a before/after spectrogram comparison image. |
|
||
|
||
### UniverSR Load Video Audio
|
||
|
||
Upload or pick a video, extract its audio track (native rate/channels, via `ffmpeg`), and keep a
|
||
reference to the source video for remuxing. The clip **previews inline in the node** — with an upload
|
||
button and drag-and-drop, just like a normal video loader. Outputs **`UNIVERSR_VIDEO`** and **`AUDIO`**.
|
||
|
||
| Input | Type | Default | Description |
|
||
|---|---|---|---|
|
||
| `video` | upload / choice | — | Drop or upload a video, or pick one from ComfyUI's `input/` folder. |
|
||
| `start_time` *(opt.)* | float | `0.0` | Trim start, seconds. |
|
||
| `duration` *(opt.)* | float | `0.0` | Trim length, seconds (`0` = to end). |
|
||
|
||
There is also a **UniverSR Load Video Audio (Path)** variant that takes an absolute `video_path` string
|
||
(for files outside ComfyUI's `input/` folder); it previews after you run it. Both feed the combiner.
|
||
|
||
### UniverSR Video Combiner
|
||
|
||
Muxes an `AUDIO` track onto the source video **without re-encoding the video** (`-c:v copy`) and saves
|
||
the result. If the loader trimmed the clip, the same trim is applied to the video so A/V stay aligned.
|
||
|
||
| Input | Type | Default | Description |
|
||
|---|---|---|---|
|
||
| `video` | UNIVERSR_VIDEO | — | From **UniverSR Load Video Audio**. |
|
||
| `audio` | AUDIO | — | The enhanced 48 kHz audio. |
|
||
| `filename_prefix` | string | `UniverSR` | Output name prefix (auto-incremented). |
|
||
| `audio_codec` *(opt.)* | choice | `aac` | `aac` / `flac` / `pcm_s16le` / `libopus` / `libmp3lame`. |
|
||
| `save_output` *(opt.)* | bool | `true` | Save to `output/` (else `temp/`). |
|
||
|
||
Output: `output_path` (string) and an inline video preview.
|
||
|
||
#### Video workflow
|
||
|
||
```
|
||
UniverSR Load Video Audio ──┬─ audio ─► UniverSR Super-Resolution ─ audio ─┐
|
||
│ ▼
|
||
└────────────── video ──────────────► UniverSR Video Combiner ─► .mp4
|
||
UniverSR Model Loader ─► (Super-Resolution)
|
||
```
|
||
|
||
Load the video → super-resolve its audio (set `input_sr` to the content bandwidth) → feed the enhanced
|
||
audio **and** the `video` reference into the combiner. Ready-made graph:
|
||
[`example_workflows/universr_video.json`](example_workflows/universr_video.json).
|
||
|
||
---
|
||
|
||
## Choosing `input_sr` (the one setting that matters most)
|
||
|
||
`input_sr` tells the model the **effective bandwidth** of your content. Everything **above
|
||
`input_sr / 2`** is treated as missing and regenerated:
|
||
|
||
| `input_sr` | Treated as valid up to | The model regenerates |
|
||
|---|---|---|
|
||
| `8000` | 4 kHz | 4 – 24 kHz |
|
||
| `12000` | 6 kHz | 6 – 24 kHz |
|
||
| `16000` | 8 kHz | 8 – 24 kHz |
|
||
| `24000` | 12 kHz | 12 – 24 kHz |
|
||
|
||
Two ways to use it:
|
||
|
||
1. **Genuine low-rate audio (classic super-resolution).** You have an 8 kHz (or 16/24 kHz) recording
|
||
and want a full 48 kHz result → set `input_sr` to that rate. **8 kHz → 48 kHz is the strongest
|
||
case** (the model is trained 70 % on it).
|
||
2. **Brighten muffled but full-rate audio (bandwidth extension).** Your file is already 48 kHz but
|
||
sounds dull / rolled-off (e.g. generated audio, old MP3s). Pick the `input_sr` that matches where
|
||
real content ends and let the model rebuild above it — `16000` (rebuild only above 8 kHz) is the
|
||
most natural; `8000` is brighter and more aggressive. Combine with **`blend < 1.0`** to keep the
|
||
dry signal and add just a touch of high end.
|
||
|
||
> The node always reproduces the model's training degradation internally (band-limit → super-resolve),
|
||
> so you don't need to pre-process or resample your audio — just pick the bandwidth.
|
||
|
||
---
|
||
|
||
## Performance (speed)
|
||
|
||
Speedups live on the Model Loader. **`compile` is the real, tonally-neutral win** (op fusion); `tf32` is
|
||
a small extra that is off by default.
|
||
|
||
| Setting | Speedup (measured) | Notes |
|
||
|---|---|---|
|
||
| `compile` (opt-in) | ~2.1× | `torch.compile` the network — op fusion, no tonal change. The recommended speedup. |
|
||
| `tf32` (default **off**) | ~1.15× | TF32 matmul + conv on Ampere+. **Stacks with compile → ~2.5×.** Tonally neutral in our spectral A/B but not bit-exact — left off so the default is reference fp32. |
|
||
|
||
On the reference machine, a 12 s clip went **4.3 s → 1.7 s (2.48×)** with both enabled.
|
||
|
||
**About `tf32`:** in a fixed-seed A/B, TF32 left the spectral centroid and >8 kHz energy unchanged to 3
|
||
significant figures (i.e. it does **not** darken the output). If you toggle it and the result sounds
|
||
different, check your `seed` — with `seed=0` every run draws new noise, so two runs differ regardless of
|
||
TF32. To compare fairly, set a fixed `seed` and change only the toggle. Enabling `tf32` also turns on
|
||
cuDNN conv-TF32; disabling it restores true fp32 (PyTorch leaves conv-TF32 on by default otherwise).
|
||
|
||
**About `compile`:** the first run pays a one-time compile (~10–35 s); after that the compiled model is
|
||
cached for the whole ComfyUI session. The model can only be compiled for a **fixed input shape**, so the
|
||
node automatically **pads every chunk to `chunk_seconds`** — meaning clips of *any* length reuse the same
|
||
compiled graph (no per-length recompiles). Set the sampler's `chunk_seconds` near your typical clip length
|
||
so short clips aren't padded up wastefully. Requires CUDA; falls back to eager if compilation fails.
|
||
|
||
> Things that *don't* help here: CFG-batching, channel/chunk batching, and `channels_last` — the GPU is
|
||
> already compute-bound at batch 1, so they gave ~0 gain in testing. Going faster than `compile` requires
|
||
> bf16/fp16, which is **not** equal-quality (verify by ear first).
|
||
|
||
## Recommended settings
|
||
|
||
| Content | `input_sr` | `guidance_scale` | `ode_method` / `ode_steps` |
|
||
|---|---|---|---|
|
||
| Speech (8 kHz source) | 8000 | 1.0 – 1.5 | midpoint / 4 |
|
||
| Music (8 kHz source) | 8000 | 1.5 – 2.0 | midpoint / 4–8 |
|
||
| Sound effects | 8000 | ~1.5 | midpoint / 4 |
|
||
| Brighten dull 48 kHz audio | 16000 | 2.0 – 3.0 | midpoint / 4 (try `blend` 0.6–1.0) |
|
||
|
||
Notes:
|
||
- Higher `guidance_scale` (>3) produces denser highs but can add hiss/artifacts.
|
||
- Higher input rates (especially 24 kHz) reconstruct less high-frequency detail than 8 kHz, an upstream
|
||
model limitation — see the [UniverSR notes](https://github.com/woongzip1/UniverSR#-known-limitations--tips).
|
||
|
||
---
|
||
|
||
## Long audio & chunking
|
||
|
||
UniverSR runs the whole clip through a flow-matching ODE in one pass, which exhausts VRAM on long
|
||
files. This node splits the audio in the time domain and stitches the results with **overlap-add and a
|
||
linear crossfade** (weight-normalised), so seams are click-free.
|
||
|
||
- `chunk_seconds` — lower it if you hit out-of-memory errors; `0` processes the whole clip at once.
|
||
Values below ~0.68 s are raised to the model's internal minimum automatically.
|
||
- `overlap_seconds` — raise it slightly if you ever hear a seam between chunks.
|
||
- Stereo is processed per-channel; a ComfyUI progress bar tracks `batch × channels × chunks`.
|
||
|
||
---
|
||
|
||
## Example workflow
|
||
|
||
A ready-made graph is in [`example_workflows/universr_super_resolution.json`](example_workflows/universr_super_resolution.json)
|
||
— **drag it onto the ComfyUI canvas**. It wires `LoadAudio → UniverSR Model Loader → UniverSR
|
||
Super-Resolution → PreviewAudio` with the spectrogram going to a `PreviewImage`.
|
||
|
||
---
|
||
|
||
## How it works
|
||
|
||
ComfyUI audio arrives at an arbitrary real sample rate. UniverSR's *file* API relies on
|
||
`torchaudio.load` (whose torchcodec backend is fragile across environments), and its *tensor* API
|
||
assumes the tensor is already at `input_sr`. So this node does the band-limit itself, entirely with
|
||
pure-DSP resampling (no codec):
|
||
|
||
1. Resample the input to 48 kHz.
|
||
2. For each chunk, downsample to `input_sr` → hand UniverSR a *genuine* low-rate tensor.
|
||
3. UniverSR upsamples back to 48 kHz internally and regenerates the high band via flow matching.
|
||
4. Overlap-add the enhanced chunks; optionally blend with the dry signal.
|
||
|
||
This reproduces the exact training-time degradation (validated against the upstream pipeline). The
|
||
node also **snapshots and restores the global torch/CUDA RNG** around inference, so seeding here never
|
||
makes the rest of your ComfyUI graph deterministic.
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
| Symptom | Fix |
|
||
|---|---|
|
||
| `Could not import the 'universr' package` | `pip install torchdiffeq` into your ComfyUI Python env. |
|
||
| CUDA out of memory | Lower `chunk_seconds` (e.g. 5–8), or set the loader `device` to `cpu`. |
|
||
| Output sounds harsh / hissy | Lower `guidance_scale`; for BWE, raise `input_sr` and/or lower `blend`. |
|
||
| Result barely brighter | This is normal for higher `input_sr`; use a lower `input_sr` or raise `guidance_scale`. |
|
||
| First run hangs for a while | It's downloading the ~230 MB checkpoint — watch the console. |
|
||
| Spectrogram is blank | `matplotlib` is missing/headless; audio output is unaffected. |
|
||
|
||
---
|
||
|
||
## Credits & license
|
||
|
||
UniverSR © Woongjib Choi, Sangmin Lee, Hyungseob Lim, Hong-Goo Kang — [DSPAI Lab, Yonsei
|
||
University](http://dsp.yonsei.ac.kr/) — released under the **MIT License** (see [`LICENSE`](LICENSE)).
|
||
This repository wraps UniverSR for ComfyUI and vendors its inference code **unmodified** under
|
||
`vendor/`. All credit for the model and method goes to the original authors.
|
||
|
||
```bibtex
|
||
@inproceedings{choi2026universr,
|
||
title = {{UniverSR}: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching},
|
||
author = {Choi, Woongjib and Lee, Sangmin and Lim, Hyungseob and Kang, Hong-Goo},
|
||
booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
|
||
year = {2026}
|
||
}
|
||
```
|
||
|
||
**Links:** [paper](https://arxiv.org/abs/2510.00771) · [demo](https://woongzip1.github.io/universr-demo/) ·
|
||
[upstream repo](https://github.com/woongzip1/UniverSR)
|