From 12cbc415cfb9d296565f1e7864c7df91fa98c832 Mon Sep 17 00:00:00 2001
From: Ethanfel <ethan.fel@ts-pc.fr>
Date: Mon, 1 Jun 2026 13:02:10 +0200
Subject: [PATCH] docs: full node documentation in README

Comprehensive README: features, install, model auto-download, a
parameter reference for both nodes, an input_sr guide (SR vs BWE),
recommended settings, chunking, how-it-works, and troubleshooting.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 README.md | 296 +++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 224 insertions(+), 72 deletions(-)
diff --git a/README.md b/README.md
index fed9e93..3c3de82 100644
--- a/README.md
+++ b/README.md
@@ -1,72 +1,54 @@
 # ComfyUI-UniverSR
 
-ComfyUI nodes for **[UniverSR](https://github.com/woongzip1/UniverSR)** — *Unified and Versatile
-Audio Super-Resolution via Vocoder-Free Flow Matching* (ICASSP 2026,
-[arXiv:2510.00771](https://arxiv.org/abs/2510.00771)).
+**Audio super-resolution for ComfyUI** — upscale low-bandwidth audio to a full **48 kHz** with
+[UniverSR](https://github.com/woongzip1/UniverSR), *Unified and Versatile Audio Super-Resolution via
+Vocoder-Free Flow Matching* (ICASSP 2026).
 
-A single model upscales **8 / 12 / 16 / 24 kHz** effective bandwidth → **48 kHz** across speech,
-music and sound effects. It works directly in the complex‑STFT domain with flow matching — no neural
-vocoder — and regenerates the missing high‑frequency band rather than just interpolating.
+[![ICASSP 2026](https://img.shields.io/badge/ICASSP-2026-1f6feb.svg)](https://arxiv.org/abs/2510.00771)
+[![arXiv](https://img.shields.io/badge/arXiv-2510.00771-b31b1b.svg)](https://arxiv.org/abs/2510.00771)
+[![Demo](https://img.shields.io/badge/Demo-page-blue.svg)](https://woongzip1.github.io/universr-demo/)
+[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
 
-![overview](https://raw.githubusercontent.com/woongzip1/UniverSR/master/assets/overview.png)
+One model upscales **8 / 12 / 16 / 24 kHz** effective bandwidth → **48 kHz** for **speech, music and
+sound effects**. It works directly in the complex‑STFT domain with flow matching — **no neural
+vocoder** — and *regenerates* the missing high‑frequency band instead of merely interpolating, so
+muffled or band‑limited audio gets believable "air" and detail back.
+
+<p align="center">
+  <img src="https://raw.githubusercontent.com/woongzip1/UniverSR/master/assets/overview.png" width="760" alt="UniverSR overview" />
+</p>
 
 ---
 
-## Nodes
-
-| Node | Output | Purpose |
-|---|---|---|
-| **UniverSR Model Loader** | `UNIVERSR_MODEL` | Loads + caches a checkpoint. Auto-downloads the presets to `models/universr/`. |
-| **UniverSR Super-Resolution** | `AUDIO`, `IMAGE` | Runs the SR. Chunks long audio (click-free overlap-add). Optional before/after spectrogram. |
-
-Wire it up:
-
-```
-LoadAudio ─────────────┐
-                       ▼
-UniverSR Model Loader ─► UniverSR Super-Resolution ─► SaveAudio
-                                          └─ spectrogram ─► PreviewImage
-```
-
-### Model Loader
-- **model** — `universr-audio` (general; music/SFX/mixed, recommended) or `universr-speech` (voice).
-  Each downloads ~230 MB to `models/universr/<name>` on first use. Local checkpoint folders placed
-  in `models/universr/` also appear in this list.
-- **device** — `auto` / `cuda` / `cpu`.
-- **local_path** *(optional)* — override with a folder (`config.yaml` + `pytorch_model.bin`) or a raw
-  `.pth`/`.ckpt` training checkpoint.
-- **config_path** *(optional)* — `config.yaml` for a raw checkpoint. Empty → the bundled default config.
-
-### Super-Resolution
-- **input_sr** — the *effective bandwidth* of your content in Hz. The model treats everything up to
-  `input_sr/2` as valid and **regenerates above it**.
-  - `8000` → genuine low-rate audio (8 kHz → 48 kHz; the strongest, best-trained case).
-  - `16000` → brighten muffled but full-rate audio by regenerating only above 8 kHz (most natural).
-- **ode_method** — `euler` (fastest) → `midpoint` (balanced) → `rk4` (best).
-- **ode_steps** — flow-matching steps. `4` is fast and validated; `4–10` is a good range.
-- **guidance_scale** — classifier-free guidance. Speech `1.0–1.5`, music `1.5–2.0`, SFX `~1.5`.
-  Higher = denser highs but less faithful. `0` disables CFG.
-- **seed** — noise seed (`0` = random each run).
-- **chunk_seconds** / **overlap_seconds** — long-audio handling (see below). `chunk_seconds=0`
-  processes the whole clip at once.
-- **blend** — wet/dry mix. `1.0` = full SR. Lower keeps more of the original (handy for *bandwidth
-  extension* of already-48 kHz audio).
-- **unload_model** — free VRAM after the run.
-- **show_spectrogram** — also output a before/after spectrogram comparison `IMAGE`.
+## Table of contents
+- [Features](#features)
+- [Installation](#installation)
+- [Models](#models)
+- [Nodes](#nodes)
+  - [UniverSR Model Loader](#universr-model-loader)
+  - [UniverSR Super-Resolution](#universr-super-resolution)
+- [Choosing `input_sr`](#choosing-input_sr-the-one-setting-that-matters-most)
+- [Recommended settings](#recommended-settings)
+- [Long audio & chunking](#long-audio--chunking)
+- [Example workflow](#example-workflow)
+- [How it works](#how-it-works)
+- [Troubleshooting](#troubleshooting)
+- [Credits & license](#credits--license)
 
 ---
 
-## Long audio & chunking
+## Features
 
-UniverSR runs the whole clip through a flow-matching ODE in one shot, which OOMs on long files
-(the upstream notebook added chunking specifically to survive clips > 2 min). This node chunks in the
-time domain and stitches the results with **overlap-add + linear crossfade** (weight-normalised), so
-seams are click-free — an improvement over the upstream GUI's naive concatenation. Drop
-`chunk_seconds` if you hit VRAM limits; raise `overlap_seconds` if you ever hear a seam. Stereo is
-processed per-channel and preserved.
-
-> Compared to the `FoleyTune BWE` node (which brightens short foley clips and processes the whole clip
-> at once), this node adds the chunking needed for arbitrarily long sequences.
+- 🎚️ **8 / 12 / 16 / 24 kHz → 48 kHz** with a single model — speech, music, SFX.
+- 🧩 **Two-node design** — a cached **Model Loader** + a **Super-Resolution** sampler.
+- ⬇️ **Auto-download** of the official checkpoints into `models/universr/` on first use.
+- 🔗 **Long-audio chunking** with click-free overlap-add (handles clips of any length).
+- 🎧 **Stereo-aware** — each channel is processed independently and preserved.
+- 🎛️ **Wet/dry blend** — full SR, or dial it back to gently brighten already-48 kHz audio (BWE).
+- 🎲 **Seed control** with **global-RNG isolation** (won't perturb other nodes' randomness).
+- 📊 Optional **before/after spectrogram** image output.
+- 📦 **Self-contained** — the UniverSR inference code is vendored; the only extra dependency beyond
+  ComfyUI's stack is `torchdiffeq`.
 
 ---
 
@@ -74,35 +56,205 @@ processed per-channel and preserved.
 
 ```bash
 cd ComfyUI/custom_nodes
-git clone <this repo> ComfyUI-UniverSR
+git clone https://github.com/ethanfel/ComfyUI-UniverSR.git
 pip install -r ComfyUI-UniverSR/requirements.txt
 ```
 
-The `universr` model code is **vendored** under `vendor/` (an installed `pip` copy is preferred if
-present), so the only dependency beyond ComfyUI's stack is **`torchdiffeq`** (plus `einops`, `timm`,
-`huggingface_hub`, `pyyaml`, which ComfyUI usually already has). Weights download automatically on
-first use.
+Then restart ComfyUI. The nodes appear under the **`audio/UniverSR`** category.
+
+**Dependencies.** `torch`, `torchaudio`, `numpy` and `matplotlib` already ship with ComfyUI. This node
+only adds:
+
+```
+torchdiffeq   einops   timm   huggingface_hub   pyyaml
+```
+
+(`einops`/`timm`/`huggingface_hub`/`pyyaml` are usually already present; `torchdiffeq` is the one
+that typically needs installing.) The `universr` package itself is **vendored** under `vendor/` — if a
+`pip`-installed copy is found it is preferred, otherwise the bundled one is used, so no `git+` install
+is required.
+
+> **GPU recommended.** Inference runs on CUDA if available and falls back to CPU (much slower).
 
 ---
 
-## How it works (implementation note)
+## Models
 
-ComfyUI audio arrives at an arbitrary real sample rate. UniverSR's *file* path relies on
-`torchaudio.load` (fragile torchcodec backend), and its *tensor* path assumes the tensor is already at
-`input_sr`. So this node does the band-limit itself: resample to 48 kHz → downsample each chunk to
-`input_sr` (pure DSP, no codec) → hand UniverSR a genuine low-rate tensor to super-resolve. This
-exactly reproduces the model's training-time degradation.
+| Preset | Domain | Hugging Face | Notes |
+|---|---|---|---|
+| `universr-audio` | General (music / SFX / mixed) | [`woongzip1/universr-audio`](https://huggingface.co/woongzip1/universr-audio) | **Recommended default.** |
+| `universr-speech` | Speech / voice | [`woongzip1/universr-speech`](https://huggingface.co/woongzip1/universr-speech) | Tuned for voice recordings. |
+
+Each preset is ~230 MB and **downloads automatically** to `ComfyUI/models/universr/<preset>/` the
+first time you load it (it lands as `config.yaml` + `pytorch_model.bin`).
+
+**Manual / offline install** — drop the two files into `ComfyUI/models/universr/<name>/` yourself:
+
+```bash
+huggingface-cli download woongzip1/universr-audio \
+  --local-dir ComfyUI/models/universr/universr-audio
+```
+
+Any folder you place under `models/universr/` that contains `config.yaml` + `pytorch_model.bin` will
+also show up in the loader's **model** dropdown.
+
+---
+
+## Nodes
+
+```
+LoadAudio ─────────────┐
+                       ▼
+UniverSR Model Loader ─► UniverSR Super-Resolution ─► SaveAudio / PreviewAudio
+                                          └─ spectrogram ─► PreviewImage
+```
+
+### UniverSR Model Loader
+
+Loads (and caches) a checkpoint. Output: **`UNIVERSR_MODEL`**.
+
+| Input | Type | Default | Description |
+|---|---|---|---|
+| `model` | choice | `universr-audio` | Preset to download, or a local checkpoint folder found under `models/universr/`. |
+| `device` | `auto` / `cuda` / `cpu` | `auto` | Where to load the weights. `auto` picks CUDA when available. |
+| `local_path` *(opt.)* | string | `""` | Override: a folder with `config.yaml` + `pytorch_model.bin`, **or** a raw training checkpoint (`.pth` / `.ckpt`). |
+| `config_path` *(opt.)* | string | `""` | `config.yaml` to pair with a raw checkpoint. Empty → the bundled default config. |
+
+The loaded model is cached by `(path, device)`, so re-running a graph or reusing the loader across
+runs does **not** reload the weights.
+
+### UniverSR Super-Resolution
+
+Runs the super-resolution. Outputs: **`AUDIO`** (48 kHz) and **`IMAGE`** (spectrogram).
+
+| Input | Type | Default | Range | Description |
+|---|---|---|---|---|
+| `audio` | AUDIO | — | — | Input audio (any sample rate / mono or stereo). |
+| `model` | UNIVERSR_MODEL | — | — | From the Model Loader. |
+| `input_sr` | choice | `8000` | 8000 / 12000 / 16000 / 24000 | **Effective input bandwidth (Hz).** Content is treated as valid up to `input_sr/2` and **regenerated above it**. See below. |
+| `ode_method` | choice | `midpoint` | euler / midpoint / rk4 | ODE solver. `euler` fastest → `midpoint` balanced → `rk4` best. |
+| `ode_steps` | int | `4` | 1–64 | Flow-matching integration steps. `4` is fast & validated; `4–10` is a good range. |
+| `guidance_scale` | float | `1.5` | 0–6 | Classifier-free guidance. Higher = denser highs but less faithful. `0` disables CFG. |
+| `seed` | int | `0` | — | Noise seed for the flow source. `0` = random each run. |
+| `chunk_seconds` | float | `10.0` | 0–120 | Process long audio in chunks this long to bound VRAM. `0` = whole clip at once. |
+| `overlap_seconds` | float | `0.5` | 0–5 | Crossfade overlap between chunks (prevents seam clicks). |
+| `blend` | float | `1.0` | 0–1 | Wet/dry mix. `1.0` = full SR; lower keeps more of the original. |
+| `unload_model` | bool | `false` | — | Free the model from VRAM after this run. |
+| `show_spectrogram` | bool | `true` | — | Also output a before/after spectrogram comparison image. |
+
+---
+
+## Choosing `input_sr` (the one setting that matters most)
+
+`input_sr` tells the model the **effective bandwidth** of your content. Everything **above
+`input_sr / 2`** is treated as missing and regenerated:
+
+| `input_sr` | Treated as valid up to | The model regenerates |
+|---|---|---|
+| `8000`  | 4 kHz  | 4 – 24 kHz |
+| `12000` | 6 kHz  | 6 – 24 kHz |
+| `16000` | 8 kHz  | 8 – 24 kHz |
+| `24000` | 12 kHz | 12 – 24 kHz |
+
+Two ways to use it:
+
+1. **Genuine low-rate audio (classic super-resolution).** You have an 8 kHz (or 16/24 kHz) recording
+   and want a full 48 kHz result → set `input_sr` to that rate. **8 kHz → 48 kHz is the strongest
+   case** (the model is trained 70 % on it).
+2. **Brighten muffled but full-rate audio (bandwidth extension).** Your file is already 48 kHz but
+   sounds dull / rolled-off (e.g. generated audio, old MP3s). Pick the `input_sr` that matches where
+   real content ends and let the model rebuild above it — `16000` (rebuild only above 8 kHz) is the
+   most natural; `8000` is brighter and more aggressive. Combine with **`blend < 1.0`** to keep the
+   dry signal and add just a touch of high end.
+
+> The node always reproduces the model's training degradation internally (band-limit → super-resolve),
+> so you don't need to pre-process or resample your audio — just pick the bandwidth.
+
+---
+
+## Recommended settings
+
+| Content | `input_sr` | `guidance_scale` | `ode_method` / `ode_steps` |
+|---|---|---|---|
+| Speech (8 kHz source) | 8000 | 1.0 – 1.5 | midpoint / 4 |
+| Music (8 kHz source) | 8000 | 1.5 – 2.0 | midpoint / 4–8 |
+| Sound effects | 8000 | ~1.5 | midpoint / 4 |
+| Brighten dull 48 kHz audio | 16000 | 2.0 – 3.0 | midpoint / 4 (try `blend` 0.6–1.0) |
+
+Notes:
+- Higher `guidance_scale` (>3) produces denser highs but can add hiss/artifacts.
+- Higher input rates (especially 24 kHz) reconstruct less high-frequency detail than 8 kHz, an upstream
+  model limitation — see the [UniverSR notes](https://github.com/woongzip1/UniverSR#-known-limitations--tips).
+
+---
+
+## Long audio & chunking
+
+UniverSR runs the whole clip through a flow-matching ODE in one pass, which exhausts VRAM on long
+files. This node splits the audio in the time domain and stitches the results with **overlap-add and a
+linear crossfade** (weight-normalised), so seams are click-free.
+
+- `chunk_seconds` — lower it if you hit out-of-memory errors; `0` processes the whole clip at once.
+  Values below ~0.68 s are raised to the model's internal minimum automatically.
+- `overlap_seconds` — raise it slightly if you ever hear a seam between chunks.
+- Stereo is processed per-channel; a ComfyUI progress bar tracks `batch × channels × chunks`.
+
+---
+
+## Example workflow
+
+A ready-made graph is in [`example_workflows/universr_super_resolution.json`](example_workflows/universr_super_resolution.json)
+— **drag it onto the ComfyUI canvas**. It wires `LoadAudio → UniverSR Model Loader → UniverSR
+Super-Resolution → PreviewAudio` with the spectrogram going to a `PreviewImage`.
+
+---
+
+## How it works
+
+ComfyUI audio arrives at an arbitrary real sample rate. UniverSR's *file* API relies on
+`torchaudio.load` (whose torchcodec backend is fragile across environments), and its *tensor* API
+assumes the tensor is already at `input_sr`. So this node does the band-limit itself, entirely with
+pure-DSP resampling (no codec):
+
+1. Resample the input to 48 kHz.
+2. For each chunk, downsample to `input_sr` → hand UniverSR a *genuine* low-rate tensor.
+3. UniverSR upsamples back to 48 kHz internally and regenerates the high band via flow matching.
+4. Overlap-add the enhanced chunks; optionally blend with the dry signal.
+
+This reproduces the exact training-time degradation (validated against the upstream pipeline). The
+node also **snapshots and restores the global torch/CUDA RNG** around inference, so seeding here never
+makes the rest of your ComfyUI graph deterministic.
+
+---
+
+## Troubleshooting
+
+| Symptom | Fix |
+|---|---|
+| `Could not import the 'universr' package` | `pip install torchdiffeq` into your ComfyUI Python env. |
+| CUDA out of memory | Lower `chunk_seconds` (e.g. 5–8), or set the loader `device` to `cpu`. |
+| Output sounds harsh / hissy | Lower `guidance_scale`; for BWE, raise `input_sr` and/or lower `blend`. |
+| Result barely brighter | This is normal for higher `input_sr`; use a lower `input_sr` or raise `guidance_scale`. |
+| First run hangs for a while | It's downloading the ~230 MB checkpoint — watch the console. |
+| Spectrogram is blank | `matplotlib` is missing/headless; audio output is unaffected. |
+
+---
 
 ## Credits & license
 
-UniverSR © Woongjib Choi et al., DSPAI Lab, Yonsei University — released under the MIT License
-(see `LICENSE`). This node wrapper vendors the UniverSR inference code unmodified under `vendor/`.
+UniverSR © Woongjib Choi, Sangmin Lee, Hyungseob Lim, Hong-Goo Kang — [DSPAI Lab, Yonsei
+University](http://dsp.yonsei.ac.kr/) — released under the **MIT License** (see [`LICENSE`](LICENSE)).
+This repository wraps UniverSR for ComfyUI and vendors its inference code **unmodified** under
+`vendor/`. All credit for the model and method goes to the original authors.
 
 ```bibtex
 @inproceedings{choi2026universr,
   title     = {{UniverSR}: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching},
   author    = {Choi, Woongjib and Lee, Sangmin and Lim, Hyungseob and Kang, Hong-Goo},
-  booktitle = {IEEE ICASSP},
+  booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
   year      = {2026}
 }
 ```
+
+**Links:** [paper](https://arxiv.org/abs/2510.00771) · [demo](https://woongzip1.github.io/universr-demo/) ·
+[upstream repo](https://github.com/woongzip1/UniverSR)