feat: equal-quality speed options (TF32 + torch.compile)

Add two opt-in inference speedups to the Model Loader, validated to leave the
output perceptually identical (deviation at the fp32 rounding floor):

- tf32 (default on): TF32 matmul on Ampere+ (~1.15x).
- compile (opt-in): torch.compile the UNet (~2.1x). Stacks with TF32 to
  ~2.5x (measured 4.3s -> 1.7s on a 12s clip).

torch.compile needs a static shape (the model's adaptive-avg-pool can't trace
dynamic shapes), so the sampler pads every chunk to chunk_seconds — clips of
any length reuse one compiled graph (no per-length recompiles; verified an 8s
clip after a 12s clip ran in 0.9s with no recompile).

Researched + profiled first: CFG-batching, channel/chunk batching, and
channels_last gave ~0 gain because the GPU is already compute-bound at batch 1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-06-16 17:16:21 +02:00
parent 9a901adcc5
commit 104cd4bf5f
3 changed files with 97 additions and 11 deletions
+27
View File
@@ -30,6 +30,7 @@ muffled or bandlimited audio gets believable "air" and detail back.
- [UniverSR Load Video Audio](#universr-load-video-audio)
- [UniverSR Video Combiner](#universr-video-combiner)
- [Choosing `input_sr`](#choosing-input_sr-the-one-setting-that-matters-most)
- [Performance (speed)](#performance-speed)
- [Recommended settings](#recommended-settings)
- [Long audio & chunking](#long-audio--chunking)
- [Example workflow](#example-workflow)
@@ -125,6 +126,8 @@ Loads (and caches) a checkpoint. Output: **`UNIVERSR_MODEL`**.
|---|---|---|---|
| `model` | choice | `universr-audio` | Preset to download, or a local checkpoint folder found under `models/universr/`. |
| `device` | `auto` / `cuda` / `cpu` | `auto` | Where to load the weights. `auto` picks CUDA when available. |
| `tf32` *(opt.)* | bool | `True` | TF32 matmul on Ampere+ (~1.15×). Perceptually lossless, not bit-exact. |
| `compile` *(opt.)* | bool | `False` | `torch.compile` the network (~2×). See [Performance](#performance-speed). |
| `local_path` *(opt.)* | string | `""` | Override: a folder with `config.yaml` + `pytorch_model.bin`, **or** a raw training checkpoint (`.pth` / `.ckpt`). |
| `config_path` *(opt.)* | string | `""` | `config.yaml` to pair with a raw checkpoint. Empty → the bundled default config. |
@@ -223,6 +226,30 @@ Two ways to use it:
---
## Performance (speed)
Two **equal-quality** speedups live on the Model Loader (both leave the output perceptually identical —
measured deviation is at the fp32 rounding floor, ≈ 64 dB):
| Setting | Speedup (measured) | Notes |
|---|---|---|
| `tf32` (default **on**) | ~1.15× | TF32 matmul on Ampere+. One global flag, no caveats worth worrying about. |
| `compile` (opt-in) | ~2.1× | `torch.compile` the network. **Stacks with TF32 → ~2.5× total.** |
On the reference machine, a 12 s clip went **4.3 s → 1.7 s (2.48×)** with both enabled, with a max
sample deviation of `2e-4` vs plain fp32.
**About `compile`:** the first run pays a one-time compile (~1035 s); after that the compiled model is
cached for the whole ComfyUI session. The model can only be compiled for a **fixed input shape**, so the
node automatically **pads every chunk to `chunk_seconds`** — meaning clips of *any* length reuse the same
compiled graph (no per-length recompiles). Set the sampler's `chunk_seconds` near your typical clip length
so short clips aren't padded up wastefully. Requires CUDA; falls back to eager if compilation fails.
> These are the only speedups that don't change the output. Things that *don't* help here: CFG-batching,
> channel/chunk batching, and `channels_last` — the GPU is already compute-bound at batch 1, so they
> gave ~0 gain in testing. Going faster than this requires bf16/fp16, which is **not** equal-quality
> (verify by ear first).
## Recommended settings
| Content | `input_sr` | `guidance_scale` | `ode_method` / `ode_steps` |