feat: equal-quality speed options (TF32 + torch.compile)

Add two opt-in inference speedups to the Model Loader, validated to leave the output perceptually identical (deviation at the fp32 rounding floor): - tf32 (default on): TF32 matmul on Ampere+ (~1.15x). - compile (opt-in): torch.compile the UNet (~2.1x). Stacks with TF32 to ~2.5x (measured 4.3s -> 1.7s on a 12s clip). torch.compile needs a static shape (the model's adaptive-avg-pool can't trace dynamic shapes), so the sampler pads every chunk to chunk_seconds — clips of any length reuse one compiled graph (no per-length recompiles; verified an 8s clip after a 12s clip ran in 0.9s with no recompile). Researched + profiled first: CFG-batching, channel/chunk batching, and channels_last gave ~0 gain because the GPU is already compute-bound at batch 1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 17:16:21 +02:00
parent 9a901adcc5
commit 104cd4bf5f
3 changed files with 97 additions and 11 deletions
@@ -30,6 +30,7 @@ muffled or band‑limited audio gets believable "air" and detail back.
  - [UniverSR Load Video Audio](#universr-load-video-audio)
  - [UniverSR Video Combiner](#universr-video-combiner)
 - [Choosing `input_sr`](#choosing-input_sr-the-one-setting-that-matters-most)
+- [Performance (speed)](#performance-speed)
 - [Recommended settings](#recommended-settings)
 - [Long audio & chunking](#long-audio--chunking)
 - [Example workflow](#example-workflow)
@@ -125,6 +126,8 @@ Loads (and caches) a checkpoint. Output: **`UNIVERSR_MODEL`**.
 |---|---|---|---|
 | `model` | choice | `universr-audio` | Preset to download, or a local checkpoint folder found under `models/universr/`. |
 | `device` | `auto` / `cuda` / `cpu` | `auto` | Where to load the weights. `auto` picks CUDA when available. |
+| `tf32` *(opt.)* | bool | `True` | TF32 matmul on Ampere+ (~1.15×). Perceptually lossless, not bit-exact. |
+| `compile` *(opt.)* | bool | `False` | `torch.compile` the network (~2×). See [Performance](#performance-speed). |
 | `local_path` *(opt.)* | string | `""` | Override: a folder with `config.yaml` + `pytorch_model.bin`, **or** a raw training checkpoint (`.pth` / `.ckpt`). |
 | `config_path` *(opt.)* | string | `""` | `config.yaml` to pair with a raw checkpoint. Empty → the bundled default config. |

@@ -223,6 +226,30 @@ Two ways to use it:

 ---

+## Performance (speed)
+
+Two **equal-quality** speedups live on the Model Loader (both leave the output perceptually identical —
+measured deviation is at the fp32 rounding floor, ≈ −64 dB):
+
+| Setting | Speedup (measured) | Notes |
+|---|---|---|
+| `tf32` (default **on**) | ~1.15× | TF32 matmul on Ampere+. One global flag, no caveats worth worrying about. |
+| `compile` (opt-in) | ~2.1× | `torch.compile` the network. **Stacks with TF32 → ~2.5× total.** |
+
+On the reference machine, a 12 s clip went **4.3 s → 1.7 s (2.48×)** with both enabled, with a max
+sample deviation of `2e-4` vs plain fp32.
+
+**About `compile`:** the first run pays a one-time compile (~10–35 s); after that the compiled model is
+cached for the whole ComfyUI session. The model can only be compiled for a **fixed input shape**, so the
+node automatically **pads every chunk to `chunk_seconds`** — meaning clips of *any* length reuse the same
+compiled graph (no per-length recompiles). Set the sampler's `chunk_seconds` near your typical clip length
+so short clips aren't padded up wastefully. Requires CUDA; falls back to eager if compilation fails.
+
+> These are the only speedups that don't change the output. Things that *don't* help here: CFG-batching,
+> channel/chunk batching, and `channels_last` — the GPU is already compute-bound at batch 1, so they
+> gave ~0 gain in testing. Going faster than this requires bf16/fp16, which is **not** equal-quality
+> (verify by ear first).
+
 ## Recommended settings

 | Content | `input_sr` | `guidance_scale` | `ode_method` / `ode_steps` |