Files
ComfyUI-UniverSR/docs/plans/2026-06-17-auto-input-sr-design.md
Ethanfel 8d4cd71723 docs: design for auto input_sr (bandwidth auto-detect)
Cliff/edge spectral detector with a confidence score; effective cutoff =
cliff if confident else sr/2 (covers band-limited, full-band, and genuine
low-rate files in one rule). Round down to the nearest supported Nyquist.
Adds an "auto" dropdown option (default). Validated empirically before design.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 12:42:50 +02:00

59 lines
2.8 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Design: `auto` input_sr mode
Date: 2026-06-17
## Problem
`input_sr` is the most important and most confusing knob on the Sampler: it tells UniverSR the
**effective bandwidth** of the content (everything above `input_sr/2` is regenerated). Picking it wrong
either wastes the model on an empty band (Nyquist set too high → model treats empty band as valid and
won't regenerate it) or needlessly discards real content (Nyquist too low). An `auto` mode should read
the audio's spectrum and choose for the user.
## Feasibility (validated empirically)
- Naive "highest bin above a threshold" **fails** — it catches resampler skirts / leakage.
- A **cliff/edge detector with a confidence score** works: on broadband audio with a sharp cutoff it
recovers 4/6/8/12 kHz to within ~60 Hz with a 5060 dB drop; on full-band/soft-rolloff it reports a
small drop and abstains.
## Algorithm — `detect_input_sr(waveform, sr) -> (input_sr:int, info:dict)`
1. Mono-mix (mean over batch + channels); analyze at the **native** sample rate.
2. Average magnitude spectrum over time (n_fft=4096, Hann, hop n_fft//4), smoothed (avg-pool ~9 bins),
converted to dB relative to peak.
3. Find the **steepest negative gradient** (cliff). Confidence = `median(level just below) -
median(level just above)` in dB.
4. **Effective cutoff** = cliff frequency if a confident cliff (drop ≥ `CONF_DB` ≈ 25) exists below
~0.95·(sr/2); **else sr/2** (content fills its band → no edge). This single rule covers:
- 48 kHz file band-limited to 8 kHz → cliff 8 kHz → 16000.
- 48 kHz full-band → no cliff → 24 kHz → 24000 (the "unsure" fallback, emerges naturally).
- Genuine 8 kHz file → content to its 4 kHz Nyquist, no cliff → sr/2 = 4 kHz → 8000.
5. **Map:** largest supported Nyquist ≤ cutoff (+~300 Hz snap), `input_sr = Nyquist × 2`, clamped to
[8000, 24000]. Rounding **down** is deliberate (avoids the empty-band failure).
## UX
- Dropdown becomes `["auto", "8000", "12000", "16000", "24000"]`, default `"auto"`.
- Resolved in the Sampler `run()`: if `input_sr == "auto"`, call `detect_input_sr`, log the decision,
pass the concrete int to `super_resolve` (which keeps taking a plain int).
- Console log: `auto: cutoff 8.0 kHz (drop 53 dB) -> input_sr=16000` or
`auto: no clear cutoff -> input_sr=24000`.
## Edge cases
- Stereo / batch → mono-mix for analysis (one `input_sr` applies to the whole batch).
- Very long clips → one STFT over the whole signal; cost negligible vs the ODE.
- Empty/too-short audio → fall back to 24000.
## Testing
Unit-test `detect_input_sr` on synthetic signals:
- brick-wall broadband at 4/6/8/12 kHz → 8000/12000/16000/24000,
- full-band broadband → 24000,
- a "genuine 8 kHz" clip (sr=8000, full band) → 8000.
## Out of scope (YAGNI)
No new output ports, no sensitivity slider, no per-chunk re-detection.