diff --git a/README.md b/README.md new file mode 100644 index 0000000..4c1b824 --- /dev/null +++ b/README.md @@ -0,0 +1,80 @@ +# ComfyUI-Omnivoice + +A ComfyUI custom node for [OmniVoice](https://github.com/k2-fsa/OmniVoice) — a massive multilingual zero-shot TTS model supporting 600+ languages. + +## Features + +- **Voice Cloning** — clone any voice from a short reference audio clip +- **Voice Design** — describe a voice with text (e.g. "female, low pitch, british accent") +- **Auto Voice** — let the model pick a voice automatically +- **Audiobook-ready** — handles arbitrarily long text with near-constant VRAM via built-in chunking +- **Multilingual** — 600+ languages + +## Installation + +1. Clone into your ComfyUI custom nodes directory: + ```bash + cd ComfyUI/custom_nodes + git clone https://github.com/ethanfel/ComfyUI-Omnivoice.git + ``` + +2. Install dependencies: + ```bash + pip install omnivoice + ``` + +3. Restart ComfyUI. The nodes will appear under the **OmniVoice** category. + +> **Note:** OmniVoice requires PyTorch 2.8+ and a CUDA-capable GPU (or Apple Silicon). + +## Nodes + +### OmniVoice Model Loader + +Loads the OmniVoice model. Downloads automatically from HuggingFace on first run and caches locally. + +| Input | Type | Description | +|-------|------|-------------| +| `model_source` | dropdown | `Auto-download (HuggingFace)` or `Local path` | +| `local_path` | string | Path to local checkpoint (optional) | +| `device` | dropdown | `cuda:0`, `cuda:1`, or `cpu` | +| `dtype` | dropdown | `float16`, `bfloat16`, or `float32` | + +**Output:** `OMNIVOICE_MODEL` + +--- + +### OmniVoice Generate + +Generates speech from text using a loaded model. + +| Input | Type | Description | +|-------|------|-------------| +| `model` | OMNIVOICE_MODEL | From OmniVoice Model Loader | +| `text` | string | Text to synthesize (full pages supported) | +| `mode` | dropdown | `voice_cloning`, `voice_design`, or `auto_voice` | +| `ref_audio` | AUDIO | Reference audio for voice cloning (optional) | +| `ref_text` | string | Transcription of ref audio — auto-detected if blank (optional) | +| `instruct` | string | Voice description for voice design mode (optional) | +| `speed` | float | Speed multiplier — default 1.0 | +| `num_step` | int | Diffusion steps — default 32 (use 16 for faster generation) | + +**Output:** `AUDIO` at 24kHz — connects directly to ComfyUI's Save Audio node. + +## Example Workflow (Audiobook) + +``` +[OmniVoice Model Loader] ─────────────────────────┐ + ▼ +[Load Audio (narrator clip)] ──► [OmniVoice Generate] ──► [Save Audio] + ▲ + text = "Page 1 content..." + mode = voice_cloning +``` + +Repeat the Generate + Save Audio nodes for each page, reusing the same loader. + +## Credits + +- [OmniVoice](https://github.com/k2-fsa/OmniVoice) by k2-fsa +- [OmniVoice paper](https://arxiv.org/abs/2604.00688)