# ComfyUI-Omnivoice

A ComfyUI custom node for [OmniVoice](https://github.com/k2-fsa/OmniVoice) — a massive multilingual zero-shot TTS model supporting 600+ languages.

## Features

- **Voice Cloning** — clone any voice from a short reference audio clip
- **Voice Design** — describe a voice with text (e.g. "female, low pitch, british accent")
- **Auto Voice** — let the model pick a voice automatically
- **Audiobook-ready** — handles arbitrarily long text with near-constant VRAM via built-in chunking
- **Multilingual** — 600+ languages

## Installation

1. Clone into your ComfyUI custom nodes directory:
   ```bash
   cd ComfyUI/custom_nodes
   git clone https://github.com/ethanfel/ComfyUI-Omnivoice.git
   ```

2. Install `omnivoice` **without its pinned torch** (one-time manual step):
   ```bash
   pip install omnivoice --no-deps
   ```
   > **Why `--no-deps`?** omnivoice pins `torch==2.8.*` from a CUDA 12.8 index. Installing it normally would overwrite ComfyUI's torch build. The `--no-deps` flag skips that pin; ComfyUI's existing torch works fine at runtime.

3. Restart ComfyUI. ComfyUI Manager will install the remaining dependencies from `requirements.txt` automatically. The nodes will appear under the **OmniVoice** category.

## Nodes

### OmniVoice Model Loader

Loads the OmniVoice model. Downloads automatically from HuggingFace on first run and caches locally.

| Input | Type | Description |
|-------|------|-------------|
| `model_source` | dropdown | `Auto-download (HuggingFace)` or `Local path` |
| `local_path` | string | Path to local checkpoint (optional) |
| `device` | dropdown | `cuda:0`, `cuda:1`, or `cpu` |
| `dtype` | dropdown | `float16`, `bfloat16`, or `float32` |

**Output:** `OMNIVOICE_MODEL`

---

### OmniVoice Generate

Generates speech from text using a loaded model.

| Input | Type | Description |
|-------|------|-------------|
| `model` | OMNIVOICE_MODEL | From OmniVoice Model Loader |
| `text` | string | Text to synthesize (full pages supported) |
| `mode` | dropdown | `voice_cloning`, `voice_design`, or `auto_voice` |
| `ref_audio` | AUDIO | Reference audio for voice cloning (optional) |
| `ref_text` | string | Transcription of ref audio — auto-detected if blank (optional) |
| `instruct` | string | Voice description for voice design mode (optional) |
| `speed` | float | Speed multiplier — default 1.0 |
| `num_step` | int | Diffusion steps — default 32 (use 16 for faster generation) |

**Output:** `AUDIO` at 24kHz — connects directly to ComfyUI's Save Audio node.

## Example Workflow (Audiobook)

```
[OmniVoice Model Loader] ─────────────────────────┐
                                                    ▼
[Load Audio (narrator clip)] ──► [OmniVoice Generate] ──► [Save Audio]
                                        ▲
                              text = "Page 1 content..."
                              mode = voice_cloning
```

Repeat the Generate + Save Audio nodes for each page, reusing the same loader.

## Credits

- [OmniVoice](https://github.com/k2-fsa/OmniVoice) by k2-fsa
- [OmniVoice paper](https://arxiv.org/abs/2604.00688)