Initial release: ComfyUI-MisoTTS (modernized CSM 8B)
Modernized MisoTTS integration for ComfyUI with no torchtune/moshi: - vendored plain-torch Llama backbone (csm_llama), parity-verified Δ=0 vs torchtune - transformers.MimiModel codec (bit-identical codes to moshi), drops moshi/bnb/sphn - low-memory loader: streams 32GB fp32 checkpoint to GPU in bf16 (~18GB VRAM) - nodes: Model Loader, Generate (audiobook chunking + voice anchoring), EPUB Loader - pin-free requirements; runs on modern torch / Blackwell GPUs Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,62 @@
|
||||
# ComfyUI-MisoTTS
|
||||
|
||||
ComfyUI nodes for [**MisoTTS**](https://github.com/MisoLabsAI/MisoTTS) — an 8B text-to-speech model built on the Sesame **CSM** architecture (Llama-3.2 backbone + audio decoder, Mimi codec, voice cloning).
|
||||
|
||||
This is a **modernized** integration: the upstream model pins `torch==2.4`, `torchtune`, and `moshi`, which won't run on recent GPUs (e.g. Blackwell / RTX 50-series) or alongside modern ComfyUI. This pack removes those constraints with **no change in output**.
|
||||
|
||||
## What's different from upstream
|
||||
|
||||
| Upstream | Here | Why |
|
||||
|---|---|---|
|
||||
| `torchtune` (deprecated, pins torch 2.4) | vendored plain-torch Llama (`misotts/csm_llama.py`) | runs on any modern torch; numerically **identical** (verified Δ = 0, same weights) |
|
||||
| `moshi` + `bitsandbytes` + `sphn` for Mimi | `transformers.MimiModel` | **bit-identical** audio codes; drops 4 heavy deps |
|
||||
| gated `meta-llama/Llama-3.2-1B` tokenizer | ungated `unsloth/Llama-3.2-1B` mirror | no HF gating; same tokenizer |
|
||||
| loads 32 GB fp32 into CPU RAM | streams weights straight to GPU in bf16 | ~18 GB VRAM, ~0 CPU RAM |
|
||||
| watermark on by default | not bundled | minimal deps (re-addable) |
|
||||
|
||||
Result: a pin-free `requirements.txt` (just `transformers`, `safetensors`, `tokenizers`, `torchaudio`).
|
||||
|
||||
## Requirements
|
||||
|
||||
- A CUDA build of PyTorch matching your GPU (for RTX 50-series: `cu128`+, torch ≥ 2.7).
|
||||
- ~18 GB free VRAM for bf16 inference.
|
||||
- First run downloads the model (~32 GB) from `MisoLabs/MisoTTS`, the Mimi codec (`kyutai/mimi`), and the tokenizer.
|
||||
|
||||
```bash
|
||||
cd ComfyUI/custom_nodes
|
||||
git clone https://github.com/ethanfel/ComfyUI-MisoTTS
|
||||
pip install -r ComfyUI-MisoTTS/requirements.txt
|
||||
```
|
||||
|
||||
## Nodes
|
||||
|
||||
- **MisoTTS Model Loader** — loads the model (device / dtype). The 32 GB checkpoint is streamed to the GPU in the chosen dtype.
|
||||
- **MisoTTS Generate** — text → speech. Handles long text (whole EPUB chapters) via sentence-aware chunking and keeps a consistent voice across chunks. Optional `ref_audio` + `ref_text` clone a specific voice.
|
||||
- **MisoTTS EPUB Loader** — extracts a chapter range from an `.epub` as plain text.
|
||||
|
||||
## Audiobook / EPUB workflow
|
||||
|
||||
```
|
||||
MisoTTS EPUB Loader ──text──▶ MisoTTS Generate ──audio──▶ Save Audio
|
||||
MisoTTS Model Loader ─model──▶
|
||||
(optional) Load Audio ──ref_audio──▶
|
||||
```
|
||||
|
||||
**Voice consistency.** CSM-style models pick a fresh voice on each independent call, so a naïve chapter-at-a-time loop drifts. `MisoTTS Generate` avoids this by feeding the previous chunk(s) back as context (`context_window`, default `1`). For a *specific* narrator voice, connect a `ref_audio` clip (a few seconds) plus its `ref_text` — it's anchored across every chunk. Set a fixed `seed` for reproducible narration.
|
||||
|
||||
Key `Generate` parameters:
|
||||
|
||||
- `chunk_chars` (300) — target characters per chunk; larger = fewer joins, more VRAM/time.
|
||||
- `max_chunk_seconds` (30) — cap on audio generated per chunk.
|
||||
- `context_window` (1) — prior chunks reused as context for voice consistency (0 = independent).
|
||||
- `silence_ms` (250) — gap inserted between chunks.
|
||||
- `temperature` (0.9) / `topk` (50) — sampling.
|
||||
|
||||
## Notes
|
||||
|
||||
- **Speed**: an 8B autoregressive model at 12.5 Hz × 32 codebooks is ~0.2× realtime in eager mode — fine for batch/audiobook rendering, not live. A `torch.compile` path is a planned optimization.
|
||||
- **Watermarking** is not applied. If you redistribute generated audio, consider the upstream project's guidance.
|
||||
|
||||
## Credits
|
||||
|
||||
Model: [MisoLabsAI/MisoTTS](https://github.com/MisoLabsAI/MisoTTS) (Sesame CSM architecture). Mimi codec: Kyutai. This repo only provides the ComfyUI integration and the torchtune/moshi-free runtime.
|
||||
Reference in New Issue
Block a user