Files
ComfyUI-MisoTTS/README.md
T
Ethanfel f7a6f7790d Initial release: ComfyUI-MisoTTS (modernized CSM 8B)
Modernized MisoTTS integration for ComfyUI with no torchtune/moshi:
- vendored plain-torch Llama backbone (csm_llama), parity-verified Δ=0 vs torchtune
- transformers.MimiModel codec (bit-identical codes to moshi), drops moshi/bnb/sphn
- low-memory loader: streams 32GB fp32 checkpoint to GPU in bf16 (~18GB VRAM)
- nodes: Model Loader, Generate (audiobook chunking + voice anchoring), EPUB Loader
- pin-free requirements; runs on modern torch / Blackwell GPUs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 23:37:54 +02:00

63 lines
3.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ComfyUI-MisoTTS
ComfyUI nodes for [**MisoTTS**](https://github.com/MisoLabsAI/MisoTTS) — an 8B text-to-speech model built on the Sesame **CSM** architecture (Llama-3.2 backbone + audio decoder, Mimi codec, voice cloning).
This is a **modernized** integration: the upstream model pins `torch==2.4`, `torchtune`, and `moshi`, which won't run on recent GPUs (e.g. Blackwell / RTX 50-series) or alongside modern ComfyUI. This pack removes those constraints with **no change in output**.
## What's different from upstream
| Upstream | Here | Why |
|---|---|---|
| `torchtune` (deprecated, pins torch 2.4) | vendored plain-torch Llama (`misotts/csm_llama.py`) | runs on any modern torch; numerically **identical** (verified Δ = 0, same weights) |
| `moshi` + `bitsandbytes` + `sphn` for Mimi | `transformers.MimiModel` | **bit-identical** audio codes; drops 4 heavy deps |
| gated `meta-llama/Llama-3.2-1B` tokenizer | ungated `unsloth/Llama-3.2-1B` mirror | no HF gating; same tokenizer |
| loads 32 GB fp32 into CPU RAM | streams weights straight to GPU in bf16 | ~18 GB VRAM, ~0 CPU RAM |
| watermark on by default | not bundled | minimal deps (re-addable) |
Result: a pin-free `requirements.txt` (just `transformers`, `safetensors`, `tokenizers`, `torchaudio`).
## Requirements
- A CUDA build of PyTorch matching your GPU (for RTX 50-series: `cu128`+, torch ≥ 2.7).
- ~18 GB free VRAM for bf16 inference.
- First run downloads the model (~32 GB) from `MisoLabs/MisoTTS`, the Mimi codec (`kyutai/mimi`), and the tokenizer.
```bash
cd ComfyUI/custom_nodes
git clone https://github.com/ethanfel/ComfyUI-MisoTTS
pip install -r ComfyUI-MisoTTS/requirements.txt
```
## Nodes
- **MisoTTS Model Loader** — loads the model (device / dtype). The 32 GB checkpoint is streamed to the GPU in the chosen dtype.
- **MisoTTS Generate** — text → speech. Handles long text (whole EPUB chapters) via sentence-aware chunking and keeps a consistent voice across chunks. Optional `ref_audio` + `ref_text` clone a specific voice.
- **MisoTTS EPUB Loader** — extracts a chapter range from an `.epub` as plain text.
## Audiobook / EPUB workflow
```
MisoTTS EPUB Loader ──text──▶ MisoTTS Generate ──audio──▶ Save Audio
MisoTTS Model Loader ─model──▶
(optional) Load Audio ──ref_audio──▶
```
**Voice consistency.** CSM-style models pick a fresh voice on each independent call, so a naïve chapter-at-a-time loop drifts. `MisoTTS Generate` avoids this by feeding the previous chunk(s) back as context (`context_window`, default `1`). For a *specific* narrator voice, connect a `ref_audio` clip (a few seconds) plus its `ref_text` — it's anchored across every chunk. Set a fixed `seed` for reproducible narration.
Key `Generate` parameters:
- `chunk_chars` (300) — target characters per chunk; larger = fewer joins, more VRAM/time.
- `max_chunk_seconds` (30) — cap on audio generated per chunk.
- `context_window` (1) — prior chunks reused as context for voice consistency (0 = independent).
- `silence_ms` (250) — gap inserted between chunks.
- `temperature` (0.9) / `topk` (50) — sampling.
## Notes
- **Speed**: an 8B autoregressive model at 12.5 Hz × 32 codebooks is ~0.2× realtime in eager mode — fine for batch/audiobook rendering, not live. A `torch.compile` path is a planned optimization.
- **Watermarking** is not applied. If you redistribute generated audio, consider the upstream project's guidance.
## Credits
Model: [MisoLabsAI/MisoTTS](https://github.com/MisoLabsAI/MisoTTS) (Sesame CSM architecture). Mimi codec: Kyutai. This repo only provides the ComfyUI integration and the torchtune/moshi-free runtime.