Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ComfyUI-MisoTTS
ComfyUI nodes for MisoTTS — an 8B text-to-speech model built on the Sesame CSM architecture (Llama-3.2 backbone + audio decoder, Mimi codec, voice cloning).
This is a modernized integration: the upstream model pins torch==2.4, torchtune, and moshi, which won't run on recent GPUs (e.g. Blackwell / RTX 50-series) or alongside modern ComfyUI. This pack removes those constraints with no change in output.
What's different from upstream
| Upstream | Here | Why |
|---|---|---|
torchtune (deprecated, pins torch 2.4) |
vendored plain-torch Llama (misotts/csm_llama.py) |
runs on any modern torch; numerically identical (verified Δ = 0, same weights) |
moshi + bitsandbytes + sphn for Mimi |
transformers.MimiModel |
bit-identical audio codes; drops 4 heavy deps |
gated meta-llama/Llama-3.2-1B tokenizer |
ungated unsloth/Llama-3.2-1B mirror |
no HF gating; same tokenizer |
| loads 32 GB fp32 into CPU RAM | streams weights straight to GPU in bf16 | ~18 GB VRAM, ~0 CPU RAM |
| watermark on by default | not bundled | minimal deps (re-addable) |
Result: a pin-free requirements.txt (just transformers, safetensors, tokenizers, torchaudio).
Requirements
- A CUDA build of PyTorch matching your GPU (for RTX 50-series:
cu128+, torch ≥ 2.7). - ~18 GB free VRAM for bf16 inference.
- First run downloads the model (~32 GB) from
MisoLabs/MisoTTS, the Mimi codec (kyutai/mimi), and the tokenizer.
cd ComfyUI/custom_nodes
git clone https://github.com/ethanfel/ComfyUI-MisoTTS
pip install -r ComfyUI-MisoTTS/requirements.txt
Nodes
- MisoTTS Model Loader — loads the model (device / dtype). The 32 GB checkpoint is streamed to the GPU in the chosen dtype.
- MisoTTS Generate — text → speech. Handles long text (whole EPUB chapters) via sentence-aware chunking and keeps a consistent voice across chunks. Optional
ref_audio+ref_textclone a specific voice. - MisoTTS EPUB Loader — extracts a chapter range from an
.epubas plain text.
Audiobook / EPUB workflow
MisoTTS EPUB Loader ──text──▶ MisoTTS Generate ──audio──▶ Save Audio
MisoTTS Model Loader ─model──▶
(optional) Load Audio ──ref_audio──▶
Voice consistency. CSM-style models pick a fresh voice on each independent call, so a naïve chapter-at-a-time loop drifts. MisoTTS Generate avoids this by feeding the previous chunk(s) back as context (context_window, default 1). For a specific narrator voice, connect a ref_audio clip (a few seconds) plus its ref_text — it's anchored across every chunk. Set a fixed seed for reproducible narration.
Key Generate parameters:
chunk_chars(300) — target characters per chunk; larger = fewer joins, more VRAM/time.max_chunk_seconds(30) — cap on audio generated per chunk.context_window(1) — prior chunks reused as context for voice consistency (0 = independent).silence_ms(250) — gap inserted between chunks.temperature(0.9) /topk(50) — sampling.
Notes
- Speed: an 8B autoregressive model at 12.5 Hz × 32 codebooks is ~0.2× realtime in eager mode — fine for batch/audiobook rendering, not live. A
torch.compilepath is a planned optimization. - Watermarking is not applied. If you redistribute generated audio, consider the upstream project's guidance.
Credits
Model: MisoLabsAI/MisoTTS (Sesame CSM architecture). Mimi codec: Kyutai. This repo only provides the ComfyUI integration and the torchtune/moshi-free runtime.