ComfyUI-MisoTTS

ComfyUI nodes for MisoTTS — an 8B text-to-speech model built on the Sesame CSM architecture (Llama-3.2 backbone + audio decoder, Mimi codec, voice cloning).

This is a modernized integration: the upstream model pins torch==2.4, torchtune, and moshi, which won't run on recent GPUs (e.g. Blackwell / RTX 50-series) or alongside modern ComfyUI. This pack removes those constraints with no change in output.

What's different from upstream

Upstream Here Why
torchtune (deprecated, pins torch 2.4) vendored plain-torch Llama (misotts/csm_llama.py) runs on any modern torch; numerically identical (verified Δ = 0, same weights)
moshi + bitsandbytes + sphn for Mimi transformers.MimiModel bit-identical audio codes; drops 4 heavy deps
gated meta-llama/Llama-3.2-1B tokenizer ungated unsloth/Llama-3.2-1B mirror no HF gating; same tokenizer
loads 32 GB fp32 into CPU RAM streams weights straight to GPU in bf16 ~18 GB VRAM, ~0 CPU RAM
watermark on by default not bundled minimal deps (re-addable)

Result: a pin-free requirements.txt (just transformers, safetensors, tokenizers, torchaudio).

Requirements

  • A CUDA build of PyTorch matching your GPU (for RTX 50-series: cu128+, torch ≥ 2.7).
  • ~18 GB free VRAM for bf16 inference.
  • First run downloads the model (~32 GB) from MisoLabs/MisoTTS, the Mimi codec (kyutai/mimi), and the tokenizer.
cd ComfyUI/custom_nodes
git clone https://github.com/ethanfel/ComfyUI-MisoTTS
pip install -r ComfyUI-MisoTTS/requirements.txt

Nodes

  • MisoTTS Model Loader — loads the model (device / dtype). The 32 GB checkpoint is streamed to the GPU in the chosen dtype.
  • MisoTTS Generate — text → speech. Handles long text (whole EPUB chapters) via sentence-aware chunking and keeps a consistent voice across chunks. Optional ref_audio + ref_text clone a specific voice.
  • MisoTTS EPUB Loader — extracts a chapter range from an .epub as plain text.

Audiobook / EPUB workflow

MisoTTS EPUB Loader ──text──▶ MisoTTS Generate ──audio──▶ Save Audio
MisoTTS Model Loader ─model──▶
        (optional) Load Audio ──ref_audio──▶

Voice consistency. CSM-style models pick a fresh voice on each independent call, so a naïve chapter-at-a-time loop drifts. MisoTTS Generate avoids this by feeding the previous chunk(s) back as context (context_window, default 1). For a specific narrator voice, connect a ref_audio clip (a few seconds) plus its ref_text — it's anchored across every chunk. Set a fixed seed for reproducible narration.

Key Generate parameters:

  • chunk_chars (300) — target characters per chunk; larger = fewer joins, more VRAM/time.
  • max_chunk_seconds (30) — cap on audio generated per chunk.
  • context_window (1) — prior chunks reused as context for voice consistency (0 = independent).
  • silence_ms (250) — gap inserted between chunks.
  • temperature (0.9) / topk (50) — sampling.

Notes

  • Speed: an 8B autoregressive model at 12.5 Hz × 32 codebooks is ~0.2× realtime in eager mode — fine for batch/audiobook rendering, not live. A torch.compile path is a planned optimization.
  • Watermarking is not applied. If you redistribute generated audio, consider the upstream project's guidance.

Credits

Model: MisoLabsAI/MisoTTS (Sesame CSM architecture). Mimi codec: Kyutai. This repo only provides the ComfyUI integration and the torchtune/moshi-free runtime.

S
Description
No description provided
Readme 44 KiB
Languages
Python 100%