T

Ethanfel 953842c894 Add example workflows: audiobook (EPUB→Generate→Save) and basic TTS

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-06 23:41:50 +02:00

misotts

Initial release: ComfyUI-MisoTTS (modernized CSM 8B)

2026-06-06 23:37:54 +02:00

nodes

Initial release: ComfyUI-MisoTTS (modernized CSM 8B)

2026-06-06 23:37:54 +02:00

workflows

Add example workflows: audiobook (EPUB→Generate→Save) and basic TTS

2026-06-06 23:41:50 +02:00

__init__.py

Initial release: ComfyUI-MisoTTS (modernized CSM 8B)

2026-06-06 23:37:54 +02:00

.gitignore

Initial release: ComfyUI-MisoTTS (modernized CSM 8B)

2026-06-06 23:37:54 +02:00

pyproject.toml

Initial release: ComfyUI-MisoTTS (modernized CSM 8B)

2026-06-06 23:37:54 +02:00

README.md

Initial release: ComfyUI-MisoTTS (modernized CSM 8B)

2026-06-06 23:37:54 +02:00

requirements.txt

Initial release: ComfyUI-MisoTTS (modernized CSM 8B)

2026-06-06 23:37:54 +02:00

README.md

ComfyUI-MisoTTS

ComfyUI nodes for MisoTTS — an 8B text-to-speech model built on the Sesame CSM architecture (Llama-3.2 backbone + audio decoder, Mimi codec, voice cloning).

This is a modernized integration: the upstream model pins torch==2.4, torchtune, and moshi, which won't run on recent GPUs (e.g. Blackwell / RTX 50-series) or alongside modern ComfyUI. This pack removes those constraints with no change in output.

What's different from upstream

Upstream	Here	Why
`torchtune` (deprecated, pins torch 2.4)	vendored plain-torch Llama (`misotts/csm_llama.py`)	runs on any modern torch; numerically identical (verified Δ = 0, same weights)
`moshi` + `bitsandbytes` + `sphn` for Mimi	`transformers.MimiModel`	bit-identical audio codes; drops 4 heavy deps
gated `meta-llama/Llama-3.2-1B` tokenizer	ungated `unsloth/Llama-3.2-1B` mirror	no HF gating; same tokenizer
loads 32 GB fp32 into CPU RAM	streams weights straight to GPU in bf16	~18 GB VRAM, ~0 CPU RAM
watermark on by default	not bundled	minimal deps (re-addable)

Result: a pin-free requirements.txt (just transformers, safetensors, tokenizers, torchaudio).

Requirements

A CUDA build of PyTorch matching your GPU (for RTX 50-series: cu128+, torch ≥ 2.7).
~18 GB free VRAM for bf16 inference.
First run downloads the model (~32 GB) from MisoLabs/MisoTTS, the Mimi codec (kyutai/mimi), and the tokenizer.

cd ComfyUI/custom_nodes
git clone https://github.com/ethanfel/ComfyUI-MisoTTS
pip install -r ComfyUI-MisoTTS/requirements.txt

Nodes

MisoTTS Model Loader — loads the model (device / dtype). The 32 GB checkpoint is streamed to the GPU in the chosen dtype.
MisoTTS Generate — text → speech. Handles long text (whole EPUB chapters) via sentence-aware chunking and keeps a consistent voice across chunks. Optional ref_audio + ref_text clone a specific voice.
MisoTTS EPUB Loader — extracts a chapter range from an .epub as plain text.

Audiobook / EPUB workflow

MisoTTS EPUB Loader ──text──▶ MisoTTS Generate ──audio──▶ Save Audio
MisoTTS Model Loader ─model──▶
        (optional) Load Audio ──ref_audio──▶

Voice consistency. CSM-style models pick a fresh voice on each independent call, so a naïve chapter-at-a-time loop drifts. MisoTTS Generate avoids this by feeding the previous chunk(s) back as context (context_window, default 1). For a specific narrator voice, connect a ref_audio clip (a few seconds) plus its ref_text — it's anchored across every chunk. Set a fixed seed for reproducible narration.

Key Generate parameters:

chunk_chars (300) — target characters per chunk; larger = fewer joins, more VRAM/time.
max_chunk_seconds (30) — cap on audio generated per chunk.
context_window (1) — prior chunks reused as context for voice consistency (0 = independent).
silence_ms (250) — gap inserted between chunks.
temperature (0.9) / topk (50) — sampling.

Notes

Speed: an 8B autoregressive model at 12.5 Hz × 32 codebooks is ~0.2× realtime in eager mode — fine for batch/audiobook rendering, not live. A torch.compile path is a planned optimization.
Watermarking is not applied. If you redistribute generated audio, consider the upstream project's guidance.

Credits

Model: MisoLabsAI/MisoTTS (Sesame CSM architecture). Mimi codec: Kyutai. This repo only provides the ComfyUI integration and the torchtune/moshi-free runtime.

README.md Unescape Escape

ComfyUI-MisoTTS

What's different from upstream

Requirements

Nodes

Audiobook / EPUB workflow

Notes

Credits

README.md