Speed values below 0.3 produce noise, and 0.1 generates 10x the normal
audio length which can consume 20+ GB VRAM and freeze the system.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Confirmed working version. Previous >=4.40.0 was too permissive and
older versions may lack APIs that omnivoice depends on.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The chapter title was appearing multiple times in the text (from <title>,
<h1>, and body). Now <title> and <h1>/<h2>/<h3> are removed from the body
text since the title is already available via the chapter_title output.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Newer torchaudio versions default to the TorchCodec backend for loading
audio files. Without it installed, omnivoice fails with ImportError when
loading reference audio.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Returns the title of the first selected chapter as a STRING so it can
be wired directly into a Save Audio node's filename field.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The broad except ImportError was swallowing the actual failure reason
(e.g. a missing transitive dep after --no-deps install). Now captures
and re-raises the original exception in the error message so users
can diagnose what is actually missing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Dynamic JS inputs that are not listed in INPUT_TYPES may be rejected by
ComfyUI's prompt validator and not passed to the Python function. Declaring
all 8 slots as optional fixes this while JS still controls which slots are
visible on the node.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Drop any audio file (wav/flac/mp3/ogg/m4a) into the presets cache dir and
it will appear as "<name> (local)" in the Voice Preset dropdown on next
ComfyUI restart. Add a same-stem .txt file for the transcript.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each line starting with [Tag] now begins a new segment so users don't need
blank lines between tagged speeches. Continuation lines (no tag) are joined
to the previous tagged segment for multi-line speeches.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add OmniVoiceSpeaker node (label + ref_audio + ref_text → OMNIVOICE_SPEAKER)
- Add OmniVoiceSpeakers node (roster with dynamic speaker_N inputs driven by
num_speakers INT widget; slots expand/collapse via ComfyUI JS extension)
- Add web/multi_speaker.js: ComfyUI extension that hooks onNodeCreated and
onConfigure to sync speaker_N inputs in real time (max 8 speakers)
- Extend OmniVoiceGenerate with optional speakers (OMNIVOICE_SPEAKERS) input;
when connected it routes each paragraph to the assigned speaker and
concatenates the results — supports alternate_paragraphs and tagged_speakers modes
- Remove OmniVoiceMultiSpeakerGenerate (generation now lives in the existing
Generate node)
- Refactor generator.py: extract _write_tmp_wav helper, add _tensors_to_audio
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Chinese characters vs English words are self-identifying to the model.
No need for a separate language signal on either node.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Voice Design now outputs (instruct, language) — wire language directly
into Generate to avoid setting it in two places. Generate's language
input is now a STRING (accepts the connection or manual 'auto').
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pydub, tensorboardx, webdataset are omnivoice dependencies that won't
be present on a clean ComfyUI install since we use --no-deps.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pinning transformers==5.3.0 risks conflicting with existing ComfyUI venv.
Back to permissive >=4.40.0 which worked in practice.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Generate: language dropdown (auto/English/Chinese), passed only in
voice_design and auto_voice modes where it selects the instruct vocab
- VoiceDesign: Chinese mode with dialect/age/pitch/gender dropdowns
using the model's validated Chinese instruct vocabulary (全角逗号)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The model's _resolve_instruct() validates against a fixed vocabulary.
Only 10 accents are supported — removed all unsupported additions.
Updated tooltip to reflect actual constraints.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If instruct is set alongside ref_audio, it is now forwarded to
model.generate() — allowing accent/style transfer on top of the
cloned voice identity. Model may or may not honour both simultaneously.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Language: ~170 world languages with type-to-filter dropdown
Accent: 50+ regional varieties grouped by area
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OmniVoiceVoiceDesign: structured dropdowns for gender/age/pitch/accent
that compose into an instruct string — wire to Generate's instruct input.
OmniVoiceGenerate: new optional language dropdown (auto + 11 languages)
and guidance_scale (CFG, default 2.0) parameters.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Refreshed node IDs, positions and sizes from live session. Replaced
SaveAudio with PreviewAudio, added ref_text widget entry, updated
aux_id/ver properties.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ComfyUI appends a hidden "fixed"/"randomize" value after every INT
named "seed". Without it the widget values were misaligned.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Compiles the model graph on first generation (~30-60s warmup) then
speeds up all subsequent generations in the session. Recommended for
audiobook pipelines. Default off.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Generate node height 340→400 to fit all 6 widgets, Voice Preset
height 80→100, SaveAudio position adjusted.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Catching bare Exception was silently swallowing real resampling errors.
Only ImportError should trigger the interpolate fallback.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _resample: squeeze batch dim before torchaudio.Resample (expected 2D)
- weight scaling: each clip now trims to natural_length*weight samples,
dropping the broken target_per_unit double-multiplication
- empty trimmed guard: raise clear error when all weights are 0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>