Each line starting with [Tag] now begins a new segment so users don't need
blank lines between tagged speeches. Continuation lines (no tag) are joined
to the previous tagged segment for multi-line speeches.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add OmniVoiceSpeaker node (label + ref_audio + ref_text → OMNIVOICE_SPEAKER)
- Add OmniVoiceSpeakers node (roster with dynamic speaker_N inputs driven by
num_speakers INT widget; slots expand/collapse via ComfyUI JS extension)
- Add web/multi_speaker.js: ComfyUI extension that hooks onNodeCreated and
onConfigure to sync speaker_N inputs in real time (max 8 speakers)
- Extend OmniVoiceGenerate with optional speakers (OMNIVOICE_SPEAKERS) input;
when connected it routes each paragraph to the assigned speaker and
concatenates the results — supports alternate_paragraphs and tagged_speakers modes
- Remove OmniVoiceMultiSpeakerGenerate (generation now lives in the existing
Generate node)
- Refactor generator.py: extract _write_tmp_wav helper, add _tensors_to_audio
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Chinese characters vs English words are self-identifying to the model.
No need for a separate language signal on either node.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Voice Design now outputs (instruct, language) — wire language directly
into Generate to avoid setting it in two places. Generate's language
input is now a STRING (accepts the connection or manual 'auto').
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pydub, tensorboardx, webdataset are omnivoice dependencies that won't
be present on a clean ComfyUI install since we use --no-deps.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pinning transformers==5.3.0 risks conflicting with existing ComfyUI venv.
Back to permissive >=4.40.0 which worked in practice.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Generate: language dropdown (auto/English/Chinese), passed only in
voice_design and auto_voice modes where it selects the instruct vocab
- VoiceDesign: Chinese mode with dialect/age/pitch/gender dropdowns
using the model's validated Chinese instruct vocabulary (全角逗号)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The model's _resolve_instruct() validates against a fixed vocabulary.
Only 10 accents are supported — removed all unsupported additions.
Updated tooltip to reflect actual constraints.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
If instruct is set alongside ref_audio, it is now forwarded to
model.generate() — allowing accent/style transfer on top of the
cloned voice identity. Model may or may not honour both simultaneously.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Language: ~170 world languages with type-to-filter dropdown
Accent: 50+ regional varieties grouped by area
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OmniVoiceVoiceDesign: structured dropdowns for gender/age/pitch/accent
that compose into an instruct string — wire to Generate's instruct input.
OmniVoiceGenerate: new optional language dropdown (auto + 11 languages)
and guidance_scale (CFG, default 2.0) parameters.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Refreshed node IDs, positions and sizes from live session. Replaced
SaveAudio with PreviewAudio, added ref_text widget entry, updated
aux_id/ver properties.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ComfyUI appends a hidden "fixed"/"randomize" value after every INT
named "seed". Without it the widget values were misaligned.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Compiles the model graph on first generation (~30-60s warmup) then
speeds up all subsequent generations in the session. Recommended for
audiobook pipelines. Default off.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Generate node height 340→400 to fit all 6 widgets, Voice Preset
height 80→100, SaveAudio position adjusted.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Catching bare Exception was silently swallowing real resampling errors.
Only ImportError should trigger the interpolate fallback.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _resample: squeeze batch dim before torchaudio.Resample (expected 2D)
- weight scaling: each clip now trims to natural_length*weight samples,
dropping the broken target_per_unit double-multiplication
- empty trimmed guard: raise clear error when all weights are 0
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sets a default seed so the voice stays consistent across all generated
chunks when using the workflow as a starting point for audiobook pipelines.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
OmniVoice chunks long text internally; each chunk is a separate diffusion
pass with different random noise, causing voice drift between paragraphs.
Setting the same seed before each generate() call anchors the RNG state
and keeps the voice consistent. seed=0 means random (default behaviour).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
get_text(separator=' ') collapsed all paragraphs into one line.
Now inserts \n\n at block-level element boundaries (p, h1-h6, div,
li, br, tr) before extraction, then normalises whitespace.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Models always download to ComfyUI/models/omnivoice/ via HuggingFace.
Local path added unnecessary complexity; users who want a custom path
can symlink into the models directory.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Model Loader → Load Audio → OmniVoice Generate → Save Audio.
Connect a Whisper node to ref_text for auto-transcription.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Users should connect a ComfyUI Whisper node to ref_text instead of
relying on omnivoice's internal ASR. Removes the error-catch workaround
and updates the tooltip accordingly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
requirements.txt cannot install omnivoice (it would pull in torch==2.8.*
and break ComfyUI). install.py now does exactly one thing: install
omnivoice --no-deps, skipped if already present. All other deps remain
in requirements.txt for ComfyUI Manager to handle normally.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
install.py was running arbitrary pip installs as part of node loading,
which is dangerous in a shared venv. Standard approach: requirements.txt
lists the safe deps (transformers, accelerate, soundfile, etc.);
omnivoice itself must be installed once manually with --no-deps to avoid
overwriting ComfyUI's torch. README documents this clearly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The cap was wrong — it would downgrade transformers in shared venvs and
break other nodes. The torchcodec issue is handled in code now.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
install.py: restore transformers>=5.0.0 (capping it would break other nodes).
generator.py: catch the torchcodec RuntimeError that fires when ref_text is
blank and transformers 5.x auto-transcription requires missing FFmpeg libs.
Raises a human-readable error telling the user to fill in ref_text manually.
Also updates the ref_text tooltip to recommend providing it explicitly.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>