fix: wrap CLIP encoding in inference_mode during pre-generation

CLIP weights are inference tensors from ComfyUI loading. The worker
thread runs without inference_mode, so PyTorch rejects inference tensors
in multi_head_attention_forward (version counter tracking). Wrap the
encode_text_clip call in torch.inference_mode() since text encoding
doesn't need gradients.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-04-10 01:10:58 +02:00
parent 10a71b0c4f
commit 32e5344ea2
+1
View File
@@ -529,6 +529,7 @@ def _pregenerate_lora_mels(model, data_dir, lora_adapter_path, device, dtype,
prompt = prompt_map.get(npz_path.name, data.get("prompt", default_prompt))
if isinstance(prompt, np.ndarray):
prompt = str(prompt)
with torch.inference_mode():
text_clip = feature_utils.encode_text_clip([prompt]).to(device, dtype)
# Load clean audio