fix: wrap CLIP encoding in inference_mode during pre-generation

CLIP weights are inference tensors from ComfyUI loading. The worker thread runs without inference_mode, so PyTorch rejects inference tensors in multi_head_attention_forward (version counter tracking). Wrap the encode_text_clip call in torch.inference_mode() since text encoding doesn't need gradients. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-10 01:10:58 +02:00
parent 10a71b0c4f
commit 32e5344ea2
1 changed files with 2 additions and 1 deletions
@@ -529,6 +529,7 @@ def _pregenerate_lora_mels(model, data_dir, lora_adapter_path, device, dtype,
                prompt = prompt_map.get(npz_path.name, data.get("prompt", default_prompt))
                if isinstance(prompt, np.ndarray):
                    prompt = str(prompt)
+                with torch.inference_mode():
                    text_clip = feature_utils.encode_text_clip([prompt]).to(device, dtype)

                # Load clean audio