The original STAR inference uses total_noise_levels=900, preserving input
structure during SDEdit. We had 1000 which starts from near-pure noise,
destroying the input. Also always append the quality prompt to user text
instead of using it only as a fallback.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reverts text encoder and VAE loading back to using HuggingFace preset
names / repo IDs (downloading to library cache) while keeping the
attention dispatcher improvements (4D SDPA, math backend).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SDPA with 3D xformers-BMK tensors cannot use Flash Attention and falls
back to efficient_attention/math kernels that miscompute on Ada Lovelace
GPUs (e.g. RTX 6000 Pro), producing brownish line artifacts. Unsqueeze
to 4D (1, B*H, N, D) so Flash Attention is eligible. Also add a naive
"math" backend (chunked bmm) as a guaranteed-correct diagnostic baseline.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Download OpenCLIP ViT-H-14 to models/text_encoders/ and SVD temporal
VAE to models/vae/svd-temporal-vae/ instead of hidden library caches,
so they're visible, reusable, and shared with other nodes.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Newer open_clip creates nn.MultiheadAttention with batch_first=True,
but STAR's embedder unconditionally permutes to [seq, batch, embed].
This causes a RuntimeError in the text encoder (attn_mask shape
mismatch). The patch detects batch_first at runtime and only permutes
when needed.
Patches in patches/ are auto-applied to the STAR submodule on startup
and skip gracefully if already applied.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the auto-detect xformers shim with a runtime dispatcher that
always intercepts xformers.ops.memory_efficient_attention. A new
dropdown on STARModelLoader (and --attention CLI arg) lets users
explicitly select: sdpa (default), xformers, sageattn, or specific
SageAttention kernels (fp16 triton/cuda, fp8 cuda). Only backends
that successfully import appear as options.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Avoids requiring xformers installation by shimming
xformers.ops.memory_efficient_attention with
torch.nn.functional.scaled_dot_product_attention when
xformers is not available.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Standalone inference script that works outside ComfyUI — just activate
the same Python venv. Streams output frames to ffmpeg so peak RAM stays
bounded regardless of video length. Supports video files, image
sequences, and single images. Audio is automatically preserved from
input videos.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>