The local_attn_mask was not being tiled across temporal dimensions,
causing assertion errors in streaming mode and wrong masks otherwise.
Match naxci1 reference: 4D tile/rearrange for Q/K temporal windows,
chunk-based score computation, and topk<=0 guard.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use generate_draft_block_mask_refined for sparse attention mask (matches
naxci1's generate_draft_block_mask_sage with proper half-block key scoring)
- Remove spurious repeat_interleave(2, dim=-1) from generate_draft_block_mask
that doubled the key dimension incorrectly
- Add torch.clamp(0, 1) to _to_frames output (matches naxci1's tensor2video)
- Add .to(self.device) on LQ video slices in streaming loop for all pipelines
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause of remaining ghosting: our single-stage temporal padding
(N+4 → floor to 8k+1) TRUNCATED frames when N+4 wasn't already 8k+1.
For 50 frames: 50+4=54 → floor to 49, LOSING the last input frame.
The pipeline then processed misaligned LQ→output frame mapping.
Fix matches naxci1/ComfyUI-FlashVSR_Stable two-stage approach:
1. Pad to next_8n5(N) (next integer >= N of form 8k+5, minimum 21)
2. Add 4 → result is always 8(k+1)+1, a valid 8k+1 — NEVER truncates
Also:
- kv_ratio default 2.0→3.0 (matches naxci1, max quality KV cache)
- local_range default 9→11 (more stable temporal consistency)
- sinusoidal_embedding_1d, precompute_freqs_cis, rope_apply: float32→float64
(matches naxci1 reference precision for embeddings and RoPE)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: three critical differences from naxci1 reference implementation:
1. Batch decode after loop → streaming per-chunk TCDecoder decode with LQ
conditioning inside the loop. The TCDecoder uses causal convolutions with
temporal memory that must be built incrementally per-chunk. Batch decode
breaks this design and loses LQ frame conditioning, causing ghosting.
2. Buffer_LQ4x_Proj → Causal_LQ4x_Proj for FlashVSR v1.1. The causal
variant reads the OLD cache before writing the new one (truly causal),
while Buffer writes cache BEFORE the conv call. Using the wrong variant
misaligns temporal LQ conditioning features.
3. Temporal padding formula: changed from round-up to largest_8n1_leq(N+4)
matching the naxci1 reference approach.
Changes:
- flashvsr_full.py: streaming TCDecoder decode per-chunk with LQ conditioning
and per-chunk color correction (was: batch VAE decode after loop)
- flashvsr_tiny.py: streaming TCDecoder decode per-chunk (was: batch decode)
- inference.py: use Causal_LQ4x_Proj, build TCDecoder for ALL modes (including
full), fix temporal padding to largest_8n1_leq(N+4), clear TCDecoder in
clear_caches()
- utils.py: add Causal_LQ4x_Proj class
- nodes.py: update progress bar estimation for new padding formula
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When sageattn fails, q/k/v are already in [b,n,s,d] format from the
rearrange before the call. Use SDPA directly on them instead of calling
_sdpa_fallback which expects [b,s,(n*d)] and crashes with a shape error.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SageAttention CUDA kernels don't support Blackwell yet. Catch runtime
failures from sageattn/sparse_sageattn, disable them, and fall back to
PyTorch SDPA. Only pays the try/except cost once per session.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verify attention backend functions are actually callable before marking
them available. Falls back to PyTorch SDPA instead of calling None.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Vendor minimal diffsynth subset for FlashVSR inference (full/tiny pipelines,
v1 and v1.1 checkpoints auto-downloaded from HuggingFace). Includes segment-based
processing with temporal overlap and crossfade blending for bounded RAM on long videos.
Nodes: Load FlashVSR Model, FlashVSR Upscale, FlashVSR Segment Upscale.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>