Fix FlashVSR ghosting: streaming TCDecoder decode + Causal LQ projection

Root cause: three critical differences from naxci1 reference implementation: 1. Batch decode after loop → streaming per-chunk TCDecoder decode with LQ conditioning inside the loop. The TCDecoder uses causal convolutions with temporal memory that must be built incrementally per-chunk. Batch decode breaks this design and loses LQ frame conditioning, causing ghosting. 2. Buffer_LQ4x_Proj → Causal_LQ4x_Proj for FlashVSR v1.1. The causal variant reads the OLD cache before writing the new one (truly causal), while Buffer writes cache BEFORE the conv call. Using the wrong variant misaligns temporal LQ conditioning features. 3. Temporal padding formula: changed from round-up to largest_8n1_leq(N+4) matching the naxci1 reference approach. Changes: - flashvsr_full.py: streaming TCDecoder decode per-chunk with LQ conditioning and per-chunk color correction (was: batch VAE decode after loop) - flashvsr_tiny.py: streaming TCDecoder decode per-chunk (was: batch decode) - inference.py: use Causal_LQ4x_Proj, build TCDecoder for ALL modes (including full), fix temporal padding to largest_8n1_leq(N+4), clear TCDecoder in clear_caches() - utils.py: add Causal_LQ4x_Proj class - nodes.py: update progress bar estimation for new padding formula Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 17:42:20 +01:00
parent 94d9818675
commit fa250897a2
5 changed files with 196 additions and 98 deletions
--- a/inference.py
+++ b/inference.py
@@ -648,7 +648,7 @@ class FlashVSRModel:
            ModelManager, FlashVSRFullPipeline,
            FlashVSRTinyPipeline, FlashVSRTinyLongPipeline,
        )
-        from .flashvsr_arch.models.utils import Buffer_LQ4x_Proj
+        from .flashvsr_arch.models.utils import Causal_LQ4x_Proj
        from .flashvsr_arch.models.TCDecoder import build_tcdecoder

        self.mode = mode
@@ -672,16 +672,18 @@ class FlashVSRModel:
            mm.load_models([dit_path])
            Pipeline = FlashVSRTinyLongPipeline if mode == "tiny-long" else FlashVSRTinyPipeline
            self.pipe = Pipeline.from_model_manager(mm, device=device)
-            self.pipe.TCDecoder = build_tcdecoder(
-                [512, 256, 128, 128], device, dtype, 16 + 768,
-            )
-            self.pipe.TCDecoder.load_state_dict(
-                load_file(tcd_path, device=device), strict=False,
-            )
-            self.pipe.TCDecoder.clean_mem()

-        # LQ frame projection
-        self.pipe.denoising_model().LQ_proj_in = Buffer_LQ4x_Proj(3, 1536, 1).to(device, dtype)
+        # TCDecoder for ALL modes (streaming per-chunk decode with LQ conditioning)
+        self.pipe.TCDecoder = build_tcdecoder(
+            [512, 256, 128, 128], device, dtype, 16 + 768,
+        )
+        self.pipe.TCDecoder.load_state_dict(
+            load_file(tcd_path, device=device), strict=False,
+        )
+        self.pipe.TCDecoder.clean_mem()
+
+        # LQ frame projection — Causal variant for FlashVSR v1.1
+        self.pipe.denoising_model().LQ_proj_in = Causal_LQ4x_Proj(3, 1536, 1).to(device, dtype)
        if os.path.exists(lq_path):
            lq_sd = load_file(lq_path, device="cpu")
            cleaned = {}
@@ -714,6 +716,8 @@ class FlashVSRModel:
            self.pipe.denoising_model().LQ_proj_in.clear_cache()
        if hasattr(self.pipe, "vae") and self.pipe.vae is not None:
            self.pipe.vae.clear_cache()
+        if hasattr(self.pipe, "TCDecoder") and self.pipe.TCDecoder is not None:
+            self.pipe.TCDecoder.clean_mem()

    # ------------------------------------------------------------------
    # Frame preprocessing / postprocessing helpers
@@ -743,7 +747,7 @@ class FlashVSRModel:
        1. Bicubic-upscale each frame to target resolution
        2. Centered symmetric padding to 128-pixel alignment (reflect mode)
        3. Normalize to [-1, 1]
-        4. Temporal padding: repeat last frame to reach 8k+1 count
+        4. Temporal padding: N+4 then floor to largest 8k+1 (matches naxci1 reference)

        No front dummy frames — the pipeline handles LQ indexing correctly
        starting from frame 0.
@@ -780,14 +784,16 @@ class FlashVSRModel:

        video = torch.stack(processed, 0).permute(1, 0, 2, 3).unsqueeze(0)

-        # Temporal padding: repeat last frame to reach 8k+1 (pipeline requirement)
-        target = max(N, 25)  # minimum 25 for streaming loop (P >= 1)
-        remainder = (target - 1) % 8
-        if remainder != 0:
-            target += 8 - remainder
+        # Temporal padding: N+4 then floor to largest 8k+1 (matches naxci1 reference)
+        num_with_pad = N + 4
+        target = ((num_with_pad - 1) // 8) * 8 + 1  # largest_8n1_leq
+        if target < 1:
+            target = 1
        if target > N:
            pad = video[:, :, -1:].repeat(1, 1, target - N, 1, 1)
            video = torch.cat([video, pad], dim=2)
+        elif target < N:
+            video = video[:, :, :target, :, :]
        nf = video.shape[2]

        return video, th, tw, nf, sh, sw, pad_top, pad_left