Vendor minimal diffsynth subset for FlashVSR inference (full/tiny pipelines, v1 and v1.1 checkpoints auto-downloaded from HuggingFace). Includes segment-based processing with temporal overlap and crossfade blending for bounded RAM on long videos. Nodes: Load FlashVSR Model, FlashVSR Upscale, FlashVSR Segment Upscale. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
342 lines
18 KiB
Markdown
342 lines
18 KiB
Markdown
# ComfyUI BIM-VFI + EMA-VFI + SGM-VFI + GIMM-VFI + FlashVSR
|
||
|
||
ComfyUI custom nodes for video frame interpolation using [BiM-VFI](https://github.com/KAIST-VICLab/BiM-VFI) (CVPR 2025), [EMA-VFI](https://github.com/MCG-NJU/EMA-VFI) (CVPR 2023), [SGM-VFI](https://github.com/MCG-NJU/SGM-VFI) (CVPR 2024), and [GIMM-VFI](https://github.com/GSeanCDAT/GIMM-VFI) (NeurIPS 2024), plus video super-resolution using [FlashVSR](https://github.com/OpenImagingLab/FlashVSR) (arXiv 2025). Designed for long videos with thousands of frames — processes them without running out of VRAM.
|
||
|
||
## Which model should I use?
|
||
|
||
| | BIM-VFI | EMA-VFI | SGM-VFI | GIMM-VFI |
|
||
|---|---------|---------|---------|----------|
|
||
| **Best for** | General-purpose, non-uniform motion | Fast inference, light VRAM | Large motion, occlusion-heavy scenes | High multipliers (4x/8x) in a single pass |
|
||
| **Quality** | Highest overall | Good | Best on large motion | Good |
|
||
| **Speed** | Moderate | Fastest | Slowest | Fast for 4x/8x (single pass) |
|
||
| **VRAM** | ~2 GB/pair | ~1.5 GB/pair | ~3 GB/pair | ~2.5 GB/pair |
|
||
| **Params** | ~17M | ~14–65M | ~15M + GMFlow | ~80M (RAFT) / ~123M (FlowFormer) |
|
||
| **Arbitrary timestep** | Yes | Yes (with `_t` checkpoint) | No (fixed 0.5) | Yes (native single-pass) |
|
||
| **4x/8x mode** | Recursive 2x passes | Recursive 2x passes | Recursive 2x passes | Single forward pass (or recursive) |
|
||
| **Paper** | CVPR 2025 | CVPR 2023 | CVPR 2024 | NeurIPS 2024 |
|
||
| **License** | Research only | Apache 2.0 | Apache 2.0 | Apache 2.0 |
|
||
|
||
**TL;DR:** Start with **BIM-VFI** for best quality. Use **EMA-VFI** if you need speed or lower VRAM. Use **SGM-VFI** if your video has large camera motion or fast-moving objects that the others struggle with. Use **GIMM-VFI** when you want 4x or 8x interpolation without recursive passes — it generates all intermediate frames in a single forward pass per pair.
|
||
|
||
### Video Super-Resolution
|
||
|
||
FlashVSR is a different category — **spatial upscaling** rather than temporal interpolation. It can be combined with any of the VFI models above.
|
||
|
||
| | FlashVSR |
|
||
|---|----------|
|
||
| **Task** | 4x video super-resolution |
|
||
| **Architecture** | Wan 2.1-1.3B DiT + VAE (diffusion-based) |
|
||
| **Modes** | Full (best quality), Tiny (fast), Tiny-Long (streaming, lowest VRAM) |
|
||
| **VRAM** | ~8–12 GB (tiled, tiny mode) / ~16–24 GB (full mode) |
|
||
| **Params** | ~1.3B (DiT) + ~200M (VAE) |
|
||
| **Min input** | 21 frames |
|
||
| **Paper** | arXiv 2510.12747 |
|
||
| **License** | Apache 2.0 |
|
||
|
||
## Nodes
|
||
|
||
### BIM-VFI
|
||
|
||
#### Load BIM-VFI Model
|
||
|
||
Loads the BiM-VFI checkpoint. Auto-downloads from Google Drive on first use to `ComfyUI/models/bim-vfi/`.
|
||
|
||
| Input | Description |
|
||
|-------|-------------|
|
||
| **model_path** | Checkpoint file from `models/bim-vfi/` |
|
||
| **auto_pyr_level** | Auto-select pyramid level by resolution (<540p=3, 540p=5, 1080p=6, 4K=7) |
|
||
| **pyr_level** | Manual pyramid level (3-7), only used when auto is off |
|
||
|
||
#### BIM-VFI Interpolate
|
||
|
||
Interpolates frames from an image batch.
|
||
|
||
| Input | Description |
|
||
|-------|-------------|
|
||
| **images** | Input image batch |
|
||
| **model** | Model from the loader node |
|
||
| **multiplier** | 2x, 4x, or 8x frame rate (recursive 2x passes) |
|
||
| **batch_size** | Frame pairs processed simultaneously (higher = faster, more VRAM) |
|
||
| **chunk_size** | Process in segments of N input frames (0 = disabled). Bounds VRAM for very long videos. Result is identical to processing all at once |
|
||
| **keep_device** | Keep model on GPU between pairs (faster, ~200MB constant VRAM) |
|
||
| **all_on_gpu** | Keep all intermediate frames on GPU (fast, needs large VRAM) |
|
||
| **clear_cache_after_n_frames** | Clear CUDA cache every N pairs to prevent VRAM buildup |
|
||
|
||
#### BIM-VFI Segment Interpolate
|
||
|
||
Same as Interpolate but processes a single segment of the input. Chain multiple instances with Save nodes between them to bound peak RAM. The model pass-through output forces sequential execution.
|
||
|
||
### Tween Concat Videos
|
||
|
||
Concatenates segment video files into a single video using ffmpeg. Connect from any Segment Interpolate's model output to ensure it runs after all segments are saved. Works with all three models.
|
||
|
||
### EMA-VFI
|
||
|
||
#### Load EMA-VFI Model
|
||
|
||
Loads an EMA-VFI checkpoint. Auto-downloads from Google Drive on first use to `ComfyUI/models/ema-vfi/`. Variant (large/small) and timestep support are auto-detected from the filename.
|
||
|
||
| Input | Description |
|
||
|-------|-------------|
|
||
| **model_path** | Checkpoint file from `models/ema-vfi/` |
|
||
| **tta** | Test-time augmentation: flip input and average with unflipped result (~2x slower, slightly better quality) |
|
||
|
||
Available checkpoints:
|
||
| Checkpoint | Variant | Params | Arbitrary timestep |
|
||
|-----------|---------|--------|-------------------|
|
||
| `ours_t.pkl` | Large | ~65M | Yes |
|
||
| `ours.pkl` | Large | ~65M | No (fixed 0.5) |
|
||
| `ours_small_t.pkl` | Small | ~14M | Yes |
|
||
| `ours_small.pkl` | Small | ~14M | No (fixed 0.5) |
|
||
|
||
#### EMA-VFI Interpolate
|
||
|
||
Interpolates frames from an image batch. Same controls as BIM-VFI Interpolate.
|
||
|
||
#### EMA-VFI Segment Interpolate
|
||
|
||
Same as EMA-VFI Interpolate but processes a single segment. Same pattern as BIM-VFI Segment Interpolate.
|
||
|
||
### SGM-VFI
|
||
|
||
#### Load SGM-VFI Model
|
||
|
||
Loads an SGM-VFI checkpoint. Auto-downloads from Google Drive on first use to `ComfyUI/models/sgm-vfi/`. Variant (base/small) is auto-detected from the filename (default is small).
|
||
|
||
| Input | Description |
|
||
|-------|-------------|
|
||
| **model_path** | Checkpoint file from `models/sgm-vfi/` |
|
||
| **tta** | Test-time augmentation: flip input and average with unflipped result (~2x slower, slightly better quality) |
|
||
| **num_key_points** | Sparsity of global matching (0.0 = global everywhere, 0.5 = default balance, higher = faster) |
|
||
|
||
Available checkpoints:
|
||
| Checkpoint | Variant | Params |
|
||
|-----------|---------|--------|
|
||
| `ours-1-2-points.pkl` | Small | ~15M + GMFlow |
|
||
|
||
#### SGM-VFI Interpolate
|
||
|
||
Interpolates frames from an image batch. Same controls as BIM-VFI Interpolate.
|
||
|
||
#### SGM-VFI Segment Interpolate
|
||
|
||
Same as SGM-VFI Interpolate but processes a single segment. Same pattern as BIM-VFI Segment Interpolate.
|
||
|
||
### GIMM-VFI
|
||
|
||
#### Load GIMM-VFI Model
|
||
|
||
Loads a GIMM-VFI checkpoint. Auto-downloads from [HuggingFace](https://huggingface.co/Kijai/GIMM-VFI_safetensors) on first use to `ComfyUI/models/gimm-vfi/`. The matching flow estimator (RAFT or FlowFormer) is auto-detected and downloaded alongside the main model.
|
||
|
||
| Input | Description |
|
||
|-------|-------------|
|
||
| **model_path** | Checkpoint file from `models/gimm-vfi/` |
|
||
| **ds_factor** | Downscale factor for internal processing (1.0 = full res, 0.5 = half). Lower = less VRAM, faster, less quality. Try 0.5 for 4K inputs |
|
||
|
||
Available checkpoints:
|
||
| Checkpoint | Variant | Params | Flow estimator (auto-downloaded) |
|
||
|-----------|---------|--------|----------------------------------|
|
||
| `gimmvfi_r_arb_lpips_fp32.safetensors` | RAFT | ~80M | `raft-things_fp32.safetensors` |
|
||
| `gimmvfi_f_arb_lpips_fp32.safetensors` | FlowFormer | ~123M | `flowformer_sintel_fp32.safetensors` |
|
||
|
||
#### GIMM-VFI Interpolate
|
||
|
||
Interpolates frames from an image batch. Same controls as BIM-VFI Interpolate, plus:
|
||
|
||
| Input | Description |
|
||
|-------|-------------|
|
||
| **single_pass** | When enabled (default), generates all intermediate frames per pair in one forward pass using GIMM-VFI's arbitrary-timestep capability. No recursive 2x passes needed for 4x or 8x. Disable to use the standard recursive approach (same as BIM/EMA/SGM) |
|
||
|
||
#### GIMM-VFI Segment Interpolate
|
||
|
||
Same as GIMM-VFI Interpolate but processes a single segment. Same pattern as BIM-VFI Segment Interpolate.
|
||
|
||
**Output frame count (VFI models):** 2x = 2N-1, 4x = 4N-3, 8x = 8N-7
|
||
|
||
### FlashVSR
|
||
|
||
FlashVSR does **4x video super-resolution** (spatial upscaling), not frame interpolation. It uses a diffusion-based approach built on Wan 2.1-1.3B for temporally coherent upscaling.
|
||
|
||
#### Load FlashVSR Model
|
||
|
||
Downloads checkpoints from HuggingFace (~7.5 GB) on first use to `ComfyUI/models/flashvsr/`.
|
||
|
||
| Input | Description |
|
||
|-------|-------------|
|
||
| **mode** | Pipeline mode: `tiny` (fast TCDecoder decode), `tiny-long` (streaming TCDecoder, lowest VRAM for long videos), `full` (standard VAE decode, best quality) |
|
||
| **precision** | `bf16` (faster on modern GPUs) or `fp16` (for older GPUs) |
|
||
|
||
Checkpoints (auto-downloaded from [1038lab/FlashVSR](https://huggingface.co/1038lab/FlashVSR)):
|
||
| Checkpoint | Size | Description |
|
||
|-----------|------|-------------|
|
||
| `FlashVSR1_1.safetensors` | ~5 GB | Main DiT model (v1.1) |
|
||
| `Wan2.1_VAE.safetensors` | ~2 GB | Video VAE |
|
||
| `LQ_proj_in.safetensors` | ~50 MB | Low-quality frame projection |
|
||
| `TCDecoder.safetensors` | ~200 MB | Tiny conditional decoder (for tiny/tiny-long modes) |
|
||
| `Prompt.safetensors` | ~1 MB | Precomputed text embeddings |
|
||
|
||
#### FlashVSR Upscale
|
||
|
||
Upscales an image batch with 4x spatial super-resolution.
|
||
|
||
| Input | Description |
|
||
|-------|-------------|
|
||
| **images** | Input video frames (minimum 21 frames) |
|
||
| **model** | Model from the loader node |
|
||
| **scale** | Upscaling factor: 2x or 4x (4x is native resolution) |
|
||
| **frame_chunk_size** | Process in chunks of N frames to bound VRAM (0 = all at once). Recommended: 33 or 65. Each chunk must be >= 21 frames |
|
||
| **tiled** | Enable tiled VAE decode (reduces VRAM significantly) |
|
||
| **tile_size_h / tile_size_w** | VAE tile dimensions in latent space (default 60/104) |
|
||
| **topk_ratio** | Sparse attention ratio. Higher = faster, may lose fine detail (default 2.0) |
|
||
| **kv_ratio** | KV cache ratio. Higher = better quality, more VRAM (default 2.0) |
|
||
| **local_range** | Local attention window: 9 = sharper details, 11 = more temporal stability |
|
||
| **color_fix** | Apply wavelet color correction to prevent color shifts |
|
||
| **unload_dit** | Offload DiT to CPU before VAE decode (saves VRAM, slower) |
|
||
| **seed** | Random seed for the diffusion process |
|
||
|
||
#### FlashVSR Segment Upscale
|
||
|
||
Same as FlashVSR Upscale but processes a single segment of the input. Chain multiple instances with Save nodes between them to bound peak RAM. The model pass-through output forces sequential execution.
|
||
|
||
| Input | Description |
|
||
|-------|-------------|
|
||
| **segment_index** | Which segment to process (0-based) |
|
||
| **segment_size** | Number of input frames per segment (minimum 21) |
|
||
| **overlap_frames** | Overlapping frames between adjacent segments for temporal context and crossfade blending |
|
||
| **blend_frames** | Number of frames within the overlap to crossfade (must be <= overlap_frames) |
|
||
|
||
Plus all the same upscale parameters as FlashVSR Upscale.
|
||
|
||
## Installation
|
||
|
||
Clone into your ComfyUI `custom_nodes/` directory:
|
||
|
||
```bash
|
||
cd ComfyUI/custom_nodes
|
||
git clone https://github.com/your-user/ComfyUI-Tween.git
|
||
```
|
||
|
||
Dependencies (`gdown`, `cupy`, `timm`, `omegaconf`, `easydict`, `yacs`, `einops`, `huggingface_hub`, `safetensors`) are auto-installed on first load. The correct `cupy` variant is detected from your PyTorch CUDA version.
|
||
|
||
> **Warning:** `cupy` is a large package (~800MB) and compilation/installation can take several minutes. The first ComfyUI startup after installing this node may appear to hang while `cupy` installs in the background. Check the console log for progress. If auto-install fails (e.g. missing build tools in Docker), install manually with:
|
||
> ```bash
|
||
> pip install cupy-cuda12x # replace 12 with your CUDA major version
|
||
> ```
|
||
|
||
To install manually:
|
||
|
||
```bash
|
||
cd ComfyUI-Tween
|
||
python install.py
|
||
```
|
||
|
||
### Requirements
|
||
|
||
- PyTorch with CUDA
|
||
- `cupy` (matching your CUDA version, for BIM-VFI, SGM-VFI, and GIMM-VFI)
|
||
- `timm` (for EMA-VFI and SGM-VFI)
|
||
- `gdown` (for BIM-VFI/EMA-VFI/SGM-VFI model auto-download)
|
||
- `omegaconf`, `easydict`, `yacs`, `einops` (for GIMM-VFI)
|
||
- `huggingface_hub` (for GIMM-VFI and FlashVSR model auto-download)
|
||
- `safetensors` (for FlashVSR checkpoint loading)
|
||
|
||
## VRAM Guide
|
||
|
||
| VRAM | Recommended settings |
|
||
|------|---------------------|
|
||
| 8 GB | batch_size=1, chunk_size=500 |
|
||
| 24 GB | batch_size=2-4, chunk_size=1000 |
|
||
| 48 GB+ | batch_size=4-16, all_on_gpu=true |
|
||
| 96 GB+ | batch_size=8-16, all_on_gpu=true, chunk_size=0 |
|
||
|
||
## Acknowledgments
|
||
|
||
This project wraps the official [BiM-VFI](https://github.com/KAIST-VICLab/BiM-VFI) implementation by the [KAIST VIC Lab](https://github.com/KAIST-VICLab), the official [EMA-VFI](https://github.com/MCG-NJU/EMA-VFI) implementation by MCG-NJU, the official [SGM-VFI](https://github.com/MCG-NJU/SGM-VFI) implementation by MCG-NJU, the [GIMM-VFI](https://github.com/GSeanCDAT/GIMM-VFI) implementation by S-Lab (NTU), and [FlashVSR](https://github.com/OpenImagingLab/FlashVSR) by OpenImagingLab. GIMM-VFI architecture files in `gimm_vfi_arch/` are adapted from [kijai/ComfyUI-GIMM-VFI](https://github.com/kijai/ComfyUI-GIMM-VFI) with safetensors checkpoints from [Kijai/GIMM-VFI_safetensors](https://huggingface.co/Kijai/GIMM-VFI_safetensors). FlashVSR architecture files in `flashvsr_arch/` are adapted from [1038lab/ComfyUI-FlashVSR](https://github.com/1038lab/ComfyUI-FlashVSR) (a diffsynth subset) with safetensors checkpoints from [1038lab/FlashVSR](https://huggingface.co/1038lab/FlashVSR). Architecture files in `bim_vfi_arch/`, `ema_vfi_arch/`, `sgm_vfi_arch/`, `gimm_vfi_arch/`, and `flashvsr_arch/` are vendored from their respective repositories with minimal modifications (relative imports, device-awareness fixes, dtype safety patches, inference-only paths).
|
||
|
||
**BiM-VFI:**
|
||
> Wonyong Seo, Jihyong Oh, and Munchurl Kim.
|
||
> "BiM-VFI: Bidirectional Motion Field-Guided Frame Interpolation for Video with Non-uniform Motions."
|
||
> *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 2025.
|
||
> [[arXiv]](https://arxiv.org/abs/2412.11365) [[Project Page]](https://kaist-viclab.github.io/BiM-VFI_site/) [[GitHub]](https://github.com/KAIST-VICLab/BiM-VFI)
|
||
|
||
```bibtex
|
||
@inproceedings{seo2025bimvfi,
|
||
title={BiM-VFI: Bidirectional Motion Field-Guided Frame Interpolation for Video with Non-uniform Motions},
|
||
author={Seo, Wonyong and Oh, Jihyong and Kim, Munchurl},
|
||
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
|
||
year={2025}
|
||
}
|
||
```
|
||
|
||
**EMA-VFI:**
|
||
> Guozhen Zhang, Yuhan Zhu, Haonan Wang, Youxin Chen, Gangshan Wu, and Limin Wang.
|
||
> "Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation."
|
||
> *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 2023.
|
||
> [[arXiv]](https://arxiv.org/abs/2303.00440) [[GitHub]](https://github.com/MCG-NJU/EMA-VFI)
|
||
|
||
```bibtex
|
||
@inproceedings{zhang2023emavfi,
|
||
title={Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation},
|
||
author={Zhang, Guozhen and Zhu, Yuhan and Wang, Haonan and Chen, Youxin and Wu, Gangshan and Wang, Limin},
|
||
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
|
||
year={2023}
|
||
}
|
||
```
|
||
|
||
**SGM-VFI:**
|
||
> Guozhen Zhang, Yuhan Zhu, Evan Zheran Liu, Haonan Wang, Mingzhen Sun, Gangshan Wu, and Limin Wang.
|
||
> "Sparse Global Matching for Video Frame Interpolation with Large Motion."
|
||
> *IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)*, 2024.
|
||
> [[arXiv]](https://arxiv.org/abs/2404.06913) [[GitHub]](https://github.com/MCG-NJU/SGM-VFI)
|
||
|
||
```bibtex
|
||
@inproceedings{zhang2024sgmvfi,
|
||
title={Sparse Global Matching for Video Frame Interpolation with Large Motion},
|
||
author={Zhang, Guozhen and Zhu, Yuhan and Liu, Evan Zheran and Wang, Haonan and Sun, Mingzhen and Wu, Gangshan and Wang, Limin},
|
||
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
|
||
year={2024}
|
||
}
|
||
```
|
||
|
||
**GIMM-VFI:**
|
||
> Zujin Guo, Wei Li, and Chen Change Loy.
|
||
> "Generalizable Implicit Motion Modeling for Video Frame Interpolation."
|
||
> *Advances in Neural Information Processing Systems (NeurIPS)*, 2024.
|
||
> [[arXiv]](https://arxiv.org/abs/2407.08680) [[GitHub]](https://github.com/GSeanCDAT/GIMM-VFI)
|
||
|
||
```bibtex
|
||
@inproceedings{guo2024gimmvfi,
|
||
title={Generalizable Implicit Motion Modeling for Video Frame Interpolation},
|
||
author={Guo, Zujin and Li, Wei and Loy, Chen Change},
|
||
booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
|
||
year={2024}
|
||
}
|
||
```
|
||
|
||
**FlashVSR:**
|
||
> Junhao Zhuang, Ting-Che Lin, Xin Zhong, Zhihong Pan, Chun Yuan, and Ailing Zeng.
|
||
> "FlashVSR: Efficient Real-World Video Super-Resolution via Distilled Diffusion Transformer."
|
||
> *arXiv preprint arXiv:2510.12747*, 2025.
|
||
> [[arXiv]](https://arxiv.org/abs/2510.12747) [[GitHub]](https://github.com/OpenImagingLab/FlashVSR)
|
||
|
||
```bibtex
|
||
@article{zhuang2025flashvsr,
|
||
title={FlashVSR: Efficient Real-World Video Super-Resolution via Distilled Diffusion Transformer},
|
||
author={Zhuang, Junhao and Lin, Ting-Che and Zhong, Xin and Pan, Zhihong and Yuan, Chun and Zeng, Ailing},
|
||
journal={arXiv preprint arXiv:2510.12747},
|
||
year={2025}
|
||
}
|
||
```
|
||
|
||
## License
|
||
|
||
The BiM-VFI model weights and architecture code are provided by KAIST VIC Lab for **research and education purposes only**. Commercial use requires permission from the principal investigator (Prof. Munchurl Kim, mkimee@kaist.ac.kr). See the [original repository](https://github.com/KAIST-VICLab/BiM-VFI) for details.
|
||
|
||
The EMA-VFI model weights and architecture code are released under the [Apache 2.0 License](https://github.com/MCG-NJU/EMA-VFI/blob/main/LICENSE). See the [original repository](https://github.com/MCG-NJU/EMA-VFI) for details.
|
||
|
||
The SGM-VFI model weights and architecture code are released under the [Apache 2.0 License](https://github.com/MCG-NJU/SGM-VFI/blob/main/LICENSE). See the [original repository](https://github.com/MCG-NJU/SGM-VFI) for details.
|
||
|
||
The GIMM-VFI model weights and architecture code are released under the [Apache 2.0 License](https://github.com/GSeanCDAT/GIMM-VFI/blob/main/LICENSE). See the [original repository](https://github.com/GSeanCDAT/GIMM-VFI) for details. ComfyUI adaptation based on [kijai/ComfyUI-GIMM-VFI](https://github.com/kijai/ComfyUI-GIMM-VFI).
|
||
|
||
The FlashVSR model weights and architecture code are released under the [Apache 2.0 License](https://github.com/OpenImagingLab/FlashVSR/blob/main/LICENSE). See the [original repository](https://github.com/OpenImagingLab/FlashVSR) for details. Architecture files adapted from [1038lab/ComfyUI-FlashVSR](https://github.com/1038lab/ComfyUI-FlashVSR).
|